Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Tools/Manuals/TS50

From EGIWiki
Jump to navigation Jump to search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


10 data transfer to the server failed

Diagnosis

Usually it means the Globus job manager on the CE cannot call back the WMS/Condor-G (or UI in tests), or the CE cannot be called back itself. This can happen because of firewall or other network problems, or because of an incorrect or absent definition of GLOBUS_TCP_PORT_RANGE on the CE, WMS, Condor-G or UI.

It can also occur when the proxy is not acceptable to LCAS/LCMAPS on the CE, e.g. because it is a plain grid proxy instead of a VOMS proxy (currently the lcg-CE should still accept both), or when the VOMS attributes have expired.

Note: the globus-job-manager-marshal and globus-gass-cache-marshal daemons on the gLite 3.1 lcg-CE will only allow a limited number of requests to run in parallel (each has 5 by default), putting the rest into a queue that can be seen with "ps afuxwww"; if any account with pending requests has a problem (see below), it also can cause jobs for other accounts to fail with this error!

Solution

  1. Ensure the CE is up to date, using the latest versions of globus-gma, globus-job-manager-marshal and globus-gass-cache-marshal.
  2. Fix GLOBUS_TCP_PORT_RANGE on CE, WMS, Condor-G or UI as needed and open the firewall(s) correspondingly.
  3. Check if the account to which the DN is mapped has a writable home directory (also check the subdirectories!). A globus-job-run may report this error:
GRAM Job submission failed because cannot access cache files in
~/.globus/.gass_cache, check permissions, quota, and disk space
(error code 76)
  1. Check contents of $GLOBUS_LOCATION/etc/grid-services/jobmanager-* files.
  2. Check contents of $GLOBUS_LOCATION/etc/globus-job-manager.conf.
  3. Ensure /etc/grid-security is world-readable (hostkey.pem must be protected).
  4. Ensure outgoing connections are allowed from the CE to the GLOBUS_TCP_PORT_RANGE on WMS/Condor-G (or UI).
  5. Check LCAS/LCMAPS configuration on CE. For all VOMS servers of each supported VO there must either be a file like /etc/grid-security/vomsdir/$voms_server_fqdn.*.pem containing the current host certificate, or (for almost all node types) a file /etc/grid-security/vomsdir/$vo_name/$voms_server_fqdn.lsc containing exactly two lines:
    1. the VOMS server DN
    2. the DN of the issuing CA

    Otherwise there may be LCAS failures reported in /var/log/globus-gatekeeper.log:

LCAS   0:       lcas_plugin_voms-plugin_confirm_authorization_from_x509():
 VOMS Signature error (failure)!
The full host certificates are still needed on the gLite 3.1 WMS and FTS node types.