Difference between revisions of "Tools/Manuals/TS50"
Line 1: | Line 1: | ||
{{Template:Op menubar}} | |||
{{Template:Doc_menubar}} | |||
[[Category:Operations Manuals]] | |||
{{TOC_right}} | {{TOC_right}} | ||
------ | ------ | ||
Back to [[Tools/Manuals/SiteProblemsFollowUp|Troubleshooting Guide]] | Back to [[Tools/Manuals/SiteProblemsFollowUp|Troubleshooting Guide]] |
Latest revision as of 12:44, 23 November 2012
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to Troubleshooting Guide
10 data transfer to the server failed
Diagnosis
Usually it means the Globus job manager on the CE cannot call back the WMS/Condor-G (or UI in tests), or the CE cannot be called back itself. This can happen because of firewall or other network problems, or because of an incorrect or absent definition of GLOBUS_TCP_PORT_RANGE on the CE, WMS, Condor-G or UI.
It can also occur when the proxy is not acceptable to LCAS/LCMAPS on the CE, e.g. because it is a plain grid proxy instead of a VOMS proxy (currently the lcg-CE should still accept both), or when the VOMS attributes have expired.
Note: the globus-job-manager-marshal and globus-gass-cache-marshal daemons on the gLite 3.1 lcg-CE will only allow a limited number of requests to run in parallel (each has 5 by default), putting the rest into a queue that can be seen with "ps afuxwww"; if any account with pending requests has a problem (see below), it also can cause jobs for other accounts to fail with this error!
Solution
- Ensure the CE is up to date, using the latest versions of globus-gma, globus-job-manager-marshal and globus-gass-cache-marshal.
- Fix GLOBUS_TCP_PORT_RANGE on CE, WMS, Condor-G or UI as needed and open the firewall(s) correspondingly.
- Check if the account to which the DN is mapped has a writable home directory (also check the subdirectories!). A globus-job-run may report this error:
GRAM Job submission failed because cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space (error code 76)
- Check contents of $GLOBUS_LOCATION/etc/grid-services/jobmanager-* files.
- Check contents of $GLOBUS_LOCATION/etc/globus-job-manager.conf.
- Ensure /etc/grid-security is world-readable (hostkey.pem must be protected).
- Ensure outgoing connections are allowed from the CE to the GLOBUS_TCP_PORT_RANGE on WMS/Condor-G (or UI).
- Check LCAS/LCMAPS configuration on CE. For all VOMS servers of each supported VO there must either be a file like /etc/grid-security/vomsdir/$voms_server_fqdn.*.pem containing the current host certificate, or (for almost all node types) a file /etc/grid-security/vomsdir/$vo_name/$voms_server_fqdn.lsc containing exactly two lines:
- the VOMS server DN
- the DN of the issuing CA
Otherwise there may be LCAS failures reported in /var/log/globus-gatekeeper.log:
LCAS 0: lcas_plugin_voms-plugin_confirm_authorization_from_x509(): VOMS Signature error (failure)!
- The full host certificates are still needed on the gLite 3.1 WMS and FTS node types.