Difference between revisions of "Tools/Manuals/TS50"
Line 9: | Line 9: | ||
Usually it means the Globus job manager on the CE '''cannot call back''' | Usually it means the Globus job manager on the CE '''cannot call back''' | ||
the | the WMS/Condor-G (or UI in tests), or the CE cannot be called back itself. | ||
This can happen because of firewall or other network problems, | This can happen because of firewall or other network problems, | ||
or because of an incorrect or absent definition of <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font> | or because of an incorrect or absent definition of <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font> | ||
on the CE | on the CE, WMS, Condor-G or UI. | ||
It can also occur when the '''proxy''' is '''not acceptable''' to LCAS/LCMAPS on the CE, | It can also occur when the '''proxy''' is '''not acceptable''' to LCAS/LCMAPS on the CE, | ||
Line 29: | Line 29: | ||
<ol> | <ol> | ||
<li> Ensure the CE is up to date, using the latest versions of <font face="Courier New,Courier">globus-gma</font>, <font face="Courier New,Courier">globus-job-manager-marshal</font> and <font face="Courier New,Courier">globus-gass-cache-marshal</font>. | <li> Ensure the CE is up to date, using the latest versions of <font face="Courier New,Courier">globus-gma</font>, <font face="Courier New,Courier">globus-job-manager-marshal</font> and <font face="Courier New,Courier">globus-gass-cache-marshal</font>. | ||
<li> Fix <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font> on CE | <li> Fix <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font> on CE, WMS, Condor-G or UI as needed and open the firewall(s) correspondingly. | ||
<li> Check if the account to which the DN is mapped has a writable home directory (also check the subdirectories!). A <font face="Courier New,Courier">globus-job-run</font> may report this error: | <li> Check if the account to which the DN is mapped has a writable home directory (also check the subdirectories!). A <font face="Courier New,Courier">globus-job-run</font> may report this error: | ||
</ol> | </ol> | ||
Line 41: | Line 41: | ||
<li> Check contents of <font face="Courier New,Courier">$GLOBUS_LOCATION/etc/globus-job-manager.conf</font>. | <li> Check contents of <font face="Courier New,Courier">$GLOBUS_LOCATION/etc/globus-job-manager.conf</font>. | ||
<li> Ensure <font face="Courier New,Courier">/etc/grid-security</font> is world-readable (<font face="Courier New,Courier">hostkey.pem</font> must be protected). | <li> Ensure <font face="Courier New,Courier">/etc/grid-security</font> is world-readable (<font face="Courier New,Courier">hostkey.pem</font> must be protected). | ||
<li> Ensure outgoing connections are allowed from the CE to the <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font> on | <li> Ensure outgoing connections are allowed from the CE to the <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font> on WMS/Condor-G (or UI). | ||
<li> Check LCAS/LCMAPS configuration on CE. For all VOMS servers of each supported VO there must either be a file like <font face="Courier New,Courier">/etc/grid-security/vomsdir/$voms_server_fqdn.*.pem</font> containing the current host certificate, or ( | <li> Check LCAS/LCMAPS configuration on CE. For all VOMS servers of each supported VO there must either be a file like <font face="Courier New,Courier">/etc/grid-security/vomsdir/$voms_server_fqdn.*.pem</font> containing the current host certificate, or (for almost all node types) a file <font face="Courier New,Courier">/etc/grid-security/vomsdir/$vo_name/$voms_server_fqdn.lsc</font> containing exactly two lines: | ||
<ol> | <ol> | ||
<li> the VOMS server DN | <li> the VOMS server DN | ||
Line 53: | Line 53: | ||
VOMS Signature error (failure)! | VOMS Signature error (failure)! | ||
::The full host certificates are still needed on the WMS and FTS node types. | ::The full host certificates are still needed on the gLite 3.1 WMS and FTS node types. |
Revision as of 21:37, 21 August 2011
Back to Troubleshooting Guide
10 data transfer to the server failed
Diagnosis
Usually it means the Globus job manager on the CE cannot call back the WMS/Condor-G (or UI in tests), or the CE cannot be called back itself. This can happen because of firewall or other network problems, or because of an incorrect or absent definition of GLOBUS_TCP_PORT_RANGE on the CE, WMS, Condor-G or UI.
It can also occur when the proxy is not acceptable to LCAS/LCMAPS on the CE, e.g. because it is a plain grid proxy instead of a VOMS proxy (currently the lcg-CE should still accept both), or when the VOMS attributes have expired.
Note: the globus-job-manager-marshal and globus-gass-cache-marshal daemons on the gLite 3.1 lcg-CE will only allow a limited number of requests to run in parallel (each has 5 by default), putting the rest into a queue that can be seen with "ps afuxwww"; if any account with pending requests has a problem (see below), it also can cause jobs for other accounts to fail with this error!
Solution
- Ensure the CE is up to date, using the latest versions of globus-gma, globus-job-manager-marshal and globus-gass-cache-marshal.
- Fix GLOBUS_TCP_PORT_RANGE on CE, WMS, Condor-G or UI as needed and open the firewall(s) correspondingly.
- Check if the account to which the DN is mapped has a writable home directory (also check the subdirectories!). A globus-job-run may report this error:
GRAM Job submission failed because cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space (error code 76)
- Check contents of $GLOBUS_LOCATION/etc/grid-services/jobmanager-* files.
- Check contents of $GLOBUS_LOCATION/etc/globus-job-manager.conf.
- Ensure /etc/grid-security is world-readable (hostkey.pem must be protected).
- Ensure outgoing connections are allowed from the CE to the GLOBUS_TCP_PORT_RANGE on WMS/Condor-G (or UI).
- Check LCAS/LCMAPS configuration on CE. For all VOMS servers of each supported VO there must either be a file like /etc/grid-security/vomsdir/$voms_server_fqdn.*.pem containing the current host certificate, or (for almost all node types) a file /etc/grid-security/vomsdir/$vo_name/$voms_server_fqdn.lsc containing exactly two lines:
- the VOMS server DN
- the DN of the issuing CA
Otherwise there may be LCAS failures reported in /var/log/globus-gatekeeper.log:
LCAS 0: lcas_plugin_voms-plugin_confirm_authorization_from_x509(): VOMS Signature error (failure)!
- The full host certificates are still needed on the gLite 3.1 WMS and FTS node types.