Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Tools/Manuals/TS50"

From EGIWiki
Jump to navigation Jump to search
Line 9: Line 9:


Usually it means the Globus job manager on the CE '''cannot call back'''
Usually it means the Globus job manager on the CE '''cannot call back'''
the RB/WMS/Condor-G (or UI in tests), or the CE cannot be called back itself.
the WMS/Condor-G (or UI in tests), or the CE cannot be called back itself.
This can happen because of firewall or other network problems,
This can happen because of firewall or other network problems,
or because of an incorrect or absent definition of <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font>
or because of an incorrect or absent definition of <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font>
on the CE, RB, WMS, Condor-G or UI.
on the CE, WMS, Condor-G or UI.


It can also occur when the '''proxy''' is '''not acceptable''' to LCAS/LCMAPS on the CE,
It can also occur when the '''proxy''' is '''not acceptable''' to LCAS/LCMAPS on the CE,
Line 29: Line 29:
<ol>  
<ol>  
<li> Ensure the CE is up to date, using the latest versions of <font face="Courier New,Courier">globus-gma</font>, <font face="Courier New,Courier">globus-job-manager-marshal</font> and <font face="Courier New,Courier">globus-gass-cache-marshal</font>.
<li> Ensure the CE is up to date, using the latest versions of <font face="Courier New,Courier">globus-gma</font>, <font face="Courier New,Courier">globus-job-manager-marshal</font> and <font face="Courier New,Courier">globus-gass-cache-marshal</font>.
<li> Fix <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font> on CE, RB, WMS, Condor-G or UI as needed and open the firewall(s) correspondingly.
<li> Fix <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font> on CE, WMS, Condor-G or UI as needed and open the firewall(s) correspondingly.
<li> Check if the account to which the DN is mapped has a writable home directory (also check the subdirectories!).  A <font face="Courier New,Courier">globus-job-run</font> may report this error:
<li> Check if the account to which the DN is mapped has a writable home directory (also check the subdirectories!).  A <font face="Courier New,Courier">globus-job-run</font> may report this error:
</ol>
</ol>
Line 41: Line 41:
<li> Check contents of <font face="Courier New,Courier">$GLOBUS_LOCATION/etc/globus-job-manager.conf</font>.
<li> Check contents of <font face="Courier New,Courier">$GLOBUS_LOCATION/etc/globus-job-manager.conf</font>.
<li> Ensure <font face="Courier New,Courier">/etc/grid-security</font> is world-readable (<font face="Courier New,Courier">hostkey.pem</font> must be protected).
<li> Ensure <font face="Courier New,Courier">/etc/grid-security</font> is world-readable (<font face="Courier New,Courier">hostkey.pem</font> must be protected).
<li> Ensure outgoing connections are allowed from the CE to the <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font> on RB/WMS (or UI).
<li> Ensure outgoing connections are allowed from the CE to the <font face="Courier New,Courier">GLOBUS_TCP_PORT_RANGE</font> on WMS/Condor-G (or UI).
<li> Check LCAS/LCMAPS configuration on CE.  For all VOMS servers of each supported VO there must either be a file like <font face="Courier New,Courier">/etc/grid-security/vomsdir/$voms_server_fqdn.*.pem</font> containing the current host certificate, or (on most gLite 3.1 node types) a file  <font face="Courier New,Courier">/etc/grid-security/vomsdir/$vo_name/$voms_server_fqdn.lsc</font> containing exactly two lines:
<li> Check LCAS/LCMAPS configuration on CE.  For all VOMS servers of each supported VO there must either be a file like <font face="Courier New,Courier">/etc/grid-security/vomsdir/$voms_server_fqdn.*.pem</font> containing the current host certificate, or (for almost all node types) a file  <font face="Courier New,Courier">/etc/grid-security/vomsdir/$vo_name/$voms_server_fqdn.lsc</font> containing exactly two lines:
<ol>
<ol>
<li> the VOMS server DN
<li> the VOMS server DN
Line 53: Line 53:
   VOMS Signature error (failure)!
   VOMS Signature error (failure)!


::The full host certificates are still needed on the WMS and FTS node types.
::The full host certificates are still needed on the gLite 3.1 WMS and FTS node types.

Revision as of 20:37, 21 August 2011


Back to Troubleshooting Guide


10 data transfer to the server failed

Diagnosis

Usually it means the Globus job manager on the CE cannot call back the WMS/Condor-G (or UI in tests), or the CE cannot be called back itself. This can happen because of firewall or other network problems, or because of an incorrect or absent definition of GLOBUS_TCP_PORT_RANGE on the CE, WMS, Condor-G or UI.

It can also occur when the proxy is not acceptable to LCAS/LCMAPS on the CE, e.g. because it is a plain grid proxy instead of a VOMS proxy (currently the lcg-CE should still accept both), or when the VOMS attributes have expired.

Note: the globus-job-manager-marshal and globus-gass-cache-marshal daemons on the gLite 3.1 lcg-CE will only allow a limited number of requests to run in parallel (each has 5 by default), putting the rest into a queue that can be seen with "ps afuxwww"; if any account with pending requests has a problem (see below), it also can cause jobs for other accounts to fail with this error!

Solution

  1. Ensure the CE is up to date, using the latest versions of globus-gma, globus-job-manager-marshal and globus-gass-cache-marshal.
  2. Fix GLOBUS_TCP_PORT_RANGE on CE, WMS, Condor-G or UI as needed and open the firewall(s) correspondingly.
  3. Check if the account to which the DN is mapped has a writable home directory (also check the subdirectories!). A globus-job-run may report this error:
GRAM Job submission failed because cannot access cache files in
~/.globus/.gass_cache, check permissions, quota, and disk space
(error code 76)
  1. Check contents of $GLOBUS_LOCATION/etc/grid-services/jobmanager-* files.
  2. Check contents of $GLOBUS_LOCATION/etc/globus-job-manager.conf.
  3. Ensure /etc/grid-security is world-readable (hostkey.pem must be protected).
  4. Ensure outgoing connections are allowed from the CE to the GLOBUS_TCP_PORT_RANGE on WMS/Condor-G (or UI).
  5. Check LCAS/LCMAPS configuration on CE. For all VOMS servers of each supported VO there must either be a file like /etc/grid-security/vomsdir/$voms_server_fqdn.*.pem containing the current host certificate, or (for almost all node types) a file /etc/grid-security/vomsdir/$vo_name/$voms_server_fqdn.lsc containing exactly two lines:
    1. the VOMS server DN
    2. the DN of the issuing CA

    Otherwise there may be LCAS failures reported in /var/log/globus-gatekeeper.log:

LCAS   0:       lcas_plugin_voms-plugin_confirm_authorization_from_x509():
 VOMS Signature error (failure)!
The full host certificates are still needed on the gLite 3.1 WMS and FTS node types.