Tools/Manuals/TS51

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


Cannot read JobWrapper output...

Full message

======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:

Status info for the Job : https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA
Current Status:     Aborted 
Logged Reason(s):
    - [...]
    - Standard output does not contain useful data.
      Cannot read JobWrapper output, both from Condor and from Maradona.
[...]
==========================================================================

Diagnosis

The user job exit status failed to be delivered to the WMS, when two independent methods should have been tried:

  1. The user job exit status is written into an extra "Maradona" file that is copied to the WMS with globus-url-copy.
  2. The job wrapper script writes the user job exit status to stdout, which is supposed to be sent back to the WMS by Globus.

When both methods fail, it usually means that the job did not run to completion!

That means it either did not start at all:

  • batch system submission problem (e.g. batch system in bad state)
  • batch system stagein problem ("lcg" job managers need stagein to work)
  • batch system query problem (e.g. "qstat -f <batch_job_ID>" must work)
  • WN file system full
  • home directory absent or unwritable
  • home directories not shared between CE and WN, while using standard job manager
  • home directories on CE and WN have different paths (symlinks may not work!)
  • NFS synchronization slowness between CE and WN (for standard job manager)
  • NFS server overloaded/stuck (when a standard job manager is used)
  • time not synchronized between CE and WN
  • mismatch between forward and reverse DNS for CE name/IP-address (as seen from WN)
  • WN cannot globus-url-copy from/to CE (see below)
  • WN cannot scp from/to CE (also check MaxStartups in sshd_config on the CE)
  • grid environment not set up automatically on the WN (!), e.g. due to bad NFS mount
  • ...

or it got killed before it finished:

  • it ran into the wall-clock or CPU time limit (should be the most common cause!)
  • the job got pre-empted by the batch system
  • NFS server overloaded/stuck (when a standard job manager is used)
  • WN ran out of memory and killed "random" processes
  • WN crashed
  • ...

Note: the "lcg" job managers need globus-url-copy to work from WN to CE, otherwise a job cannot even start. The command can fail in particular when the CE does not have the correct contents in /etc/grid-security/vomsdir, e.g. an outdated VOMS server host certificate or bad contents in some of the vo/*.lsc files. The occurrence of such a problem may prevent certain new jobs from being submitted to the CE, while previously submitted jobs may experience fatal errors when they actually start on the WN.

Jobs may be submitted successfully to the CE, but fail to start on the WN when the gatekeeper and the gridftpd on the CE do not map the user proxy to the same local account (this should never happen on an LCG-CE). Check this as follows for an LCG-CE or OSG-CE:

globus-job-run my-CE /usr/bin/id
globus-url-copy file:/etc/group gsiftp://my-CE/tmp/test.$$
globus-job-run my-CE /bin/ls -l /tmp/test.$$

On a CREAM CE such a problem is very unlikely and cannot easily be checked by the user.

If the job was able to start, it is possible that it actually did finish, but then it must mean that:

  1. the WN could not do a globus-url-copy to the WMS, _and_
  2. Globus could not send back the job wrapper stdout, e.g. because it was not copied back from the WN to the CE, or because globus-url-copy does not work from the CE to the WMS.

This combined set of problems still can have a single cause. The following examples typically would cause certain new jobs to fail right away, while running jobs might be affected only on exit:

  • a firewall limiting outgoing connections (to the WMS on port 2811 or _its_ Globus port range)
  • some CRLs out of date both on CE and WN
  • some CA files absent
  • wrong time (zone) on CE and WN
  • ...

Note: a single bad WN can be responsible for a large number of jobs to fail on a site!