Tools/Manuals/TS51

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Contents



Back to Troubleshooting Guide


Cannot read JobWrapper output...

Full message

======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:

Status info for the Job : https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA
Current Status:     Aborted 
Logged Reason(s):
    - [...]
    - Standard output does not contain useful data.
      Cannot read JobWrapper output, both from Condor and from Maradona.
[...]
==========================================================================

Diagnosis

The user job exit status failed to be delivered to the WMS, when two independent methods should have been tried:

  1. The user job exit status is written into an extra "Maradona" file that is copied to the WMS with globus-url-copy.
  2. The job wrapper script writes the user job exit status to stdout, which is supposed to be sent back to the WMS by Globus.

When both methods fail, it usually means that the job did not run to completion!

That means it either did not start at all:

or it got killed before it finished:

Note: the "lcg" job managers need globus-url-copy to work from WN to CE, otherwise a job cannot even start. The command can fail in particular when the CE does not have the correct contents in /etc/grid-security/vomsdir, e.g. an outdated VOMS server host certificate or bad contents in some of the vo/*.lsc files. The occurrence of such a problem may prevent certain new jobs from being submitted to the CE, while previously submitted jobs may experience fatal errors when they actually start on the WN.

Jobs may be submitted successfully to the CE, but fail to start on the WN when the gatekeeper and the gridftpd on the CE do not map the user proxy to the same local account (this should never happen on an LCG-CE). Check this as follows for an LCG-CE or OSG-CE:

globus-job-run my-CE /usr/bin/id
globus-url-copy file:/etc/group gsiftp://my-CE/tmp/test.$$
globus-job-run my-CE /bin/ls -l /tmp/test.$$

On a CREAM CE such a problem is very unlikely and cannot easily be checked by the user.

If the job was able to start, it is possible that it actually did finish, but then it must mean that:

  1. the WN could not do a globus-url-copy to the WMS, _and_
  2. Globus could not send back the job wrapper stdout, e.g. because it was not copied back from the WN to the CE, or because globus-url-copy does not work from the CE to the WMS.

This combined set of problems still can have a single cause. The following examples typically would cause certain new jobs to fail right away, while running jobs might be affected only on exit:

Note: a single bad WN can be responsible for a large number of jobs to fail on a site!

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export