The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Main

EGI.eu operations services

Support

Documentation

Tools

Activities

Performance

Technology

Catch-all Services

Resource Allocation

Security

Documentation menu:

Home •

Manuals •

Procedures •

Training •

Other •

Contact ►

For:

VO managers •

Administrators

Back to Troubleshooting Guide

Cannot read JobWrapper output...

Full message

======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:

Status info for the Job : https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA
Current Status:     Aborted 
Logged Reason(s):
    - [...]
    - Standard output does not contain useful data.
      Cannot read JobWrapper output, both from Condor and from Maradona.
[...]
==========================================================================

Diagnosis

The user job exit status failed to be delivered to the WMS, when two independent methods should have been tried:

The user job exit status is written into an extra "Maradona" file that is copied to the WMS with globus-url-copy.
The job wrapper script writes the user job exit status to stdout, which is supposed to be sent back to the WMS by Globus.

When both methods fail, it usually means that the job did not run to completion!

That means it either did not start at all:

batch system submission problem (e.g. batch system in bad state)
batch system stagein problem ("lcg" job managers need stagein to work)
batch system query problem (e.g. "qstat -f <batch_job_ID>" must work)
WN file system full
home directory absent or unwritable
home directories not shared between CE and WN, while using standard job manager
home directories on CE and WN have different paths (symlinks may not work!)
NFS synchronization slowness between CE and WN (for standard job manager)
NFS server overloaded/stuck (when a standard job manager is used)
time not synchronized between CE and WN
mismatch between forward and reverse DNS for CE name/IP-address (as seen from WN)
WN cannot globus-url-copy from/to CE (see below)
WN cannot scp from/to CE (also check MaxStartups in sshd_config on the CE)
grid environment not set up automatically on the WN (!), e.g. due to bad NFS mount
...

or it got killed before it finished:

it ran into the wall-clock or CPU time limit (should be the most common cause!)
the job got pre-empted by the batch system
NFS server overloaded/stuck (when a standard job manager is used)
WN ran out of memory and killed "random" processes
WN crashed
...

Note: the "lcg" job managers need globus-url-copy to work from WN to CE, otherwise a job cannot even start. The command can fail in particular when the CE does not have the correct contents in /etc/grid-security/vomsdir, e.g. an outdated VOMS server host certificate or bad contents in some of the vo/*.lsc files. The occurrence of such a problem may prevent certain new jobs from being submitted to the CE, while previously submitted jobs may experience fatal errors when they actually start on the WN.

Jobs may be submitted successfully to the CE, but fail to start on the WN when the gatekeeper and the gridftpd on the CE do not map the user proxy to the same local account (this should never happen on an LCG-CE). Check this as follows for an LCG-CE or OSG-CE:

globus-job-run my-CE /usr/bin/id
globus-url-copy file:/etc/group gsiftp://my-CE/tmp/test.$$
globus-job-run my-CE /bin/ls -l /tmp/test.$$

On a CREAM CE such a problem is very unlikely and cannot easily be checked by the user.

If the job was able to start, it is possible that it actually did finish, but then it must mean that:

the WN could not do a globus-url-copy to the WMS, _and_
Globus could not send back the job wrapper stdout, e.g. because it was not copied back from the WN to the CE, or because globus-url-copy does not work from the CE to the WMS.

This combined set of problems still can have a single cause. The following examples typically would cause certain new jobs to fail right away, while running jobs might be affected only on exit:

a firewall limiting outgoing connections (to the WMS on port 2811 or _its_ Globus port range)
some CRLs out of date both on CE and WN
some CA files absent
wrong time (zone) on CE and WN
...

Note: a single bad WN can be responsible for a large number of jobs to fail on a site!

Tools/Manuals/TS51

Contents

Cannot read JobWrapper output...

Full message

Diagnosis

Navigation menu

Tools/Manuals/TS51

Cannot read JobWrapper output...

Full message

Diagnosis

Navigation menu

Search