Tools/Manuals/TS75
Back to Troubleshooting Guide
Unspecified gridmanager error
Full message
$ glite-wms-job-logging-info -v 2 https://wms221.cern.ch:9000/fFtw9svc7vBkj3GnvCHwOH [...] Event: Done [...] - Exit code = 1 [...] - Reason = Got a job held event, reason: Unspecified gridmanager error - Source = LogMonitor [...] - Status code = FAILED [...]
Diagnosis
This usually means there was a problem with the remote batch system: the job could not be submitted. Possible causes include:
- the non-interactive environment on the CE is incomplete, causing the submit command to fail
- the user has no permission to submit to the given queue
- the batch system is in a bad state (at least for some grid users)
- there is a bad WN refusing or failing jobs, e.g. with a full partition
- ssh from WN to CE does not (always) work, see below
Note: sgm and prd pool accounts and ordinary pool accounts usually have different primary groups. For PBS/Torque all those groups must appear in /var/spool/pbs/server_priv/acl_groups/QUEUE_NAME. YAIM will take care of that when the CE is (re)configured. Look for GROUP_ENABLE in the YAIM documentation: https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables
Check /var/spool/pbs/mom_logs on the WN for PBS/Torque errors.
Note: the "lcgpbs" job manager uses scp for the stagein of the user proxy, so the grid account on the WN has to be able to scp files from the CE. Beware of the MaxStartups limit in sshd_config on the CE: it may be too low. The "lcgpbs" job manager will cancel (qdel) a job that is reported in the W state. Torque will put a job in that state when the stagein failed.
When testing PBS/Torque do not only submit simple test scripts, but also test stagein, e.g. by running the following script under a grid account on the CE, with the queue name as argument:
#!/bin/sh queue=${1:-ops} base=stagein-`date +%Y%m%d_%H%M%S` out=/tmp/$base.out err=/tmp/$base.err dat=/tmp/$base.dat job=/tmp/$base.job echo test successful > $dat cat > $job << EOF #!/bin/sh # #PBS -S /bin/sh #PBS -m n #PBS -q $queue #PBS -o $out #PBS -e $err #PBS -r n #PBS -W stagein=$dat@`hostname`:$dat #PBS -l nodes=1 hostname cat $dat EOF jid=`qsub < $job` || exit sleep 5 while qstat $jid 2> /dev/null do sleep 5 done echo Output: echo ======= cat $out echo ======= echo echo Errors: echo ======= cat $err echo ======= echo grep "`cat $dat`" $out || echo test failed