Unspecified gridmanager error

Full message

$ glite-wms-job-logging-info -v 2 https://wms221.cern.ch:9000/fFtw9svc7vBkj3GnvCHwOH

[...]
Event: Done
[...]
- Exit code                  =    1
[...]
- Reason                     =    Got a job held event, reason:
  Unspecified gridmanager error
- Source                     =    LogMonitor
[...]
- Status code                =    FAILED
[...]

Diagnosis

This usually means there was a problem with the remote batch system: the job could not be submitted. Possible causes include:

the non-interactive environment on the CE is incomplete, causing the submit command to fail
the user has no permission to submit to the given queue
the batch system is in a bad state (at least for some grid users)
there is a bad WN refusing or failing jobs, e.g. with a full partition
ssh from WN to CE does not (always) work, see below

Note: sgm and prd pool accounts and ordinary pool accounts usually have different primary groups. For PBS/Torque all those groups must appear in /var/spool/pbs/server_priv/acl_groups/QUEUE_NAME. YAIM will take care of that when the CE is (re)configured. Look for GROUP_ENABLE in the YAIM documentation: https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables

Check /var/spool/pbs/mom_logs on the WN for PBS/Torque errors.

Note: the "lcgpbs" job manager uses scp for the stagein of the user proxy, so the grid account on the WN has to be able to scp files from the CE. Beware of the MaxStartups limit in sshd_config on the CE: it may be too low. The "lcgpbs" job manager will cancel (qdel) a job that is reported in the W state. Torque will put a job in that state when the stagein failed.

When testing PBS/Torque do not only submit simple test scripts, but also test stagein, e.g. by running the following script under a grid account on the CE, with the queue name as argument:

#!/bin/sh
queue=${1:-ops}
base=stagein-`date +%Y%m%d_%H%M%S`
out=/tmp/$base.out
err=/tmp/$base.err
dat=/tmp/$base.dat
job=/tmp/$base.job
echo test successful > $dat
cat > $job << EOF
#!/bin/sh
#
#PBS -S /bin/sh
#PBS -m n
#PBS -q $queue
#PBS -o $out
#PBS -e $err
#PBS -r n
#PBS -W stagein=$dat@`hostname`:$dat
#PBS -l nodes=1
hostname
cat $dat
EOF
jid=`qsub < $job` || exit
sleep 5
while qstat $jid 2> /dev/null
do
    sleep 5
done
echo Output:
echo =======
cat $out
echo =======
echo 
echo Errors:
echo =======
cat $err
echo =======
echo 
grep "`cat $dat`" $out || echo test failed

Tools/Manuals/TS75

Contents

Unspecified gridmanager error

Full message

Diagnosis

Navigation menu

Tools/Manuals/TS75

Unspecified gridmanager error

Full message

Diagnosis

Navigation menu

Search