Tools/Manuals/TS75

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Contents



Back to Troubleshooting Guide


Unspecified gridmanager error

Full message

$ glite-wms-job-logging-info -v 2 https://wms221.cern.ch:9000/fFtw9svc7vBkj3GnvCHwOH

[...]
Event: Done
[...]
- Exit code                  =    1
[...]
- Reason                     =    Got a job held event, reason:
  Unspecified gridmanager error
- Source                     =    LogMonitor
[...]
- Status code                =    FAILED
[...]

Diagnosis

This usually means there was a problem with the remote batch system: the job could not be submitted. Possible causes include:

Note: sgm and prd pool accounts and ordinary pool accounts usually have different primary groups. For PBS/Torque all those groups must appear in /var/spool/pbs/server_priv/acl_groups/QUEUE_NAME. YAIM will take care of that when the CE is (re)configured. Look for GROUP_ENABLE in the YAIM documentation: https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables

Check /var/spool/pbs/mom_logs on the WN for PBS/Torque errors.

Note: the "lcgpbs" job manager uses scp for the stagein of the user proxy, so the grid account on the WN has to be able to scp files from the CE. Beware of the MaxStartups limit in sshd_config on the CE: it may be too low. The "lcgpbs" job manager will cancel (qdel) a job that is reported in the W state. Torque will put a job in that state when the stagein failed.

When testing PBS/Torque do not only submit simple test scripts, but also test stagein, e.g. by running the following script under a grid account on the CE, with the queue name as argument:

#!/bin/sh
queue=${1:-ops}
base=stagein-`date +%Y%m%d_%H%M%S`
out=/tmp/$base.out
err=/tmp/$base.err
dat=/tmp/$base.dat
job=/tmp/$base.job
echo test successful > $dat
cat > $job << EOF
#!/bin/sh
#
#PBS -S /bin/sh
#PBS -m n
#PBS -q $queue
#PBS -o $out
#PBS -e $err
#PBS -r n
#PBS -W stagein=$dat@`hostname`:$dat
#PBS -l nodes=1
hostname
cat $dat
EOF
jid=`qsub < $job` || exit
sleep 5
while qstat $jid 2> /dev/null
do
    sleep 5
done
echo Output:
echo =======
cat $out
echo =======
echo 
echo Errors:
echo =======
cat $err
echo =======
echo 
grep "`cat $dat`" $out || echo test failed
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export