Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Tools/Manuals/TS75

From EGIWiki
< Tools
Revision as of 12:47, 23 November 2012 by imported>Krakow
Jump to navigation Jump to search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


Unspecified gridmanager error

Full message

$ glite-wms-job-logging-info -v 2 https://wms221.cern.ch:9000/fFtw9svc7vBkj3GnvCHwOH

[...]
Event: Done
[...]
- Exit code                  =    1
[...]
- Reason                     =    Got a job held event, reason:
  Unspecified gridmanager error
- Source                     =    LogMonitor
[...]
- Status code                =    FAILED
[...]

Diagnosis

This usually means there was a problem with the remote batch system: the job could not be submitted. Possible causes include:

  • the non-interactive environment on the CE is incomplete, causing the submit command to fail
  • the user has no permission to submit to the given queue
  • the batch system is in a bad state (at least for some grid users)
  • there is a bad WN refusing or failing jobs, e.g. with a full partition
  • ssh from WN to CE does not (always) work, see below

Note: sgm and prd pool accounts and ordinary pool accounts usually have different primary groups. For PBS/Torque all those groups must appear in /var/spool/pbs/server_priv/acl_groups/QUEUE_NAME. YAIM will take care of that when the CE is (re)configured. Look for GROUP_ENABLE in the YAIM documentation: https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables

Check /var/spool/pbs/mom_logs on the WN for PBS/Torque errors.

Note: the "lcgpbs" job manager uses scp for the stagein of the user proxy, so the grid account on the WN has to be able to scp files from the CE. Beware of the MaxStartups limit in sshd_config on the CE: it may be too low. The "lcgpbs" job manager will cancel (qdel) a job that is reported in the W state. Torque will put a job in that state when the stagein failed.

When testing PBS/Torque do not only submit simple test scripts, but also test stagein, e.g. by running the following script under a grid account on the CE, with the queue name as argument:

#!/bin/sh
queue=${1:-ops}
base=stagein-`date +%Y%m%d_%H%M%S`
out=/tmp/$base.out
err=/tmp/$base.err
dat=/tmp/$base.dat
job=/tmp/$base.job
echo test successful > $dat
cat > $job << EOF
#!/bin/sh
#
#PBS -S /bin/sh
#PBS -m n
#PBS -q $queue
#PBS -o $out
#PBS -e $err
#PBS -r n
#PBS -W stagein=$dat@`hostname`:$dat
#PBS -l nodes=1
hostname
cat $dat
EOF
jid=`qsub < $job` || exit
sleep 5
while qstat $jid 2> /dev/null
do
    sleep 5
done
echo Output:
echo =======
cat $out
echo =======
echo 
echo Errors:
echo =======
cat $err
echo =======
echo 
grep "`cat $dat`" $out || echo test failed