Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @


From EGIWiki
Jump to navigation Jump to search
Main operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security

Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Back to Troubleshooting Guide

Unspecified gridmanager error

Full message

$ glite-wms-job-logging-info -v 2

Event: Done
- Exit code                  =    1
- Reason                     =    Got a job held event, reason:
  Unspecified gridmanager error
- Source                     =    LogMonitor
- Status code                =    FAILED


This usually means there was a problem with the remote batch system: the job could not be submitted. Possible causes include:

  • the non-interactive environment on the CE is incomplete, causing the submit command to fail
  • the user has no permission to submit to the given queue
  • the batch system is in a bad state (at least for some grid users)
  • there is a bad WN refusing or failing jobs, e.g. with a full partition
  • ssh from WN to CE does not (always) work, see below

Note: sgm and prd pool accounts and ordinary pool accounts usually have different primary groups. For PBS/Torque all those groups must appear in /var/spool/pbs/server_priv/acl_groups/QUEUE_NAME. YAIM will take care of that when the CE is (re)configured. Look for GROUP_ENABLE in the YAIM documentation:

Check /var/spool/pbs/mom_logs on the WN for PBS/Torque errors.

Note: the "lcgpbs" job manager uses scp for the stagein of the user proxy, so the grid account on the WN has to be able to scp files from the CE. Beware of the MaxStartups limit in sshd_config on the CE: it may be too low. The "lcgpbs" job manager will cancel (qdel) a job that is reported in the W state. Torque will put a job in that state when the stagein failed.

When testing PBS/Torque do not only submit simple test scripts, but also test stagein, e.g. by running the following script under a grid account on the CE, with the queue name as argument:

base=stagein-`date +%Y%m%d_%H%M%S`
echo test successful > $dat
cat > $job << EOF
#PBS -S /bin/sh
#PBS -m n
#PBS -q $queue
#PBS -o $out
#PBS -e $err
#PBS -r n
#PBS -W stagein=$dat@`hostname`:$dat
#PBS -l nodes=1
cat $dat
jid=`qsub < $job` || exit
sleep 5
while qstat $jid 2> /dev/null
    sleep 5
echo Output:
echo =======
cat $out
echo =======
echo Errors:
echo =======
cat $err
echo =======
grep "`cat $dat`" $out || echo test failed