Tools/Manuals/TS59
< Tools
Jump to navigation
Jump to search
Revision as of 14:28, 25 May 2011 by Aesch (talk | contribs) (Created page with '{{TOC_right}} Category:FAQ ------ Back to Troubleshooting Guide ------ = 444444 waiting jobs = == Full message == $ lcg-infosites -…')
Back to Troubleshooting Guide
444444 waiting jobs
Full message
$ lcg-infosites --vo ops ce -f SOME-SITE # CPU Free Total Jobs Running Waiting ComputingElement ---------------------------------------------------------------- 456 3 0 0 444444 ce.site.domain:2119/jobmanager-lcgpbs-ops
Diagnosis
The CE information system provider has safe, easily recognizable default values for various attributes that are published in the BDII. Normally those defaults are overridden by the actual values obtained from the batch system by a particular provider or plugin script. When a default value does appear in the BDII, it means the provider failed and its output, if any, was discarded. An info provider can fail for at least the following reasons:
- For Torque-Maui systems: the user running the BDII ("edguser" or "ldap") has no permission to query the Maui scheduler. On the machine running Maui check:
[root@server ~]# grep ADMIN3 /var/spool/maui/maui.cfg ADMIN3 edginfo rgma edguser ldap
- The info provider timed out. For PBS/Torque systems this can happen when there is a WN in bad shape, causing commands like "qstat" and "pbsnodes" to hang. If the WN cannot quickly be recovered (e.g. by a reboot):
- remove the WN from /var/spool/pbs/server_priv/nodes
- remove the corresponding jobs from /var/spool/pbs/server_priv/jobs
- restart the PBS/Torque daemons