Difference between revisions of "Tools/Manuals/TS59"
< Tools
Jump to navigation
Jump to search
Line 22: | Line 22: | ||
GLUE2ComputingShareTotalJobs | GLUE2ComputingShareTotalJobs | ||
GLUE2ComputingShareEstimatedAverageWaitingTime | GLUE2ComputingShareEstimatedAverageWaitingTime | ||
GLUE2ComputingShareEstimatedWorstWaitingTime | |||
* <font face="Courier New,Courier">/var/lib/bdii/gip/plugin/glite-info-dynamic-ce</font> is the plugin responsible for getting information from the underlying batch system configuration to publish attributes like: | * <font face="Courier New,Courier">/var/lib/bdii/gip/plugin/glite-info-dynamic-ce</font> is the plugin responsible for getting information from the underlying batch system configuration to publish attributes like: |
Revision as of 15:56, 29 November 2013
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to Troubleshooting Guide
444444 waiting jobs
Full message
$ lcg-infosites --vo ops ce -f SOME-SITE # CPU Free Total Jobs Running Waiting ComputingElement ---------------------------------------------------------------- 456 3 0 0 444444 ce.site.domain:2119/jobmanager-lcgpbs-ops
- /var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper is the plugin responsible for dynamically modifying:
GLUE2ComputingShareRunningJobs GLUE2ComputingShareWaitingJobs GLUE2ComputingShareTotalJobs GLUE2ComputingShareEstimatedAverageWaitingTime GLUE2ComputingShareEstimatedWorstWaitingTime
- /var/lib/bdii/gip/plugin/glite-info-dynamic-ce is the plugin responsible for getting information from the underlying batch system configuration to publish attributes like:
GLUE2ComputingShareMaxCPUTime GLUE2ComputingShareMaxWallTime GLUE2ComputingShareMaxRunningJobs ...
Diagnosis
The CE information system provider has safe, easily recognizable default values for various attributes that are published in the BDII. Normally those defaults are overridden by the actual values obtained from the batch system by a particular provider or plugin script. When a default value does appear in the BDII, it means the provider failed and its output, if any, was discarded. An info provider can fail for at least the following reasons:
- For Torque-Maui systems: the user running the BDII ("edguser" or "ldap") has no permission to query the Maui scheduler. On the machine running Maui check:
[root@server ~]# grep ADMIN3 /var/spool/maui/maui.cfg ADMIN3 edginfo rgma edguser ldap
- The info provider timed out. For PBS/Torque systems this can happen when there is a WN in bad shape, causing commands like "qstat" and "pbsnodes" to hang. If the WN cannot quickly be recovered (e.g. by a reboot):
- remove the WN from /var/spool/pbs/server_priv/nodes
- remove the corresponding jobs from /var/spool/pbs/server_priv/jobs
- restart the PBS/Torque daemons
- The info provider has an incomplete environment for querying the batch system. Beware that the info provider may not run under a login shell and therefore would not source files in /etc/profile.d. This is fixed for resource bdii versions >= 5.2.21, which ensure a login shell is used.
- Torque/PBS queues have no "resources_default.walltime" configured, only "resources_max.walltime". This is normal when YAIM is used, but at a few sites with external Torque/PBS servers (not configured by YAIM, not only used by the CE) this has led to the given problem. In such cases configure the "resources_default.walltime" where needed and beware that the problem may persist until all previously submitted jobs have disappeared from the batch system. One could also apply a patch like this:
# diff lcg-info-dynamic-scheduler.bak lcg-info-dynamic-scheduler 12a13 > from types import NoneType 435a437,438 > if type(qwt) is NoneType: > qwt = 260000 485c488,491 < wrt = waitingJobs[0].get('maxwalltime') * nwait --- > qwt = waitingJobs[0].get('maxwalltime') > if type(qwt) is NoneType: > qwt = 260000 > wrt = qwt * nwait