Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Tools/Manuals/TS59"

From EGIWiki
Jump to navigation Jump to search
Line 36: Line 36:
:# restart the PBS/Torque daemons
:# restart the PBS/Torque daemons


* The info provider has an incomplete environment for querying the batch system. Beware that the info provider does not run under a login shell and therefore will not source files in <tt>/etc/profile.d</tt>.
* The info provider has an incomplete environment for querying the batch system. Beware that the info provider may not run under a login shell and therefore would not source files in <tt>/etc/profile.d</tt>.  This is '''fixed''' for resource bdii versions >= '''5.2.21''', which ensure a login shell is used.


* The Torque/PBS queues have no "resources_default.walltime" configured, only "resources_max.walltime".
* The Torque/PBS queues have no "resources_default.walltime" configured, only "resources_max.walltime".

Revision as of 17:08, 9 August 2013

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


444444 waiting jobs

Full message

$ lcg-infosites --vo ops ce -f SOME-SITE
#   CPU    Free Total Jobs      Running Waiting ComputingElement
----------------------------------------------------------------
    456       3          0            0  444444 ce.site.domain:2119/jobmanager-lcgpbs-ops

Diagnosis

The CE information system provider has safe, easily recognizable default values for various attributes that are published in the BDII. Normally those defaults are overridden by the actual values obtained from the batch system by a particular provider or plugin script. When a default value does appear in the BDII, it means the provider failed and its output, if any, was discarded. An info provider can fail for at least the following reasons:

  • For Torque-Maui systems: the user running the BDII ("edguser" or "ldap") has no permission to query the Maui scheduler. On the machine running Maui check:
[root@server ~]# grep ADMIN3 /var/spool/maui/maui.cfg
ADMIN3                  edginfo rgma edguser ldap
  • The info provider timed out. For PBS/Torque systems this can happen when there is a WN in bad shape, causing commands like "qstat" and "pbsnodes" to hang. If the WN cannot quickly be recovered (e.g. by a reboot):
  1. remove the WN from /var/spool/pbs/server_priv/nodes
  2. remove the corresponding jobs from /var/spool/pbs/server_priv/jobs
  3. restart the PBS/Torque daemons
  • The info provider has an incomplete environment for querying the batch system. Beware that the info provider may not run under a login shell and therefore would not source files in /etc/profile.d. This is fixed for resource bdii versions >= 5.2.21, which ensure a login shell is used.
  • The Torque/PBS queues have no "resources_default.walltime" configured, only "resources_max.walltime".
In that case configure the "resources_default.walltime" where needed and beware that the problem
may persist until all previously submitted jobs have disappeared from the batch system.
In the latter case one may apply the following temporary patch:
# diff lcg-info-dynamic-scheduler.bak lcg-info-dynamic-scheduler
12a13
> from types import NoneType
435a437,438
>         if type(qwt) is NoneType:
>            qwt = 260000
485c488,491
<         wrt = waitingJobs[0].get('maxwalltime')  * nwait
---
>      qwt = waitingJobs[0].get('maxwalltime') 
>         if type(qwt) is NoneType:
>            qwt = 260000
>         wrt = qwt * nwait