Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Tools/Manuals/TS59"

From EGIWiki
Jump to navigation Jump to search
 
(5 intermediate revisions by 2 users not shown)
Line 14: Line 14:
  #  CPU    Free Total Jobs      Running Waiting ComputingElement
  #  CPU    Free Total Jobs      Running Waiting ComputingElement
  ----------------------------------------------------------------
  ----------------------------------------------------------------
     456      3          0            0  444444 ce.site.domain:2119/jobmanager-lcgpbs-ops
     456      3          0            0  444444 ce.site.domain:8443/cream-pbs-ops
 
* <font face="Courier New,Courier">/var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper</font> is the plugin responsible for dynamically modifying:
 
GLUE2ComputingShareRunningJobs
GLUE2ComputingShareWaitingJobs
GLUE2ComputingShareTotalJobs
GLUE2ComputingShareEstimatedAverageWaitingTime
GLUE2ComputingShareEstimatedWorstWaitingTime
 
* <font face="Courier New,Courier">/var/lib/bdii/gip/plugin/glite-info-dynamic-ce</font> is the plugin responsible for getting information from the underlying batch system configuration to publish attributes like:
 
GLUE2ComputingShareMaxCPUTime
GLUE2ComputingShareMaxWallTime
GLUE2ComputingShareMaxRunningJobs
...


== Diagnosis ==
== Diagnosis ==
Line 53: Line 38:
* The info provider has an incomplete environment for querying the batch system. Beware that the info provider may not run under a login shell and therefore would not source files in <tt>/etc/profile.d</tt>.  This is '''fixed''' for resource bdii versions >= '''5.2.21''', which ensure a login shell is used.
* The info provider has an incomplete environment for querying the batch system. Beware that the info provider may not run under a login shell and therefore would not source files in <tt>/etc/profile.d</tt>.  This is '''fixed''' for resource bdii versions >= '''5.2.21''', which ensure a login shell is used.


* Torque/PBS queues have no "resources_default.walltime" configured, only "resources_max.walltime". This is normal when YAIM is used, but at a few sites with external Torque/PBS servers (not configured by YAIM, not only used by the CE) this has led to the given problem. In such cases configure the "resources_default.walltime" where needed and beware that the problem may persist until all previously submitted jobs have disappeared from the batch system. One could also apply a patch like this:
* Torque/PBS queues have no "resources_default.walltime" configured, only "resources_max.walltime". This is normal when YAIM is used, but at a few sites with external Torque/PBS servers (not configured by YAIM, not only used by the CE) this has led to the given problem. In such cases configure the "resources_default.walltime" where needed (see https://wiki.egi.eu/wiki/Middleware_issues_and_solutions#GlueCEStateWaitingJobs:_444444_and_WallTime_workaround on how to do this) and beware that the problem may persist until all previously submitted jobs have disappeared from the batch system. One could also apply a patch like this:


  # diff lcg-info-dynamic-scheduler.bak lcg-info-dynamic-scheduler
  # diff lcg-info-dynamic-scheduler.bak lcg-info-dynamic-scheduler
Line 68: Line 53:
  >            qwt = 260000
  >            qwt = 260000
  >        wrt = qwt * nwait
  >        wrt = qwt * nwait
* Due to a known bug in CREAM, the info provider does not work properly when you get the error : Cannot find user for xxxxx.xxxxxxxx. This happens when CREAM and TORQUE are running on different hosts. Check the workaround to be applied until this is fixed and released in CREAM:
https://wiki.italiangrid.it/twiki/bin/view/CREAM/KnownIssues#Error_from_TORQUE_infoprovider_E
== Further information ==
* Check some documentation on how to configure TORQUE batch system parameters that are not configured with YAIM: https://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI3#1_6_1_4_Manual_Tuning
* <font face="Courier New,Courier">/var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper</font> is the plugin responsible for dynamically modifying:
GLUE2ComputingShareRunningJobs
GLUE2ComputingShareWaitingJobs
GLUE2ComputingShareTotalJobs
GLUE2ComputingShareEstimatedAverageWaitingTime
GLUE2ComputingShareEstimatedWorstWaitingTime
* <font face="Courier New,Courier">/var/lib/bdii/gip/plugin/glite-info-dynamic-ce</font> is the plugin responsible for getting information from the underlying batch system configuration to publish attributes like:
GLUE2ComputingShareMaxCPUTime
GLUE2ComputingShareMaxWallTime
GLUE2ComputingShareMaxRunningJobs
...
* If you use YAIM, make sure all the configuration targets are listed at once in a unique yaim command. Otherwise, configuration files used by the providers and plugins may be incoherent or incomplete. For instance, if using Torque and running the server on the CE host, please run:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_server -n TORQUE_utils
* The directory <font face="Courier New,Courier">/var/tmp/info-dynamic-scheduler-generic</font> is used by glite-info-dynamic-scheduler-wrapper. This is created by the YAIM function config_cream_gip_scheduler_plugin. Make sure this directory exists and it is owned by the "ldap" user.
* The <font face="Courier New,Courier">/etc/lrms/scheduler.conf</font> file has an LRMS section with information specific for the underlying batch system configuration. This information is normally added by YAIM functions like:
** config_gip_sched_plugin_pbs for PBS
** config_gip_sge for SGE
** config_slurm_gip_sched_plugin for SLURM
** config_gip_sched_plugin_lsf for LSF
: Make sure this section exists in the configuration file.
* For more details, please check:
https://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI2#1_6_Batch_system_integration

Latest revision as of 16:46, 4 March 2014

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


444444 waiting jobs

Full message

$ lcg-infosites --vo ops ce -f SOME-SITE
#   CPU    Free Total Jobs      Running Waiting ComputingElement
----------------------------------------------------------------
    456       3          0            0  444444 ce.site.domain:8443/cream-pbs-ops

Diagnosis

The CE information system provider has safe, easily recognizable default values for various attributes that are published in the BDII. Normally those defaults are overridden by the actual values obtained from the batch system by a particular provider or plugin script. When a default value does appear in the BDII, it means the provider failed and its output, if any, was discarded. An info provider can fail for at least the following reasons:

  • For Torque-Maui systems: the user running the BDII ("edguser" or "ldap") has no permission to query the Maui scheduler. On the machine running Maui check:
[root@server ~]# grep ADMIN3 /var/spool/maui/maui.cfg
ADMIN3                  edginfo rgma edguser ldap
  • The info provider timed out. For PBS/Torque systems this can happen when there is a WN in bad shape, causing commands like "qstat" and "pbsnodes" to hang. If the WN cannot quickly be recovered (e.g. by a reboot):
  1. remove the WN from /var/spool/pbs/server_priv/nodes
  2. remove the corresponding jobs from /var/spool/pbs/server_priv/jobs
  3. restart the PBS/Torque daemons
  • The info provider has an incomplete environment for querying the batch system. Beware that the info provider may not run under a login shell and therefore would not source files in /etc/profile.d. This is fixed for resource bdii versions >= 5.2.21, which ensure a login shell is used.
  • Torque/PBS queues have no "resources_default.walltime" configured, only "resources_max.walltime". This is normal when YAIM is used, but at a few sites with external Torque/PBS servers (not configured by YAIM, not only used by the CE) this has led to the given problem. In such cases configure the "resources_default.walltime" where needed (see https://wiki.egi.eu/wiki/Middleware_issues_and_solutions#GlueCEStateWaitingJobs:_444444_and_WallTime_workaround on how to do this) and beware that the problem may persist until all previously submitted jobs have disappeared from the batch system. One could also apply a patch like this:
# diff lcg-info-dynamic-scheduler.bak lcg-info-dynamic-scheduler
12a13
> from types import NoneType
435a437,438
>         if type(qwt) is NoneType:
>            qwt = 260000
485c488,491
<         wrt = waitingJobs[0].get('maxwalltime')  * nwait
---
>      qwt = waitingJobs[0].get('maxwalltime') 
>         if type(qwt) is NoneType:
>            qwt = 260000
>         wrt = qwt * nwait
  • Due to a known bug in CREAM, the info provider does not work properly when you get the error : Cannot find user for xxxxx.xxxxxxxx. This happens when CREAM and TORQUE are running on different hosts. Check the workaround to be applied until this is fixed and released in CREAM:

https://wiki.italiangrid.it/twiki/bin/view/CREAM/KnownIssues#Error_from_TORQUE_infoprovider_E

Further information

  • /var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper is the plugin responsible for dynamically modifying:
GLUE2ComputingShareRunningJobs
GLUE2ComputingShareWaitingJobs
GLUE2ComputingShareTotalJobs
GLUE2ComputingShareEstimatedAverageWaitingTime
GLUE2ComputingShareEstimatedWorstWaitingTime
  • /var/lib/bdii/gip/plugin/glite-info-dynamic-ce is the plugin responsible for getting information from the underlying batch system configuration to publish attributes like:
GLUE2ComputingShareMaxCPUTime
GLUE2ComputingShareMaxWallTime
GLUE2ComputingShareMaxRunningJobs
...
  • If you use YAIM, make sure all the configuration targets are listed at once in a unique yaim command. Otherwise, configuration files used by the providers and plugins may be incoherent or incomplete. For instance, if using Torque and running the server on the CE host, please run:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_server -n TORQUE_utils
  • The directory /var/tmp/info-dynamic-scheduler-generic is used by glite-info-dynamic-scheduler-wrapper. This is created by the YAIM function config_cream_gip_scheduler_plugin. Make sure this directory exists and it is owned by the "ldap" user.
  • The /etc/lrms/scheduler.conf file has an LRMS section with information specific for the underlying batch system configuration. This information is normally added by YAIM functions like:
    • config_gip_sched_plugin_pbs for PBS
    • config_gip_sge for SGE
    • config_slurm_gip_sched_plugin for SLURM
    • config_gip_sched_plugin_lsf for LSF
Make sure this section exists in the configuration file.
  • For more details, please check:
https://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI2#1_6_Batch_system_integration