The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Tools/Manuals/TS105

From EGIWiki

Jump to navigation Jump to search

Main

EGI.eu operations services

Support

Documentation

Tools

Activities

Performance

Technology

Catch-all Services

Resource Allocation

Security

Documentation menu:

Home •

Other •

For:

VO managers •

Back to Troubleshooting Guide

Unreliable gathering of CE Information

Error

GStat graphs show an erratic number of CPUs for some CE
the number of waiting jobs for some CE is intermittently reported as 444444

Diagnosis

Such problems may be due to the glite-info-dynamic-ce or glite-info-dynamic-scheduler-wrapper info provider timing out.

Solution

For PBS/Torque/Maui systems:

Many stale files for old jobs in /var/spool/pbs/server_priv/jobs or /var/torque/server_priv/jobs could slow down qstat: in that case such files should be deleted and the pbs_server restarted.
With older versions of the middleware and/or batch systems it was a good idea to replace qstat etc. with versions that would cache the results for a while. These days that should not be needed (see next items), but you may want to check out the utilities provided by NIKHEF at the time:

http://www.dutchgrid.nl/Admin/nikhef/

Consider upgrading Torque and/or Maui to more recent versions, but beware of potential compatibility issues e.g. with gLite. You may want to ask for advice e.g. on the LCG-Rollout list.
Look at the Torque/Maui documentation for large clusters:

Retrieved from "https://wiki.egi.eu/w/index.php?title=Tools/Manuals/TS105&oldid=46199"

Operations Manuals