Tools/Manuals/TS105

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


Unreliable gathering of CE Information

Error

  • GStat graphs show an erratic number of CPUs for some CE
  • the number of waiting jobs for some CE is intermittently reported as 444444

Diagnosis

Such problems may be due to the glite-info-dynamic-ce or glite-info-dynamic-scheduler-wrapper info provider timing out.

Solution

For PBS/Torque/Maui systems:

  • Many stale files for old jobs in /var/spool/pbs/server_priv/jobs or /var/torque/server_priv/jobs could slow down qstat: in that case such files should be deleted and the pbs_server restarted.
  • With older versions of the middleware and/or batch systems it was a good idea to replace qstat etc. with versions that would cache the results for a while. These days that should not be needed (see next items), but you may want to check out the utilities provided by NIKHEF at the time:
  • Consider upgrading Torque and/or Maui to more recent versions, but beware of potential compatibility issues e.g. with gLite. You may want to ask for advice e.g. on the LCG-Rollout list.
  • Look at the Torque/Maui documentation for large clusters: