Tools/Manuals/TS105
Jump to navigation
Jump to search
Back to Troubleshooting Guide
Unreliable gathering of CE Information
Error
- GStat graphs show an erratic number of CPUs for some CE
- the number of waiting jobs for some CE is intermittently reported as 444444
Diagnosis
Such problems may be due to the glite-info-dynamic-ce or glite-info-dynamic-scheduler-wrapper info provider timing out.
Solution
For PBS/Torque/Maui systems:
- Many stale files for old jobs in /var/spool/pbs/server_priv/jobs or /var/torque/server_priv/jobs could slow down qstat: in that case such files should be deleted and the pbs_server restarted.
- With older versions of the middleware and/or batch systems it was a good idea to replace qstat etc. with versions that would cache the results for a while. These days that should not be needed (see next items), but you may want to check out the utilities provided by NIKHEF at the time:
- Consider upgrading Torque and/or Maui to more recent versions, but beware of potential compatibility issues e.g. with gLite. You may want to ask for advice e.g. on the LCG-Rollout list.
- Look at the Torque/Maui documentation for large clusters: