Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Tools/Manuals/TS105"

From EGIWiki
Jump to navigation Jump to search
(Created page with '{{TOC_right}} Category:FAQ ------ Back to Troubleshooting Guide ------ = Unreliable gathering of CE Information = == Error == * GSta…')
 
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
{{Template:Op menubar}}
{{Template:Doc_menubar}}
[[Category:Operations Manuals]]
{{TOC_right}}
{{TOC_right}}
[[Category:FAQ]]
------
------
Back to [[Tools/Manuals/SiteProblemsFollowUp|Troubleshooting Guide]]
Back to [[Tools/Manuals/SiteProblemsFollowUp|Troubleshooting Guide]]
Line 27: Line 30:
* Look at the Torque/Maui documentation for large clusters:   
* Look at the Torque/Maui documentation for large clusters:   
:* http://www.clusterresources.com/torquedocs21/a.flargeclusters.shtml  
:* http://www.clusterresources.com/torquedocs21/a.flargeclusters.shtml  
:* http://www.clusterresources.com/products/maui/docs/a.ilargeclusters.shtml
:* http://www.adaptivecomputing.com/resources/docs/maui/a.ilargeclusters.php

Latest revision as of 12:25, 23 November 2012

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


Unreliable gathering of CE Information

Error

  • GStat graphs show an erratic number of CPUs for some CE
  • the number of waiting jobs for some CE is intermittently reported as 444444

Diagnosis

Such problems may be due to the glite-info-dynamic-ce or glite-info-dynamic-scheduler-wrapper info provider timing out.

Solution

For PBS/Torque/Maui systems:

  • Many stale files for old jobs in /var/spool/pbs/server_priv/jobs or /var/torque/server_priv/jobs could slow down qstat: in that case such files should be deleted and the pbs_server restarted.
  • With older versions of the middleware and/or batch systems it was a good idea to replace qstat etc. with versions that would cache the results for a while. These days that should not be needed (see next items), but you may want to check out the utilities provided by NIKHEF at the time:
  • Consider upgrading Torque and/or Maui to more recent versions, but beware of potential compatibility issues e.g. with gLite. You may want to ask for advice e.g. on the LCG-Rollout list.
  • Look at the Torque/Maui documentation for large clusters: