Difference between revisions of "Tools/Manuals/TS55"
< Tools
Jump to navigation
Jump to search
m |
m |
||
Line 28: | Line 28: | ||
* The WM keeps crashing (check its log file). A complicated JDL file in the input queue <font face="Courier New,Courier">$GLITE_LOCATION_VAR/workload_manager/jobdir/*</font> might trigger a bug; in that case the admin could move any such files out of the way to restore the service. | * The WM keeps crashing (check its log file). A complicated JDL file in the input queue <font face="Courier New,Courier">$GLITE_LOCATION_VAR/workload_manager/jobdir/*</font> might trigger a bug; in that case the admin could move any such files out of the way to restore the service. | ||
If only some jobs stay in the Waiting state, while other jobs proceed, see: [[ | If only some jobs stay in the Waiting state, while other jobs proceed, see: [[Tools/Manuals/TS53|BrokerHelper: no compatible resources]] |
Revision as of 14:04, 25 May 2011
Back to Troubleshooting Guide
Jobs sent to my WMS stay in Waiting state forever
Full message
************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://gswms01.cern.ch:9000/9AfNLYg09zhvi7i4T0RRQw Current Status: Waiting Submitted: Mon Mar 28 18:50:20 2011 CEST *************************************************************
Diagnosis
When jobs stay in the Waiting state for a long time, the workload_manager (WM) daemon on the WMS somehow is slow in processing its input queue (consisting of requests for matchmaking, submission or cancellation of jobs). This can have various causes:
- The WM has a backlog, e.g. due to the WMS being overloaded.
- The WM sits in an infinite loop spinning the CPU (check with top, strace, etc.). This has not been seen since a very long time, possibly never.
- The WM sits in a deadlock (check with strace, gdb, etc.). This has not been seen since a long time.
- The WM keeps crashing (check its log file). A complicated JDL file in the input queue $GLITE_LOCATION_VAR/workload_manager/jobdir/* might trigger a bug; in that case the admin could move any such files out of the way to restore the service.
If only some jobs stay in the Waiting state, while other jobs proceed, see: BrokerHelper: no compatible resources