Difference between revisions of "Tools/Manuals/TS82"
(Created page with '{{TOC_right}} Category:FAQ ------ Back to Troubleshooting Guide ------ = JS always fails with 'user proxy expired' message = == Full …') |
|||
Line 44: | Line 44: | ||
On an LCG-CE this problem should not occur just for a single user, | On an LCG-CE this problem should not occur just for a single user, | ||
since the calls to the batch system are executed by the <font face="Courier New,Courier">globus-gma</font> | since the calls to the batch system are executed by the <font face="Courier New,Courier">globus-gma</font> | ||
daemon on behalf of all users. | daemon on behalf of all users and killed on timeout. | ||
== Solution == | == Solution == |
Revision as of 21:21, 27 September 2011
Back to Troubleshooting Guide
JS always fails with 'user proxy expired' message
Full message
$ glite-wms-job-logging-info -v 2 https://wms221.cern.ch:9000/iFtw9svc7vBkj3GnvCHwOK [...] Event: Done [...] - Exit code = 1 [...] - Reason = Got a job held event, reason: Globus error 131: the user proxy expired (job is still running) - Source = LogMonitor [...] - Status code = FAILED [...]
Diagnosis
When for one user the WMS jobs submitted to a particular CE consistently fail with that error, the problem may be due to the "grid_monitor" process for that user being stuck on the CE, for example in a call to qstat if the batch system is Torque/PBS:
-------------------------------------------------------------------------------- USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND ops006 17956 0.0 1.2 8016 6452 ? S Mar13 0:00 perl /tmp/grid_manager_monitor_agent.ops006.17387.1000 --delete-self --maxtime=3540s ops006 23017 0.0 0.1 2176 960 ? S Mar13 0:00 sh -c /usr/bin/qstat -f 2>/dev/null ops006 23018 0.0 0.2 4372 1112 ? S Mar13 0:00 /usr/bin/qstat -f --------------------------------------------------------------------------------
In this example qstat was stuck in a read from a socket connected to the Torque server.
On an LCG-CE this problem should not occur just for a single user, since the calls to the batch system are executed by the globus-gma daemon on behalf of all users and killed on timeout.
Solution
Kill the process that causes the grid_manager_monitor_agent to hang; the latter should clean up and exit shortly afterwards, to be replaced with a new instance a bit later. Also check Proxy expired.