From EGIWiki
Jump to: navigation, search
Main operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security

Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Back to Troubleshooting Guide

JS always fails with 'user proxy expired' message

Full message

$ glite-wms-job-logging-info -v 2

Event: Done
- Exit code                  =    1
- Reason                     =    Got a job held event, reason:
  Globus error 131: the user proxy expired (job is still running)
- Source                     =    LogMonitor
- Status code                =    FAILED


When for one user the WMS jobs submitted to a particular CE consistently fail with that error, the problem may be due to the "grid_monitor" process for that user being stuck on the CE, for example in a call to qstat if the batch system is Torque/PBS:

ops006   17956  0.0  1.2  8016 6452 ?        S    Mar13   0:00 perl
/tmp/grid_manager_monitor_agent.ops006.17387.1000 --delete-self --maxtime=3540s
ops006   23017  0.0  0.1  2176  960 ?        S    Mar13   0:00 sh -c
/usr/bin/qstat -f 2>/dev/null
ops006   23018  0.0  0.2  4372 1112 ?        S    Mar13   0:00 /usr/bin/qstat -f

In this example qstat was stuck in a read from a socket connected to the Torque server.

On an LCG-CE this problem should not occur just for a single user, since the calls to the batch system are executed by the globus-gma daemon on behalf of all users and killed on timeout.


Kill the process that causes the grid_manager_monitor_agent to hang; the latter should clean up and exit shortly afterwards, to be replaced with a new instance a bit later. Also check Proxy expired.