Tools/Manuals/TS82

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Contents



Back to Troubleshooting Guide


JS always fails with 'user proxy expired' message

Full message

$ glite-wms-job-logging-info -v 2 https://wms221.cern.ch:9000/iFtw9svc7vBkj3GnvCHwOK

[...]
Event: Done
[...]
- Exit code                  =    1
[...]
- Reason                     =    Got a job held event, reason:
  Globus error 131: the user proxy expired (job is still running)
- Source                     =    LogMonitor
[...]
- Status code                =    FAILED
[...]

Diagnosis

When for one user the WMS jobs submitted to a particular CE consistently fail with that error, the problem may be due to the "grid_monitor" process for that user being stuck on the CE, for example in a call to qstat if the batch system is Torque/PBS:

--------------------------------------------------------------------------------
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
ops006   17956  0.0  1.2  8016 6452 ?        S    Mar13   0:00 perl
/tmp/grid_manager_monitor_agent.ops006.17387.1000 --delete-self --maxtime=3540s
ops006   23017  0.0  0.1  2176  960 ?        S    Mar13   0:00 sh -c
/usr/bin/qstat -f 2>/dev/null
ops006   23018  0.0  0.2  4372 1112 ?        S    Mar13   0:00 /usr/bin/qstat -f
--------------------------------------------------------------------------------

In this example qstat was stuck in a read from a socket connected to the Torque server.

On an LCG-CE this problem should not occur just for a single user, since the calls to the batch system are executed by the globus-gma daemon on behalf of all users and killed on timeout.

Solution

Kill the process that causes the grid_manager_monitor_agent to hang; the latter should clean up and exit shortly afterwards, to be replaced with a new instance a bit later. Also check Proxy expired.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export