Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Tools/Manuals/TS82

From EGIWiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


JS always fails with 'user proxy expired' message

Full message

$ glite-wms-job-logging-info -v 2 https://wms221.cern.ch:9000/iFtw9svc7vBkj3GnvCHwOK

[...]
Event: Done
[...]
- Exit code                  =    1
[...]
- Reason                     =    Got a job held event, reason:
  Globus error 131: the user proxy expired (job is still running)
- Source                     =    LogMonitor
[...]
- Status code                =    FAILED
[...]

Diagnosis

When for one user the WMS jobs submitted to a particular CE consistently fail with that error, the problem may be due to the "grid_monitor" process for that user being stuck on the CE, for example in a call to qstat if the batch system is Torque/PBS:

--------------------------------------------------------------------------------
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
ops006   17956  0.0  1.2  8016 6452 ?        S    Mar13   0:00 perl
/tmp/grid_manager_monitor_agent.ops006.17387.1000 --delete-self --maxtime=3540s
ops006   23017  0.0  0.1  2176  960 ?        S    Mar13   0:00 sh -c
/usr/bin/qstat -f 2>/dev/null
ops006   23018  0.0  0.2  4372 1112 ?        S    Mar13   0:00 /usr/bin/qstat -f
--------------------------------------------------------------------------------

In this example qstat was stuck in a read from a socket connected to the Torque server.

On an LCG-CE this problem should not occur just for a single user, since the calls to the batch system are executed by the globus-gma daemon on behalf of all users and killed on timeout.

Solution

Kill the process that causes the grid_manager_monitor_agent to hang; the latter should clean up and exit shortly afterwards, to be replaced with a new instance a bit later. Also check Proxy expired.