< ToolsJump to navigation Jump to search
|Main||EGI.eu operations services||Support||Documentation||Tools||Activities||Performance||Technology||Catch-all Services||Resource Allocation||Security|
|Documentation menu:||Home •||Manuals •||Procedures •||Training •||Other •||Contact ►||For:||VO managers •||Administrators|
Back to Troubleshooting Guide
Jobs fail with this error reported by glite-wms-job-logging-info -v 2:
Globus error 131: the user proxy expired (job is still running)
In the WMS syslog (/var/log/messages) the following error may appear under unusual circumstances:
Mar 19 22:38:49 WMS-host glite-proxy-renewd: Proxy lifetime exceeded value of the Condor limit!
- The job stayed longer in the queue and/or ran longer than the remaining life time of the proxy that was used when the job was submitted and the proxy could not be renewed from the MyProxy server that was specified, if any. Check if the MyProxy server has a valid proxy:
- To upload a proxy valid for 800 hours (~1 month):
myproxy-init -d -n -c 800
- Note: the "-d" and "-n" options are essential!
- Also check if the used WMS has been configured as an "authorized renewer" on the given MyProxy server.
- If the initial proxy is valid for less than 30 minutes the WMS will not even try to renew it and would log the "Condor limit" error shown above.
- If the job actually did finish (output sandbox is on the WMS, the job uploaded output files to an SE, ...) but the WMS reports that the proxy is expired, this is a site problem. For example:
- Site batch system monitoring commands cannot be executed by the "grid_monitor" (Job submission chain diagram Job submission chain diagram) and therefore the status of the job could not be reported to the WMS.
- PBS/Torque may have been configured to remember completed jobs for a long time, instead of forgetting them. The "lcgpbs" job manager ignores the "C" state and keeps reporting the job in the previous state, i.e. running. To let the batch system remember completed jobs for only 60 seconds:
qmgr -c 'set server keep_completed = 60'
- The reason the "lcgpbs" job manager ignores the "C" state is explained here: https://savannah.cern.ch/bugs/?7874
- The "grid_monitor" could not write its status files under $GLOBUS_LOCATION/tmp, which must be world-writable with the sticky bit set. Check if the UIDs of grid accounts did not change on the CE. For each grid account UID the directory may contain files like these:
-rw-r--r-- 1 sgmops96 cg 100 Aug 15 23:14 grid_manager_monitor_agent_log.26865 -rw-r--r-- 1 sgmops96 cg 17 Aug 15 23:14 grid_manager_monitor_agent_log.26865.lock -rw-r--r-- 1 sgmops96 cg 144 Aug 15 23:14 grid_manager_monitor_agent_log.26865.time
- Ensure that the UID in each file name (here 26865) is that of the owner (here sgmops96). An easy way to do that:
# ls -ln $GLOBUS_LOCATION/tmp | awk '$NF !~ "."$3' drwxrwxrwt 2 0 0 434176 Aug 15 23:19 gram_job_state #
- Only the gram_job_state subdirectory must appear in the output.
- Note: on an LCG-CE the files there should no longer be used, since the "grid_monitor" functionality should
- be handled by the globus-gma daemon that has them in the gma_state subdirectory instead (and owned by root).
- The "grid_monitor" could not copy its status files back to the WMS. On a UI with inbound connectivity for the GLOBUS_TCP_PORT_RANGE you can check that like this:
globus-job-run some_CE.some_domain \ /opt/globus/bin/globus-url-copy file:/etc/group \ gsiftp://your_WMS.its_domain/tmp/foo.$$
- It could fail e.g. because the CRLs on the CE are out of date:
error: globus_ftp_control: gss_init_sec_context failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Invalid CRL: The available CRL has expired
- If only some users have the problem with a particular CE, check JS always fails with 'user proxy expired' message.