Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Tools/Manuals/TS05

From EGIWiki
Jump to navigation Jump to search

Back to Troubleshooting Guide


Proxy expired

Full message

Jobs fail with this error reported by glite-wms-job-logging-info -v 2:

Globus error 131: the user proxy expired (job is still running)

In the WMS syslog (/var/log/messages) the following error may appear under unusual circumstances:

Mar 19 22:38:49 WMS-host glite-proxy-renewd[5768]:
Proxy lifetime exceeded value of the Condor limit!

Diagnosis

  1. The job stayed longer in the queue and/or ran longer than the remaining life time of the proxy that was used when the job was submitted and the proxy could not be renewed from the MyProxy server that was specified, if any. Check if the MyProxy server has a valid proxy:
myproxy-info -d
To upload a proxy valid for 800 hours (~1 month):
myproxy-init -d -n -c 800
Note: the "-d" and "-n" options are essential!
  1. If the initial proxy is valid for less than 30 minutes the WMS will not even try to renew it and would log the "Condor limit" error shown above.
  2. If the job actually did finish (output sendbox is on the WMS, the job uploaded output files to an SE, ...) but the WMS reports that the proxy is expired, this is a site problem. For example:
    • Site batch system monitoring commands cannot be executed by the "grid_monitor" (Job submission chain diagram) and therefore the status of the job could not be reported to the WMS.
    • PBS/Torque may have been configured to remember completed jobs for a long time, instead of forgetting them. The "lcgpbs" job manager ignores the "C" state and keeps reporting the job in the previous state, i.e. running. To let the batch system remember completed jobs for only 60 seconds:
    qmgr -c 'set server keep_completed = 60'
  3. The reason the "lcgpbs" job manager ignores the "C" state is explained here: https://savannah.cern.ch/bugs/?7874
    • The "grid_monitor" could not write its status files under $GLOBUS_LOCATION/tmp, which must be world-writable with the sticky bit set. Check if the UIDs of grid accounts did not change on the CE. For each grid account UID the directory may contain files like these:
        -rw-r--r--    1 sgmops96 cg            100 Aug 15 23:14 grid_manager_monitor_agent_log.26865
        -rw-r--r--    1 sgmops96 cg             17 Aug 15 23:14 grid_manager_monitor_agent_log.26865.lock
        -rw-r--r--    1 sgmops96 cg            144 Aug 15 23:14 grid_manager_monitor_agent_log.26865.time
    Ensure that the UID in each file name (here 26865) is that of the owner (here sgmops96). An easy way to do that:
        # ls -ln $GLOBUS_LOCATION/tmp | awk '$NF !~ "."$3'
        drwxrwxrwt    2 0        0          434176 Aug 15 23:19 gram_job_state
        #
    Only the gram_job_state subdirectory must appear in the output.
    • The "grid_monitor" could not copy its status files back to the WMS. On a UI with inbound connectivity for the GLOBUS_TCP_PORT_RANGE you can check that like this:
        globus-job-run some_CE.some_domain \
        /opt/globus/bin/globus-url-copy file:/etc/group \
        gsiftp://your_WMS.its_domain/tmp/foo.$$
    It could fail e.g. because the CRLs on the CE are out of date:
        error: globus_ftp_control: gss_init_sec_context failed
        globus_gsi_callback_module: Could not verify credential
        globus_gsi_callback_module: Could not verify credential
        globus_gsi_callback_module: Invalid CRL: The available CRL has expired
  1. If only some users have the problem with a particular CE, check JS always fails with 'user proxy expired' message.