Tools/Manuals/TS05

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Contents



Back to Troubleshooting Guide


Proxy expired

Full message

Jobs fail with this error reported by glite-wms-job-logging-info -v 2:

Globus error 131: the user proxy expired (job is still running)

In the WMS syslog (/var/log/messages) the following error may appear under unusual circumstances:

Mar 19 22:38:49 WMS-host glite-proxy-renewd[5768]:
Proxy lifetime exceeded value of the Condor limit!

Diagnosis

  1. The job stayed longer in the queue and/or ran longer than the remaining life time of the proxy that was used when the job was submitted and the proxy could not be renewed from the MyProxy server that was specified, if any. Check if the MyProxy server has a valid proxy:
myproxy-info -d
To upload a proxy valid for 800 hours (~1 month):
myproxy-init -d -n -c 800
Note: the "-d" and "-n" options are essential!
Also check if the used WMS has been configured as an "authorized renewer" on the given MyProxy server.
  1. If the initial proxy is valid for less than 30 minutes the WMS will not even try to renew it and would log the "Condor limit" error shown above.
  2. If the job actually did finish (output sandbox is on the WMS, the job uploaded output files to an SE, ...) but the WMS reports that the proxy is expired, this is a site problem. For example:
    • Site batch system monitoring commands cannot be executed by the "grid_monitor" (Job submission chain diagram Job submission chain diagram) and therefore the status of the job could not be reported to the WMS.
    • PBS/Torque may have been configured to remember completed jobs for a long time, instead of forgetting them. The "lcgpbs" job manager ignores the "C" state and keeps reporting the job in the previous state, i.e. running. To let the batch system remember completed jobs for only 60 seconds:
    qmgr -c 'set server keep_completed = 60'
  3. The reason the "lcgpbs" job manager ignores the "C" state is explained here: https://savannah.cern.ch/bugs/?7874
    • The "grid_monitor" could not write its status files under $GLOBUS_LOCATION/tmp, which must be world-writable with the sticky bit set. Check if the UIDs of grid accounts did not change on the CE. For each grid account UID the directory may contain files like these:
      -rw-r--r--    1 sgmops96 cg   100 Aug 15 23:14 grid_manager_monitor_agent_log.26865
      -rw-r--r--    1 sgmops96 cg    17 Aug 15 23:14 grid_manager_monitor_agent_log.26865.lock
      -rw-r--r--    1 sgmops96 cg   144 Aug 15 23:14 grid_manager_monitor_agent_log.26865.time
    Ensure that the UID in each file name (here 26865) is that of the owner (here sgmops96). An easy way to do that:
      # ls -ln $GLOBUS_LOCATION/tmp | awk '$NF !~ "."$3'
      drwxrwxrwt    2 0        0          434176 Aug 15 23:19 gram_job_state
      #
    Only the gram_job_state subdirectory must appear in the output.
    Note: on an LCG-CE the files there should no longer be used, since the "grid_monitor" functionality should
    be handled by the globus-gma daemon that has them in the gma_state subdirectory instead (and owned by root).
    • The "grid_monitor" could not copy its status files back to the WMS. On a UI with inbound connectivity for the GLOBUS_TCP_PORT_RANGE you can check that like this:
      globus-job-run some_CE.some_domain \
      /opt/globus/bin/globus-url-copy file:/etc/group \
      gsiftp://your_WMS.its_domain/tmp/foo.$$
    It could fail e.g. because the CRLs on the CE are out of date:
      error: globus_ftp_control: gss_init_sec_context failed
      globus_gsi_callback_module: Could not verify credential
      globus_gsi_callback_module: Could not verify credential
      globus_gsi_callback_module: Invalid CRL: The available CRL has expired
  1. If only some users have the problem with a particular CE, check JS always fails with 'user proxy expired' message.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export