Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @


From EGIWiki
Jump to navigation Jump to search
Main operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security

Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Back to Troubleshooting Guide

Tracing a WMS job ID to the batch system job ID

Some recipes are given below for tracing a WMS job ID to the associated batch system job ID.

Jobs sent to CREAM instances

On a CREAM CE in $GLITE_LOCATION_VAR/log/accounting there is an accounting file blahp.log-YYYYMMDD per day, each file starting at midnight UTC of the corresponding date. By default these files are only accessible to the CE administrator. For a given WMS job ID the corresponding batch system job ID(s) can be found like this:

# cd $GLITE_LOCATION_VAR/log/accounting
# grep \
blahp.log-20110605 | tr ' ' \\n | grep lrmsID

The WMS job ID will appear multiple times, possibly in multiple files, when the job failed and was resubmitted to the same CE.

Jobs sent to LCG-CE or OSG-CE instances

Suppose we have this example job:

======================= glite-wms-job-status Success =====================

Status info for the Job :
Current Status:     Aborted 
Logged Reason(s):
    - Job has been terminated (got SIGTERM)
    - Standard output does not contain useful data.
      Cannot read JobWrapper output, both from Condor and from Maradona.
Status Reason:      hit job retry count (0)
Submitted:          Fri May  6 21:02:40 2011 CEST

If the job went to an LCG-CE, the batch system job ID can be found like this:

$ uberftp lxbra2307 'cat /opt/edg/var/gatekeeper/grid-jobmap_20110506' > \
$ grep \
grid-jobmap_20110506 | tr ' ' \\n | grep lrmsID

On an LCG-CE configured by Quattor the leading directory typically is /var/glite/gatekeeper instead of /opt/edg/var/gatekeeper.

When such a "grid-jobmap" file cannot be obtained, e.g. for a job sent to an OSG-CE, more information can be obtained as follows:

$ glite-wms-job-logging-info -v 2
Event: Transfer
- Arrived                    =    Wed Jun  1 22:18:45 2011 CEST
- Dest host                  =    localhost
- Dest instance              =    /var/glite/logmonitor/CondorG.log/CondorG.1306950633.log
- Dest jobid                 =    4210734
- Destination                =    LogMonitor
- Host                       =
- Reason                     =    unavailable
- Result                     =    OK
- Source                     =    JobController

In the "Transfer" record with source "JobController" and result "OK" the "Dest jobid" indicates the Condor-G job ID on the WMS (it is not the remote batch system job ID!) and the "Dest instance" its associated log file. Further information can then be obtained from that log file, as detailed below.

On the WMS

If the Condor-G log file is not known, it can be determined as follows on the WMS:

# cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log
# ls -t | xargs grep -l

Multiple files may match when the job was retried at least once. If no file matches, the correct file may have been deemed complete and moved into the "recycle" subdirectory. In that case:

# cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log/recycle
# ls -t | xargs grep -l

For a completed job the log file would contain entries like these, usually interspersed with entries for other, concurrent jobs:

000 (4210734.000.000) 06/01 22:18:45 Job submitted from host: <>
    ( [...]
017 (4210734.000.000) 06/01 22:19:05 Job submitted to Globus
    Can-Restart-JM: 1
027 (4210734.000.000) 06/01 22:19:05 Job submitted to grid resource
    GridResource: gt2
    GridJobId: gt2
001 (4210734.000.000) 06/01 22:21:33 Job executing on host:
005 (4210734.000.000) 06/01 22:22:19 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job

Each relevant record starts with a 3-digit code followed by the Condor-G job ID as shown. A record is ended by a line consisting of just 3 dots followed by a newline.

To find the remote job ID the next step is to note the last 2 numbers in the "JM-Contact" string. In the example they are 2599 and 1306959537. They are used below.

The OSG-CE gatekeeper host is evident from the same string. The gatekeeper log (typically $GLOBUS_LOCATION/var/globus-gatekeeper.log on the OSG-CE) will contain references to an accounting file that we need to determine the remote job ID. For example:

PID: 24037 -- Notice: 0: GATEKEEPER_ACCT_FD=4 (/usr/local/OSG-1.2.12/globus/var/accounting.log)

The previously determined numbers are then joined by a dot and used like this to obtain the remote batch system job ID:

$ cd /usr/local/OSG-1.2.12/globus/var
$ zgrep -l 2599.1306959537 accounting.log*
$ sed -n 's/.*GRAM_SCRIPT_JOB_ID \([^ |]*\).*2599.1306959537.*/\1/p' accounting.log