Tools/Manuals/TS85

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Contents



Back to Troubleshooting Guide


Tracing a WMS job ID to the batch system job ID

Some recipes are given below for tracing a WMS job ID to the associated batch system job ID.

Jobs sent to CREAM instances

On a CREAM CE in $GLITE_LOCATION_VAR/log/accounting there is an accounting file blahp.log-YYYYMMDD per day, each file starting at midnight UTC of the corresponding date. By default these files are only accessible to the CE administrator. For a given WMS job ID the corresponding batch system job ID(s) can be found like this:

# cd $GLITE_LOCATION_VAR/log/accounting
# grep https://wms208.cern.ch:9000/KfrY2SMKxwrpSnxBNtOFnw \
blahp.log-20110605 | tr ' ' \\n | grep lrmsID
"lrmsID=145778742"


The WMS job ID will appear multiple times, possibly in multiple files, when the job failed and was resubmitted to the same CE.

Jobs sent to LCG-CE or OSG-CE instances

Suppose we have this example job:

======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:

Status info for the Job : https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA
Current Status:     Aborted 
Logged Reason(s):
    - Job has been terminated (got SIGTERM)
    - Standard output does not contain useful data.
      Cannot read JobWrapper output, both from Condor and from Maradona.
Status Reason:      hit job retry count (0)
Destination:        lxbra2307.cern.ch:2119/jobmanager-lcgpbs-ops
Submitted:          Fri May  6 21:02:40 2011 CEST
==========================================================================

If the job went to an LCG-CE, the batch system job ID can be found like this:

$ uberftp lxbra2307 'cat /opt/edg/var/gatekeeper/grid-jobmap_20110506' > \
grid-jobmap_20110506
$ grep https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA \
grid-jobmap_20110506 | tr ' ' \\n | grep lrmsID
"lrmsID=21744.lxbra2307.cern.ch"

On an LCG-CE configured by Quattor the leading directory typically is /var/glite/gatekeeper instead of /opt/edg/var/gatekeeper.

When such a "grid-jobmap" file cannot be obtained, e.g. for a job sent to an OSG-CE, more information can be obtained as follows:

$ glite-wms-job-logging-info -v 2 https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw
[...]
Event: Transfer
- Arrived                    =    Wed Jun  1 22:18:45 2011 CEST
- Dest host                  =    localhost
- Dest instance              =    /var/glite/logmonitor/CondorG.log/CondorG.1306950633.log
- Dest jobid                 =    4210734
- Destination                =    LogMonitor
- Host                       =    gswms01.cern.ch
- Reason                     =    unavailable
- Result                     =    OK
- Source                     =    JobController
[...]

In the "Transfer" record with source "JobController" and result "OK" the "Dest jobid" indicates the Condor-G job ID on the WMS (it is not the remote batch system job ID!) and the "Dest instance" its associated log file. Further information can then be obtained from that log file, as detailed below.

On the WMS

If the Condor-G log file is not known, it can be determined as follows on the WMS:

# cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log
# ls -t | xargs grep -l https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw
CondorG.1306950633.log
^C

Multiple files may match when the job was retried at least once. If no file matches, the correct file may have been deemed complete and moved into the "recycle" subdirectory. In that case:

# cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log/recycle
# ls -t | xargs grep -l https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw

For a completed job the log file would contain entries like these, usually interspersed with entries for other, concurrent jobs:

000 (4210734.000.000) 06/01 22:18:45 Job submitted from host: <128.142.167.28:23392>
    (https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw) [...]
...
[...]
017 (4210734.000.000) 06/01 22:19:05 Job submitted to Globus
    RM-Contact: antaeus.hpcc.ttu.edu:2119/jobmanager-sge
    JM-Contact: https://antaeus.hpcc.ttu.edu:40902/2599/1306959537/
    Can-Restart-JM: 1
...
[...]
027 (4210734.000.000) 06/01 22:19:05 Job submitted to grid resource
    GridResource: gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge
    GridJobId: gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge
    https://antaeus.hpcc.ttu.edu:40902/2599/1306959537/
...
[...]
001 (4210734.000.000) 06/01 22:21:33 Job executing on host:
    gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge
...
[...]
005 (4210734.000.000) 06/01 22:22:19 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

Each relevant record starts with a 3-digit code followed by the Condor-G job ID as shown. A record is ended by a line consisting of just 3 dots followed by a newline.

To find the remote job ID the next step is to note the last 2 numbers in the "JM-Contact" string. In the example they are 2599 and 1306959537. They are used below.

The OSG-CE gatekeeper host is evident from the same string. The gatekeeper log (typically $GLOBUS_LOCATION/var/globus-gatekeeper.log on the OSG-CE) will contain references to an accounting file that we need to determine the remote job ID. For example:

PID: 24037 -- Notice: 0: GATEKEEPER_ACCT_FD=4 (/usr/local/OSG-1.2.12/globus/var/accounting.log)

The previously determined numbers are then joined by a dot and used like this to obtain the remote batch system job ID:

$ cd /usr/local/OSG-1.2.12/globus/var
$ zgrep -l 2599.1306959537 accounting.log*
accounting.log
$ sed -n 's/.*GRAM_SCRIPT_JOB_ID \([^ |]*\).*2599.1306959537.*/\1/p' accounting.log
190389
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export