|Main||EGI.eu operations services||Support||Documentation||Tools||Activities||Performance||Technology||Catch-all Services||Resource Allocation||Security|
|Documentation menu:||Home •||Manuals •||Procedures •||Training •||Other •||Contact ►||For:||VO managers •||Administrators|
Back to Troubleshooting Guide
Tracing a WMS job ID to the batch system job ID
Some recipes are given below for tracing a WMS job ID to the associated batch system job ID.
Jobs sent to CREAM instances
On a CREAM CE in $GLITE_LOCATION_VAR/log/accounting there is an accounting file blahp.log-YYYYMMDD per day, each file starting at midnight UTC of the corresponding date. By default these files are only accessible to the CE administrator. For a given WMS job ID the corresponding batch system job ID(s) can be found like this:
# cd $GLITE_LOCATION_VAR/log/accounting # grep https://wms208.cern.ch:9000/KfrY2SMKxwrpSnxBNtOFnw \ blahp.log-20110605 | tr ' ' \\n | grep lrmsID "lrmsID=145778742"
The WMS job ID will appear multiple times, possibly in multiple files, when the job failed and was resubmitted to the same CE.
Jobs sent to LCG-CE or OSG-CE instances
Suppose we have this example job:
======================= glite-wms-job-status Success ===================== BOOKKEEPING INFORMATION: Status info for the Job : https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA Current Status: Aborted Logged Reason(s): - Job has been terminated (got SIGTERM) - Standard output does not contain useful data. Cannot read JobWrapper output, both from Condor and from Maradona. Status Reason: hit job retry count (0) Destination: lxbra2307.cern.ch:2119/jobmanager-lcgpbs-ops Submitted: Fri May 6 21:02:40 2011 CEST ==========================================================================
If the job went to an LCG-CE, the batch system job ID can be found like this:
$ uberftp lxbra2307 'cat /opt/edg/var/gatekeeper/grid-jobmap_20110506' > \ grid-jobmap_20110506 $ grep https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA \ grid-jobmap_20110506 | tr ' ' \\n | grep lrmsID "lrmsID=21744.lxbra2307.cern.ch"
On an LCG-CE configured by Quattor the leading directory typically is /var/glite/gatekeeper instead of /opt/edg/var/gatekeeper.
When such a "grid-jobmap" file cannot be obtained, e.g. for a job sent to an OSG-CE, more information can be obtained as follows:
$ glite-wms-job-logging-info -v 2 https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw [...] Event: Transfer - Arrived = Wed Jun 1 22:18:45 2011 CEST - Dest host = localhost - Dest instance = /var/glite/logmonitor/CondorG.log/CondorG.1306950633.log - Dest jobid = 4210734 - Destination = LogMonitor - Host = gswms01.cern.ch - Reason = unavailable - Result = OK - Source = JobController [...]
In the "Transfer" record with source "JobController" and result "OK" the "Dest jobid" indicates the Condor-G job ID on the WMS (it is not the remote batch system job ID!) and the "Dest instance" its associated log file. Further information can then be obtained from that log file, as detailed below.
On the WMS
If the Condor-G log file is not known, it can be determined as follows on the WMS:
# cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log # ls -t | xargs grep -l https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw CondorG.1306950633.log ^C
Multiple files may match when the job was retried at least once. If no file matches, the correct file may have been deemed complete and moved into the "recycle" subdirectory. In that case:
# cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log/recycle # ls -t | xargs grep -l https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw
For a completed job the log file would contain entries like these, usually interspersed with entries for other, concurrent jobs:
000 (4210734.000.000) 06/01 22:18:45 Job submitted from host: <18.104.22.168:23392> (https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw) [...] ... [...] 017 (4210734.000.000) 06/01 22:19:05 Job submitted to Globus RM-Contact: antaeus.hpcc.ttu.edu:2119/jobmanager-sge JM-Contact: https://antaeus.hpcc.ttu.edu:40902/2599/1306959537/ Can-Restart-JM: 1 ... [...] 027 (4210734.000.000) 06/01 22:19:05 Job submitted to grid resource GridResource: gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge GridJobId: gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge https://antaeus.hpcc.ttu.edu:40902/2599/1306959537/ ... [...] 001 (4210734.000.000) 06/01 22:21:33 Job executing on host: gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge ... [...] 005 (4210734.000.000) 06/01 22:22:19 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
Each relevant record starts with a 3-digit code followed by the Condor-G job ID as shown. A record is ended by a line consisting of just 3 dots followed by a newline.
To find the remote job ID the next step is to note the last 2 numbers in the "JM-Contact" string. In the example they are 2599 and 1306959537. They are used below.
The OSG-CE gatekeeper host is evident from the same string. The gatekeeper log (typically $GLOBUS_LOCATION/var/globus-gatekeeper.log on the OSG-CE) will contain references to an accounting file that we need to determine the remote job ID. For example:
PID: 24037 -- Notice: 0: GATEKEEPER_ACCT_FD=4 (/usr/local/OSG-1.2.12/globus/var/accounting.log)
The previously determined numbers are then joined by a dot and used like this to obtain the remote batch system job ID:
$ cd /usr/local/OSG-1.2.12/globus/var $ zgrep -l 2599.1306959537 accounting.log* accounting.log $ sed -n 's/.*GRAM_SCRIPT_JOB_ID \([^ |]*\).*2599.1306959537.*/\1/p' accounting.log 190389