Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Tools/Manuals/TS85"

From EGIWiki
Jump to navigation Jump to search
(Created page with '{{TOC_right}} Category:FAQ ------ Back to Troubleshooting Guide ------ = Tracing a WMS job ID to the batch system job ID = Some recip…')
 
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
{{Template:Op menubar}}
{{Template:Doc_menubar}}
[[Category:Operations Manuals]]
{{TOC_right}}
{{TOC_right}}
[[Category:FAQ]]
------
------
Back to [[Tools/Manuals/SiteProblemsFollowUp|Troubleshooting Guide]]
Back to [[Tools/Manuals/SiteProblemsFollowUp|Troubleshooting Guide]]
Line 18: Line 20:


  # cd $GLITE_LOCATION_VAR/log/accounting
  # cd $GLITE_LOCATION_VAR/log/accounting
  # grep https://wms208.cern.ch:9000/KfrY2SMKxwrpSnxBNtOFnw \
  # grep <nowiki>https://wms208.cern.ch:9000/KfrY2SMKxwrpSnxBNtOFnw</nowiki> \
  blahp.log-20110605 | tr ' ' \\n | grep lrmsID
  blahp.log-20110605 | tr ' ' \\n | grep lrmsID
  "lrmsID=145778742"
  "lrmsID=145778742"
Line 33: Line 35:
  BOOKKEEPING INFORMATION:
  BOOKKEEPING INFORMATION:
   
   
  Status info for the Job : https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA
  Status info for the Job : <nowiki>https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA</nowiki>
  Current Status:    Aborted  
  Current Status:    Aborted  
  Logged Reason(s):
  Logged Reason(s):
Line 48: Line 50:
  $ uberftp lxbra2307 'cat /opt/edg/var/gatekeeper/grid-jobmap_20110506' > \
  $ uberftp lxbra2307 'cat /opt/edg/var/gatekeeper/grid-jobmap_20110506' > \
  grid-jobmap_20110506
  grid-jobmap_20110506
  $ grep https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA \
  $ grep <nowiki>https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA</nowiki> \
  grid-jobmap_20110506 | tr ' ' \\n | grep lrmsID
  grid-jobmap_20110506 | tr ' ' \\n | grep lrmsID
  "lrmsID=21744.lxbra2307.cern.ch"
  "lrmsID=21744.lxbra2307.cern.ch"
Line 58: Line 60:
an OSG-CE, more information can be obtained as follows:
an OSG-CE, more information can be obtained as follows:


  $ glite-wms-job-logging-info -v 2 https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw
  $ glite-wms-job-logging-info -v 2 <nowiki>https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw</nowiki>
  [...]
  [...]
  Event: Transfer
  Event: Transfer
Line 82: Line 84:


  # cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log
  # cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log
  # ls -t | xargs grep -l https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw
  # ls -t | xargs grep -l <nowiki>https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw</nowiki>
  CondorG.1306950633.log
  CondorG.1306950633.log
  ^C
  ^C
Line 91: Line 93:


  # cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log/recycle
  # cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log/recycle
  # ls -t | xargs grep -l https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw
  # ls -t | xargs grep -l <nowiki>https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw</nowiki>


For a completed job the log file would contain entries like these,
For a completed job the log file would contain entries like these,
Line 97: Line 99:


  000 (4210734.000.000) 06/01 22:18:45 Job submitted from host: <128.142.167.28:23392>
  000 (4210734.000.000) 06/01 22:18:45 Job submitted from host: <128.142.167.28:23392>
     (https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw) [...]
     (<nowiki>https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw</nowiki>) [...]
  ...
  ...
  [...]
  [...]
  017 (4210734.000.000) 06/01 22:19:05 Job submitted to Globus
  017 (4210734.000.000) 06/01 22:19:05 Job submitted to Globus
     RM-Contact: antaeus.hpcc.ttu.edu:2119/jobmanager-sge
     RM-Contact: antaeus.hpcc.ttu.edu:2119/jobmanager-sge
     JM-Contact: https://antaeus.hpcc.ttu.edu:40902/2599/1306959537/
     JM-Contact: <nowiki>https://antaeus.hpcc.ttu.edu:40902/2599/1306959537/</nowiki>
     Can-Restart-JM: 1
     Can-Restart-JM: 1
  ...
  ...
Line 109: Line 111:
     GridResource: gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge
     GridResource: gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge
     GridJobId: gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge
     GridJobId: gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge
     https://antaeus.hpcc.ttu.edu:40902/2599/1306959537/
     <nowiki>https://antaeus.hpcc.ttu.edu:40902/2599/1306959537/</nowiki>
  ...
  ...
  [...]
  [...]

Latest revision as of 13:48, 23 November 2012

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


Tracing a WMS job ID to the batch system job ID

Some recipes are given below for tracing a WMS job ID to the associated batch system job ID.

Jobs sent to CREAM instances

On a CREAM CE in $GLITE_LOCATION_VAR/log/accounting there is an accounting file blahp.log-YYYYMMDD per day, each file starting at midnight UTC of the corresponding date. By default these files are only accessible to the CE administrator. For a given WMS job ID the corresponding batch system job ID(s) can be found like this:

# cd $GLITE_LOCATION_VAR/log/accounting
# grep https://wms208.cern.ch:9000/KfrY2SMKxwrpSnxBNtOFnw \
blahp.log-20110605 | tr ' ' \\n | grep lrmsID
"lrmsID=145778742"


The WMS job ID will appear multiple times, possibly in multiple files, when the job failed and was resubmitted to the same CE.

Jobs sent to LCG-CE or OSG-CE instances

Suppose we have this example job:

======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:

Status info for the Job : https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA
Current Status:     Aborted 
Logged Reason(s):
    - Job has been terminated (got SIGTERM)
    - Standard output does not contain useful data.
      Cannot read JobWrapper output, both from Condor and from Maradona.
Status Reason:      hit job retry count (0)
Destination:        lxbra2307.cern.ch:2119/jobmanager-lcgpbs-ops
Submitted:          Fri May  6 21:02:40 2011 CEST
==========================================================================

If the job went to an LCG-CE, the batch system job ID can be found like this:

$ uberftp lxbra2307 'cat /opt/edg/var/gatekeeper/grid-jobmap_20110506' > \
grid-jobmap_20110506
$ grep https://wms221.cern.ch:9000/yFtw9svc7vBkj3GnvCHwOA \
grid-jobmap_20110506 | tr ' ' \\n | grep lrmsID
"lrmsID=21744.lxbra2307.cern.ch"

On an LCG-CE configured by Quattor the leading directory typically is /var/glite/gatekeeper instead of /opt/edg/var/gatekeeper.

When such a "grid-jobmap" file cannot be obtained, e.g. for a job sent to an OSG-CE, more information can be obtained as follows:

$ glite-wms-job-logging-info -v 2 https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw
[...]
Event: Transfer
- Arrived                    =    Wed Jun  1 22:18:45 2011 CEST
- Dest host                  =    localhost
- Dest instance              =    /var/glite/logmonitor/CondorG.log/CondorG.1306950633.log
- Dest jobid                 =    4210734
- Destination                =    LogMonitor
- Host                       =    gswms01.cern.ch
- Reason                     =    unavailable
- Result                     =    OK
- Source                     =    JobController
[...]

In the "Transfer" record with source "JobController" and result "OK" the "Dest jobid" indicates the Condor-G job ID on the WMS (it is not the remote batch system job ID!) and the "Dest instance" its associated log file. Further information can then be obtained from that log file, as detailed below.

On the WMS

If the Condor-G log file is not known, it can be determined as follows on the WMS:

# cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log
# ls -t | xargs grep -l https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw
CondorG.1306950633.log
^C

Multiple files may match when the job was retried at least once. If no file matches, the correct file may have been deemed complete and moved into the "recycle" subdirectory. In that case:

# cd $GLITE_LOCATION_VAR/logmonitor/CondorG.log/recycle
# ls -t | xargs grep -l https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw

For a completed job the log file would contain entries like these, usually interspersed with entries for other, concurrent jobs:

000 (4210734.000.000) 06/01 22:18:45 Job submitted from host: <128.142.167.28:23392>
    (https://gswms01.cern.ch:9000/Simrfu1NRtThYZ5sj0nXJw) [...]
...
[...]
017 (4210734.000.000) 06/01 22:19:05 Job submitted to Globus
    RM-Contact: antaeus.hpcc.ttu.edu:2119/jobmanager-sge
    JM-Contact: https://antaeus.hpcc.ttu.edu:40902/2599/1306959537/
    Can-Restart-JM: 1
...
[...]
027 (4210734.000.000) 06/01 22:19:05 Job submitted to grid resource
    GridResource: gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge
    GridJobId: gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge
    https://antaeus.hpcc.ttu.edu:40902/2599/1306959537/
...
[...]
001 (4210734.000.000) 06/01 22:21:33 Job executing on host:
    gt2 antaeus.hpcc.ttu.edu:2119/jobmanager-sge
...
[...]
005 (4210734.000.000) 06/01 22:22:19 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

Each relevant record starts with a 3-digit code followed by the Condor-G job ID as shown. A record is ended by a line consisting of just 3 dots followed by a newline.

To find the remote job ID the next step is to note the last 2 numbers in the "JM-Contact" string. In the example they are 2599 and 1306959537. They are used below.

The OSG-CE gatekeeper host is evident from the same string. The gatekeeper log (typically $GLOBUS_LOCATION/var/globus-gatekeeper.log on the OSG-CE) will contain references to an accounting file that we need to determine the remote job ID. For example:

PID: 24037 -- Notice: 0: GATEKEEPER_ACCT_FD=4 (/usr/local/OSG-1.2.12/globus/var/accounting.log)

The previously determined numbers are then joined by a dot and used like this to obtain the remote batch system job ID:

$ cd /usr/local/OSG-1.2.12/globus/var
$ zgrep -l 2599.1306959537 accounting.log*
accounting.log
$ sed -n 's/.*GRAM_SCRIPT_JOB_ID \([^ |]*\).*2599.1306959537.*/\1/p' accounting.log
190389