Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Tools/Manuals/TS58

From EGIWiki
< Tools
Revision as of 13:24, 25 May 2011 by Aesch (talk | contribs) (Created page with '{{TOC_right}} Category:FAQ ------ Back to Troubleshooting Guide ------ = Jobs sent to some CE stay in Running state forever = == Full…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Back to Troubleshooting Guide


Jobs sent to some CE stay in Running state forever

Full message

$ glite-wms-job-logging-info -v 2 https://gswms01.cern.ch:9000/6MuGaiwPVcxEo7w6YlGHng

**********************************************************************
[...]
        ---
Event: Running
- Arrived                    =    Sun Jan 17 11:39:21 2010 CET
- Host                       =    lxbsq1344.cern.ch
- Node                       =    lxbsq1344.cern.ch
- Source                     =    LRMS
- Timestamp                  =    Sun Jan 17 11:38:22 2010 CET
- User                       =    /DC=ch/DC=cern/.....
        ---
Event: ReallyRunning
- Arrived                    =    Sun Jan 17 11:39:23 2010 CET
- Host                       =    lxbsq1344.cern.ch
- Source                     =    LRMS
- Timestamp                  =    Sun Jan 17 11:39:23 2010 CET
- User                       =    /DC=ch/DC=cern/.....
        ---
Event: ResourceUsage
- Arrived                    =    Sun Jan 17 11:39:24 2010 CET
- Host                       =    lxbsq1344.cern.ch
- Source                     =    LRMS
- Timestamp                  =    Sun Jan 17 11:39:24 2010 CET
- User                       =    /DC=ch/DC=cern/.....
        ---
Event: Done
- Arrived                    =    Sun Jan 17 11:39:26 2010 CET
- Exit code                  =    0
- Host                       =    lxbsq1344.cern.ch
- Source                     =    LRMS
- Status code                =    OK
- Timestamp                  =    Sun Jan 17 11:39:26 2010 CET
- User                       =    /DC=ch/DC=cern/..... 

**********************************************************************

$ glite-wms-job-status https://gswms01.cern.ch:9000/6MuGaiwPVcxEo7w6YlGHng


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://gswms01.cern.ch:9000/6MuGaiwPVcxEo7w6YlGHng
Current Status:     Running
Status Reason:      Job successfully submitted to Globus
Destination:        ce128.cern.ch:2119/jobmanager-lcglsf-grid_ops
Submitted:          Sun Jan 17 11:34:20 2010 CET
*************************************************************

Diagnosis

When a "Done" event is logged with the source being the "LRMS", it means the RB/WMS job wrapper resumed control after the user payload returned and the output sandbox was transferred. The job wrapper should then exit shortly afterwards. However, if the user payload created a background process that did not exit in the meantime, depending on the batch system (or its configuration) the job may continue to be considered running. Older versions of the job wrapper used to kill all the payload processes that remained in the original process group in which the payload was started, but recent versions leave any such cleanup to the payload itself. In any case the safest is for the payload to try and ensure it leaves no stale processes behind, because it should not count on some other layer to exercise an intelligent cleanup of such processes.


For other possible explanations refer to Jobs sent to some CE stay in Scheduled state forever.