Tools/Manuals/TS58

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Contents



Back to Troubleshooting Guide


Jobs sent to some CE stay in Running state forever

Full message

$ glite-wms-job-logging-info -v 2 https://gswms01.cern.ch:9000/6MuGaiwPVcxEo7w6YlGHng

**********************************************************************
[...]
        ---
Event: Running
- Arrived                    =    Sun Jan 17 11:39:21 2010 CET
- Host                       =    lxbsq1344.cern.ch
- Node                       =    lxbsq1344.cern.ch
- Source                     =    LRMS
- Timestamp                  =    Sun Jan 17 11:38:22 2010 CET
- User                       =    /DC=ch/DC=cern/.....
        ---
Event: ReallyRunning
- Arrived                    =    Sun Jan 17 11:39:23 2010 CET
- Host                       =    lxbsq1344.cern.ch
- Source                     =    LRMS
- Timestamp                  =    Sun Jan 17 11:39:23 2010 CET
- User                       =    /DC=ch/DC=cern/.....
        ---
Event: ResourceUsage
- Arrived                    =    Sun Jan 17 11:39:24 2010 CET
- Host                       =    lxbsq1344.cern.ch
- Source                     =    LRMS
- Timestamp                  =    Sun Jan 17 11:39:24 2010 CET
- User                       =    /DC=ch/DC=cern/.....
        ---
Event: Done
- Arrived                    =    Sun Jan 17 11:39:26 2010 CET
- Exit code                  =    0
- Host                       =    lxbsq1344.cern.ch
- Source                     =    LRMS
- Status code                =    OK
- Timestamp                  =    Sun Jan 17 11:39:26 2010 CET
- User                       =    /DC=ch/DC=cern/..... 

**********************************************************************

$ glite-wms-job-status https://gswms01.cern.ch:9000/6MuGaiwPVcxEo7w6YlGHng


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://gswms01.cern.ch:9000/6MuGaiwPVcxEo7w6YlGHng
Current Status:     Running
Status Reason:      Job successfully submitted to Globus
Destination:        ce128.cern.ch:2119/jobmanager-lcglsf-grid_ops
Submitted:          Sun Jan 17 11:34:20 2010 CET
*************************************************************

Diagnosis

When a "Done" event is logged with the source being the "LRMS", it means the RB/WMS job wrapper resumed control after the user payload returned and the output sandbox was transferred. The job wrapper should then exit shortly afterwards. However, if the user payload created a background process that did not exit in the meantime, depending on the batch system (or its configuration) the job may continue to be considered running. Older versions of the job wrapper used to kill all the payload processes that remained in the original process group in which the payload was started, but recent versions leave any such cleanup to the payload itself. In any case the safest is for the payload to try and ensure it leaves no stale processes behind, because it should not count on some other layer to exercise an intelligent cleanup of such processes.


For other possible explanations refer to Jobs sent to some CE stay in Scheduled state forever.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export