Tools/Manuals/TS58
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to Troubleshooting Guide
Jobs sent to some CE stay in Running state forever
Full message
$ glite-wms-job-logging-info -v 2 https://gswms01.cern.ch:9000/6MuGaiwPVcxEo7w6YlGHng ********************************************************************** [...] --- Event: Running - Arrived = Sun Jan 17 11:39:21 2010 CET - Host = lxbsq1344.cern.ch - Node = lxbsq1344.cern.ch - Source = LRMS - Timestamp = Sun Jan 17 11:38:22 2010 CET - User = /DC=ch/DC=cern/..... --- Event: ReallyRunning - Arrived = Sun Jan 17 11:39:23 2010 CET - Host = lxbsq1344.cern.ch - Source = LRMS - Timestamp = Sun Jan 17 11:39:23 2010 CET - User = /DC=ch/DC=cern/..... --- Event: ResourceUsage - Arrived = Sun Jan 17 11:39:24 2010 CET - Host = lxbsq1344.cern.ch - Source = LRMS - Timestamp = Sun Jan 17 11:39:24 2010 CET - User = /DC=ch/DC=cern/..... --- Event: Done - Arrived = Sun Jan 17 11:39:26 2010 CET - Exit code = 0 - Host = lxbsq1344.cern.ch - Source = LRMS - Status code = OK - Timestamp = Sun Jan 17 11:39:26 2010 CET - User = /DC=ch/DC=cern/..... ********************************************************************** $ glite-wms-job-status https://gswms01.cern.ch:9000/6MuGaiwPVcxEo7w6YlGHng ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://gswms01.cern.ch:9000/6MuGaiwPVcxEo7w6YlGHng Current Status: Running Status Reason: Job successfully submitted to Globus Destination: ce128.cern.ch:2119/jobmanager-lcglsf-grid_ops Submitted: Sun Jan 17 11:34:20 2010 CET *************************************************************
Diagnosis
When a "Done" event is logged with the source being the "LRMS", it means the RB/WMS job wrapper resumed control after the user payload returned and the output sandbox was transferred. The job wrapper should then exit shortly afterwards. However, if the user payload created a background process that did not exit in the meantime, depending on the batch system (or its configuration) the job may continue to be considered running. Older versions of the job wrapper used to kill all the payload processes that remained in the original process group in which the payload was started, but recent versions leave any such cleanup to the payload itself. In any case the safest is for the payload to try and ensure it leaves no stale processes behind, because it should not count on some other layer to exercise an intelligent cleanup of such processes.
For other possible explanations refer to
Jobs sent to some CE stay in Scheduled state forever.