Tools/Manuals/TS57

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Contents



Back to Troubleshooting Guide


Jobs sent to some CE stay in Scheduled state forever

Full message

$ glite-wms-job-status https://lb104.cern.ch:9000/2MgVf3kHkyThplWY67rp6w


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://lb104.cern.ch:9000/2MgVf3kHkyThplWY67rp6w
Current Status:     Scheduled
Status Reason:      Job successfully submitted to Globus
Destination:        some-CE.some-domain:2119/jobmanager-lcgpbs-ops
Submitted:          Fri Oct 17 20:31:16 2008 CEST
*************************************************************

Diagnosis

When jobs sent to a particular CE stay in Scheduled or Running state for a long time, this can have various causes:

The RB/WMS launches a "grid_monitor" process per user (DN) on each lcg-CE that has unfinished jobs for that user. As far as the RB/WMS is concerned, the progress of those jobs completely depends on what the "grid_monitor" reports to the user's gahp_server process running on the RB/WMS. Information sent to the LB server by the job wrapper itself is mostly ignored for good technical reasons. A Running event will get the job into the Running state, but a Done event needs to be logged by the WMS LogMonitor daemon, which concludes this state from examining the logfiles of Condor-G, which gets the state from the "grid_monitor", which gets the state from querying the batch system. This means that if the "grid_monitor" fails to report updates, the job remains in the Scheduled or Running state. The updates are reported through globus-url-copy which may fail due to various causes - see Cannot read JobWrapper output... for suggestions.

It is also possible the "grid_monitor" cannot obtain job status updates itself:

The "grid_monitor" does not get a login shell, so it will not pick up environment settings from /etc/profile or /etc/profile.d. The CE admin has to ensure the batch system client commands also work in non-interactive shells. An example showing various options for the "lcglsf" job manager:

/opt/globus/setup/globus/lcglsf.in to have definitions like these:

   my $LSF = ". /etc/profile.d/lsf.sh; "
           . "/usr/local/lsf/6.2/linux2.6-glibc2.3-x86/bin";
   $bsub   = "$LSF/bsub";
   $bjobs  = "$LSF/bjobs";
   $bkill  = "$LSF/bkill";
   $bacct  = "$LSF/bacct";
   $bmod   = "$LSF/bmod";
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export