Jobs sent to some CE stay in Scheduled state forever

Full message

$ glite-wms-job-status


Status info for the Job :
Current Status:     Scheduled
Status Reason:      Job successfully submitted to Globus
Destination:        some-CE.some-domain:2119/jobmanager-lcgpbs-ops
Submitted:          Fri Oct 17 20:31:16 2008 CEST


When jobs sent to a particular CE stay in Scheduled or Running state for a long time, this can have various causes:

  • The CE is busy with jobs for other users. This is the normal case for Scheduled jobs.
  • The RB/WMS has a backlog of events to be recorded in the LB server, e.g. because of many other jobs or because the LB server is slow.
  • Mapping problem with the "grid_monitor" on the lcg-CE. See Globus error 158: the job manager could not lock the state lock file.
  • Other problem with the "grid_monitor" on the lcg-CE. See below.

The RB/WMS launches a "grid_monitor" process per user (DN) on each lcg-CE that has unfinished jobs for that user. As far as the RB/WMS is concerned, the progress of those jobs completely depends on what the "grid_monitor" reports to the user's gahp_server process running on the RB/WMS. Information sent to the LB server by the job wrapper itself is mostly ignored for good technical reasons. A Running event will get the job into the Running state, but a Done event needs to be logged by the WMS LogMonitor daemon, which concludes this state from examining the logfiles of Condor-G, which gets the state from the "grid_monitor", which gets the state from querying the batch system. This means that if the "grid_monitor" fails to report updates, the job remains in the Scheduled or Running state. The updates are reported through globus-url-copy which may fail due to various causes - see Cannot read JobWrapper output... for suggestions.

It is also possible the "grid_monitor" cannot obtain job status updates itself:

  • The batch system query command (qstat, bjobs, condor_q, ...) may fail or hang for some users, e.g. due to a permission or other configuration problem, or because the batch system is in a bad state.
  • The batch system query command may fail for all users because of an incomplete environment in a non-login shell. See below.

The "grid_monitor" does not get a login shell, so it will not pick up environment settings from /etc/profile or /etc/profile.d. The CE admin has to ensure the batch system client commands also work in non-interactive shells. An example showing various options for the "lcglsf" job manager:

  • Put the LSF configuration file in its default place.
  • Replace bjobs etc. with wrappers that set up the environment first. One can put those wrappers e.g. in /usr/local/bin and ensure those get used by the "lcglsf" job manager (see the next option).
  • Edit /opt/globus/lib/perl/Globus/GRAM/JobManager/ and

/opt/globus/setup/globus/ to have definitions like these:

   my $LSF = ". /etc/profile.d/; "
           . "/usr/local/lsf/6.2/linux2.6-glibc2.3-x86/bin";
   $bsub   = "$LSF/bsub";
   $bjobs  = "$LSF/bjobs";
   $bkill  = "$LSF/bkill";
   $bacct  = "$LSF/bacct";
   $bmod   = "$LSF/bmod";