Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Tools/Manuals/TS57

From EGIWiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


Jobs sent to some CE stay in Scheduled state forever

Full message

$ glite-wms-job-status https://lb104.cern.ch:9000/2MgVf3kHkyThplWY67rp6w


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://lb104.cern.ch:9000/2MgVf3kHkyThplWY67rp6w
Current Status:     Scheduled
Status Reason:      Job successfully submitted to Globus
Destination:        some-CE.some-domain:2119/jobmanager-lcgpbs-ops
Submitted:          Fri Oct 17 20:31:16 2008 CEST
*************************************************************

Diagnosis

When jobs sent to a particular CE stay in Scheduled or Running state for a long time, this can have various causes:

  • The CE is busy with jobs for other users. This is the normal case for Scheduled jobs.
  • The RB/WMS has a backlog of events to be recorded in the LB server, e.g. because of many other jobs or because the LB server is slow.
  • Mapping problem with the "grid_monitor" on the lcg-CE. See Globus error 158: the job manager could not lock the state lock file.
  • Other problem with the "grid_monitor" on the lcg-CE. See below.

The RB/WMS launches a "grid_monitor" process per user (DN) on each lcg-CE that has unfinished jobs for that user. As far as the RB/WMS is concerned, the progress of those jobs completely depends on what the "grid_monitor" reports to the user's gahp_server process running on the RB/WMS. Information sent to the LB server by the job wrapper itself is mostly ignored for good technical reasons. A Running event will get the job into the Running state, but a Done event needs to be logged by the WMS LogMonitor daemon, which concludes this state from examining the logfiles of Condor-G, which gets the state from the "grid_monitor", which gets the state from querying the batch system. This means that if the "grid_monitor" fails to report updates, the job remains in the Scheduled or Running state. The updates are reported through globus-url-copy which may fail due to various causes - see Cannot read JobWrapper output... for suggestions.

It is also possible the "grid_monitor" cannot obtain job status updates itself:

  • The batch system query command (qstat, bjobs, condor_q, ...) may fail or hang for some users, e.g. due to a permission or other configuration problem, or because the batch system is in a bad state.
  • The batch system query command may fail for all users because of an incomplete environment in a non-login shell. See below.

The "grid_monitor" does not get a login shell, so it will not pick up environment settings from /etc/profile or /etc/profile.d. The CE admin has to ensure the batch system client commands also work in non-interactive shells. An example showing various options for the "lcglsf" job manager:

  • Put the LSF configuration file in its default place.
  • Replace bjobs etc. with wrappers that set up the environment first. One can put those wrappers e.g. in /usr/local/bin and ensure those get used by the "lcglsf" job manager (see the next option).
  • Edit /opt/globus/lib/perl/Globus/GRAM/JobManager/lcglsf.pm and

/opt/globus/setup/globus/lcglsf.in to have definitions like these:

   my $LSF = ". /etc/profile.d/lsf.sh; "
           . "/usr/local/lsf/6.2/linux2.6-glibc2.3-x86/bin";
   $bsub   = "$LSF/bsub";
   $bjobs  = "$LSF/bjobs";
   $bkill  = "$LSF/bkill";
   $bacct  = "$LSF/bacct";
   $bmod   = "$LSF/bmod";