Tools/Manuals/TS57
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to Troubleshooting Guide
Jobs sent to some CE stay in Scheduled state forever
Full message
$ glite-wms-job-status https://lb104.cern.ch:9000/2MgVf3kHkyThplWY67rp6w ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://lb104.cern.ch:9000/2MgVf3kHkyThplWY67rp6w Current Status: Scheduled Status Reason: Job successfully submitted to Globus Destination: some-CE.some-domain:2119/jobmanager-lcgpbs-ops Submitted: Fri Oct 17 20:31:16 2008 CEST *************************************************************
Diagnosis
When jobs sent to a particular CE stay in Scheduled or Running state for a long time, this can have various causes:
- The CE is busy with jobs for other users. This is the normal case for Scheduled jobs.
- The RB/WMS has a backlog of events to be recorded in the LB server, e.g. because of many other jobs or because the LB server is slow.
- Mapping problem with the "grid_monitor" on the lcg-CE. See Globus error 158: the job manager could not lock the state lock file.
- Other problem with the "grid_monitor" on the lcg-CE. See below.
The RB/WMS launches a "grid_monitor" process per user (DN) on each lcg-CE that has unfinished jobs for that user. As far as the RB/WMS is concerned, the progress of those jobs completely depends on what the "grid_monitor" reports to the user's gahp_server process running on the RB/WMS. Information sent to the LB server by the job wrapper itself is mostly ignored for good technical reasons. A Running event will get the job into the Running state, but a Done event needs to be logged by the WMS LogMonitor daemon, which concludes this state from examining the logfiles of Condor-G, which gets the state from the "grid_monitor", which gets the state from querying the batch system. This means that if the "grid_monitor" fails to report updates, the job remains in the Scheduled or Running state. The updates are reported through globus-url-copy which may fail due to various causes - see Cannot read JobWrapper output... for suggestions.
It is also possible the "grid_monitor" cannot obtain job status updates itself:
- The batch system query command (qstat, bjobs, condor_q, ...) may fail or hang for some users, e.g. due to a permission or other configuration problem, or because the batch system is in a bad state.
- The batch system query command may fail for all users because of an incomplete environment in a non-login shell. See below.
The "grid_monitor" does not get a login shell, so it will not pick up environment settings from /etc/profile or /etc/profile.d. The CE admin has to ensure the batch system client commands also work in non-interactive shells. An example showing various options for the "lcglsf" job manager:
- Put the LSF configuration file in its default place.
- Replace bjobs etc. with wrappers that set up the environment first. One can put those wrappers e.g. in /usr/local/bin and ensure those get used by the "lcglsf" job manager (see the next option).
- Edit /opt/globus/lib/perl/Globus/GRAM/JobManager/lcglsf.pm and
/opt/globus/setup/globus/lcglsf.in to have definitions like these:
my $LSF = ". /etc/profile.d/lsf.sh; " . "/usr/local/lsf/6.2/linux2.6-glibc2.3-x86/bin"; $bsub = "$LSF/bsub"; $bjobs = "$LSF/bjobs"; $bkill = "$LSF/bkill"; $bacct = "$LSF/bacct"; $bmod = "$LSF/bmod";