Tools/Manuals/TS124

From EGIWiki
< Tools
Revision as of 10:32, 18 September 2011 by Aesch (talk | contribs) (Created page with '{{TOC_right}} Category:FAQ ------ Back to Administration FAQ ------ = How to block failing hosts in LSF = How can I prevent a bad WN fro…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Back to Administration FAQ


How to block failing hosts in LSF

How can I prevent a bad WN from becoming a black hole, i.e. swallowing and failing a large number of jobs sent to my LSF batch system?

Solution

LSF provides a black hole detection scheme. This allows the machines to be configured so that a WN is automatically disabled when jobs exit at a very high rate on that host. Such high rates are usually due to a serious problem on the machine, such as a hardware component failure or a full file system.

For example, to check for 20 jobs exiting in 10 minutes, add something like this to lsb.hosts:

Begin Host
HOST_NAME       pool    tmp     r1m     r15m    ut      EXIT_RATE
default         300     100     2.5     2.0     0.9     20
End Host

This can be checked using bhosts -l host. At the end of the listing, the current thresholds are shown:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load         0.00

After 1 bad job:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load         1.00

Further bad job executions then would lead to:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load        20.00

After 10 minutes (the scan period) the status is set to Closed_Adm and the bhosts -l host command gives the following:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load         0.00
 ADMIN ACTION COMMENT: "eadmin: JOB EXIT RATE THRESHOLD EXCEEDED"

An e-mail is also sent to the lsfadmin user.