Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Tools/Manuals/TS124

From EGIWiki
< Tools
Revision as of 12:39, 23 November 2012 by imported>Krakow
Jump to navigation Jump to search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Administration FAQ


How to block failing hosts in LSF

How can I prevent a bad WN from becoming a black hole, i.e. swallowing and failing a large number of jobs sent to my LSF batch system?

Solution

LSF provides a black hole detection scheme. This allows the machines to be configured so that a WN is automatically disabled when jobs exit at a very high rate on that host. Such high rates are usually due to a serious problem on the machine, such as a hardware component failure or a full file system.

For example, to check for 20 jobs exiting in 10 minutes, add something like this to lsb.hosts:

Begin Host
HOST_NAME       pool    tmp     r1m     r15m    ut      EXIT_RATE
default         300     100     2.5     2.0     0.9     20
End Host

This can be checked using bhosts -l host. At the end of the listing, the current thresholds are shown:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load         0.00

After 1 bad job:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load         1.00

Further bad job executions then would lead to:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load        20.00

After 10 minutes (the scan period) the status is set to Closed_Adm and the bhosts -l host command gives the following:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load         0.00
 ADMIN ACTION COMMENT: "eadmin: JOB EXIT RATE THRESHOLD EXCEEDED"

An e-mail is also sent to the lsfadmin user.