Difference between revisions of "Tools/Manuals/TS124"
imported>Krakow |
|
(No difference)
|
Revision as of 13:39, 23 November 2012
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to Administration FAQ
How to block failing hosts in LSF
How can I prevent a bad WN from becoming a black hole, i.e. swallowing and failing a large number of jobs sent to my LSF batch system?
Solution
LSF provides a black hole detection scheme. This allows the machines to be configured so that a WN is automatically disabled when jobs exit at a very high rate on that host. Such high rates are usually due to a serious problem on the machine, such as a hardware component failure or a full file system.
For example, to check for 20 jobs exiting in 10 minutes, add something like this to lsb.hosts:
Begin Host HOST_NAME pool tmp r1m r15m ut EXIT_RATE default 300 100 2.5 2.0 0.9 20 End Host
This can be checked using bhosts -l host. At the end of the listing, the current thresholds are shown:
THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 20.00 Load 0.00
After 1 bad job:
THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 20.00 Load 1.00
Further bad job executions then would lead to:
THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 20.00 Load 20.00
After 10 minutes (the scan period) the status is set to Closed_Adm and the bhosts -l host command gives the following:
THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 20.00 Load 0.00 ADMIN ACTION COMMENT: "eadmin: JOB EXIT RATE THRESHOLD EXCEEDED"
An e-mail is also sent to the lsfadmin user.