Difference between revisions of "Tools/Manuals/TS124"
Latest revision as of 12:39, 23 November 2012
|Main||EGI.eu operations services||Support||Documentation||Tools||Activities||Performance||Technology||Catch-all Services||Resource Allocation||Security|
|Documentation menu:||Home •||Manuals •||Procedures •||Training •||Other •||Contact ►||For:||VO managers •||Administrators|
Back to Administration FAQ
How to block failing hosts in LSF
How can I prevent a bad WN from becoming a black hole, i.e. swallowing and failing a large number of jobs sent to my LSF batch system?
LSF provides a black hole detection scheme. This allows the machines to be configured so that a WN is automatically disabled when jobs exit at a very high rate on that host. Such high rates are usually due to a serious problem on the machine, such as a hardware component failure or a full file system.
For example, to check for 20 jobs exiting in 10 minutes, add something like this to lsb.hosts:
Begin Host HOST_NAME pool tmp r1m r15m ut EXIT_RATE default 300 100 2.5 2.0 0.9 20 End Host
This can be checked using bhosts -l host. At the end of the listing, the current thresholds are shown:
THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 20.00 Load 0.00
After 1 bad job:
THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 20.00 Load 1.00
Further bad job executions then would lead to:
THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 20.00 Load 20.00
After 10 minutes (the scan period) the status is set to Closed_Adm and the bhosts -l host command gives the following:
THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 20.00 Load 0.00 ADMIN ACTION COMMENT: "eadmin: JOB EXIT RATE THRESHOLD EXCEEDED"
An e-mail is also sent to the lsfadmin user.