Revision as of 13:39, 23 November 2012

Main

EGI.eu operations services

Support

Documentation

Tools

Activities

Performance

Technology

Catch-all Services

Resource Allocation

Security

Documentation menu:

Home •

Manuals •

Procedures •

Training •

Other •

Contact ►

For:

VO managers •

Administrators

Back to Administration FAQ

How to block failing hosts in LSF

How can I prevent a bad WN from becoming a black hole, i.e. swallowing and failing a large number of jobs sent to my LSF batch system?

Solution

LSF provides a black hole detection scheme. This allows the machines to be configured so that a WN is automatically disabled when jobs exit at a very high rate on that host. Such high rates are usually due to a serious problem on the machine, such as a hardware component failure or a full file system.

For example, to check for 20 jobs exiting in 10 minutes, add something like this to lsb.hosts:

Begin Host
HOST_NAME       pool    tmp     r1m     r15m    ut      EXIT_RATE
default         300     100     2.5     2.0     0.9     20
End Host

This can be checked using bhosts -l host. At the end of the listing, the current thresholds are shown:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load         0.00

After 1 bad job:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load         1.00

Further bad job executions then would lead to:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load        20.00

After 10 minutes (the scan period) the status is set to Closed_Adm and the bhosts -l host command gives the following:

THRESHOLD AND LOAD USED FOR EXCEPTIONS:
            JOB_EXIT_RATE
 Threshold   20.00
 Load         0.00
 ADMIN ACTION COMMENT: "eadmin: JOB EXIT RATE THRESHOLD EXCEEDED"

An e-mail is also sent to the lsfadmin user.

Difference between revisions of "Tools/Manuals/TS124"

Revision as of 13:39, 23 November 2012

Contents

How to block failing hosts in LSF

Solution

Navigation menu

Latest revision as of 13:39, 23 November 2012 (view source) Krakow (talk \| contribs) ← Older edit	Revision as of 13:39, 23 November 2012 (view source) imported>Krakow Newer edit →
(No difference)

Difference between revisions of "Tools/Manuals/TS124"

Revision as of 13:39, 23 November 2012

How to block failing hosts in LSF

Solution

Navigation menu

Search