Tools/Manuals/TS123

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Administration FAQ


How to reboot a CE without draining Torque queues

When a CE is being rebooted, it may cause jobs to fail that were just starting or finishing, while jobs in steady state (running or queued) should not be affected.

Since draining the queues could take days, one may consider temporarily suspending all jobs as an alternative. For PBS/Torque:

# suspend all running jobs in a particular queue
qsig -s STOP `qselect -q some_queue -s R`

# reboot the CE

# let the suspended jobs continue
qsig -s CONT `qselect -q some_queue -s R`

Warning: also this alternative can cause jobs to fail that happen to be engaged in network traffic with remote services (quite usual) and such traffic is bound by timeouts (quite usual).

In any case, the time between the start and the end of the whole operation should be kept as short as possible, to minimize the number of job failures.