From EGIWiki
Jump to: navigation, search
Main operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security

Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Back to Administration FAQ

How to reboot a CE without draining Torque queues

When a CE is being rebooted, it may cause jobs to fail that were just starting or finishing, while jobs in steady state (running or queued) should not be affected.

Since draining the queues could take days, one may consider temporarily suspending all jobs as an alternative. For PBS/Torque:

# suspend all running jobs in a particular queue
qsig -s STOP `qselect -q some_queue -s R`

# reboot the CE

# let the suspended jobs continue
qsig -s CONT `qselect -q some_queue -s R`

Warning: also this alternative can cause jobs to fail that happen to be engaged in network traffic with remote services (quite usual) and such traffic is bound by timeouts (quite usual).

In any case, the time between the start and the end of the whole operation should be kept as short as possible, to minimize the number of job failures.