Tools/Manuals/TS123
Jump to navigation
Jump to search
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to Administration FAQ
How to reboot a CE without draining Torque queues
When a CE is being rebooted, it may cause jobs to fail that were just starting or finishing, while jobs in steady state (running or queued) should not be affected.
Since draining the queues could take days, one may consider temporarily suspending all jobs as an alternative. For PBS/Torque:
# suspend all running jobs in a particular queue qsig -s STOP `qselect -q some_queue -s R` # reboot the CE # let the suspended jobs continue qsig -s CONT `qselect -q some_queue -s R`
Warning: also this alternative can cause jobs to fail that happen to be engaged in network traffic with remote services (quite usual) and such traffic is bound by timeouts (quite usual).
In any case, the time between the start and the end of the whole operation should be kept as short as possible, to minimize the number of job failures.