GPGPU WG KnowledgeBase -Batch Schedulers - SLURM
Revision as of 20:04, 27 February 2014 by Walshj1 (talk | contribs) (moved GPGPU-WG:GPGPU Working Group KnowlegeBase:Batch Schedulers:SLURM to GPGPU-WG:GPGPU Working Group KnowledgeBase:Batch Schedulers:SLURM)
SLURM has increased in popularity as the LRMS of choice at many large resource centres. This document does not
GPGPU resources are handled as a Generic Resource under SLURM.
To add support for GPGPU generic resources, the following value must be declared in slurm.conf:
GresTypes=gpu # cons_res seems to be necessary to stop all cores being allocated on the workernode for the job SelectType=select/cons_res
To declare that a workernode supports GPGPUs, the resource must be declared through the NodeName/NodeAddr statement.
#Declare a range of nodes wn0 -> wn9 #Each of these nodes has 2 GPGPUs and 8 job slots NodeName=wn[0-9].example.com Gres=gpu:2 CPUs=8
A Partition defined a set of related nodes, and is somewhat equivalent to a PBS queue.
PartitionName=gpgpu Nodes=wn[0-9].example.com Default=YES MaxTime=INFINITE State=UP Shared=YES
To view which nodes support GPU resources:
NodeName=wn0 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00 Features=(null) Gres=gpu:2 NodeAddr=wn0.example.com NodeHostName=wn0.example.com OS=Linux RealMemory=1 AllocMem=0 Sockets=8 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-01-27T22:30:02 SlurmdStartTime=2014-01-30T15:58:43 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
The above configuration will prevent more than two jobs requiring a single GPU from running simultaneously
srun -p gpgpu --gres=gpu:1 sleep 20 & srun -p gpgpu --gres=gpu:1 sleep 20 & srun -p gpgpu --gres=gpu:1 sleep 20 & # This job should wait until a GPGPU is available srun -p gpgpu sleep 20 # This job will run immediately is a job slot is available