GPGPU WG KnowledgeBase -Batch Schedulers - SLURM

SLURM has increased in popularity as the LRMS of choice at many large resource centres. This document does not

GPGPU resources are handled as a Generic Resource under SLURM. To add support for GPGPU generic resources, the following value must be declared in slurm.conf:

GresTypes=gpu
# cons_res seems to be necessary to stop all cores being allocated on the workernode for the job  
SelectType=select/cons_res

To declare that a workernode supports GPGPUs, the resource must be declared through the NodeName/NodeAddr statement.

#Declare a range of nodes wn0 -> wn9
#Each of these nodes has 2 GPGPUs and 8 job slots
NodeName=wn[0-9].example.com Gres=gpu:2 CPUs=8

A Partition defined a set of related nodes, and is somewhat equivalent to a PBS queue.

PartitionName=gpgpu Nodes=wn[0-9].example.com Default=YES MaxTime=INFINITE State=UP Shared=YES

To view which nodes support GPU resources:

NodeName=wn0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00 Features=(null)
   Gres=gpu:2
   NodeAddr=wn0.example.com NodeHostName=wn0.example.com
   OS=Linux RealMemory=1 AllocMem=0 Sockets=8 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2014-01-27T22:30:02 SlurmdStartTime=2014-01-30T15:58:43
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

The above configuration will prevent more than two jobs requiring a single GPU from running simultaneously

srun  -p gpgpu --gres=gpu:1 sleep 20 &
srun  -p gpgpu --gres=gpu:1 sleep 20 &
srun  -p gpgpu --gres=gpu:1 sleep 20 & # This job should wait until a GPGPU is available
srun  -p gpgpu  sleep 20 # This job will run immediately is a job slot is available

GPGPU WG KnowledgeBase -Batch Schedulers - SLURM

Navigation menu

Search