GPGPU WG KnowledgeBase -Batch Schedulers - SLURM

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


<< GPGPU Working Group main page

SLURM has increased in popularity as the LRMS of choice at many large resource centres. This document does not


GPGPU resources are handled as a Generic Resource under SLURM. To add support for GPGPU generic resources, the following value must be declared in slurm.conf:

GresTypes=gpu
# cons_res seems to be necessary to stop all cores being allocated on the workernode for the job  
SelectType=select/cons_res


To declare that a workernode supports GPGPUs, the resource must be declared through the NodeName/NodeAddr statement.

#Declare a range of nodes wn0 -> wn9
#Each of these nodes has 2 GPGPUs and 8 job slots
NodeName=wn[0-9].example.com Gres=gpu:2 CPUs=8

A Partition defined a set of related nodes, and is somewhat equivalent to a PBS queue.

PartitionName=gpgpu Nodes=wn[0-9].example.com Default=YES MaxTime=INFINITE State=UP Shared=YES



To view which nodes support GPU resources:

NodeName=wn0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00 Features=(null)
   Gres=gpu:2
   NodeAddr=wn0.example.com NodeHostName=wn0.example.com
   OS=Linux RealMemory=1 AllocMem=0 Sockets=8 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2014-01-27T22:30:02 SlurmdStartTime=2014-01-30T15:58:43
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


The above configuration will prevent more than two jobs requiring a single GPU from running simultaneously

srun  -p gpgpu --gres=gpu:1 sleep 20 &
srun  -p gpgpu --gres=gpu:1 sleep 20 &
srun  -p gpgpu --gres=gpu:1 sleep 20 & # This job should wait until a GPGPU is available
srun  -p gpgpu  sleep 20 # This job will run immediately is a job slot is available


SLURM can also handle CUDA_VISIBILE_DEVICES correctly:

# Acquire 2 GPU on same node
srun  -p gpgpu --gres=gpu:2 env | grep -i cuda
CUDA_VISIBLE_DEVICES=0,1

#Acquire 1 GPU on Node
srun  -p gpgpu --gres=gpu:1 env | grep -i cuda
CUDA_VISIBLE_DEVICES=0