GPGPU-CREAM
Goal
- To develop a solution enabling GPU support in CREAM-CE:
- For the most popular LRMSes already supported by CREAM-CE
- Based on GLUE 2.1 schema
Work plan
- Indentifying the relevant GPGPU-related parameters supported by the different LRMS, and abstract them to significant JDL attributes
- GPGPU accounting is expected to be provided by LRMS log files, as done for CPU accounting, and then follows the same APEL flow
- Implementing the needed changes in CREAM-core and BLAH components
- Writing the infoproviders according to GLUE 2.1
- Testing and certification of the prototype
- Releasing a CREAM-CE update with full GPGPU support
Testbed
- 3 nodes (2x Intel Xeon E5-2620v2) with 2 NVIDIA Tesla K20m GPUs per node available at CIRMMP
- MoBrain applications installed: AMBER and GROMACS with CUDA 5.5
- Batch system/Scheduler: Torque 4.2.10 (source compiled with NVML libs)/ Maui 3.3.1
- EMI3 CREAM-CE
Progress
- May 2015
- tested local AMBER job submission with pbs_sched as scheduler (i.e. not using Maui) and various Torque/NVIDIA GPGPU support options, e.g.:
qsub -l nodes=1:gpus=2:default qsub -l nodes=1:gpus=2:exclusive_process
- June 2015
- attributes "GPUNumber" e "GPUMode" added to command BLAH_JOB_SUBMIT, e.g.:
BLAH_JOB_SUBMIT 2 [Cmd="/tmp/test.sh";GridType="pbs";Queue="batch";In="/dev/null";Out="~\/StdOutput";Err="~\/StdError";GPUNumber=1;GPUMode="default"] BLAH_JOB_SUBMIT 2 [Cmd="test_gpu_blah.sh";GridType="pbs";Queue="batch";In="/dev/null";Out="StdOutput";Err="StdError";GPUNumber=1;GPUMode="exclusive_process"]
- this required modifications blah_common_submit_functions.sh and server.c
- first implementation of the two new attributes for Torque/pbs_sched:
GPUMode refers to the NVML COMPUTE mode (the gpu_mode variable in the pbsnodes output) and can have the following values for Torque/pbs_sched: - default - Shared mode available for multiple processes - exclusive Thread - Only one COMPUTE thread is allowed to run on the GPU (v260 exclusive) - prohibited - No COMPUTE contexts are allowed to run on the GPU - exclusive_process - Only one COMPUTE process is allowed to run on the GPU
- this required modifications to pbs_submit.sh
- July-August 2015
- implemented the parser on CREAM core for the new JDL attributes GPUNumber and GPUMode
- tested AMBER remote job submission through glite-ce-submit client:
$ glite-ce-job-submit -o jobid.txt -d -a -r cegpu.cerm.unifi.it:8443/cream-pbs-batch test.jdl $ cat test.jdl [ executable = "test_gpu.sh"; inputSandbox = { "test_gpu.sh" }; stdoutput = "out.out"; outputsandboxbasedesturi = "gsiftp://localhost"; stderror = "err.err"; outputsandbox = { "out.out","err.err","min.out","heat.out" }; GPUNumber=2; GPUMode="exclusive_process"; ]
- September 2015
- contacted STFC/Emerald (LSF8 based GPU cluster), IFCA (SGE based GPU cluster) and INFN-CNAF (LSF9 based GPu cluster) if they can provide a testing instance for extending CREAM GPU support to LSF and SGE LRMSes
- continued troubleshooting CREAM prototype with AMBER application at CIRMMP cluster
- analysed the GLUE2.1 schema as a base for writing the GPU-aware Torque info providers
- An update report of GPGPU support in CREAM-CE has been presented at the OMB
- October 2015
- Enabled Maui scheduler at CIRMMP cluster and started the development of a CREAM prototype with GPGPU support for Torque/Maui. Problem: it seems there is no way to set the NVML COMPUTE mode as is done in Torque/pbs_sched through the qsub -l (see above)
- Troubleshooting of CREAM prototype.
- Derek Ross of STFC/Emerald (LSF8 based GPU cluster) activate an account to access Emerald GPGPU testbed, in order to start GPGPU-enabled CREAM prototyping for LSF.
- A thread on the WP4.4 and VT-GPGPU mailing lists has started with the APEL team for investigating how to address GPGPU accounting for Torque and other LRMSes.
- November 2015
- Obtained availabilty to testing from ARNES (Slurm based GPU cluster), Qeen Mary (SGE based cluster with OpenCL compatible AMD GPUs) and GRIF (HTCondor based GPU cluster)
- GPGPU EGI Activities presented at WLCG Grid Deployement Board
- Accounting issues further discussed at the EGI CF Accelerated Computing Session in Bari
- Continued testing and troubleshooting of GPU-enabled CREAM-CE prototype at CIRMMP testbed.
- Implemented another MoBrain use-case: the dockerized DisVis application has been properly installed at CIRMMP testbed. MoBrain users (through enmr.eu VO) can now run DisVis exploiting the GPU cluster at CIRMMP via the GPU-enabled CREAM-CE
- Start prepararing the process of certification: investigating the use of IM (UPV tool) for automatically deploying cluster on the EGI FedCloud to be used for the GPU-enabled CREAM-CE certification
Next steps
- verifying if parameters identified for Torque have their analogous in SLURM/SGE/LSF/Condor or additional parameters have to be abstracted to JDL attributes
- writing the BLAH components for SLURM and/or SGE and/or LFS and/or Condor
- writing the information providers according with GLUE2.1 schema