EGI-Engage project:	Main page	WP1(NA1)	WP3(JRA1)	WP5(SA1)	PMB	Deliverables and Milestones	Quality Plan	Risk Plan	Data Plan
	Roles and responsibilities	WP2(NA2)	WP4(JRA2)	WP6(SA2)	AMB	Software and services	Metrics	Project Office	Procedures

Goal

To develop a solution enabling GPU support in CREAM-CE:

For the most popular LRMSes already supported by CREAM-CE
Based on GLUE 2.1 schema

Work plan

Indentifying the relevant GPGPU-related parameters supported by the different LRMS, and abstract them to significant JDL attributes
GPGPU accounting is expected to be provided by LRMS log files, as done for CPU accounting, and then follows the same APEL flow
Implementing the needed changes in CREAM-core and BLAH components
Writing the infoproviders according to GLUE 2.1
Testing and certification of the prototype
Releasing a CREAM-CE update with full GPGPU support

Testbed

3 nodes (2x Intel Xeon E5-2620v2) with 2 NVIDIA Tesla K20m GPUs per node available at CIRMMP
- MoBrain applications installed: AMBER and GROMACS with CUDA 5.5
- Batch system/Scheduler: Torque 4.2.10 (source compiled with NVML libs)/ Maui 3.3.1
- EMI3 CREAM-CE

Progress

May 2015
- tested local AMBER job submission with pbs_sched as scheduler (i.e. not using Maui) and various Torque/NVIDIA GPGPU support options, e.g.:

qsub -l nodes=1:gpus=2:default
qsub -l nodes=1:gpus=2:exclusive_process

June 2015
- attributes "GPUNumber" e "GPUMode" added to command BLAH_JOB_SUBMIT, e.g.:

BLAH_JOB_SUBMIT 2 [Cmd="/tmp/test.sh";GridType="pbs";Queue="batch";In="/dev/null";Out="~\/StdOutput";Err="~\/StdError";GPUNumber=1;GPUMode="default"]
BLAH_JOB_SUBMIT 2 [Cmd="test_gpu_blah.sh";GridType="pbs";Queue="batch";In="/dev/null";Out="StdOutput";Err="StdError";GPUNumber=1;GPUMode="exclusive_process"]

this required modifications blah_common_submit_functions.sh and server.c

- first implementation of the two new attributes for Torque/pbs_sched:

GPUMode refers to the NVML COMPUTE mode (the gpu_mode variable in the pbsnodes output) and can have the following values for Torque/pbs_sched:
- default - Shared mode available for multiple processes
- exclusive Thread - Only one COMPUTE thread is allowed to run on the GPU (v260 exclusive)
- prohibited - No COMPUTE contexts are allowed to run on the GPU
- exclusive_process - Only one COMPUTE process is allowed to run on the GPU

this required modifications to pbs_submit.sh

July-August 2015
- implemented the parser on CREAM core for the new JDL attributes GPUNumber and GPUMode
- tested AMBER remote job submission through glite-ce-submit client:

$ glite-ce-job-submit -o jobid.txt -d -a -r cegpu.cerm.unifi.it:8443/cream-pbs-batch test.jdl
$ cat test.jdl
[
 executable = "test_gpu.sh";
 inputSandbox = { "test_gpu.sh" };
 stdoutput = "out.out";
 outputsandboxbasedesturi = "gsiftp://localhost";
 stderror = "err.err";
 outputsandbox = { "out.out","err.err","min.out","heat.out" };
 GPUNumber=2;
 GPUMode="exclusive_process";
]

September 2015
- contacted STFC/Emerald (LSF8 based GPU cluster), IFCA (SGE based GPU cluster) and INFN-CNAF (LSF9 based GPu cluster) if they can provide a testing instance for extending CREAM GPU support to LSF and SGE LRMSes
- continued troubleshooting CREAM prototype with AMBER application at CIRMMP cluster
- analysed the GLUE2.1 schema as a base for writing the GPU-aware Torque info providers
- An update report of GPGPU support in CREAM-CE has been presented at the OMB
October 2015
- Enabled Maui scheduler at CIRMMP cluster and started the development of a CREAM prototype with GPGPU support for Torque/Maui. Problem: it seems there is no way to set the NVML COMPUTE mode as is done in Torque/pbs_sched through the qsub -l (see above)
- Troubleshooting of CREAM prototype.
- Derek Ross of STFC/Emerald (LSF8 based GPU cluster) activate an account to access Emerald GPGPU testbed, in order to start GPGPU-enabled CREAM prototyping for LSF.
- A thread on the WP4.4 and VT-GPGPU mailing lists has started with the APEL team for investigating how to address GPGPU accounting for Torque and other LRMSes.
November 2015
- Obtained availabilty to testing from ARNES (Slurm based GPU cluster), Qeen Mary (SGE based cluster with OpenCL compatible AMD GPUs) and GRIF (HTCondor based GPU cluster)
- GPGPU EGI Activities presented at WLCG Grid Deployement Board
- Accounting issues further discussed at the EGI CF Accelerated Computing Session in Bari
- Continued testing and troubleshooting of GPU-enabled CREAM-CE prototype at CIRMMP testbed.
- Implemented another MoBrain use-case: the dockerized DisVis application has been properly installed at CIRMMP testbed. MoBrain users (through enmr.eu VO) can now run DisVis exploiting the GPU cluster at CIRMMP via the GPU-enabled CREAM-CE
- Start prepararing the process of certification: investigating the use of IM (UPV tool) for automatically deploying cluster on the EGI FedCloud to be used for the GPU-enabled CREAM-CE certification

Next steps

verifying if parameters identified for Torque have their analogous in SLURM/SGE/LSF/Condor or additional parameters have to be abstracted to JDL attributes
writing the BLAH components for SLURM and/or SGE and/or LFS and/or Condor
writing the information providers according with GLUE2.1 schema

Back to Accelerated Computing task

GPGPU-CREAM

Contents

Goal

Work plan

Testbed

Progress

Next steps

Back to Accelerated Computing task

Navigation menu

GPGPU-CREAM

Goal

Work plan

Testbed

Progress

Next steps

Back to Accelerated Computing task

Navigation menu

Search