Difference between revisions of "GPGPU-CREAM"
Jump to navigation
Jump to search
m (→Progress) |
m (→Progress) |
||
Line 60: | Line 60: | ||
* October 2015 | * October 2015 | ||
** Enabled Maui scheduler at CIRMMP cluster and started the development of a CREAM prototype with GPGPU support for Torque/Maui. Problem: it seems there is no way to set the NVML COMPUTE mode as is done in Torque/pbs_sched through the qsub -l (see above) | ** Enabled Maui scheduler at CIRMMP cluster and started the development of a CREAM prototype with GPGPU support for Torque/Maui. Problem: it seems there is no way to set the NVML COMPUTE mode as is done in Torque/pbs_sched through the qsub -l (see above) | ||
** Found problems in passing the GRES directive with Torque/Maui: qsub -W x=GRES:gpu@2 only allow the use of one GPU. Investigating. | |||
** Derek Ross of STFC/Emerald (LSF8 based GPU cluster) send us the AUP to sign to access Emerald GPGPU testbed, in order to start GPGPU-enabled CREAM prototyping for LSF. | |||
** A thread on the WP4.4 and VT-GPGPU mailing lists has started with the APEL team for investigating how to address GPGPU accounting for Torque and other LRMSes. | |||
= Next steps = | = Next steps = |
Revision as of 14:44, 9 October 2015
Goal
- To develop a solution enabling GPU support in CREAM-CE:
- For the most popular LRMSes already supported by CREAM-CE
- Based on GLUE 2.1 schema
Work plan
- Indentifying the relevant GPGPU-related parameters supported by the different LRMS, and abstract them to significant JDL attributes
- GPGPU accounting is expected to be provided by LRMS log files, as done for CPU accounting, and then follows the same APEL flow
- Implementing the needed changes in CREAM-core and BLAH components
- Writing the infoproviders according to GLUE 2.1
- Testing and certification of the prototype
- Releasing a CREAM-CE update with full GPGPU support
Testbed
- 3 nodes (2x Intel Xeon E5-2620v2) with 2 NVIDIA Tesla K20m GPUs per node available at CIRMMP
- MoBrain applications installed: AMBER and GROMACS with CUDA 5.5
- Batch system/Scheduler: Torque 4.2.10 (source compiled with NVML libs)/ Maui 3.3.1
- EMI3 CREAM-CE
Progress
- May 2015:
- tested local AMBER job submission with pbs_sched as scheduler (i.e. not using Maui) and various Torque/NVIDIA GPGPU support options, e.g.:
qsub -l nodes=1:gpus=2:default qsub -l nodes=1:gpus=2:exclusive_process
- June 2015:
- attributes "GPUNumber" e "GPUMode" added to command BLAH_JOB_SUBMIT, e.g.:
BLAH_JOB_SUBMIT 2 [Cmd="/tmp/test.sh";GridType="pbs";Queue="batch";In="/dev/null";Out="~\/StdOutput";Err="~\/StdError";GPUNumber=1;GPUMode="default"] BLAH_JOB_SUBMIT 2 [Cmd="test_gpu_blah.sh";GridType="pbs";Queue="batch";In="/dev/null";Out="StdOutput";Err="StdError";GPUNumber=1;GPUMode="exclusive_process"]
- this required modifications blah_common_submit_functions.sh and server.c
- first implementation of the two new attributes for Torque/pbs_sched:
GPUMode refers to the NVML COMPUTE mode (the gpu_mode variable in the pbsnodes output) and can have the following values for Torque/pbs_sched: - default - Shared mode available for multiple processes - exclusive Thread - Only one COMPUTE thread is allowed to run on the GPU (v260 exclusive) - prohibited - No COMPUTE contexts are allowed to run on the GPU - exclusive_process - Only one COMPUTE process is allowed to run on the GPU
- this required modifications to pbs_submit.sh
- July-August 2015:
- implemented the parser on CREAM core for the new JDL attributes GPUNumber and GPUMode
- tested AMBER remote job submission through glite-ce-submit client:
$ glite-ce-job-submit -o jobid.txt -d -a -r cegpu.cerm.unifi.it:8443/cream-pbs-batch test.jdl $ cat test.jdl [ executable = "test_gpu.sh"; inputSandbox = { "test_gpu.sh" }; stdoutput = "out.out"; outputsandboxbasedesturi = "gsiftp://localhost"; stderror = "err.err"; outputsandbox = { "out.out","err.err","min.out","heat.out" }; GPUNumber=2; GPUMode="exclusive_process"; ]
- September 2015
- contacted STFC/Emerald (LSF8 based GPU cluster), IFCA (SGE based GPU cluster) and INFN-CNAF (LSF9 based GPu cluster) if they can provide a testing instance for extending CREAM GPU support to LSF and SGE LRMSes
- continued troubleshooting CREAM prototype with AMBER application at CIRMMP cluster
- analysed the GLUE2.1 schema as a base for writing the GPU-aware Torque info providers
- An update report of GPGPU support in CREAM-CE has been presented at the OMB
- October 2015
- Enabled Maui scheduler at CIRMMP cluster and started the development of a CREAM prototype with GPGPU support for Torque/Maui. Problem: it seems there is no way to set the NVML COMPUTE mode as is done in Torque/pbs_sched through the qsub -l (see above)
- Found problems in passing the GRES directive with Torque/Maui: qsub -W x=GRES:gpu@2 only allow the use of one GPU. Investigating.
- Derek Ross of STFC/Emerald (LSF8 based GPU cluster) send us the AUP to sign to access Emerald GPGPU testbed, in order to start GPGPU-enabled CREAM prototyping for LSF.
- A thread on the WP4.4 and VT-GPGPU mailing lists has started with the APEL team for investigating how to address GPGPU accounting for Torque and other LRMSes.
Next steps
- verifying if parameters identified for Torque have their analogous in SLURM/SGE/LSF/Condor or additional parameters have to be abstracted to JDL attributes
- writing the BLAH components for SLURM and/or SGE and/or LFS and/or Condor
- writing the information providers according with GLUE2.1 schema