Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "GPGPU-CREAM"

From EGIWiki
Jump to navigation Jump to search
 
(32 intermediate revisions by the same user not shown)
Line 20: Line 20:


= Progress =
= Progress =
* May 2015:
* '''May 2015'''
** tested local AMBER job submission with the different [http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/3-nodes/NVIDIAGPGPUs.htm Torque/NVIDIA GPGPU support options], e.g.:
** tested local AMBER job submission with pbs_sched as scheduler (i.e. not using Maui) and various [http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/3-nodes/NVIDIAGPGPUs.htm Torque/NVIDIA GPGPU support options]
qsub -l nodes=1:gpus=2:default
qsub -l nodes=1:gpus=2:exclusive_process


* June 2015:
* '''June 2015'''
**attributes "GPUNumber" e "GPUMode" added to command BLAH_JOB_SUBMIT, e.g.:
**added support to JDL attributes '''GPUNumber''' e '''GPUMode''': this required modifications blah_common_submit_functions.sh and server.c  
BLAH_JOB_SUBMIT 2 [Cmd="/tmp/test.sh";GridType="pbs";Queue="batch";In="/dev/null";Out="~\/StdOutput";Err="~\/StdError";GPUNumber=1;GPUMode="default"]
**first implementation of the two new attributes for Torque/pbs_sched:
BLAH_JOB_SUBMIT 2 [Cmd="test_gpu_blah.sh";GridType="pbs";Queue="batch";In="/dev/null";Out="StdOutput";Err="StdError";GPUNumber=1;GPUMode="exclusive_process"]
  GPUMode refers to the NVML COMPUTE mode (the gpu_mode variable in the pbsnodes output) and can have the following values for Torque/pbs_sched:
:this required modifications blah_common_submit_functions.sh and server.c  
**first implementation of the two new attributes for PBS/Torque:
  GPUMode can have the following values for PBS/Torque:
  - default - Shared mode available for multiple processes
  - default - Shared mode available for multiple processes
  - exclusive Thread - Only one COMPUTE thread is allowed to run on the GPU (v260 exclusive)
  - exclusive Thread - Only one COMPUTE thread is allowed to run on the GPU (v260 exclusive)
Line 38: Line 33:
:this required modifications to pbs_submit.sh
:this required modifications to pbs_submit.sh


* July-August 2015:
* '''July-August 2015'''
** implemented the parser on CREAM core for the new JDL attributes GPUNumber and GPUMode
** implemented the parser on CREAM core for the new JDL attributes GPUNumber and GPUMode
** tested AMBER remote job submission through glite-ce-submit client:
** tested AMBER remote job submission through glite-ce-submit client:
Line 53: Line 48:
   GPUMode="exclusive_process";
   GPUMode="exclusive_process";
  ]
  ]
* September 2015
** However, being the user possibility to set the NVML Compute mode not available on all the batch systems, it was decided '''not to support GPUMode JDL attribute''' in the future production release.
** contacted STFC/Emerald (LSF8 based GPU cluster), IFCA (SGE based GPU cluster) and INFN-CNAF (LSF9 based GPu cluster) if they can provide a testing instance for extending CREAM GPU support to LSF and SGE LRMSes
* '''September 2015'''
** toubleshooting CREAM prototype with AMBER application at CIRMMP cluster
** contacted STFC/Emerald (LSF8 based GPU cluster), IFCA (SGE based GPU cluster) and INFN-CNAF (LSF9 based GPU cluster) if they can provide a testing instance for extending CREAM GPU support to LSF and SGE LRMSes
** continued troubleshooting CREAM prototype with AMBER application at CIRMMP cluster
** analysed the GLUE2.1 schema as a base for writing the GPU-aware Torque info providers
** An [https://indico.egi.eu/indico/getFile.py/access?contribId=5&resId=0&materialId=slides&confId=2380 update report of GPGPU support in CREAM-CE] has been presented at the  OMB
* '''October 2015'''
** Enabled Maui scheduler at CIRMMP cluster and started the development of a CREAM prototype with GPGPU support for Torque/Maui. Problem: it seems there is no way to set the NVML Compute mode as is done in Torque/pbs_sched through the qsub -l (see above)
** Troubleshooting of CREAM prototype.
** Derek Ross of STFC/Emerald (LSF8 based GPU cluster) activate an account to access Emerald GPGPU testbed, in order to start GPGPU-enabled CREAM prototyping for LSF.
** A thread on the WP4.4 and VT-GPGPU mailing lists has started with the APEL team for investigating how to address GPGPU accounting for Torque and other LRMSes.
* '''November 2015'''
** Obtained availabilty to testing from ARNES (Slurm based GPU cluster), Queen Mary University of London (QMUL) (SGE based cluster with OpenCL compatible AMD GPUs) and GRIF (HTCondor based GPU cluster)
** [http://indico.cern.ch/event/319753/contribution/8/attachments/1181551/1710838/GDB-04Nov15.pdf GPGPU EGI Activities] presented at WLCG Grid Deployement Board
** Accounting issues further discussed at the [https://indico.egi.eu/indico/sessionDisplay.py?sessionId=47&confId=2544#20151111 EGI CF Accelerated Computing Session] in Bari
** Continued testing and troubleshooting of GPU-enabled CREAM-CE prototype at CIRMMP testbed.
** Implemented another [https://wiki.egi.eu/wiki/Competence_centre_MoBrain#How_to_run_the_DisVis_docker_image_on_the_enmr.eu_VO MoBrain use-case]: the dockerized [http://www.ncbi.nlm.nih.gov/pubmed/26026169 DisVis application] has been properly installed at CIRMMP testbed. MoBrain users (through enmr.eu VO) can now run DisVis exploiting the GPU cluster at CIRMMP via the GPU-enabled CREAM-CE
** Start prepararing the process of certification: investigating the use of [http://www.grycap.upv.es/im/ IM (UPV tool)] for automatically deploying cluster on the EGI FedCloud to be used for the GPU-enabled CREAM-CE certification
* '''December 2015'''
** Coordination with CESNET/MU partner of MoBrain CC in order to:
*** add their GPU nodes to the cloud testbed (i.e. via OCCI, not interested in extending the grid testbed)
*** preparing a GPU-enabled VM image with Gromacs and Amber, and test them in the FedCloud
*** cloning the Gromacs and Amber WeNMR portals and interface it to the testbed above
** managed to use IM tool to deploy on cloud Torque, Slurm, SGE and HTCondor clusters. Done for CentOS6 on servers without GPUs. Plan to do the same on GPU servers available trough the IISAS cloud site.
* '''January 2016'''
** Supporting the MoBrain CC team at CIRMMP to port their AMBER portal to talk with GPGPU-enabled CREAM-CE
** Contacted APEL for GPGPU accounting news: they are producing a short document spelling out a couple of scenarios of how we might proceed with grid and cloud
* '''February 2016'''
** Supporting the MoBrain CC team at CIRMMP to port their AMBER portal to talk with GPGPU-enabled CREAM-CE
** Contributing to deliverables D4.6 and D6.7 and Periodic Report
* '''March 2016'''
** GPGPU-enabled CREAM-CE prototype implemented and tested at GRIF HTCondor based GPU/MIC cluster
** added support to two new JDL attributes '''MICNumber''' and '''GPUModel''', expected to be supported by latest version of HTCondor, Slurm and LSF batch systems.
* '''April 2016'''
** GPGPU-enabled CREAM-CE prototype implemented and tested at ARNES Slurm based GPU cluster and QMUL SGE based GPU cluster
** Task activity presented at EGI Conference 2016
** GLUE2.1 draft updated with relevant Accelerator card specific attributes
* '''May 2016'''
** Participation to GLUE-WG meeting of 17th of May and updating the [https://cernbox.cern.ch/index.php/s/JPGIMJunHMl37Bo GLUE2.1 draft]
** Plans to implement a prototype for the infosys based on this GLUE2.1 draft
** Future official approval of GLUE 2.1 would occur after the specification is revised based on prototype lessons learned


= Next steps =
= Next steps =
* verifying if parameters identified for Torque have their analogous in SLURM/SGE/LSF/Condor or additional parameters have to be abstracted to JDL attributes
* Work for enabling support to accelerated computing in grid enviroment officially stopped at the end of May 2016.
* writing the BLAH components for SLURM and/or SGE and/or LFS and/or Condor
* The CREAM developers team committed to produce a major CREAM-CE release for CentOS7 sysops with the new accelerated computing capabilities by June 2017 (see statement [https://wiki.italiangrid.it/CREAM here]).
* writing the information providers according with GLUE2.1 schema


= [https://wiki.egi.eu/wiki/EGI-Engage:TASK_JRA2.4_Accelerated_Computing Back to Accelerated Computing task] =
= [https://wiki.egi.eu/wiki/EGI-Engage:TASK_JRA2.4_Accelerated_Computing Back to Accelerated Computing task] =

Latest revision as of 13:31, 15 May 2017

EGI-Engage project: Main page WP1(NA1) WP3(JRA1) WP5(SA1) PMB Deliverables and Milestones Quality Plan Risk Plan Data Plan
Roles and
responsibilities
WP2(NA2) WP4(JRA2) WP6(SA2) AMB Software and services Metrics Project Office Procedures



Goal

  • To develop a solution enabling GPU support in CREAM-CE:
  1. For the most popular LRMSes already supported by CREAM-CE
  2. Based on GLUE 2.1 schema

Work plan

  1. Indentifying the relevant GPGPU-related parameters supported by the different LRMS, and abstract them to significant JDL attributes
  2. GPGPU accounting is expected to be provided by LRMS log files, as done for CPU accounting, and then follows the same APEL flow
  3. Implementing the needed changes in CREAM-core and BLAH components
  4. Writing the infoproviders according to GLUE 2.1
  5. Testing and certification of the prototype
  6. Releasing a CREAM-CE update with full GPGPU support

Testbed

  • 3 nodes (2x Intel Xeon E5-2620v2) with 2 NVIDIA Tesla K20m GPUs per node available at CIRMMP
    • MoBrain applications installed: AMBER and GROMACS with CUDA 5.5
    • Batch system/Scheduler: Torque 4.2.10 (source compiled with NVML libs)/ Maui 3.3.1
    • EMI3 CREAM-CE

Progress

  • June 2015
    • added support to JDL attributes GPUNumber e GPUMode: this required modifications blah_common_submit_functions.sh and server.c
    • first implementation of the two new attributes for Torque/pbs_sched:
GPUMode refers to the NVML COMPUTE mode (the gpu_mode variable in the pbsnodes output) and can have the following values for Torque/pbs_sched:
- default - Shared mode available for multiple processes
- exclusive Thread - Only one COMPUTE thread is allowed to run on the GPU (v260 exclusive)
- prohibited - No COMPUTE contexts are allowed to run on the GPU
- exclusive_process - Only one COMPUTE process is allowed to run on the GPU
this required modifications to pbs_submit.sh
  • July-August 2015
    • implemented the parser on CREAM core for the new JDL attributes GPUNumber and GPUMode
    • tested AMBER remote job submission through glite-ce-submit client:
$ glite-ce-job-submit -o jobid.txt -d -a -r cegpu.cerm.unifi.it:8443/cream-pbs-batch test.jdl
$ cat test.jdl
[
 executable = "test_gpu.sh";
 inputSandbox = { "test_gpu.sh" };
 stdoutput = "out.out";
 outputsandboxbasedesturi = "gsiftp://localhost";
 stderror = "err.err";
 outputsandbox = { "out.out","err.err","min.out","heat.out" };
 GPUNumber=2;
 GPUMode="exclusive_process";
]
    • However, being the user possibility to set the NVML Compute mode not available on all the batch systems, it was decided not to support GPUMode JDL attribute in the future production release.
  • September 2015
    • contacted STFC/Emerald (LSF8 based GPU cluster), IFCA (SGE based GPU cluster) and INFN-CNAF (LSF9 based GPU cluster) if they can provide a testing instance for extending CREAM GPU support to LSF and SGE LRMSes
    • continued troubleshooting CREAM prototype with AMBER application at CIRMMP cluster
    • analysed the GLUE2.1 schema as a base for writing the GPU-aware Torque info providers
    • An update report of GPGPU support in CREAM-CE has been presented at the OMB
  • October 2015
    • Enabled Maui scheduler at CIRMMP cluster and started the development of a CREAM prototype with GPGPU support for Torque/Maui. Problem: it seems there is no way to set the NVML Compute mode as is done in Torque/pbs_sched through the qsub -l (see above)
    • Troubleshooting of CREAM prototype.
    • Derek Ross of STFC/Emerald (LSF8 based GPU cluster) activate an account to access Emerald GPGPU testbed, in order to start GPGPU-enabled CREAM prototyping for LSF.
    • A thread on the WP4.4 and VT-GPGPU mailing lists has started with the APEL team for investigating how to address GPGPU accounting for Torque and other LRMSes.
  • November 2015
    • Obtained availabilty to testing from ARNES (Slurm based GPU cluster), Queen Mary University of London (QMUL) (SGE based cluster with OpenCL compatible AMD GPUs) and GRIF (HTCondor based GPU cluster)
    • GPGPU EGI Activities presented at WLCG Grid Deployement Board
    • Accounting issues further discussed at the EGI CF Accelerated Computing Session in Bari
    • Continued testing and troubleshooting of GPU-enabled CREAM-CE prototype at CIRMMP testbed.
    • Implemented another MoBrain use-case: the dockerized DisVis application has been properly installed at CIRMMP testbed. MoBrain users (through enmr.eu VO) can now run DisVis exploiting the GPU cluster at CIRMMP via the GPU-enabled CREAM-CE
    • Start prepararing the process of certification: investigating the use of IM (UPV tool) for automatically deploying cluster on the EGI FedCloud to be used for the GPU-enabled CREAM-CE certification
  • December 2015
    • Coordination with CESNET/MU partner of MoBrain CC in order to:
      • add their GPU nodes to the cloud testbed (i.e. via OCCI, not interested in extending the grid testbed)
      • preparing a GPU-enabled VM image with Gromacs and Amber, and test them in the FedCloud
      • cloning the Gromacs and Amber WeNMR portals and interface it to the testbed above
    • managed to use IM tool to deploy on cloud Torque, Slurm, SGE and HTCondor clusters. Done for CentOS6 on servers without GPUs. Plan to do the same on GPU servers available trough the IISAS cloud site.
  • January 2016
    • Supporting the MoBrain CC team at CIRMMP to port their AMBER portal to talk with GPGPU-enabled CREAM-CE
    • Contacted APEL for GPGPU accounting news: they are producing a short document spelling out a couple of scenarios of how we might proceed with grid and cloud
  • February 2016
    • Supporting the MoBrain CC team at CIRMMP to port their AMBER portal to talk with GPGPU-enabled CREAM-CE
    • Contributing to deliverables D4.6 and D6.7 and Periodic Report
  • March 2016
    • GPGPU-enabled CREAM-CE prototype implemented and tested at GRIF HTCondor based GPU/MIC cluster
    • added support to two new JDL attributes MICNumber and GPUModel, expected to be supported by latest version of HTCondor, Slurm and LSF batch systems.
  • April 2016
    • GPGPU-enabled CREAM-CE prototype implemented and tested at ARNES Slurm based GPU cluster and QMUL SGE based GPU cluster
    • Task activity presented at EGI Conference 2016
    • GLUE2.1 draft updated with relevant Accelerator card specific attributes
  • May 2016
    • Participation to GLUE-WG meeting of 17th of May and updating the GLUE2.1 draft
    • Plans to implement a prototype for the infosys based on this GLUE2.1 draft
    • Future official approval of GLUE 2.1 would occur after the specification is revised based on prototype lessons learned

Next steps

  • Work for enabling support to accelerated computing in grid enviroment officially stopped at the end of May 2016.
  • The CREAM developers team committed to produce a major CREAM-CE release for CentOS7 sysops with the new accelerated computing capabilities by June 2017 (see statement here).

Back to Accelerated Computing task