Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "GPGPU-CREAM"

From EGIWiki
Jump to navigation Jump to search
Line 58: Line 58:


= Next steps =
= Next steps =
* verifying if parameters identified for Torque have their analogous in SLURM/SGE/LSF/Condor or additional parameters have to be abstracted to JDL attributes
* writing the BLAH components for SLURM and/or SGE and/or LFS and/or Condor  
* writing the BLAH components for SLURM and/or SGE and/or LFS and/or Condor  
* writing the information providers according with GLUE2.1 schema
* writing the information providers according with GLUE2.1 schema

Revision as of 14:26, 4 September 2015

EGI-Engage project: Main page WP1(NA1) WP3(JRA1) WP5(SA1) PMB Deliverables and Milestones Quality Plan Risk Plan Data Plan
Roles and
responsibilities
WP2(NA2) WP4(JRA2) WP6(SA2) AMB Software and services Metrics Project Office Procedures



Goal

  • To develop a solution enabling GPU support in CREAM-CE:
  1. For the most popular LRMSes already supported by CREAM-CE
  2. Based on GLUE 2.1 schema

Work plan

  1. Indentifying the relevant GPGPU-related parameters supported by the different LRMS, and abstract them to significant JDL attributes
  2. GPGPU accounting is expected to be provided by LRMS log files, as done for CPU accounting, and then follows the same APEL flow
  3. Implementing the needed changes in CREAM-core and BLAH components
  4. Writing the infoproviders according to GLUE 2.1
  5. Testing and certification of the prototype
  6. Releasing a CREAM-CE update with full GPGPU support

Testbed

  • 3 nodes (2x Intel Xeon E5-2620v2) with 2 NVIDIA Tesla K20m GPUs per node available at CIRMMP
    • MoBrain applications installed: AMBER and GROMACS with CUDA 5.5
    • Batch system/Scheduler: Torque 4.2.10 (source compiled with NVML libs)/ Maui 3.3.1
    • EMI3 CREAM-CE

Progress

qsub -l nodes=1:gpus=2:default
qsub -l nodes=1:gpus=2:exclusive_process
  • June 2015:
    • attributes "GPUNumber" e "GPUMode" added to command BLAH_JOB_SUBMIT, e.g.:
BLAH_JOB_SUBMIT 2 [Cmd="/tmp/test.sh";GridType="pbs";Queue="batch";In="/dev/null";Out="~\/StdOutput";Err="~\/StdError";GPUNumber=1;GPUMode="default"]
BLAH_JOB_SUBMIT 2 [Cmd="test_gpu_blah.sh";GridType="pbs";Queue="batch";In="/dev/null";Out="StdOutput";Err="StdError";GPUNumber=1;GPUMode="exclusive_process"]
this required modifications blah_common_submit_functions.sh and server.c
    • first implementation of the two new attributes for PBS/Torque:
GPUMode can have the following values for PBS/Torque:
- default - Shared mode available for multiple processes
- exclusive Thread - Only one COMPUTE thread is allowed to run on the GPU (v260 exclusive)
- prohibited - No COMPUTE contexts are allowed to run on the GPU
- exclusive_process - Only one COMPUTE process is allowed to run on the GPU
this required modifications to pbs_submit.sh
  • July-August 2015:
    • implemented the parser on CREAM core for the new JDL attributes GPUNumber and GPUMode
    • tested AMBER remote job submission through glite-ce-submit client:
$ glite-ce-job-submit -o jobid.txt -d -a -r cegpu.cerm.unifi.it:8443/cream-pbs-batch test.jdl
$ cat test.jdl
[
 executable = "test_gpu.sh";
 inputSandbox = { "test_gpu.sh" };
 stdoutput = "out.out";
 outputsandboxbasedesturi = "gsiftp://localhost";
 stderror = "err.err";
 outputsandbox = { "out.out","err.err","min.out","heat.out" };
 GPUNumber=2;
 GPUMode="exclusive_process";
]
  • September 2015
    • contacted STFC/Emerald (LSF8 based GPU cluster), IFCA (SGE based GPU cluster) and INFN-CNAF (LSF9 based GPu cluster) if they can provide a testing instance for extending CREAM GPU support to LSF and SGE LRMSes
    • toubleshooting CREAM prototype with AMBER application at CIRMMP cluster

Next steps

  • verifying if parameters identified for Torque have their analogous in SLURM/SGE/LSF/Condor or additional parameters have to be abstracted to JDL attributes
  • writing the BLAH components for SLURM and/or SGE and/or LFS and/or Condor
  • writing the information providers according with GLUE2.1 schema