Latest revision as of 11:30, 29 August 2018

EGI-Engage project:	Main page	WP1(NA1)	WP3(JRA1)	WP5(SA1)	PMB	Deliverables and Milestones	Quality Plan	Risk Plan	Data Plan
	Roles and responsibilities	WP2(NA2)	WP4(JRA2)	WP6(SA2)	AMB	Software and services	Metrics	Project Office	Procedures

Note:
This page is used to track the GPGPU support related activities.
For user information check Federated Cloud GPGPU.
Providers willing to expose GPGPU resources check the documentation below.

Objective

To provide support for accelerated computing in EGI-Engage federated cloud.

Participants

Viet Tran (IISAS) viet.tran _at_ savba.sk

Jan Astalos (IISAS)

Miroslav Dobrucky (IISAS)

Current status

Status of OpenNebula site wiki.egi.eu/wiki/GPGPU-OpenNebula

IISAS-GPUCloud site with GPGPU has been established and integrated into EGI federated cloud

HW configuration:

6 computing nodes IBM dx360 M4 server with two NVIDIA Tesla K20 accelerators.
Ubuntu 14.04.2 LTS with KVM/QEMU, PCI passthrough virtualization of GPU cards.

SW configuration:

Base OS: Ubuntu 14.04.2 LTS
Hypervisor: KVM
Middleware: Openstack Liberty
GPU-enable flavors: gpu1cpu6 (1GPU + 6 CPU cores), gpu2cpu12 (2GPU +12 CPU cores)

EGI federated cloud configuration:

GOCDB: IISAS-GPUCloud, https://goc.egi.eu/portal/index.php?Page_Type=Site&id=1485
Monitoring https://cloudmon.egi.eu/nagios/cgi-bin/status.cgi?host=nova3.ui.savba.sk
Openstack endpoint: https://keystone3.ui.savba.sk:5000/v2.0
OCCI endpoint: https://nova3.ui.savba.sk:8787/occi1.1/
Supported VOs: fedcloud.egi.eu, ops, dteam, moldyngrid, enmr.eu, vo.lifewatch.eu, acc-comp.egi.eu

Applications being tested/running on IISAS-GPUCloud

MolDynGrid http://moldyngrid.org/
WeNMR https://www.wenmr.eu/
Lifewatch-CC https://wiki.egi.eu/wiki/CC-LifeWatch

For information and support, please contact us via cloud-admin _at_ savba.sk

How to use GPGPU on IISAS-GPUCloud

For EGI users:

Join EGI federated cloud https://wiki.egi.eu/wiki/Federated_Cloud_user_support#Quick_Start

Install your rOCCI client if you don't have it already (in Linux: just single command "curl -L http://go.egi.eu/fedcloud.ui | sudo /bin/bash -" )

Get VOMS proxy certificate from fedcloud.egi.eu or any supported VO with -rfc (on rOCCI client: "voms-proxy-init --voms fedcloud.egi.eu -rfc")
 
Choose a suitable flavor with GPU (e.g. gpu1cpu6, OCCI users: resource_tpl#f0cd78ab-10a0-4350-a6cb-5f3fdd6e6294)

Choose a suitable image (e.g. Ubuntu-14.04-UEFI, OCCI users: os_tpl#8fc055c5-eace-4bf2-9f87-100f3026227e)
     
Create a keypair for logging in to your server (and stored in tmpfedcloud.login context-file)
          (see https://wiki.egi.eu/wiki/Fedcloud-tf:CLI_Environment#How_to_create_a_key_pair_to_access_the_VMs_via_SSH) 

Create a VM with the selected image, flavor and keypair (OCCI users: copy the following very long OCCI command
          occi  --endpoint  https://nova3.ui.savba.sk:8787/occi1.1/ \
          --auth x509 --user-cred $X509_USER_PROXY --voms --action create --resource compute \
          --mixin os_tpl#8fc055c5-eace-4bf2-9f87-100f3026227e --mixin resource_tpl#f0cd78ab-10a0-4350-a6cb-5f3fdd6e6294 \
          --attribute occi.core.title="Testing GPU" \
          --context user_data="file://$PWD/tmpfedcloud.login"
       remark: check the proper os_tpl-ID by
          occi  --endpoint  https://nova3.ui.savba.sk:8787/occi1.1/ \
          --auth x509 --user-cred $X509_USER_PROXY --voms --action describe --resource os_tpl | grep -A1 Ubuntu-14

Assign a public (floating) IP to your VM (using VM_ID from previous command and /occi1.1/network/PUBLIC
          occi --endpoint  https://nova3.ui.savba.sk:8787/occi1.1/  \
          --auth x509 --user-cred $X509_USER_PROXY --voms --action link \
          --resource https://nova3.ui.savba.sk:8787/occi1.1/compute/$YOUR_VM_ID_HERE -j /occi1.1/network/PUBLIC)

Log in the VM with your private key and use it as your own GPU server (ssh -i tmpfedcloud cloudadm@$VM_PUBLIC_IP)
    Remark: please update the VM-OS immediately: sudo apt-get update && unattended-upgrade; sudo reboot

Delete your VM to release resources for other users:
          occi --endpoint  https://nova3.ui.savba.sk:8787/occi1.1/  \
          --auth x509 --user-cred $X509_USER_PROXY --voms --action delete \
          --resource https://nova3.ui.savba.sk:8787/occi1.1/compute/$YOUR_VM_ID_HERE

Please remember to delete/terminate your servers when you finish your jobs to release resources for other users

For access to IISAS-GPUCloud via portal:

Get a token issued by Keystone with VOMS proxy certificate. You can use the tool from https://github.com/tdviet/Keystone-VOMS-client

Login into Openstack Horizon dashboard with the token via https://horizon.ui.savba.sk/horizon/auth/token/

Create and manage VMs using the portal.

Note: All network connections to/from VMs are logged and monitored by IDS.
If users have long computation, please inform us ahead. VMs with longer inactivity will be deleted for releasing resources 
The default user account for VM created from Ubuntu-based images via Horizon is "ubuntu". 
The default user account for VM created by rOCCI is defined in the context file "tmpfedcloud.login"

How to create your own GPGPU server in cloud

It is a short instruction to create a GPGPU server in cloud from Ubuntu vanilla image

Create a VM from vanilla image with UEFI support (e.g. Ubuntu-14.04-UEFI, make sure with flavor with GPU support)

Install gcc, make and kernel-extra: "apt-get update; apt-get install gcc make linux-image-extra-virtual"

Choose and download correct driver from http://www.nvidia.com/Download/index.aspx, and upload it to the VM

Install the NVIDIA driver: "dpkg -i nvidia-driver-local-repo-ubuntu*_amd64.deb" (or "./NVIDIA-Linux-x86_64-*.run" )

Download CUDA toolkit from https://developer.nvidia.com/cuda-downloads (choose deb format for smaller download)
 
Install the CUDA toolkit: "dpkg -i cuda-repo-ubuntu*_amd64.deb; apt-get update; apt-get install cuda" (very large install, 650+ packages, take a long time ~15 minutes)
  and set the environment (e.g. "export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}; export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" )

Your server is ready for your application. You can install additional software (NAMD, GROMACS, ...) and your own application now

For your convenience, a script is created for installing NVIDIA + CUDA automatically https://github.com/tdviet/NVIDIA_CUDA_installer
 
Be sure to make a snapshot of your server for later use. You may need to suspend your server before creating snapshot (due to KVM passthrough). 
Do not terminate your server before creating snapshot, whole server will be deleted when terminated

Other scripts for creating GPGPU servers with NVIDIA + CUDA on the cloud via occi, cloud-init and ansible roles have been developed as result of a collaboration with INDIGO and 
West-life, and are available at http://about.west-life.eu/network/west-life/documentation/egi-platforms/accelerated-computing-platforms

Verify if CUDA is correctly installed

]$ sudo apt-get install cuda-samples-8-0
]$ cd /usr/local/cuda-8.0/samples/1_Utilities/deviceQuery
]$ sudo make
/usr/local/cuda-8.0/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_20,code=sm_20 -gencode 
[..]
mkdir -p ../../bin/x86_64/linux/release
cp deviceQuery ../../bin/x86_64/linux/release

]$ ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla K20m"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 4743 MBytes (4972937216 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            706 MHz (0.71 GHz)
  Memory Clock rate:                             2600 Mhz
  Memory Bus Width:                              320-bit
  L2 Cache Size:                                 1310720 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 7
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K20m
Result = PASS

How to enable GPGPU passthrough in OpenStack

For admins of cloud providers

On computing node, get vendor/product ID of your hardware: "lspci | grep NVDIA" to get pci slot of GPU, then "virsh nodedev-dumpxml pci_xxxx_xx_xx_x"
On computing node, unbind device from host kernel driver
On computing node, add "pci_passthrough_whitelist = {"vendor_id":"xxxx","product_id":"xxxx"}" to nova.conf
On controller node, add "pci_alias = {"vendor_id":"xxxx","product_id":"xxxx", "name":"GPU"}" to nova.conf
On controller node, enable PciPassthroughFilter in the scheduler
Create new flavors with "pci_passthrough:alias" (or add key to existing flavor) e.g. nova flavor-key  m1.large set  "pci_passthrough:alias"="GPU:2"

How to transfer the ownership (user) of a VM in OpenStack

For admins of cloud providers

On controller node, execute this script:

#!/bin/bash
#original author: https://ask.openstack.org/en/users/12462/suriyanath/
#source: https://ask.openstack.org/en/question/28026/how-to-move-a-instance-between-projects/
#--defaults-file=pw contains: user, password and other settings
[ -z "$*" ] && echo "Usage: $0 destination_user_id VM-1_id VM-2_id ..... VM-x_id" && exit 1
[ -z "$2" ] && echo "Usage: $0 destination_user_id VM-1_id VM-2_id ..... VM-x_id" && exit 1
for i
do
if [ "$i" != "$1" ]; then
echo "moving instance id " $i " to user id" $1;
mysql --defaults-file=pw <<query
use nova;
update instances set user_id="$1" where uuid="$i";
query
else
#get project id of the instance before update
proj_id=$(mysql --defaults-file=pw <<query
use nova;
select project_id from instances where uuid="$2";
query
)
#get user id of the instance before update
old_user_id=$(mysql --defaults-file=pw <<query
use nova;
select user_id from instances where uuid="$2";
query
)
echo "original_user=" $old_user_id project_id=$proj_id
fi
done

And the "pw" file contains the following lines:

[mysql]
user=root
password=********
silent
skip-column-names

Progress

May 2015
- Review of available technologies
- GPGPU virtualisation in KVM/QEMU
- Performance testing of passthrough

HW configuration: 
IBM dx360 M4 server with two NVIDIA Tesla K20 accelerators.
Ubuntu 14.04.2 LTS with KVM/QEMU, PCI passthrough virtualization of GPU cards.

Tested application:
NAMD molecular dynamics simulation (CUDA version), STMV test example (http://www.ks.uiuc.edu/Research/namd/).

Performance results:
Tested application runs 2-3% slower in virtual machine compared to direct run on tested server.
If hyperthreading is enabled on compute server, vCPUs have to be pinned to real cores so that
whole cores will be dedicated to one VM. To avoid potential performance problems, hyperthreading 
should be switched off.

June 2015
- Creating cloud site with GPGPU support

Configuration: master node, 2 worker nodes (IBM dx360 M4 servers, see above)
Base OS: Ubuntu 14.04.2 LTS
Hypervisor: KVM
Middleware: Openstack Kilo

July 2015
- Creating cloud site with GPGPU support

Cloud site created at keystone3.ui.savba.sk, master + two worker nodes, configuration reported above
Creating VM images for GPGPU (based on Ubuntu 14.04, GPU driver and libraries)

August 2015
- Testing cloud site with GPGPU support

Performance testing and tuning with GPGPU in Openstack 
 - comparing performance of cloud-based VM with non-cloud virtualization and physical machine, finding discrepancies and tuning them
 - setting CPU flavor in Openstack nova (performance optimization) 
 - Adjusting Openstack scheduler

Starting process of integration of the site to EGI FedCloud
 - Keystone VOMS support being integrated
 - OCCI in preparation, installation planned in September

September 2015

 Continue integration to EGI-FedCloud

October 2015

 Full integration to EGI-FedCloud, being in certification process
 Support for moldyngrid, enmr.eu and vo.lifewatch.eu VO

November 2015

 Create new authentication module for logging into Horizon dashboard via keystone token
 Various client tools: getting token, installing nvidia+cuda,
 Participation on EGI Community Forum v Bari
 Site certificated

December 2015

 User support: adding and testing images from various VOs, solving problems with  multiple-VO users
 Maintenance: security updates and minor improvements

January 2016

 Testing + performance tuning OpenCL
 Updating images with CUDA
 Adding Openstack Ceilometer for betting resource monitoring/accounting

February-March 2016

 Testing VM migration
 Examining GLUE schemes
 Examining accounting format and tools

April 2016

 Status report presented at EGI Conference 2016

May 2016

GLUE2.1 draft discussed at GLUE-WG meeting and updated with relevant Accelerator card specific attributes.
GPGPU experimental support enabled on CESNET-Metacloud site. VMs with Tesla M2090 GPU cards tested with DisVis program. 
Working on support for GPU with LXC/LXD hypervisor with Openstack, which would provide better performance than KVM.

Next steps

Production, application support
Cooperation with APEL team on accounting of GPUs
Generating II according to GLUE 2.1

Back to Accelerated Computing task

Difference between revisions of "GPGPU-FedCloud"

Latest revision as of 11:30, 29 August 2018

Contents

Objective

Participants

Current status

How to use GPGPU on IISAS-GPUCloud

How to create your own GPGPU server in cloud

Verify if CUDA is correctly installed

How to enable GPGPU passthrough in OpenStack

How to transfer the ownership (user) of a VM in OpenStack

Progress

Back to Accelerated Computing task

Navigation menu

@@ Line 1: / Line 1: @@
-= Status of accelerated computing in Clouds  =
+{{Template:EGI-Engage menubar}} {{TOC_right}}
-Need efforts for additional development/support at all levels
-*Chipset&nbsp;: HW virtualization support (otherwise some limitation)
-*OS level: correct kernel configuration for the accelerators
-*Hypervisor: configuration pass-through, vGPU
-*CMFs: VM start, scheduler
-*FedCloud facilities: accounting, information discovery
-*Application: VM images with correct drivers for specific chipsets
-= Accelerators  =
-== GPGPU (General-Purpose computing on Graphical Processing Units)  ==
+{{Template:Block-comment
+| name=Note
+| text=This page is used to track the GPGPU support related activities.
+<br/>
+'''For user information check [[Federated Cloud GPGPU]].'''
+<br/>
+Providers willing to expose GPGPU resources check the documentation below.
+  }}
-NVIDIA GPU/Tesla/GRID, AMD Radeon/FirePro, Intel HD Graphics,...
+= Objective =
-Virtualization using VGA pass-through, vGPU (GPU partitioning) - NVIDIA GRID accelerators
+To provide support for accelerated computing in EGI-Engage federated cloud.
-*Shared Virtual GPUs (vGPU) http://www.nvidia.com/object/virtual-gpus.html
-== Intel Many Integrated Core Architecture  ==
+= Participants =
-Xeon Phi Coprocessor
+Viet Tran (IISAS) viet.tran _at_ savba.sk
-Virtualization using PCI pass-through
+Jan Astalos (IISAS)
-== Specialized PCIe cards with accelerators  ==
+Miroslav Dobrucky (IISAS)
-DSP (Digital Signal Processors)
+= Current status  =
-FPGA (Field Programmable Gate Array)
+Status of OpenNebula site&nbsp;[https://wiki.egi.eu/wiki/GPGPU-OpenNebula wiki.egi.eu/wiki/GPGPU-OpenNebula]
-Not commonly used in cloud environment
-= Hypervisors  =
-== QEMU/KVM  ==
+IISAS-GPUCloud site with GPGPU has been established and integrated into EGI federated cloud
-Supports only pass-through virtualization model
+HW configuration:
-vGPU support is under development
+computing nodes IBM dx360 M4 server with two NVIDIA Tesla K20 accelerators.
+ Ubuntu 14.04.2 LTS with KVM/QEMU, PCI passthrough virtualization of GPU cards.
+SW configuration:
+ Base OS: Ubuntu 14.04.2 LTS
+ Hypervisor: KVM
+ Middleware: Openstack Liberty
+ GPU-enable flavors: gpu1cpu6 (1GPU + 6 CPU cores), gpu2cpu12 (2GPU +12 CPU cores)
+EGI federated cloud configuration:
-Instructions for configuring passthrough in KVM (link ???)
+ GOCDB: IISAS-GPUCloud, https://goc.egi.eu/portal/index.php?Page_Type=Site&amp;id=1485
+ Monitoring https://cloudmon.egi.eu/nagios/cgi-bin/status.cgi?host=nova3.ui.savba.sk
+ Openstack endpoint: https://keystone3.ui.savba.sk:5000/v2.0
+ OCCI endpoint: https://nova3.ui.savba.sk:8787/occi1.1/
+ Supported VOs: fedcloud.egi.eu, ops, dteam, moldyngrid, enmr.eu, vo.lifewatch.eu, acc-comp.egi.eu
-== Citrix XenServer 6, VMware ESXi 5.1  ==
+Applications being tested/running on IISAS-GPUCloud
-Support both pass-through and vGPU virtualization models
+ MolDynGrid http://moldyngrid.org/
+ WeNMR https://www.wenmr.eu/
+ Lifewatch-CC https://wiki.egi.eu/wiki/CC-LifeWatch
-Limitations:
+For information and support, please contact us via cloud-admin _at_ savba.sk
-*vGPU support require certified server HW
+= How to use GPGPU on IISAS-GPUCloud =
-*Live VM migration is not supported
+For EGI users:
-*VM snapshot with memory is not supported
-Security issues
+ Join EGI federated cloud https://wiki.egi.eu/wiki/Federated_Cloud_user_support#Quick_Start
-* Non-standard PCI device functionality may render pass-through insecure (http://xenbits.xen.org/xsa/advisory-124.html)
+ Install your rOCCI client if you don't have it already (in Linux: just single command "curl -L http://go.egi.eu/fedcloud.ui | sudo /bin/bash -" )
+ Get VOMS proxy certificate from fedcloud.egi.eu or any supported VO '''with -rfc''' (on rOCCI client: "voms-proxy-init --voms fedcloud.egi.eu -rfc")
+ Choose a suitable flavor with GPU (e.g. gpu1cpu6, OCCI users: resource_tpl#f0cd78ab-10a0-4350-a6cb-5f3fdd6e6294)
+ Choose a suitable image (e.g. Ubuntu-14.04-UEFI, OCCI users: os_tpl#8fc055c5-eace-4bf2-9f87-100f3026227e)
+ Create a keypair for logging in to your server (and stored in tmpfedcloud.login context-file)
+           (see https://wiki.egi.eu/wiki/Fedcloud-tf:CLI_Environment#How_to_create_a_key_pair_to_access_the_VMs_via_SSH)
+ Create a VM with the selected image, flavor and keypair (OCCI users: copy the following very long OCCI command
+           occi  --endpoint  https://nova3.ui.savba.sk:8787/occi1.1/ \
+           --auth x509 --user-cred $X509_USER_PROXY --voms --action create --resource compute \
+           --mixin os_tpl#8fc055c5-eace-4bf2-9f87-100f3026227e --mixin resource_tpl#f0cd78ab-10a0-4350-a6cb-5f3fdd6e6294 \
+           --attribute occi.core.title="Testing GPU" \
+           --context user_data="file://$PWD/tmpfedcloud.login"
+        remark: check the proper os_tpl-ID by
+           occi  --endpoint  https://nova3.ui.savba.sk:8787/occi1.1/ \
+           --auth x509 --user-cred $X509_USER_PROXY --voms --action describe --resource os_tpl | grep -A1 Ubuntu-14
-= Cloud Management Frameworks =
+ Assign a public (floating) IP to your VM (using VM_ID from previous command and /occi1.1/network/PUBLIC
+           occi --endpoint  https://nova3.ui.savba.sk:8787/occi1.1/  \
+           --auth x509 --user-cred $X509_USER_PROXY --voms --action link \
+           --resource https://nova3.ui.savba.sk:8787/occi1.1/compute/$YOUR_VM_ID_HERE -j /occi1.1/network/PUBLIC)
+ Log in the VM with your private key and use it as your own GPU server (ssh -i tmpfedcloud cloudadm@$VM_PUBLIC_IP)
+     Remark: please update the VM-OS immediately: sudo apt-get update && unattended-upgrade; sudo reboot
+ Delete your VM to release resources for other users:
+           occi --endpoint  https://nova3.ui.savba.sk:8787/occi1.1/  \
+           --auth x509 --user-cred $X509_USER_PROXY --voms --action delete \
+           --resource https://nova3.ui.savba.sk:8787/occi1.1/compute/$YOUR_VM_ID_HERE
-Some work done with PCI passthrough
+'''Please remember to delete/terminate your servers when you finish your jobs to release resources for other users'''
-*OpenStack PCI passthrough (https://wiki.openstack.org/wiki/Pci_passthrough) meetings (https://wiki.openstack.org/wiki/Meetings/Passthrough)
-vGPU is in very early stage
-*Design document for GPU and vGPU support for CloudStack Guest VMs (https://cwiki.apache.org/confluence/display/CLOUDSTACK/GPU+and+vGPU+support+for+CloudStack+Guest+VMs)
-*Request to support vGPU in OpenNebula (http://dev.opennebula.org/issues/3028)
-Work to be done:
+For access to IISAS-GPUCloud via portal:
-*Define VM types/flavors with attributes for GPGPU
+ Get a token issued by Keystone with VOMS proxy certificate. You can use the tool from https://github.com/tdviet/Keystone-VOMS-client
-*Modify VM start to allow passthrough or allocate vGPU
-*Modify scheduler to allocate VMs with GPGPU correctly
+ Login into Openstack Horizon dashboard with the token via https://horizon.ui.savba.sk/horizon/auth/token/
+ Create and manage VMs using the portal.
-= VM images =
+ '''Note''': All network connections to/from VMs are logged and monitored by IDS.
+ If users have long computation, please inform us ahead. VMs with longer inactivity will be deleted for releasing resources
+ The default user account for VM created from Ubuntu-based images via Horizon is "ubuntu".
+ The default user account for VM created by rOCCI is defined in the context file "tmpfedcloud.login"
-VM images should contain proper drivers and libraries for specific accelerators
+= How to create your own GPGPU server in cloud  =
-*Not transferable from site to site
-More suitable approach is to use vanilla images with GPU support provided by cloud provider
+It is a short instruction to create a GPGPU server in cloud from Ubuntu vanilla image
-*Using VM contextualization like cloud-init for installing applications
-Or using VM snapshots
+ Create a VM from vanilla image with UEFI support (e.g. Ubuntu-14.04-UEFI, make sure with flavor with GPU support)
-* May require support from site admins
+ Install gcc, make and kernel-extra: "apt-get update; apt-get install gcc make linux-image-extra-virtual"
+ Choose and download correct driver from http://www.nvidia.com/Download/index.aspx, and upload it to the VM
+ Install the NVIDIA driver: "dpkg -i nvidia-driver-local-repo-ubuntu*_amd64.deb" (or "./NVIDIA-Linux-x86_64-*.run" )
+ Download CUDA toolkit from https://developer.nvidia.com/cuda-downloads (choose deb format for smaller download)
+ Install the CUDA toolkit: "dpkg -i cuda-repo-ubuntu*_amd64.deb; apt-get update; apt-get install cuda" (very large install, 650+ packages, take a long time ~15 minutes)
+   and set the environment (e.g. "export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}; export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" )
-= FedCloud facilities =
+ Your server is ready for your application. You can install additional software (NAMD, GROMACS, ...) and your own application now
-AppDB
+ For your convenience, a script is created for installing NVIDIA + CUDA automatically https://github.com/tdviet/NVIDIA_CUDA_installer
-* VM images are rather site-specific: any sense to use AppDB ?
+ Be sure to make a snapshot of your server for later use. You may need to suspend your server before creating snapshot (due to KVM passthrough).
+ Do not terminate your server before creating snapshot, whole server will be deleted when terminated
-Information discovery
+ Other scripts for creating GPGPU servers with NVIDIA + CUDA on the cloud via occi, cloud-init and ansible roles have been developed as result of a collaboration with INDIGO and
-* Should use similar GLUE2 scheme like grid sites with GPGPU
+ West-life, and are available at http://about.west-life.eu/network/west-life/documentation/egi-platforms/accelerated-computing-platforms
-Accounting
+= Verify if CUDA is correctly installed  =
-* How to account GPU? (again to coordinate with grid)
+<pre>]$ sudo apt-get install cuda-samples-8-0
+]$ cd /usr/local/cuda-8.0/samples/1_Utilities/deviceQuery
+]$ sudo make
+/usr/local/cuda-8.0/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_20,code=sm_20 -gencode
+[..]
+mkdir -p ../../bin/x86_64/linux/release
+cp deviceQuery ../../bin/x86_64/linux/release
-Brokering, monitoring, VM management
+]$ ./deviceQuery
+./deviceQuery Starting...
-= Possible configuration =
+CUDA Device Query (Runtime API) version (CUDART static linking)
-== Dedicated cloud site with GPGPU ==
+Detected 1 CUDA Capable device(s)
-Homogenous: identical working nodes
-Single VM type, single VM per node
+Device 0: "Tesla K20m"
-*Simple configuration, no conflicting resources, no need to modify scheduler
+  CUDA Driver Version / Runtime Version          8.0 / 8.0
+  CUDA Capability Major/Minor version number:    3.5
+  Total amount of global memory:                 4743 MBytes (4972937216 bytes)
+  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
+  GPU Max Clock rate:                            706 MHz (0.71 GHz)
+  Memory Clock rate:                             2600 Mhz
+  Memory Bus Width:                              320-bit
+  L2 Cache Size:                                 1310720 bytes
+  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
+  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
+  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
+  Total amount of constant memory:               65536 bytes
+  Total amount of shared memory per block:       49152 bytes
+  Total number of registers available per block: 65536
+  Warp size:                                     32
+  Maximum number of threads per multiprocessor:  2048
+  Maximum number of threads per block:           1024
+  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
+  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
+  Maximum memory pitch:                          2147483647 bytes
+  Texture alignment:                             512 bytes
+  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
+  Run time limit on kernels:                     No
+  Integrated GPU sharing Host Memory:            No
+  Support host page-locked memory mapping:       Yes
+  Alignment requirement for Surfaces:            Yes
+  Device has ECC support:                        Enabled
+  Device supports Unified Addressing (UVA):      Yes
+  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 7
+  Compute Mode:
+     &lt; Default (multiple host threads can use&nbsp;::cudaSetDevice() with device simultaneously) &gt;
-== Cloud site with OS level hypervisor ==
+deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K20m
+Result = PASS
+</pre>
-VMs can have direct access to hardware resources and share them
+= How to enable GPGPU passthrough in OpenStack =
+For admins of cloud providers
+ On computing node, get vendor/product ID of your hardware: "lspci | grep NVDIA" to get pci slot of GPU, then "virsh nodedev-dumpxml pci_xxxx_xx_xx_x"
+ On computing node, unbind device from host kernel driver
+ On computing node, add "pci_passthrough_whitelist = {"vendor_id":"xxxx","product_id":"xxxx"}" to nova.conf
+ On controller node, add "pci_alias = {"vendor_id":"xxxx","product_id":"xxxx", "name":"GPU"}" to nova.conf
+ On controller node, enable PciPassthroughFilter in the scheduler
+ Create new flavors with "pci_passthrough:alias" (or add key to existing flavor) e.g. nova flavor-key  m1.large set  "pci_passthrough:alias"="GPU:2"
-Limitation to the same OS/kernel
+= How to transfer the ownership (user) of a VM in OpenStack =
+For admins of cloud providers
+ On controller node, execute this script:
-= Related work =
+ #!/bin/bash
+ #original author: https://ask.openstack.org/en/users/12462/suriyanath/
+ #source: https://ask.openstack.org/en/question/28026/how-to-move-a-instance-between-projects/
+ #--defaults-file=pw contains: user, password and other settings
+ [ -z "$*" ] && echo "Usage: $0 destination_user_id VM-1_id VM-2_id ..... VM-x_id" && exit 1
+ [ -z "$2" ] && echo "Usage: $0 destination_user_id VM-1_id VM-2_id ..... VM-x_id" && exit 1
+ for i
+ do
+ if [ "$i" != "$1" ]; then
+ echo "moving instance id " $i " to user id" $1;
+ mysql --defaults-file=pw <<query
+ use nova;
+ update instances set user_id="$1" where uuid="$i";
+ query
+ else
+ #get project id of the instance before update
+ proj_id=$(mysql --defaults-file=pw <<query
+ use nova;
+ select project_id from instances where uuid="$2";
+ query
+ )
+ #get user id of the instance before update
+ old_user_id=$(mysql --defaults-file=pw <<query
+ use nova;
+ select user_id from instances where uuid="$2";
+ query
+ )
+ echo "original_user=" $old_user_id project_id=$proj_id
+ fi
+ done
-*GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications. http://www.isi.edu/sites/default/files/users/jwalters/papers/Cloud_2014.pdf
+And the "pw" file contains the following lines:
+ [mysql]
+ user=root
+ password=********
+ silent
+ skip-column-names
 = Progress =
@@ Line 124: / Line 259: @@
   Tested application:
-  NAMD molecular dynamics simulation, STMV test example.
+  NAMD molecular dynamics simulation (CUDA version), STMV test example (http://www.ks.uiuc.edu/Research/namd/).
   Performance results:
-  Tested application runs 6% slower in virtual machine compared to direct run on tested server.
+  Tested application runs 2-3% slower in virtual machine compared to direct run on tested server.
+ If hyperthreading is enabled on compute server, vCPUs have to be pinned to real cores so that
+ whole cores will be dedicated to one VM. To avoid potential performance problems, hyperthreading
+ should be switched off.
 * June 2015
 ** Creating cloud site with GPGPU support
@@ Line 136: / Line 274: @@
   Middleware: Openstack Kilo
-* Next steps
+* July 2015
-** Testing Openstack scheduler
+** Creating cloud site with GPGPU support
-** Integration with EGI-Engage Fedcloud
+ Cloud site created at keystone3.ui.savba.sk, master + two worker nodes, configuration reported above
+ Creating VM images for GPGPU (based on Ubuntu 14.04, GPU driver and libraries)
+* August 2015
+** Testing cloud site with GPGPU support
+ Performance testing and tuning with GPGPU in Openstack
+  - comparing performance of cloud-based VM with non-cloud virtualization and physical machine, finding discrepancies and tuning them
+  - setting CPU flavor in Openstack nova (performance optimization)
+  - Adjusting Openstack scheduler
+ Starting process of integration of the site to EGI FedCloud
+  - Keystone VOMS support being integrated
+  - OCCI in preparation, installation planned in September
+* September 2015
+  Continue integration to EGI-FedCloud
+* October 2015
+  Full integration to EGI-FedCloud, being in certification process
+  Support for moldyngrid, enmr.eu and vo.lifewatch.eu VO
+* November 2015
+  Create new authentication module for logging into Horizon dashboard via keystone token
+  Various client tools: getting token, installing nvidia+cuda,
+  Participation on EGI Community Forum v Bari
+  Site certificated
+* December 2015
+  User support: adding and testing images from various VOs, solving problems with  multiple-VO users
+  Maintenance: security updates and minor improvements
+* January 2016
+  Testing + performance tuning OpenCL
+  Updating images with CUDA
+  Adding Openstack Ceilometer for betting resource monitoring/accounting
+* February-March 2016
+  Testing VM migration
+  Examining GLUE schemes
+  Examining accounting format and tools
+* April 2016
+  Status report presented at EGI Conference 2016
+* May 2016
+ [https://cernbox.cern.ch/index.php/s/JPGIMJunHMl37Bo GLUE2.1 draft] discussed at GLUE-WG meeting and updated with relevant Accelerator card specific attributes.
+ GPGPU experimental support enabled on CESNET-Metacloud site. VMs with Tesla M2090 GPU cards tested with DisVis program.
+ Working on support for GPU with LXC/LXD hypervisor with Openstack, which would provide better performance than KVM.
+* Next steps
+ Production, application support
+ Cooperation with APEL team on accounting of GPUs
+ Generating II according to GLUE 2.1
+= [https://wiki.egi.eu/wiki/EGI-Engage:TASK_JRA2.4_Accelerated_Computing Back to Accelerated Computing task] =

Difference between revisions of "GPGPU-FedCloud"

Latest revision as of 11:30, 29 August 2018

Objective

Participants

Current status

How to use GPGPU on IISAS-GPUCloud

How to create your own GPGPU server in cloud

Verify if CUDA is correctly installed

How to enable GPGPU passthrough in OpenStack

How to transfer the ownership (user) of a VM in OpenStack

Progress

Back to Accelerated Computing task

Navigation menu

Search