Revision as of 12:01, 19 December 2018

EGI-Engage project:	Main page	WP1(NA1)	WP3(JRA1)	WP5(SA1)	PMB	Deliverables and Milestones	Quality Plan	Risk Plan	Data Plan
	Roles and responsibilities	WP2(NA2)	WP4(JRA2)	WP6(SA2)	AMB	Software and services	Metrics	Project Office	Procedures

Objective

To provide testing Cloud site based on OpenNebula middleware for testing GPGPU support.

Current status

IISAS-Nebula site has been integrated to EGI Federated Cloud and is accessible using acc-comp.egi.eu VO.

HW configuration:

Management services: OpenNebula Cloud controller and Site BDII in virtual servers
IBM System x3250 M5, 1x Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz, 16 RAM, 1TB Disk

2 computing nodes: IBM dx360 M4 server with two NVIDIA Tesla K20 accelerators.
CentOS 7 with KVM/QEMU, PCI passthrough virtualization of GPU cards.

2.8TB block storage via NFS

SW configuration:

Base OS: CentOS 7
Hypervisor: KVM
Middleware: OpenNebula 5.4
OCCI server: rOCCI-server 2.0.4

GPU-enabled flavors:

mem_extra_large_gpu   Extra Large Instance - 8 cores and 32 GB RAM + GPU Nvidia K20m GPU
mem_large_gpu         Large Instance - 4 cores and 16 GB RAM + Nvidia K20m GPU
mem_medium_gpu        Medium Instance - 2 cores and 8 GB RAM + Nvidia K20m GPU
mem_small_gpu         Small Instance - 1 core and 4 GB RAM + Nvidia K20m GPU
large_gpu             Large Instance - 4 cores and 4 GB RAM + Nvidia K20m GPU
medium_gpu            Medium Instance - 2 cores and 2 GB RAM + Nvidia K20m GPU
small_gpu             Small Instance - 1 core and 1 GB RAM + Nvidia K20m GPU

EGI federated cloud configuration:

GOCDB: IISAS-Nebula, https://goc.egi.eu/portal/index.php?Page_Type=Site&id=1785
ARGO monitoring: http://argo.egi.eu/lavoisier/status_report-sf?site=IISAS-Nebula&report=Critical&accept=html
OCCI endpoint: https://nebula2.ui.savba.sk:11443/
EGI AppDB: https://appdb.egi.eu/store/site/iisas-nebula
Supported VOs: biomed, acc-comp.egi.eu, ops, dteam

How to use IISAS-Nebula site

Join Accelerated_computing_VO: VO acc-comp.egi.eu is dedicated for users to develop and test applications/VMs that use GPGPU or other types of accelerated computing.

Install rOCCI client: More information about installation and using of rOCCI CLI can be found at HOWTO11_How_to_use_the_rOCCI_Client

Get RFC proxy certificate from acc-comp.egi.eu VOMS server

$ voms-proxy-init --voms acc-comp.egi.eu -rfc

Choose a suitable flavor from the list above: Alternatively you can list the available resource flavors using OCCI client:

$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \
       --action describe --resource resource_tpl

Choose a suitable image from the list of supported Virtual Appliance images: The up-to-date list can be found at EGI AppDB or using OCCI client:

$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \
       --action describe --resource os_tpl

Create SSH keys and contextualisation file: Follow the guide at FAQ10_EGI_Federated_Cloud_User#Contextualisation

Create a VM with the selected image, flavor and context_file using OCCI command

$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \
       --action create --resource compute \
       --mixin os_tpl#uuid_08c3d95e_6937_5e63_914b_279174686ac2_images_15 \
       --mixin resource_tpl#large_gpu \
       --attribute occi.core.title="Testing GPU" \
       --context user_data="file://$PWD/context_file"

The command should print the URL ID of your new VM

Find out IP address assigned to your new VM

$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \
       --action describe --resource $VM_ID_URL | grep occi.networkinterface.address

Log into the new VM with your private key

$ ssh -i fedcloud cloudadm@$VM_PUBLIC_IP

Note: use the username defined in your contextualisation file if it differs from cloudadm.

Install Nvidia drivers (example installation for CentOS 7): Before installation it's recommended to update the image to get latest security updates:

[cloudadm@localhost ~]$ sudo yum -y update ; sudo reboot

Installation of cuda-drivers:

[cloudadm@localhost ~]$ sudo yum -y install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
[cloudadm@localhost ~]$ sudo yum -y install http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-8.0.61-1.x86_64.rpm
[cloudadm@localhost ~]$ sudo yum -y install cuda-drivers
[cloudadm@localhost ~]$ sudo reboot

After reboot the drivers should work. It can be checked by running nvidia-smi tool:

[cloudadm@localhost ~]$ sudo nvidia-smi

Deploy your application into your VM
After you finish working with your VM, don't forget to delete it to free GPU resources for other users

$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \
       --action delete --resource $VM_ID_URL

How to use NVIDIA Docker image

For applications that need nvidia-docker it's possible to create VM using CentOS 7 Virtual Appliance image with pre-installed NVIDIA Docker. The example of running TensorFlow docker image:

Create virtual machine using NVIDIA Docker image: Find out the os_tpl ID of the image using occi

$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms --action describe --resource os_tpl
...
title:        Image for NVIDIA Docker CentOS 7 [CentOS/7/KVM]
term:         uuid_d742ffb3_f928_5dd7_8206_18b2a592f6e0_images_23

Alternatively it's possible to determine the IDs from the AppDB link (https://appdb.egi.eu/store/vappliance/nvidia.docker.centos.7). In the section "Availability & Usage" select VO, site and resource template, then click on 'get IDs'. Templates with the same CPU, Memory, Disk and OS can be then scrolled using '<' and '>' buttons.

Create VM as described in #How_to_use_IISAS-Nebula_site and update it. NVIDIA drivers are already pre-installed in the image.

$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \
       --action create --resource compute \
       --mixin os_tpl#uuid_d742ffb3_f928_5dd7_8206_18b2a592f6e0_images_23 \
       --mixin resource_tpl#large_gpu \
       --attribute occi.core.title="Tensorflow" \
       --context user_data="file://$PWD/context_file"

$ ssh -i fedcloud cloudadm@$VM_PUBLIC_IP
[cloudadm@localhost ~]$ sudo yum -y update ; sudo reboot

Testing TensorFlow using convolution network from tutorials: Log-in to your new VM and download the TensorFlow models from GitHub

[cloudadm@localhost ~]$ sudo yum -y install git
[cloudadm@localhost ~]$ git clone https://github.com/tensorflow/models.git

Run the test

[cloudadm@localhost ~]$ sudo nvidia-docker run -v ${PWD}/models:/models \
       -it tensorflow/tensorflow:latest-gpu python /models/tutorials/image/mnist/convolutional.py

In the output you should see

...
2018-04-05 09:27:59.596790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-05 09:27:59.947689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device
(/job:localhost/replica:0/task:0/device:GPU:0 with 4334 MB memory) -> physical GPU (device: 0, name: Tesla 
K20m, pci bus id: 0000:01:01.0, compute capability: 3.5)
Initialized!
Step 0 (epoch 0.00), 26.5 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
...

Running TensorFlow Notebook: Start TensorFlow notebook

[cloudadm@localhost ~]$ sudo nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu
...
   Copy/paste this URL into your browser when you connect for the first time,
   to login with a token:
       http://localhost:8888/?token=...

Create SSH tunnel from your desktop PC to the notebook port

ssh -f -i fedcloud -p 22 cloudadm@$VM_PUBLIC_IP -L 8888:localhost:8888 -N

Copy/paste the URL with token into your browser and you should see Jupyter Notebook web interface.

Note: If TensorFlow fails with 'Illegal instruction' error, then site admin has to add 'host-passthrough' CPU model into the VM template. For OpenNebula it looks like:

CPU_MODEL=[
  MODEL="host-passthrough" ]

It should be added to all templates where it needs to be set and into cloudkeeper template /etc/cloudkeeper-one/templates/template.erb so it will be added automatically to all new templates.

Difference between revisions of "GPGPU-OpenNebula"

Revision as of 12:01, 19 December 2018

Contents

Objective

Current status

How to use IISAS-Nebula site

How to use NVIDIA Docker image

Navigation menu

Latest revision as of 12:01, 19 December 2018 (view source) Astalos (talk \| contribs) (→‎Current status) ← Older edit	Revision as of 12:01, 19 December 2018 (view source) imported>Astalos (→‎Current status) Newer edit →
(No difference)

Difference between revisions of "GPGPU-OpenNebula"

Revision as of 12:01, 19 December 2018

Objective

Current status

How to use IISAS-Nebula site

How to use NVIDIA Docker image

Navigation menu

Search