Difference between revisions of "GPGPU-OpenNebula"
Jump to navigation
Jump to search
imported>Astalos |
|
(No difference)
|
Revision as of 12:01, 19 December 2018
Objective
To provide testing Cloud site based on OpenNebula middleware for testing GPGPU support.
Current status
IISAS-Nebula site has been integrated to EGI Federated Cloud and is accessible using acc-comp.egi.eu VO.
HW configuration:
Management services: OpenNebula Cloud controller and Site BDII in virtual servers IBM System x3250 M5, 1x Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz, 16 RAM, 1TB Disk 2 computing nodes: IBM dx360 M4 server with two NVIDIA Tesla K20 accelerators. CentOS 7 with KVM/QEMU, PCI passthrough virtualization of GPU cards. 2.8TB block storage via NFS
SW configuration:
Base OS: CentOS 7 Hypervisor: KVM Middleware: OpenNebula 5.4 OCCI server: rOCCI-server 2.0.4
GPU-enabled flavors:
mem_extra_large_gpu Extra Large Instance - 8 cores and 32 GB RAM + GPU Nvidia K20m GPU mem_large_gpu Large Instance - 4 cores and 16 GB RAM + Nvidia K20m GPU mem_medium_gpu Medium Instance - 2 cores and 8 GB RAM + Nvidia K20m GPU mem_small_gpu Small Instance - 1 core and 4 GB RAM + Nvidia K20m GPU large_gpu Large Instance - 4 cores and 4 GB RAM + Nvidia K20m GPU medium_gpu Medium Instance - 2 cores and 2 GB RAM + Nvidia K20m GPU small_gpu Small Instance - 1 core and 1 GB RAM + Nvidia K20m GPU
EGI federated cloud configuration:
GOCDB: IISAS-Nebula, https://goc.egi.eu/portal/index.php?Page_Type=Site&id=1785 ARGO monitoring: http://argo.egi.eu/lavoisier/status_report-sf?site=IISAS-Nebula&report=Critical&accept=html OCCI endpoint: https://nebula2.ui.savba.sk:11443/ EGI AppDB: https://appdb.egi.eu/store/site/iisas-nebula Supported VOs: biomed, acc-comp.egi.eu, ops, dteam
How to use IISAS-Nebula site
- Join Accelerated_computing_VO
- VO acc-comp.egi.eu is dedicated for users to develop and test applications/VMs that use GPGPU or other types of accelerated computing.
- Install rOCCI client
- More information about installation and using of rOCCI CLI can be found at HOWTO11_How_to_use_the_rOCCI_Client
- Get RFC proxy certificate from acc-comp.egi.eu VOMS server
$ voms-proxy-init --voms acc-comp.egi.eu -rfc
- Choose a suitable flavor from the list above
- Alternatively you can list the available resource flavors using OCCI client:
$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \ --action describe --resource resource_tpl
- Choose a suitable image from the list of supported Virtual Appliance images
- The up-to-date list can be found at EGI AppDB or using OCCI client:
$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \ --action describe --resource os_tpl
- Create SSH keys and contextualisation file
- Follow the guide at FAQ10_EGI_Federated_Cloud_User#Contextualisation
- Create a VM with the selected image, flavor and context_file using OCCI command
$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \ --action create --resource compute \ --mixin os_tpl#uuid_08c3d95e_6937_5e63_914b_279174686ac2_images_15 \ --mixin resource_tpl#large_gpu \ --attribute occi.core.title="Testing GPU" \ --context user_data="file://$PWD/context_file"
- The command should print the URL ID of your new VM
- Find out IP address assigned to your new VM
$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \ --action describe --resource $VM_ID_URL | grep occi.networkinterface.address
- Log into the new VM with your private key
$ ssh -i fedcloud cloudadm@$VM_PUBLIC_IP
- Note: use the username defined in your contextualisation file if it differs from cloudadm.
- Install Nvidia drivers (example installation for CentOS 7)
- Before installation it's recommended to update the image to get latest security updates:
[cloudadm@localhost ~]$ sudo yum -y update ; sudo reboot
- Installation of cuda-drivers:
[cloudadm@localhost ~]$ sudo yum -y install kernel-devel-$(uname -r) kernel-headers-$(uname -r) [cloudadm@localhost ~]$ sudo yum -y install http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-8.0.61-1.x86_64.rpm [cloudadm@localhost ~]$ sudo yum -y install cuda-drivers [cloudadm@localhost ~]$ sudo reboot
- After reboot the drivers should work. It can be checked by running nvidia-smi tool:
[cloudadm@localhost ~]$ sudo nvidia-smi
- Deploy your application into your VM
- After you finish working with your VM, don't forget to delete it to free GPU resources for other users
$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \ --action delete --resource $VM_ID_URL
How to use NVIDIA Docker image
For applications that need nvidia-docker it's possible to create VM using CentOS 7 Virtual Appliance image with pre-installed NVIDIA Docker. The example of running TensorFlow docker image:
- Create virtual machine using NVIDIA Docker image
- Find out the os_tpl ID of the image using occi
$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms --action describe --resource os_tpl ... title: Image for NVIDIA Docker CentOS 7 [CentOS/7/KVM] term: uuid_d742ffb3_f928_5dd7_8206_18b2a592f6e0_images_23
- Alternatively it's possible to determine the IDs from the AppDB link (https://appdb.egi.eu/store/vappliance/nvidia.docker.centos.7). In the section "Availability & Usage" select VO, site and resource template, then click on 'get IDs'. Templates with the same CPU, Memory, Disk and OS can be then scrolled using '<' and '>' buttons.
- Create VM as described in #How_to_use_IISAS-Nebula_site and update it. NVIDIA drivers are already pre-installed in the image.
$ occi --endpoint https://nebula2.ui.savba.sk:11443/ --auth x509 --user-cred $X509_USER_PROXY --voms \ --action create --resource compute \ --mixin os_tpl#uuid_d742ffb3_f928_5dd7_8206_18b2a592f6e0_images_23 \ --mixin resource_tpl#large_gpu \ --attribute occi.core.title="Tensorflow" \ --context user_data="file://$PWD/context_file" $ ssh -i fedcloud cloudadm@$VM_PUBLIC_IP [cloudadm@localhost ~]$ sudo yum -y update ; sudo reboot
- Testing TensorFlow using convolution network from tutorials
- Log-in to your new VM and download the TensorFlow models from GitHub
[cloudadm@localhost ~]$ sudo yum -y install git [cloudadm@localhost ~]$ git clone https://github.com/tensorflow/models.git
- Run the test
[cloudadm@localhost ~]$ sudo nvidia-docker run -v ${PWD}/models:/models \ -it tensorflow/tensorflow:latest-gpu python /models/tutorials/image/mnist/convolutional.py
- In the output you should see
... 2018-04-05 09:27:59.596790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0 2018-04-05 09:27:59.947689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4334 MB memory) -> physical GPU (device: 0, name: Tesla K20m, pci bus id: 0000:01:01.0, compute capability: 3.5) Initialized! Step 0 (epoch 0.00), 26.5 ms Minibatch loss: 8.334, learning rate: 0.010000 Minibatch error: 85.9% ...
- Running TensorFlow Notebook
- Start TensorFlow notebook
[cloudadm@localhost ~]$ sudo nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu ... Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=...
- Create SSH tunnel from your desktop PC to the notebook port
ssh -f -i fedcloud -p 22 cloudadm@$VM_PUBLIC_IP -L 8888:localhost:8888 -N
- Copy/paste the URL with token into your browser and you should see Jupyter Notebook web interface.
- Note
- If TensorFlow fails with 'Illegal instruction' error, then site admin has to add 'host-passthrough' CPU model into the VM template. For OpenNebula it looks like:
CPU_MODEL=[ MODEL="host-passthrough" ]
- It should be added to all templates where it needs to be set and into cloudkeeper template /etc/cloudkeeper-one/templates/template.erb so it will be added automatically to all new templates.