Deploy HTC clusters on top of IaaS infrastructure
Engagement overview | Community requirements | Community events | Training | EGI Webinars | Documentations |
How to deploy virtual HTC clusters on EGI
About
Overall, this document describes how to use the Elastic Cloud Computing Cluster (EC3) platform to create elastic virtual clusters on the EGI Federated Cloud infrastructure.
Configuration of the environment
EC3 has an official Docker container image available in Docker Hub that can be used instead of installing the CLI. You can download it by typing:
]$ sudo docker pull grycap/ec3
You can exploit all the potential of EC3 as if you download the CLI and run it on your computer.
List available templates
To list the available templates, use the command:
]$ cd $HOME ]$ sudo docker run -v /tmp/.ec3/clusters:/root/.ec3/clusters grycap/ec3 templates name kind summary --------------------------------------------------------------------------------------------------- blcr component Tool for checkpoint the applications. centos-ec2 images CentOS 6.5 amd64 on EC2. ckptman component Tool to automatically checkpoint applications running on Spot instances. docker component An open-source tool to deploy applications inside software containers. gnuplot component A program to generate two- and three-dimensional plots. nfs component Tool to configure shared directories inside a network. octave component A high-level programming language for numerical computations openvpn component Tool to create a VPN network. sge main Install and configure a cluster SGE from distribution repositories. slurm main Install and configure a cluster SLURM 14.11 from source code. torque main Install and configure a cluster TORQUE from distribution repositories. ubuntu-azure images Ubuntu 12.04 amd64 on Azure. ubuntu-ec2 images Ubuntu 14.04 amd64 on EC2.
List virtual clusters
To list the available running clusters, use the command:
]$ cd $HOME ]$ sudo docker run -v /tmp/.ec3/clusters:/root/.ec3/clusters grycap/ec3 list name state IP nodes ------------------------------------------------------- cluster configured 212.189.145.XXX 0
Create a cluster
To launch a cluster, you can use the recipes that you have locally by mounting the folder as a volume, or create your dedicated ones. Also, it is recommendable to maintain the data of active clusters locally, by mounting a volume. In the next example, we are going to deploy a new Torque/Maui cluster on one cloud provider of the EGI Federation (INFN-CATANIA-STACK).
The cluster will be configured with the following templates:
#torque (default template), #configure_nfs (patched template), #ubuntu-1604-occi-INFN-CATANIA-STACK (user's template), #cluster_configure (user's template)
. User’s templates are stored in
$HOME/ec3/templates
]$ cd $HOME ]$ sudo docker run -v /home/centos/:/tmp/ \ -v /home/centos/ec3/templates:/etc/ec3/templates \ -v /tmp/.ec3/clusters:/root/.ec3/clusters grycap/ec3 launch unicam_cluster \ torque ubuntu-1604-occi-INFN-CATANIA-STACK cluster_configure configure_nfs \ -a /tmp/auth_INFN-CATANIA-STACK.dat Creating infrastructure Infrastructure successfully created with ID: 529c62ec-343e-11e9-8b1d-300000000002 Front-end state: launching Front-end state: pending Front-end state: running IP: 212.189.145.XXX Front-end configured with IP 212.189.145.XXX Transferring infrastructure Front-end ready!
Authorization file
The authorization file stores in plain text the credentials to access the cloud providers, the IM service and the VMRC service. Each line of the file is composed by pairs of key and value separated by semicolon, and refers to a single credential. The key and value should be separated by "=", that is an equals sign preceded and followed by one white space at least.
Examples: Creation of an auth file to use X.509 proxy certificate.
]$ cd $HOME ]$ cat auth_INFN-CATANIA-STACK.dat id = occi; type = OCCI; proxy = file(/tmp/proxy.pem); host = http://stack-server.ct.infn.it:8787/occi1.1
Templates
This section contains the templates used to configure the cluster.
a.) cluster_configure.radl
]$ cd $HOME ]$ cat ec3/templates/cluster_configure.radl configure front ( @begin --- - vars: - USERS: - { name: user01, password: <PASSWORD> } - { name: user02, password: <PASSWORD> } [..] tasks: - user: name: "{{ item.name }}" password: "{{ item.password }}" shell: /bin/bash append: yes state: present with_items: "{{ USERS }}" - name: Install missing dependences in Debian system apt: pkg={{ item }} state=present with_items: - build-essential - mpich - gcc - g++ - vim become: yes when: ansible_os_family == "Debian" - name: SSH without password include_role: name: grycap.ssh vars: ssh_type_of_node: front ssh_user: "{{ user.name }}" loop: '{{ USERS }}' loop_control: loop_var: user - name: Updating the /etc/hosts.allow file lineinfile: path: /etc/hosts.allow line: 'sshd: XXX.XXX.XXX.*' become: yes - name: Updating the /etc/hosts.deny file lineinfile: path: /etc/hosts.deny line: 'ALL: ALL' become: yes @end ) configure wn ( @begin --- - vars: - USERS: - { name: user01, password: <PASSWORD> } - { name: user02, password: <PASSWORD> } [..] tasks: - user: name: "{{ item.name }}" password: "{{ item.password }}" shell: /bin/bash append: yes state: present with_items: "{{ USERS }}" - name: Install missing dependences in Debian system apt: pkg={{ item }} state=present with_items: - build-essential - mpich - gcc - g++ - vim become: yes when: ansible_os_family == "Debian" - name: SSH without password include_role: name: grycap.ssh vars: ssh_type_of_node: wn ssh_user: "{{ user.name }}" loop: '{{ USERS }}' loop_control: loop_var: user - name: Updating the /etc/hosts.allow file lineinfile: path: /etc/hosts.allow line: 'sshd: XXX.XXX.XXX.*' become: yes - name: Updating the /etc/hosts.deny file lineinfile: path: /etc/hosts.deny line: 'ALL: ALL' become: yes @end )
b.) ubuntu-1604-occi-INFN-CATANIA-STACK.radl
]$ cd $HOME ]$ cat ec3/templates/ubuntu-1604-occi-INFN-CATANIA-STACK.radl description ubuntu-1604-occi-INFN-CATANIA-STACK ( kind = 'images' and short = 'Ubuntu 16.04' and content = 'FEDCLOUD Image for EGI Ubuntu 16.04 LTS [Ubuntu/16.04/VirtualBox]' ) system front ( cpu.arch = 'x86_64' and cpu.count >= 4 and memory.size >= 8196 and instance_type = 'http://schemas.openstack.org/template/resource#35aa7c8d-15a9-4832-ad34-02f2e78bdeb4' and disk.0.os.name = 'linux' and # EGI_Training tenant disk.0.image.url = 'http://stack-server.ct.infn.it:8787/occi1.1/024a1b38-1b60-4df9-861a-9ec79bed1e41' and disk.0.os.credentials.username = 'ubuntu' ) system wn ( cpu.arch = 'x86_64' and cpu.count >= 2 and memory.size >= 2048m and ec3_max_instances = 10 and # maximum number of working nodes in the cluster instance_type = 'http://schemas.openstack.org/template/resource#98f6ac88-e773-48b8-85bf-86415b421996' and disk.0.os.name = 'linux' and # EGI_Training tenant disk.0.image.url = 'http://stack-server.ct.infn.it:8787/occi1.1/024a1b38-1b60-4df9-861a-9ec79bed1e41' and disk.0.os.credentials.username = 'ubuntu' )
c.) configure_nfs.radl
]$ cd $HOME ]$ cat ec3/templates/configure_nfs.radl # http://www.server-world.info/en/note?os=CentOS_6&p=nfs&f=1 # http://www.server-world.info/en/note?os=CentOS_7&p=nfs description nfs ( kind = 'component' and short = 'Tool to configure shared directories inside a network.' And content = 'Network File System (NFS) client allows you to access shared directories from Linux client. This recipe installs nfs from the repository and shares the /home/ubuntu directory with all the nodes that compose the cluster. Webpage: http://www.grycap.upv.es/clues/' ) network public ( outports contains '111/tcp' and outports contains '111/udp' and outports contains '2046/tcp' and outports contains '2046/udp' and outports contains '2047/tcp' and outports contains '2047/udp' and outports contains '2048/tcp' and outports contains '2048/udp' and outports contains '2049/tcp' and outports contains '2049/udp' and outports contains '892/tcp' and outports contains '892/udp' and outports contains '32803/tcp' and outports contains '32769/udp' ) system front ( ec3_templates contains 'nfs' and disk.0.applications contains (name = 'ansible.modules.grycap.nfs') ) configure front ( @begin - roles: - { role: 'grycap.nfs', nfs_mode: 'front', nfs_exports: [{path: "/home", export: wn*.localdomain(rw,async,no_root_squash,no_subtree_check,insecure)"}] } @end ) system wn ( ec3_templates contains 'nfs' ) configure wn ( @begin - roles: - { role: 'grycap.nfs', nfs_mode: 'wn', nfs_client_imports: [{ local: "/home", remote: "/home", server_host: '{{ hostvars[groups["front"][0]]["IM_NODE_PRIVATE_IP"] }}' }] } @end ) include nfs_misc ( template = 'openports' )
Access the cluster
To access the cluster, use the command:
]$ cd $HOME ]$ sudo docker run -ti -v /tmp/.ec3/clusters:/root/.ec3/clusters grycap/ec3 ssh unicam_cluster Warning: Permanently added '212.189.145.140' (ECDSA) to the list of known hosts. Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-164-generic x86_64) * Documentation: https://help.ubuntu.com/ Last login: Tue Feb 19 13:04:45 2019 from servproject.i3m.upv.es
Configuration of the cluster
a.) Enable Password-based authentication
Change settings in /etc/ssh/sshd_config
# Change to no to disable tunnelled clear text passwords PasswordAuthentication yes
Restart the ssh daemon:
]$ sudo service ssh restart
b.) Configure the number of processors of the cluster
]$ cat /var/spool/torque/server_priv/nodes wn1 np=XX wn2 np=XX [..]
To obtain the number of CPU/cores (np) in Linux, use the command:
]$ lscpu | grep -i CPU CPU op-mode(s): 32-bit, 64-bit CPU(s): 16 On-line CPU(s) list: 0-15 CPU family: 6 Model name: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz CPU MHz: 2266.858 NUMA node0 CPU(s): 0-3,8-11 NUMA node1 CPU(s): 4-7,12-15
c.) Test the cluster
Create a simple test script:
]$ cat test.sh #!/bin/bash #PBS -N job #PBS -q batch #cd $PBS_O_WORKDIR/ hostname -f sleep 5
Submit to the batch queue:
]$ qsub -l nodes=2 test.sh
Destroy the cluster
To destroy the running cluster, use the command:
]$ cd $HOME ]$ sudo docker run -ti -v /tmp/.ec3/clusters:/root/.ec3/clusters grycap/ec3 destroy unicam_cluster WARNING: you are going to delete the infrastructure (including frontend and nodes). Continue [y/N]? y Success deleting the cluster!