Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "MAN05 top-BDII and site-BDII High Availability"

From EGIWiki
Jump to navigation Jump to search
Line 23: Line 23:


= A TopBDII High Availability Proposal =
= A TopBDII High Availability Proposal =
* The best practice proposal to provide a high availability TopBDII service is based on three mechanisms working as main building blocks:
* The best practice proposal to provide a high availability TopBDII service is based on two mechanisms working as main building blocks:
# <big>'''DNS round robin load balacing'''</big>
# <big>'''DNS round robin load balacing'''</big>
# <big>'''Fault tolerance based on DNS dynamic updates'''</big>
# <big>'''DNS Updater process implenting fault tolerance based on DNS dynamic updates'''</big>
# <big>'''DNS Updater process'''</big>


We will provide a short introduction to some of these DNS mechanisms but for further information on specific implementations, please contact your DNS administrator.
We will provide a short introduction to some of these DNS mechanisms but for further information on specific implementations, please contact your DNS administrator.
Line 53: Line 52:
<br />
<br />


== Fault tolerance mechanism based on DNS dynamic updates ==
== DNS Updater ==
* The DNS Updater is a process which tests the different TopBDIIs and decides to remove or add DNS entries through DNS dynamic updates. Some relevant BDII metrics which can be checked to take a given decision are defined in [https://twiki.cern.ch/twiki/bin/view/EGEE/BDII#Monitoring_the_BDII_Instance gLite-BDII_top Monitoring].
 
* There are several alternatives to implement the DNS Updater:
# NAGIOS based tests
# a demonized service
# scripts running as crons.
 
=== Fault tolerance mechanism based on DNS dynamic updates ===
* Fault tolerance can be implemented by dynamically removing the DNS “A” records of unavailable TopBDII(s)
* Fault tolerance can be implemented by dynamically removing the DNS “A” records of unavailable TopBDII(s)


Line 63: Line 70:


<br />
<br />
== DNS Updater ==
* There are several alternatives for the TopBDII test mechanism:
# NAGIOS based tests
# a demonized service
# scripts running as crons.
The main idea is to test each TopBDII instance, obtain a certain metrics, and according to those metrics values decide to remove or add DNS entries through DNS dynamic updates. Relevant BDII metrics to check are defined in [https://twiki.cern.ch/twiki/bin/view/EGEE/BDII#Monitoring_the_BDII_Instance gLite-BDII_top Monitoring].

Revision as of 14:54, 14 June 2011

Objective

This document forseens to provide guidelines to implement a high availability TopBDII service.


Service requirements

Hardware

  • dual core CPU
  • 10GB of hard disk space
  • 2-3 GB RAM. If you decide to set BDII_RAM_DISK=yes in your YAIM configuration, it's advisable to have 4GB of RAM.


Co-hosting

  • Due to the critical nature of the information system with respect to the operation of the grid, the TopBDII should be installed as a stand-alone service to ensure that problems with other services do not affect the BDII. In no circumstances should the BDII be co-hosted with a service which has the potential to generate a high load.


Physical vs Virtual Machines

  • There is no clear vision on this topic. Some managers complain that there are performance issues related to deploying a TopBDII service under a virtual machine. Others argue that such performance issues are related to the configuration of the service itself. The only agreed feature is that the management and disaster recovery of any service deployed under a virtual machine is more flexible and easier. This could be an important point to take into account considering the critical importance of the TopBDII service.


A TopBDII High Availability Proposal

  • The best practice proposal to provide a high availability TopBDII service is based on two mechanisms working as main building blocks:
  1. DNS round robin load balacing
  2. DNS Updater process implenting fault tolerance based on DNS dynamic updates

We will provide a short introduction to some of these DNS mechanisms but for further information on specific implementations, please contact your DNS administrator.


DNS round robin load balacing

  • Load balancing is a technique to distribute workload evenly across two or more resources. A load balancing method, which does not necessarily require a dedicated software or hardware node, is called round robin DNS.
  • We can assume that all transactions (queries to top-bdii) generate the same resource load. For an effective load balancing, all top-bdii instances should have the same hardware configurations. In other case, a load balancing arbiter is needed.
  • Simple round robin DNS load balancing is easy to deploy. Assuming that there is a primary DNs server (dns.top.domain) where the DNS load balancing will be implemented, one simply has to add multiple A records mapping the same hostname to multiple IP addresses under the core.top.domain DNS zone
# In dns.top.domain: Add multiple A records mapping the same hostname to multiple IP addresses
Zone core.top.domain
topbdii.core.top.domain IN A x.x.x.x
topbdii.core.top.domain IN A y.y.y.y
topbdii.core.top.domain IN A z.z.z.z
  • The 3 records are always served as answer but the order of the records will rotate in each DNS query
  • This does NOT provide fault tolerance against problems in the TopBDIIs themselves
  1. if one TopBDII fails its DNS “A” record will still be served
  2. one in each three DNS queries will provide the failed TopBDII first answer


DNS Updater

  • The DNS Updater is a process which tests the different TopBDIIs and decides to remove or add DNS entries through DNS dynamic updates. Some relevant BDII metrics which can be checked to take a given decision are defined in gLite-BDII_top Monitoring.
  • There are several alternatives to implement the DNS Updater:
  1. NAGIOS based tests
  2. a demonized service
  3. scripts running as crons.

Fault tolerance mechanism based on DNS dynamic updates

  • Fault tolerance can be implemented by dynamically removing the DNS “A” records of unavailable TopBDII(s)
  • nsupdate introduced in bind V8 offers the possibility of changing DNS records dynamically:
  1. The nsupdate tool connects to a bind server on port 53 (TCP or UDP) and can update zone records
  2. Updates are authorized based on keys
  3. Updates can only be performed on the DNS primary server
  4. In the DNS bind implementation, the entire zone is rewritten by the DNS server upon “stop” to reflect the changes. Thefore, the zone should not be managed manually; and the changes are kept in a zone journal file until a “stop” happens