Revision as of 20:58, 3 June 2011

Objective

This document forseens to provide guidelines to implement a high availability TopBDII service.

Service requirements

Hardware

dual core CPU
10GB of hard disk space
2-3 GB RAM. If you decide to set BDII_RAM_DISK=yes in your YAIM configuration, it's advisable to have 4GB of RAM.

Co-hosting

Due to the critical nature of the information system with respect to the operation of the grid, the TopBDII should be installed as a stand-alone service to ensure that problems with other services do not affect the BDII. In no circumstances should the BDII be co-hosted with a service which has the potential to generate a high load.

Physical vs Virtual Machines

There is no clear vision on this topic. Some managers complain that there are performance issues related to deploying a TopBDII service under a virtual machine. Others argue that such performance issues are related to the configuration of the service itself. The only agreed feature is that the management and disaster recovery of any service deployed under a virtual machine is more flexible and easier. This could be an important point to take into account considering the critical importance of the TopBDII service.

A TopBDII High Availability Proposal

The best practice proposal to provide a high availability TopBDII service is based on two main building blocks:

DNS round robin load balacing
Fault tolerance mechanism based on DNS dynamic updates
A mechanism to test the different TopBDII instances, and make decisions regarding remove or adding DNS entries

We will provide a short introduction to DNS round robin load balacing and DNS dynamic updates but for further information on specific implementations, please contact your DNS administrator.

There are several alternatives for the TopBDII test mechanism:

NAGIOS based tests
a demonized service
scripts running as crons.

The main idea is to test each TopBDII instance, obtain a certain metrics, and according to those metrics values decide to remove or add DNS entries through DNS dynamic updates. Relevant BDII metrics to check are defined in gLite-BDII_top Monitoring.

DNS round robin load balacing

Load balancing is a technique to distribute workload evenly across two or more resources. A load balancing method, which does not necessarily require a dedicated software or hardware node, is called round robin DNS.

We can assume that all transactions (queries to top-bdii) generate the same resource load. For an effective load balancing, all top-bdii instances should have the same hardware configurations. In other case, a load balancing arbiter is needed.

Simple round robin DNS load balancing is easy to deploy. One must assume that there is a primary DNs server (dns.my.domain) where the DNS load balancing will be implemented, and that inherits a zone from

To implement a load balance mechanism between instances deployed in a WAN:

In dns.top.domain: configure the delegation of the DNS zone core.top.domain to dns.my.domain.
In dns.my.domain: Add multiple A records mapping the same hostname to multiple IP addresses under the core.top.domain DNS zone

# In dns.top.domain: Delegate the core.top.domain zone to dns.my.domain
Zone top.domain
core.top.domain IN NS dns.my.domain

# In dns.my.domain: Add multiple A records mapping the same hostname to multiple IP addresses
Zone core.top.domain
topbdii.core.top.domain IN A x.x.x.x
topbdii.core.top.domain IN A y.y.y.y
topbdii.core.top.domain IN A z.z.z.z

The 3 records are always served as answer but the order of the records will rotate in each DNS query

This does NOT provide fault tolerance against problems in the TopBDIIs themselves

if one TopBDII fails its DNS “A” record will still be served
one in each three DNS queries will provide the failed TopBDII first answer

Fault tolerance mechanism based on DNS dynamic updates

Fault tolerance can be implemented by dynamically removing the DNS “A” records of unavailable TopBDII(s)
nsupdate introduced in bind V8 offers the possibility of changing DNS records dynamically:

The nsupdate tool connects to a bind server on port 53 (TCP or UDP) and can update zone records
Updates are authorized based on keys
Updates can only be performed on the DNS primary server
In the DNS bind implementation, the entire zone is rewritten by the DNS server upon “stop” to reflect the changes. Thefore, the zone should not be managed manually; and the changes are kept in a zone journal file until a “stop” happens

Method

Implementation examples

Basically, our setup is also based on dns round robin (for load balancing) and we use nagios to check each top-bdii instance and update the dns records (a nagios event handler runs a script that add/delete the "A" record using nsupdate).

Primary DNS and nagios are clearly single points of failure, but we prefer to keep the setup very simple, avoiding for example the inconsistency of DNS information using more than one primary DNS (as you reported) or issues about incoherent results if more than one server check to the top-bdii instances. To mitigate these spof, we check (via another nagios instance) the DNS server and the Nagios used to update the DNS records and a sms notification is sent in case of problem to the people on duty for H24 support.

About a best practice document, i think it should explain:

   * recommended hardware setup;
   * why DNS round robin is a good technique to adopt for top-bdii load balancing;
   * what to check to verify availability of a top-bdii instance;

Other issues, like the use of virtual machines, how to configure the DNS, how to check the top-bdii instances (using nagios or a cron, for example) and how to update the DNS are implementation details: they highly depend on the configuration, experiences and policies adopted at each resource center and ngi. Of course, the best practice documentation could be integrated with some use cases.

BDII check and DNS update Useful information on how to monitor the bdii service are available at https://twiki.cern.ch/twiki/bin/view/EGEE/BDII#Monitoring_the_BDII_Instance. The probe used by SAM is available at Existing probes integrated into SAM/Nagios. When the instance checked fails, the ip must be removed via nsupdate from the DNS. When the instance restart to work as expected, it must be re-added to the DNS. Both these operation could be done by cron script, daemon, etc. A basic receipt to add them in a Nagios instance  Create a SERVICEGROUP (ex. GROUP-TOPBDII) define servicegroup{ servicegroup_name alias TOP-BDII TOP-BDII }  Create a host profile for each instances: define host{ host_name top-bdii01 use basic_host ; # template with some common definitions address 192.168.0.125 }  Write a nagios plugin to check the bdii service: define service{ host_name top-bdii01, top-bdii02, top-bdii03 service_description TOP-BDII use basic-service ; # template with some common definitions normal_check_interval 5 max_check_attempts 4 servicegroups TOP-BDII check_command check_bdii ; try the one used by SAM event_handler update_top-bdii }  Write an event handler (update_top-bdii) that, at least: o Remove the IP from DNS if the check result change its status (from OK to CRITICAL); o Add the IP to the DNS if the check result change its status (from CRITICAL to OK). The event handler script is called by Nagios every time the status of a check changes. In the above definition, the check is performed every 5 minutes (normal_check_interval is used to define the number of "time units" to wait before scheduling the next "regular" check of the service. "Regular" checks are those that occur when the service is in an OK state or when the service is in a non-OK state, but has already been rechecked max_check_attempts number of times.) When a service or host check results in a non-OK or non-UP state and the service check has not yet been (re)checked the number of times specified by the max_check_attempts directive in the service or host definition the error type is called a soft error. The event handler can be written also with the following algorithm (base on error type): o FROM OK to CRITICAL SOFT 1: do nothing (can be a network glitch); o o o o CRITICAL SOFT 2: run nsupdate to remove the ip from the DNS and try to restart the BDII service (this could be done configuring NRPE, but its out of the scope of these notes). CRITICAL SOFT 3: do nothing; CRITICAL HARD: send a notification From any CRITICAL STATE to OK: run nsupdate to add the ip to the DNS and send a notification.

@@ Line 43: / Line 43: @@
 * We can assume that all transactions (queries to top-bdii) generate the same resource load. For an effective load balancing, all top-bdii instances should have the same hardware configurations. In other case, a load balancing arbiter is needed.
-* Simple round robin DNS load balancing is easy to deploy. To implement a load balance mechanism between instances deployed in a WAN, one can:
+* Simple round robin DNS load balancing is easy to deploy. One must assume that there is a primary DNs server (dns.my.domain) where the DNS load balancing will be implemented, and that inherits a zone from
+To implement a load balance mechanism between instances deployed in a WAN:
 # In dns.top.domain: configure the delegation of the [http://en.wikipedia.org/wiki/DNS_zone DNS zone] core.top.domain to dns.my.domain.
 # In dns.my.domain: Add multiple A records mapping the same hostname to multiple IP addresses under the  core.top.domain [http://en.wikipedia.org/wiki/DNS_zone DNS zone]

Difference between revisions of "MAN05 top-BDII and site-BDII High Availability"

Revision as of 20:58, 3 June 2011

Contents

Objective

Service requirements

Hardware

Co-hosting

Physical vs Virtual Machines

A TopBDII High Availability Proposal

DNS round robin load balacing

Fault tolerance mechanism based on DNS dynamic updates

Method

Implementation examples

Documentation

Authors

Navigation menu

Difference between revisions of "MAN05 top-BDII and site-BDII High Availability"

Revision as of 20:58, 3 June 2011

Objective

Service requirements

Hardware

Co-hosting

Physical vs Virtual Machines

A TopBDII High Availability Proposal

DNS round robin load balacing

Fault tolerance mechanism based on DNS dynamic updates

Method

Implementation examples

Documentation

Authors

Navigation menu

Search