Difference between revisions of "MAN05 top-BDII and site-BDII High Availability"

Latest revision as of 10:54, 31 August 2021

Main

EGI.eu operations services

Support

Documentation

Tools

Activities

Performance

Technology

Catch-all Services

Resource Allocation

Security

Documentation menu:

Home •

Manuals •

Procedures •

Training •

Other •

Contact ►

For:

VO managers •

Administrators

This article is Deprecated and has been moved to https://docs.egi.eu/providers/operations-manuals/man05_top_and_site_bdii_high_availability.

@@ Line 1: / Line 1: @@
-= Introduction =
+{{Template: Op menubar}} {{Template:Doc_menubar}}
-This document forseens to provide guidelines to provide a TopBDII service in a high availability mode.
-== Recommended hardware setup ==
+{{DeprecatedAndMovedTo|new_location=https://docs.egi.eu/providers/operations-manuals/man05_top_and_site_bdii_high_availability}}
+[[Category:Operations_Manuals]]
-Due to the critical nature of the information system with respect to the operation of the grid, the BDII should be installed as a stand-alone service to ensure that problems with other services do not affect the BDII. In no circumstances should the BDII be co-hosted with a service which has the potential to generate a high load. To achieve the desired scalability, multiple BDII  instances should be deployed behind a round-robin DNS alias in order to provide a load balanced BDII service.
-* recommended hardware setup;
-* why DNS round robin is a good technique to adopt for top-bdii load balancing;
-* what to check to verify availability of a top-bdii instance;
-= Method =
-= Implementation examples =
-Basically, our setup is also based on dns round robin (for load balancing) and we use nagios to check each top-bdii instance and update the dns records (a nagios event handler runs a script that add/delete the "A" record using nsupdate).
-Primary DNS and nagios are clearly single points of failure, but we prefer to keep the setup very simple, avoiding for example the inconsistency of DNS information using more than one primary DNS (as you reported) or issues about incoherent results if more than one server check to the top-bdii instances. To mitigate these spof, we check (via another nagios instance) the DNS server and the Nagios used to update the DNS records and a sms notification is sent in case of problem to the people on duty for H24 support.
-About a best practice document, i think it should explain:
-    * recommended hardware setup;
-    * why DNS round robin is a good technique to adopt for top-bdii load balancing;
-    * what to check to verify availability of a top-bdii instance;
-Other issues, like the use of virtual machines, how to configure the DNS, how to check the top-bdii instances (using nagios or a cron, for example) and how to update the DNS are implementation details: they highly depend on the configuration, experiences and policies adopted at each resource center and ngi. Of course, the best practice documentation could be integrated with some use cases.
-Service Reference card: https://twiki.cern.ch/twiki/bin/view/EMI/EMTSrcTemplate#SrcLinks
-The following note can be applied to resource-bdii and site-bdii too.
-Top-bdii load balancing from the service perspective
-Load balancing is a technique to distribute workload evenly across two or more resources
-http://en.wikipedia.org/wiki/Load_distribution.
-A load balancing method, which does not necessarily require a dedicated software or hardware node, is
-called round robin DNS and it is described at http://en.wikipedia.org/wiki/Round-robin_DNS.
-Why Simple DNS load balancing works for top-bdii?
-We can assume that all transactions (queries to top-bdii) generate the same resource loads. It means that
-all top-bdii instances should have the same hardware configurations to have an effective load balancing. In
-other case, a load balancing arbiter is needed.
-Virtual machine or physical machine
-In the IGI setup, each instance is a virtual machine (kvm + virtio) with two cores, 4 GB RAM, 10 GB hard
-disk. The main issue using a round robin DNS regards the needed to use, more or less, the same hardware
-(and software) configurations to have an effective load balancing. There is no reason to avoid VM.
-Top-bdii failover from the service perspective
-Simple DNS load balancing don’t provide fault tolerance against single instances problems: if one instance
-on five fails, 20% query fails.
-Simple Failover can correct this situation by removing the failed instance from the DNS round robin set
-using nsupdate (http://linux.yyz.us/nsupdate/). To propagate DNS changes fast enough it is important to
-have very short TTL lifetimes (60s seems to be adequate). To proper configure the DNS server please
-contact your DNS administrator.
-BDII check and DNS update
-Useful information on how to monitor the bdii service are available at
-https://twiki.cern.ch/twiki/bin/view/EGEE/BDII#Monitoring_the_BDII_Instance.
-The probe used by SAM is available at Existing probes integrated into SAM/Nagios.
-When the instance checked fails, the ip must be removed via nsupdate from the DNS. When the instance
-restart to work as expected, it must be re-added to the DNS.
-Both these operation could be done by cron script, daemon, etc.
-A basic receipt to add them in a Nagios instance
-
-Create a SERVICEGROUP (ex. GROUP-TOPBDII)
-define servicegroup{
-servicegroup_name
-alias
-TOP-BDII
-TOP-BDII
-}
-
-Create a host profile for each instances:
-define host{
-host_name top-bdii01
-use basic_host ; # template with some common definitions
-address 192.168.0.125
-}
-
-Write a nagios plugin to check the bdii service:
-define service{
-host_name top-bdii01, top-bdii02, top-bdii03
-service_description TOP-BDII
-use basic-service ; # template with some common definitions
-normal_check_interval 5
-max_check_attempts 4
-servicegroups TOP-BDII
-check_command check_bdii ; try the one used by SAM
-event_handler update_top-bdii
-}
-
-Write an event handler (update_top-bdii) that, at least:
-o Remove the IP from DNS if the check result change its status (from OK to CRITICAL);
-o Add the IP to the DNS if the check result change its status (from CRITICAL to OK).
-The event handler script is called by Nagios every time the status of a check changes. In the above
-definition, the check is performed every 5 minutes (normal_check_interval is used to define the number of
-"time units" to wait before scheduling the next "regular" check of the service. "Regular" checks are those
-that occur when the service is in an OK state or when the service is in a non-OK state, but has already been
-rechecked max_check_attempts number of times.) When a service or host check results in a non-OK or
-non-UP state and the service check has not yet been (re)checked the number of times specified by
-the max_check_attempts directive in the service or host definition the error type is called a soft error. The
-event handler can be written also with the following algorithm (base on error type):
-o
-FROM OK to CRITICAL SOFT 1: do nothing (can be a network glitch);
-o
-o
-o
-o
-CRITICAL SOFT 2: run nsupdate to remove the ip from the DNS and try to restart the BDII
-service (this could be done configuring NRPE, but its out of the scope of these notes).
-CRITICAL SOFT 3: do nothing;
-CRITICAL HARD: send a notification
-From any CRITICAL STATE to OK: run nsupdate to add the ip to the DNS and send a
-notification.
-= Documentation =
-* [http://glite.cern.ch/glite-BDII_top/ <big>gLite-BDII_top Updates</big>]
-* [https://twiki.cern.ch/twiki/bin/view/EGEE/BDII <big>gLite-BDII_top User and Admin Manual</big>]
-* [https://twiki.cern.ch/twiki/bin/view/EGEE/Glite-BDII <big>gLite-BDII_top Service Reference Card</big>]
-* [http://glite.cern.ch/glite-BDII_top/known_issues <big>gLite-BDII_top known issues</big>]
-<br />
-== Authors ==

Difference between revisions of "MAN05 top-BDII and site-BDII High Availability"

Latest revision as of 10:54, 31 August 2021

Navigation menu

Search