Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "MAN05 top-BDII and site-BDII High Availability"

From EGIWiki
Jump to navigation Jump to search
(Deprecate page)
Tag: Replaced
 
(62 intermediate revisions by 7 users not shown)
Line 1: Line 1:
= Objective =
{{Template: Op menubar}} {{Template:Doc_menubar}}
This document forseens to provide guidelines to implement a high availability TopBDII service.


<br />
{{DeprecatedAndMovedTo|new_location=https://docs.egi.eu/providers/operations-manuals/man05_top_and_site_bdii_high_availability}}


= TopBDII service requirements =
[[Category:Operations_Manuals]]
== Hardware ==
* dual core CPU
* 10GB of hard disk space
* 2-3 GB RAM. If you decide to set BDII_RAM_DISK=yes in your YAIM configuration, it's advisable to have 4GB of RAM.
 
<br />
 
== Co-hosting ==
* Due to the critical nature of the information system with respect to the operation of the grid, the TopBDII should be installed as a stand-alone service to ensure that problems with other services do not affect the BDII. In no circumstances should the BDII be co-hosted with a service which has the potential to generate a high load.
 
<br />
 
== Physical vs Virtual Machines ==
* There is no clear vision on this topic. Some managers complain that there are performance issues related to deploying a TopBDII service under a virtual machine. Others argue that such performance issues are related to the configuration of the service itself. The only agreed feature is that the management and disaster recovery of any service deployed under a virtual machine is more flexible and easier. This could be an important point to take into account considering the critical importance of the TopBDII service.
<br />
 
= Best practices from a client perspective =
 
* gLite clients use the LCG_GFAL_INFOSYS variable to determine the default TopBDIIs configured at each node (UI, WN, WMS).
$ env |grep GFAL
LCG_GFAL_INFOSYS=egee-bdii.cnaf.infn.it:2170
 
* There is an optional yaim variable (BDII_LIST) to define a list of top level BDIIs to support the automatic failover in the GFAL clients. Be aware that lcg-infosites doesn't work with multiple BDIIs. Only gfal, lcg_utils, lcg-info and glite-sd-query.
 
* Site admins should use a list of top-bdii to configure their services (the first of the list should be the default top-bdii provided by their NGI).
 
= A TopBDII High Availability Proposal =
* The best practice proposal to provide a high availability TopBDII service is based on two mechanisms working as main building blocks:
# <big>'''DNS round robin load balacing'''</big>
# <big>'''Fault tolerance DNS Updater'''</big>
 
We will provide a short introduction to some of these DNS mechanisms but for further information on specific implementations, please contact your DNS administrator.
 
<br />
 
== DNS round robin load balacing ==
* [http://en.wikipedia.org/wiki/Load_distribution Load balancing] is a technique to distribute workload evenly across two or more resources. A load balancing method, which does not necessarily require a dedicated software or hardware node, is called [http://en.wikipedia.org/wiki/Round-robin_DNS round robin DNS].
 
* We can assume that all transactions (queries to top-bdii) generate the same resource load. For an effective load balancing, all top-bdii instances should have the same hardware configurations. In other case, a load balancing arbiter is needed.
 
* Simple round robin DNS load balancing is easy to deploy. Assuming that there is a primary DNs server (dns.top.domain) where the DNS load balancing will be implemented, one simply has to add multiple A records mapping the same hostname to multiple IP addresses under the core.top.domain [http://en.wikipedia.org/wiki/DNS_zone DNS zone]
# In dns.top.domain: Add multiple A records mapping the same hostname to multiple IP addresses
Zone core.top.domain
topbdii.core.top.domain IN A x.x.x.x
topbdii.core.top.domain IN A y.y.y.y
topbdii.core.top.domain IN A z.z.z.z
 
* The 3 records are always served as answer but the order of the records will rotate in each DNS query
 
* '''This does NOT provide fault tolerance against problems in the TopBDIIs themselves'''
# if one TopBDII fails its DNS “A” record will still be served
# one in each three DNS queries will provide the failed TopBDII first answer
 
<br />
 
== Fault tolerance DNS Updater ==
* The DNS Updater is a mechanism (to be implemented by you) which tests the different TopBDIIs and decides to remove or add DNS entries through DNS dynamic updates. The fault tolerance is implemented by dynamically removing the DNS “A” records of unavailable TopBDII(s). [http://linux.yyz.us/nsupdate/ nsupdate] introduced in bind V8 offers the possibility of changing DNS records dynamically:
# The nsupdate tool connects to a bind server on port 53 (TCP or UDP) and can update zone records
# Updates are authorized based on keys
# Updates can only be performed on the DNS primary server
# In the DNS bind implementation, the entire zone is rewritten by the DNS server upon “stop” to reflect the changes. Thefore, the zone should not be managed manually; and the changes are kept in a zone journal file until a “stop” happens
 
<br />
 
=== Implementation ===
 
* There are several alternatives to implement the DNS Updater:
# NAGIOS based tests
# a demonized service
# scripts running as crons
 
<br />
 
=== What to test: BDII metrics ===
* Status information about the BDII is available by querying the o=infosys root for the UpdateStats object. This entry contains a number of metrics relating to the latest update such as the time to update the database and the total number of entries. And example of such entry is shown below.
 
dn: Hostname=lxbra2510.cern.ch,o=infosys
objectClass: UpdateStats
Hostname: lxbra2510.cern.ch
FailedDeletes: 0
ModifiedEntries: 4950
DeletedEntries: 1318
UpdateTime: 150
FailedAdds: 603
FailedModifies: 0
TotalEntries: 52702
QueryTime: 8
NewEntries: 603
DBUpdateTime: 11
ReadTime: 0
PluginsTime: 4
ProvidersTime: 113
 
* The following table shows the meaning of these metrics:
<center>
{|style="border-collapse: collapse; border-width: 1px; border-style: solid; border-color: #000"
|-
!style="border-style: solid; border-width: 1px"| Metric
!style="border-style: solid; border-width: 1px"| Desciption
|-
|style="border-style: solid; border-width: 1px"|ModifiedEntries
|style="border-style: solid; border-width: 1px"|The number of objects to modify
|-
|style="border-style: solid; border-width: 1px"|DeletedEntries
|style="border-style: solid; border-width: 1px"|The number of objects to delete
|-
|style="border-style: solid; border-width: 1px"|UpdateTime
|style="border-style: solid; border-width: 1px"|To total update time in seconds
|-
|style="border-style: solid; border-width: 1px"|FailedAdds
|style="border-style: solid; border-width: 1px"|The number of add statements which failed
|-
|style="border-style: solid; border-width: 1px"|FailedModifies
|style="border-style: solid; border-width: 1px"|The number of modify statements which failed
|-
|style="border-style: solid; border-width: 1px"|TotalEntries
|style="border-style: solid; border-width: 1px"|The total number of entries in the database
|-
|style="border-style: solid; border-width: 1px"|QueryTime
|style="border-style: solid; border-width: 1px"|The time taken to query the database
|-
|style="border-style: solid; border-width: 1px"|NewEntries
|style="border-style: solid; border-width: 1px"|The number of new objects
|-
|style="border-style: solid; border-width: 1px"|DBUpdateTime
|style="border-style: solid; border-width: 1px"|The time taken to update the database in seconds
|-
|style="border-style: solid; border-width: 1px"|ReadTime
|style="border-style: solid; border-width: 1px"|The time taken to read the LDIF sources in seconds
|-
|style="border-style: solid; border-width: 1px"|PluginsTime
|style="border-style: solid; border-width: 1px"|The time taken to run the plugins in seconds
|-
|style="border-style: solid; border-width: 1px"|ProvidersTime
|style="border-style: solid; border-width: 1px"|The time taken to run the information providers in seconds
|}
</center>
 
* Previous BDII metrics can be checked to take a decision regarding the reliability and availability of a TopBDII instance.
 
* More information is available in [https://twiki.cern.ch/twiki/bin/view/EGEE/BDII#Monitoring_the_BDII_Instance gLite-BDII_top Monitoring].
 
<br />
 
=== Notes on DNS caching ===
* DNS records obtained in queries are cached by the DNS servers (usually during 24 hours). Therefore to propagate DNS changes fast enough it is important to have very short TTL lifetimes.
* DNS has not been built to have very short TTL values and these may increase highly the number of queries and as result increase the load of the DNS server
* The TTL lifetime to be used will have to be tested.
* If the top BDII are only used by sites in the region and if queries are only from the DNS servers of these few sites then the number of queries may be low enough to allow for a very small TTL
* This value should not be lower than 30s - 60s
 
<br />
 
=== Example 1: The DNS Updater IGI Nagios based mechanism ===
When Nagios needs to check the status of a service it will execute a plugin and pass it
information about what needs to be checked.
• The plugin will then check the operational state of the service and report the results back
to the Nagios daemon.
• Nagios will process the results of the service check and take appropriate action as
necessary (e.g. send notifications, run event handlers, etc).
Active check are executed:
• At regular intervals, as defined by
the check_interval and retry_interval options in the service definitions
• On-demand as needed
 
 
 
 
ach instance is checked every 5 minutes (5 minutes seems to be adequate);
• If a failure occurs, nagios try to restart the BDII service AND remove the
instance from the DNS round robin set using dnsupdate
– an email is sent as notification;
– If the instances failed are 4 (on 5), an SMS messages is sent as notification;
• If a failed instance appears to be restored, nagios re-add it to the DNS set
 
<br />
 
 
 
=== Example 2: The DNS Updater IBERGRID scripting based mechanism ===
 
<br />

Latest revision as of 10:54, 31 August 2021