MAN05 top-BDII and site-BDII High Availability

From EGIWiki
(Redirected from MAN05)
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Contents


Title (EMI.gLite) top-BDII and site-BDII High Availability
Document link https://wiki.egi.eu/wiki/MAN05
Last modified v1.0, 19 August 2014
Policy Group Acronym OMB
Policy Group Name Operations Management Board
Contact Group operations-support@mailman.egi.eu
Document Status Approved
Approved Date 21 June 2011
Procedure Statement This manual provides information on how to deploy the BDII service in High Availability configuration. It is equally applicable to site and top BDII
Owner Owner of procedure


(EMI.gLite) top-BDII and site-BDII High Availability

This document objective is to provide guidelines to improve the availability of the information system, addressing three main areas:

  1. Requirements to deploy a TopBDII (siteBDII) service
  2. High Availability from a client perspective
  3. Configuration of a High Availability TopBDII (siteBDII) service


Requirements to deploy a TopBDII (siteBDII) service

Hardware

Co-hosting

Physical vs Virtual Machines

Best practices from a client perspective for top-BDII

BDII_LIST=topbdii.domain.one:2170[,topbdii.domain.two:2170[...]]. 
LCG_GFAL_INFOSYS=topbdii.domain.one:2170,topbdii.domain.two:2170 


Best practices for a TopBDII (siteBDII) High Availability service

  1. DNS round robin load balacing
  2. A fault tolerance DNS Updater

We will provide a short introduction to some of these DNS mechanisms but for further information on specific implementations, please contact your DNS administrator.

DNS round robin load balacing

# In dns.top.domain: Add multiple A records mapping the same hostname to multiple IP addresses
Zone core.top.domain
topbdii.core.top.domain IN A x.x.x.x
topbdii.core.top.domain IN A y.y.y.y
topbdii.core.top.domain IN A z.z.z.z
  1. if one TopBDII (siteBDII) fails its DNS “A” record will still be served
  2. one in each three DNS queries will provide the failed TopBDII (siteBDII) first answer

Fault tolerance DNS Updater

  1. The nsupdate tool connects to a bind server on port 53 (TCP or UDP) and can update zone records
  2. Updates are authorized based on keys
  3. Updates can only be performed on the DNS primary server
  4. In the DNS bind implementation, the entire zone is rewritten by the DNS server upon “stop” to reflect the changes. Thefore, the zone should not be managed manually; and the changes are kept in a zone journal file until a “stop” happens.

Implementation

  1. NAGIOS based tests
  2. a demonized service
  3. scripts running as crons

What to test: BDII metrics

$ ldapsearch -x -h <TopBDII/siteBDII> -p 2170 -b "o=infosys" 
(...)
dn: Hostname=localhost,o=infosys
objectClass: UpdateStats
Hostname: lxbra2510.cern.ch
FailedDeletes: 0
ModifiedEntries: 4950
DeletedEntries: 1318
UpdateTime: 150
FailedAdds: 603
FailedModifies: 0
TotalEntries: 52702
QueryTime: 8
NewEntries: 603
DBUpdateTime: 11
ReadTime: 0
PluginsTime: 4
ProvidersTime: 113
$ ldapsearch -x -h <TopBDII/siteBDII> -p 2170 -b "o=infosys" +
(...)
# localhost, infosys
dn: Hostname=localhost,o=infosys
structuralObjectClass: UpdateStats
entryUUID: 09bf40e0-7b23-4992-af55-fd74f036a454
creatorsName: o=infosys
createTimestamp: 20110612223435Z
entryCSN: 20110615120723.216201Z#000000#000#000000
modifiersName: o=infosys
modifyTimestamp: 20110615120723Z
entryDN: Hostname=localhost,o=infosys
subschemaSubentry: cn=Subschema
hasSubordinates: FALSE
Metric Desciption
ModifiedEntries The number of objects to modify
DeletedEntries The number of objects to delete
UpdateTime To total update time in seconds
FailedAdds The number of add statements which failed
FailedModifies The number of modify statements which failed
TotalEntries The total number of entries in the database
QueryTime The time taken to query the database
NewEntries The number of new objects
DBUpdateTime The time taken to update the database in seconds
ReadTime The time taken to read the LDIF sources in seconds
PluginsTime The time taken to run the plugins in seconds
ProvidersTime The time taken to run the information providers in seconds


DNS caching

Example 1: The IGI Nagios based mechanism

  1. an email is sent as notification;
  2. If 4 (out of 5) instances are failing, a SMS message is sent as notification;
DNSUpdater@IGI.png
  1. The Nagios instance can fail
  2. The master DNS where the DNS entries are updated can fail

Example 2: The IBERGRID scripting based mechanism

  1. Written in perl
  2. Can be run as daemon or at the command prompt
  3. The tests are programs that are forked
  4. Tests are added in a "module" fashion way
  5. Can be used to manage several DNS round robin scenarios
  6. Can manage multiple DNS servers
Nsupdater@IBERGRID.png
  1. Three primary servers would then exist for core.ibergrid.eu
  2. All three DNS servers could be dynamically updated independently
  3. The monitoring application should also have three instances, one running at each site
  4. The downside is that DNS information can become incoherent. It would be up to the monitoring application to manage the three DNS servers content and their cohe
Nsupdater@IBERGRID2.png

Revision history

Version Authors Date Comments
1.0 Goncalo Borges, Jorge Gomes, Paolo Veronesi 2011-06-15 first draft
1.1 Paolo Veronesi 2012-06-21 This manual is equally applicable to site BDII, added some notes about it.
M. Krakowian 19 August 2014 Change contact group -> Operations support
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export