VO Service Availability Monitoring

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Alert.png This article is Deprecated and should no longer be used, but is still available for reasons of reference.



Introduction to VO SAM

VOServicesWikiFig3.png

SAM as an NGI instance

The current operations model forces that each NGI must deploy and operate their own Service Availability Monitoring (SAM) system to monitor the fraction of EGI production infrastructure under their scope. SAM currently includes the following components:

  • a test execution framework (probes) based on the open source monitoring framework Nagios, and the Nagios Configuration Generator (NCG)
  • the Aggregated Topology Provider (ATP), the Metrics Description Database (MDDB), and the Metrics Results Database (MRDB)
  • the message bus to publish results and a programmatic interface
  • the visualization portal (MyEGI)

The full list of SAM instances across the EGI infrastructure can be consulted here)


Each SAM instance triggers the execution of probes in grid sites under their scope. The present list of probes includes:

  • Job submission testing via CE probe/metrics: full job submission chain is exercised - job submission, states monitoring, output sand-box retreival, ...
  • Data managements testing via SRM probe/metrics: get full SRM endpoint(s) and storage areas from BDII, copy a local file to the SRM into default space area(s), ...
  • WN testing via WN probe/metrics: replica management tests (WN<->SE communication), ...
  • WMS testing via WMS probe/metrics using submissions to predefined CEs.
  • LFC testing via LFC probe/metrics: read and update catalogue entries, ...



VOServicesWikiFig4.png

SAM as a VO instance

The main idea is to profit from the SAM service in use for EGI operations and adapted it to monitor the resources from a VO, or from multiple VOs, using the same SAM instance.


One of the most obvious advantages of this service is that a VO can then develop and integrate their own probes. While EGI operation teams test and monitor the status of resources through generic tests, they can be considered insufficient for certain communities. These approach allows those communities to define custom test suites and insert them in their SAM system.


In order to acomplish this multi-VO monitoring role, the SAM service instance has to be properly adapted:

  • the topology generation has to change so that resources to be tested are properly configured. The difference with respect to the service used in operations is that VO resources may not be restricted to a single region, and may be spread along the whole EGI infrastructure.
  • the services which interoperate with the SAM services (the WMS which is used to submit jobs, default SRM used to replicate files, ...) have to be properly configured to suport those VOs


The following section will depict how to install and properly configure a Service Availability Monitoring System for VOs.


General Information

What is the VO SAM?

The VO SAM service is an adaptation of the operation SAM service used by NGIs. It is useful to monitor VO/VRC infrastructures within a given NGI or groups of NGIs.


Who can run the VO SAM?

The VO SAM was delivered so that VRCs (or VOs associated to a VRC) could assume the operation of the service. This should be the most optimal scenario since it provides full independency for the VRC / VO to configure the service according to VO needs (as for example, integration of new VO specific probes). Nevertheless, the delegation of the service operation to a third party entity is not excluded.


Which services interoperate with VO SAM?

The VO SAM interoperates with the following middleware components which have to be declared at configuration time:

  1. WMS: For job submission tests
  2. central SE: for data replica tests
  3. MyProxy service: For renewal of VO credentials

The WMS and the central SE must support the VOs in cause. The best case scenario is that the VO uses dedicated instances of the previous services for their SAM system since nagios tests will induce high load peaks. The alternative is to use services at disposal of the VO but shared by all VO users (and probably, by other VOs also). The information regarding those services endpoints are available through informations system queries from any user interface:

# lcg-infosites --vo <VO NAME> wms
# lcg-infosites --vo <VO NAME> se

The MyProxy service is used to store and renew the credential of the user sending the nagios jobs. Starting from Update-09 (still under Stage Rollout) SAM supports usage of robot certificates, instead of MyProxy credentials. This is an optional feature which can be used only if your CA and VO support robot certificates. If your CA supports robot certificates, we suggest switching to robot certificates, as they are easier to maintain. Also robots provide better availability as SAM doesn't depend on availability of MyProxy server.

Case the VO is unable to provide their own instances of those services, the VRCs may trigger the formal links established between VRCs and NGIs, contemplated by the VRC approval process, and agreed at the VRC aproval time. This provides direct opportunities to find service providers for VRCs, and for any VO which is associated to a VRC.


Instalation guide

System requirements

The VO SAM will be a service with high load peaks specially at job submission times. The system requirements have to be chosen according to the number of sites to be monitored under each VO, and the number of VO to be included in the same VO SAM box. Therefore, as a minimum requirement, we suggest:

  1. Scientific Linux 5 64 bit OS
  2. 4 GB RAM memory
  3. quad-core processor is recommended to better handle parallel submitions

The system requirements may increase according to the VOs infrastructure to be monitored.


SAM service reference card

To get an overview of the service, in terms of

  • individual services running on the SAM box
  • configuration files for each service
  • log files for each service
  • open ports needed by the services
  • cron jobs scheduled to execution in the SAM box

please consult the SAM service reference card



Install and configure SAM with YAIM

General guidelines

  • For a fresh installation, please check the instalation / configuration guidelines depicted in the Installing SAM web page.
  • For upgrades, also check the SAM Release Notes web page.
  • Before actually onfiguring the service with YAIM, you should apply the VO specific guidelines presented in the next sections.
  • A brief explanation of all the YAIM variables used to configure SAM is available in SAM YAIM variable explanations.


Installation notes

  • During installation, you need to use yum-priorities in order to download the proper package version from the correct repository. Please check carefully the repositories priorities according to SAM version you are using.
  • Package installation sequence:
$ yum install lcg-CA
$ yum install httpd
$ yum --exclude=\*saga\* --exclude=\*SAGA\* groupinstall 'glite-UI (production - x86_64)'
$ yum install egee-NAGIOS


Upgrade notes

  • Note that after upgrade and before YAIM execution, atp_synchro.conf.rpmnew and glite-info-service-nagios.conf.rpmnew configuration files should replace the existing ones.
$ mv /etc/atp/atp_synchro.conf.rpmnew /etc/atp/atp_synchro.conf
$ mv /opt/glite/etc/glite-info-service-nagios.conf.rpmnew /opt/glite/etc/glite-info-service-nagios.conf
  • In the glite-info-service-nagios.conf.rpmnew, one must ensure the the new file does include the following line
get_data = echo -e "Role=${NAGIOS_ROLE}\nMsgNagiosDestination="$(. /etc/sysconfig/msg-to-queue && echo $MSG_TO_QUEUE_DESTINATION)"\nVersion="$(cat /etc/sam-release)


VO specific guidelines

VO Topology generation: ATP vs LDAP

There are two alternative ways to configure a VO SAM.

  1. Use Aggregated Topology Provider as an input for Nagios Configuration Generator(NCG) to get the sites and services. ATP is part of the ROC/NGI nagios package which aggregate information from GOCDB, Top BDII and VO feed; and it is single authoritative information source with topology information.
  2. Use LDAP as a topology provider for Nagios Configuration Generator(NCG)


  • ATP is now the preferred option for the topology generatation, and it can be switch on defining
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=false
NCG_TOPOLOGY_USE_SAM=false
NCG_TOPOLOGY_USE_ATP=true

After running yaim, your /etc/ncg/ncg.conf should contain the following entries under the SiteInfo block:

<NCG::SiteInfo>
  <ATP>
    ATP_ROOT_URL="http://grid-monitoring.cern.ch/atp"
  </ATP>
  (...)
</NCG::SiteInfo>
  • To use LDAP, you should switch off all topology providers except for LDAP, and be sure that the SiteInfo block in
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=true
NCG_TOPOLOGY_USE_SAM=false
NCG_TOPOLOGY_USE_ATP=false
# if LDAP topology is used, control adding hosts (see SAM-1470) (optional variable)
NCG_LDAP_ADD_HOSTS=1

and ensure that similar entries are defined in /etc/ncg/ncg.conf

<NCG::SiteInfo>
 <LDAP>
   LDAP_ADDRESS=<Your TopBDII you want to use>
   ADD_HOSTS=1
   VO_FILTER=<Your VO>
 </LDAP>
  (...)
</NCG::SiteInfo>

Afterwards restart ncg service.

Yaim configuration file example (for phys.vo.ibergrid.eu VO)

###
### Generic definitions for some of the core services
SITE_NAME=NCG-INGRID-PT
SITE_BDII_HOST=sbdii01.ncg.ingrid.pt
PX_HOST=px01.ncg.ingrid.pt
BDII_HOST=topbdii01.ncg.ingrid.pt

###
### VO Definitions
# List of VOs to support 
VOS="phys.vo.ibergrid.eu"
# VOMS server definition for vo1
VO_PHYS_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/phys.vo.ibergrid.eu?/phys.vo.ibergrid.eu/'"
# VOMSES server definition for vo1
VO_PHYS_VO_IBERGRID_EU_VOMSES="'phys.vo.ibergrid.eu voms01.ncg.ingrid.pt 40007 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt phys.vo.ibergrid.eu'"
# DN of the CA which issued the VOMS Certificate
VO_PHYS_VO_IBERGRID_EU_VOMS_CA_DN="/C=PT/O=LIPCA/CN=LIP Certification Authority"
# WMS used to submit jobs to vo1
VO_PHYS_VO_IBERGRID_EU_WMS_HOSTS="wms01.ncg.ingrid.pt"

###
### LFC and SE definitions for the data management tests
JOBSUBMIT_WN_LFC=lfc01.ncg.ingrid.pt
JOBSUBMIT_WN_SE_REP=se01-tic.ciemat.es

###
### Nagios
NAGIOS_ADMIN_DNS="/C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=Goncalo Borges"
NCG_NAGIOS_ADMIN=goncalo@lip.pt
NAGIOS_HOST=nagios02.ncg.ingrid.pt
# Nagios is acting on a VO 
NAGIOS_ROLE=vo
NCG_PROBES_TYPE=local
# List of VOs the tests should run as. 
NCG_VO="phys.vo.ibergrid.eu"
# Do not show hosts without services associated 
NCG_INCLUDE_EMPTY_HOSTS=0
NAGIOS_HTTPD_ENABLE_CONFIG=true
NAGIOS_NCG_ENABLE_CONFIG=true
NAGIOS_SUDO_ENABLE_CONFIG=true
NAGIOS_NAGIOS_ENABLE_CONFIG=true
NAGIOS_CGI_ENABLE_CONFIG=true
NAGIOS_NSCA_PASS="xxxxxx" 
NAGIOS_NCG_ENABLE_CRON=true

COUNTRY_NAME=Portugal

###
### NCG configurations 
# list all the NGI/ROCs
NCG_GOCDB_ROC_NAME=NGI_IBERGRID
NCG_TOPOLOGY_USE_SAM=false
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=false
NCG_TOPOLOGY_USE_ATP=true
NCG_TOPOLOGY_ATP_ROOT_URL="http://grid-monitoring.cern.ch/atp"
NCG_REMOTE_USE_SAM=false
NCG_REMOTE_USE_NAGIOS=false
NCG_REMOTE_USE_ENOC=false
NCG_MDDB_SUPPORTED_PROFILES="ROC,ROC_CRITICAL,ROC_OPERATORS"

###
### DB data
MYSQL_ADMIN="xxxxxx"
DB_PASS="xxxxxx"
DB_TYPE=mysql
DB_USER=mrs
DB_NAME=mrs
DB_HOST="localhost"

###
### MyEGI
MYEGI_ADMIN_NAME="Goncalo Borges"
MYEGI_ADMIN_EMAIL="goncalo@lip.pt"
MYEGI_DEFAULT_PROFILE="ROC"
MYEGI_DEBUG="true"
MYEGI_DATABASE_PASSWORD="xxxxxx"
MYEGI_REGION="NGI_IBERGRID"


VO specific probes

Right now it is pretty tricky to include VO specific probes. The first step is to learn how a nagios probe must be developed. A tutorial is available here:

Afterwards, the VO would have to define their own profile in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash_local.pm in the same way as other profiles exist in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash.pm. To understand the atributes, flags and static file rules which are used in previous perl module, please consult:

Finally, in the Yaim configuration you would need to point to the new profile:

NCG_HASH_CONFIG_PROFILES=customprofile

In case of multiple-VO instance where each VO has its own profile config would be:

NCG_VO="vo1 vo2"
NCG_HASH_CONFIG_PROFILES=customprofileA,customprofileB
NCG_PROFILE_FQAN_customprofileA=vo2
NCG_PROFILE_FQAN_customprofileB=vo2


Contacts

  • GGUS: VO Services Support Unit
  • Mailing list: vo-services@mailman.egi.eu
  • SAM mailing list: tool-admins@mailman.egi.eu


Additional references


FAQ VO Service Availability Monitoring