Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "VO Service Availability Monitoring"

From EGIWiki
Jump to navigation Jump to search
 
(23 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Template:VOServicesMainMenu}}
{{Template:Op menubar}}
<br />
{{Template:Doc_menubar}}
 
{{TOC_right}}
__TOC__
{{Template:Deprecated}}
 
<br />
<br />


Line 11: Line 10:


== SAM as an NGI instance ==
== SAM as an NGI instance ==
The current operations model forces that each NGI must deploy and operate their own [https://wiki.egi.eu/wiki/SAM '''Service Availability Monitoring (SAM)'''] system to monitor the fraction of EGI production infrastructure under their scope. SAM currently includes the following components:
The current operations model forces that each NGI must deploy and operate their own [[SAM |'''Service Availability Monitoring (SAM)''']] system to monitor the fraction of EGI production infrastructure under their scope. SAM currently includes the following components:
* a test execution framework (probes) based on the open source monitoring framework Nagios, and the Nagios Configuration Generator (NCG)
* a test execution framework (probes) based on the open source monitoring framework Nagios, and the Nagios Configuration Generator (NCG)
* the Aggregated Topology Provider (ATP), the Metrics Description Database (MDDB), and the Metrics Results Database (MRDB)
* the Aggregated Topology Provider (ATP), the Metrics Description Database (MDDB), and the Metrics Results Database (MRDB)
* the message bus to publish results and a programmatic interface
* the message bus to publish results and a programmatic interface
* the visualization portal (MyEGI)
* the visualization portal (MyEGI)
The full list of SAM instances across the EGI infrastructure can be consulted [https://wiki.egi.eu/wiki/SAM_Instances here])
The full list of SAM instances across the EGI infrastructure can be consulted [[SAM_Instances| here]])


<br />
<br />
Line 135: Line 134:
==== VO Topology generation: ATP vs LDAP ====
==== VO Topology generation: ATP vs LDAP ====
There are two alternative ways to configure a VO SAM.
There are two alternative ways to configure a VO SAM.
# Use ATP ([http://grid-monitoring.cern.ch/atp|Aggregated Topology Provider)as an input for Nagios Configuration Generator(NCG) to get the sites and services. ATP is part of the ROC/NGI nagios package which aggregate information from GOCDB, Top BDII and VO feed; and it is single authoritative information source with topology information.
# Use [http://grid-monitoring.cern.ch/atp Aggregated Topology Provider] as an input for Nagios Configuration Generator(NCG) to get the sites and services. ATP is part of the ROC/NGI nagios package which aggregate information from GOCDB, Top BDII and VO feed; and it is single authoritative information source with topology information.
# Use LDAP as a topology provider for Nagios Configuration Generator(NCG)
# Use LDAP as a topology provider for Nagios Configuration Generator(NCG)
<br />
<br />
Line 173: Line 172:
Afterwards restart ncg service.
Afterwards restart ncg service.


 
==== Yaim configuration file example (for phys.vo.ibergrid.eu VO) ====
 
  ###
To monitor all the resources under the scope of a VO, you have to properly set the following variables in your YAIM configuration file:
  ### Generic definitions for some of the core services
* '''General VO definitions'''
# List of VOs to support
VOS="vo1"
# VOMS server definition for vo1
VO_vo1_VOMS_SERVERS="'vomss://voms.my.domain:8443/voms/vo1?/vo1/'"
# VOMSES server definition for vo1
VO_vo1_VOMSES="'vo1 voms.my.domain 15001 /C=Country/O=Ca/O=Institution/OU=Department/CN=voms.my.domain vo1'"
# DN of the CA which issued the VOMS Certificate
VO_vo1_VOMS_CA_DN="/C=Country/O=CA/CN=Certification Authority"
  # WMS used to submit jobs to vo1
VO_vo1_WMS_HOSTS="wms01.ncg.ingrid.pt"
 
* '''Specific SAM VO definitions '''
# Nagios is acting on a VO
NAGIOS_ROLE=vo
# List of VOs the tests should run as. You must have a member of each VO willing to store a proxy for your retrieval.
NCG_VO="vo1"
  # Do not show hosts without services associated
NCG_INCLUDE_EMPTY_HOSTS=0
# list all the NGI/ROCs
# NCG_GOCDB_ROC_NAME="NGI_IBERGRID NGI_NL ..."
NCG_GOCDB_ROC_NAME=ALL
 
* '''Monitor more than one VO'''
Include a white space separated VO list for
VO="vo1 vo2 vo3"
NCG_VO="vo1 vo2 vo3"
and insert the information depicted in the previous '''General VO definition''' section for all VOs.
 
<br />
 
 
 
 
# Generic
  SITE_NAME=NCG-INGRID-PT
  SITE_NAME=NCG-INGRID-PT
  SITE_BDII_HOST=sbdii01.ncg.ingrid.pt
  SITE_BDII_HOST=sbdii01.ncg.ingrid.pt
  PX_HOST=px01.ncg.ingrid.pt
  PX_HOST=px01.ncg.ingrid.pt
  BDII_HOST=topbdii01.ncg.ingrid.pt
  BDII_HOST=topbdii01.ncg.ingrid.pt
  RB_HOST=rb02.lip.pt # irrelevant
   
 
###
  # VO Definitions
  ### VO Definitions
# List of VOs to support
  VOS="phys.vo.ibergrid.eu"
  VOS="phys.vo.ibergrid.eu"
# VOMS server definition for vo1
  VO_PHYS_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/phys.vo.ibergrid.eu?/phys.vo.ibergrid.eu/'"
  VO_PHYS_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/phys.vo.ibergrid.eu?/phys.vo.ibergrid.eu/'"
# VOMSES server definition for vo1
  VO_PHYS_VO_IBERGRID_EU_VOMSES="'phys.vo.ibergrid.eu voms01.ncg.ingrid.pt 40007 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt phys.vo.ibergrid.eu'"
  VO_PHYS_VO_IBERGRID_EU_VOMSES="'phys.vo.ibergrid.eu voms01.ncg.ingrid.pt 40007 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt phys.vo.ibergrid.eu'"
# DN of the CA which issued the VOMS Certificate
  VO_PHYS_VO_IBERGRID_EU_VOMS_CA_DN="/C=PT/O=LIPCA/CN=LIP Certification Authority"
  VO_PHYS_VO_IBERGRID_EU_VOMS_CA_DN="/C=PT/O=LIPCA/CN=LIP Certification Authority"
# WMS used to submit jobs to vo1
  VO_PHYS_VO_IBERGRID_EU_WMS_HOSTS="wms01.ncg.ingrid.pt"
  VO_PHYS_VO_IBERGRID_EU_WMS_HOSTS="wms01.ncg.ingrid.pt"
 
  ###
  ###
  ### LFC and SE definitions
  ### LFC and SE definitions for the data management tests
  JOBSUBMIT_WN_LFC=lfc01.ncg.ingrid.pt
  JOBSUBMIT_WN_LFC=lfc01.ncg.ingrid.pt
  JOBSUBMIT_WN_SE_REP=se01-tic.ciemat.es
  JOBSUBMIT_WN_SE_REP=se01-tic.ciemat.es
 
  # Nagios
###
 
  ### Nagios
 
  NAGIOS_ADMIN_DNS="/C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=Goncalo Borges"
  #NAGIOS_ADMIN_DNS="/C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=Goncalp Borges"
  NCG_NAGIOS_ADMIN=goncalo@lip.pt
  #NCG_NAGIOS_ADMIN=goncalo@lip.pt
  NAGIOS_HOST=nagios02.ncg.ingrid.pt
  NAGIOS_HOST=nagios02.ncg.ingrid.pt
 
  # Nagios is acting on a VO
  NAGIOS_ADMIN_DNS="/C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=Joao Antonio Tomasio Pina"
NCG_NAGIOS_ADMIN=jpina@lip.pt
  NAGIOS_ROLE=vo
  NAGIOS_ROLE=vo
#NAGIOS_ROLE=ngi
  NCG_PROBES_TYPE=local
  NCG_PROBES_TYPE=local
# List of VOs the tests should run as.
  NCG_VO="phys.vo.ibergrid.eu"
  NCG_VO="phys.vo.ibergrid.eu"
  #NCG_VO="ops"
  # Do not show hosts without services associated
  NCG_INCLUDE_EMPTY_HOSTS=0
  NCG_INCLUDE_EMPTY_HOSTS=0
  NAGIOS_HTTPD_ENABLE_CONFIG=true
  NAGIOS_HTTPD_ENABLE_CONFIG=true
Line 257: Line 215:
  NAGIOS_NAGIOS_ENABLE_CONFIG=true
  NAGIOS_NAGIOS_ENABLE_CONFIG=true
  NAGIOS_CGI_ENABLE_CONFIG=true
  NAGIOS_CGI_ENABLE_CONFIG=true
  NAGIOS_NSCA_PASS="xxxxxx"
  NAGIOS_NSCA_PASS="xxxxxx"  
 
NAGIOS_NCG_ENABLE_CRON=true
  # NGI/ROC Nagios
   
  COUNTRY_NAME=Portugal
  COUNTRY_NAME=Portugal
  #COUNTRY_NAME=ibergrid
  NAGIOS_NCG_ENABLE_CRON=true
###
  ### NCG configurations
  # list all the NGI/ROCs
  NCG_GOCDB_ROC_NAME=NGI_IBERGRID
  NCG_GOCDB_ROC_NAME=NGI_IBERGRID
  NCG_TOPOLOGY_USE_SAM=false
  NCG_TOPOLOGY_USE_SAM=false
Line 269: Line 229:
  NCG_TOPOLOGY_USE_LDAP=false
  NCG_TOPOLOGY_USE_LDAP=false
  NCG_TOPOLOGY_USE_ATP=true
  NCG_TOPOLOGY_USE_ATP=true
#NCG_USE_ATP_VO_FEED=true
  NCG_TOPOLOGY_ATP_ROOT_URL="http://grid-monitoring.cern.ch/atp"
  NCG_TOPOLOGY_ATP_ROOT_URL="http://grid-monitoring.cern.ch/atp"
  NCG_REMOTE_USE_SAM=false
  NCG_REMOTE_USE_SAM=false
  NCG_REMOTE_USE_NAGIOS=false
  NCG_REMOTE_USE_NAGIOS=false
  NCG_REMOTE_USE_ENOC=false
  NCG_REMOTE_USE_ENOC=false
  #ROC_NAME=NGI_IBERGRID
  NCG_MDDB_SUPPORTED_PROFILES="ROC,ROC_CRITICAL,ROC_OPERATORS"
 
  # DB data
  ###
### DB data
  MYSQL_ADMIN="xxxxxx"
  MYSQL_ADMIN="xxxxxx"
  DB_PASS="xxxxxx"
  DB_PASS="xxxxxx"
Line 284: Line 244:
  DB_HOST="localhost"
  DB_HOST="localhost"
   
   
  # MyEGI
  ###
### MyEGI
  MYEGI_ADMIN_NAME="Goncalo Borges"
  MYEGI_ADMIN_NAME="Goncalo Borges"
  MYEGI_ADMIN_EMAIL="goncalo@lip.pt"
  MYEGI_ADMIN_EMAIL="goncalo@lip.pt"
Line 291: Line 252:
  MYEGI_DATABASE_PASSWORD="xxxxxx"
  MYEGI_DATABASE_PASSWORD="xxxxxx"
  MYEGI_REGION="NGI_IBERGRID"
  MYEGI_REGION="NGI_IBERGRID"
  NCG_MDDB_SUPPORTED_PROFILES="ROC,ROC_CRITICAL,ROC_OPERATORS"
   
 
<br />
<br />


== VO specific probes ==
= VO specific probes =
Right now it is pretty tricky to include VO specific probes. The first step is to learn how a nagios probe must be developed. A tutorial is available here:
Right now it is pretty tricky to include VO specific probes. The first step is to learn how a nagios probe must be developed. A tutorial is available here:
* [https://svnweb.cern.ch/trac/sam/browser/trunk/demo-probe <big>'''Tutorial to build nagios probes'''</big>]
* [https://svnweb.cern.ch/trac/sam/browser/trunk/demo-probe <big>'''Tutorial to build nagios probes'''</big>]
Line 310: Line 270:
  NCG_PROFILE_FQAN_customprofileA=vo2
  NCG_PROFILE_FQAN_customprofileA=vo2
  NCG_PROFILE_FQAN_customprofileB=vo2
  NCG_PROFILE_FQAN_customprofileB=vo2
<br />
== Frequently Asked Questions & Troubleshooting ==
=== General ===
Before proceeding, check if your question is not already answered in
* [https://tomtools.cern.ch/confluence/display/SAMDOC/Troubleshooting '''<big>SAM Administrator FAQs</big>'''].
* [https://tomtools.cern.ch/confluence/display/SAMDOC/FAQs '''<big>SAM Troubleshooting</big>'''].
<br />
=== Instalation ===
==== perl-DBD-MySQL dependency problems? ====
For installation problems related with perl-DBD-MySQL package, check that you have the repository priorities set as explained in the release notes. The proper package should be fetched from RPM FORGE EXTRA repository: [http://rpmforge.sw.be/redhat/el5/en/x86_64/extras/RPMS/ RPM FORGE repository]
<br />
==== perl-SOAP-Lite dependency problems? ====
Fri Mar 11 13:13:54 CET 2011 : Can't locate Class/Inspector.pm in @INC
(@INC contains: /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl
/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl/5.8.5
/usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at /usr/lib/perl5/vendor_perl/5.8.8/SOAP/Lite.pm line 435.
Fri Mar 11 13:13:54 CET 2011 : BEGIN failed--compilation aborted at /usr/lib/perl5/vendor_perl/5.8.8/SOAP/Lite.pm line 435.
Fri Mar 11 13:13:54 CET 2011 : Compilation failed in require at /usr/bin/voms2htpasswd line 12.
Fri Mar 11 13:13:54 CET 2011 : BEGIN failed--compilation aborted at /usr/bin/voms2htpasswd line 12.
Please check if you have "perl-Class-Inspector" installed. If not, please execute '''"yum install 'perl(Class::Inspector)'"'''
<br />
=== Configuration ===
====Can I start 2 different proxies to submit jobs to the different VOs?====
Yes. You can have different proxy for each VO. Just use different user certificate when creating MyProxy credential. For example:
# For vo1
$ export X509_USER_CERT=~/.globus/usercert-vo1.pem
$ export X509_USER_KEY=~/.globus/userkey-vo1.pem
$ myproxy-init -l nagios -s $PX_HOST -k NagiosRetrieve-NAGIOS_HOSTNAME-vo1 -c 1000 -x -Z "NAGIOS_HOSTNAME_DN"
# For vo2
$ export X509_USER_CERT=~/.globus/usercert-vo2.pem
$ export X509_USER_KEY=~/.globus/userkey-vo2.pem
$ myproxy-init -l nagios -s $PX_HOST -k NagiosRetrieve-NAGIOS_HOSTNAME-vo2 -c 1000 -x -Z "NAGIOS_HOSTNAME_DN"
Same principle applies to any other VO supported by that instance. Of course you can use the same user cert if it is member of multiple VOs. Easier solution would of course be to use robot certificates. Robot certs will be supported in the next release (https://tomtools.cern.ch/jira/browse/SAM-952).
<br />
==== Can a catch-all VO SAM provide a dedicated VO view? ====
Nagios web interface was never about obvious presentation. However, there is the service group view where NCG generates service group aggregating all VO dependent checks for each VO. For example:
*[https://nagios01.ncg.ingrid.pt/nagios/cgi-bin/status.cgi?servicegroup=phys.vo.ibergrid.eu&style=detail <big>'''phys.vo.ibergrid.eu view in nagios01.ncg.ingrid.pt'''</big>]
<br />
==== Can I configure VO SAM to use a unique LFC and central SE for all VOs? ====
Yes. Include the following definitions in your YAIM configuration variables. Implicitly there is the assumption that the unique LFC and central SE do support all monitored VOs.
# LFC and SE definitions
JOBSUBMIT_WN_LFC=lfc-allvos.my.domain
JOBSUBMIT_WN_SE_REP=se-allvos.my.domain
<br />
==== Can I configure VO dependent LFCs and central SEs in a VO SAM? ====
There is a way to do this, though slightly more complicated. Make sure that you don't have line like anywhere in localdb:
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--wn-lfc!lfc.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-lfc!lfc.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--wn-se-rep!se.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-se-rep!se.my.domain
anywhere in /etc/ncg/*localdb*, and put the following instead:
VO_ATTRIBUTE!vo1!WN_SE_REP!se-vo1.my.domain
VO_ATTRIBUTE!vo2!WN_SE_REP!se-vo2.my.domain
MODIFY_METRIC_ATTRIBUTE!org.sam.CE-JobState!WN_SE_REP!--wn-se-rep
VO_ATTRIBUTE!vo1!WN_LFC!lfc-vo1.my.domain
VO_ATTRIBUTE!vo2!WN_LFC!lfc-vo2.my.domain
MODIFY_METRIC_ATTRIBUTE!org.sam.CE-JobState!WN_LFC!--wn-lfc
<br />
==== How can I provide access to a VO member? ====
At configuration time, and dependending of the NCG_ROLE selected, different users may not have the same permissions to access to the VO SAM services. To enable permission of a given user,  one can add the user DN to /etc/voms2htpasswd-static.d/YAIM-ops-monitor.conf
<br />
==== How do I run VO SAM tests with a specific FQAN ?====
You can set
VO_ENMR_EU_NCG_DEFAULT_VO_FQAN="YOUR FQAN"
in your yaim configuration file, and start your proxy as normal
$ export X509_USER_CERT=~/.globus/usercert-vo1.pem
$ export X509_USER_KEY=~/.globus/userkey-vo1.pem
$ myproxy-init -l nagios -s $PX_HOST -k NagiosRetrieve-NAGIOS_HOSTNAME-vo1 -c 1000 -x -Z "NAGIOS_HOSTNAME_DN"
==== How to change the email notification header? ====
# optional - change of notification header (SAM-1130):
NCG_NOTIFICATION_HEADER="YOUR HEADER"
<br />
==== How to enable the use of ROBOT certificates? ====
# optional - use of robot certificates (SAM-1180):
NCG_USE_ROBOT_CERT=true
# Robot cert and key can be different for each VO
# and standard Yaim VO notation is used
VO_OPS_ROBOT_CERT=/etc/nagios/globus/robot-cert.pem
VO_OPS_ROBOT_KEY=/etc/nagios/globus/robot-key.pem
<br />
==== How to monitor uncertified sites? ====
To monitor uncertified sites, you will need to use a dedicated TopBDII (with the information of those sites) and a dedicated WMS. The list of uncertified sites to be monitored should also be listed:
# optional - add uncertified gLite sites (SAM-1143)
UNCERTIFIED_SITES="SiteA SiteB SiteC"
UNCERTIFIED_WMS=wms.uncert.org
UNCERTIFIED_BDII=bdii.uncert.org
<br />
==== How to check host checks off/on? ====
# switch host checks off/on (see SAM-1173) (optional variable)
NCG_CHECK_HOSTS=1
==== How to switch off importing admin DNs ? ====
# switch off importing admin DNs (see SAM-1434) (optional variable)
NCG_CONTACTS_USE_GOCDB=false
=== Run-Time ===
==== ATP syncronization fails while running YAIM ====
Check the  ATP log files (/var/log/atp.log) to know the cause of the problem. This can happen because of high latency values incompatible with ATP synchronization timeouts. Change ATP_SYNC_TIMEOUT to a higher value (ex: ATP_SYNC_TIMEOUT=1200; only in use for SAM 10 or higher). For previous versions you need to directly change the YAIM ATP function file: /opt/glite/yaim/functions/config_atp
<br />
==== NCG configuration fails while running YAIM ====
Check ncg log files (/var/log/ncg.log) to know the cause of the problem. This can arise due to a bad configuration file (/etc/ncg/ncg.conf), generated by YAIM incorrect configuration variables. Double check your YAIM configuration file.
<br />
== A working example ==
* [https://nagios01.ncg.ingrid.pt/nagios <big>'''phys.vo.ibergrid.eu NAGIOS instance'''</big>]
<br />
== Future improvements ==
* In the next SAM release, ncg will automatically go over the whole infrastructure and look for nodes which support defined VOs.
* Define a VO profile with VO dependent only metrics. Please check [https://tomtools.cern.ch/jira/browse/SAM-1178 SAM-1178]
<br />
== Known issues ==
* This approach is not perfect since it bootstraps all hosts and not only the ones interesting for VO. However, the feature to bootstrap only hosts relevant to VO is still not implemented. This will be done by Update-12 (sometimes in June).


<br />
<br />
Line 473: Line 282:
= Additional references =
= Additional references =
* [http://www.nagios.org/ <big>'''Nagios'''</big>]
* [http://www.nagios.org/ <big>'''Nagios'''</big>]
<br />
=[[FAQ_VO_Service_Availability_Monitoring|FAQ VO Service Availability Monitoring]]=
[[Category:Operations Documentation]]
[[Category:SAM]]

Latest revision as of 06:47, 13 July 2016

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Alert.png This article is Deprecated and should no longer be used, but is still available for reasons of reference.



Introduction to VO SAM

VOServicesWikiFig3.png

SAM as an NGI instance

The current operations model forces that each NGI must deploy and operate their own Service Availability Monitoring (SAM) system to monitor the fraction of EGI production infrastructure under their scope. SAM currently includes the following components:

  • a test execution framework (probes) based on the open source monitoring framework Nagios, and the Nagios Configuration Generator (NCG)
  • the Aggregated Topology Provider (ATP), the Metrics Description Database (MDDB), and the Metrics Results Database (MRDB)
  • the message bus to publish results and a programmatic interface
  • the visualization portal (MyEGI)

The full list of SAM instances across the EGI infrastructure can be consulted here)


Each SAM instance triggers the execution of probes in grid sites under their scope. The present list of probes includes:

  • Job submission testing via CE probe/metrics: full job submission chain is exercised - job submission, states monitoring, output sand-box retreival, ...
  • Data managements testing via SRM probe/metrics: get full SRM endpoint(s) and storage areas from BDII, copy a local file to the SRM into default space area(s), ...
  • WN testing via WN probe/metrics: replica management tests (WN<->SE communication), ...
  • WMS testing via WMS probe/metrics using submissions to predefined CEs.
  • LFC testing via LFC probe/metrics: read and update catalogue entries, ...



VOServicesWikiFig4.png

SAM as a VO instance

The main idea is to profit from the SAM service in use for EGI operations and adapted it to monitor the resources from a VO, or from multiple VOs, using the same SAM instance.


One of the most obvious advantages of this service is that a VO can then develop and integrate their own probes. While EGI operation teams test and monitor the status of resources through generic tests, they can be considered insufficient for certain communities. These approach allows those communities to define custom test suites and insert them in their SAM system.


In order to acomplish this multi-VO monitoring role, the SAM service instance has to be properly adapted:

  • the topology generation has to change so that resources to be tested are properly configured. The difference with respect to the service used in operations is that VO resources may not be restricted to a single region, and may be spread along the whole EGI infrastructure.
  • the services which interoperate with the SAM services (the WMS which is used to submit jobs, default SRM used to replicate files, ...) have to be properly configured to suport those VOs


The following section will depict how to install and properly configure a Service Availability Monitoring System for VOs.


General Information

What is the VO SAM?

The VO SAM service is an adaptation of the operation SAM service used by NGIs. It is useful to monitor VO/VRC infrastructures within a given NGI or groups of NGIs.


Who can run the VO SAM?

The VO SAM was delivered so that VRCs (or VOs associated to a VRC) could assume the operation of the service. This should be the most optimal scenario since it provides full independency for the VRC / VO to configure the service according to VO needs (as for example, integration of new VO specific probes). Nevertheless, the delegation of the service operation to a third party entity is not excluded.


Which services interoperate with VO SAM?

The VO SAM interoperates with the following middleware components which have to be declared at configuration time:

  1. WMS: For job submission tests
  2. central SE: for data replica tests
  3. MyProxy service: For renewal of VO credentials

The WMS and the central SE must support the VOs in cause. The best case scenario is that the VO uses dedicated instances of the previous services for their SAM system since nagios tests will induce high load peaks. The alternative is to use services at disposal of the VO but shared by all VO users (and probably, by other VOs also). The information regarding those services endpoints are available through informations system queries from any user interface:

# lcg-infosites --vo <VO NAME> wms
# lcg-infosites --vo <VO NAME> se

The MyProxy service is used to store and renew the credential of the user sending the nagios jobs. Starting from Update-09 (still under Stage Rollout) SAM supports usage of robot certificates, instead of MyProxy credentials. This is an optional feature which can be used only if your CA and VO support robot certificates. If your CA supports robot certificates, we suggest switching to robot certificates, as they are easier to maintain. Also robots provide better availability as SAM doesn't depend on availability of MyProxy server.

Case the VO is unable to provide their own instances of those services, the VRCs may trigger the formal links established between VRCs and NGIs, contemplated by the VRC approval process, and agreed at the VRC aproval time. This provides direct opportunities to find service providers for VRCs, and for any VO which is associated to a VRC.


Instalation guide

System requirements

The VO SAM will be a service with high load peaks specially at job submission times. The system requirements have to be chosen according to the number of sites to be monitored under each VO, and the number of VO to be included in the same VO SAM box. Therefore, as a minimum requirement, we suggest:

  1. Scientific Linux 5 64 bit OS
  2. 4 GB RAM memory
  3. quad-core processor is recommended to better handle parallel submitions

The system requirements may increase according to the VOs infrastructure to be monitored.


SAM service reference card

To get an overview of the service, in terms of

  • individual services running on the SAM box
  • configuration files for each service
  • log files for each service
  • open ports needed by the services
  • cron jobs scheduled to execution in the SAM box

please consult the SAM service reference card



Install and configure SAM with YAIM

General guidelines

  • For a fresh installation, please check the instalation / configuration guidelines depicted in the Installing SAM web page.
  • For upgrades, also check the SAM Release Notes web page.
  • Before actually onfiguring the service with YAIM, you should apply the VO specific guidelines presented in the next sections.
  • A brief explanation of all the YAIM variables used to configure SAM is available in SAM YAIM variable explanations.


Installation notes

  • During installation, you need to use yum-priorities in order to download the proper package version from the correct repository. Please check carefully the repositories priorities according to SAM version you are using.
  • Package installation sequence:
$ yum install lcg-CA
$ yum install httpd
$ yum --exclude=\*saga\* --exclude=\*SAGA\* groupinstall 'glite-UI (production - x86_64)'
$ yum install egee-NAGIOS


Upgrade notes

  • Note that after upgrade and before YAIM execution, atp_synchro.conf.rpmnew and glite-info-service-nagios.conf.rpmnew configuration files should replace the existing ones.
$ mv /etc/atp/atp_synchro.conf.rpmnew /etc/atp/atp_synchro.conf
$ mv /opt/glite/etc/glite-info-service-nagios.conf.rpmnew /opt/glite/etc/glite-info-service-nagios.conf
  • In the glite-info-service-nagios.conf.rpmnew, one must ensure the the new file does include the following line
get_data = echo -e "Role=${NAGIOS_ROLE}\nMsgNagiosDestination="$(. /etc/sysconfig/msg-to-queue && echo $MSG_TO_QUEUE_DESTINATION)"\nVersion="$(cat /etc/sam-release)


VO specific guidelines

VO Topology generation: ATP vs LDAP

There are two alternative ways to configure a VO SAM.

  1. Use Aggregated Topology Provider as an input for Nagios Configuration Generator(NCG) to get the sites and services. ATP is part of the ROC/NGI nagios package which aggregate information from GOCDB, Top BDII and VO feed; and it is single authoritative information source with topology information.
  2. Use LDAP as a topology provider for Nagios Configuration Generator(NCG)


  • ATP is now the preferred option for the topology generatation, and it can be switch on defining
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=false
NCG_TOPOLOGY_USE_SAM=false
NCG_TOPOLOGY_USE_ATP=true

After running yaim, your /etc/ncg/ncg.conf should contain the following entries under the SiteInfo block:

<NCG::SiteInfo>
  <ATP>
    ATP_ROOT_URL="http://grid-monitoring.cern.ch/atp"
  </ATP>
  (...)
</NCG::SiteInfo>
  • To use LDAP, you should switch off all topology providers except for LDAP, and be sure that the SiteInfo block in
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=true
NCG_TOPOLOGY_USE_SAM=false
NCG_TOPOLOGY_USE_ATP=false
# if LDAP topology is used, control adding hosts (see SAM-1470) (optional variable)
NCG_LDAP_ADD_HOSTS=1

and ensure that similar entries are defined in /etc/ncg/ncg.conf

<NCG::SiteInfo>
 <LDAP>
   LDAP_ADDRESS=<Your TopBDII you want to use>
   ADD_HOSTS=1
   VO_FILTER=<Your VO>
 </LDAP>
  (...)
</NCG::SiteInfo>

Afterwards restart ncg service.

Yaim configuration file example (for phys.vo.ibergrid.eu VO)

###
### Generic definitions for some of the core services
SITE_NAME=NCG-INGRID-PT
SITE_BDII_HOST=sbdii01.ncg.ingrid.pt
PX_HOST=px01.ncg.ingrid.pt
BDII_HOST=topbdii01.ncg.ingrid.pt

###
### VO Definitions
# List of VOs to support 
VOS="phys.vo.ibergrid.eu"
# VOMS server definition for vo1
VO_PHYS_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/phys.vo.ibergrid.eu?/phys.vo.ibergrid.eu/'"
# VOMSES server definition for vo1
VO_PHYS_VO_IBERGRID_EU_VOMSES="'phys.vo.ibergrid.eu voms01.ncg.ingrid.pt 40007 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt phys.vo.ibergrid.eu'"
# DN of the CA which issued the VOMS Certificate
VO_PHYS_VO_IBERGRID_EU_VOMS_CA_DN="/C=PT/O=LIPCA/CN=LIP Certification Authority"
# WMS used to submit jobs to vo1
VO_PHYS_VO_IBERGRID_EU_WMS_HOSTS="wms01.ncg.ingrid.pt"

###
### LFC and SE definitions for the data management tests
JOBSUBMIT_WN_LFC=lfc01.ncg.ingrid.pt
JOBSUBMIT_WN_SE_REP=se01-tic.ciemat.es

###
### Nagios
NAGIOS_ADMIN_DNS="/C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=Goncalo Borges"
NCG_NAGIOS_ADMIN=goncalo@lip.pt
NAGIOS_HOST=nagios02.ncg.ingrid.pt
# Nagios is acting on a VO 
NAGIOS_ROLE=vo
NCG_PROBES_TYPE=local
# List of VOs the tests should run as. 
NCG_VO="phys.vo.ibergrid.eu"
# Do not show hosts without services associated 
NCG_INCLUDE_EMPTY_HOSTS=0
NAGIOS_HTTPD_ENABLE_CONFIG=true
NAGIOS_NCG_ENABLE_CONFIG=true
NAGIOS_SUDO_ENABLE_CONFIG=true
NAGIOS_NAGIOS_ENABLE_CONFIG=true
NAGIOS_CGI_ENABLE_CONFIG=true
NAGIOS_NSCA_PASS="xxxxxx" 
NAGIOS_NCG_ENABLE_CRON=true

COUNTRY_NAME=Portugal

###
### NCG configurations 
# list all the NGI/ROCs
NCG_GOCDB_ROC_NAME=NGI_IBERGRID
NCG_TOPOLOGY_USE_SAM=false
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=false
NCG_TOPOLOGY_USE_ATP=true
NCG_TOPOLOGY_ATP_ROOT_URL="http://grid-monitoring.cern.ch/atp"
NCG_REMOTE_USE_SAM=false
NCG_REMOTE_USE_NAGIOS=false
NCG_REMOTE_USE_ENOC=false
NCG_MDDB_SUPPORTED_PROFILES="ROC,ROC_CRITICAL,ROC_OPERATORS"

###
### DB data
MYSQL_ADMIN="xxxxxx"
DB_PASS="xxxxxx"
DB_TYPE=mysql
DB_USER=mrs
DB_NAME=mrs
DB_HOST="localhost"

###
### MyEGI
MYEGI_ADMIN_NAME="Goncalo Borges"
MYEGI_ADMIN_EMAIL="goncalo@lip.pt"
MYEGI_DEFAULT_PROFILE="ROC"
MYEGI_DEBUG="true"
MYEGI_DATABASE_PASSWORD="xxxxxx"
MYEGI_REGION="NGI_IBERGRID"


VO specific probes

Right now it is pretty tricky to include VO specific probes. The first step is to learn how a nagios probe must be developed. A tutorial is available here:

Afterwards, the VO would have to define their own profile in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash_local.pm in the same way as other profiles exist in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash.pm. To understand the atributes, flags and static file rules which are used in previous perl module, please consult:

Finally, in the Yaim configuration you would need to point to the new profile:

NCG_HASH_CONFIG_PROFILES=customprofile

In case of multiple-VO instance where each VO has its own profile config would be:

NCG_VO="vo1 vo2"
NCG_HASH_CONFIG_PROFILES=customprofileA,customprofileB
NCG_PROFILE_FQAN_customprofileA=vo2
NCG_PROFILE_FQAN_customprofileB=vo2


Contacts

  • GGUS: VO Services Support Unit
  • Mailing list: vo-services@mailman.egi.eu
  • SAM mailing list: tool-admins@mailman.egi.eu


Additional references


FAQ VO Service Availability Monitoring