Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "SAM"

From EGIWiki
Jump to navigation Jump to search
 
(184 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{Template:Op menubar}}
{{Template:Op menubar}}
{{Template:Tools menubar}}
{{Template:Tools menubar}}
{{Template:Deprecated}}
{{TOC_right}}
{{TOC_right}}
[[Category:SAM]]


The Service Availability Monitoring (SAM) system is used to monitor the resources within the production infrastructure. SAM monitoring data is used for calculation of availability and reliability of grid sites.  
The Service Availability Monitoring (SAM) system is used to monitor the resources within the production infrastructure. SAM monitoring data is used for calculation of availability and reliability of grid sites.  
It includes the following components:
 
* probes: a test execution framework (based on the open source monitoring framework Nagios) and the Nagios Configuration Generator (NCG)
= SAM Nagios probes re-factoring TF =
* the Aggregated Topology Provider (ATP), the Metrics Description Database (MDDB), and the Metrics Results Database (MRDB)
* [[SAM Nagios probes refactoring TF]]
* the message bus to publish results and a programmatic interface
* the visualization portal (MyEGI).


= SAM tool instances =
= SAM tool instances =
* [[SAM Instances]]
* [[SAM Instances]]


=Tests and probes=
= Documentation =
* [[SAM Tests|Terminology]]
== Introduction ==
* [https://tomtools.cern.ch/confluence/display/SAMDOC/Released+Probes SAM released probes]
=== SAM Tests terminology and types ===
** [https://twiki.cern.ch/twiki/bin/view/EMI/NagiosProbes EMI Nagios probes] ([[EMI Nagios probes|old page instance]])
 
** [https://tomtools.cern.ch/confluence/display/SAM/Probes+org.sam Probes] from org.SAM package
==== SAM Test ====
'''Test''' is a procedure which checks specific functionality of a given service, i.e. single measurement (e.g. org.bdii.Freshness, hr.srce.RGMA-CertLifetime). Tests are executed on NGI/ROC SAM instances.
 
A '''SAM test''' is:
* an '''OPERATIONS''' test which raises alarms in the Operations Portal (see [[Operations SAM tests | list]] of OPERATIONS tests)
* an '''AVAILABILITY''' test whose result is used for Resource Centre availability calculation by ACE (see the [https://mon.egi.eu/poem/admin/poem/profile/26/ ROC_CRITICAL] profle)
 
NOTE: '''CRITICAL''' is used ONLY to refer to one of the possible results returned by a Nagios probe.


=Profiles=
==== Probe ====
For OPS VO:
'''Probe''' is code which implements single or multiple tests.
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=ARC ARC]
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=GLEXEC GLEXEC]
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=NGI NGI]
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=ROC ROC] - all the possible metrics that NCG can use to configure NGI Nagios
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=ROC_CRITICAL ROC_CRITICAL] - profile used for EGI Availability/Reliability computation (for EGI Resource Centres it is equivalent to WLCG_CREAM_LCGCE_CRITICAL - see below)
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=ROC_OPERATORS ROC_OPERATORS] - metrics that raise alarms in the operations dashboard


'''WLCG'''
==== Metric ====
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=WLCG_CREAM_CRITICAL WLCG_CREAM_CRITICAL]
'''Metric instances''' are tuples of flavour, metric name and optionally FQAN ([http://argoeu.github.io/samdoc/confluence/display/SAMDOC/POEM%20User%27s%20Guide.html#POEMUser%27sGuide-Adding%2FEditingprofilesandmetrics POEM] documentation).
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=WLCG_CREAM_LCGCE_CRITICAL WLCG_CREAM_LCGCE_CRITICAL]  
Metric is a synonym for tests used in the development documentation. In operations document "test" is the reference term to be used.
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=WLCG_CRITICAL WLCG_CRITICAL]
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=WLCG_CRITICAL_TEST WLCG_CRITICAL_TEST]


'''OSG'''
==== POEM Profile ====
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=OSG OSG]
'''POEM''' (Profile Management Database, former Metric Description Database) aims to describe existing metrics and group ('''profiles''') them in order to run tests. In addition it should define actions that can either configure the way the availability and reliability is computed or allow notifications to messaging system.
* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=OSG_CRITICAL OSG_CRITICAL]


= Documentation =
A POEM is a profile as a triple of (vo, atp_groups and metric instances) where:
 
# set of atp_groups is a set of service instances defined in the Aggregated Topology Provider via a VO feed, e.g. LHCb_Site LCG.CNAF-T2.it= (service_instance1, service_instance2, etc.)
# set of (service_flavor, metric, fqan) tuples
## metric is fully qualified name of the metric (e.g. hr.srce.SRM2-CertLifetime)
## service_flavor is taken from ATP (e.g. CE, SRM, etc.)
## FQAN - voms role to use for the tests (e.g. /Role=lcgadm) - as metric1 is run with fqan1, metric1 with fqan2, etc.
# VO: is the name of the VO
 
In other words, the ''profile'' is a cartesian product of service groups and metrics, plus VO.
 
== New Administration Guide ==
 
Please use this guides to install '''SAM Nagios "Update 23" instance''' - [[SAMUpdate23]]


== Release Notes ==
== Administrator guides ==
[https://tomtools.cern.ch/confluence/display/SAMDOC/Release+Notes SAM Release Notes]
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Release+Notes -->
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Release%20Notes.html SAM Release Notes]
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Administrator%27s+Guide -->
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Administrator%27s%20Guide.html SAM admin guide] (including configuration via [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/SAM%20Configuration%20via%20YAIM.html YAIM])
<!-- [https://tomtools.cern.ch/confluence/display/SAMDOC/SAM-Nagios+Card -->
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/SAM-Nagios%20Card.html SAM/NAGIOS Reference Card for sitemanger]
* [[VO_Service_Availability_Monitoring |VO SAM]]
* Monitoring uncertified sites:
<!-- https://tomtools.cern.ch/confluence/display/SAM/Monitor+Uncertified+Sites -->
** IMPORTANT. EGI.eu provides '''catch-all WMS and BDII''' services for the monitoring of uncertified sites. The service is open for use, and your NGI can easily apply [http://site-certification.egi.eu/ here].


== SAM Project Tracking ==
== Probes ==
* [https://tomtools.cern.ch/jira JIRA SAM project tracking system]
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Probes -->
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Probes.html SAM Probes]
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Probes+Development -->
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Probes%20Development.html Probes development policy]
* Obsoleted pages:[https://twiki.cern.ch/twiki/bin/view/EMI/NagiosProbes EMI Nagios] and [https://savannah.cern.ch/task/?21823 status] (ARC, dCache, gLite, UNICORE)


==Installation instructions==
== Developers guides ==
* [https://tomtools.cern.ch/confluence/display/SAMDOC/Home Sam Doc home - New confluence page]
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Developer%27s+Guide -->
* [https://tomtools.cern.ch/confluence/display/SAMDOC/Installing+SAM-Nagios Installation Instruction -NEW Confluence page]
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Developer%27s%20Guide.html Probes development guides, SAM PI]
* [https://wiki.egi.eu/wiki/VO_Services/VO_Service_Availability_Monitoring Setting up a VO SAM instance]
* [https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaim NAGIOS&NCG YAim Based Installation Instruction -OLD page with YAIM variables definition]
* [https://tomtools.cern.ch/confluence/display/SAMDOC/SAM-Nagios+Card SAM/NAGIOS Reference Card for sitemanger]
* [https://tomtools.cern.ch/confluence/display/SAMDOC/SAM+Administrators+FAQ SAM Administrators FAQ]


== Monitoring uncertified sites ==
= Support =
* [https://tomtools.cern.ch/confluence/display/SAM/Monitor+Uncertified+Sites Setting NAGIOS to Monitor Uncertified Sites]
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Support
** EGI.eu provides catch-all WMS and BDII instances for the monitoring of uncertified sites. The service is open for use, and your NGI can easily apply [http://site-certification.egi.eu/ here].
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Support.html FAQs and Troubleshooting guides] -->
This section collects all information related to the SAM team support activities.


==Tools information pages==
* Support to EGI:
===MyEGI===
** Incidents and bugs affecting the SAM services are managed via GGUS tickets. These tickets follow a support line of three levels:
* [https://tomtools.cern.ch/confluence/display/SAM/MyEGI/ MyEGI documentation]
*** 1st Level Support: carried out by the EGI TPM
* [https://tomtools.cern.ch/confluence/display/SAMDOC/Web+Services+Specification MyEGI Web Services Specification]
*** 2nd Level Support: carried out by the '''[https://wiki.egi.eu/wiki/GGUS:ARGO_SAM_EGI_SUPPORT_FAQ "ARGO/SAM EGI Support"] SU'''
<!-- * [https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics SAM Probes and Metrics] -->
*** 3rd Level Support: carried out by the [https://wiki.egi.eu/wiki/GGUS:ARGO_SAM_EGI_SUPPORT_FAQ "SAM-Nagios Experts"] SU
===NCG===
*** New requirements and coordination topics are discussed at [[OTAG]] meetings.
* [https://tomtools.cern.ch/confluence/display/SAM/NCG NCG Component Overview]
** Other topics (different from incidents, bugs, requirements, coordination) are managed on a best effort basis, on the SAM EGI mailing list (argo-ggus-support AT grnet.gr).
* [https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgRecipes Grid Monitoring Specific Ncg Recipes]
<!--   Support to WLCG:
<!-- obsoleted * [https://twiki.cern.ch/twiki/bin/view/EGEE/MyEGEE MyEGEE Documentation]-->
        Incidents and bugs affecting the SAM services are managed via SNOW tickets. These tickets follow a support line of three levels:
===Databases===
        New requirements are managed via SNOW tickets and discussed at SAM HEP VOs coordination meetings.
* [https://tomtools.cern.ch/confluence/display/SAM/ATP Aggregated Topology Provider] (ATP)
        System notifications from VO SAM-Nagios nodes are sent to: SAM ATLAS VO, SAM CMS VO, SAM ALICE VO, SAM LHCB VO-->
* [https://tomtools.cern.ch/confluence/display/SAM/POEM Profile Management Database] (POEM)
* [https://tomtools.cern.ch/confluence/display/SAM/MRS Metric Result Store] (MRS)
<!--* [https://tomtools.cern.ch/jira JIRA SAM project tracking system]-->


= SAM Milestones =
*    Other:
* [https://tomtools.cern.ch/confluence/display/SAMDOC/Milestones SAM milestones]
**        This websites includes all SAM documentation for [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/User%27s%20Guide.html users], [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Administrator%27s%20Guide.html administrators], [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Developer%27s%20Guide.html developers], and [[#Support|support units]].
**        The [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Troubleshooting.html Troubleshooting] and [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/FAQs.html FAQs] sections may also be helpful.


=Related Procedures=
=SAM-related Procedures=
* Validate ROC or NGI Nagios Procedures: [[PROC05]]
* Validate ROC or NGI Nagios Procedures: [[PROC05]]
* Setting a Nagios test status to OPERATIONS: [[PROC06]]
* Setting a Nagios test status to OPERATIONS: [[PROC06]]
Line 82: Line 102:
* Management of the EGI OPS Availability and Reliability Profile: [[PROC08]]
* Management of the EGI OPS Availability and Reliability Profile: [[PROC08]]


=SAM/Nagios EGI Support Procedures=
= Resources =
 
* Andreade, P.; M. Babik, M.;  Bhatt, K; Service Availability Monitoring Framework Based On Commodity Software; CHEP12, March 2012 ([https://wiki.egi.eu/wiki/File:SAM_CHEP2012_poster_1.pdf poster])
* [[SAM/Nagios EGI Support FAQ]]
* [[SAM/Nagios EGI Support Rotas]]
 
= External Links =
* SAM Project [https://tomtools.cern.ch/confluence/display/SAM/Home home page]
*[https://twiki.cern.ch/twiki/bin/view/EGEE/MultiLevelMonitoringOverview Multi Level Monitoring Overview]
*[https://tomtools.cern.ch/confluence/download/attachments/2261694/Ace_Service_Availability_Computation.pdf?version=1&modificationDate=1314361543000 Computation of Service Availability Metrics in ACE]
<!--*[https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf A/R algorithms] -->
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/ExternalROCNagios Deployed ROC and NGI Nagios]--> <!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/OAT_EGEE_III Main EGEE OAT wiki]-->
*[https://tomtools.cern.ch/confluence/display/SAMDOC/Web+Services+Specification#WebServicesSpecification-ServiceAvailabilityinProfile SAM-PI documentation] (Non official wiki page containing [[SAM PI examples]])
[[Category:SAM]]

Latest revision as of 12:15, 13 July 2016

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Tools menu: Main page Instructions for developers AAI Proxy Accounting Portal Accounting Repository AppDB ARGO GGUS GOCDB
Message brokers Licenses OTAGs Operations Portal Perun EGI Collaboration tools LToS EGI Workload Manager


Alert.png This article is Deprecated and should no longer be used, but is still available for reasons of reference.



The Service Availability Monitoring (SAM) system is used to monitor the resources within the production infrastructure. SAM monitoring data is used for calculation of availability and reliability of grid sites.

SAM Nagios probes re-factoring TF

SAM tool instances

Documentation

Introduction

SAM Tests terminology and types

SAM Test

Test is a procedure which checks specific functionality of a given service, i.e. single measurement (e.g. org.bdii.Freshness, hr.srce.RGMA-CertLifetime). Tests are executed on NGI/ROC SAM instances.

A SAM test is:

  • an OPERATIONS test which raises alarms in the Operations Portal (see list of OPERATIONS tests)
  • an AVAILABILITY test whose result is used for Resource Centre availability calculation by ACE (see the ROC_CRITICAL profle)

NOTE: CRITICAL is used ONLY to refer to one of the possible results returned by a Nagios probe.

Probe

Probe is code which implements single or multiple tests.

Metric

Metric instances are tuples of flavour, metric name and optionally FQAN (POEM documentation). Metric is a synonym for tests used in the development documentation. In operations document "test" is the reference term to be used.

POEM Profile

POEM (Profile Management Database, former Metric Description Database) aims to describe existing metrics and group (profiles) them in order to run tests. In addition it should define actions that can either configure the way the availability and reliability is computed or allow notifications to messaging system.

A POEM is a profile as a triple of (vo, atp_groups and metric instances) where:

  1. set of atp_groups is a set of service instances defined in the Aggregated Topology Provider via a VO feed, e.g. LHCb_Site LCG.CNAF-T2.it= (service_instance1, service_instance2, etc.)
  2. set of (service_flavor, metric, fqan) tuples
    1. metric is fully qualified name of the metric (e.g. hr.srce.SRM2-CertLifetime)
    2. service_flavor is taken from ATP (e.g. CE, SRM, etc.)
    3. FQAN - voms role to use for the tests (e.g. /Role=lcgadm) - as metric1 is run with fqan1, metric1 with fqan2, etc.
  3. VO: is the name of the VO

In other words, the profile is a cartesian product of service groups and metrics, plus VO.

New Administration Guide

Please use this guides to install SAM Nagios "Update 23" instance - SAMUpdate23

Administrator guides

Probes

Developers guides

Support

This section collects all information related to the SAM team support activities.

  • Support to EGI:
    • Incidents and bugs affecting the SAM services are managed via GGUS tickets. These tickets follow a support line of three levels:
      • 1st Level Support: carried out by the EGI TPM
      • 2nd Level Support: carried out by the "ARGO/SAM EGI Support" SU
      • 3rd Level Support: carried out by the "SAM-Nagios Experts" SU
      • New requirements and coordination topics are discussed at OTAG meetings.
    • Other topics (different from incidents, bugs, requirements, coordination) are managed on a best effort basis, on the SAM EGI mailing list (argo-ggus-support AT grnet.gr).

SAM-related Procedures

  • Validate ROC or NGI Nagios Procedures: PROC05
  • Setting a Nagios test status to OPERATIONS: PROC06
  • Adding new probes to SAM: PROC07
  • Management of the EGI OPS Availability and Reliability Profile: PROC08

Resources

  • Andreade, P.; M. Babik, M.; Bhatt, K; Service Availability Monitoring Framework Based On Commodity Software; CHEP12, March 2012 (poster)