Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "SAM"

From EGIWiki
Jump to navigation Jump to search
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Template:Op menubar}}
{{Template:Op menubar}}
{{Template:Tools menubar}}
{{Template:Tools menubar}}
{{Template:Deprecated}}
{{TOC_right}}
{{TOC_right}}
[[Category:SAM]]
[[Category:SAM]]
Line 44: Line 46:
# VO: is the name of the VO
# VO: is the name of the VO


In other words, the ''profile'' is a cartesian product of service groups and metrics, plus VO (read [https://tomtools.cern.ch/confluence/display/SAM/Model more]).
In other words, the ''profile'' is a cartesian product of service groups and metrics, plus VO.
 
==== Availability and reliability profile ====
{|
|[[file:Ace-profile.jpg|thumb|left]]
(Courtesy of [https://www.egi.eu/indico/contributionDisplay.py?sessionId=78&contribId=392&confId=452 P. Andrade], CERN)
|[https://tomtools.cern.ch/confluence/display/SAM/Requirements Availability and Reliability Profiles] are a collection of metrics/services defined for VOs (multiple profiles per VO). Each profile defines its computation algorithm. Metrics can be in different levels such as crtical, non-critcal etc.
|}
 
===SAM POEM Profiles ===
 
====Profiles for RC monitoring====
<!--* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=ROC ROC]-->
*[https://mon.egi.eu/poem/admin/poem/profile/25/ ROC] - Tests for monitoring of all EGI services; applied on all NGI SAM Nagioses. NOTE WELL: starting from SAMUpdate-17 the removal of a metric from ROC profile will immediately cause the removal of the metric from all NGI Nagios instances, i.e. tests will no longer be executed.
** Deployed: on all NGI SAM Nagios
** Tests: 99
** [[ROC_SAM_Tests |ROC Tests description]]
<!--* [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=ROC_CRITICAL ROC_CRITICAL] -->
* [https://mon.egi.eu/poem/admin/poem/profile/26/ ROC_CRITICAL] - The profile for Availability/Reliability computation of EGI Resource Centres (OPS VO), subset of ROC tests. NOTE: It replaces WLCG_CREAM_LCGCE_CRITICAL as of 01 Jan 2012.
** Deployed: on all NGI SAM Nagios
** Tests: 31
** These profile contains a subset of [[ROC_SAM_Tests |ROC Tests]]. Please see [https://mon.egi.eu/poem/admin/poem/profile/26/ ROC_CRITICAL] for the list of test
<!--*[http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=ROC_OPERATORS ROC_OPERATORS] -->
*[https://mon.egi.eu/poem/admin/poem/profile/27/ ROC_OPERATORS] - Subset of ROC tests that are Operations tests, metrics that can generate an alarm on the operations dashboard when failing.
** Deployed: on all NGI SAM Nagios
** Tests: 74
** These profile contains a subset of [[ROC_SAM_Tests |ROC Tests]]. Please see [https://mon.egi.eu/poem/admin/poem/profile/27/ ROC_OPERATORS] for the list of test.
 
====Profile for Cloud RC monitoring ====
* [https://mon.egi.eu/poem/admin/poem/profile/29/ CLOUD-MON] - Tests for monitoring EGI FedCloud resources from cloudmon.egi.eu
** Deployed: on Central instance (cloudmon.egi.eu)
** Tests: 6
** [https://wiki.egi.eu/wiki/Cloud_SAM_tests CLOUD_MONITOR Tests description]
 
====Profiles for Operations Tools monitoring ====
<!-- [http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles/?vo_name=ops&profile_name=OPS_MONITOR -->
* [https://mon.egi.eu/poem/admin/poem/profile/22/ OPS_MONITOR] - Tests for monitoring of all EGI.eu Central Operational Tools from opsmon.egi.eu, including NGI SAM
** Deployed: on Central instance (opsmon.egi.eu)
** Tests: 28
** [https://wiki.egi.eu/wiki/OPS-MONITOR_profile_SAM_tests OPS_MONITOR Tests description]
* [https://mon.egi.eu/poem/admin/poem/profile/23/ OPS_MONITOR_CRITICAL] - Subset of OPS_MONITOR tests used for A/R calculation
** Deployed: on Central instance (opsmon.egi.eu)
** Tests: 23
** [https://TOADD OPS_MONITOR_CRITICAL Tests description]
 
====Others====
<!-- http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=GLEXEC -->
* [https://mon.egi.eu/poem/admin/poem/profile/17/ GLEXEC] - gLExec tests configured on NGI SAM Nagioses
** Deployed: on all NGI SAM Nagios
** Tests: 2
** [https://TOADD GLEXEC Tests description]
* [https://midmon.egi.eu/poem/admin/poem/profile/1/ MW_MONITOR] - Tests for monitoring all EGI services for special purposes (MW upgrades) from midmon.egi.eu
** Deployed: on Central instance (midmon.egi.eu)
** Tests: 15
** [https://wiki.egi.eu/wiki/MW_Nagios_tests MW_MONITOR Tests description]
* [https://secmon.egi.eu/poem/admin/poem/profile/1/ SEC_MONITOR] - Security tests for monitoring all EGI services from secmon.egi.eu
** Deployed: on Central instance (secmon.egi.eu)
** Tests: 14
** [https://wiki.egi.eu/wiki/EGI_CSIRT:SMG SEC_MONITOR Tests description]
 
=== SAM tests  ===
 
Tests on NGI/ROC SAM instances are the one which frameworks includes in the SAM configuration. In addition SAM admins can add their own probes to these instances.
 
SAM teams proposes addition of new probes. The addition of probes is part of SAM release and thus part of the staged rollout. It was agreed that prior to release new list of probes will be briefly presented at the OMB meeting. Probes which perform internal components of SAM are not presented at OMB.
 
The list of '''tests included in the SAM release''' can be found [[NGI profile SAM tests|here - NGI profile SAM tests]].
 
Lists of tests '''to be included''' are [[Inactive SAM tests|here - Inactive SAM tests]].
 
List of '''MW related tests''': [[MW SAM tests]].
 
List of '''operational tools tests''': [[OPS-MONITOR profile SAM tests]].
 
List of '''cloud tests''': [[Cloud SAM tests]].
 
==== Operations tests  ====
 
Tests on Operations Portal are the ones used for raising alarms for ROD and Operations teams. Operations portal does not execute these tests, but receives alarms from NGI/ROC SAM instances. Operations Portal contains list of the probes used for alarms and others are filtered.
 
The procedure for adding a new probe (PROC06) can be found [[PROC06|here]].
 
'''The list of tests''' can be found [[Operations SAM tests|here - Operations SAM tests]].
 
==== Availability tests  ====
 
Set of tests used for calculating availability and reliability of sites and services. The A/R calculation is related to the OLA. As in case of Operations Portal, availability calculation component receives results from NGI/ROC SAM instances.
 
TSA1.8 proposes a change in avail calculation (which probe results count in it) and the OMB approves.
 
'''The list of tests''' can be found [[Availability SAM tests|here - Availability SAM tests]].
 
=== SAM components ===
* MyEGI
<!-- https://tomtools.cern.ch/confluence/display/SAM/MyEGI/ -->
** [https://tomtools.cern.ch/confluence/display/SAM/MyEGI.html MyEGI documentation]
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Web+Services+Specification MyEGI
https://tomtools.cern.ch/confluence/display/SAMDOC/Web+Services+Specification.html -->
** [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Web%20Services%20Specification.html  Web Services Specification]
<!-- https://tomtools.cern.ch/confluence/display/SAM/SAM-Nagios -->
* [https://tomtools.cern.ch/confluence/display/SAM/Nagios.html SAM Nagios]
* [https://tomtools.cern.ch/confluence/display/SAM/NCG NCG (Nagios Config Generator)]
* [https://tomtools.cern.ch/confluence/display/SAM/ATP ATP (Aggregated Topology Provider)]
<!-- https://tomtools.cern.ch/confluence/display/SAM/POEM -->
* [https://tomtools.cern.ch/confluence/display/SAM/POEM.html POEM]
<!-- https://tomtools.cern.ch/confluence/display/SAM/MRS -->
* [https://tomtools.cern.ch/confluence/display/SAM/MRS.html MRS (Metrics Result Store)]
* ACE (Availability Computation Engine)
<!-- https://tomtools.cern.ch/confluence/display/SAM/Availability+and+Reliability+report+generation -->
** [https://tomtools.cern.ch/confluence/display/SAM/ACE.html Availability report generation]
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Availability+Computation -->
** [https://tomtools.cern.ch/confluence/download/attachments/2261694/Ace_Service_Availability_Computation.pdf%3Fversion=1&modificationDate=1314361543000 Availability Computation Algorithm]
 
== User guides ==
* [https://tomtools.cern.ch/confluence/display/SAMDOC/MyWLCG+%28MyEGI%29+User%27s+Guide MyEGI, Nagios, POEM]
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Web+Services+Specification#WebServicesSpecification-ServiceAvailabilityinProfile -->
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Web%20Services%20Specification.html#WebServicesSpecification-ServiceAvailabilityinProfile SAM-PI documentation] (Non official wiki page containing [[SAM PI examples]])


== New Administration Guide ==
== New Administration Guide ==
Line 176: Line 62:
* Monitoring uncertified sites:
* Monitoring uncertified sites:
<!-- https://tomtools.cern.ch/confluence/display/SAM/Monitor+Uncertified+Sites -->
<!-- https://tomtools.cern.ch/confluence/display/SAM/Monitor+Uncertified+Sites -->
** [https://tomtools.cern.ch/confluence/display/SAM/Monitor+Uncertified+Sites.html Setting NAGIOS to Monitor Uncertified Sites]
** IMPORTANT. EGI.eu provides '''catch-all WMS and BDII''' services for the monitoring of uncertified sites. The service is open for use, and your NGI can easily apply [http://site-certification.egi.eu/ here].
** IMPORTANT. EGI.eu provides '''catch-all WMS and BDII''' services for the monitoring of uncertified sites. The service is open for use, and your NGI can easily apply [http://site-certification.egi.eu/ here].



Latest revision as of 11:15, 13 July 2016

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Tools menu: Main page Instructions for developers AAI Proxy Accounting Portal Accounting Repository AppDB ARGO GGUS GOCDB
Message brokers Licenses OTAGs Operations Portal Perun EGI Collaboration tools LToS EGI Workload Manager


Alert.png This article is Deprecated and should no longer be used, but is still available for reasons of reference.



The Service Availability Monitoring (SAM) system is used to monitor the resources within the production infrastructure. SAM monitoring data is used for calculation of availability and reliability of grid sites.

SAM Nagios probes re-factoring TF

SAM tool instances

Documentation

Introduction

SAM Tests terminology and types

SAM Test

Test is a procedure which checks specific functionality of a given service, i.e. single measurement (e.g. org.bdii.Freshness, hr.srce.RGMA-CertLifetime). Tests are executed on NGI/ROC SAM instances.

A SAM test is:

  • an OPERATIONS test which raises alarms in the Operations Portal (see list of OPERATIONS tests)
  • an AVAILABILITY test whose result is used for Resource Centre availability calculation by ACE (see the ROC_CRITICAL profle)

NOTE: CRITICAL is used ONLY to refer to one of the possible results returned by a Nagios probe.

Probe

Probe is code which implements single or multiple tests.

Metric

Metric instances are tuples of flavour, metric name and optionally FQAN (POEM documentation). Metric is a synonym for tests used in the development documentation. In operations document "test" is the reference term to be used.

POEM Profile

POEM (Profile Management Database, former Metric Description Database) aims to describe existing metrics and group (profiles) them in order to run tests. In addition it should define actions that can either configure the way the availability and reliability is computed or allow notifications to messaging system.

A POEM is a profile as a triple of (vo, atp_groups and metric instances) where:

  1. set of atp_groups is a set of service instances defined in the Aggregated Topology Provider via a VO feed, e.g. LHCb_Site LCG.CNAF-T2.it= (service_instance1, service_instance2, etc.)
  2. set of (service_flavor, metric, fqan) tuples
    1. metric is fully qualified name of the metric (e.g. hr.srce.SRM2-CertLifetime)
    2. service_flavor is taken from ATP (e.g. CE, SRM, etc.)
    3. FQAN - voms role to use for the tests (e.g. /Role=lcgadm) - as metric1 is run with fqan1, metric1 with fqan2, etc.
  3. VO: is the name of the VO

In other words, the profile is a cartesian product of service groups and metrics, plus VO.

New Administration Guide

Please use this guides to install SAM Nagios "Update 23" instance - SAMUpdate23

Administrator guides

Probes

Developers guides

Support

This section collects all information related to the SAM team support activities.

  • Support to EGI:
    • Incidents and bugs affecting the SAM services are managed via GGUS tickets. These tickets follow a support line of three levels:
      • 1st Level Support: carried out by the EGI TPM
      • 2nd Level Support: carried out by the "ARGO/SAM EGI Support" SU
      • 3rd Level Support: carried out by the "SAM-Nagios Experts" SU
      • New requirements and coordination topics are discussed at OTAG meetings.
    • Other topics (different from incidents, bugs, requirements, coordination) are managed on a best effort basis, on the SAM EGI mailing list (argo-ggus-support AT grnet.gr).

SAM-related Procedures

  • Validate ROC or NGI Nagios Procedures: PROC05
  • Setting a Nagios test status to OPERATIONS: PROC06
  • Adding new probes to SAM: PROC07
  • Management of the EGI OPS Availability and Reliability Profile: PROC08

Resources

  • Andreade, P.; M. Babik, M.; Bhatt, K; Service Availability Monitoring Framework Based On Commodity Software; CHEP12, March 2012 (poster)