Difference between revisions of "SAM"
(63 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{Template:Op menubar}} | {{Template:Op menubar}} | ||
{{Template:Tools menubar}} | {{Template:Tools menubar}} | ||
{{Template:Deprecated}} | |||
{{TOC_right}} | {{TOC_right}} | ||
[[Category:SAM]] | [[Category:SAM]] | ||
Line 14: | Line 16: | ||
= Documentation = | = Documentation = | ||
== Introduction == | == Introduction == | ||
=== SAM === | === SAM Tests terminology and types === | ||
* [[SAM | |||
* | ==== SAM Test ==== | ||
'''Test''' is a procedure which checks specific functionality of a given service, i.e. single measurement (e.g. org.bdii.Freshness, hr.srce.RGMA-CertLifetime). Tests are executed on NGI/ROC SAM instances. | |||
A '''SAM test''' is: | |||
* an '''OPERATIONS''' test which raises alarms in the Operations Portal (see [[Operations SAM tests | list]] of OPERATIONS tests) | |||
* an '''AVAILABILITY''' test whose result is used for Resource Centre availability calculation by ACE (see the [https://mon.egi.eu/poem/admin/poem/profile/26/ ROC_CRITICAL] profle) | |||
NOTE: '''CRITICAL''' is used ONLY to refer to one of the possible results returned by a Nagios probe. | |||
=== | ==== Probe ==== | ||
'''Probe''' is code which implements single or multiple tests. | |||
==== Metric ==== | |||
'''Metric instances''' are tuples of flavour, metric name and optionally FQAN ([http://argoeu.github.io/samdoc/confluence/display/SAMDOC/POEM%20User%27s%20Guide.html#POEMUser%27sGuide-Adding%2FEditingprofilesandmetrics POEM] documentation). | |||
Metric is a synonym for tests used in the development documentation. In operations document "test" is the reference term to be used. | |||
==== POEM Profile ==== | |||
'''POEM''' (Profile Management Database, former Metric Description Database) aims to describe existing metrics and group ('''profiles''') them in order to run tests. In addition it should define actions that can either configure the way the availability and reliability is computed or allow notifications to messaging system. | '''POEM''' (Profile Management Database, former Metric Description Database) aims to describe existing metrics and group ('''profiles''') them in order to run tests. In addition it should define actions that can either configure the way the availability and reliability is computed or allow notifications to messaging system. | ||
A POEM is a profile as a triple of (vo, atp_groups and metric instances) where: | |||
= | # set of atp_groups is a set of service instances defined in the Aggregated Topology Provider via a VO feed, e.g. LHCb_Site LCG.CNAF-T2.it= (service_instance1, service_instance2, etc.) | ||
# set of (service_flavor, metric, fqan) tuples | |||
## metric is fully qualified name of the metric (e.g. hr.srce.SRM2-CertLifetime) | |||
## service_flavor is taken from ATP (e.g. CE, SRM, etc.) | |||
## FQAN - voms role to use for the tests (e.g. /Role=lcgadm) - as metric1 is run with fqan1, metric1 with fqan2, etc. | |||
# VO: is the name of the VO | |||
In other words, the ''profile'' is a cartesian product of service groups and metrics, plus VO. | |||
==== | == New Administration Guide == | ||
Please use this guides to install '''SAM Nagios "Update 23" instance''' - [[SAMUpdate23]] | |||
== Administrator guides == | == Administrator guides == | ||
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Release+Notes --> | |||
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Release%20Notes.html SAM Release Notes] | |||
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Administrator%27s+Guide --> | |||
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Administrator%27s%20Guide.html SAM admin guide] (including configuration via [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/SAM%20Configuration%20via%20YAIM.html YAIM]) | |||
<!-- [https://tomtools.cern.ch/confluence/display/SAMDOC/SAM-Nagios+Card --> | |||
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/SAM-Nagios%20Card.html SAM/NAGIOS Reference Card for sitemanger] | |||
* [[VO_Service_Availability_Monitoring |VO SAM]] | * [[VO_Service_Availability_Monitoring |VO SAM]] | ||
* Monitoring uncertified sites: | * Monitoring uncertified sites: | ||
<!-- https://tomtools.cern.ch/confluence/display/SAM/Monitor+Uncertified+Sites --> | |||
** IMPORTANT. EGI.eu provides '''catch-all WMS and BDII''' services for the monitoring of uncertified sites. The service is open for use, and your NGI can easily apply [http://site-certification.egi.eu/ here]. | ** IMPORTANT. EGI.eu provides '''catch-all WMS and BDII''' services for the monitoring of uncertified sites. The service is open for use, and your NGI can easily apply [http://site-certification.egi.eu/ here]. | ||
== Probes == | == Probes == | ||
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Probes --> | |||
* [ | * [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Probes.html SAM Probes] | ||
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Probes+Development --> | |||
* [https://twiki.cern.ch/twiki/bin/view/EMI/NagiosProbes EMI Nagios] and [https://savannah.cern.ch/task/?21823 status] (ARC, dCache, gLite, UNICORE) | * [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Probes%20Development.html Probes development policy] | ||
* Obsoleted pages:[https://twiki.cern.ch/twiki/bin/view/EMI/NagiosProbes EMI Nagios] and [https://savannah.cern.ch/task/?21823 status] (ARC, dCache, gLite, UNICORE) | |||
== Developers guides == | == Developers guides == | ||
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Developer%27s+Guide --> | |||
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Developer%27s%20Guide.html Probes development guides, SAM PI] | |||
= Support = | |||
<!-- https://tomtools.cern.ch/confluence/display/SAMDOC/Support | |||
* [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Support.html FAQs and Troubleshooting guides] --> | |||
This section collects all information related to the SAM team support activities. | |||
* Support to EGI: | |||
** Incidents and bugs affecting the SAM services are managed via GGUS tickets. These tickets follow a support line of three levels: | |||
*** 1st Level Support: carried out by the EGI TPM | |||
*** 2nd Level Support: carried out by the '''[https://wiki.egi.eu/wiki/GGUS:ARGO_SAM_EGI_SUPPORT_FAQ "ARGO/SAM EGI Support"] SU''' | |||
*** 3rd Level Support: carried out by the [https://wiki.egi.eu/wiki/GGUS:ARGO_SAM_EGI_SUPPORT_FAQ "SAM-Nagios Experts"] SU | |||
*** New requirements and coordination topics are discussed at [[OTAG]] meetings. | |||
** Other topics (different from incidents, bugs, requirements, coordination) are managed on a best effort basis, on the SAM EGI mailing list (argo-ggus-support AT grnet.gr). | |||
<!-- Support to WLCG: | |||
Incidents and bugs affecting the SAM services are managed via SNOW tickets. These tickets follow a support line of three levels: | |||
New requirements are managed via SNOW tickets and discussed at SAM HEP VOs coordination meetings. | |||
System notifications from VO SAM-Nagios nodes are sent to: SAM ATLAS VO, SAM CMS VO, SAM ALICE VO, SAM LHCB VO--> | |||
* Other: | |||
** This websites includes all SAM documentation for [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/User%27s%20Guide.html users], [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Administrator%27s%20Guide.html administrators], [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Developer%27s%20Guide.html developers], and [[#Support|support units]]. | |||
** The [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Troubleshooting.html Troubleshooting] and [http://argoeu.github.io/samdoc/confluence/display/SAMDOC/FAQs.html FAQs] sections may also be helpful. | |||
=SAM-related Procedures= | =SAM-related Procedures= | ||
Line 99: | Line 101: | ||
* Adding new probes to SAM: [[PROC07]] | * Adding new probes to SAM: [[PROC07]] | ||
* Management of the EGI OPS Availability and Reliability Profile: [[PROC08]] | * Management of the EGI OPS Availability and Reliability Profile: [[PROC08]] | ||
= Resources = | = Resources = | ||
* Andreade, P.; M. Babik, M.; Bhatt, K; Service Availability Monitoring Framework Based On Commodity Software; CHEP12, March 2012 ([https://wiki.egi.eu/wiki/File:SAM_CHEP2012_poster_1.pdf poster]) | * Andreade, P.; M. Babik, M.; Bhatt, K; Service Availability Monitoring Framework Based On Commodity Software; CHEP12, March 2012 ([https://wiki.egi.eu/wiki/File:SAM_CHEP2012_poster_1.pdf poster]) |
Latest revision as of 12:15, 13 July 2016
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Tools menu: | • Main page | • Instructions for developers | • AAI Proxy | • Accounting Portal | • Accounting Repository | • AppDB | • ARGO | • GGUS | • GOCDB |
• Message brokers | • Licenses | • OTAGs | • Operations Portal | • Perun | • EGI Collaboration tools | • LToS | • EGI Workload Manager |
This article is Deprecated and should no longer be used, but is still available for reasons of reference. |
The Service Availability Monitoring (SAM) system is used to monitor the resources within the production infrastructure. SAM monitoring data is used for calculation of availability and reliability of grid sites.
SAM Nagios probes re-factoring TF
SAM tool instances
Documentation
Introduction
SAM Tests terminology and types
SAM Test
Test is a procedure which checks specific functionality of a given service, i.e. single measurement (e.g. org.bdii.Freshness, hr.srce.RGMA-CertLifetime). Tests are executed on NGI/ROC SAM instances.
A SAM test is:
- an OPERATIONS test which raises alarms in the Operations Portal (see list of OPERATIONS tests)
- an AVAILABILITY test whose result is used for Resource Centre availability calculation by ACE (see the ROC_CRITICAL profle)
NOTE: CRITICAL is used ONLY to refer to one of the possible results returned by a Nagios probe.
Probe
Probe is code which implements single or multiple tests.
Metric
Metric instances are tuples of flavour, metric name and optionally FQAN (POEM documentation). Metric is a synonym for tests used in the development documentation. In operations document "test" is the reference term to be used.
POEM Profile
POEM (Profile Management Database, former Metric Description Database) aims to describe existing metrics and group (profiles) them in order to run tests. In addition it should define actions that can either configure the way the availability and reliability is computed or allow notifications to messaging system.
A POEM is a profile as a triple of (vo, atp_groups and metric instances) where:
- set of atp_groups is a set of service instances defined in the Aggregated Topology Provider via a VO feed, e.g. LHCb_Site LCG.CNAF-T2.it= (service_instance1, service_instance2, etc.)
- set of (service_flavor, metric, fqan) tuples
- metric is fully qualified name of the metric (e.g. hr.srce.SRM2-CertLifetime)
- service_flavor is taken from ATP (e.g. CE, SRM, etc.)
- FQAN - voms role to use for the tests (e.g. /Role=lcgadm) - as metric1 is run with fqan1, metric1 with fqan2, etc.
- VO: is the name of the VO
In other words, the profile is a cartesian product of service groups and metrics, plus VO.
New Administration Guide
Please use this guides to install SAM Nagios "Update 23" instance - SAMUpdate23
Administrator guides
- SAM Release Notes
- SAM admin guide (including configuration via YAIM)
- SAM/NAGIOS Reference Card for sitemanger
- VO SAM
- Monitoring uncertified sites:
- IMPORTANT. EGI.eu provides catch-all WMS and BDII services for the monitoring of uncertified sites. The service is open for use, and your NGI can easily apply here.
Probes
- SAM Probes
- Probes development policy
- Obsoleted pages:EMI Nagios and status (ARC, dCache, gLite, UNICORE)
Developers guides
Support
This section collects all information related to the SAM team support activities.
- Support to EGI:
- Incidents and bugs affecting the SAM services are managed via GGUS tickets. These tickets follow a support line of three levels:
- 1st Level Support: carried out by the EGI TPM
- 2nd Level Support: carried out by the "ARGO/SAM EGI Support" SU
- 3rd Level Support: carried out by the "SAM-Nagios Experts" SU
- New requirements and coordination topics are discussed at OTAG meetings.
- Other topics (different from incidents, bugs, requirements, coordination) are managed on a best effort basis, on the SAM EGI mailing list (argo-ggus-support AT grnet.gr).
- Incidents and bugs affecting the SAM services are managed via GGUS tickets. These tickets follow a support line of three levels:
- Other:
- This websites includes all SAM documentation for users, administrators, developers, and support units.
- The Troubleshooting and FAQs sections may also be helpful.
- Validate ROC or NGI Nagios Procedures: PROC05
- Setting a Nagios test status to OPERATIONS: PROC06
- Adding new probes to SAM: PROC07
- Management of the EGI OPS Availability and Reliability Profile: PROC08
Resources
- Andreade, P.; M. Babik, M.; Bhatt, K; Service Availability Monitoring Framework Based On Commodity Software; CHEP12, March 2012 (poster)