Service Level Target - Availability Reliability

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Contents



Description

The ARGO service collects status results and computes daily and monthly availability (A) and reliability (R) metrics of distributed services. Both status results and A/R metrics are delivered through the ARGO Web UI, with the ability for a user to drill-down from the availability of a site to individual test results that contributed to the computed figure.

Availability - Defined as the ability of a service to fulfil its intended function at a specific time or over a calendar month.

Reliability - Defined as the ability of a service to fulfil its intended function at a specific time or over a calendar month, excluding scheduled maintenance periods.


Components

ARGO is comprised of the following building blocks:

Definitions

Groupings of resources

The definitions of entities (resources) are the following:

Metrics and Statuses

The following define the Metric and the Status, core building blocks of the algorithm used for A/R computations

These status values are mutually exclusive. The status of a resource can have only one value at a given point in time.

Profiles

There are three (3) types of profiles used within each A/R computation:


Time slices

For computations of A/R results the ARGO compute engine uses 288 discrete samples on the daily timeline. The quantization of 288 values has been selected because it corresponds to a sampling frequency of 5mins. (24h * 60 = 1440 mins / 288 = 5mins).

The compute engine performs computations on a daily base timeframe (even though the computations run per hour, actually ARGO performs the same daily computation with updated metric data).


A/R Computation Algorithm

The A/R results are produced by integrating status results according to metric, operations and aggregation profiles. So the compute engine needs to handle status results from metric data in an efficient way in order to algorithmically combine and integrate upon them. When the engine creates a daily timeline for a specific service endpoint and a specific metric it initiates a 288 item array reserved for the service endpoint and metric couple.

Empty sliced timeline.png

When metric data is collected for a specific metric (for a specific service endpoint) it is roughly in the following form:

{ time_stamp | metric | service_flavour | hostname | status | vo | vofqan | profile | dates }

The engine then gathers all relevant daily data for the specific service endpoint and metric. For example imagine that for a given day 5 distinct metric data for the hostname foo.example.com, the service mysql.service and the metric mysql.some.metric. The data rows for that day will be of the following form:

{ time_stamp #1 | mysql.some.metric | mysql.service | foo.example.com | UNKOWN | vo | vofqan | profile | dates }
{ time_stamp #2 | mysql.some.metric | mysql.service | foo.example.com | OK | vo | vofqan | profile | dates }
{ time_stamp #3 | mysql.some.metric | mysql.service | foo.example.com | OK | vo | vofqan | profile | dates }
{ time_stamp #4 | mysql.some.metric | mysql.service | foo.example.com | CRITICAL | vo | vofqan | profile | dates }
{ time_stamp #5 | mysql.some.metric | mysql.service | foo.example.com | OK | vo | vofqan | profile | dates }

The compute engine will also grab the last metric from the previous day timeline

{ time_stamp #0 | mysql.some.metric | mysql.service | foo.example.com | OK | vo | vofqan | profile | dates }

Based on the timestamp and status fields the compute engine will map these data points to the correct indexes of the metric array:

Init sliced timeline.png

Afterwards the compute engine will fill in the gaps appropriately, like so:

Filled sliced timeline.png

When the engine needs to combine several different timelines in order to produce an aggregated timeline result (for example for a specific service flavor), it does the following:

  1. Reserves a new array for the aggregation timeline
  2. Aligns the relevant timeline arrays
  3. Begins from index 0 and combines all array_items[0] to produce the aggregation_item[0]
  4. Moves to next index

The end result is an aggregated timeline:

Aggregated sliced timeline.png

In all cases AND and OR operations are based on the Operations profile used.

It is important to note that the discrete handling of the status results as samples gives an easy and graceful way to implement aggregations.

Status Aggregation Algorithm

Regarding status timelines and since there are no pre-established points in time shared by all timelines (like in sampling and A/R computations described above) the compute engine operates differently.

If for example the compute engine is given 3 continuous status timelines that need to be aggregated a new timeline for the aggregation is reserved.

Empty status timeline.png

Then the points of interest (timestamps were status changes occur) are collected

Pois status timeline.png

and the compute engine slices the timeline accordingly

Sliced status timeline.png

The compute engine then creates a number of chunks based on the points of interest found

Chunked status timeline.png

And iteratively fills up the gaps progressively based on the profiles used in the given computation.

Aggr1 status timeline.png


Aggr2 status timeline.png

Once the filling up is completed the compute engine stitches back the complete aggregated timeline, like in the picture below:

Filled status timeline.png

Reports

In the following subsections the metric and aggregation profiles used for each EGI report are given.

The operations profile used in all of the subsequent EGI reports are given in the tabulars here:

AND OK WARNING UNKNOWN MISSING CRITICAL DOWNTIME
OK OK WARNING UNKNOWN MISSING CRITICAL DOWNTIME
WARNING WARNING WARNING UNKNOWN MISSING CRITICAL DOWNTIME
UNKNOWN UNKNOWN UNKNOWN UNKNOWN MISSING CRITICAL DOWNTIME
MISSING MISSING MISSING MISSING MISSING CRITICAL DOWNTIME
CRITICAL CRITICAL CRITICAL CRITICAL CRITICAL CRITICAL CRITICAL
DOWNTIME DOWNTIME DOWNTIME DOWNTIME DOWNTIME CRITICAL DOWNTIME


OR OK WARNING UNKNOWN MISSING CRITICAL DOWNTIME
OK OK OK OK OK OK OK
WARNING OK WARNING WARNING WARNING WARNING WARNING
UNKNOWN OK WARNING UNKNOWN UNKNOWN CRITICAL UNKNOWN
MISSING OK WARNING UNKNOWN MISSING CRITICAL DOWNTIME
CRITICAL OK WARNING CRITICAL CRITICAL CRITICAL CRITICAL
DOWNTIME OK WARNING UNKNOWN DOWNTIME CRITICAL DOWNTIME



Sites A/R

In the Sites A/R report the following metric profile is used:

Metric Service Type
org.nordugrid.ARC-CE-ARIS ARC-CE
org.nordugrid.ARC-CE-IGTF ARC-CE
org.nordugrid.ARC-CE-result ARC-CE
org.nordugrid.ARC-CE-srm ARC-CE
org.nordugrid.ARC-CE-sw-csh ARC-CE
emi.cream.CREAMCE-JobSubmit CREAM-CE
emi.wn.WN-Bi CREAM-CE
emi.wn.WN-Csh CREAM-CE
emi.wn.WN-SoftVer CREAM-CE
hr.srce.CADist-Check CREAM-CE
hr.srce.CREAMCE-CertLifetime CREAM-CE
hr.srce.GRAM-Auth GRAM5
hr.srce.GRAM-CertLifetime GRAM5
hr.srce.GRAM-Command GRAM5
hr.srce.QCG-Computing-CertLifetime QCG.Computing
pl.plgrid.QCG-Computing QCG.Computing
hr.srce.SRM2-CertLifetime SRMv2
org.sam.SRM-Del SRMv2
org.sam.SRM-Get SRMv2
org.sam.SRM-GetSURLs SRMv2
org.sam.SRM-GetTURLs SRMv2
org.sam.SRM-Ls SRMv2
org.sam.SRM-LsDir SRMv2
org.sam.SRM-Put SRMv2
org.bdii.Entries Site-BDII
org.bdii.Freshness Site-BDII
emi.unicore.TargetSystemFactory unicore6.TargetSystemFactory
emi.unicore.UNICORE-Job unicore6.TargetSystemFactory
eu.egi.OCCI-IGTF eu.egi.cloud.vm-management.occi
eu.egi.cloud.OCCI-Context eu.egi.cloud.vm-management.occi
eu.egi.cloud.OCCI-VM eu.egi.cloud.vm-management.occi
org.nagios.OCCI-TCP eu.egi.cloud.vm-management.occi
eu.egi.Keystone-IGTF org.openstack.nova
eu.egi.cloud.OpenStack-VM org.openstack.nova
org.nagios.Keystone-TCP org.openstack.nova

The Aggregation profile used is the following one:

Sites Aggregation Profile
Operation Capability Operation Service Flavor
AND Compute OR CREAM-CE
ARC-CE
GRAM5
unicore6.TargetSystemFactory
QCG.Computing
Storage OR SRMv2
SRM
Information OR Site-BDII
vm-management OR eu.egi.cloud.vm-management.occi
org.openstack.nova


NGI sites A/R

For the NGI level aggregation all A/R results for sites belonging to the NGI are collected and aggregated dynamically weighted based on the HEPSPEC factor for each site. Hence larger sites contribute more to the overall NGI A/R and smaller sites less.

Monthly League Tables

Monthly EGI League Tables are accessible via the ARGO Web UI (Lavoisier) under the following link: http://argo.egi.eu/lavoisier/ngi_reports?month=YYYY-MM

To get results for a specific month one should replace YYYY and MM with the calendar year and month respectively, hence to obtain results for August 2015 the link should be formatted as follows: http://argo.egi.eu/lavoisier/ngi_reports?month=2015-08 .

Monthly Reports are also available at Resource Centres OLA and Resource infrastructure Provider OLA reports wiki page

Core services A/R

The Core service A/R report utilizes the following metric profile:

Metric Service Type
org.activemq.OpenWireSSL egi.APELRepository
org.nagiosexchange.AccountingPortal-WebCheck egi.AccountingPortal
org.nagiosexchange.AppDB-WebCheck egi.AppDB
org.nagiosexchange.GGUS-WebCheck egi.GGUS
org.nagios.GOCDB-PortCheck egi.GOCDB
org.nagiosexchange.GOCDB-PI egi.GOCDB
org.nagiosexchange.GOCDB-WebCheck egi.GOCDB
org.nagiosexchange.GSTAT-WebCheck egi.GSTAT
org.activemq.Network-Topic egi.MSGBroker
org.activemq.Network-VirtualDestination egi.MSGBroker
org.activemq.OpenWire egi.MSGBroker
org.activemq.OpenWireSSL egi.MSGBroker
org.activemq.STOMP egi.MSGBroker
org.activemq.STOMPSSL egi.MSGBroker
org.nagiosexchange.MetricsPortal-WebCheck egi.MetricsPortal
org.nagiosexchange.OpsPortal-WebCheck egi.OpsPortal
eu.egi.cloud.Perun-Check egi.Perun
org.nagiosexchange.Portal-WebCheck egi.Portal
ch.cern.sam.SAMCentralWebAPI egi.SAM
org.nagiosexchange.TMP-WebCheck egi.TMP
org.nagiosexchange.OpsPortal-WebCheck ngi.OpsPortal
org.nagiosexchange.MyEGIWebInterface ngi.SAM
org.nagiosexchange.NagiosHostSummary ngi.SAM
org.nagiosexchange.NagiosProcess ngi.SAM
org.nagiosexchange.NagiosServiceSummary ngi.SAM
org.nagiosexchange.NagiosWebInterface ngi.SAM
org.nagiosexchange.MyEGIWebInterface vo.SAM
org.nagiosexchange.NagiosHostSummary vo.SAM
org.nagiosexchange.NagiosProcess vo.SAM
org.nagiosexchange.NagiosServiceSummary vo.SAM
org.nagiosexchange.NagiosWebInterface vo.SAM

The Aggregation profile used is the following one:

Core Services Aggregation Profile
Operation Capability Operation Service Flavor
AND gstat OR egi.GSTAT
vosam OR vo.SAM
ngisam OR ngi.SAM
egisam OR egi.SAM
brokering OR egi.MSGBroker
egiportal OR egi.Portal
egiopsportal OR egi.OpsPortal
egimetricsportal OR egi.MetricsPortal
registry OR egi.GOCDB
helpdesk OR egi.GGUS
applications OR egi.AppDB
authentication OR egi.Perun
tpm OR egi.TPM
apelrepository OR egi.APELRepository
accountingportal OR egi.AccountingPortal


Recomputation procedure

Please refer to PROC10.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export