Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "New Availability Reporting"

From EGIWiki
Jump to navigation Jump to search
 
(25 intermediate revisions by 4 users not shown)
Line 1: Line 1:
=Use case 1: NGI availability=
{{Template:Op menubar}}
We would like to extend the availability OPS reporting system to measure the performance of the services operated by an NGI. For example:
{{Template:Tools menubar}}
[[Category:Tools]]
__TOC__
 
=Use case 1: NGI availability reports=
We would like to have NGI Availability reports. These reports should include the central services operated by the NGI, this including the regional tools and other middleware core services operated, for example:
 
* the VOMS service
* the VOMS service
* the top-BDII service
* the top-BDII service
* the WMS service
* the WMS service
* the operational services including
 
and the operational services including
** the NGI SAM service
** the NGI SAM service
** the accounting portal and repositories (where available)
** the accounting portal and repositories (where available)
Line 11: Line 18:
**...
**...


VOMS, top-BDII, WMS etc. when deployed in cluster mode are a '''logical''' service comprising N physical instances (tB1, tB2, ..., tBN):
The concerned services would be only those for which the NGI has direct administration responsibilities. For example the NGI availability reports shouldn't include WMS, VOMS etc. instances that are independently deployed by the sites to support local user communities and local projects.
* each deployed potentially in a different physical site.
* the node can be part of a different NGI. For example: the SAM service of NGI_CH is actually operated by NGI_DE.


==NGI middleware availability==
It is important to consider that a NGI core services is often physically distributed across different sites, that only have the role of hosting the hardware (but no administration responsibility). This has several implications.
 
# If one instance is down but the rest of the cluster is up, then the "logical" service is still available. This means that the alias should be monitored for the sake of availability computation, not the individual physical instances
# The site availability should not be impacted by the unavailability of physical instances of a service operated by the NGI.
 
This use case could be satisfied by:
- grouping NGI services into a dedicated NGI site (in case of a distributed service, only the alias is registered)
- create a NGI availability profile just applicable to the "NGI" site, where the availability of the site is computed as the AND composition of the availability of all registered services. Note that if some (optional) services are NOT available, then UP should be returned, i.e. the profile should include a mandatory set of services (e.g. regional SAM) and a complementary set of optional services (e.g. the local helpdesk, VOMS, etc.)
 
<!-- ==NGI middleware availability==
The NGI middleare logical site includes all core middleware services operated by the NGI: WMS, top-BDII, VOMS etc. regardless of their physical location.
The NGI middleare logical site includes all core middleware services operated by the NGI: WMS, top-BDII, VOMS etc. regardless of their physical location.


Line 25: Line 39:
The NGI operations logical site includes all operations tools operated by the NGI: helpdesk, SAM, ops dashboard etc.
The NGI operations logical site includes all operations tools operated by the NGI: helpdesk, SAM, ops dashboard etc.


NGI operations services is UP IF (Helpdesk is UP) AND ... AND (SAM is UP)
NGI operations services is UP IF (Helpdesk is UP) AND ... AND (SAM is UP) -->


= Use case 2: EGI.eu availability =
= Use case 2: EGI.eu availability reports=
We would like to measure the overall availability of EGI.eu services.  
We would like to measure the overall availability of EGI.eu services.  
Example of such services are:
Operational
* accounting portal and accounting repository
* GOCDB
* operations portal
* central MyEGI
* message bus
* GGUS
* security Nagios and Pakiti
* security dashboard
* DTEAM VOMS
* OPS VOMS
* Overall availability of all EGI production sites
Technical
* EGI repository
* RT
User
* application database
* training database
For each category above, for example operations, EGI.eu operations service is UP if (GGUS is UP) AND (Operations Portal is UP) AND ... AND (GOCDB is UP).
In GOCDB the VIRTUALOPS ROC could be evolved into the EGI.eu ROC, which includes all EGI.eu services.
An new availability profile for EGI.eu is needed.
= Use case 3: Regionalized NGI availability reports =
The regional VO support in operations tools is essential for the idea of NGI autonomy coined at the end of EGEE-III and promoted also by EGI. NGI_PL users submit jobs using vo.plgrid.pl and we would like to have
Availability/Reliability (A/R) statistics for this VO.
A next step forward would be to allow customization of the A/R computation algorithm in terms of (1) '''adding regional tests for regular EGI services'''. For example in NGI_PL we extended WMS monitoring for a couple of new functionalities and these probes we would like to make critical for WMS availability. Along with that extension goes ability to customize site availability by (2) '''manipulating critical services list''' e.g. by adding UI as service critical for a site (NGI_PL monitors UIs).
Finally we would like to be able to modify A/R computation algorithm by (3) '''adding own regional service types'''.


For example EGI.eu operations service is UP if (GGUS is UP) AND (Operations Portal is UP) AND ... AND (GOCDB is UP)
Note: (2) overlaps with use case 4 below.
(M.Radecki for NGI_PL)


= Use case 4: Extension of the standard OPS site availability profile =
KIT requested that new services (in addition to CE, SE and BDII) are included in availability computation, for example: WMS, LB, LFC, FTS, top-BDII, VOMS).
In other words, the site requests that any local core services (that is independently operated from the NGI) can be considered in OPS availability reports.


As not all services necessarily need to be operated WMS, LB etc., if such optional services do not exist, the respective availability (in the site availability computation algorithm) should be 1.


'''NGI is UP if (operations services is UP) AND (middleware services is UP)'''
For example:


= Use case 3 =
Site is UP iff (CE is UP) AND (SE is UP) AND (Site BDII is UP) AND (WMS is UP) AND (LGC is UP) AND (VOMS is UP)
The usage of the "logical" site could be used to represent a distributed Resource Centre (like the NDGF T1).
At the moment it is a single site in GOCDB associated to country X. This use case is mentioned here for the records of the discussion. It is not a crticial use case for the moment.

Latest revision as of 15:11, 6 January 2015

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Tools menu: Main page Instructions for developers AAI Proxy Accounting Portal Accounting Repository AppDB ARGO GGUS GOCDB
Message brokers Licenses OTAGs Operations Portal Perun EGI Collaboration tools LToS EGI Workload Manager

Use case 1: NGI availability reports

We would like to have NGI Availability reports. These reports should include the central services operated by the NGI, this including the regional tools and other middleware core services operated, for example:

  • the VOMS service
  • the top-BDII service
  • the WMS service

and the operational services including

    • the NGI SAM service
    • the accounting portal and repositories (where available)
    • the NGI operations dashboard (where available)
    • the NGI helpdesk (where available)
    • ...

The concerned services would be only those for which the NGI has direct administration responsibilities. For example the NGI availability reports shouldn't include WMS, VOMS etc. instances that are independently deployed by the sites to support local user communities and local projects.

It is important to consider that a NGI core services is often physically distributed across different sites, that only have the role of hosting the hardware (but no administration responsibility). This has several implications.

  1. If one instance is down but the rest of the cluster is up, then the "logical" service is still available. This means that the alias should be monitored for the sake of availability computation, not the individual physical instances
  2. The site availability should not be impacted by the unavailability of physical instances of a service operated by the NGI.

This use case could be satisfied by: - grouping NGI services into a dedicated NGI site (in case of a distributed service, only the alias is registered) - create a NGI availability profile just applicable to the "NGI" site, where the availability of the site is computed as the AND composition of the availability of all registered services. Note that if some (optional) services are NOT available, then UP should be returned, i.e. the profile should include a mandatory set of services (e.g. regional SAM) and a complementary set of optional services (e.g. the local helpdesk, VOMS, etc.)


Use case 2: EGI.eu availability reports

We would like to measure the overall availability of EGI.eu services. Example of such services are:

Operational

  • accounting portal and accounting repository
  • GOCDB
  • operations portal
  • central MyEGI
  • message bus
  • GGUS
  • security Nagios and Pakiti
  • security dashboard
  • DTEAM VOMS
  • OPS VOMS
  • Overall availability of all EGI production sites

Technical

  • EGI repository
  • RT

User

  • application database
  • training database

For each category above, for example operations, EGI.eu operations service is UP if (GGUS is UP) AND (Operations Portal is UP) AND ... AND (GOCDB is UP).

In GOCDB the VIRTUALOPS ROC could be evolved into the EGI.eu ROC, which includes all EGI.eu services. An new availability profile for EGI.eu is needed.

Use case 3: Regionalized NGI availability reports

The regional VO support in operations tools is essential for the idea of NGI autonomy coined at the end of EGEE-III and promoted also by EGI. NGI_PL users submit jobs using vo.plgrid.pl and we would like to have Availability/Reliability (A/R) statistics for this VO. A next step forward would be to allow customization of the A/R computation algorithm in terms of (1) adding regional tests for regular EGI services. For example in NGI_PL we extended WMS monitoring for a couple of new functionalities and these probes we would like to make critical for WMS availability. Along with that extension goes ability to customize site availability by (2) manipulating critical services list e.g. by adding UI as service critical for a site (NGI_PL monitors UIs). Finally we would like to be able to modify A/R computation algorithm by (3) adding own regional service types.

Note: (2) overlaps with use case 4 below. (M.Radecki for NGI_PL)

Use case 4: Extension of the standard OPS site availability profile

KIT requested that new services (in addition to CE, SE and BDII) are included in availability computation, for example: WMS, LB, LFC, FTS, top-BDII, VOMS). In other words, the site requests that any local core services (that is independently operated from the NGI) can be considered in OPS availability reports.

As not all services necessarily need to be operated WMS, LB etc., if such optional services do not exist, the respective availability (in the site availability computation algorithm) should be 1.

For example:

Site is UP iff (CE is UP) AND (SE is UP) AND (Site BDII is UP) AND (WMS is UP) AND (LGC is UP) AND (VOMS is UP)