Difference between revisions of "PROC08 Management of the EGI OPS Availability and Reliability Profile"
m (Protected "PROC08" ([edit=sysop] (indefinite) [move=sysop] (indefinite)))
Revision as of 11:29, 24 October 2012
|Main||EGI.eu operations services||Support||Documentation||Tools||Activities||Performance||Technology||Catch-all Services||Resource Allocation||Security|
|Documentation menu:||Home •||Manuals •||Procedures •||Training •||Other •||Contact ►||For:||VO managers •||Administrators|
|Title||Management of the EGI OPS Availability and Reliability Profile|
|Last modified||16:18, 16 March 2011 (UTC)|
|Policy Group Acronym||OMB|
|Policy Group Name||Operations Management Board|
|Contact Person||E. Imamagic|
|Approved Date||28 March 2011|
|Procedure Statement||This document specifies the procedure for modifying the EGI OPS Availability and Reliability profile.|
Management of the EGI OPS Availability and Reliability Profile
A change in the profile is needed every time a new Nagios test needs to be added/removed to/from the profile, in order to have its results included/removed in/from Availability and Reliability monthly statistics. A change in the OPS Availability and Reliability profile affects the computation of the monthly Availability and Reliability statistics of all EGI Resource Infrastructures and Resource Centres.
- The key words Profile, Metric, Probe and Test are defined in the SAM Tests page.
- List of AVAILABILITY tests
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
This procedure is applicable to the EGI OPS Availability and Reliability profile. Any change applied is global, as it has effects on all EGI Resource Centres. The ACE component uses profiles to generate monthly Availability and Reliability reports.
This procedure is NOT applicable to VO-specific Availability and Reliability profiles used by non-OPS VOs (e.g. user communities, national operations VOs, etc.).
Entities involved in the procedure
- Applicant. The Applicant submits a request for changing the EGI OPS profile. Anybody is allowed to submit the request. The request is submitted to the respective Operations Centre, who after acceptance, forward it to the Operations Management Board OMB for discussion.
- Operations Centre. The entity associated to EGI that is responsible of delivering local operational services to a Resource Infrastructure Provider. In order to contribute resources to EGI a Resource Infrastructure Provider must be associated to an Operations Centre.
- Resource Infrastructure Operations Manager. Represents the respective Resource Infrastructure within the OMB.
- Chief Operations Officer (COO). COO is the chairman of the Operations Management Board OMB.
- SAM Product Team. The SAM Product Team is responsible of scheduling, integrating and releasing probes.
This procedure requires usage of the ACE system for generating monthly availability and reliability statistics. The procedure is not applicable to the GridView system which is currently used. The critical feature which ACE supports and GridView lacks is definition of multiple profiles for availability and reliability statistics.
If the request of change includes the addition of new tests, each test MUST first go through the following steps:
- integration of the probe in the SAM release (see procedure PROC07);
- integration of the test in the Operations Dashboard (i.e. being an OPERATIONS test is a necessary condition to be an AVAILABILITY test) (see procedure PROC06).
The two procedures above assure that the new tests are included in the SAM release, deployed on all Resource Infrastructure SAM instances and accepted by Operations Centre operators.
|1||Applicant||Sends a change request to the attention of the respective own Operations Centre. The request is submitted through a GGUS ticket.
Use the "Affected ROC/NGI" to address the ticket to the appropriate Operations Centre. Template:
Subject: Request for adding/removing XXX(,YYY,...) test(s) to/from the EGI OPS A/R Profile We would like to request adding/removing XXX(,YYY,...) test(s) to/from from the EGI OPS Profile Prerequisite data: * name of SAM test(s): * name of service on which the test runs: * link to documentation page: * motivation (which part of the infrastructure will be improved with the new test or description of users' problems which will be avoided in future - provide list of GGUS tickets is possible)
|2||Operations Centre||The Operations Centre process the request specified in the GGUS ticket for acceptance/rejection.
Motivations for rejection need to specify in the GGUS ticket. In case of acceptance, a RT ticket is opened in queue noc-managers to forward the request for discussion in the OMB. Template:
Subject: Request for adding/removing XXX(,YYY,...) test(s) to/from the EGI OPS Profile We would like to request adding/removing XXX(,YYY,...) test(s) to/from the EGI OPS Profile. Please see details in GGUS ticket _link to Applicant's GGUS ticket_.
|3||COO and Resource Infrastructure Operations Manager||COO schedules a presentation of the change requested at the next possible OMB meeting. The relevant Resource Infrastructure Operations Manager presents the request during the meeting. The Applicant is invited to attend the meeting. Only one request will be processed at a time as the impact of a change needs to be assessed. Requests are processed depending on their priority, as agreed by the OMB.|
|4 (*)||COO||Opens a GGUS ticket requesting the creation of a new ACE profile with a modified set of AVAILABILITY tests to the SAM/Nagios 3rd level SU.|
|5 (*)||ACE team||ACE team creates the new ACE profile.|
|6||SA1.8 staff||For the following one month two availability and reliability reports are generated. SA1.8 staff compares the figures and presents them at the next OMB meeting with a recommendation for acceptance or rejection.|
|7||OMB||If the availability and reliability statistics generated with the new profile are satisfactory, OMB approves the modification.|
|8 (*)||COO||Opens a child GGUS ticket requesting that the new availability and reliability profile becomes the official for EGI.|
|9||COO||Broadcasts the modification to all relevant parties (i.e. Operations Centres and Resource Centres). Closes the GGUS ticket (parent and child).|
(*) - These steps depend on the procedure for creating new profiles which will be defined by the ACE team once the ACE is in production. Steps defined here have been provided by the ACE team. This procedure will be updated if any change occurs.
- 18/10/2011: (T. Ferrari) removed obsolete information: "This procedure does not apply to modifications which have already been agreed with the SAM team: including CREAM-CE probes, switching from the old SAM CA probe to the new one, switching from the old SAM ARC probes to Nagios ones."