Difference between revisions of "PROC24"
(→Steps) |
(→Steps) |
||
Line 43: | Line 43: | ||
= Steps = | = Steps = | ||
The | The Operations Team is informed about incident which can be anager starts the emergency procedure when s/he classifies the incident as Major Incident. The following table describes how the major incident is lead to resoolution. | ||
{| class="wikitable" | {| class="wikitable" | ||
Line 49: | Line 49: | ||
! Step# | ! Step# | ||
! Responsible | ! Responsible | ||
! Action | ! Action | ||
|- valign="top" | |- valign="top" | ||
| | | <br> | ||
| Anyone<br> | |||
| Operations Team is informed about incident which could be classified as major | |||
|- valign="top" | |||
| <br> | |||
| EGI Operations team<br> | |||
| '''Assign Incident Manager''' to investigate the incident if it is a major incident | |||
|- valign="top" | |||
| <br> | |||
| Incident Manager | | Incident Manager | ||
| (if needed) establish (and lead) a temporary team of experts to handle the incident | | '''Initially investigate the incident''' if it is a major incident<br> | ||
|- valign="top" | |||
| <br> | |||
| Incident Manager | |||
| | |||
If yes, (if needed) '''establish''' (and lead) a temporary '''Incident team''' of experts to handle the incident. <br> | |||
|- | |- | ||
| | | <br> | ||
| Incident Team | | Incident Team | ||
| | | | ||
'''Assess the status of the impacted services''', evaluate the possible solutions and workarounds with corresponding time resoolution estimations, plan the resolution. | |||
Extend Incident team if needed. <br> | |||
|- | |- | ||
| | | <br> | ||
| Incident Manager | | Incident Manager | ||
| | | | ||
Inform about the incident impacted users/customers<br> | |||
*description of the incident<br> | |||
*services impacted<br> | |||
*plan for resolution<br> | |||
*ETA<br> | |||
*possible workarounds | |||
|- | |- | ||
| | | <br> | ||
| Incident Team | | Incident Team | ||
| | | | ||
Trigger CHM1 procedure.<br> | |||
Implement the workaround planned until the incident is resolved | |||
|- | |- | ||
| | | <br> | ||
| Incident Manager | | Incident Manager | ||
| | | Inform users/customers and stakeholders about solution of the incident. <br> | ||
|} | |} | ||
<br> | |||
= Revision History = | = Revision History = |
Revision as of 13:26, 29 July 2016
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Title | Handling Major Incidents |
Document link | https://wiki.egi.eu/wiki/PROC24 |
Last modified | 0.1 |
Policy Group Acronym | OMB |
Policy Group Name | Operations Management Board |
Contact Group | operations at mailman.egi.eu |
Document Status | DRAFT |
Approved Date | |
Procedure Statement | The document describes the process of handling a major incident happening within the EGI infrastructure. |
Owner | Owner of procedure |
Overview
Major Incidents cause serious interruptions of activities and must be resolved with greater urgency. The aim of the procedure is the fast recovery of the service, where necessary by means of workarounds and creating a dynamically established team of specific experts coordinated by EGI Operations.
Definitions
Please refer to the EGI Glossary for the definitions of the terms used in this procedure.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Entities involved in the procedure
- Incident Manager (IM): a member of EGI Operations classifies the incident as major incident; if needed, s/he also establishes and coordinates a dynamic team of few experts to face the issue (Incident Team)
- Incident Team: dynamic team of few experts nominated by the IM
- Customers/Users: in the case of a significant number of customers/users unable to use the services, or if it is possible that a significant loss or costs can be ascribed to customers/users as a consequence of the incident, Customers/Users can be provided directly with temporary workarounds and mitigation procedures to be used until the incident is resolved
- Service providers: teams responsible for development, release and deployment of the services involved in the incident
Requirements
The incident has a big impact on the EGI infrastructure, involving for instance a service heavily used by several VOs. It is the case if
- a significant number of grid and/or cloud resources are not available to the EGI users
- central operation tools are not available
- a significant number of users is impacted by the unavailability of specific services
- strategy user is impacted
Steps
The Operations Team is informed about incident which can be anager starts the emergency procedure when s/he classifies the incident as Major Incident. The following table describes how the major incident is lead to resoolution.
Step# | Responsible | Action |
---|---|---|
Anyone |
Operations Team is informed about incident which could be classified as major | |
EGI Operations team |
Assign Incident Manager to investigate the incident if it is a major incident | |
Incident Manager | Initially investigate the incident if it is a major incident | |
Incident Manager |
If yes, (if needed) establish (and lead) a temporary Incident team of experts to handle the incident. | |
Incident Team |
Assess the status of the impacted services, evaluate the possible solutions and workarounds with corresponding time resoolution estimations, plan the resolution. Extend Incident team if needed. | |
Incident Manager |
Inform about the incident impacted users/customers
| |
Incident Team |
Trigger CHM1 procedure. Implement the workaround planned until the incident is resolved | |
Incident Manager | Inform users/customers and stakeholders about solution of the incident. |
Revision History
Version | Authors | Date | Comments |
---|---|---|---|