PROC24
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Title | Handling Major Incidents |
Document link | https://wiki.egi.eu/wiki/PROC24 |
Last modified | 0.1 |
Policy Group Acronym | OMB |
Policy Group Name | Operations Management Board |
Contact Group | operations at mailman.egi.eu |
Document Status | DRAFT |
Approved Date | |
Procedure Statement | The document describes the process of handling a major incident happening within the EGI infrastructure. |
Owner | Owner of procedure |
Overview
Major Incidents cause serious interruptions of activities and must be resolved with greater urgency. The aim of the procedure is the fast recovery of the service, where necessary by means of workarounds and creating a dynamically established team of specific experts coordinated by EGI Operations.
Definitions
Please refer to the EGI Glossary for the definitions of the terms used in this procedure.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Entities involved in the procedure
- Incident Manager (IM): a member of EGI Operations classifies the incident as major incident; if needed, he also establishes and coordinates a dynamic team of few experts to face the issue
- Customers/Users: in the case of a significant number of customers/users unable to use the services, or if it is possible that a significant loss or costs can be ascribed to customers/users as a consequence of the incident, Customers/Users can be provided directly with temporary workarounds and mitigation procedures to be used until the incident is resolved
- Service providers: teams responsible for development, release and deployment of the services involved in the incident
Requirements
The incident has a big impact on the EGI infrastructure, involving for instance a service heavily used by several VOs. It is the case if
- a significant number of grid and/or cloud resources are not available to the EGI users
- central operation tools are not available
- a significant number of users is impacted by the unavailability of specific services
Steps
The following table describes
Step# | Responsible | Action | Prerequisites, if any | |
---|---|---|---|---|
Revision History
Version | Authors | Date | Comments |
---|---|---|---|