Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC24"

From EGIWiki
Jump to navigation Jump to search
Line 39: Line 39:
* central operation tools are not available
* central operation tools are not available
* a significant number of users is impacted by the unavailability of specific services
* a significant number of users is impacted by the unavailability of specific services
* strategy user is impacted


= Steps  =
= Steps  =

Revision as of 13:13, 29 July 2016

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators


Title Handling Major Incidents
Document link https://wiki.egi.eu/wiki/PROC24
Last modified 0.1
Policy Group Acronym OMB
Policy Group Name Operations Management Board
Contact Group operations at mailman.egi.eu
Document Status DRAFT
Approved Date
Procedure Statement The document describes the process of handling a major incident happening within the EGI infrastructure.
Owner Owner of procedure


Overview

Major Incidents cause serious interruptions of activities and must be resolved with greater urgency. The aim of the procedure is the fast recovery of the service, where necessary by means of workarounds and creating a dynamically established team of specific experts coordinated by EGI Operations.

Definitions

Please refer to the EGI Glossary for the definitions of the terms used in this procedure.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Entities involved in the procedure

  • Incident Manager (IM): a member of EGI Operations classifies the incident as major incident; if needed, s/he also establishes and coordinates a dynamic team of few experts to face the issue (Incident Team)
  • Incident Team: dynamic team of few experts nominated by the IM
  • Customers/Users: in the case of a significant number of customers/users unable to use the services, or if it is possible that a significant loss or costs can be ascribed to customers/users as a consequence of the incident, Customers/Users can be provided directly with temporary workarounds and mitigation procedures to be used until the incident is resolved
  • Service providers: teams responsible for development, release and deployment of the services involved in the incident

Requirements

The incident has a big impact on the EGI infrastructure, involving for instance a service heavily used by several VOs. It is the case if

  • a significant number of grid and/or cloud resources are not available to the EGI users
  • central operation tools are not available
  • a significant number of users is impacted by the unavailability of specific services
  • strategy user is impacted

Steps

The Incident Manager starts the emergency procedure when he classifies the incident as Major Incident. The following table describes how the major incident is lead to resoolution.

Step# Responsible Action
1 Incident Manager (if needed) establish (and lead) a temporary team of experts to handle the incident
2 Incident Team assess the status of the impacted services, evaluate the possible solutions and workarounds with corresponding time resoolution estimations, plan the resolution
3 Incident Manager provide description of the incident to users/customers (description of the incident, services impacted, plan for resolution, ETA, possible workarounds)
4 Incident Team implement the solution planned until the incident is resolved
5 Incident Manager report to the entities involved about the outcome of the plan

Revision History

Version Authors Date Comments