Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

PROC24

From EGIWiki
Revision as of 12:29, 29 July 2016 by Krakow (talk | contribs)
Jump to navigation Jump to search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators


Title Handling Major Incidents
Document link https://wiki.egi.eu/wiki/PROC24
Last modified 0.1
Policy Group Acronym OMB
Policy Group Name Operations Management Board
Contact Group operations at mailman.egi.eu
Document Status DRAFT
Approved Date
Procedure Statement The document describes the process of handling a major incident happening within the EGI infrastructure.
Owner Owner of procedure


Overview

Major Incidents cause serious interruptions of activities and must be resolved with greater urgency. The aim of the procedure is the fast recovery of the service, where necessary by means of workarounds and creating a dynamically established team of specific experts coordinated by EGI Operations.

Definitions

Please refer to the EGI Glossary for the definitions of the terms used in this procedure.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Entities involved in the procedure

  • Incident Manager (IM): a member of EGI Operations classifies the incident as major incident; if needed, s/he also establishes and coordinates a dynamic team of few experts to face the issue (Incident Team)
  • Incident Team: dynamic team of few experts nominated by the IM
  • Customers/Users: in the case of a significant number of customers/users unable to use the services, or if it is possible that a significant loss or costs can be ascribed to customers/users as a consequence of the incident, Customers/Users can be provided directly with temporary workarounds and mitigation procedures to be used until the incident is resolved
  • Service providers: teams responsible for development, release and deployment of the services involved in the incident

Requirements

The incident has a big impact on the EGI infrastructure, involving for instance a service heavily used by several VOs. It is the case if

  • a significant number of grid and/or cloud resources are not available to the EGI users
  • central operation tools are not available
  • a significant number of users is impacted by the unavailability of specific services
  • strategy user is impacted

Steps

The Operations Team is informed about incident which can be anager starts the emergency procedure when s/he classifies the incident as Major Incident. The following table describes how the major incident is lead to resoolution.

Step# Responsible Action

Anyone
Operations Team is informed (via email, skype , phone or GGUS) about incident which could be classified as major

EGI Operations team
Assign Incident Manager to investigate the incident if it is a major incident

Incident Manager Initially investigate the incident if it is a major incident

Incident Manager

If yes, (if needed) establish (and lead) a temporary Incident team of experts to handle the incident.


Incident Team

Assess the status of the impacted services, evaluate the possible solutions and workarounds with corresponding time resoolution estimations, plan the resolution.

Extend Incident team if needed.


Incident Manager

Inform about the incident impacted users/customers

  • description of the incident
  • services impacted
  • plan for resolution
  • ETA
  • possible workarounds

Incident Team

Trigger CHM1 procedure.

Implement the workaround planned until the incident is resolved


Incident Manager Inform users/customers and stakeholders about solution of the incident.


Revision History

Version Authors Date Comments