Difference between revisions of "MAN04 Tool Intervention Management"
Line 50: | Line 50: | ||
In the following sections relevant scenarios are covered. | In the following sections relevant scenarios are covered. | ||
When using Operations Portal Broadcast tool the following groups should be included: | |||
* LCG Rollout Mailing List | |||
* Operators Mailing lists | |||
* OSG Mailing list | |||
* Tool Admins Mailing List | |||
* WLCG Tier 1 contacts | |||
* NGI managers | |||
* VO managers | |||
* VO users | |||
* Site administrators | |||
* Operation tools | |||
== Case 1: short "undetected" downtime == | == Case 1: short "undetected" downtime == | ||
Line 56: | Line 68: | ||
Actions: | Actions: | ||
# Administrator should enter UNSCHEDULED downtime to GOCDB with detailed description of failure (see notification template 4). | # Administrator should enter UNSCHEDULED downtime to GOCDB with detailed description of failure (see notification template 4). (TODO: who gets the information in case of downtime of central operational tool). | ||
== Case 2: "detected" downtime == | == Case 2: "detected" downtime == |
Revision as of 10:36, 31 May 2011
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Title | Management of central operational tool unscheduled downtimes |
Document link | https://wiki.egi.eu/wiki/MAN03_Tool_Intervention_Management |
Last review | Tferrari 13:55, 8 March 2011 (UTC) |
Policy Group Acronym | OMB |
Policy Group Name | Operations Management Board |
Contact Person | E. Imamagic |
Document Status | draft |
Approved Date | specify |
Procedure Statement | This manual provides information on how to manage central operational tool unscheduled downtimes. |
Management of central operational tool unscheduled downtimes
The purpose of this document is to describe the intervention in case of unscheduled failure of central operational tool.
Scope
This manual only applies to unscheduled downtimes of central operational tools. List of central operational tool
Scheduled downtimes are management according to existing procedures.
Procedure
In the following sections relevant scenarios are covered.
When using Operations Portal Broadcast tool the following groups should be included:
- LCG Rollout Mailing List
- Operators Mailing lists
- OSG Mailing list
- Tool Admins Mailing List
- WLCG Tier 1 contacts
- NGI managers
- VO managers
- VO users
- Site administrators
- Operation tools
Case 1: short "undetected" downtime
Description: Service fails and recovers before administrator manages to react (e.g. short power or network outage).
Actions:
- Administrator should enter UNSCHEDULED downtime to GOCDB with detailed description of failure (see notification template 4). (TODO: who gets the information in case of downtime of central operational tool).
Case 2: "detected" downtime
Description: Service fails and administrator detects the problem. The problem takes longer (TODO: longer than 1h?) time to recover.
Actions:
- Administrator should announce the failure by using Operations Portal Broadcast tool (see notification template 1).
- At the time of recovery administrator should announce the recovery by using Operations Portal Broadcast tool (see notification template 3).
- Administrator should enter UNSCHEDULED downtime to GOCDB with detailed description of failure (see notification template 4).
Case 3: prolonged downtime
Description: Service recovery is delayed.
Actions:
- Administrator should announce that recovery is taking longer by using Operations Portal Broadcast tool (see notification template 2).