Revision as of 11:55, 31 May 2011

Main

EGI.eu operations services

Support

Documentation

Tools

Activities

Performance

Technology

Catch-all Services

Resource Allocation

Security

Documentation menu:

Home •

Manuals •

Procedures •

Training •

Other •

Contact ►

For:

VO managers •

Administrators

Title	Management of central operational tool unscheduled downtimes
Document link	https://wiki.egi.eu/wiki/MAN03_Tool_Intervention_Management
Last review	Tferrari 13:55, 8 March 2011 (UTC)
Policy Group Acronym	OMB
Policy Group Name	Operations Management Board
Contact Person	E. Imamagic
Document Status	draft
Approved Date	specify
Procedure Statement	This manual provides information on how to manage central operational tool unscheduled downtimes.

Management of central operational tool unscheduled downtimes

The purpose of this document is to describe the intervention in case of unscheduled failure of central operational tool.

Scope

This manual only applies to unscheduled downtimes of central operational tools. List of central operational tool

Scheduled downtimes are management according to existing procedures.

Procedure

In the following sections relevant scenarios are covered.

Case 1: short "undetected" downtime

Description: Service fails and recovers before administrator manages to react (e.g. short power or network outage).

Actions:

Administrator should announce the failure by using Operations Portal Broadcast tool (see notification template 4).

Case 2: "detected" downtime

Description: Service fails and administrator detects the problem. The problem takes longer (TODO: longer than 1h?) time to recover.

Actions:

Administrator should announce the failure by using Operations Portal Broadcast tool (see notification template 1).
At the time of recovery administrator should announce the recovery by using Operations Portal Broadcast tool (see notification template 3).
Administrator should announce the detailed description of failure by using Operations Portal Broadcast tool (see notification template 4).

Case 3: prolonged downtime

Description: Service recovery is delayed.

Actions:

Administrator should announce that recovery is taking longer by using Operations Portal Broadcast tool (see notification template 2).

Notification templates

When using Operations Portal Broadcast tool the following groups should be included:

LCG Rollout Mailing List
Operators Mailing lists
OSG Mailing list
Tool Admins Mailing List
WLCG Tier 1 contacts
NGI managers
VO managers
VO users
Site administrators
Operation tools

For individual notification templates together with targets are predefined in Operations Portal Broadcast tool.

1. Service failure notification

Subject:

[SERVICE_NAME] outage

Message:

Dear all,

[SERVICE_NAME] is experiencing unscheduled downtime.

[ADDITIONAL_INFO]

Apologies for any inconvenience caused.

Best Regards
[SERVICE_TEAM]

2. Extended service failure notification

Subject:

[SERVICE_NAME] extended outage

Message:

Dear all,

Outage of [SERVICE_NAME] is extended.

[Recovery 

[ADDITIONAL_INFO]

Apologies for any inconvenience caused.

Best Regards
[SERVICE_TEAM]

3. Service recovery notification without detailed information

Subject:

[SERVICE_NAME] recovery

Message:

Dear all,

[SERVICE_NAME] is back online. Additional information on failure will be provided soon.

Best Regards
[SERVICE_TEAM]

4. Post mortem analysis

Description:

[SERVICE_NAME] recovery

Revision History

@@ Line 56: / Line 56: @@
 Actions:
-# Administrator should enter UNSCHEDULED downtime to GOCDB with detailed description of failure (see notification template 4). (TODO: who gets the information in case of downtime of central operational tool).
+# Administrator should announce the failure by using Operations Portal Broadcast tool (see notification template 4).
 == Case 2: "detected" downtime ==
@@ Line 65: / Line 65: @@
 # Administrator should announce the failure by using Operations Portal Broadcast tool (see notification template 1).
 # At the time of recovery administrator should announce the recovery by using Operations Portal Broadcast tool (see notification template 3).
-# Administrator should enter UNSCHEDULED downtime to GOCDB with detailed description of failure (see notification template 4).
+# Administrator should announce the detailed description of failure by using Operations Portal Broadcast tool (see notification template 4).
 == Case 3: prolonged downtime ==

Difference between revisions of "MAN04 Tool Intervention Management"

Revision as of 11:55, 31 May 2011

Contents

Management of central operational tool unscheduled downtimes

Scope

Procedure

Case 1: short "undetected" downtime

Case 2: "detected" downtime

Case 3: prolonged downtime

Notification templates

1. Service failure notification

2. Extended service failure notification

3. Service recovery notification without detailed information

4. Post mortem analysis

Revision History

Navigation menu

Difference between revisions of "MAN04 Tool Intervention Management"

Revision as of 11:55, 31 May 2011

Management of central operational tool unscheduled downtimes

Scope

Procedure

Case 1: short "undetected" downtime

Case 2: "detected" downtime

Case 3: prolonged downtime

Notification templates

1. Service failure notification

2. Extended service failure notification

3. Service recovery notification without detailed information

4. Post mortem analysis

Revision History

Navigation menu

Search