Difference between revisions of "Collaboration Tools Availability and Continuity Plan"

Latest revision as of 17:29, 1 February 2022

Main

EGI.eu operations services

Support

Documentation

Tools

Activities

Performance

Technology

Catch-all Services

Resource Allocation

Security

Documentation menu:

Home •

Manuals •

Procedures •

Training •

Other •

Contact ►

For:

VO managers •

Administrators

Back to main page: Services Availability Continuity Plans

This article is Deprecated and has been moved to https://confluence.egi.eu/display/IMS/SACM%3A+Collaboration+Tools.

@@ Line 2: / Line 2: @@
 [[Category:Operations]]
 Back to main page: [[Services Availability Continuity Plans]]
+{{DeprecatedAndMovedTo|new_location=https://confluence.egi.eu/display/IMS/SACM%3A+Collaboration+Tools}}
-= Introduction =
-This page reports on the Availability and Continuity Plan for the '''[[EGI Collaboration tools]]''' and it is the result of the risks assessment conducted for this service: a series of risks and treats has been identified and analysed, along with the correspondent countermeasures currently in place. Whenever a countermeasure is not considered satisfactory for either avoiding or reducing the likelihood of the occurrence of a risk, or its impact, it is agreed with the service provider a new treatment for improving the availability and continuity of the service. The process is concluded with an availability and continuity test.
-{| class="wikitable"
-|-
-! scope="col"|
-! scope="col"| Last
-! scope="col"| Next
-|-
-! scope="row"| Risks assessment
-| 2018-04-24
-| 2019 April
-|-
-! scope="row"| Av/Co plan and test
-| 2018-10-26
-| --
-|-
-|}
-Previous plans are collected here: <add a link to doc db>
-= Performances =
-In the OLA it was agreed the following performances targets, on a monthly basis:
-*Availability: DNS 99%; other services 95%
-*Reliability 99%
-<pre style="color: blue"> Other availability requirements:
-- the service is accessible through X509 certificate and/or other authentication system
-- The service is accessible via CLI and/or webUI
-- (depending on the service, specific requirements can be identified. In case, for each requirement report what is the action/measure in case of failure)
-The service availability is regularly tested by nagios probe org.nagiosexchange.Portal-WebCheck and org.nagiosexchange.RT-WebCheck: https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SITE_GRIDOPS-CTOOLS_egi.Portal&style=overview https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SITE_GRIDOPS-CTOOLS_eu.egi.rt&style=overview
-</pre>
-The performances reports in terms of Availability and Reliability are produced by [http://egi.ui.argo.grnet.gr/egi/OPS-MONITOR-Critical ARGO] on an almost real time basis and they are also periodically collected into the [https://documents.egi.eu/public/ShowDocument?docid=2324 Documentation Database].
-Over the past years, the Collaboration tools hadn't particular Av/Co issues highlighted by the performances that need to be further investigated.
-= Risks assessment and management =
-For more details, please look at the [https://docs.google.com/spreadsheets/d/1KWfMyLv51BPg-XFCK5zHXuIrLbmKfFBBSIqV4Yn25P4/edit#gid=1565906040 google spreadsheet]. We will report here a summary of the assessment.
-== Risks analysis ==
-<pre style="color: blue"> to update </pre>
-{| class="wikitable"
-! Risk id
-! Risk description
-! Affected components
-! Established measures
-! Risk level
-! Expected duration of downtime / time for recovery
-! Comment
-|-
-| 1
-| Service unavailable / loss of data due to hardware failure
-| all services
-| virtualization on HA platform, backups
-| style="background: yellow"| Medium
-| 1 or more working day
-| the measures already in place are considered satisfactory and risk level is acceptable
-|-
-| 2
-| Service unavailable / loss of data due to software failure
-| depends on affected sw, probably all services
-| monitoring of system health, backups
-| style="background: yellow"| Medium
-| up to 8 hours (1 working day)
-| the measures already in place are considered satisfactory and risk level is acceptable
-|-
-| 3
-| service unavailable / loss of data due to human error
-| depends on affected sw/data, probably all services
-| monitoring of system health, backups, actively maintained documentation (wiki)
-| style="background: yellow"| Medium
-| up to 8 hours (1 working day)
-| the measures already in place are considered satisfactory and risk level is acceptable
-|-
-| 4
-| service unavailable for network failure (Network outage with causes external of the site)
-| all services (could affect only selected users, depends on problematic networks)
-| monitoring of service availability, alternative network routes
-| style="background: green"| Low
-| up to 4 hours (half working day)
-| the measures already in place are considered satisfactory and risk level is acceptable
-|-
-| 5
-| Unavailability of key technical and support staff (holidays period, sickness, ...)
-| depends on the problem requiring staff attention, could escalate to all services
-| contacts to other local staff capable of administering the services
-| style="background: yellow"| Medium
-| 1 or more working day
-| the measures already in place are considered satisfactory and risk level is acceptable
-|-
-| 6
-| Major disruption in the data centre. Fire, flood  or electric failure for example
-| all services
-| access to other data centres, geographical diverse backups
-| style="background: yellow"| Medium
-| 1 or more working day
-| the measures already in place are considered satisfactory and risk level is acceptable
-|-
-| 7
-| Major security incident. The system is compromised by external attackers and needs to be reinstalled and restored.
-| depends on compromised subsystem and/or data, could escalate to all services
-| monitoring of system health, security audits, backups, following best practices for security configuration and timely implementation of patches
-| style="background: yellow"| Medium
-| up to 8 hours (1 working day)
-| the measures already in place are considered satisfactory and risk level is acceptable
-|-
-| 8
-| (D)DOS attack. The service is unavailable because of a coordinated DDOS.
-| depends on the attack, could extend from one specific service to complete network outage (=all services)
-| monitoring of service availability, alternative network routes, antiDDoS measures at ISP network
-| style="background: green"| Low
-| up to 4 hours (half working day)
-| the measures already in place are considered satisfactory and risk level is acceptable
-|}
-== Additional information ==
-<pre style="color: blue">
-- procedures for the several countermeasures to invoke in case of risk occurrence (put a link if public)
-- the Availability targets don't change in case the plan is invoked.
-- recovery requirements:
--- Maximum tolerable period of disruption (MTPoD) (the maximum amount of time that a service can be unavailable or undelivered after an event that causes disruption to operations, before its stakeholders perceive unacceptable consequences): 2 days
--- Recovery time objective (RTO) (the acceptable amount of time to restore the service in order to avoid unacceptable consequences associated with a break in continuity (this has to be less than MTPoD)): 1 day
--- Recovery point objective (RPO) (the acceptable latency of data that will not be recovered): 2 days
-- approach for the return to normal working conditions as reported in the risk assessment.
-</pre>
-== Outcome ==
-<pre style="color: blue"> to update </pre>
-The level of all the identified risks is acceptable and the countermeasures already adopted are considered satisfactory
-= Availability and Continuity test =
-<pre style="color: blue"> to update </pre>
-The proposed A/C test will focus on a recovery scenario: the service has been disrupted and needs to be reinstalled from scratch.
-The time spent for restoring the service will be measured, using the last backup of the data stored in it.
-Performing this test will be useful to spot any issue in the recovery procedures of the service.
-'''Test details''':
-*The recovery process has been tested and it took 26 minutes.
-*Backups are created every two days, so in the worst scenario we can lose data two days back.
-'''Outcomes and recommendations''':
-The test on the whole can be considered successful: the recovery time is acceptable, even though we need to evaluate if loosing 2 days data in the worst case can be tolerable. For some services included in the Collaboration Tools we might need an higher backups frequency: we are going to perform a Business Impact Analysis for these services and then we will agree with the providers the necessary updates to the plan.
-= Revision History  =
-{| class="wikitable"
-|-
-! Version
-! Authors
-! Date
-! Comments
-|-
-| <br>
-| Alessandro Paolini
-| 2018-04-25
-| first draft, discussing with the provider
-|-
-| <br>
-| Alessandro Paolini
-| 2018-10-26
-| recovery test performed, plan finalised
-|-
-| <br>
-| Alessandro Paolini
-| 2019-11-25
-| starting the yearly review....
-|-
-|
-|
-|
-|
-|}

Difference between revisions of "Collaboration Tools Availability and Continuity Plan"

Latest revision as of 17:29, 1 February 2022

Navigation menu

Search