GGUS Availability and Continuity Plan
|Main||EGI.eu operations services||Support||Documentation||Tools||Activities||Performance||Technology||Catch-all Services||Resource Allocation||Security|
|Documentation menu:||Home •||Manuals •||Procedures •||Training •||Other •||Contact ►||For:||VO managers •||Administrators|
Back to main page: Services Availability Continuity Plans
This page reports on the Availability and Continuity Plan for the GGUS and it is the result of the risks assessment conducted for this service: a series of risks and treats has been identified and analysed, along with the correspondent countermeasures currently in place. Whenever a countermeasure is not considered satisfactory for either avoiding or reducing the likelihood of the occurrence of a risk, or its impact, it is agreed with the service provider a new treatment for improving the availability and continuity of the service. The process is concluded with an availability and continuity test.
|Risks assessment||2018-04-26||2019 2nd half|
|Av/Co plan and test||2018-11-06||--|
Previous plans are collected here: https://documents.egi.eu/secure/ShowDocument?docid=3542
In the OLA it was agreed the following performances targets, on a monthly basis:
- Availability 99%
- Reliability 99%
Other availability requirements: - the service is accessible through X509 certificate and/or other authentication system - The service is accessible via CLI and/or webUI - (depending on the service, specific requirements can be identified. In case, for each requirement report what is the action/measure in case of failure) The service availability is regularly tested by nagios probe org.nagiosexchange.Portal-WebCheck and org.nagiosexchange.RT-WebCheck: https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SITE_GRIDOPS-CTOOLS_egi.Portal&style=overview https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SITE_GRIDOPS-CTOOLS_eu.egi.rt&style=overview
Over the past years, the GGUS helpdesk system hadn't particular Av/Co issues highlighted by the performances that need to be further investigated.
Risks assessment and management
For more details, please look at the google spreadsheet. We will report here a summary of the assessment.
|Risk id||Risk description||Affected components||Established measures||Risk level||Expected duration of downtime / time for recovery||Comment|
|1||Service unavailable / loss of data due to hardware failure||Web GUI, Webservices, Databases||fail save architecture, virtualisation solution, integration in on call service||Medium||up to 4 hours (half working day)||the measures already in place are considered satisfactory and risk level is acceptable|
|2||Service unavailable / loss of data due to software failure||Web GUI, Webservices, Databases||fail save architecture, software in version control, integration in on call service||Medium||up to 4 hours (half working day)||the measures already in place are considered satisfactory and risk level is acceptable|
|3||service unavailable / loss of data due to human error||Web GUI, Webservices, Databases||periodical training of on call service staff; system with very restricted access, only accessible by service owner and the team of the on call service||Medium||up to 4 hours (half working day)||the measures already in place are considered satisfactory and risk level is acceptable|
|4||service unavailable for network failure (Network outage with causes external of the site)||Web GUI, Webservices, Databases||two independant network provider||Low||up to 4 hours (half working day)||the measures already in place are considered satisfactory and risk level is acceptable|
|5||Unavailability of key technical and support staff (holidays period, sickness, ...)||Web GUI, Webservices, Databases||integration in 24/7 on call service||Low||1 or more working day||the measures already in place are considered satisfactory and risk level is acceptable|
|6||Major disruption in the data centre. Fire, flood or electric failure for example||Web GUI, Webservices, Databases||well organized computing center, power supply (USV), fire alarm system, tape backup at different location||Medium||1 or more working day||the measures already in place are considered satisfactory and risk level is acceptable|
|7||Major security incident. The system is compromised by external attackers and needs to be reinstalled and restored.||Web GUI, Webservices, Databases||server configuration according to best practices, most recent patch installation of OS, virus protection, firewall||Medium||1 or more working day||the measures already in place are considered satisfactory and risk level is acceptable|
|8||(D)DOS attack. The service is unavailable because of a coordinated DDOS.||Web GUI, Webservices, Databases||preconnected load balancer, fail to ban, firewall||Medium||1 or more working day||the measures already in place are considered satisfactory and risk level is acceptable|
The level of all the identified risks is acceptable and the countermeasures already adopted are considered satisfactory
- procedures for the several countermeasures to invoke in case of risk occurrence (put a link if public) - the Availability targets don't change in case the plan is invoked. - recovery requirements: -- Maximum tolerable period of disruption (MTPoD) (the maximum amount of time that a service can be unavailable or undelivered after an event that causes disruption to operations, before its stakeholders perceive unacceptable consequences): 2 days -- Recovery time objective (RTO) (the acceptable amount of time to restore the service in order to avoid unacceptable consequences associated with a break in continuity (this has to be less than MTPoD)): 1 day -- Recovery point objective (RPO) (the acceptable latency of data that will not be recovered): 2 days - approach for the return to normal working conditions as reported in the risk assessment.
Availability and Continuity test
The proposed A/C test focused on a recovery scenario:
- database was corrupted and a backup version has to be imported;
- result: after importing the backup the system was up and running again without any further problems;
- the import of the backup took ~30 minutes;
- in such a scenario the data gathered between backup and database crash may be lost. In worst case this may be data of 1 day (24 hours).
The test can be considered successful: restoring the service took relatively few time (while GGUS is unavailable, the Operations Portal Dashboard will be in read-only mode, and the regional helpdesk systems interfaced to GGUS will only miss the connection with it), and even if 1 day data at most are lost, new tickets can be opened again after restoring the service, and replies to tickets can be posted again.
|Alessandro Paolini||2018-04-26||first draft, discussing with the provider.|
|Alessandro Paolini||2018-11-06||added the information about the test; plan finalised.|
|Alessandro Paolini||2019-11-25||starting the yearly review....|