Operations Portal Availability and Continuity Plan
|Main||EGI.eu operations services||Support||Documentation||Tools||Activities||Performance||Technology||Catch-all Services||Resource Allocation||Security|
|Documentation menu:||Home •||Manuals •||Procedures •||Training •||Other •||Contact ►||For:||VO managers •||Administrators|
Back to main page: Services Availability Continuity Plans
This page reports on the Availability and Continuity Plan for the Operations Portal and it is the result of the risks assessment conducted for this service: a series of risks and treats has been identified and analysed, along with the correspondent countermeasures currently in place. Whenever a countermeasure is not considered satisfactory for either avoiding or reducing the likely of the occurrence of a risk, it is agreed with the service provider a new treatment for improving the availability and continuity of the service. The process is concluded with an availability and continuity test.
|Risks assessment||2021-04-06||2022 April|
|Av/Co plan and test||2020-04-07||2022 April|
Previous plans are collected here: https://documents.egi.eu/secure/ShowDocument?docid=3538
In the OLA it was agreed the following performances targets, on a monthly basis:
- Availability 99%
- Reliability 99%
Other availability requirements:
- the service is accessible through EOSC AAI and EGI Checkin
- the service is also accessible with a certificate by using "IGTF Proxy certificate" in AAI
- a web service is also exposing information through REST API
The service availability is regularly tested by nagios probe org.nagiosexchange.OpsPortal-WebCheck: https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_egi.OpsPortal&style=detail
Over the past years, the Operations Portal hadn't particular Av/Co issues highlighted by the performances that need to be further investigated.
Risks assessment and management
For more details, please look at the google spreadsheet. We will report here a summary of the assessment.
|Risk id||Risk description||Affected components||Established measures||Risk level||Expected duration of downtime / time for recovery||Treatment (Protective, mitigation measures, recovery activities, controls)|
|1||Service unavailable / loss of data due to hardware failure||Cluster web / Databases||Databases are hosted on a cluster of machines.The other nodes will continue to work . Web services are now containerized and the images are immediately available in case of disruption. If a node is disabled another one will be spawn||Low||1 sec||Recovery activities : daily backup , protective measures / controls : monitoring of the nodes|
|2||Service unavailable / loss of data due to software failure||Cluster web / Databases||Databases are hosted on a cluster of machines.The other nodes will continue to work . Web services are now containerized and the images are immediately available in case of disruption. If a node is disabled another one will be spawn||Low||1 sec||Recovery activities : daily backup , protective measures / controls : monitoring of the nodes|
|3||Service unavailable / loss of data due to software failure||Lavoisier||The configuration of Lavoisier is stored centrally . We can spawn a new machine to replace it or we can use the pre-production instance||Medium||15 min||Protective measures / controls : monitoring of the nodes , Recovery activities : the pre-production instance could be restarted in production mode, if it is a problem of configuration we can revert it from gitlab. We are working to move Lavoisier services on the new web platform to ensure a complete continuity of service in case of problem on a given node.|
|4||service unavailable / loss of data due to human error||Lavoisier||The configuration of Lavoisier is stored centrally . We can spawn a new machine to replace it or we can use the pre-production instance||Medium||15 min||Protective measures / controls : monitoring of the nodes , Recovery activities : the pre-production instance could be restarted in production mode, if it is a problem of configuration we can revert it from gitlab. We are working to move Lavoisier services on the new web platform to ensure a complete continuity of service in case of problem on a given node.|
|5||service unavailable for network failure (Network outage with causes external of the site)||Cluster web / Databases||RENATER has redundant network connectivity||Medium||1 working day||mutiple and redundant links to renater (preventive)|
|6||Not enough people for maintaining and operating the service||Cluster web / Databases||The team is trying to take holidays with always one people at the office. Only 2 periods are not covered generally 25th-26th December and August 15th||Medium||1 or more working days||Monitoring of the services, vpn accesses to work from home (preventive)|
|7||Major disruption in the data centre.||Cluster web / Databases||The computing centre has electric backup system and fire control devices, the data-center has 2 separated machine rooms||Medium||less than 1 hour||Monitoring of the building and mutlple redundant systems (electricity and cooling systems) (preventive)|
|8||Major security incident. The system is compromised by external attackers and needs to be reinstalled and restored.||Cluster web / Databases / Lavoisier||Regular upgrades of the components , check CVE and known vulnerabilities, isolation of the components||Medium||until the resolution of the security incident||Security monitoring systems (preventive)|
|9||(D)DOS attack. The service is unavailable because of a coordinated DDOS.||Cluster web / Databases||RENATER and local team network provides protection for DOS attacks, firewall can limit impact of the DDoS||Medium||until the end of the attack||Security monitoring systems, firewall (preventive). In case of DDOS attack all requests will be redirected on a static landing pages .|
The level of all the identified risks is acceptable and the countermeasures already adopted are considered satisfactory.
- procedures for the several countermeasures to invoke in case of risk occurrence: https://gitlab.in2p3.fr/opsportal/sf3/-/wikis/cp
- the Availability targets don't change in case the plan is invoked.
- recovery requirements:
- Maximum tolerable period of disruption (MTPoD) (the maximum amount of time that a service can be unavailable or undelivered after an event that causes disruption to operations, before its stakeholders perceive unacceptable consequences): 2 days
- Recovery time objective (RTO) (the acceptable amount of time to restore the service in order to avoid unacceptable consequences associated with a break in continuity (this has to be less than MTPoD)): 1 day
- Recovery point objective (RPO) (the acceptable latency of data that will not be recovered): 2 days
- approach for the return to normal working conditions as reported in the risk assessment.
Availability and Continuity test
The service didn't suffered particular issues requiring the execution of a new continuity test, neither there were significative changes. The test previously reported is still valid. Hereafter there are the details as included in the previous A&C plan.
The proposed A/C test will focus on a recovery scenario: the service has been disrupted and needs to be reinstalled from scratch.
We have made the test for the Lavoisier Component .
To simulate the disruption it was
- stopped the lavoisier service - deleted the repository with the lavoisier configuration
Then for the recovery part :
- Get the lavoisier engine - wget a zip file and unzip it - Get the local configuration - git clone - Add properties: passwords and sensitive information - stored in gitlab variables - Add the certificate - locale copy from laptop - Restart the service
- Duration : ~ 8 min
- Start : Thu Nov 8 14:10:26 CET 2018
- End : Thu Nov 8 14:18:49 CET 2018
The recovery test can be considered successful since restoring the service after a disruption to the Lavoisier component took very few time, well inside the Maximum tolerable period of disruption for this service.
|Alessandro Paolini||2018-03-27||first draft, discussing with the provider|
|Cyril l'Orphelin, Alessandro Paolini||2018-11-08||added information on the recovery test; plan finalised.|
|Alessandro Paolini||2019-11-19||starting the yearly review....|
|Alessandro Paolini||2020-02-18||review completed|
|Alessandro Paolini, Cyril l'Orphelin||2021-04-07||yearly review completed|