Difference between revisions of "EGI Notebooks Availability and Continuity Plan"
(2 intermediate revisions by the same user not shown) | |||
Line 14: | Line 14: | ||
|- | |- | ||
! scope="row"| Risks assessment | ! scope="row"| Risks assessment | ||
| | | 2021-12-23 | ||
| | | Jan 2023 | ||
|- | |- | ||
! scope="row"| Av/Co plan | ! scope="row"| Av/Co plan | ||
| | | 2022-01-06 | ||
| | | Jan 2023 | ||
|- | |- | ||
|} | |} | ||
Line 140: | Line 140: | ||
= Availability and Continuity test = | = Availability and Continuity test = | ||
During 2021 the service was moved to a different provider and reinstalled, testing again the recovery test defined in the previous years. | |||
== Test details == | == Test details == | ||
Line 384: | Line 384: | ||
| | | | ||
| Alessandro Paolini | | Alessandro Paolini | ||
| 2021-12-15 | | 2021-12-15, 2022-01-06 | ||
| | | yearly review, updated the risk assessment, updated performance section; no need to perform a new recovery/continuity test. | ||
|} | |} |
Latest revision as of 11:00, 6 January 2022
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to main page: Services Availability Continuity Plans
Introduction
This page reports on the Availability and Continuity Plan for EGI Notebooks and it is the result of the risks assessment conducted for this service: a series of risks and treats has been identified and analysed, along with the correspondent countermeasures currently in place. Whenever a countermeasure is not considered satisfactory for either avoiding or reducing the likelihood of the occurrence of a risk, or its impact, it is agreed with the service provider a new treatment for improving the availability and continuity of the service. The process is concluded with an availability and continuity test.
Last | Next | |
---|---|---|
Risks assessment | 2021-12-23 | Jan 2023 |
Av/Co plan | 2022-01-06 | Jan 2023 |
Previous plans are collected here: https://documents.egi.eu/document/3651
Performances
In the OLA it was agreed the following performances targets, on a monthly basis:
- Availability: 95%
- Reliability 95%
Other availability requirements:
- the service is accessible through OpenID Connect (Check-in) and users can create personal tokens as needed for API access.
- The service is accessible via WebUI and API
The service availability is regularly tested by nagios probes (eu.egi.cloud.APEL-Pub, eu.egi.Notebooks-Status):
- jupyterhub and
- accounting
The performances reports in terms of Availability and Reliability are produced by ARGO on an almost real time basis and they are also periodically collected into the Documentation Database.
Risks assessment and management
For more details, please look at the google spreadsheet. We will report here a summary of the assessment.
Risks analysis
Risk id | Risk description | Affected components | Established measures | Risk level | Expected duration of downtime / time for recovery | Comment |
---|---|---|---|---|---|---|
1 | Service unavailable / loss of data due to hardware failure | All | Service configuration and deployment on Kubernetes and managed as code in git repositories.
(Daily) Backup of user storage |
Medium | up to 1 day | the measures already in place are considered satisfactory and risk level is acceptable |
2 | Service unavailable / loss of data due to software failure | All | Service configuration and deployment on Kubernetes and managed as code in git repositories.
(Daily) Backup of user storage |
Medium | up to 1 day | the measures already in place are considered satisfactory and risk level is acceptable |
3 | service unavailable / loss of data due to human error | All | Use of CHM processe. Service configuration and deployment on Kubernetes and managed as code in git repositories.
(Daily) Backup of user storage |
Low | up to 1 day | the measures already in place are considered satisfactory and risk level is acceptable |
4 | service unavailable for network failure (Network outage with causes external of the site) | All | Service configuration and deployment on Kubernetes and managed as code in git repositories.
(Daily) Backup of user storage |
Low | up to 1 day | the measures already in place are considered satisfactory and risk level is acceptable |
5 | Unavailability of key technical and support staff (holidays period, sickness, ...) | All | Documentation of deployments, use of git for storing and sharing configuration of the deployments, training of staff | Medium | 1 or more working days | Deployments are now documented and configured using git. 2 staff capable of operating the service. Still the staff capable of managing the service can be increased. |
6 | Major disruption in the data centre. Fire, flood or electric failure for example | All | Service configuration and deployment on Kubernetes and managed as code in git repositories.
(Daily) Backup of user storage |
Low | up to 1 working day | the measures already in place are considered satisfactory and risk level is acceptable |
7 | Major security incident. The system is compromised by external attackers and needs to be reinstalled and restored. | All | Follow security advisories of used products. (Daily) Backup of user storage. | Medium | up to 1 working day | the measures already in place are considered satisfactory and risk level is acceptable |
8 | (D)DOS attack. The service is unavailable because of a coordinated DDOS. | All | Service configuration and deployment on Kubernetes and managed as code in git repositories.
(Daily) Backup of user storage |
Low | up to 1 working day | the measures already in place are considered satisfactory and risk level is acceptable |
Outcome
The risk number 5 rating (Unavailability of key technical and support staff) was decreased from 6 (High) to 4 (Medium). Currently 2 staff are capable of operating the system.
Additional information
- Documentation for the several countermeasures to invoke in case of risk occurrence available at https://docs.egi.eu/providers/notebooks/
- The Availability targets don't change in case the plan is invoked.
- Approach for the return to normal working conditions as reported in the risk assessment.
- The support unit Notebooks shall be used to report any incident or service request.
- The providers can contact EGI Operations via ticket or email in case the continuity plan is invoked, or to discuss any change to it.
Recovery requirements:
- Maximum tolerable period of disruption (MTPoD) (the maximum amount of time that a service can be unavailable or undelivered after an event that causes disruption to operations, before its stakeholders perceive unacceptable consequences): 2 days
- Recovery time objective (RTO) (the acceptable amount of time to restore the service in order to avoid unacceptable consequences associated with a break in continuity (this has to be less than MTPoD)): 1 day
- Recovery point objective (RPO) (the acceptable latency of data that will not be recovered): 1 day
Availability and Continuity test
During 2021 the service was moved to a different provider and reinstalled, testing again the recovery test defined in the previous years.
Test details
The proposed A/C test checks if the recovery from a disruption can be performed by installing from scratch all the service. The last user data will be backed up and time spent will be measured. Performing this test will be useful to spot any issue in the recovery procedures of the service.
Test steps:
- Deploy kubernetes on a set of Virtual Machines, get one public IP for the ingress node and record that public IP
- Register a new domain name "recover-notebooks.test.fedcloud.eu" in https://nsupdate.egi.eu pointing to the public IP of the k8s ingress
- Register a new client in dev instance of EGI Check-in
- Deploy the notebooks on the kubernetes cluster configuring the Check-in clients credentials obtained in the previous step and using "recover-notebooks.test.fedcloud.eu" as host name for the ingress.
- Clone the EGI-Notebooks-backup repo
- Create the secrets as detailed in the repo
- Create the rbac roles and launch the recovery the job
- Login with the user of the original notebooks.egi.eu and check that files are actually recovered
Test outcome
The test was successfully executed. The recovery of user storage took 10 minutes:
Started: Thu, 06 Jun 2019 16:37:15 +0200 Finished: Thu, 06 Jun 2019 16:47:18 +0200
and the files are available for the users as expected.
Detailed log of test
Installation of the Jupyterhub instance:
$ helm install -f hub.yaml --namespace catchall --version=0.8.2 --name hub jupyterhub/jupyterhub NAME: hub LAST DEPLOYED: Thu Jun 6 16:35:44 2019 NAMESPACE: catchall STATUS: DEPLOYED RESOURCES: ==> v1/ConfigMap NAME DATA AGE hub-config 1 1s ==> v1/Deployment NAME READY UP-TO-DATE AVAILABLE AGE hub 0/1 1 0 0s proxy 0/1 1 0 0s ==> v1/PersistentVolumeClaim NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE hub-db-dir Bound pvc-61c44c5c-8868-11e9-a52c-fa163ed00f15 1Gi RWO managed-nfs-storage 1s ==> v1/Pod(related) NAME READY STATUS RESTARTS AGE hub-597c78b9fb-87kn4 0/1 ContainerCreating 0 0s proxy-8474bf55cb-2xhgs 0/1 ContainerCreating 0 0s ==> v1/Role NAME AGE hub 1s ==> v1/RoleBinding NAME AGE hub 0s ==> v1/Secret NAME TYPE DATA AGE hub-secret Opaque 3 1s ==> v1/Service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE hub ClusterIP 10.103.124.120 <none> 8081/TCP 0s proxy-api ClusterIP 10.99.7.137 <none> 8001/TCP 0s proxy-public NodePort 10.97.236.34 <none> 80:30590/TCP,443:30739/TCP 0s ==> v1/ServiceAccount NAME SECRETS AGE hub 1 1s ==> v1/StatefulSet NAME READY AGE user-placeholder 0/0 0s ==> v1beta1/Ingress NAME HOSTS ADDRESS PORTS AGE jupyterhub recover-notebooks.test.fedcloud.eu 80 0s ==> v1beta1/PodDisruptionBudget NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE hub 1 N/A 0 1s proxy 1 N/A 0 1s user-placeholder 0 N/A 0 1s user-scheduler 1 N/A 0 1s NOTES: Thank you for installing JupyterHub! Your release is named hub and installed into the namespace catchall. You can find if the hub and proxy is ready by doing: kubectl --namespace=catchall get pod and watching for both those pods to be in status 'Ready'. You can find the public IP of the JupyterHub by doing: kubectl --namespace=catchall get svc proxy-public It might take a few minutes for it to appear! Note that this is still an alpha release! If you have questions, feel free to 1. Read the guide at https://z2jh.jupyter.org 2. Chat with us at https://gitter.im/jupyterhub/jupyterhub 3. File issues at https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues
Perform restore of backup:
$ kubectl apply -f job.yaml job.batch/notebooks-backup-recover created $ kubectl get pod NAME READY STATUS RESTARTS AGE notebooks-backup-recover-2fntt 1/1 Running 0 6s $ kubectl logs notebooks-backup-recover-2fntt + mkdir /backup + restic restore latest --target /backup created new cache in /root/.cache/restic restoring <Snapshot 87683f08 of [/exports] at 2019-06-06 14:10:56.913319998 +0000 UTC by root@onebackup-d2n5s> to /backup + python /usr/local/bin/recover.py --backup-path /backup/exports/ --namespace catchall /backup/exports/pvc INFO:root:PVC: 00d584d784d9463b9545e726a290a512 INFO:root:Destination path: /exports/catchall-00d584d784d9463b9545e726a290a512-pvc-b9c1162b-8868-11e9-a52c-fa163ed00f15 INFO:root:Will restore storage of user dfadf63fcb2723480357cb8ff9f0570cda7d2872ca24e65bfe21f0154f238ce2 at /exports/catchall-00d584d784d9463b9545e726a290a512-pvc-b9c1162b-8868-11e9-a52c-fa163ed00f15 from /backup/exports/catchall-00d584d784d9463b9545e726a290a512-pvc-f94a7ab5-4ae1-11e9-85ab-fa163e6125a0 [...] INFO:root:PVC: fe1c6c56e09d4ea3a4fa0328a43fa925 INFO:root:Destination path: /exports/catchall-fe1c6c56e09d4ea3a4fa0328a43fa925-pvc-f40b68a0-8869-11e9-a52c-fa163ed00f15 INFO:root:Will restore storage of user 025166931789a0f57793a6092726c2ad89387a4cc167e7c63c5d85fc91021d18 at /exports/catchall-fe1c6c56e09d4ea3a4fa0328a43fa925-pvc-f40b68a0-8869-11e9-a52c-fa163ed00f15 from /backup/exports/catchall-fe1c6c56e09d4ea3a4fa0328a43fa925-pvc-b25fddf7-ee78-11e8-8d67-fa163e6125a0 INFO:root:Restored 75 users
Pod execution information:
$ kubectl describe pod notebooks-backup-recover-2fntt Name: notebooks-backup-recover-2fntt Namespace: default Priority: 0 PriorityClassName: <none> Node: k8s-worker/172.16.4.7 Start Time: Thu, 06 Jun 2019 16:37:12 +0200 Labels: controller-uid=924467c7-8868-11e9-a52c-fa163ed00f15 job-name=notebooks-backup-recover Annotations: <none> Status: Succeeded IP: 10.44.0.8 Controlled By: Job/notebooks-backup-recover Containers: recover: Container ID: docker://2896f5aecc493423b536cb4a7646290cc273e412bd7c8384b3d39b7bbc1f188d Image: eginotebooks/svc-backup:0.1.0-5b7cab0 Image ID: docker-pullable://eginotebooks/svc-backup@sha256:4ce168e68bddcdefd1cea163a7695476b3f174bb9b52230ccb52831fb9d76276 Port: <none> Host Port: <none> Args: /usr/local/bin/recover.sh State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 06 Jun 2019 16:37:15 +0200 Finished: Thu, 06 Jun 2019 16:47:18 +0200 Ready: False Restart Count: 0 Environment: NFS_PATH: /exports NAMESPACE: catchall RESTIC_REPOSITORY: sftp:centos@134.158.151.166:/backups/notebooks RESTIC_PASSWORD_FILE: /restic-secret/password Mounts: /exports from dest (rw) /restic-secret/ from restic-password (rw) /root/.ssh from restic-ssh (rw) /var/run/secrets/kubernetes.io/serviceaccount from notebooks-backup-recover-token-cm7vq (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: dest: Type: HostPath (bare host directory volume) Path: /exports HostPathType: DirectoryOrCreate restic-ssh: Type: Secret (a volume populated by a Secret) SecretName: restic-ssh Optional: false restic-password: Type: Secret (a volume populated by a Secret) SecretName: restic-password Optional: false notebooks-backup-recover-token-cm7vq: Type: Secret (a volume populated by a Secret) SecretName: notebooks-backup-recover-token-cm7vq Optional: false QoS Class: BestEffort Node-Selectors: nfs-server=true Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 15m default-scheduler Successfully assigned default/notebooks-backup-recover-2fntt to k8s-worker Normal Pulling 15m kubelet, k8s-worker Pulling image "eginotebooks/svc-backup:0.1.0-5b7cab0" Normal Pulled 15m kubelet, k8s-worker Successfully pulled image "eginotebooks/svc-backup:0.1.0-5b7cab0" Normal Created 15m kubelet, k8s-worker Created container recover Normal Started 15m kubelet, k8s-worker Started container recover
Revision History
Version | Authors | Date | Comments |
---|---|---|---|
v1 | Enol Fernández, Alessandro Paolini | 2019-06-06 | Initial version |
2019-06-11 | plan finalised | ||
Enol Fernández, Alessandro Paolini | 2020-11-30 | review completed, no need of a new continuity/recovery test; decreased the rating of risk 5; updated the sections "Performance" and "Additional information". | |
Alessandro Paolini | 2021-12-15, 2022-01-06 | yearly review, updated the risk assessment, updated performance section; no need to perform a new recovery/continuity test. |