Difference between revisions of "Operations Procedures"
Jump to navigation
Jump to search
(45 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
{{Template:Op menubar}} | {{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}} | ||
{{Template:Doc_menubar}} | |||
{{TOC_right}} | |||
= Operations = | = Operations = | ||
Line 14: | Line 12: | ||
| '''Area''' | | '''Area''' | ||
| '''Relevant to''' | | '''Relevant to''' | ||
| ''' | | '''Status''' | ||
|- | |- | ||
| [[PROC01|PROC 01]] | | [[PROC01|PROC 01]] | ||
| [[PROC01| | | [[PROC01|EGI Infrastructure Oversight Escalation]] | ||
| Operations ticket escation | | Operations ticket escation | ||
| Ticket Management | | Ticket Management | ||
| Resource Centre Administrators, Operations Centres, | | Resource Centre Administrators, Operations Centres, Operations | ||
| | | Approved | ||
|- | |- | ||
| [[PROC02|PROC 02]] | | [[PROC02|PROC 02]] | ||
Line 27: | Line 25: | ||
| Step-by-step instructions on how to create a new Operations Centre | | Step-by-step instructions on how to create a new Operations Centre | ||
| Operations Centre Management | | Operations Centre Management | ||
| Operations Centres, | | Operations Centres, Operations | ||
| | | Approved | ||
|- | |- | ||
| [[PROC03|PROC 03]] | | [[PROC03|PROC 03]] | ||
Line 34: | Line 32: | ||
| Step-by-step instructions on how to decommission an Operations Centre | | Step-by-step instructions on how to decommission an Operations Centre | ||
| Operations Centre Management | | Operations Centre Management | ||
| Operations Centres, | | Operations Centres, Operations | ||
| | | Approved | ||
|- | |- | ||
| [[PROC04|PROC 04]] | | [[PROC04|PROC 04]] | ||
Line 41: | Line 39: | ||
| Instructions RODs and Operations Centres on how to handle justification for poor monthly performance | | Instructions RODs and Operations Centres on how to handle justification for poor monthly performance | ||
| Availability and Monitoring | | Availability and Monitoring | ||
| Resource Centre Administrators, Operations Centres, | | Resource Centre Administrators, Operations Centres, Operations | ||
| | | Approved | ||
|- | |- | ||
| [[PROC05|PROC 05]] | | [[PROC05|PROC 05]] | ||
Line 48: | Line 46: | ||
| This procedure is part of the [[PROC02|Operations Centre creation]] procedure. | | This procedure is part of the [[PROC02|Operations Centre creation]] procedure. | ||
| Availability and Monitoring | | Availability and Monitoring | ||
| Operations Centres, | | Operations Centres, Operations | ||
| | | DEPRECATED | ||
|- | |- | ||
| [[PROC06|PROC 06]] | | [[PROC06|PROC 06]] | ||
Line 55: | Line 53: | ||
| A Nagios probe is set to OPERATIONS when its results are used to generate notifications for the Operations Dashboard. This procedure details the steps to turn a Nagios test to OPERATIONs. | | A Nagios probe is set to OPERATIONS when its results are used to generate notifications for the Operations Dashboard. This procedure details the steps to turn a Nagios test to OPERATIONs. | ||
| Availability and Monitoring | | Availability and Monitoring | ||
| Operations Centres, | | Operations Centres, Operations | ||
| | | Approved | ||
|- | |- | ||
| [[PROC07|PROC 07]] <!-- Procedure number --> | | [[PROC07|PROC 07]] <!-- Procedure number --> | ||
| [[PROC07|Adding new probes to | | [[PROC07|Adding new probes to ARGO]] <!-- Title --> | ||
| Addition of new OPS Nagios probes to | | Addition of new OPS Nagios probes to ARGO. <!-- Comment --> | ||
| Availability and Monitoring <!-- Area --> | | Availability and Monitoring <!-- Area --> | ||
| Resource Centre Administrators, Operations Centres, | | Resource Centre Administrators, Operations Centres, Operations <!-- Relevant to --> | ||
| | | Approved | ||
|- | |- | ||
| [[PROC08|PROC 08]] <!-- Procedure number --> | | [[PROC08|PROC 08]] <!-- Procedure number --> | ||
Line 69: | Line 67: | ||
| Request of a OPS EGI Availability and Reliability profile. A change in the profile is needed every time a new Nagios test needs to be added/removed to/from the profile, in order to have its results included/removed in/from Availability and Reliability monthly statistics. <!-- Comment --> | | Request of a OPS EGI Availability and Reliability profile. A change in the profile is needed every time a new Nagios test needs to be added/removed to/from the profile, in order to have its results included/removed in/from Availability and Reliability monthly statistics. <!-- Comment --> | ||
| Availability and Monitoring <!-- Area --> | | Availability and Monitoring <!-- Area --> | ||
| Resource Centre Administrators, Operations Centres, | | Resource Centre Administrators, Operations Centres, Operations <!-- Relevant to --> | ||
| | | Approved | ||
|- | |- | ||
| [[PROC09|PROC 09]] <!-- Procedure number --> | | [[PROC09|PROC 09]] <!-- Procedure number --> | ||
| [[PROC09|Resource Centre Registration and Certification]] <!-- Title --> | | [[PROC09|Resource Centre Registration and Certification]] <!-- Title --> | ||
| Registration of a new Resource Centre | | Registration of a new Resource Centre | ||
| Resource Centre Management | | Resource Centre Management | ||
| Resource Centre Administrator, Operations Centres | | Resource Centre Administrator, Operations Centres | ||
| | | Approved | ||
|- | |- | ||
| [[PROC10|PROC 10]] <!-- Procedure number --> | | [[PROC10|PROC 10]] <!-- Procedure number --> | ||
Line 84: | Line 82: | ||
| Availability and Monitoring <!-- Area --> | | Availability and Monitoring <!-- Area --> | ||
| Resource Centre Administrators, Operations Centres<!-- Relevant to --> | | Resource Centre Administrators, Operations Centres<!-- Relevant to --> | ||
| | | Approved | ||
|- | |- | ||
| [[PROC11|PROC 11]] | | [[PROC11|PROC 11]] | ||
Line 91: | Line 89: | ||
| Resource Centre Management | | Resource Centre Management | ||
| Resource Centre Administrator, Operations Centres | | Resource Centre Administrator, Operations Centres | ||
| | | Approved | ||
|- | |- | ||
| [[PROC12|PROC 12]] | | [[PROC12|PROC 12]] | ||
Line 98: | Line 96: | ||
| Resource Centre Management | | Resource Centre Management | ||
| Resource Centre Administrator, Operations Centres | | Resource Centre Administrator, Operations Centres | ||
| | | Approved | ||
|- | |- | ||
| [[PROC13|PROC 13]] | | [[PROC13|PROC 13]] | ||
Line 105: | Line 103: | ||
| VO Management | | VO Management | ||
| VO Managers, Operations Manager | | VO Managers, Operations Manager | ||
| | | Approved | ||
|- | |- | ||
| [[PROC14|PROC 14]] | | [[PROC14|PROC 14]] | ||
Line 112: | Line 110: | ||
| VO Management | | VO Management | ||
| VO Managers, Operations Manager | | VO Managers, Operations Manager | ||
| | | Approved | ||
|- | |- | ||
| [[PROC15|PROC 15]] | | [[PROC15|PROC 15]] | ||
Line 119: | Line 117: | ||
| Resource Centre Management | | Resource Centre Management | ||
| Resource Centre Administrator, Operations Centres | | Resource Centre Administrator, Operations Centres | ||
| | | Approved | ||
|- | |- | ||
| [[PROC16|PROC 16]] | | [[PROC16|PROC 16]] | ||
Line 126: | Line 124: | ||
| Resource Centre Management | | Resource Centre Management | ||
| Resource Centre Administrator, Operations Centres | | Resource Centre Administrator, Operations Centres | ||
| | | Approved | ||
|- | |- | ||
| [[PROC17|PROC 17]] | | [[PROC17|PROC 17]] | ||
Line 133: | Line 131: | ||
| Resource Centre Management | | Resource Centre Management | ||
| Resource Centre Administrator, Operations Centres | | Resource Centre Administrator, Operations Centres | ||
| 18. | | Deprecated | ||
|- | |||
| [[PROC18|PROC 18]] | |||
| [[PROC18|Temporary Cloud Resource Centre Registration and Certification]] | |||
| A Temporary procedure for registration of a new Cloud Resource Centre. Also apples to certified Resource Centers which introduce cloud resources for the first time. | |||
| Resource Centre Management | |||
| Resource Centre Administrator, Operations Centres | |||
| Deprecated | |||
|- | |||
| [[PROC19|PROC 19]] | |||
| [[PROC19|Introducing new cloud stack and grid middleware in EGI Production Infrastructure]] | |||
| A procedure for the steps to introduce new stack (Cloud platform) or middleware (Grid Platform) in EGI Production Infrastructure. | |||
| Resource Centre Management | |||
| Resource Centre Administrator, Operations Centres | |||
| Draft | |||
|- | |||
| [[PROC20|PROC 20]]<br> | |||
| [[PROC20|Support for CVMFS replication across the EGI and OSG CVMFS services]] | |||
| The document describes the process of enabling the replication of CVMFS spaces across OSG and EGI CVMFS infrastructures | |||
| Resource Centre Management | |||
| Resource Centre Administrator, VO managers | |||
| Approved | |||
|- | |||
| [[PROC21|PROC 21]] | |||
| [[PROC21|Resource Centre suspension]]<br> | |||
| The document describes the process of Resource Centre suspension in EGI infrastructure | |||
| Resource Centre Management | |||
| Resource Centre Administrator, VO managers | |||
| Approved | |||
|- | |||
| [[PROC22|PROC 22]] | |||
| [[PROC22|Support for CVMFS replication across the EGI Infrastructure]]<br> | |||
| The procedure describes the process of creating a repository within the EGI CVMFS infrastructure for an EGI VO. | |||
| Resource Centre Management | |||
| Resource Centre Administrator, VO managers | |||
| DRAFT | |||
|- | |||
| [[PROC23|PROC23]] | |||
| Production tools release and deployment process | |||
| The procedure describes the process of release and deployment in EGI production infrastructure for Production tools | |||
| Production tools | |||
| | |||
| APPROVED | |||
|- | |||
| [[PROC24|PROC24]] | |||
| Major incident handling | |||
| The procedure describes the process of handling major incidents<br> | |||
| Production tools | |||
| | |||
| DRAFT | |||
|- | |||
| [[PROC25|PROC25]] | |||
| UMD and CMD software release procedure | |||
| The procedure describes the process of adding a new produc release to the software provisioning process and releasing it in UMD/CMD. <br> | |||
| Middleware software for HTC/Cloud deployed in multiple resource centres in EGI. | |||
| | |||
| DRAFT | |||
|- | |||
| PROC26 ([https://ims.egi.eu/display/EGIPP/ISRM5+Verify+Helpdesk+SUs+are+working+and+perform+a+periodic+review+of+them ISRM5]) | |||
| Verify helpdesk Support Units are working and perform a periodic review of them | |||
| The document describes the process for verifying that the support teams are still able to follow-up the GGUS tickets assigned to their Support Unit. Moreover it is defined a criteria for decommissioning the SUs that don't work any more. | |||
| Ticket Management | |||
| | |||
| Approved | |||
|} | |} | ||
[[Procedure template|Structure template for new procedures]] | [[Procedure template|Structure template for new procedures]] | ||
= Security = | = Security = | ||
{|border="1" class="wikitable sortable" | |||
{| border="1" class="wikitable sortable" | |||
|- style="background-color:lightgray;" | |- style="background-color:lightgray;" | ||
| '''Number''' | | '''Number''' | ||
| '''Title''' | | '''Title''' | ||
| '''Comment''' | | '''Comment''' | ||
| '''Status''' | | '''Status''' | ||
| '''Area''' | | '''Area''' | ||
| '''Relevant to''' | | '''Relevant to''' | ||
|- | |- | ||
| SEC 01 | | [[SEC01|SEC 01]] | ||
| [ | | [[SEC01|EGI Security Incident Handling]] | ||
| The "Security Incident Handling Procedure" define site and incident coordinator responsibilities when handling Grid-related security incident. ALL EGI sites are required to follow this procedure to report and handle Grid-related security incident. | | The "Security Incident Handling Procedure" define site and incident coordinator responsibilities when handling Grid-related security incident. ALL EGI sites are required to follow this procedure to report and handle Grid-related security incident. | ||
| | | Approved March 2016<br> | ||
| Security | | Security | ||
| Resource Centres, EGI CSIRT | | Resource Centres, EGI CSIRT | ||
|- | |- | ||
| SEC 02 <!-- number --> | | [[SEC02|SEC 02]] <!-- number --> | ||
| [https://documents.egi.eu/ | | [https://documents.egi.eu/secure/ShowDocument?docid=3145 EGI Vulnerability issue handling process] <!-- title and wiki link --> | ||
| | | This procedure is used to handle vulnerabilities in Software relevant to the EGI infrastructure. <!-- comment--> | ||
| ''approved'', | | ''approved'', Nov 2017 <!-- status, date of approval --> | ||
| Security <!-- area --> | | Security <!-- area --> | ||
| Resource Centres, Risk Assessment Team, Technology Providers, SVG <!-- Relevant to --> | | Resource Centres, Risk Assessment Team, Technology Providers, SVG <!-- Relevant to --> | ||
|- | |- | ||
| [[SEC03|SEC 03]] <!-- number --> | | [[SEC03|SEC 03]] <!-- number --> | ||
| [ | | [[SEC03|EGI-CSIRT Critical Vulnerability Handling]] | ||
| After a problem has been assessed as critical, and a solution is available, then sites are required to take action. This document primarily defines the procedure from this time, where sites are asked to take action, and what steps are taken if they do not respond or do not take action. If a site fails to take action, this may lead to site suspension. | | After a problem has been assessed as critical, and a solution is available, then sites are required to take action. This document primarily defines the procedure from this time, where sites are asked to take action, and what steps are taken if they do not respond or do not take action. If a site fails to take action, this may lead to site suspension. | ||
| ''approved'', | | | ||
| Security | ''approved'', | ||
| Resource Centres, Operations Centres, EGI-CSIRT, SVG | |||
8. Sept. 2015 | |||
| Security | |||
| Resource Centres, Operations Centres, EGI-CSIRT, SVG | |||
|- | |- | ||
| [[SEC04|SEC 04]] | |||
| [https://documents.egi.eu/document/1018 Compromised Certificates and Central Security Emergency suspension] | |||
| This procedure describes what should be done in the event of a compromised identity certificate, including long lived certificates and proxies. This applies to robot certificates and service certificates as well as user certificates. Certificates are considered to be compromised if they are exposed outside intended policy, or linked to security incidents or malicious jobs. This procedure also addresses usage of Central Security Emergency suspension. The implications of a CA compromise are also briefly described. | |||
| ''approved'', September 27 2013 <!-- status, date of approval --> | |||
| Security | |||
| EGI CSIRT | |||
|- | |||
| [[SEC05|SEC 05]] | |||
| [[SEC05|Security Resource Centre Certification Procedure]] | |||
| Security Resource Centre Certification Procedure applies to Resource Centres under certification process and re-certification of suspended Resource Centres (sites). This step of the security certification procedure checks that the resources under certification do not contain known CRITICAL software vulnerabilities. | |||
| ''approved'', November 27 2014 | |||
| Security | |||
| Resource Centres, Operations Centres, EGI-CSIRT | |||
|} | |} | ||
[[ | [[EGI CSIRT:Policies#EGI_Operational_Security_Procedures|More information]] | ||
= EGI Policies | = Security Policies = | ||
See EGI Security [[SPG:Documents|Policies]] | |||
[[Category:Operations_Procedures|*]] |
Revision as of 11:57, 4 October 2021
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Operations
EGI Operational Procedures are prescriptive documents that describe step-by-step processes involving several partners. The purpose of a procedure is define the related workflow. Procedures are approved by the OMB and are periodically reviewed.
Number | Title | Comment | Area | Relevant to | Status |
PROC 01 | EGI Infrastructure Oversight Escalation | Operations ticket escation | Ticket Management | Resource Centre Administrators, Operations Centres, Operations | Approved |
PROC 02 | Operations Centre Creation | Step-by-step instructions on how to create a new Operations Centre | Operations Centre Management | Operations Centres, Operations | Approved |
PROC 03 | Operations Centre decommissioning | Step-by-step instructions on how to decommission an Operations Centre | Operations Centre Management | Operations Centres, Operations | Approved |
PROC 04 | Quality verification of monthly availability and reliability statistics | Instructions RODs and Operations Centres on how to handle justification for poor monthly performance | Availability and Monitoring | Resource Centre Administrators, Operations Centres, Operations | Approved |
PROC 05 | Validation of Operations Centre Nagios | This procedure is part of the Operations Centre creation procedure. | Availability and Monitoring | Operations Centres, Operations | DEPRECATED |
PROC 06 | Setting a Nagios test status to OPERATIONS | A Nagios probe is set to OPERATIONS when its results are used to generate notifications for the Operations Dashboard. This procedure details the steps to turn a Nagios test to OPERATIONs. | Availability and Monitoring | Operations Centres, Operations | Approved |
PROC 07 | Adding new probes to ARGO | Addition of new OPS Nagios probes to ARGO. | Availability and Monitoring | Resource Centre Administrators, Operations Centres, Operations | Approved |
PROC 08 | Management of the EGI OPS Availability and Reliability Profile | Request of a OPS EGI Availability and Reliability profile. A change in the profile is needed every time a new Nagios test needs to be added/removed to/from the profile, in order to have its results included/removed in/from Availability and Reliability monthly statistics. | Availability and Monitoring | Resource Centre Administrators, Operations Centres, Operations | Approved |
PROC 09 | Resource Centre Registration and Certification | Registration of a new Resource Centre | Resource Centre Management | Resource Centre Administrator, Operations Centres | Approved |
PROC 10 | Recomputation of monitoring results and availability statistics | Notification of problems with the monitoring results gathered by SAM and to request a recomputation of results and the related availability and reliability statistics | Availability and Monitoring | Resource Centre Administrators, Operations Centres | Approved |
PROC 11 | Resource Centre Decommissioning | Decommissioning of a Resource Centre before it is turned into CLOSED in GOCDB | Resource Centre Management | Resource Centre Administrator, Operations Centres | Approved |
PROC 12 | Production Service Decommissioning | Decommissioning of a EGI production service | Resource Centre Management | Resource Centre Administrator, Operations Centres | Approved |
PROC 13 | VO Deregistration | Decommissioning of a Virtual Organization supported by the European Grid Infrastructure | VO Management | VO Managers, Operations Manager | Approved |
PROC 14 | VO Registration | Registration of a Virtual Organization to the European Grid Infrastructure | VO Management | VO Managers, Operations Manager | Approved |
PROC 15 | Resource Center renaming | A procedure for changing name of a Resource Center. | Resource Centre Management | Resource Centre Administrator, Operations Centres | Approved |
PROC 16 | Decommissioning of unsupported software | A procedure for removal of unsupported software from production infrastructure. | Resource Centre Management | Resource Centre Administrator, Operations Centres | Approved |
PROC 17 | Decommissioning of service type | A procedure for removal of service type from production infrastructure. | Resource Centre Management | Resource Centre Administrator, Operations Centres | Deprecated |
PROC 18 | Temporary Cloud Resource Centre Registration and Certification | A Temporary procedure for registration of a new Cloud Resource Centre. Also apples to certified Resource Centers which introduce cloud resources for the first time. | Resource Centre Management | Resource Centre Administrator, Operations Centres | Deprecated |
PROC 19 | Introducing new cloud stack and grid middleware in EGI Production Infrastructure | A procedure for the steps to introduce new stack (Cloud platform) or middleware (Grid Platform) in EGI Production Infrastructure. | Resource Centre Management | Resource Centre Administrator, Operations Centres | Draft |
PROC 20 |
Support for CVMFS replication across the EGI and OSG CVMFS services | The document describes the process of enabling the replication of CVMFS spaces across OSG and EGI CVMFS infrastructures | Resource Centre Management | Resource Centre Administrator, VO managers | Approved |
PROC 21 | Resource Centre suspension |
The document describes the process of Resource Centre suspension in EGI infrastructure | Resource Centre Management | Resource Centre Administrator, VO managers | Approved |
PROC 22 | Support for CVMFS replication across the EGI Infrastructure |
The procedure describes the process of creating a repository within the EGI CVMFS infrastructure for an EGI VO. | Resource Centre Management | Resource Centre Administrator, VO managers | DRAFT |
PROC23 | Production tools release and deployment process | The procedure describes the process of release and deployment in EGI production infrastructure for Production tools | Production tools | APPROVED | |
PROC24 | Major incident handling | The procedure describes the process of handling major incidents |
Production tools | DRAFT | |
PROC25 | UMD and CMD software release procedure | The procedure describes the process of adding a new produc release to the software provisioning process and releasing it in UMD/CMD. |
Middleware software for HTC/Cloud deployed in multiple resource centres in EGI. | DRAFT | |
PROC26 (ISRM5) | Verify helpdesk Support Units are working and perform a periodic review of them | The document describes the process for verifying that the support teams are still able to follow-up the GGUS tickets assigned to their Support Unit. Moreover it is defined a criteria for decommissioning the SUs that don't work any more. | Ticket Management | Approved |
Structure template for new procedures
Security
Number | Title | Comment | Status | Area | Relevant to |
SEC 01 | EGI Security Incident Handling | The "Security Incident Handling Procedure" define site and incident coordinator responsibilities when handling Grid-related security incident. ALL EGI sites are required to follow this procedure to report and handle Grid-related security incident. | Approved March 2016 |
Security | Resource Centres, EGI CSIRT |
SEC 02 | EGI Vulnerability issue handling process | This procedure is used to handle vulnerabilities in Software relevant to the EGI infrastructure. | approved, Nov 2017 | Security | Resource Centres, Risk Assessment Team, Technology Providers, SVG |
SEC 03 | EGI-CSIRT Critical Vulnerability Handling | After a problem has been assessed as critical, and a solution is available, then sites are required to take action. This document primarily defines the procedure from this time, where sites are asked to take action, and what steps are taken if they do not respond or do not take action. If a site fails to take action, this may lead to site suspension. |
approved, 8. Sept. 2015 |
Security | Resource Centres, Operations Centres, EGI-CSIRT, SVG |
SEC 04 | Compromised Certificates and Central Security Emergency suspension | This procedure describes what should be done in the event of a compromised identity certificate, including long lived certificates and proxies. This applies to robot certificates and service certificates as well as user certificates. Certificates are considered to be compromised if they are exposed outside intended policy, or linked to security incidents or malicious jobs. This procedure also addresses usage of Central Security Emergency suspension. The implications of a CA compromise are also briefly described. | approved, September 27 2013 | Security | EGI CSIRT |
SEC 05 | Security Resource Centre Certification Procedure | Security Resource Centre Certification Procedure applies to Resource Centres under certification process and re-certification of suspended Resource Centres (sites). This step of the security certification procedure checks that the resources under certification do not contain known CRITICAL software vulnerabilities. | approved, November 27 2014 | Security | Resource Centres, Operations Centres, EGI-CSIRT |
Security Policies
See EGI Security Policies