Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Operations and Operations Support"

From EGIWiki
Jump to navigation Jump to search
 
(122 intermediate revisions by 8 users not shown)
Line 1: Line 1:
{{Template:Op menubar}}
{{Template:Op menubar}} {{Template:GO menubar}} {{TOC_right}}  
{{TOC_right}}


[[Category:COD]]
<br>
= Introduction =
The purpose of this page is to collect all materials needed by COD team to perform the Grid operations oversight activities.


= Introduction  =


= People and contact =
'''New version on https://wiki.egi.eu/wiki/EGI_Operations_Team'''


COD team is formed from Dutch and Polish team and includes  COD managers (people responsible for managerial issues) and COD shifters (people performing day-to-day COD work)
This page collects internal materials needed by EGI.eu Operations and EGI Operations Support team to perform the EGI Infrastructure operations oversight activities.


;'''COD managers:''' : Ron Trompert (Chair), Marcin Radecki, Luuk Uljee, Malgorzata Krakowian
'''NOTE''': on April 30th 2016 EGI Operations Support activity stopped, all its task passed to Operations
;'''COD shifters:''' : Malgorzata Krakowian, Ron Trompert, Luuk Uljee, Maarten van Ingen, Ernst Pijper, Alexander Verkooijen


= Contact  =


[[Grid_operations_oversight/Photo | People behind the names]]
EGI.eu Operations:


*GGUS Support Unit:Operations
*operations @ egi.eu


There are 2 mailing lists used for different cases:
= Actions  =
** '''manager-central-operator-on-duty''' AT mailman.egi.eu - for COD managerial issues like suggesting changes in procedures, tools. '''COD managers''' are recipients of this list.
** '''central-operator-on-duty''' AT mailman.egi.eu - for reporting COD day-to-day issues like problems with tools or Nagios tests. '''COD shifters''' are recipients of this list.


= COD Duties =
* COD managers
** '''representing RODs/COD in OTAG, OMB and Operations meetings''' - collecting requirements and improvements proposals from RODs concerning operations tools and procedures
** '''suspending Resource Centres''' in case of operational issues
** '''taking part in OLA task force'''
** '''writing new procedures''' - in case of need COD is taking part in procedures creation process
** '''preparing ROD newsletters''' - informing RODs about recent and upcoming developments related to Grid Oversight
** '''preparing ROD metrics reports''' - providing an overview of operations support process in grid infrastructure.
* COD shifters
** '''escalation of operational problems with RODs'''
** '''dealing with GGUS tickets assigned to COD'''
** '''process coordination''' of:
*** creation and decommission of Operations Centre
*** setting a Nagios test to an operations test
*** getting explanations for low availability and reliability metrics
= COD shifters work instructions =
In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.  
In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.  


{| border="1" cellspacing="0" cellpadding="5" align="center"
{| class="wikitable"
! Action
|-
! Description
! Action  
! Related procedures
! Responsible<br>
|-v
! Procedure
| '''GGUS tickets assigned to COD'''
! Instructions and related pages<br>
|
|-
COD shifter is obliged to check the current status of all '''GGUS tickets assigned to COD'''
| '''ROD certification'''  
* see [http://tinyurl.com/2ws735h Link to all GGUS tickets assigned to COD]
| OS<br>
* If the ticket is waiting for COD action then he/she should perform the action
|
*[https://wiki.egi.eu/wiki/PROC02 Operations Centre Creation]


|
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]


In case of a request for:
|-
* '''ROD certification'''
| '''Creation of a new NGI'''  
**  see [[Grid_operations_oversight/WI01| New ROD team certification work instructions]]
| OS<br>
* '''Creation of a new NGI'''  
|
** see [[PROC02 | Creation of a new Operations Centre process coordination]]
*[https://wiki.egi.eu/wiki/PROC02 Operations Centre Creation]
** In case where COD is also the Integration Process Coordinator, COD is responsible for the whole procedure.
* '''Operations Centre decommission'''
** see [[PROC03|Operations Centre decommission process coordination]]
** COD validates the request and removes ROD information from all-operators mailing list
* '''Setting a Nagios test to an operations test'''
** see [[PROC06| Procedure for setting a Nagios test to an operations test]]
** COD is responsible for coordinating the whole process.


If the shifter doesn't know what kind of action should be taken, he/she should contact COD managers
|  
|  
* [[PROC02|  Creation of a new Operations Centre process coordination]]
*[[WI02 Operations centre creation|WI02 - New Operations Centre creation work instruction]]
* [[PROC03|Operations Centre decommission process coordination]]
 
* [[PROC06| Procedure for setting Nagios test an operations test]]
|-
|-
| '''Availability/reliability reports'''
| '''Monthly operations broadcast'''
| OS
|  
|  
* Handling availability/reliability reports: [[Availability_and_reliability_work_instruction_for_COD | Availability and reliability work instruction]]
** [[Underperforming_sites_and_suspensions |  AR reports metrics]]
|  
|  
* [[Operations:COD_Escalation_Procedure|COD escalation procedure]]
*[[WI04_Monthly_broadcast| WI04 - Monthly Operations broadcast]]
* [[Availability_and_reliability_monthly_statistics | Availability and reliability monthly statistics procedure]]
 
|-
|-
| '''Operational portal dashboard issues'''
| '''Operations Centre decommission'''  
| O<br>
|  
|  
*[https://operations-portal.egi.eu/dashboard/ccodView COD dashboard link]
*[https://wiki.egi.eu/wiki/PROC03 Operations Centre decommissioning]
|
 
* [[PROC01|COD escalation procedure]]
| <br>
|-
| '''Setting a Nagios test to an operations test'''
| O<br>
|  
*[https://wiki.egi.eu/wiki/PROC06 Setting a Nagios test status to OPERATIONS]
 
| <br>
|-
|-
| '''Handover'''
| '''Operational portal dashboard issues'''  
| O<br>
|
*[https://wiki.egi.eu/wiki/PROC01 EGI Infrastructure Oversight Escalation]
 
|  
|  
[https://operations-portal.egi.eu/dashboard/ccodView COD dashboard link]
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
* At the end of the shift a handover should be submitted (send to COD) via Handover tool in the Operational Portal
 
** Problems on the dashboard which will pass to next week: the ggus id of the ticket and when next escalation step should be taken
** GGUS tickets assigned to COD: for each ticket its last status and the action taken by the shifter should be provided
** Other issues: problems with tools etc.
|
|-
|-
|}
| '''Availability/reliability followup procedure'''
| O<br>
|
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]<br>
 
|
*[https://wiki.egi.eu/wiki/PROC10 Recomputation of monitoring results and availability statistics]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]
*[[Underperforming sites and suspensions|Underperforming sites and suspensions]]
 
|-
| '''Unknown followup procedure'''
| O<br>
|
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]
 
|
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[Unknown issue|UNKNOWN issue]]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]
 
|-
| '''Top-level BDII followup procedure'''
| O<br>
|
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]
 
|
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]


|-
| '''ROD performance index followup procedure'''
| O<br>
| <br>
|
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]
*[[ROD performance index|ROD performance index]]


''NOTE: all procedures should contain the following template: https://wiki.egi.eu/wiki/PDT:Procedure_Template''
|}


= Events =
== ''Work Instructions''<br>  ==


* [[Grid_operations_oversight/CODOD|Phone conference Meetings, Agenda and Actions]]
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]
*[[WI02 Operations centre creation|WI02 - New Opertions Centre creation work instruction]]
*[[WI03 RC and RP OLA violation report followup|WI03 - RC and RP OLA violation report followup]]
*[[WI04 Monthly broadcast|WI04 - Monthly Operations broadcast]]
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI06_Core_services_process| Core services process]]


= Resources =
== Pages listing NGIs<br>  ==


* [https://documents.egi.eu/secure/ShowDocument?docid=298 Document server: ROD newsletter]
For EGI&nbsp;Operations:&nbsp;to be updated while OC&nbsp;creation or decommission
* [https://documents.egi.eu/secure/ShowDocument?docid=155 Document server: Operations Support Metrics]


== ROD and COD Performance ==
*[https://wiki.egi.eu/wiki/GOCDB_grouping_action https://wiki.egi.eu/wiki/GOCDB_grouping_action ]<br>
* [[Grid_operations_oversight/OperationsSupportMetrics | Operations Support Metrics]]
*[https://wiki.egi.eu/wiki/Operations_centres https://wiki.egi.eu/wiki/Operations_centres] <br>
* [[Grid_operations_oversight/OperationsSupportMetrics_summary | Operations Support Metrics - reports summary]]
*https://wiki.egi.eu/wiki/Top-BDII_list_for_NGI <br>
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1205<br>
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1206
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1184
*https://docs.google.com/a/egi.eu/spreadsheets/d/1Zsk3ykVllc5GzNG2Hhref7wzTvz_rSKcckV8nnWWZIs/edit#gid=163292516
*folder "08 - sites-history Q"


== Nagios tests ==
<br>
* [[Operations:Operations_tests| Operations tests list ]]: list of Nagios probes generating alarms for visualization in the Operations Dashboard
* [[Availability_and_reliability_tests| Availability and reliability tests list ]]: list of Nagios probes whose results are used for Availability and Reliability computation


== OTAG topics ==
<br>


=== Operational Portal: Dashboard ===
<br>
* [http://bit.ly/dZ3RWN  RT tickets]
* [[Grid_operations_oversight/COD_interaction_with_Dashboard_team| COD interactions with Dashboard team (draft)]]
* [[Grid_operations_oversight/COD_OTAG_topics| COD topics to be discussed on OTAG meeting]]


=== GOC DB ===
= Resources  =
* [[Grid_operations_oversight/COD_GOCDB_requirements|Collection of GOC DB requirements regarding COD work (draft)]]


== Pages in draft state ==
*[[Operations Procedures|Operations Procedures]]
*[http://www.youtube.com/user/EGIGridOversight Youtube channel]


* [[Grid_operations_oversight/COD_Improvements_to_availability_procedure|Improvements to Availability Calculation Procedure (draft)]]
<!--
== ROD and COD Performance  ==


* [[Grid_operations_oversight/A/R_fixing_procedure| A/R fixing procedure (draft)]]  
*[[Grid operations oversight/OperationsSupportMetrics summary|Operations Support Metrics - reports summary]]-->


<br>


[[Category:COD]]
[[Category:Infrastructure_Oversight]]

Latest revision as of 17:17, 28 July 2016

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 




Introduction

New version on https://wiki.egi.eu/wiki/EGI_Operations_Team

This page collects internal materials needed by EGI.eu Operations and EGI Operations Support team to perform the EGI Infrastructure operations oversight activities.

NOTE: on April 30th 2016 EGI Operations Support activity stopped, all its task passed to Operations

Contact

EGI.eu Operations:

  • GGUS Support Unit:Operations
  • operations @ egi.eu

Actions

In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.

Action Responsible
Procedure Instructions and related pages
ROD certification OS
Creation of a new NGI OS
Monthly operations broadcast OS
Operations Centre decommission O

Setting a Nagios test to an operations test O

Operational portal dashboard issues O
Availability/reliability followup procedure O
Unknown followup procedure O
Top-level BDII followup procedure O
ROD performance index followup procedure O

Work Instructions

Pages listing NGIs

For EGI Operations: to be updated while OC creation or decommission




Resources