Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Operations and Operations Support"

From EGIWiki
Jump to navigation Jump to search
 
(26 intermediate revisions by 2 users not shown)
Line 5: Line 5:
= Introduction  =
= Introduction  =


This page collects internal materials needed by EGI.eu Operations and EGI Operations Support team to perform the EGI Infrastructure operations oversight activities.  
'''New version on https://wiki.egi.eu/wiki/EGI_Operations_Team'''
 
This page collects internal materials needed by EGI.eu Operations and EGI Operations Support team to perform the EGI Infrastructure operations oversight activities.
 
'''NOTE''': on April 30th 2016 EGI Operations Support activity stopped, all its task passed to Operations


= Contact  =
= Contact  =


EGI.eu Operations:
EGI.eu Operations:  


*GGUS Support Unit: Operation
*GGUS Support Unit:Operations
*operations @ egi.eu
*operations @ egi.eu


EGI Operations Support:
= Actions =
 
*GGUS Suport Unit: EGI Operations Support
*operations-support @ mailman.egi.eu
 
= Duties  =
 
= Shifters work instructions =


In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.  
In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.  
Line 28: Line 25:
|-
|-
! Action  
! Action  
! Description
! Responsible<br>
! Related procedures
! Procedure
! Instructions and related pages<br>
|-
|-
| '''GGUS tickets assigned to COD'''  
| '''ROD certification'''  
| OS<br>
|  
|  
COD shifter is obliged to check the current status of all '''GGUS tickets assigned to COD'''
*[https://wiki.egi.eu/wiki/PROC02 Operations Centre Creation]


*see [https://ggus.eu/index.php?mode=ticket_search&ticket_id=&supportunit=COD&status=open&orderhow=desc&search_submit=GO! Link to all GGUS tickets assigned to COD]  
|
*If the ticket is waiting for COD action then he/she should perform the action
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]


<br> In case of a request for:  
|-
| '''Creation of a new NGI'''
| OS<br>  
|
*[https://wiki.egi.eu/wiki/PROC02 Operations Centre Creation]


*'''ROD certification'''
|  
**see [[WI01 ROD certification ticket handling|New ROD team certification work instructions]]
*[[WI02 Operations centre creation|WI02 - New Operations Centre creation work instruction]]
*'''Creation of a new NGI'''
**see [[PROC02|Creation of a new Operations Centre process coordination]]
**see [[WI02 Operations centre creation|work instruction]]  
**In case where COD is also the Integration Process Coordinator, COD is responsible for the whole procedure.
*'''Operations Centre decommission'''
**see [[PROC03|Operations Centre decommission process coordination]]
**COD validates the request and removes ROD information from all-operators mailing list
*'''Setting a Nagios test to an operations test'''
**see [[PROC06|Procedure for setting a Nagios test to an operations test]]
**Test can be turned ops in ops portal here: https://operations-portal.egi.eu/dashboard/regionalPreferences. You choose "ALL" as a scope.
**Broadcast can be done here: https://operations-portal.egi.eu/broadcast Subject: New OPERATIONS tests related to (choose right scope here). No option to select RODs: CC to: all-operator-on-duty@mailman.egi.eu
**Nagios ROC_OPERATORS profile must be updated by SAM team.http://grid-monitoring.cern.ch/poem/admin/poem/profile/26/
**COD is responsible for coordinating the whole process.


If the shifter doesn't know what kind of action should be taken, he/she should contact COD managers
|-
| '''Monthly operations broadcast'''
| OS
|
|
*[[WI04_Monthly_broadcast| WI04 - Monthly Operations broadcast]]


|-
| '''Operations Centre decommission'''
| O<br>
|  
|  
*[[PROC02|Creation of a new Operations Centre process coordination]]
*[https://wiki.egi.eu/wiki/PROC03 Operations Centre decommissioning]
*[[PROC03|Operations Centre decommission process coordination]]
*[[PROC06|Procedure for setting Nagios test an operations test]]


<br>  
| <br>
|-
| '''Setting a Nagios test to an operations test'''
| O<br>
|
*[https://wiki.egi.eu/wiki/PROC06 Setting a Nagios test status to OPERATIONS]


| <br>
|-
|-
| '''Operational portal dashboard issues'''  
| '''Operational portal dashboard issues'''  
| O<br>
|  
|  
*[https://operations-portal.egi.eu/codDashboard/ngi/any/tab/list/filter/operators/page/list COD dashboard link]
*[https://wiki.egi.eu/wiki/PROC01 EGI Infrastructure Oversight Escalation]


|  
|  
*[[PROC01|COD escalation procedure]]
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]


|-
|-
| '''Handover'''  
| '''Availability/reliability followup procedure'''  
| O<br>
|  
|  
*[https://operations-portal.egi.eu/dashboard/ccodView COD dashboard link]
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]<br>
*At the end of the shift a handover should be submitted (send to COD) via Handover tool in the Operational Portal
**Problems on the dashboard which will pass to next week: the ggus id of the ticket and when next escalation step should be taken
**GGUS tickets assigned to COD: for each ticket its last status and the action taken by the shifter should be provided
**Other issues: problems with tools etc.


| <br>
|-
| '''Availability/reliability followup procedure'''
|  
|  
*[[WI03 Availability and Reliability report followup|WI03 - Availability and reliability report work instruction]]  
*[https://wiki.egi.eu/wiki/PROC10 Recomputation of monitoring results and availability statistics]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]  
*[[Underperforming sites and suspensions|Underperforming sites and suspensions]]
*[[Underperforming sites and suspensions|Underperforming sites and suspensions]]
|
*[[PROC04|Availability and reliability monthly statistics procedure]]


|-
|-
| '''Unknown followup procedure'''  
| '''Unknown followup procedure'''  
| O<br>
|  
|  
*[[WI08 Unknown report followup|WI08 - Unknown report work instruction]]
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]
*[[Unknown issue|UNKNOWN issue ]]


|  
|  
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[Unknown issue|UNKNOWN issue]]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]


|-
|-
| '''Top-level BDII followup procedure'''  
| '''Top-level BDII followup procedure'''  
| O<br>
|  
|  
*[[WI04 Core services report followup|WI04 - Core services report work instruction ]]
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]


|  
|  
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]


|-
|-
| '''ROD performance index followup procedure'''  
| '''ROD performance index followup procedure'''  
| O<br>
| <br>
|  
|  
*[[WI07 ROD performance index report follwup|WI07 - ROD Performance Index report work instruction]]  
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]  
*[[ROD performance index|ROD performance index]]
*[[ROD performance index|ROD performance index]]
|
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]


|}
|}
Line 124: Line 123:
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]  
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]  
*[[WI02 Operations centre creation|WI02 - New Opertions Centre creation work instruction]]  
*[[WI02 Operations centre creation|WI02 - New Opertions Centre creation work instruction]]  
*[[WI03 Availability and Reliability report followup|WI03 - Availability and reliability report work instruction]]  
*[[WI03 RC and RP OLA violation report followup|WI03 - RC and RP OLA violation report followup]]  
*[[WI04 Core services report followup|WI04 - Core services report work instruction ]]  
*[[WI04 Monthly broadcast|WI04 - Monthly Operations broadcast]]  
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]  
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI06 Tickets older than 30 days|WI06 - Tickets &gt; 30 days]]  
*[[WI06_Core_services_process| Core services process]]
*[[WI07 ROD performance index report follwup|WI07 - ROD Performance Index report work instruction]]
 
*[[WI08 Unknown report followup|WI08 - Unknown report work instruction]]
== Pages listing NGIs<br>  ==
 
For EGI&nbsp;Operations:&nbsp;to be updated while OC&nbsp;creation or decommission
 
*[https://wiki.egi.eu/wiki/GOCDB_grouping_action https://wiki.egi.eu/wiki/GOCDB_grouping_action ]<br>
*[https://wiki.egi.eu/wiki/Operations_centres https://wiki.egi.eu/wiki/Operations_centres] <br>
*https://wiki.egi.eu/wiki/Top-BDII_list_for_NGI <br>
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1205<br>
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1206
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1184
*https://docs.google.com/a/egi.eu/spreadsheets/d/1Zsk3ykVllc5GzNG2Hhref7wzTvz_rSKcckV8nnWWZIs/edit#gid=163292516
*folder "08 - sites-history Q"


= Events  =
<br>


*[https://www.egi.eu/indico/categoryDisplay.py?categId=11 EGI indico page] with COD meeting agendas.
<br>
*All open actions can be found from [[COD actions|COD actions]]
 
<br>


= Resources  =
= Resources  =


*[https://documents.egi.eu/secure/ShowDocument?docid=298 Document server: ROD newsletter]
*[https://documents.egi.eu/secure/ShowDocument?docid=155 Document server: Operations Support Metrics]
*[[Operations Procedures|Operations Procedures]]  
*[[Operations Procedures|Operations Procedures]]  
*[http://www.youtube.com/user/EGIGridOversight Youtube channel]
*[http://www.youtube.com/user/EGIGridOversight Youtube channel]
*[https://operations-portal.in2p3.fr/dashboard/regionalPreferences Mailing lists for each ROD]
*[https://wiki.egi.eu/wiki/COD_Knowledge_database Knowledge database]


<!--
<!--
Line 149: Line 156:


*[[Grid operations oversight/OperationsSupportMetrics summary|Operations Support Metrics - reports summary]]-->  
*[[Grid operations oversight/OperationsSupportMetrics summary|Operations Support Metrics - reports summary]]-->  
=== Oct 2011 to date  ===
*Please provide a link here
<br>


<br>  
<br>  


Definition of [[Operations support metrics|Operations Support metrics]]
[[Category:Infrastructure_Oversight]]
 
=== May 2010-Sep 2011  ===
 
*Operations Support [https://documents.egi.eu/document/155 metrics]
 
=== Until April 2010  ===
 
*EGEE-III Operations Support [https://documents.egi.eu/document/829 metrics]
 
== Nagios tests  ==
 
*[[Operations SAM tests|Operations tests list ]]: list of Nagios probes generating alarms for visualization in the Operations Dashboard
*[[Availability SAM tests|Availability and reliability tests list]]: list of Nagios probes whose results are used for Availability and Reliability computation
 
== OTAG topics  ==
 
=== Operational Portal: Dashboard  ===
 
*[http://bit.ly/dZ3RWN RT tickets]
*[[COD Interaction with Dashboard team|COD interactions with Dashboard team (draft)]]
*[[COD OTAG topics|COD topics to be discussed on OTAG meeting]]
 
== Pages in draft state  ==
 
*[[Availability procedure improvements|Improvements to Availability Calculation Procedure (draft)]]
*[[Candidate or Suspended sites|Candidate Suspended Sites List]]
 
[[Category:Grid_Oversight]]

Latest revision as of 17:17, 28 July 2016

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 




Introduction

New version on https://wiki.egi.eu/wiki/EGI_Operations_Team

This page collects internal materials needed by EGI.eu Operations and EGI Operations Support team to perform the EGI Infrastructure operations oversight activities.

NOTE: on April 30th 2016 EGI Operations Support activity stopped, all its task passed to Operations

Contact

EGI.eu Operations:

  • GGUS Support Unit:Operations
  • operations @ egi.eu

Actions

In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.

Action Responsible
Procedure Instructions and related pages
ROD certification OS
Creation of a new NGI OS
Monthly operations broadcast OS
Operations Centre decommission O

Setting a Nagios test to an operations test O

Operational portal dashboard issues O
Availability/reliability followup procedure O
Unknown followup procedure O
Top-level BDII followup procedure O
ROD performance index followup procedure O

Work Instructions

Pages listing NGIs

For EGI Operations: to be updated while OC creation or decommission




Resources