Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Operations and Operations Support"

From EGIWiki
Jump to navigation Jump to search
 
(24 intermediate revisions by 2 users not shown)
Line 5: Line 5:
= Introduction  =
= Introduction  =


This page collects internal materials needed by EGI.eu Operations and EGI Operations Support team to perform the EGI Infrastructure operations oversight activities.  
'''New version on https://wiki.egi.eu/wiki/EGI_Operations_Team'''
 
This page collects internal materials needed by EGI.eu Operations and EGI Operations Support team to perform the EGI Infrastructure operations oversight activities.
 
'''NOTE''': on April 30th 2016 EGI Operations Support activity stopped, all its task passed to Operations


= Contact  =
= Contact  =
Line 11: Line 15:
EGI.eu Operations:  
EGI.eu Operations:  


*GGUS Support Unit: Operation
*GGUS Support Unit:Operations
*operations @ egi.eu
*operations @ egi.eu


EGI Operations Support:
= Actions =
 
*GGUS Suport Unit: EGI Operations Support
*operations-support @ mailman.egi.eu
 
= Duties  =
 
= Shifters work instructions =


In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.  
In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.  
Line 28: Line 25:
|-
|-
! Action  
! Action  
! Description
! Responsible<br>
! Related procedures
! Procedure
! Instructions and related pages<br>
|-
|-
| '''GGUS tickets assigned to COD'''  
| '''ROD certification'''  
| <br>
| OS<br>  
*'''ROD certification'''
|
**see [[WI01 ROD certification ticket handling|New ROD team certification work instructions]]  
*[https://wiki.egi.eu/wiki/PROC02 Operations Centre Creation]
*'''Creation of a new NGI'''
 
**see [[PROC02|Creation of a new Operations Centre process coordination]]
|
**see [[WI02 Operations centre creation|work instruction]]
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]
**In case where COD is also the Integration Process Coordinator, COD is responsible for the whole procedure.
*'''Operations Centre decommission'''
**see [[PROC03|Operations Centre decommission process coordination]]
**COD validates the request and removes ROD information from all-operators mailing list
*'''Setting a Nagios test to an operations test'''
**see [[PROC06|Procedure for setting a Nagios test to an operations test]]
**Test can be turned ops in ops portal here: https://operations-portal.egi.eu/dashboard/regionalPreferences. You choose "ALL" as a scope.
**Broadcast can be done here: https://operations-portal.egi.eu/broadcast Subject: New OPERATIONS tests related to (choose right scope here). No option to select RODs: CC to: all-operator-on-duty@mailman.egi.eu
**Nagios ROC_OPERATORS profile must be updated by SAM team.http://grid-monitoring.cern.ch/poem/admin/poem/profile/26/
**COD is responsible for coordinating the whole process.


If the shifter doesn't know what kind of action should be taken, he/she should contact COD managers
|-
| '''Creation of a new NGI'''
| OS<br>
|
*[https://wiki.egi.eu/wiki/PROC02 Operations Centre Creation]


|  
|  
*[[PROC02|Creation of a new Operations Centre process coordination]]
*[[WI02 Operations centre creation|WI02 - New Operations Centre creation work instruction]]
*[[PROC03|Operations Centre decommission process coordination]]
*[[PROC06|Procedure for setting Nagios test an operations test]]


<br>  
|-
| '''Monthly operations broadcast'''
| OS
|
|
*[[WI04_Monthly_broadcast| WI04 - Monthly Operations broadcast]]
 
|-
| '''Operations Centre decommission'''
| O<br>  
|
*[https://wiki.egi.eu/wiki/PROC03 Operations Centre decommissioning]
 
| <br>
|-
| '''Setting a Nagios test to an operations test'''
| O<br>
|
*[https://wiki.egi.eu/wiki/PROC06 Setting a Nagios test status to OPERATIONS]


| <br>
|-
|-
| '''Operational portal dashboard issues'''  
| '''Operational portal dashboard issues'''  
| O<br>
|  
|  
*[https://operations-portal.egi.eu/codDashboard/ngi/any/tab/list/filter/operators/page/list COD dashboard link]
*[https://wiki.egi.eu/wiki/PROC01 EGI Infrastructure Oversight Escalation]


|  
|  
*[[PROC01|COD escalation procedure]]
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]


|-
|-
| '''Availability/reliability followup procedure'''  
| '''Availability/reliability followup procedure'''  
| O<br>
|  
|  
*[[WI03 Availability and Reliability report followup|WI03 - Availability and reliability report work instruction]]
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]<br>
*[[Underperforming sites and suspensions|Underperforming sites and suspensions]]


|  
|  
*[[PROC04|Availability and reliability monthly statistics procedure]]
*[https://wiki.egi.eu/wiki/PROC10 Recomputation of monitoring results and availability statistics]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]
*[[Underperforming sites and suspensions|Underperforming sites and suspensions]]


|-
|-
| '''Unknown followup procedure'''  
| '''Unknown followup procedure'''  
| O<br>
|  
|  
*[[WI08 Unknown report followup|WI08 - Unknown report work instruction]]
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]
*[[Unknown issue|UNKNOWN issue ]]


|  
|  
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[Unknown issue|UNKNOWN issue]]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]


|-
|-
| '''Top-level BDII followup procedure'''  
| '''Top-level BDII followup procedure'''  
| O<br>
|  
|  
*[[WI04 Core services report followup|WI04 - Core services report work instruction ]]
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]


|  
|  
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]


|-
|-
| '''ROD performance index followup procedure'''  
| '''ROD performance index followup procedure'''  
| O<br>
| <br>
|  
|  
*[[WI07 ROD performance index report follwup|WI07 - ROD Performance Index report work instruction]]  
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]  
*[[ROD performance index|ROD performance index]]
*[[ROD performance index|ROD performance index]]
|
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]


|}
|}
Line 107: Line 123:
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]  
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]  
*[[WI02 Operations centre creation|WI02 - New Opertions Centre creation work instruction]]  
*[[WI02 Operations centre creation|WI02 - New Opertions Centre creation work instruction]]  
*[[WI03 Availability and Reliability report followup|WI03 - Availability and reliability report work instruction]]  
*[[WI03 RC and RP OLA violation report followup|WI03 - RC and RP OLA violation report followup]]  
*[[WI04 Core services report followup|WI04 - Core services report work instruction ]]  
*[[WI04 Monthly broadcast|WI04 - Monthly Operations broadcast]]  
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]  
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[WI06 Tickets older than 30 days|WI06 - Tickets &gt; 30 days]]  
*[[WI06_Core_services_process| Core services process]]
*[[WI07 ROD performance index report follwup|WI07 - ROD Performance Index report work instruction]]
 
*[[WI08 Unknown report followup|WI08 - Unknown report work instruction]]
== Pages listing NGIs<br>  ==
 
For EGI&nbsp;Operations:&nbsp;to be updated while OC&nbsp;creation or decommission
 
*[https://wiki.egi.eu/wiki/GOCDB_grouping_action https://wiki.egi.eu/wiki/GOCDB_grouping_action ]<br>
*[https://wiki.egi.eu/wiki/Operations_centres https://wiki.egi.eu/wiki/Operations_centres] <br>
*https://wiki.egi.eu/wiki/Top-BDII_list_for_NGI <br>
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1205<br>
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1206
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1184
*https://docs.google.com/a/egi.eu/spreadsheets/d/1Zsk3ykVllc5GzNG2Hhref7wzTvz_rSKcckV8nnWWZIs/edit#gid=163292516
*folder "08 - sites-history Q"


= Events  =
<br>


*[https://www.egi.eu/indico/categoryDisplay.py?categId=11 EGI indico page] with COD meeting agendas.
<br>
*All open actions can be found from [[COD actions|COD actions]]
 
<br>


= Resources  =
= Resources  =


*[https://documents.egi.eu/secure/ShowDocument?docid=298 Document server: ROD newsletter]
*[https://documents.egi.eu/secure/ShowDocument?docid=155 Document server: Operations Support Metrics]
*[[Operations Procedures|Operations Procedures]]  
*[[Operations Procedures|Operations Procedures]]  
*[http://www.youtube.com/user/EGIGridOversight Youtube channel]
*[http://www.youtube.com/user/EGIGridOversight Youtube channel]
*[https://operations-portal.in2p3.fr/dashboard/regionalPreferences Mailing lists for each ROD]
*[https://wiki.egi.eu/wiki/COD_Knowledge_database Knowledge database]


<!--
<!--
Line 132: Line 156:


*[[Grid operations oversight/OperationsSupportMetrics summary|Operations Support Metrics - reports summary]]-->  
*[[Grid operations oversight/OperationsSupportMetrics summary|Operations Support Metrics - reports summary]]-->  
=== Oct 2011 to date  ===
*Please provide a link here


<br>  
<br>  


<br>
[[Category:Infrastructure_Oversight]]
 
Definition of [[Operations support metrics|Operations Support metrics]]
 
=== May 2010-Sep 2011  ===
 
*Operations Support [https://documents.egi.eu/document/155 metrics]
 
=== Until April 2010  ===
 
*EGEE-III Operations Support [https://documents.egi.eu/document/829 metrics]
 
== Nagios tests  ==
 
*[[Operations SAM tests|Operations tests list ]]: list of Nagios probes generating alarms for visualization in the Operations Dashboard
*[[Availability SAM tests|Availability and reliability tests list]]: list of Nagios probes whose results are used for Availability and Reliability computation
 
== OTAG topics  ==
 
=== Operational Portal: Dashboard  ===
 
*[http://bit.ly/dZ3RWN RT tickets]
*[[COD Interaction with Dashboard team|COD interactions with Dashboard team (draft)]]
*[[COD OTAG topics|COD topics to be discussed on OTAG meeting]]
 
== Pages in draft state  ==
 
*[[Availability procedure improvements|Improvements to Availability Calculation Procedure (draft)]]
*[[Candidate or Suspended sites|Candidate Suspended Sites List]]
 
[[Category:Grid_Oversight]]

Latest revision as of 17:17, 28 July 2016

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 




Introduction

New version on https://wiki.egi.eu/wiki/EGI_Operations_Team

This page collects internal materials needed by EGI.eu Operations and EGI Operations Support team to perform the EGI Infrastructure operations oversight activities.

NOTE: on April 30th 2016 EGI Operations Support activity stopped, all its task passed to Operations

Contact

EGI.eu Operations:

  • GGUS Support Unit:Operations
  • operations @ egi.eu

Actions

In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.

Action Responsible
Procedure Instructions and related pages
ROD certification OS
Creation of a new NGI OS
Monthly operations broadcast OS
Operations Centre decommission O

Setting a Nagios test to an operations test O

Operational portal dashboard issues O
Availability/reliability followup procedure O
Unknown followup procedure O
Top-level BDII followup procedure O
ROD performance index followup procedure O

Work Instructions

Pages listing NGIs

For EGI Operations: to be updated while OC creation or decommission




Resources