Difference between revisions of "WI03 RC and RP OLA violation report followup"

From EGIWiki
Jump to: navigation, search
(Ticket content)
(GGUS Ticket content (template))
 
(52 intermediate revisions by 6 users not shown)
Line 1: Line 1:
{{Template:Op menubar}}  
+
{{Template:Op menubar}} {{Template:GO menubar}} {{TOC_right}}  
{{Template:GO menubar}} {{TOC_right}}  
 
  
[[Category:Grid Oversight]]
+
<br>
  
 +
= RC and RP OLA&nbsp;violation work instruction for EGI Operations  =
  
= Availability and reliability report work instruction for COD  =
+
This page describes steps which should be taken to follow RC and RP&nbsp;OLA violation issues.
  
This page describes steps which should be taken by COD shifter to follow availability/reliability issues.
+
== General info  ==
  
== General info ==
+
*'''Receiver''': NGI
 +
*'''Subject''': RC and RP&nbsp;OLA violation<br>
 +
*'''Threshold''':&nbsp;
 +
*'''Goal''': We expect to see improvement
 +
*'''Deadline for answers''': 10 days
 +
*'''No response''':&nbsp;
 +
 
 +
== Steps  ==
 +
 
 +
Starts with 1 day of the month<br>
 +
 
 +
{| border="1" class="wikitable"
 +
|-
 +
| '''Step [#]'''
 +
| <br>
 +
|
 +
'''Max. Duration [work days]'''
  
*'''Receiver''': Site
+
(time before moving to next step)
*'''Subject''': Availability or reliability under target for last 3 months
+
 
*'''Threshold''': Availability: 70%, Reliability: 75%
+
| '''Responsible'''  
*'''Goal''': We expect to see improvement
+
| '''Step'''
*'''Deadline for answers''': 10 days
+
|-
*'''No response''': site suspension  
+
| rowspan="3" | 1 <br>
 +
| 1
 +
| rowspan="3" | 10<br>
 +
| SLM, CA
 +
|
 +
#'''SLM''':&nbsp;Produce '''Availability/Reliability report'''
 +
#'''SLM:'''&nbsp;Create DOC&nbsp;DB&nbsp;entry and add link to&nbsp; https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics
 +
#'''SLM''': Send mail to operations @ egi.eu
 +
 
 +
<br>
 +
 
 +
|-
 +
| 2<br>
 +
| PS
 +
|
 +
#Create report for '''NGI&nbsp;Top BDII'''<br>
 +
#Add report to DOC&nbsp;DB entry created by SLM
 +
#Create summary and send to opertions@ egi.eu
 +
 
 +
|-
 +
| 3
 +
| MK
 +
|
 +
#Create report for '''ROD performance index, NGI&nbsp;SAM&nbsp; and QoS '''<br>
 +
#Add reports to DOC&nbsp;DB entry for [https://wiki.egi.eu/wiki/ROD_performance_index#Performance_reports ROD performance index], NGI&nbsp;SAM (same as for BDII)&nbsp; and QoS <br>
 +
#Create summary and send to Opertions@ egi.eu
 +
#Create summary for <br>#*'''NGIs: underperforming ava/rel''' <br>#*'''NGIs: high Unknown''' <br>#*'''RCs: underpeforming ava/rel for 3 months''' <br>#'''CA''':&amp;nbsp;Send summary to Opertions@ egi.eu
 +
 
 +
|-
 +
| 2<br>
 +
| <br>
 +
| 10<br>
 +
| VS<br>
 +
|
 +
#Put summary of reports together<br>
 +
#Send email to NOC&nbsp;managers about all reports (link to DOC&nbsp;DB)
 +
#Create master and child GGUS tickets against NGIs
 +
#Follow-up GGUS tickets against NGIs
 +
 
 +
|}
 +
 
 +
<br>
 +
 
 +
=== NGI managers email (example)<br>  ===
 +
<pre>Subject:&nbsp;EGI RC OLA and RP OLA Reports for October 2014
 +
 
 +
Content:&nbsp;Dear NGI managers
 +
 
 +
Please find EGI RC OLA and RP OLA Reports for October 2014 under:
 +
 
 +
https://documents.egi.eu/public/ShowDocument?docid=2352
 +
 
 +
Entry includes reports:
 +
* EGI Cloud RC A/R
 +
* EGI RP/RC A/R/U
 +
* EGI RP Quality of Support
 +
* EGI RP ROD performance index
 +
* EGI RP SAM A/R
 +
* EGI RP Top-BDII A/R
 +
 
 +
Best Regards
 +
</pre>
 +
=== DOC&nbsp;DB content (example)  ===
 +
<pre>Title:&nbsp;EGI RC OLA and RP OLA Reports for October 2014
 +
 
 +
Abstract:&nbsp;Container for reports supporting
 +
 
 +
Resource Centre Operational Level Agreement
 +
 
 +
https://documents.egi.eu/document/31
 +
 
 +
and
 +
 
 +
Resource infrastructure Provider Operational Level Agreement
 +
 
 +
https://documents.egi.eu/document/463
 +
</pre>
 +
=== GGUS&nbsp;Ticket content (template)  ===
 +
<pre>$NGI - $MONTH $YEAR - RP/RC OLA performance
 +
 
 +
Dear NGI/ROC,
 +
 
 +
the EGI RC OLA and RP OLA Report for November 2020 has been produced and is available at the following links:
 +
- NGIs reports: http://egi.ui.argo.grnet.gr/egi/report-ar/Critical/NGI?month=2020-11 (Clicking on the NGI name, it will be displayed the resource centres A/R figures)
 +
- RCs reports: http://egi.ui.argo.grnet.gr/egi/report-ar/Critical/SITES?month=2020-11
 +
 
 +
According to the Service targets reports for Resource infrastructure Provider [1] and Resource Centre[2] OLAs, the following problems occurred:
 +
 
 +
============= RC Availability Reliability [2]==========
 +
 
 +
According to recent availability/reliability report following sites have achieved insufficient performance below Availability target threshold in 3 consecutive months (September, October, and November):
 +
 
 +
$SITE
 +
$SITE
 +
$SITE
 +
 
 +
* During the 10 working days after receiving this ticket the NGI can suspend the site or ask to not suspend the site by providing adequate explanation. If no answer is provided to this ticket, the NGI will be contacted by email; if no reply will be provided to the email, the site will be suspended[6].
 +
 
 +
If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended.
 +
 
 +
If you think that the site should not be suspended please provide justification in this ticket within 10 working days. In case the site performance rises above targets within 3 days from providing explanation, the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.
 +
 
 +
 
 +
============= Quality of Support [1]==========
 +
 
 +
According to recent report your NGI achieved insufficient Quality of Support performance:
 +
 
 +
less urgent (expected 5 working days):
 +
urgent (expected 5 working days):
 +
very urgent (expected 1 working day):
 +
top priority (expected 1 working day): 
  
== Steps ==
+
In order to see the details (https://wiki.egi.eu/wiki/Service_Level_Target_-_Quality_of_Support ):
  
=== Parent ticket ===
+
1) go on https://ggus.eu/?mode=report_view
#Ticket is submitted by Georgios Kaklamanos or George Fergadis.
+
2) choose "response times" or "tickets submitted" and the proper timeframe
#Add ticket URL to [[Grid_operations_oversight/CODOD_actions#Monthly_Actions| Monthly actions]]
+
3) select your NGI
#Add ticket URL to [[Underperforming_sites_and_suspensions| Underperforming sites and suspensions]]
+
4) group by Responsible unit and priority
 +
5) click on the lines displayed for getting the tickets details
  
=== Submit child tickets to sites ===
 
#Go to Dropbox - COD - TicketCreator - AvaRel report
 
#Prepare input file EGI_sus.csv based on the records marked as red in the source pdf. Input file syntax: <br /> <pre>NGI;Site;Availability;Reliability</pre> Take into account only sites that are in Certified state in [https://goc.egi.eu/portal/ GocDB]. <br> Make sure NGIs are named according to the below table.
 
#Run ticket creator: <br /><pre>perl start-suspend.pl ticket_number ‘date, e.g. Sep 2012’ “EGI_sus.csv”</pre> More info about [[Grid_operations_oversight/WI03/TG-AR|Ticket generator for A/R]]
 
  
=== Handling the child tickets ===
+
Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days from receiving it.  
#NGIs that replied within 10 days - check the explanation. If uncertain whether to suspend or not, discuss with COO by submitting a ticket to them.
 
##If after 3 days from receiving the explanation from NGI performance shows no improvement (Availability is still <70%, Reliability <75%) COD should suspend the site. Inform NGI and site about the suspension.
 
##In cases COD agree the site should not be suspended (such as: raise of availability >70% and reliability >75% or any other important reason, such as NGI SAM problem) the site can be left certified
 
#NGIs that didn’t reply - after 10 days suspend the site. Inform NGI and site about the suspension.
 
#Prepare summary report and place it in the parent ticket.
 
#Update:
 
##[[Underperforming_sites_and_suspensions| Underperforming_sites_and_suspensions]]
 
##[[List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable |List of sites for which the availability followup procedures were not applicable]]
 
The whole process should be completed '''by the end of the month'''.
 
  
== Additional info ==
+
If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps.
  
=== Naming the NGIs ===
+
**********************
  
In grid view NGIs/ROCs are named differently than in GGUS. You should change NGI/ROC name according to GGUS. NGI name table can be found here: [[NGIs_GGUS| NGIs GGUS names]]
 
  
=== Ticket content  ===
+
Links:
  
<pre>Subject:$SU/$siteName - site suspension
+
[1] https://documents.egi.eu/public/ShowDocument?docid=463 "Resource infrastructure Provider Operational Level Agreement"
  
Dear $SU,
+
[2] https://documents.egi.eu/public/ShowDocument?docid=31 "Resource Centre Operational Level Agreement"
 
According to recent availability/reliability report $siteName has achieved poor performance below target
 
Ava. 70% or Rel. 75% in three consecutive months.
 
More details: https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics
 
  
The aim of submitting this ticket is the intervention of the NGI and immediate improvement of the situation.
+
[3] https://wiki.egi.eu/wiki/PROC01 "EGI Infrastructure Oversight escalation"
  
According to procedures approved on OMB 17.08, the site will be suspended 10 working days after receiving
+
[4] https://wiki.egi.eu/wiki/PROC10 "Recomputation of SAM results or availability reliability statistics"
this ticket unless NGI intervene. If NGI intervene and performance is still below targets 3 days after the
 
intervention, the site will also be suspended.
 
  
If you think that the site should not be suspended please provide justification in this ticket within 10
+
[5] https://wiki.egi.eu/wiki/MAN05 "top-BDII and site-BDII High Availability"
working days. In case the site performance rises above targets within 3 days from providing explanation,
 
the site will not be suspended. Otherwise COD may decide on suspension of the site.
 
  
You will be notified about the outcome in this ticket.
+
[6] https://wiki.egi.eu/wiki/PROC04 "Quality verification of monthly availability and reliability statistics"
  
 
Best Regards,
 
Best Regards,
EGI Central Operator on Duty
+
EGI Operations
 +
 
 
</pre>  
 
</pre>  
 +
More info about [[Ticket generator Availability Reliability|Ticket generator for A/R]]
 +
 +
<br>
  
More info about [[Grid_operations_oversight/WI03/TG-AR|Ticket generator for A/R]]
+
[[Category:Infrastructure_Oversight]]

Latest revision as of 11:28, 7 December 2020

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 




RC and RP OLA violation work instruction for EGI Operations

This page describes steps which should be taken to follow RC and RP OLA violation issues.

General info

  • Receiver: NGI
  • Subject: RC and RP OLA violation
  • Threshold
  • Goal: We expect to see improvement
  • Deadline for answers: 10 days
  • No response

Steps

Starts with 1 day of the month

Step [#]

Max. Duration [work days]

(time before moving to next step)

Responsible Step
1
1 10
SLM, CA
  1. SLM: Produce Availability/Reliability report
  2. SLM: Create DOC DB entry and add link to  https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics
  3. SLM: Send mail to operations @ egi.eu


2
PS
  1. Create report for NGI Top BDII
  2. Add report to DOC DB entry created by SLM
  3. Create summary and send to opertions@ egi.eu
3 MK
  1. Create report for ROD performance index, NGI SAM  and QoS
  2. Add reports to DOC DB entry for ROD performance index, NGI SAM (same as for BDII)  and QoS
  3. Create summary and send to Opertions@ egi.eu
  4. Create summary for
    #*NGIs: underperforming ava/rel
    #*NGIs: high Unknown
    #*RCs: underpeforming ava/rel for 3 months
    #CA:&nbsp;Send summary to Opertions@ egi.eu
2

10
VS
  1. Put summary of reports together
  2. Send email to NOC managers about all reports (link to DOC DB)
  3. Create master and child GGUS tickets against NGIs
  4. Follow-up GGUS tickets against NGIs


NGI managers email (example)

Subject: EGI RC OLA and RP OLA Reports for October 2014

Content: Dear NGI managers

Please find EGI RC OLA and RP OLA Reports for October 2014 under:

https://documents.egi.eu/public/ShowDocument?docid=2352

Entry includes reports:
* EGI Cloud RC A/R
* EGI RP/RC A/R/U
* EGI RP Quality of Support
* EGI RP ROD performance index
* EGI RP SAM A/R
* EGI RP Top-BDII A/R

Best Regards

DOC DB content (example)

Title: EGI RC OLA and RP OLA Reports for October 2014

Abstract: Container for reports supporting 

Resource Centre Operational Level Agreement

https://documents.egi.eu/document/31

and

Resource infrastructure Provider Operational Level Agreement

https://documents.egi.eu/document/463

GGUS Ticket content (template)

$NGI - $MONTH $YEAR - RP/RC OLA performance

Dear NGI/ROC,	

the EGI RC OLA and RP OLA Report for November 2020 has been produced and is available at the following links:
- NGIs reports: http://egi.ui.argo.grnet.gr/egi/report-ar/Critical/NGI?month=2020-11 (Clicking on the NGI name, it will be displayed the resource centres A/R figures)
- RCs reports: http://egi.ui.argo.grnet.gr/egi/report-ar/Critical/SITES?month=2020-11

According to the Service targets reports for Resource infrastructure Provider [1] and Resource Centre[2] OLAs, the following problems occurred:

============= RC Availability Reliability [2]==========

According to recent availability/reliability report following sites have achieved insufficient performance below Availability target threshold in 3 consecutive months (September, October, and November):

$SITE
$SITE
$SITE 

* During the 10 working days after receiving this ticket the NGI can suspend the site or ask to not suspend the site by providing adequate explanation. If no answer is provided to this ticket, the NGI will be contacted by email; if no reply will be provided to the email, the site will be suspended[6]. 

If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended.

If you think that the site should not be suspended please provide justification in this ticket within 10 working days. In case the site performance rises above targets within 3 days from providing explanation, the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.


============= Quality of Support [1]==========

According to recent report your NGI achieved insufficient Quality of Support performance: 

less urgent (expected 5 working days): 
urgent (expected 5 working days):
very urgent (expected 1 working day):
top priority (expected 1 working day):  

In order to see the details (https://wiki.egi.eu/wiki/Service_Level_Target_-_Quality_of_Support ):

1) go on https://ggus.eu/?mode=report_view
2) choose "response times" or "tickets submitted" and the proper timeframe
3) select your NGI
4) group by Responsible unit and priority
5) click on the lines displayed for getting the tickets details


Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days from receiving it. 

If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps.

**********************


Links:

[1] https://documents.egi.eu/public/ShowDocument?docid=463 "Resource infrastructure Provider Operational Level Agreement"

[2] https://documents.egi.eu/public/ShowDocument?docid=31 "Resource Centre Operational Level Agreement"

[3] https://wiki.egi.eu/wiki/PROC01 "EGI Infrastructure Oversight escalation"

[4] https://wiki.egi.eu/wiki/PROC10 "Recomputation of SAM results or availability reliability statistics"

[5] https://wiki.egi.eu/wiki/MAN05 "top-BDII and site-BDII High Availability"

[6] https://wiki.egi.eu/wiki/PROC04 "Quality verification of monthly availability and reliability statistics"

Best Regards,
EGI Operations

More info about Ticket generator for A/R