Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "WI03 RC and RP OLA violation report followup"

From EGIWiki
Jump to navigation Jump to search
Line 40: Line 40:
#'''SLM''': Produce Availability/Reliability report  
#'''SLM''': Produce Availability/Reliability report  
#'''SLM:''' Create DOC DB entry and add link to  https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics  
#'''SLM:''' Create DOC DB entry and add link to  https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics  
#'''CA''': Add ticket URL to [[Underperforming sites and suspensions|Underperforming sites and suspensions]]
#'''CA''': create summary for  
#'''CA''': create summary for  
#*NGIs: underperforming ava/rel  
#*NGIs: underperforming ava/rel  
Line 61: Line 62:
|  
|  
#Create report for ROD performance index, NGI&nbsp;SAM&nbsp; and QoS <br>  
#Create report for ROD performance index, NGI&nbsp;SAM&nbsp; and QoS <br>  
#Add reports to DOC&nbsp;DB entry for [https://wiki.egi.eu/wiki/ROD_performance_index#Performance_reports ROD performance index], NGI&nbsp;SAM (same as for BDII)&nbsp; and QoS <br>
#Add reports to DOC&nbsp;DB entry for [https://wiki.egi.eu/wiki/ROD_performance_index#Performance_reports ROD performance index], NGI&nbsp;SAM (same as for BDII)&nbsp; and QoS <br>  
#Create summary and send to Opertions@ egi.eu
#Create summary and send to Opertions@ egi.eu


Line 72: Line 73:
|  
|  
#Put summary of reports together<br>  
#Put summary of reports together<br>  
#Send email to NOC&nbsp;managers about all reports (link to DOC&nbsp;DB)
#Send email to NOC&nbsp;managers about all reports (link to DOC&nbsp;DB)  
#Create and follow-up GGUS tickets against NGIs
#Create and follow-up GGUS tickets against NGIs  
#&nbsp;Update [[Underperforming sites and suspensions|Underperforming sites and suspensions]]


| <br>
| <br>
Line 80: Line 82:
<br>  
<br>  


=== Parent ticket ===
=== ===
 
=== Ticket content  ===
<pre>Subject: $NGI_SU - $date - RP/RC OLA violation
 
Dear $NGI_SU,
 
According to the Service targets reports for Resource infrastructure Provider [1] and Resource Center[2] OLA
for $date following problems accures:
 
============= RC Availability Reliability [2]==========
 
According to recent availability/reliability report following sites have achieved insufficient performance below Availability target
threshold in 3 consecutive months:
* site name - ava:  rel:
*
*
 
* The sites will be suspended 10 working days after receiving this ticket unless NGI intervene. *
 
If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended.
 
If you think that the site should not be suspended please provide justification in this ticket within 10
working days. In case the site performance rises above targets within 3 days from providing explanation,
the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.
 
Report:
 
============= RP Availability Reliability [1]==========
 
According to recent availability/reliability report your Operations Center have achieved poor insufficient
performance:
- Availability:
- Reliability:
 
Report:
 
============= RP Unknown [1]==========
 
Unknown metric in your NGI has been spotted with higher than 10% value.
 
Note:
Having a high percentage of UNKNOWN status of sites implies that there is no data available regarding the
site availability for a substantial amount of time. For more information about UNKNOWN status please visit [6]
 
 
Report:
 
============= Top BDII Availability Reliability [1]==========
According to availability/reliability Top Level BDII report your Top Level BDII achieved insufficient
performance:
- Availability:
- Reliability:
 
We would like to kindly ask you to take an action to improve your Top-BDII performance.
 
Note:
If the top-BDII performance was due to problems with middleware or you believe there are errors in the computation,
please send a GGUS ticket to the Service Level Management support unit, following the procedure [4]
 
 
If you need information on how to set up a highly available Top-BDII, have a look at [5]


#Ticket is submitted by EGI&nbsp;SLM team.
#Add ticket URL to [[COD actions#Monthly_Actions|Monthly actions]]
#Add ticket URL to [[Underperforming sites and suspensions|Underperforming sites and suspensions]]


=== Submit child tickets to sites  ===
Report:


#Go to ..........<br>
============= SAM Availability Reliability [1]==========
#Prepare input file EGI_sus.csv based on the records marked as red in the source pdf. We take into account sites for which availability for last 3 months is red. Input file syntax: <br> <pre>NGI;Site;Ava1(oldest);Ava2(middle);Ava3(newest)</pre> Take into account only sites that are in Certified state in [https://goc.egi.eu/portal/ GocDB]. <br> Make sure NGIs are named according to the below table.
#Run ticket creator: <br><pre>perl start-suspend.pl ticket_number ‘date, e.g. Sep 2012’ “EGI_sus.csv”</pre> More info about [[Ticket generator Availability Reliability|Ticket generator for A/R<br>]]If you get errors, make sure to exchange all the " and ' characters in terminal.


=== Handling the child tickets  ===
According to availability/reliability SAM report your NGI SAM instance achieved insufficient
performance:
- Availability:
- Reliability:


#NGIs that replied within 10 days - check the explanation. If uncertain whether to suspend or not, discuss with COO by submitting a ticket to them.
We would like to kindly ask you to take an action to improve your NGI SAM performance.
##If after 3 days from receiving the explanation from site performance shows no improvement (Availability is still &lt;80%, Reliability &lt;85%) COD should suspend the site. Inform NGI and site about the suspension.
##In cases COD agree the site should not be suspended (such as: raise of availability &gt;70% and reliability &gt;75% or any other important reason, such as NGI SAM problem) the site can be left certified
#NGIs that didn’t reply - after 7 days put a reminder in the ticket. If no answer after 10 days from submitting tickets, suspend the site. Inform NGI and site about the suspension. <br>'''Tip''': It is recommended to send an e-mail to NGI managers mailing list and all NGI managers informing about the situation, and suspend the site if there's no reply or improvement.
#Prepare summary report and place it in the parent ticket.  
#Update:
##[[Underperforming sites and suspensions|Underperforming_sites_and_suspensions]]
##[[List of sites for which the availability followup procedures were not applicable|List of sites for which the availability followup procedures were not applicable]]


The whole process should be completed '''by the end of the month'''.
Report:


== Additional info  ==
============= ROD performance index [1]==========


=== Naming the NGIs  ===
According to recent report more than 10 items were not handled by your ROD team according to
the escalation procedure described in [3]


In grid view NGIs/ROCs are named differently than in GGUS. You should change NGI/ROC name according to GGUS. NGI name table with these differences can be found here: [[NGI naming in GGUS|NGIs GGUS names]]  
No. Tickets expired occurrence:
No. Alarms older than 72h occurrence:
 
 
Report:
 
============= Quality of Support [1]==========
 
According to recent report your NGI achieved insufficient Quality of Support performance:
 
less urgent (expected 5 working days):
urgent (expected 5 working days):
very urgent (expected 1 working day):
top priority (expected 1 working day): 
 
 
Report:
 
Links:
 
[1] https://documents.egi.eu/public/ShowDocument?docid=463
 
[2] https://documents.egi.eu/public/ShowDocument?docid=31
 
[3] https://wiki.egi.eu/wiki/PROC01
 
[4] https://wiki.egi.eu/wiki/PROC10
 
[5] https://wiki.egi.eu/wiki/MAN05
 
[6] https://wiki.egi.eu/wiki/Unknown_issue


A mapping from countries to NGIs is available here: [[Operations centres|Operations centres]]


=== Ticket content  ===
<pre>Subject: $SU/$siteName - site suspension


Dear $SU,
According to recent availability/reliability report $siteName has achieved poor performance below Availability target
threshold in three consecutive months.
Availability for last 3 months was as follows: $availability1, $availability2, $availability3.
More details: https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics


The aim of submitting this ticket is the intervention of the NGI and immediate improvement of the situation.
Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days
from receiving it.  


According to procedures approved on OMB 17.08.2010, the site will be suspended 10 working days after receiving
If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps.
this ticket unless NGI intervene. If NGI intervene and performance is still below targets 3 days after the
intervention, the site will also be suspended.


If you think that the site should not be suspended please provide justification in this ticket within 10
working days. In case the site performance rises above targets within 3 days from providing explanation,
the site will not be suspended. Otherwise COD may decide on suspension of the site.


You will be notified about the outcome in this ticket.


Best Regards,
Best Regards,
EGI Operations Team
EGI Operations
</pre>  
</pre>  
More info about [[Ticket generator Availability Reliability|Ticket generator for A/R]]  
More info about [[Ticket generator Availability Reliability|Ticket generator for A/R]]  

Revision as of 16:36, 13 October 2014

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 




RC and RP OLA violation work instruction for EGI Operations

This page describes steps which should be taken to follow RC and RP OLA violation issues.

General info

  • Receiver: NGI
  • Subject: RC and RP OLA violation
  • Threshold
  • Goal: We expect to see improvement
  • Deadline for answers: 10 days
  • No response

Steps


Step [#]

Max. Duration [work days]

(time before moving to next step)

Responsible Step
1
1 5
SLM, CA
  1. SLM: Produce Availability/Reliability report
  2. SLM: Create DOC DB entry and add link to  https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics
  3. CA: Add ticket URL to Underperforming sites and suspensions
  4. CA: create summary for
    • NGIs: underperforming ava/rel
    • NGIs: high Unknown
    • RCs: underpeforming ava/rel for 3 months
  5. CA: Send summary to Opertions@ egi.eu

2
PS
  1. Create report for NGI Top BDII
  2. Add report to DOC DB entry created by SLM
  3. Create summary and send to Opertions@ egi.eu

3 MK
  1. Create report for ROD performance index, NGI SAM  and QoS
  2. Add reports to DOC DB entry for ROD performance index, NGI SAM (same as for BDII)  and QoS
  3. Create summary and send to Opertions@ egi.eu

2

10
MK or CA
  1. Put summary of reports together
  2. Send email to NOC managers about all reports (link to DOC DB)
  3. Create and follow-up GGUS tickets against NGIs
  4.  Update Underperforming sites and suspensions


Ticket content

Subject: $NGI_SU - $date - RP/RC OLA violation

Dear $NGI_SU,	

According to the Service targets reports for Resource infrastructure Provider [1] and Resource Center[2] OLA
for $date following problems accures:

============= RC Availability Reliability [2]==========

According to recent availability/reliability report following sites have achieved insufficient performance below Availability target
threshold in 3 consecutive months:
* site name - ava:  rel:
*
*

* The sites will be suspended 10 working days after receiving this ticket unless NGI intervene. *

If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended.

If you think that the site should not be suspended please provide justification in this ticket within 10 
working days. In case the site performance rises above targets within 3 days from providing explanation, 
the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.

Report: 

============= RP Availability Reliability [1]==========

According to recent availability/reliability report your Operations Center have achieved poor insufficient 
performance: 
- Availability:
- Reliability:

Report:

============= RP Unknown [1]==========

Unknown metric in your NGI has been spotted with higher than 10% value.

Note:
Having a high percentage of UNKNOWN status of sites implies that there is no data available regarding the 
site availability for a substantial amount of time. For more information about UNKNOWN status please visit [6]


Report:

============= Top BDII Availability Reliability [1]==========
According to availability/reliability Top Level BDII report your Top Level BDII achieved insufficient 
performance: 
- Availability:
- Reliability:

We would like to kindly ask you to take an action to improve your Top-BDII performance.

Note: 
If the top-BDII performance was due to problems with middleware or you believe there are errors in the computation, 
please send a GGUS ticket to the Service Level Management support unit, following the procedure [4]


If you need information on how to set up a highly available Top-BDII, have a look at [5]


Report:

============= SAM Availability Reliability [1]==========

According to availability/reliability SAM report your NGI SAM instance achieved insufficient 
performance:
- Availability:
- Reliability:

We would like to kindly ask you to take an action to improve your NGI SAM performance.

Report:

============= ROD performance index [1]==========

According to recent report more than 10 items were not handled by your ROD team according to 
the escalation procedure described in [3]

No. Tickets expired occurrence: 
No. Alarms older than 72h occurrence: 


Report:

============= Quality of Support [1]==========

According to recent report your NGI achieved insufficient Quality of Support performance: 

less urgent (expected 5 working days): 
urgent (expected 5 working days):
very urgent (expected 1 working day):
top priority (expected 1 working day):  


Report:

Links:

[1] https://documents.egi.eu/public/ShowDocument?docid=463

[2] https://documents.egi.eu/public/ShowDocument?docid=31

[3] https://wiki.egi.eu/wiki/PROC01

[4] https://wiki.egi.eu/wiki/PROC10

[5] https://wiki.egi.eu/wiki/MAN05

[6] https://wiki.egi.eu/wiki/Unknown_issue




Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days 
from receiving it. 

If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps.



Best Regards,
EGI Operations

More info about Ticket generator for A/R