Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "WI03 RC and RP OLA violation report followup"

From EGIWiki
Jump to navigation Jump to search
Line 93: Line 93:
According to the Service targets reports for Resource infrastructure Provider [1] and Resource Center[2] OLA
According to the Service targets reports for Resource infrastructure Provider [1] and Resource Center[2] OLA
for $date following problems accures:
for $date following problems accures:
Report: https://documents.egi.eu/public/ShowDocument?docid=2352


============= RC Availability Reliability [2]==========
============= RC Availability Reliability [2]==========
Line 110: Line 112:
the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.
the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.


Report:


============= RP Availability Reliability [1]==========
============= RP Availability Reliability [1]==========
Line 119: Line 120:
- Reliability:
- Reliability:


Report:
We would like to kindly ask you to take an action to improve your Resource Infrastructure performance.
 


============= RP Unknown [1]==========
============= RP Unknown [1]==========
Line 129: Line 131:
site availability for a substantial amount of time. For more information about UNKNOWN status please visit [6]
site availability for a substantial amount of time. For more information about UNKNOWN status please visit [6]


 
We would like to kindly ask you to take an action to decrease Unknown metric.
Report:


============= Top BDII Availability Reliability [1]==========
============= Top BDII Availability Reliability [1]==========
Line 147: Line 148:
If you need information on how to set up a highly available Top-BDII, have a look at [5]
If you need information on how to set up a highly available Top-BDII, have a look at [5]


Report:


============= SAM Availability Reliability [1]==========
============= SAM Availability Reliability [1]==========
Line 159: Line 158:
We would like to kindly ask you to take an action to improve your NGI SAM performance.
We would like to kindly ask you to take an action to improve your NGI SAM performance.


Report:


============= ROD performance index [1]==========
============= ROD performance index [1]==========
Line 169: Line 167:
No. Alarms older than 72h occurrence:  
No. Alarms older than 72h occurrence:  


 
We would like to kindly ask you to take an action to improve your ROD team performance.
Report:


============= Quality of Support [1]==========
============= Quality of Support [1]==========
Line 182: Line 179:




Report:
**********************
 
Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days
from receiving it.
 
If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps.
 
**********************
 


Links:
Links:
Line 197: Line 202:


[6] https://wiki.egi.eu/wiki/Unknown_issue
[6] https://wiki.egi.eu/wiki/Unknown_issue
Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days
from receiving it.
If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps.


Best Regards,
Best Regards,

Revision as of 17:46, 11 November 2014

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 




RC and RP OLA violation work instruction for EGI Operations

This page describes steps which should be taken to follow RC and RP OLA violation issues.

General info

  • Receiver: NGI
  • Subject: RC and RP OLA violation
  • Threshold
  • Goal: We expect to see improvement
  • Deadline for answers: 10 days
  • No response

Steps

Starts with 1 day of the month

Step [#]

Max. Duration [work days]

(time before moving to next step)

Responsible Step
1
1 10
SLM, CA
  1. SLM: Produce Availability/Reliability report
  2. SLM: Create DOC DB entry and add link to  https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics
  3. SLM: Send mail to operations @ egi.eu
  4. CA: Create summary for
    • NGIs: underperforming ava/rel
    • NGIs: high Unknown
    • RCs: underpeforming ava/rel for 3 months
  5. CA: Send summary to Opertions@ egi.eu
2
PS
  1. Create report for NGI Top BDII
  2. Add report to DOC DB entry created by SLM
  3. Create summary and send to opertions@ egi.eu
3 MK
  1. Create report for ROD performance index, NGI SAM  and QoS
  2. Add reports to DOC DB entry for ROD performance index, NGI SAM (same as for BDII)  and QoS
  3. Create summary and send to Opertions@ egi.eu
2

10
MK or CA
  1. Put summary of reports together
  2. Send email to NOC managers about all reports (link to DOC DB)
  3. Create master and child GGUS tickets against NGIs
  4. Add ticket URL to Underperforming sites and suspensions
  5. Follow-up GGUS tickets against NGIs
  6.  Update Underperforming sites and suspensions


DOC DB content

EGI Availability/Reliability Report for September 2014

Container for Availability/Reliability Report supporting 
RP OLA agreement https://documents.egi.eu/public/ShowDocument?docid=463 
and Resource Center OLA agreement https://documents.egi.eu/public/ShowDocument?docid=31

Ticket content

Subject: $NGI_SU - $date - RP/RC OLA violation

Dear $NGI_SU,	

According to the Service targets reports for Resource infrastructure Provider [1] and Resource Center[2] OLA
for $date following problems accures:

Report: https://documents.egi.eu/public/ShowDocument?docid=2352

============= RC Availability Reliability [2]==========

According to recent availability/reliability report following sites have achieved insufficient performance below Availability target
threshold in 3 consecutive months:
* site name - ava:  rel:
*
*

* The sites will be suspended 10 working days after receiving this ticket unless NGI intervene. *

If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended.

If you think that the site should not be suspended please provide justification in this ticket within 10 
working days. In case the site performance rises above targets within 3 days from providing explanation, 
the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.


============= RP Availability Reliability [1]==========

According to recent availability/reliability report the sites operated by your Operations Center have achieved insufficient 
average performance: 
- Availability:
- Reliability:

We would like to kindly ask you to take an action to improve your Resource Infrastructure performance.


============= RP Unknown [1]==========

Unknown metric in your NGI has been spotted with higher than 10% value.

Note:
Having a high percentage of UNKNOWN status of sites implies that there is no data available regarding the 
site availability for a substantial amount of time. For more information about UNKNOWN status please visit [6]

We would like to kindly ask you to take an action to decrease Unknown metric.

============= Top BDII Availability Reliability [1]==========
According to availability/reliability Top Level BDII report your Top Level BDII achieved insufficient 
performance: 
- Availability:
- Reliability:

We would like to kindly ask you to take an action to improve your Top-BDII performance.

Note: 
If the top-BDII performance was due to problems with middleware or you believe there are errors in the computation, 
please send a GGUS ticket to the Service Level Management support unit, following the procedure [4]


If you need information on how to set up a highly available Top-BDII, have a look at [5]


============= SAM Availability Reliability [1]==========

According to availability/reliability SAM report your NGI SAM instance achieved insufficient 
performance:
- Availability:
- Reliability:

We would like to kindly ask you to take an action to improve your NGI SAM performance.


============= ROD performance index [1]==========

According to recent report more than 10 items were not handled by your ROD team according to 
the escalation procedure described in [3]

No. Tickets expired occurrence: 
No. Alarms older than 72h occurrence: 

We would like to kindly ask you to take an action to improve your ROD team performance.

============= Quality of Support [1]==========

According to recent report your NGI achieved insufficient Quality of Support performance: 

less urgent (expected 5 working days): 
urgent (expected 5 working days):
very urgent (expected 1 working day):
top priority (expected 1 working day):  


**********************

Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days 
from receiving it. 

If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps.

**********************


Links:

[1] https://documents.egi.eu/public/ShowDocument?docid=463

[2] https://documents.egi.eu/public/ShowDocument?docid=31

[3] https://wiki.egi.eu/wiki/PROC01

[4] https://wiki.egi.eu/wiki/PROC10

[5] https://wiki.egi.eu/wiki/MAN05

[6] https://wiki.egi.eu/wiki/Unknown_issue

Best Regards,
EGI Operations

More info about Ticket generator for A/R