Difference between revisions of "WI03 RC and RP OLA violation report followup"
Jump to navigation
Jump to search
Line 40: | Line 40: | ||
#'''SLM''': Produce Availability/Reliability report | #'''SLM''': Produce Availability/Reliability report | ||
#'''SLM:''' Create DOC DB entry and add link to https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics | #'''SLM:''' Create DOC DB entry and add link to https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics | ||
#'''CA''': Add ticket URL to [[Underperforming sites and suspensions|Underperforming sites and suspensions]] | |||
#'''CA''': create summary for | #'''CA''': create summary for | ||
#*NGIs: underperforming ava/rel | #*NGIs: underperforming ava/rel | ||
Line 61: | Line 62: | ||
| | | | ||
#Create report for ROD performance index, NGI SAM and QoS <br> | #Create report for ROD performance index, NGI SAM and QoS <br> | ||
#Add reports to DOC DB entry for [https://wiki.egi.eu/wiki/ROD_performance_index#Performance_reports ROD performance index], NGI SAM (same as for BDII) and QoS <br> | #Add reports to DOC DB entry for [https://wiki.egi.eu/wiki/ROD_performance_index#Performance_reports ROD performance index], NGI SAM (same as for BDII) and QoS <br> | ||
#Create summary and send to Opertions@ egi.eu | #Create summary and send to Opertions@ egi.eu | ||
Line 72: | Line 73: | ||
| | | | ||
#Put summary of reports together<br> | #Put summary of reports together<br> | ||
#Send email to NOC managers about all reports (link to DOC DB) | #Send email to NOC managers about all reports (link to DOC DB) | ||
#Create and follow-up GGUS tickets against NGIs | #Create and follow-up GGUS tickets against NGIs | ||
# Update [[Underperforming sites and suspensions|Underperforming sites and suspensions]] | |||
| <br> | | <br> | ||
Line 80: | Line 82: | ||
<br> | <br> | ||
=== | === === | ||
=== Ticket content === | |||
<pre>Subject: $NGI_SU - $date - RP/RC OLA violation | |||
Dear $NGI_SU, | |||
According to the Service targets reports for Resource infrastructure Provider [1] and Resource Center[2] OLA | |||
for $date following problems accures: | |||
============= RC Availability Reliability [2]========== | |||
According to recent availability/reliability report following sites have achieved insufficient performance below Availability target | |||
threshold in 3 consecutive months: | |||
* site name - ava: rel: | |||
* | |||
* | |||
* The sites will be suspended 10 working days after receiving this ticket unless NGI intervene. * | |||
If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended. | |||
If you think that the site should not be suspended please provide justification in this ticket within 10 | |||
working days. In case the site performance rises above targets within 3 days from providing explanation, | |||
the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site. | |||
Report: | |||
============= RP Availability Reliability [1]========== | |||
According to recent availability/reliability report your Operations Center have achieved poor insufficient | |||
performance: | |||
- Availability: | |||
- Reliability: | |||
Report: | |||
============= RP Unknown [1]========== | |||
Unknown metric in your NGI has been spotted with higher than 10% value. | |||
Note: | |||
Having a high percentage of UNKNOWN status of sites implies that there is no data available regarding the | |||
site availability for a substantial amount of time. For more information about UNKNOWN status please visit [6] | |||
Report: | |||
============= Top BDII Availability Reliability [1]========== | |||
According to availability/reliability Top Level BDII report your Top Level BDII achieved insufficient | |||
performance: | |||
- Availability: | |||
- Reliability: | |||
We would like to kindly ask you to take an action to improve your Top-BDII performance. | |||
Note: | |||
If the top-BDII performance was due to problems with middleware or you believe there are errors in the computation, | |||
please send a GGUS ticket to the Service Level Management support unit, following the procedure [4] | |||
If you need information on how to set up a highly available Top-BDII, have a look at [5] | |||
Report: | |||
============= SAM Availability Reliability [1]========== | |||
According to availability/reliability SAM report your NGI SAM instance achieved insufficient | |||
performance: | |||
- Availability: | |||
- Reliability: | |||
We would like to kindly ask you to take an action to improve your NGI SAM performance. | |||
Report: | |||
== | ============= ROD performance index [1]========== | ||
According to recent report more than 10 items were not handled by your ROD team according to | |||
the escalation procedure described in [3] | |||
No. Tickets expired occurrence: | |||
No. Alarms older than 72h occurrence: | |||
Report: | |||
============= Quality of Support [1]========== | |||
According to recent report your NGI achieved insufficient Quality of Support performance: | |||
less urgent (expected 5 working days): | |||
urgent (expected 5 working days): | |||
very urgent (expected 1 working day): | |||
top priority (expected 1 working day): | |||
Report: | |||
Links: | |||
[1] https://documents.egi.eu/public/ShowDocument?docid=463 | |||
[2] https://documents.egi.eu/public/ShowDocument?docid=31 | |||
[3] https://wiki.egi.eu/wiki/PROC01 | |||
[4] https://wiki.egi.eu/wiki/PROC10 | |||
[5] https://wiki.egi.eu/wiki/MAN05 | |||
[6] https://wiki.egi.eu/wiki/Unknown_issue | |||
Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days | |||
from receiving it. | |||
If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps. | |||
Best Regards, | Best Regards, | ||
EGI Operations | EGI Operations | ||
</pre> | </pre> | ||
More info about [[Ticket generator Availability Reliability|Ticket generator for A/R]] | More info about [[Ticket generator Availability Reliability|Ticket generator for A/R]] |
Revision as of 17:36, 13 October 2014
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
EGI Infrastructure Operations Oversight menu: | Home • | EGI.eu Operations Team • | Regional Operators (ROD) |
RC and RP OLA violation work instruction for EGI Operations
This page describes steps which should be taken to follow RC and RP OLA violation issues.
General info
- Receiver: NGI
- Subject: RC and RP OLA violation
- Threshold:
- Goal: We expect to see improvement
- Deadline for answers: 10 days
- No response:
Steps
Step [#] |
Max. Duration [work days] (time before moving to next step) |
Responsible | Step | ||
1 |
1 | 5 |
SLM, CA |
|
|
2 |
PS |
|
|||
3 | MK |
|
|||
2 |
10 |
MK or CA |
|
Ticket content
Subject: $NGI_SU - $date - RP/RC OLA violation Dear $NGI_SU, According to the Service targets reports for Resource infrastructure Provider [1] and Resource Center[2] OLA for $date following problems accures: ============= RC Availability Reliability [2]========== According to recent availability/reliability report following sites have achieved insufficient performance below Availability target threshold in 3 consecutive months: * site name - ava: rel: * * * The sites will be suspended 10 working days after receiving this ticket unless NGI intervene. * If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended. If you think that the site should not be suspended please provide justification in this ticket within 10 working days. In case the site performance rises above targets within 3 days from providing explanation, the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site. Report: ============= RP Availability Reliability [1]========== According to recent availability/reliability report your Operations Center have achieved poor insufficient performance: - Availability: - Reliability: Report: ============= RP Unknown [1]========== Unknown metric in your NGI has been spotted with higher than 10% value. Note: Having a high percentage of UNKNOWN status of sites implies that there is no data available regarding the site availability for a substantial amount of time. For more information about UNKNOWN status please visit [6] Report: ============= Top BDII Availability Reliability [1]========== According to availability/reliability Top Level BDII report your Top Level BDII achieved insufficient performance: - Availability: - Reliability: We would like to kindly ask you to take an action to improve your Top-BDII performance. Note: If the top-BDII performance was due to problems with middleware or you believe there are errors in the computation, please send a GGUS ticket to the Service Level Management support unit, following the procedure [4] If you need information on how to set up a highly available Top-BDII, have a look at [5] Report: ============= SAM Availability Reliability [1]========== According to availability/reliability SAM report your NGI SAM instance achieved insufficient performance: - Availability: - Reliability: We would like to kindly ask you to take an action to improve your NGI SAM performance. Report: ============= ROD performance index [1]========== According to recent report more than 10 items were not handled by your ROD team according to the escalation procedure described in [3] No. Tickets expired occurrence: No. Alarms older than 72h occurrence: Report: ============= Quality of Support [1]========== According to recent report your NGI achieved insufficient Quality of Support performance: less urgent (expected 5 working days): urgent (expected 5 working days): very urgent (expected 1 working day): top priority (expected 1 working day): Report: Links: [1] https://documents.egi.eu/public/ShowDocument?docid=463 [2] https://documents.egi.eu/public/ShowDocument?docid=31 [3] https://wiki.egi.eu/wiki/PROC01 [4] https://wiki.egi.eu/wiki/PROC10 [5] https://wiki.egi.eu/wiki/MAN05 [6] https://wiki.egi.eu/wiki/Unknown_issue Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days from receiving it. If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps. Best Regards, EGI Operations
More info about Ticket generator for A/R