WI03 RC and RP OLA violation report followup

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 




RC and RP OLA violation work instruction for EGI Operations

This page describes steps which should be taken to follow RC and RP OLA violation issues.

General info

  • Receiver: NGI
  • Subject: RC and RP OLA violation
  • Threshold
  • Goal: We expect to see improvement
  • Deadline for answers: 10 days
  • No response

Steps

Starts with 1 day of the month

Step [#]

Max. Duration [work days]

(time before moving to next step)

Responsible Step
1
1 10
SLM, CA
  1. SLM: Produce Availability/Reliability report
  2. SLM: Create DOC DB entry and add link to  https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics
  3. SLM: Send mail to operations @ egi.eu


2
PS
  1. Create report for NGI Top BDII
  2. Add report to DOC DB entry created by SLM
  3. Create summary and send to opertions@ egi.eu
3 MK
  1. Create report for ROD performance index, NGI SAM  and QoS
  2. Add reports to DOC DB entry for ROD performance index, NGI SAM (same as for BDII)  and QoS
  3. Create summary and send to Opertions@ egi.eu
  4. Create summary for
    #*NGIs: underperforming ava/rel
    #*NGIs: high Unknown
    #*RCs: underpeforming ava/rel for 3 months
    #CA: Send summary to Opertions@ egi.eu
2

10
VS
  1. Put summary of reports together
  2. Send email to NOC managers about all reports (link to DOC DB)
  3. Create master and child GGUS tickets against NGIs
  4. Follow-up GGUS tickets against NGIs


NGI managers email (example)

Subject: EGI RC OLA and RP OLA Reports for October 2014

Content: Dear NGI managers

Please find EGI RC OLA and RP OLA Reports for October 2014 under:

https://documents.egi.eu/public/ShowDocument?docid=2352

Entry includes reports:
* EGI Cloud RC A/R
* EGI RP/RC A/R/U
* EGI RP Quality of Support
* EGI RP ROD performance index
* EGI RP SAM A/R
* EGI RP Top-BDII A/R

Best Regards

DOC DB content (example)

Title: EGI RC OLA and RP OLA Reports for October 2014

Abstract: Container for reports supporting 

Resource Centre Operational Level Agreement

https://documents.egi.eu/document/31

and

Resource infrastructure Provider Operational Level Agreement

https://documents.egi.eu/document/463

GGUS Ticket content (template)

$NGI - $MONTH $YEAR - RP/RC OLA performance

Dear NGI/ROC,	

According to the Service targets reports for Resource infrastructure Provider [1] and Resource Center[2] OLA for '''September''' following problems occurred:

EGI RC OLA and RP OLA Reports for September 2015: https://documents.egi.eu/public/ShowDocument?docid=2607

============= RC Availability Reliability [2]==========

According to recent availability/reliability report following sites have achieved insufficient performance below Availability target threshold in 3 consecutive months (January, February, March):

$SITE
$SITE
$SITE 

* During the 10 working days after receiving this ticket the NGI can suspend the site or ask to not suspend the site by providing adequate explanation. If no answer is provided to this ticket, the NGI will be contacted by email; if no reply will be provided to the email, the site will be suspended[6]. 

If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended.

If you think that the site should not be suspended please provide justification in this ticket within 10 working days. In case the site performance rises above targets within 3 days from providing explanation, the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.


============= RP Availability Reliability [1]==========

According to recent availability/reliability report the sites operated by your Operations Center have achieved insufficient average performance: 
- Availability:
- Reliability:

We would like to kindly ask you to take an action to improve your Resource Infrastructure performance.

============= Top BDII Availability Reliability [1]==========
According to availability/reliability Top Level BDII report your Top Level BDII achieved insufficient performance: 
- Availability:
- Reliability:

We would like to kindly ask you to take an action to improve your Top-BDII performance.

Note: 
If the top-BDII performance was due to problems with middleware or you believe there are errors in the computation, please send a GGUS ticket to the Service Level Management support unit, following the procedure [4]


If you need information on how to set up a highly available Top-BDII, have a look at [5]


============= Quality of Support [1]==========

According to recent report your NGI achieved insufficient Quality of Support performance: 

less urgent (expected 5 working days): 
urgent (expected 5 working days):
very urgent (expected 1 working day):
top priority (expected 1 working day):  

Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days from receiving it. 

If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps.

**********************


Links:

[1] https://documents.egi.eu/public/ShowDocument?docid=463 "Resource infrastructure Provider Operational Level Agreement"

[2] https://documents.egi.eu/public/ShowDocument?docid=31 "Resource Centre Operational Level Agreement"

[3] https://wiki.egi.eu/wiki/PROC01 "EGI Infrastructure Oversight escalation"

[4] https://wiki.egi.eu/wiki/PROC10 "Recomputation of SAM results or availability reliability statistics"

[5] https://wiki.egi.eu/wiki/MAN05 "top-BDII and site-BDII High Availability"

[6] https://wiki.egi.eu/wiki/PROC04 "Quality verification of monthly availability and reliability statistics"

Best Regards,
EGI Operations

More info about Ticket generator for A/R