Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "WI03 RC and RP OLA violation report followup"

From EGIWiki
Jump to navigation Jump to search
 
(82 intermediate revisions by 7 users not shown)
Line 1: Line 1:
{{Template:Op menubar}} {{TOC_right}}  
{{Template:Op menubar}} {{Template:GO menubar}} {{TOC_right}}  
[[Category:Deprecated]]
{| style="border:1px solid black; background-color:lightgrey; color: black; padding:5px; font-size:140%; width: 90%; margin: auto;"
| style="padding-right: 15px; padding-left: 15px;" |
|[[File:Alert.png]] This article is '''Deprecated''' and should no longer be used, but is still available for reasons of reference.
|}
<br>


= Internal procedure for COD - '''Availability and reliability work instruction for COD''' =
= RC and RP OLA&nbsp;violation work instruction for EGI Operations =


This page describes steps which should be taken by COD shifter to follow availability/reliability issues.  
This page describes steps which should be taken to follow RC and RP&nbsp;OLA violation issues.  


<br> When GGUS ticket about availability/reliability metrics is assigned to COD:
== General info  ==


<br>  
*'''Receiver''': NGI
*'''Subject''': RC and RP&nbsp;OLA violation<br>
*'''Threshold''':&nbsp;
*'''Goal''': We expect to see improvement
*'''Deadline for answers''': 10 days
*'''No response''':&nbsp;
 
== Steps  ==
 
Starts with 1 day of the month<br>  


{| align="center" cellspacing="0" cellpadding="5" border="1"
{| border="1" class="wikitable"
|-
! Timelines
! Step
! Substep
! Description
|-
|-
| '''Step [#]'''
| <br>
|  
|  
| 1
'''Max. Duration [work days]'''  
|
| Add ticket url to [https://wiki.egi.eu/wiki/Underperforming_sites_and_suspensions Underperforming_sites_and_suspensions] page
|-
|
| 2
|
| Ava/Rel report review
|-
|
|
| 1
| Prepare ''''sites for suspension'''' list: Look at&nbsp; availability metics for two previous months in AR report and the current one. If all are below 70% then sites qualifies for suspension.
Check if the site was mentioned in [https://wiki.egi.eu/wiki/List_of_underperforming_sites List of sites for which the availability followup procedures were not applicable] page. In some cases there could be no need to open a ticket.


|-
(time before moving to next step)
|
|
| 2
| Prepare ''''sites to be asked for explanation'''' list: Look at current months in AR report. If Ava. is below 70% or Rel. below 75% then sites qualifies to be asked for explanation. This list should be prepared according to requirements for input file for [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD#How_to_use_ticket_generator ticket generator].
Check if the site was mentioned in [https://wiki.egi.eu/wiki/List_of_underperforming_sites List of sites for which the availability followup procedures were not applicable] page In some cases there could be no need to open a ticket.


| '''Responsible'''
| '''Step'''
|-
|-
|  
| rowspan="3" | 1 <br>
| 3  
|
| Create tickets for each case as a child to the tickets assigned to COD
|-
|
|  
| 1  
| 1  
| For ''''sites for suspension'''' list please use [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD#How_to_use_ticket_generator ticket generator]
| rowspan="3" | 10<br>
|-
| SLM, CA
|  
|  
|
#'''SLM''':&nbsp;Produce '''Availability/Reliability report'''  
| 2
#'''SLM:'''&nbsp;Create DOC&nbsp;DB&nbsp;entry and add link to&nbsp; https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics
| For ''''sites to be asked for explanation'''' list please use [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD#How_to_use_ticket_generator ticket generator]
#'''SLM''': Send mail to operations @ egi.eu
|-
| '''Within''' 10 working days from when the tickets are created.  
| 4
|
|
'''Handling of sites below targets'''


When explanation is provided and is found satisfactory put as a solution of the ticket
<br>  
<pre>'The explanation is satisfactory. Thank you!'. </pre>
After that you should set child ticket to 'verified' status.


|-
|-
| '''After''' 10 working days from when the tickets are created.
| 2<br>
| 5
| PS
|  
|  
| Final actions.
#Create report for '''NGI&nbsp;Top BDII'''<br>
#Add report to DOC&nbsp;DB entry created by SLM
#Create summary and send to opertions@ egi.eu
 
|-
|-
| 3
| MK
|  
|  
|
#Create report for '''ROD performance index, NGI&nbsp;SAM&nbsp; and QoS '''<br>
| 1
#Add reports to DOC&nbsp;DB entry for [https://wiki.egi.eu/wiki/ROD_performance_index#Performance_reports ROD performance index], NGI&nbsp;SAM (same as for BDII)&nbsp; and QoS <br>
| '''Handling of sites that are eligible for suspension'''  
#Create summary and send to Opertions@ egi.eu
*in the case of '''no''' NGI intervention, the site is suspended in GOC DB - as a reason put a link to GGUS ticket created for the site
#Create summary for <br>#*'''NGIs: underperforming ava/rel''' <br>#*'''NGIs: high Unknown''' <br>#*'''RCs: underpeforming ava/rel for 3 months''' <br>#'''CA''':&amp;nbsp;Send summary to Opertions@ egi.eu
*in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI
**COO should be involved to the ticket


|-
|-
| 2<br>
| <br>
| 10<br>
| VS<br>
|  
|  
|
#Put summary of reports together<br>
| 2
#Send email to NOC&nbsp;managers about all reports (link to DOC&nbsp;DB)
| '''Handling of sites below targets'''
#Create master and child GGUS tickets against NGIs
If the explanation is not given in due time, or the explanation is found inadequate, COD send mail to NGI/ROC manager with CC to ROD and GGUS:
#Follow-up GGUS tickets against NGIs


*informing that NGI/ROC manager should make the site react on the ticket or suspend the site within 3 days
|}
*if NGI will not react COD will suspend the site on the 4th day.
<pre>Dear XX


I would like to inform you that 10 working days passed.
<br>
Please make the site react on the ticket or suspend the site within 3 days.
If NGI will not react COD will suspend the site on the 4th day.


Best Regards
=== NGI managers email (example)<br>  ===
XXX
<pre>Subject:&nbsp;EGI RC OLA and RP OLA Reports for October 2014
On behalf of COD team
</pre>
|-
|
| 6
|
| Prepare summary report (it should be placed in parent ticket):  
#sites which are not responsive and didn't provided satisfactory explanation
#sites which were suspended
#ROCs/NGIs which are not responsive
#...


|-
Content:&nbsp;Dear NGI managers
|
| 7
|
| Update [https://wiki.egi.eu/wiki/List_of_underperforming_sites List of sites for which the availability followup procedures were not applicable] page. Put here outstanding cases which should be recorded. This could be used for example to avoid opening a ticket next month for a solved issue.
|-
|
| 8
|
| Update [https://wiki.egi.eu/wiki/Underperforming_sites_and_suspensions Underperforming_sites_and_suspensions] page.
|}


= Questions/issues  =
Please find EGI RC OLA and RP OLA Reports for October 2014 under:


''MR: what do we do with sites marked with "n/a"?''
https://documents.egi.eu/public/ShowDocument?docid=2352


''MK: we don't take into account months with "N/A" ''
Entry includes reports:
* EGI Cloud RC A/R
* EGI RP/RC A/R/U
* EGI RP Quality of Support
* EGI RP ROD performance index
* EGI RP SAM A/R
* EGI RP Top-BDII A/R


<br> <span style="color: rgb(255, 0, 0);">'''VERY IMPORTANT'''</span>  
Best Regards
</pre>  
=== DOC&nbsp;DB content (example) ===
<pre>Title:&nbsp;EGI RC OLA and RP OLA Reports for October 2014


<span style="background: none repeat scroll 0% 0% rgb(255, 0, 0);"> In grid view NGIs/ROCs are named differently then in GGUS. You should change NGI/ROC name according to GGUS.</span>
Abstract:&nbsp;Container for reports supporting


<br>
Resource Centre Operational Level Agreement


{| align="center" cellspacing="0" cellpadding="5" border="1"
https://documents.egi.eu/document/31
|-
! GGUS
! Gridview
|-
| ROC_DECH
| GermanySwitzerland
|-
| NGI_FRANCE
| NGI_France
|-
| ROC_Asia/Pacific
| AsiaPacific
|-
| ROC_Italy
| Italy
|-
| ROC_CERN
| CERN
|-
| ROC_Russia
| Russia
|-
| ROC_North
| NorthernEurope
|-
| ROC_UK/Ireland
| UKI
|-
| ROC_SE
| SouthEasternEurope
|-
| ROC_SW
| SouthWesternEurope
|}


= Tickets content  =
and


== Request for explanation  ==
Resource infrastructure Provider Operational Level Agreement
<pre>Subject:$SU/$siteName - availability/reliability statistics for $date


Dear $SU,
https://documents.egi.eu/document/463
</pre>
=== GGUS&nbsp;Ticket content (template)  ===
<pre>$NGI - $MONTH $YEAR - RP/RC OLA performance


According to recent availability/reliability report $siteName has achieved
Dear NGI/ROC,
poor performance Ava. $availability  Rel. $realiability.
More details: https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics.


Could you please provide explanations for poor performance of the $siteName site?
the EGI RC OLA and RP OLA Report for November 2020 has been produced and is available at the following links:
- NGIs reports: http://egi.ui.argo.grnet.gr/egi/report-ar/Critical/NGI?month=2020-11 (Clicking on the NGI name, it will be displayed the resource centres A/R figures)
- RCs reports: http://egi.ui.argo.grnet.gr/egi/report-ar/Critical/SITES?month=2020-11


Your explanation must be returned within 10 working days from when the ticket is created.
According to the Service targets reports for Resource infrastructure Provider [1] and Resource Centre[2] OLAs, the following problems occurred:
If the explanation is not given in due time, or the explanation is found inadequate,
COD escalation procedure will be followed https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure


If the site was certified during last month please close this ticket and
============= RC Availability Reliability [2]==========
put this info in a ticket solution field. There is known bug in report
generation tool being worked on.


According to recent availability/reliability report following sites have achieved insufficient performance below Availability target threshold in 3 consecutive months (September, October, and November):


Best Regards,
$SITE
EGI Central Operator on Duty
$SITE
</pre>
$SITE
== Site for suspension  ==
<pre>Subject:$SU/$siteName site suspension


Dear $SU,
* During the 10 working days after receiving this ticket the NGI can suspend the site or ask to not suspend the site by providing adequate explanation. If no answer is provided to this ticket, the NGI will be contacted by email; if no reply will be provided to the email, the site will be suspended[6].


According to recent availability/reliability report $siteName has achieved
If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended.
poor performance below target Ava. 50% or Rel. 50% in three consecutive months.
More details: https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics.


According to procedures approved on OMB 17.08, site will be suspended within 10 working days unless the NGI intervene.
If you think that the site should not be suspended please provide justification in this ticket within 10 working days. In case the site performance rises above targets within 3 days from providing explanation, the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.
If you think that the site should not be suspended please provide justification within 10 working days.


Best Regards,
EGI Central Operator on Duty
</pre>
= How to use ticket generator  =


current version of the script: 3.0
============= Quality of Support [1]==========


features:  
According to recent report your NGI achieved insufficient Quality of Support performance:  


*bulk child ticket creation
less urgent (expected 5 working days):
*'assigned to' set
urgent (expected 5 working days):
*'affected site' set
very urgent (expected 1 working day):
*'type of problem' set to Operations
top priority (expected 1 working day): 


<br>
In order to see the details (https://wiki.egi.eu/wiki/Service_Level_Target_-_Quality_of_Support ):


<br>
1) go on https://ggus.eu/?mode=report_view
2) choose "response times" or "tickets submitted" and the proper timeframe
3) select your NGI
4) group by Responsible unit and priority
5) click on the lines displayed for getting the tickets details


*'''Configure the script'''.


In start-explanations.pl/start-suspend.pl file at the beginning of the script you have to fill in following variable:
Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days from receiving it.  
<pre># PRODUCTION
my $endpoint = "https://gusiwr.fzk.de/arsys/services/ARService?server=gusiwr&amp;webService=Grid_HelpDesk";
my $user = ""; # login to GGUS web-services
my $pass = ""; # password to GGUS web-services


# Submitter data, Those data will be used as submitter's data to create tickets
If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps.
my $Mail = ""; # your email address
my $DN = "";  # your DN
my $Name = ""; # Name and Surname
</pre>
<br>


*'''Prepare input file.'''
**********************


The input plain file format for both scripts is as follow:


''ROC/NGI support unit in GGUS; Site name; Availability; Reliability;''
Links:


Remember that in each line should be one site and the number of semicolons should be always 4. For start-suspend.pl script Availability and Reliability values are omitted but semicolons are necessary.  
[1] https://documents.egi.eu/public/ShowDocument?docid=463 "Resource infrastructure Provider Operational Level Agreement"


example:  
[2] https://documents.egi.eu/public/ShowDocument?docid=31 "Resource Centre Operational Level Agreement"
<pre>NGI_PL; CYFRONET_LCG2; 50%; 10%;
NGI_PL; IFJ-PAN; 15%; 3%;
</pre>
*'''Execute the tool'''


Login to machine with perl installed and execute the script as follow:  
[3] https://wiki.egi.eu/wiki/PROC01 "EGI Infrastructure Oversight escalation"


''perl start-explanations.pl/start-suspend.pl PARENT_TICKET_ID "DATE" FILE_NAME''
[4] https://wiki.egi.eu/wiki/PROC10 "Recomputation of SAM results or availability reliability statistics"


PARENT_TICKET_ID - number of "Availability/reliability statistics for *" ticket
[5] https://wiki.egi.eu/wiki/MAN05 "top-BDII and site-BDII High Availability"


DATE - date of the report. Format: "month year"  
[6] https://wiki.egi.eu/wiki/PROC04 "Quality verification of monthly availability and reliability statistics"


FILE_NAME - file with input availability/reliability data
Best Regards,
EGI Operations


example:
<pre>  perl start-explanations.pl 4121 "May 2010" dane.txt
</pre>  
</pre>  
= Best practice  =
More info about [[Ticket generator Availability Reliability|Ticket generator for A/R]]


*If the site explaining that site administrator was on holidays put as a solution "This time the explanation is found satisfactory, although for the future in case of administrators holidays site should provide administrator deputy. If it is not possible then NGI should put site which is failing in downtime. Thank you!". Close the ticket and verify it.
<br>


[[Category:COD]]
[[Category:Infrastructure_Oversight]]

Latest revision as of 16:18, 30 July 2021

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 


Alert.png This article is Deprecated and should no longer be used, but is still available for reasons of reference.


RC and RP OLA violation work instruction for EGI Operations

This page describes steps which should be taken to follow RC and RP OLA violation issues.

General info

  • Receiver: NGI
  • Subject: RC and RP OLA violation
  • Threshold
  • Goal: We expect to see improvement
  • Deadline for answers: 10 days
  • No response

Steps

Starts with 1 day of the month

Step [#]

Max. Duration [work days]

(time before moving to next step)

Responsible Step
1
1 10
SLM, CA
  1. SLM: Produce Availability/Reliability report
  2. SLM: Create DOC DB entry and add link to  https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics
  3. SLM: Send mail to operations @ egi.eu


2
PS
  1. Create report for NGI Top BDII
  2. Add report to DOC DB entry created by SLM
  3. Create summary and send to opertions@ egi.eu
3 MK
  1. Create report for ROD performance index, NGI SAM  and QoS
  2. Add reports to DOC DB entry for ROD performance index, NGI SAM (same as for BDII)  and QoS
  3. Create summary and send to Opertions@ egi.eu
  4. Create summary for
    #*NGIs: underperforming ava/rel
    #*NGIs: high Unknown
    #*RCs: underpeforming ava/rel for 3 months
    #CA:&nbsp;Send summary to Opertions@ egi.eu
2

10
VS
  1. Put summary of reports together
  2. Send email to NOC managers about all reports (link to DOC DB)
  3. Create master and child GGUS tickets against NGIs
  4. Follow-up GGUS tickets against NGIs


NGI managers email (example)

Subject: EGI RC OLA and RP OLA Reports for October 2014

Content: Dear NGI managers

Please find EGI RC OLA and RP OLA Reports for October 2014 under:

https://documents.egi.eu/public/ShowDocument?docid=2352

Entry includes reports:
* EGI Cloud RC A/R
* EGI RP/RC A/R/U
* EGI RP Quality of Support
* EGI RP ROD performance index
* EGI RP SAM A/R
* EGI RP Top-BDII A/R

Best Regards

DOC DB content (example)

Title: EGI RC OLA and RP OLA Reports for October 2014

Abstract: Container for reports supporting 

Resource Centre Operational Level Agreement

https://documents.egi.eu/document/31

and

Resource infrastructure Provider Operational Level Agreement

https://documents.egi.eu/document/463

GGUS Ticket content (template)

$NGI - $MONTH $YEAR - RP/RC OLA performance

Dear NGI/ROC,	

the EGI RC OLA and RP OLA Report for November 2020 has been produced and is available at the following links:
- NGIs reports: http://egi.ui.argo.grnet.gr/egi/report-ar/Critical/NGI?month=2020-11 (Clicking on the NGI name, it will be displayed the resource centres A/R figures)
- RCs reports: http://egi.ui.argo.grnet.gr/egi/report-ar/Critical/SITES?month=2020-11

According to the Service targets reports for Resource infrastructure Provider [1] and Resource Centre[2] OLAs, the following problems occurred:

============= RC Availability Reliability [2]==========

According to recent availability/reliability report following sites have achieved insufficient performance below Availability target threshold in 3 consecutive months (September, October, and November):

$SITE
$SITE
$SITE 

* During the 10 working days after receiving this ticket the NGI can suspend the site or ask to not suspend the site by providing adequate explanation. If no answer is provided to this ticket, the NGI will be contacted by email; if no reply will be provided to the email, the site will be suspended[6]. 

If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended.

If you think that the site should not be suspended please provide justification in this ticket within 10 working days. In case the site performance rises above targets within 3 days from providing explanation, the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.


============= Quality of Support [1]==========

According to recent report your NGI achieved insufficient Quality of Support performance: 

less urgent (expected 5 working days): 
urgent (expected 5 working days):
very urgent (expected 1 working day):
top priority (expected 1 working day):  

In order to see the details (https://wiki.egi.eu/wiki/Service_Level_Target_-_Quality_of_Support ):

1) go on https://ggus.eu/?mode=report_view
2) choose "response times" or "tickets submitted" and the proper timeframe
3) select your NGI
4) group by Responsible unit and priority
5) click on the lines displayed for getting the tickets details


Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days from receiving it. 

If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps.

**********************


Links:

[1] https://documents.egi.eu/public/ShowDocument?docid=463 "Resource infrastructure Provider Operational Level Agreement"

[2] https://documents.egi.eu/public/ShowDocument?docid=31 "Resource Centre Operational Level Agreement"

[3] https://wiki.egi.eu/wiki/PROC01 "EGI Infrastructure Oversight escalation"

[4] https://wiki.egi.eu/wiki/PROC10 "Recomputation of SAM results or availability reliability statistics"

[5] https://wiki.egi.eu/wiki/MAN05 "top-BDII and site-BDII High Availability"

[6] https://wiki.egi.eu/wiki/PROC04 "Quality verification of monthly availability and reliability statistics"

Best Regards,
EGI Operations

More info about Ticket generator for A/R