https://wiki.egi.eu/w/api.php?action=feedcontributions&user=Dzila&feedformat=atomEGIWiki - User contributions [en]2024-03-28T20:35:02ZUser contributionsMediaWiki 1.37.1https://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:Plan_2012_SA1.8&diff=30025EGI-InSPIRE:Plan 2012 SA1.82011-12-19T10:10:53Z<p>Dzila: /* DTEAM VO Services */</p>
<hr />
<div>= Plans 2012 SA1.8 =<br />
<br />
== Assessement of progress, 2011 ==<br />
<br />
=== Core Grid Services ===<br />
<br />
==== DTEAM VO Services ====<br />
<br />
The migration of the DTEAM VO was finalized on January 2011. DTEAM VO is served by 2 geographically distributed VOMS servers in Thessaloniki and Athens (voms.hellasgrid.gr and voms2.hellasgrid.gr). During this year 7 NGI groups were created on the DTEAM VO (NGI_FI, NGI_NDGF, NGI_DE, NGI_IT, NGI_IE, NGI_UK, NGI_ZA) and 3 ROC Groups were decommissioned (ROC_Italy, SEE, dech)<br />
<br />
==== EGI Catch All CA ====<br />
<br />
During 2011 the EGI Catch All CA setup three new Registration Authorities in Senegal, Egypt and for SixSq (partner in StratusLab) in Switzerland. This brings the total number of RAs to 7.<br />
<br />
==== Core Services for Site Certification ====<br />
<br />
A TOP-BDII, a WMS and an LB service was installed as catch all services for NGIs that do not operate their own services for the site certification process. In addition a portal was built, that syncs with GOCDB and gives the ability to the NGI Managers to add and remove on demand uncertified sites from the catch-all TOP-BDII. <br />
<br />
<br />
=== Operations tool and availability computation ===<br />
<br />
==== Propose Changes for Operations tools ====<br />
<br />
An assessment of the operations tools was completed and the result were presented at the EGI Technical Conference in Lyon.<br />
<br />
https://wiki.egi.eu/wiki/POEM_and_ACE_requirements<br />
<br />
==== Data more readily available to NGIs ====<br />
<br />
This has been provided by MyEGI. Maybe improvements can be suggested as more experience is gained from its usage.<br />
<br />
==== Follow-up with developers for issues that affect accuracy ====<br />
<br />
There is a high number of unknown status from certain NGI nagios instances / sites. This is still investigated but it seems to involve mostly NGI nagios operations and not developers. This is an ongoing activity<br />
<br />
=== Operational Level Agreements (OLAs) ===<br />
<br />
==== MSA 411 ====<br />
<br />
The milestone MSA11 "Operational Level Agreements within the EGI PRoduction Infrastructure" was achieved during 2011.<br />
<br />
https://documents.egi.eu/document/524<br />
<br />
<br />
==== Continue adaptations to the OLA between NGI and sites ====<br />
<br />
The RC OLA has been finalized and available at: <br />
<br />
https://documents.egi.eu/document/31<br />
<br />
==== Produce OLA between EGI and NGIs, as well as a Core services OLA ====<br />
<br />
The RP OLA, which was started during 2011, partially covers this, with NGI responsibilities including the services NGI provides as core services, however it is ongoing that as tools evolve more services thresholds should be included in this OLA. The first release of the RP OLA was finalized in 2011 and the second release will come shortly in early 2012.<br />
<br />
https://documents.egi.eu/document/463<br />
<br />
In 2012 the EGI.eu OLA will cover the services offered by EGI.<br />
<br />
==== Propose an OLA amendment procedure (Spring 2011) ====<br />
<br />
This action was not completed at the OLAs were not finalized. This is an action for 2012<br />
<br />
==== Evaluate the impact of increased availability suspension threshold ====<br />
<br />
During 2011 TSA1.8 evaluated the impact of increasing the availability suspension threshold. The results of the evaluation were presented at the Technical Forum in Lyon:<br />
<br />
https://www.egi.eu/indico/conferenceDisplay.py?confId=267<br />
<br />
<br />
==== Reconvene with the OLA task force at least once per 2 months ====<br />
<br />
That was not really needed, depending on the requirements sometimes 2 meetings took place within 1 month, as the TF work has to go through the OMB for approval and additional comments to be addressed.<br />
<br />
==== Availability/Reliability ====<br />
<br />
TSA1.8 is responsible for the distribution of monthly league tables. Continue adding useful material to the wiki:<br />
<br />
https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics <br />
<br />
The investigation whether operational tools advancements can simplify the procedure is an ongoing activity and will continue in 2012:<br />
<br />
https://rt.egi.eu/guest/Ticket/Display.html?id=289<br />
<br />
Regarding the prime causes of site failures investigation: Ongoing, the first step is to determine the causes of the high % of UNKNOWN states in NGI Nagios (mentioned before in the accuracy issues) before going deeper into sites. Site replies to COD tickets for the reports could start be categorized in 2012. The initial results of the investigate show that the problems are mostly relating with operation.<br />
<br />
== Plans for 2012 ==<br />
<br />
=== Core Grid Services ===<br />
<br />
==== DTEAM VO Services ====<br />
<br />
The plan for to 2012 is to finalize the decommission of the legacy ROC Groups. (ROC_Benelux, ROC_France, ROC_UKI). Currently the DTEAM VO services are provided using the VOMRS service. Investigate whether the new VOMS service provides all the needed functionality.<br />
<br />
==== EGI Catch All CA ====<br />
<br />
Continue the support and operation of the EGI Catch All CA and the expansion of the RA Network as needed.<br />
<br />
==== Core Services for Site Certification ====<br />
<br />
Continue the support and operation of the Site Certification Core Services.<br />
<br />
=== Operations tool and availability computation ===<br />
<br />
==== Follow-up with developers for issues that affect accuracy ====<br />
<br />
Continue the investigation of the relatively high number of unknown status from certain NGI nagios instances / sites. Target date 2012Q2.<br />
<br />
=== Operational Level Aggreements (OLAs) ===<br />
<br />
==== MSA 418 ====<br />
<br />
The milestone MSA 418 "Operational Level Agreements (OLAs) within the EGI production infrastructure" is planned for 2012Q1 with deadline the end of the first month of 2012Q2.<br />
<br />
==== Produce OLA between EGI and NGIs, as well as a Core services OLA ====<br />
<br />
The 2nd release of the RP OLA will be finalized early 2012Q1. A new work item for 2012 is the EGI.eu OLA. A draft version will be ready in 2012Q2 and the final version is expected in 2012Q3. In 2012Q3 a new revision of the RP OLA will be drafted including any a<br />
<br />
==== Propose an OLA amendment procedure ====<br />
<br />
The amendment procedure for the OLA is scheduled for 2012Q2<br />
<br />
==== OLA Task Force Meetings ====<br />
<br />
The OLA Task Force will reconvene via video conference and/or face to face meetings as needed.<br />
<br />
==== Availability/Reliability ====<br />
<br />
TSA1.8 will continue the distribution of monthly league tables and the maintenance of the relevant wiki space:<br />
<br />
https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics <br />
<br />
The investigation whether operational tools advancements can simplify the procedure will continue in 2012 and recommendations will be made to operations and tools developers.<br />
<br />
https://rt.egi.eu/guest/Ticket/Display.html?id=289<br />
<br />
Regarding the prime causes of site failures investigation: Ongoing, the first step is to determine the causes of the high % of UNKNOWN states in NGI Nagios (mentioned before in the accuracy issues) before going deeper into sites. Site replies to COD tickets for the reports could start be categorized in 2012. The initial results of the investigate show that the problems are mostly relating with operation.</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=28065Resource Centres OLA and Resource infrastructure Provider OLA reports2011-11-18T09:55:01Z<p>Dzila: /* Process for quality verification */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
[[Category:Service Level Management]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/332 Jan]/<br />
[https://documents.egi.eu/document/402 Feb]/<br />
[https://documents.egi.eu/document/465 Mar]/<br />
[https://documents.egi.eu/document/508 Apr]/<br />
[https://documents.egi.eu/document/593 May]/<br />
[https://documents.egi.eu/document/648 Jun]/<br />
[https://documents.egi.eu/document/716 Jul]/<br />
[https://documents.egi.eu/document/783 Aug]/<br />
[https://documents.egi.eu/document/820 Sep]/<br />
[https://documents.egi.eu/document/879 Oct]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/42 May]/[https://documents.egi.eu/document/96 Jun]/[https://documents.egi.eu/document/130 Jul]/[https://documents.egi.eu/document/157 Aug]/[https://documents.egi.eu/document/219 Sep]/[https://documents.egi.eu/document/238 Oct]/[https://documents.egi.eu/document/266 Nov]/[https://documents.egi.eu/document/299 Dec]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
The authoritative source of availability and reliability data is [https://grid-monitoring.cern.ch/myegi/sa/# MyEGI].<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
<!--*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]--><br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/ACE/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
* '''Recomputation precedure'''<br />
Should there be doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [https://wiki.egi.eu/wiki/PROC10]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# ACE as Gridview in the past, is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, ACE takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview had included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. However ACE reports (used since May 2011) do not include the snapshot feature yet.''' <br />
# The calculations performed by ACE always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Operational Level Agreements=<br />
==Resource Centre Operational Level Agreement==<br />
* [https://documents.egi.eu/document/31 Resource Centre (RC) Operational Level Agreement]<br />
* [[Resource_Centre_OLA:_Release_Notes|RC OLA release notes]]<br />
<br />
==Resource infrastructure Provider Operational Level Agreement==<br />
* (DRAFT) [https://documents.egi.eu/document/463 Resource infrastructure Provider (RP) Operational Level Agreement]<br />
* [[Resource_infrastructure_Provider_OLA:_Release_Notes|RP OLA release notes]]<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://tomtools.cern.ch/confluence/download/attachments/2261694/Ace_Service_Availability_Computation.pdf?version=1&modificationDate=1314361543000 paper])<br />
<!--([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])--><br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://tomtools.cern.ch/confluence/display/SAM/ACE Availability Computation Engine (ACE) home page]<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE (Old) Availability Computation Engine] (ACE)<br />
<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]<br />
* Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Dteam_vo&diff=26945Dteam vo2011-11-03T09:25:40Z<p>Dzila: /* General Information */</p>
<hr />
<div>{{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}} <br />
<br />
= General Information =<br />
<br />
The DTEAM VO is an infrastructure VO that MUST be enabled by all EGI Resource Centres that support the VO concept for user authentication, as stated in the [https://documents.egi.eu/document/31 Resource Centre Operational Level Agreement]. It is meant for testing and troubleshooting of grid capabilities across EGI Resource Centres. Usage of the DTEAM VO is subject to the EGI [[SPG:Documents|Security Policies]]. <br />
<br />
*[http://operations-portal.egi.eu/vo/downloadAUP/file/dteam-AcceptableUsePolicy-20110926-1316993681969.txt DTEAM AUP]. <br />
*'''Get support''': in order to get support about the DTEAM VO please [http://helpdesk.egi.eu/ open a ticket], select type ''Operations'', and set ''concerned VO'' to ''dteam''. If you have privileges, assign it to the Support Unit ''VOsupport unit''.<br />
*[https://voms.hellasgrid.gr:8443/vo/dteam/vomrs DTEAM VOMRS]<br />
<br />
= Recipes for VO/ROC/NGI/Site managers =<br />
<br />
== What users filling the '''dteam''' VO Registration form should do ==<br />
<br />
Select the appropriate '''Representative''' and '''Group''' for themselves. The Representative corresponding to their region is offered in a drop-down menu. <br />
<br />
'''Example:''' <br />
<blockquote style="background-color: lightgrey; border: solid thin grey; padding: 5px;">dteam users from Greece should select Kostas Koumantaros or Ioannis Liabotis as their Representative and /dteam/NGI_GRNET as their Group. </blockquote> <br />
Everybody is automatically registered under the root group /dteam in addition to any Group they might select. Nobody can de-assign them from this "root group" unless they get "Denied", in the first place or, later on, "Suspended", by the VO-Admin, in which case they can't run any Grid jobs and they get deleted from the VOMS database. <br />
<br />
When users select additional Groups, the GroupOwners have nothing to do, if they have no objection. Users may select GroupRoles within a given Group as well. <br />
<br />
== What the VO-Admin can do ==<br />
<br />
Everything including VO member suspension/removal that nobody else can do! <br />
<br />
If you try to remove a member and the box-to-tick is grey, this means that the member has some authority (GroupOwner/Manager or Representative). You 'll have to remove that funtion first from him/her via "Manage VO Admin Roles". <br />
<br />
To remove the GroupOwner/Manager autority, use control/click on the relevant Group/Role (it will be blue)! <br />
<br />
== What the Representative can do ==<br />
<br />
Approve Candidates during the initial registration and handle Expired users. <br />
<br />
To do this, the Representative should either click on the link (s)he got in the email notification or go to the web interface, open the "Members" sub-menu, click on "Set status", search for "New" candidates and approve those assigned to him/her. <br />
<br />
The Representative selected by the user can assign another Representative before approving, as appropriate. <br />
<br />
'''Example:''' <br />
<blockquote style="background-color: lightgrey; border: solid thin grey; padding: 5px;">a DTEAM VO Candidate from a Russian LCG Site selected the SWE ROC manager as Representative. Gonzalo (SWE) can replace himself with Alexander (RDIG). </blockquote> <br />
== What the GroupOwners can do ==<br />
<br />
Group Owners can create groups/group roles and assign new Group Owner/Manager roles to member within the subgroups. If they decided that the user doesn't belong to their group(s) they can de-assign him/her at any time. <br />
<br />
'''Example:''' <br />
<blockquote style="background-color: lightgrey; border: solid thin grey;padding: 5px;">If Sven from DECH selects additional group /dteam/see, Kostas can move him out. </blockquote> <br />
== What the GroupManagers can do ==<br />
<br />
They can deassign users from their group at any time. <br />
<br />
http://cern.ch/dimou/lcg/vomrs/Groups-Roles.doc contains EGEE era implementation details and plans on Groups/Roles. As VOMRS fuctionality will be implemented in VOMS this document is becoming obsolete. <br />
<br />
== Proposed distribution of responsibilities ==<br />
<br />
{| border="1"<br />
|-<br />
! Operations manager and deputy <br />
! Operations centre staff <br />
! Site staff<br />
|-<br />
| GroupOwner,GroupManager, VO Representative <br />
| GroupManager <br />
| Group Member<br />
|}<br />
<br />
= Mini How-To =<br />
<br />
*To (De)Assign someone as Representative go to "Manage VO Admin Roles". <br />
*To (De)Assign someone as GroupOwner go to "Manage VO Admin Roles", search for the VO member and select the Group (s)he should own. <br />
*To Change Representative for all members go to "Change Representative", Select the right DN from the drop dowm menu, click on each member. <br />
*To receive email notification for actions you need to take go to "Subscription" and select what you wish to be notified about.<br />
<br />
{| border="1"<br />
|-<br />
! <br />
! VO Admin <br />
! Representative <br />
! GroupOwner <br />
! GroupManager<br />
|-<br />
| Candidate <br />
| remove <br />
| <br />
| <br />
| <br />
|-<br />
| Applicant <br />
| Remove/approve/deny Assign/deassign to/from group and group role <br />
| Remove/approve/suspend/expire <br />
| Assign/deassign to/from group and group role<br />
|-<br />
| Member <br />
| Remove/approve/suspend/expire Assign/deassign to/from group and group role <br />
| expire from Institute but not from the VO <br />
| assign/deassign to/from group and group role <br />
| assign/deassign to/from group and group role<br />
|-<br />
| Member’s certificate <br />
| Remove/approve/deny/suspend <br />
| <br />
| assign/deassign to/from group and group role <br />
| assign/deassign to/from group and group role<br />
|}<br />
<br />
= Resources =<br />
<br />
*VOMRS Tutorials: http://www.uscms.org/SoftwareComputing/Grid/VO/tutorials.html <br />
*VOMRS Online Documentation: http://computing.fnal.gov/docs/products/vomrs/<br />
<br />
= Acknowledgements =<br />
<br />
Information provided in this page was collected from M. Dimou's VOMRS [http://dimou.web.cern.ch/dimou/lcg/registrar/TF/vomrs-tips.html tips page], with material provided by Tanya Levshina (VOMRS Project Leader and developer).</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR6&diff=26837EGI-InSPIRE:SA1.8-QR62011-11-02T10:38:14Z<p>Dzila: /* Core services for uncertified sites */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|-<br />
|5/8/2011<br />
|https://www.egi.eu/indico/conferenceDisplay.py?confId=561<br />
|5th OLA Task Force meeting<br />
|Further evolution of the RP OLA and the VO SLA<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
<br><br />
<br />
*The RP OLA document has been produced and discussed in the EGI Technical Forum in Lyon.<br />
*The VO SLA document was also produced<br />
*First NGI Core service reports were generated in October 2011 for the TopBDII service using the MyEGI programmatic interface<br />
*EGI sites availability recalculation procedure was finalized<br />
*Service Level Management Support Unit was created in GGUS and wil be used for availability/reliability issues<br />
* A documentation and operations training session was held in the EGI Technical Forum in Lyon. This handled PROC01 and Nagios from the perspective of RODs and SAM Nagios admins.<br />
<br />
==== EGI Catch-All CA ====<br />
<br />
The EGI Catch All CA is servicing 5 countries which do not have a national accredited Certification Authority. These countries are Albania, Azerbaijan, Bosnia and Herzegovina, Georgia and Senegal. In addition a Registration Authority has been established at SixSq, a company located in Switzerland and affiliated with the StratusLab project.<br />
<br />
==== Core services for uncertified sites ====<br />
<br />
*The webpage which NGI managers can use to add uncertified sites to the EGI catch-all WMS and BDII services dedicated to these sites has been moved to production. It is available at [http://site-certification.egi.eu/]<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
| The Virtual Site concept for Core services is not easy to implement<br />
| In case of the Top BDII MyEGI programmatic interace was used together with a special spreadsheet for the calculation<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
*Work will start on the EGI.eu OLA<br />
*Possibilities to obtain availability/reliability for more Core services will be explored<br />
*The availability profile used for EGI sites will need to be separated from WLCG sites.<br />
*Further clean up the operations wiki<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|July 2011<br />
|2<br />
|-<br />
|August 2011<br />
|0<br />
|-<br />
|September 2011<br />
|2<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR6&diff=26836EGI-InSPIRE:SA1.8-QR62011-11-02T10:37:33Z<p>Dzila: /* 2. Main Achievements */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|-<br />
|5/8/2011<br />
|https://www.egi.eu/indico/conferenceDisplay.py?confId=561<br />
|5th OLA Task Force meeting<br />
|Further evolution of the RP OLA and the VO SLA<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
<br><br />
<br />
*The RP OLA document has been produced and discussed in the EGI Technical Forum in Lyon.<br />
*The VO SLA document was also produced<br />
*First NGI Core service reports were generated in October 2011 for the TopBDII service using the MyEGI programmatic interface<br />
*EGI sites availability recalculation procedure was finalized<br />
*Service Level Management Support Unit was created in GGUS and wil be used for availability/reliability issues<br />
* A documentation and operations training session was held in the EGI Technical Forum in Lyon. This handled PROC01 and Nagios from the perspective of RODs and SAM Nagios admins.<br />
<br />
==== EGI Catch-All CA ====<br />
<br />
The EGI Catch All CA is servicing 5 countries which do not have a national accredited Certification Authority. These countries are Albania, Azerbaijan, Bosnia and Herzegovina, Georgia and Senegal. In addition a Registration Authority has been established at SixSq, a company located in Switzerland and affiliated with the StratusLab project.<br />
<br />
==== Core services for uncertified sites ====<br />
<br />
*The webpage which NGI managers can use to add uncertified sites to the EGI catch-all WMS and BDII services dedicated to these sites has been launched into production. It is available at [http://site-certification.egi.eu/]<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
| The Virtual Site concept for Core services is not easy to implement<br />
| In case of the Top BDII MyEGI programmatic interace was used together with a special spreadsheet for the calculation<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
*Work will start on the EGI.eu OLA<br />
*Possibilities to obtain availability/reliability for more Core services will be explored<br />
*The availability profile used for EGI sites will need to be separated from WLCG sites.<br />
*Further clean up the operations wiki<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|July 2011<br />
|2<br />
|-<br />
|August 2011<br />
|0<br />
|-<br />
|September 2011<br />
|2<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1_Task_Metrics_Table&diff=26661EGI-InSPIRE:SA1 Task Metrics Table2011-10-31T09:40:35Z<p>Dzila: </p>
<hr />
<div>{{Template:Op menubar}} <br />
<br />
SA1 task quarterly metrics. <br />
<br />
Back to the [[SA1 Task QR Reports and Metrics]]. <!--<br />
Task metrics.<br />
Note. Only provide values for the metrics relevant to your task.<br />
Values reported need to be aggregated during the reference three months.<br />
--> <br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | SA1 Task <br />
! scope="col" | Metric name <br />
! scope="col" | Metric description <br />
! scope="col" | QR3 <br />
! QR4 <br />
! QR5 <br />
! QR6 <br />
! QR7 <br />
! QR8 <br />
! QR9 <br />
! QR10 <br />
! QR11 <br />
! QR12 <br />
! QR13 <br />
! QR14 <br />
! QR15 <br />
! QR16<br />
|-<br />
! scope="row" | TSA1.1 <br />
! scope="row" | M.SA1.Size.1 <br />
! scope="row" | Total number of production resource centres that are part of the EGI <br />
| &lt;Q3_value&gt; <br />
| &lt;Q4_value&gt; <br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.2 <br />
! scope="row" | M.SA1.OperationalSecurity.1 <br />
! scope="row" | Number of Site Security Challenge (SSC) made <br />
| 0 <br />
| 0 <br />
| 40 <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.2 <br />
! scope="row" | M.SA1.OperationalSecurity.2 <br />
! scope="row" | Number of Sites passing one Service Challenge <br />
| N/A <br />
| 0 <br />
| N/A (evaluation is still ongoing) <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.2 <br />
! scope="row" | M.SA1.OperationalSecurity.3 <br />
! scope="row" | Number of suspended sites for security issues <br />
| 0 <br />
| 0 <br />
| 0 <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.3 <br />
! scope="row" | M.SA1.ServiceValidation.1 <br />
! scope="row" | Total number of components tested/rejected in staged rollout <br />
| 11/2 <br />
| 29/1<br> <br />
| 54/2<br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.3 <br />
! scope="row" | M.SA1.ServiceValidation.2 <br />
! scope="row" | Number of staged rollout tests undertaken <br />
| 14 <br />
| 40 <br />
| 81<br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.3 <br />
! scope="row" | M.SA1.ServiceValidation.3 <br />
! scope="row" | Number of EA teams <br />
| 40 <br />
| 45 <br />
| 46<br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.5 <br />
! scope="row" | MSA1.Accounting.1 <br />
! scope="row" | Number of sites adopting AMQ messaging for Usage Record publication <br />
| 149 (90 RGMA, 62 direct insertion, 56% infrastructure ok) <br />
| 241&nbsp; <br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.7 <br />
! scope="row" | M.SA1.Support.7 <br />
! scope="row" | COD Workload per month <br />
| <br />
764/551/844 <br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1 <br />
<br />
| <br />
135/363/315 <br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1 <br />
<br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.7 <br />
! scope="row" | M.SA1.Support.8 <br />
! scope="row" | ROD Workload per month (breakdown per region/NGI) <br />
| <br />
2943/1912/2090 <br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1 <br />
<br />
| <br />
1530/1692/2059 <br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1 <br />
<br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.7 <br />
! scope="row" | M.SA1.Support.9 <br />
! scope="row" | ROD Quality Metrics per month (breakdown per region/NGI) <br />
| <br />
0.90/0.81/0.76 <br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1 <br />
<br />
| <br />
0.85/0.82/0.86 <br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1 <br />
<br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.8 <br />
! scope="row" | M.SA1.Operation.2 <br />
! scope="row" | Number of sites suspended <br />
| 1/0/1 <br />
| 2/0/0 <br />
| 0/2/7 <br />
| 2/0/2 <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|}<br />
<br />
[[Category:Metrics]]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR6&diff=26658EGI-InSPIRE:SA1.8-QR62011-10-31T09:20:41Z<p>Dzila: /* 4. Plans for the next period */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|-<br />
|5/8/2011<br />
|https://www.egi.eu/indico/conferenceDisplay.py?confId=561<br />
|5th OLA Task Force meeting<br />
|Further evolution of the RP OLA and the VO SLA<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
<br><br />
<br />
*The RP OLA document has been produced and discussed in the EGI Technical Forum in Lyon.<br />
*The VO SLA document was also produced<br />
*First NGI Core service reports were generated in October 2011 for the TopBDII service using the MyEGI programmatic interface<br />
*EGI sites availability recalculation procedure was finalized<br />
*Service Level Management Support Unit was created in GGUS and wil be used for availability/reliability issues<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
| The Virtual Site concept for Core services is not easy to implement<br />
| In case of the Top BDII MyEGI programmatic interace was used together with a special spreadsheet for the calculation<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
*Work will start on the EGI.eu OLA<br />
*Possibilities to obtain availability/reliability for more Core services will be explored<br />
*The availability profile used for EGI sites will need to be separated from WLCG sites.<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|July 2011<br />
|2<br />
|-<br />
|August 2011<br />
|0<br />
|-<br />
|September 2011<br />
|2<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR6&diff=26657EGI-InSPIRE:SA1.8-QR62011-10-31T09:17:21Z<p>Dzila: /* 3. Issues and Mitigation */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|-<br />
|5/8/2011<br />
|https://www.egi.eu/indico/conferenceDisplay.py?confId=561<br />
|5th OLA Task Force meeting<br />
|Further evolution of the RP OLA and the VO SLA<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
<br><br />
<br />
*The RP OLA document has been produced and discussed in the EGI Technical Forum in Lyon.<br />
*The VO SLA document was also produced<br />
*First NGI Core service reports were generated in October 2011 for the TopBDII service using the MyEGI programmatic interface<br />
*EGI sites availability recalculation procedure was finalized<br />
*Service Level Management Support Unit was created in GGUS and wil be used for availability/reliability issues<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
| The Virtual Site concept for Core services is not easy to implement<br />
| In case of the Top BDII MyEGI programmatic interace was used together with a special spreadsheet for the calculation<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|July 2011<br />
|2<br />
|-<br />
|August 2011<br />
|0<br />
|-<br />
|September 2011<br />
|2<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR6&diff=26656EGI-InSPIRE:SA1.8-QR62011-10-31T09:09:50Z<p>Dzila: /* 2. Main Achievements */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|-<br />
|5/8/2011<br />
|https://www.egi.eu/indico/conferenceDisplay.py?confId=561<br />
|5th OLA Task Force meeting<br />
|Further evolution of the RP OLA and the VO SLA<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
<br><br />
<br />
*The RP OLA document has been produced and discussed in the EGI Technical Forum in Lyon.<br />
*The VO SLA document was also produced<br />
*First NGI Core service reports were generated in October 2011 for the TopBDII service using the MyEGI programmatic interface<br />
*EGI sites availability recalculation procedure was finalized<br />
*Service Level Management Support Unit was created in GGUS and wil be used for availability/reliability issues<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
| <br />
|<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|July 2011<br />
|2<br />
|-<br />
|August 2011<br />
|0<br />
|-<br />
|September 2011<br />
|2<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR6&diff=26655EGI-InSPIRE:SA1.8-QR62011-10-31T08:48:34Z<p>Dzila: </p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|-<br />
|5/8/2011<br />
|https://www.egi.eu/indico/conferenceDisplay.py?confId=561<br />
|5th OLA Task Force meeting<br />
|Further evolution of the RP OLA and the VO SLA<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
| <br />
|<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|July 2011<br />
|2<br />
|-<br />
|August 2011<br />
|0<br />
|-<br />
|September 2011<br />
|2<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR6&diff=26652EGI-InSPIRE:SA1.8-QR62011-10-31T08:15:13Z<p>Dzila: /* 1. Task Meetings */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|-<br />
|5/8/2011<br />
|https://www.egi.eu/indico/conferenceDisplay.py?confId=561<br />
|5th OLA Task Force meeting<br />
|Further evolution of the RP OLA and the VO SLA<br />
|-<br />
|<br />
|<br />
|<br />
|<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
| <br />
|<br />
|}<br />
<br />
= 4. Plans for the next period =</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Dteam_vo&diff=25692Dteam vo2011-10-07T12:59:41Z<p>Dzila: /* Proposed responsibilities */</p>
<hr />
<div>{{Template:Op menubar}}<br />
{{Template:Doc_menubar}}<br />
{{TOC_right}}<br />
<br />
=General Information =<br />
The DTEAM VO is an infrastructure VO that MUST be enabled by all EGI Resource Centres that support the VO concept for user autentication, as stated in the [https://documents.egi.eu/document/31 Resource Centre Operational Level Agreement]. It is meant for testing and troubleshooting of grid capabilities across EGI Resource Centres. Usage of the DTEAM VO is subject to the EGI [[SPG:Documents| Security Policies]].<br />
* [http://operations-portal.egi.eu/vo/downloadAUP/file/dteam-AcceptableUsePolicy-20110926-1316993681969.txt DTEAM AUP].<br />
* '''Get support''': in order to get support about the DTEAM VO please [http://helpdesk.egi.eu/ open a ticket], select type ''Operations'', and set ''concerned VO'' to ''dteam''. If you have privileges, assign it to the Support Unit ''VOsupport unit''.<br />
<br />
=Recipes for VO/ROC/NGI/Site managers=<br />
<br />
==What users filling the '''dteam''' VO Registration form should do==<br />
<br />
Select the appropriate '''Representative''' and '''Group''' for themselves. The Representative corresponding to their region is offered in a drop-down menu. Example: dteam users from Greece should select Kostas Koumantaros or Ioannis Lambiotis as their Representative and /dteam/NGI_GRNET as their Group.<br />
Everybody is automatically registered under the root group /dteam in addition to any Group they might select. Nobody can de-assign them from this "root group" unless they get "Denied", in the first place or, later on, "Suspended", by the VO-Admin, in which case they can't run any Grid jobs and they get deleted from the VOMS database.<br />
When users select additional Groups, the GroupOwners have nothing to do, if they have no objection.<br />
Users may select GroupRoles within a given Group as well.<br />
<br />
==What the VO-Admin can do==<br />
<br />
Everything including VO member suspension/removal that nobody else can do!<br />
'''NB!!!'''If you try to remove a member and the box-to-tick is grey, this means that the member has some authority (GroupOwner/Manager or Representative). You 'll have to remove that funtion first from him/her via "Manage VO Admin Roles". To remove the GroupOwner/Manager autority, use control/click on the relevant Group/Role (it will be blue)!<br />
<br />
==What the Representative can do==<br />
<br />
Approve Candidates during the initial registration and handle Expired users. To do this, the Representative should either click on the link (s)he got in the email notification or go to the web interface, open the "Members" sub-menu, click on "Set status", search for "New" candidates and approve those assigned to him/her.<br />
<br />
The Representative selected by the user can assign another Representative before approving, as appropriate. Example: a DTEAM VO Candidate from a Russian LCG Site selected the SWE ROC manager as Representative. Gonzalo (SWE) can replace himself with Alexander (RDIG).<br />
<br />
==What the GroupOwners can do==<br />
Group Owners can create groups/group roles and assign new Group Owner/Manager roles to member within the subgroups. If they decided that the user doesn't belong to their group(s) they can de-assign him/her at any time. Example: If Sven from DECH selects additional group /dteam/see, Kostas can move him out.<br />
<br />
==What the GroupManagers can do==<br />
They can deassign users from their group at any time.<br />
<br />
http://cern.ch/dimou/lcg/vomrs/Groups-Roles.doc contains EGEE era implementation details and plans on Groups/Roles. As VOMRS fuctionality will be implemented in VOMS this document is becoming obsolete.<br />
<br />
==Proposed distribution of responsibilities==<br />
{| border="1"<br />
! Operations manager and deputy<br />
! Operations centre staff<br />
! Site staff<br />
|-<br />
|GroupOwner,GroupManager, VO Representative<br />
|GroupManager<br />
|Group Member<br />
|-<br />
|}<br />
<br />
=Mini How-To=<br />
<br />
* To (De)Assign someone as Representative go to "Manage VO Admin Roles".<br />
* To (De)Assign someone as GroupOwner go to "Manage VO Admin Roles", search for the VO member and select the Group (s)he should own.<br />
* To Change Representative for all members go to "Change Representative", Select the right DN from the drop dowm menu, click on each member.<br />
* To receive email notification for actions you need to take go to "Subscription" and select what you wish to be notified about.<br />
<br />
{| border="1"<br />
! <br />
! VO Admin<br />
! Representative<br />
! GroupOwner<br />
! GroupManager<br />
|-<br />
|Candidate<br />
|remove <br />
|<br />
|<br />
|<br />
|-<br />
|Applicant<br />
|Remove/approve/deny Assign/deassign to/from group and group role<br />
|Remove/approve/suspend/expire<br />
|Assign/deassign to/from group and group role<br />
|-<br />
|Member<br />
|Remove/approve/suspend/expire Assign/deassign to/from group and group role<br />
|expire from Institute but not from the VO<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|Member’s certificate<br />
|Remove/approve/deny/suspend<br />
|<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|}<br />
<br />
=Migration of the dteam VO from CERN VOMS server to EGI VOMS (AUTH/NGI_GRNET)=<br />
# Sync dteam Greece with dteam CERN.<br />
# Advise sites to add the new VOMS server to their configuration. They need to be told new site-info.def definitions to replace these: <br />
<pre><br />
VO_DTEAM_VOMS_SERVERS='vomss://voms.cern.ch:8443/voms/dteam?/dteam/' <br />
<br />
VO_DTEAM_VOMSES="\<br />
'dteam lcg-voms.cern.ch 15004 \<br />
/DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch dteam 24' \<br />
'dteam voms.cern.ch 15004 \<br />
/DC=ch/DC=cern/OU=computers/CN=voms.cern.ch dteam 24'" </pre><br />
and<br />
<pre><br />
VO_DTEAM_VOMS_CA_DN="\<br />
'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' \<br />
'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"<br />
</pre><br />
with these:<br />
<pre><br />
VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/' <br />
<br />
VO_DTEAM_VOMSES="\<br />
'dteam voms.hellasgrid.gr 15004 \<br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' \<br />
'dteam voms2.hellasgrid.gr 15004 \<br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'" </pre><br />
and<br />
<pre><br />
VO_DTEAM_VOMS_CA_DN="\<br />
'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' \<br />
'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'"<br />
</pre><br />
'''Run yaim after changing site-info.def'''.The new "lsc" files should be '''voms.hellasgrid.gr.lsc''' and '''voms2.hellasgrid.gr.lsc''' with the following contents, respectively:<br />
<pre><br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr<br />
/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006<br />
</pre><br />
<pre><br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr<br />
/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006<br />
</pre><br />
<ol start="3"><br />
<li> Sites also need an rpm containing the host cert(s) of the new VOMS server(s) at least for the WMS, while it still requires the certs of supported VOs. We could add those certs to lcg-vomscerts to smoothen the transition, but it may be better for EGI to control its own rpm. 11/10/2010 lcg-vomscerts has been already updated. Version 6.1.0 and later contains the new certs. Latest[http://etics-repository.cern.ch/repository/download/registered/org.glite/lcg-vomscerts/6.2.0/noarch/lcg-vomscerts-6.2.0-1.noarch.rpm] as of 11/11/2010. <br />
</ol><br />
<ol start="4"><br />
<li> Wait a bit (1 month sounds reasonable).<br />
</ol><br />
<ol start="5"><br />
<li> Close registrations at CERN. service stop vomrs should do.<br />
</ol><br />
<ol start="6"><br />
<li> Sync dteam Greece with dteam CERN.<br />
</ol><br />
<ol start="7"><br />
<li> Advise new users to register with Greece. https://voms.hellasgrid.gr:8443/vo/dteam/vomrs<br />
</ol><br />
<ol start="8"><br />
<li> Remove CERN dteam. '''This will take place on Wednesday January 26'''.<br />
</ol><br />
<ol start="9"><br />
<li> Advise sites to drop CERN dteam configuration.<br />
</ol><br />
<br />
= Resources =<br />
*VOMRS Tutorials: http://www.uscms.org/SoftwareComputing/Grid/VO/tutorials.html<br />
*VOMRS Online Documentation: http://computing.fnal.gov/docs/products/vomrs/<br />
<br />
= Acknowledgements =<br />
Information provided in this page was collected from M. Dimou's VOMRS [http://dimou.web.cern.ch/dimou/lcg/registrar/TF/vomrs-tips.html tips page], with material provided by Tanya Levshina (VOMRS Project Leader and developer).</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Dteam_vo&diff=25691Dteam vo2011-10-07T12:59:02Z<p>Dzila: /* Proposed responsibilities */</p>
<hr />
<div>{{Template:Op menubar}}<br />
{{Template:Doc_menubar}}<br />
{{TOC_right}}<br />
<br />
=General Information =<br />
The DTEAM VO is an infrastructure VO that MUST be enabled by all EGI Resource Centres that support the VO concept for user autentication, as stated in the [https://documents.egi.eu/document/31 Resource Centre Operational Level Agreement]. It is meant for testing and troubleshooting of grid capabilities across EGI Resource Centres. Usage of the DTEAM VO is subject to the EGI [[SPG:Documents| Security Policies]].<br />
* [http://operations-portal.egi.eu/vo/downloadAUP/file/dteam-AcceptableUsePolicy-20110926-1316993681969.txt DTEAM AUP].<br />
* '''Get support''': in order to get support about the DTEAM VO please [http://helpdesk.egi.eu/ open a ticket], select type ''Operations'', and set ''concerned VO'' to ''dteam''. If you have privileges, assign it to the Support Unit ''VOsupport unit''.<br />
<br />
=Recipes for VO/ROC/NGI/Site managers=<br />
<br />
==What users filling the '''dteam''' VO Registration form should do==<br />
<br />
Select the appropriate '''Representative''' and '''Group''' for themselves. The Representative corresponding to their region is offered in a drop-down menu. Example: dteam users from Greece should select Kostas Koumantaros or Ioannis Lambiotis as their Representative and /dteam/NGI_GRNET as their Group.<br />
Everybody is automatically registered under the root group /dteam in addition to any Group they might select. Nobody can de-assign them from this "root group" unless they get "Denied", in the first place or, later on, "Suspended", by the VO-Admin, in which case they can't run any Grid jobs and they get deleted from the VOMS database.<br />
When users select additional Groups, the GroupOwners have nothing to do, if they have no objection.<br />
Users may select GroupRoles within a given Group as well.<br />
<br />
==What the VO-Admin can do==<br />
<br />
Everything including VO member suspension/removal that nobody else can do!<br />
'''NB!!!'''If you try to remove a member and the box-to-tick is grey, this means that the member has some authority (GroupOwner/Manager or Representative). You 'll have to remove that funtion first from him/her via "Manage VO Admin Roles". To remove the GroupOwner/Manager autority, use control/click on the relevant Group/Role (it will be blue)!<br />
<br />
==What the Representative can do==<br />
<br />
Approve Candidates during the initial registration and handle Expired users. To do this, the Representative should either click on the link (s)he got in the email notification or go to the web interface, open the "Members" sub-menu, click on "Set status", search for "New" candidates and approve those assigned to him/her.<br />
<br />
The Representative selected by the user can assign another Representative before approving, as appropriate. Example: a DTEAM VO Candidate from a Russian LCG Site selected the SWE ROC manager as Representative. Gonzalo (SWE) can replace himself with Alexander (RDIG).<br />
<br />
==What the GroupOwners can do==<br />
Group Owners can create groups/group roles and assign new Group Owner/Manager roles to member within the subgroups. If they decided that the user doesn't belong to their group(s) they can de-assign him/her at any time. Example: If Sven from DECH selects additional group /dteam/see, Kostas can move him out.<br />
<br />
==What the GroupManagers can do==<br />
They can deassign users from their group at any time.<br />
<br />
http://cern.ch/dimou/lcg/vomrs/Groups-Roles.doc contains EGEE era implementation details and plans on Groups/Roles. As VOMRS fuctionality will be implemented in VOMS this document is becoming obsolete.<br />
<br />
==Proposed responsibilities==<br />
{| border="1"<br />
! Operations manager and deputy<br />
! Operations centre staff<br />
! Site staff<br />
|-<br />
|GroupOwner,GroupManager, VO Representative<br />
|GroupManager<br />
|Group Member<br />
|-<br />
|}<br />
<br />
=Mini How-To=<br />
<br />
* To (De)Assign someone as Representative go to "Manage VO Admin Roles".<br />
* To (De)Assign someone as GroupOwner go to "Manage VO Admin Roles", search for the VO member and select the Group (s)he should own.<br />
* To Change Representative for all members go to "Change Representative", Select the right DN from the drop dowm menu, click on each member.<br />
* To receive email notification for actions you need to take go to "Subscription" and select what you wish to be notified about.<br />
<br />
{| border="1"<br />
! <br />
! VO Admin<br />
! Representative<br />
! GroupOwner<br />
! GroupManager<br />
|-<br />
|Candidate<br />
|remove <br />
|<br />
|<br />
|<br />
|-<br />
|Applicant<br />
|Remove/approve/deny Assign/deassign to/from group and group role<br />
|Remove/approve/suspend/expire<br />
|Assign/deassign to/from group and group role<br />
|-<br />
|Member<br />
|Remove/approve/suspend/expire Assign/deassign to/from group and group role<br />
|expire from Institute but not from the VO<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|Member’s certificate<br />
|Remove/approve/deny/suspend<br />
|<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|}<br />
<br />
=Migration of the dteam VO from CERN VOMS server to EGI VOMS (AUTH/NGI_GRNET)=<br />
# Sync dteam Greece with dteam CERN.<br />
# Advise sites to add the new VOMS server to their configuration. They need to be told new site-info.def definitions to replace these: <br />
<pre><br />
VO_DTEAM_VOMS_SERVERS='vomss://voms.cern.ch:8443/voms/dteam?/dteam/' <br />
<br />
VO_DTEAM_VOMSES="\<br />
'dteam lcg-voms.cern.ch 15004 \<br />
/DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch dteam 24' \<br />
'dteam voms.cern.ch 15004 \<br />
/DC=ch/DC=cern/OU=computers/CN=voms.cern.ch dteam 24'" </pre><br />
and<br />
<pre><br />
VO_DTEAM_VOMS_CA_DN="\<br />
'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' \<br />
'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"<br />
</pre><br />
with these:<br />
<pre><br />
VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/' <br />
<br />
VO_DTEAM_VOMSES="\<br />
'dteam voms.hellasgrid.gr 15004 \<br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' \<br />
'dteam voms2.hellasgrid.gr 15004 \<br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'" </pre><br />
and<br />
<pre><br />
VO_DTEAM_VOMS_CA_DN="\<br />
'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' \<br />
'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'"<br />
</pre><br />
'''Run yaim after changing site-info.def'''.The new "lsc" files should be '''voms.hellasgrid.gr.lsc''' and '''voms2.hellasgrid.gr.lsc''' with the following contents, respectively:<br />
<pre><br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr<br />
/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006<br />
</pre><br />
<pre><br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr<br />
/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006<br />
</pre><br />
<ol start="3"><br />
<li> Sites also need an rpm containing the host cert(s) of the new VOMS server(s) at least for the WMS, while it still requires the certs of supported VOs. We could add those certs to lcg-vomscerts to smoothen the transition, but it may be better for EGI to control its own rpm. 11/10/2010 lcg-vomscerts has been already updated. Version 6.1.0 and later contains the new certs. Latest[http://etics-repository.cern.ch/repository/download/registered/org.glite/lcg-vomscerts/6.2.0/noarch/lcg-vomscerts-6.2.0-1.noarch.rpm] as of 11/11/2010. <br />
</ol><br />
<ol start="4"><br />
<li> Wait a bit (1 month sounds reasonable).<br />
</ol><br />
<ol start="5"><br />
<li> Close registrations at CERN. service stop vomrs should do.<br />
</ol><br />
<ol start="6"><br />
<li> Sync dteam Greece with dteam CERN.<br />
</ol><br />
<ol start="7"><br />
<li> Advise new users to register with Greece. https://voms.hellasgrid.gr:8443/vo/dteam/vomrs<br />
</ol><br />
<ol start="8"><br />
<li> Remove CERN dteam. '''This will take place on Wednesday January 26'''.<br />
</ol><br />
<ol start="9"><br />
<li> Advise sites to drop CERN dteam configuration.<br />
</ol><br />
<br />
= Resources =<br />
*VOMRS Tutorials: http://www.uscms.org/SoftwareComputing/Grid/VO/tutorials.html<br />
*VOMRS Online Documentation: http://computing.fnal.gov/docs/products/vomrs/<br />
<br />
= Acknowledgements =<br />
Information provided in this page was collected from M. Dimou's VOMRS [http://dimou.web.cern.ch/dimou/lcg/registrar/TF/vomrs-tips.html tips page], with material provided by Tanya Levshina (VOMRS Project Leader and developer).</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Dteam_vo&diff=25690Dteam vo2011-10-07T12:57:53Z<p>Dzila: /* Proposed responsibilities */</p>
<hr />
<div>{{Template:Op menubar}}<br />
{{Template:Doc_menubar}}<br />
{{TOC_right}}<br />
<br />
=General Information =<br />
The DTEAM VO is an infrastructure VO that MUST be enabled by all EGI Resource Centres that support the VO concept for user autentication, as stated in the [https://documents.egi.eu/document/31 Resource Centre Operational Level Agreement]. It is meant for testing and troubleshooting of grid capabilities across EGI Resource Centres. Usage of the DTEAM VO is subject to the EGI [[SPG:Documents| Security Policies]].<br />
* [http://operations-portal.egi.eu/vo/downloadAUP/file/dteam-AcceptableUsePolicy-20110926-1316993681969.txt DTEAM AUP].<br />
* '''Get support''': in order to get support about the DTEAM VO please [http://helpdesk.egi.eu/ open a ticket], select type ''Operations'', and set ''concerned VO'' to ''dteam''. If you have privileges, assign it to the Support Unit ''VOsupport unit''.<br />
<br />
=Recipes for VO/ROC/NGI/Site managers=<br />
<br />
==What users filling the '''dteam''' VO Registration form should do==<br />
<br />
Select the appropriate '''Representative''' and '''Group''' for themselves. The Representative corresponding to their region is offered in a drop-down menu. Example: dteam users from Greece should select Kostas Koumantaros or Ioannis Lambiotis as their Representative and /dteam/NGI_GRNET as their Group.<br />
Everybody is automatically registered under the root group /dteam in addition to any Group they might select. Nobody can de-assign them from this "root group" unless they get "Denied", in the first place or, later on, "Suspended", by the VO-Admin, in which case they can't run any Grid jobs and they get deleted from the VOMS database.<br />
When users select additional Groups, the GroupOwners have nothing to do, if they have no objection.<br />
Users may select GroupRoles within a given Group as well.<br />
<br />
==What the VO-Admin can do==<br />
<br />
Everything including VO member suspension/removal that nobody else can do!<br />
'''NB!!!'''If you try to remove a member and the box-to-tick is grey, this means that the member has some authority (GroupOwner/Manager or Representative). You 'll have to remove that funtion first from him/her via "Manage VO Admin Roles". To remove the GroupOwner/Manager autority, use control/click on the relevant Group/Role (it will be blue)!<br />
<br />
==What the Representative can do==<br />
<br />
Approve Candidates during the initial registration and handle Expired users. To do this, the Representative should either click on the link (s)he got in the email notification or go to the web interface, open the "Members" sub-menu, click on "Set status", search for "New" candidates and approve those assigned to him/her.<br />
<br />
The Representative selected by the user can assign another Representative before approving, as appropriate. Example: a DTEAM VO Candidate from a Russian LCG Site selected the SWE ROC manager as Representative. Gonzalo (SWE) can replace himself with Alexander (RDIG).<br />
<br />
==What the GroupOwners can do==<br />
Group Owners can create groups/group roles and assign new Group Owner/Manager roles to member within the subgroups. If they decided that the user doesn't belong to their group(s) they can de-assign him/her at any time. Example: If Sven from DECH selects additional group /dteam/see, Kostas can move him out.<br />
<br />
==What the GroupManagers can do==<br />
They can deassign users from their group at any time.<br />
<br />
http://cern.ch/dimou/lcg/vomrs/Groups-Roles.doc contains EGEE era implementation details and plans on Groups/Roles. As VOMRS fuctionality will be implemented in VOMS this document is becoming obsolete.<br />
<br />
==Proposed responsibilities==<br />
{| border="1"<br />
! Operations manager and deputy<br />
! Operations centre staff<br />
! Site staff<br />
|-<br />
|GroupOwner,GroupManager<br />
|GroupManager<br />
|GroupMember<br />
|-<br />
|}<br />
<br />
=Mini How-To=<br />
<br />
* To (De)Assign someone as Representative go to "Manage VO Admin Roles".<br />
* To (De)Assign someone as GroupOwner go to "Manage VO Admin Roles", search for the VO member and select the Group (s)he should own.<br />
* To Change Representative for all members go to "Change Representative", Select the right DN from the drop dowm menu, click on each member.<br />
* To receive email notification for actions you need to take go to "Subscription" and select what you wish to be notified about.<br />
<br />
{| border="1"<br />
! <br />
! VO Admin<br />
! Representative<br />
! GroupOwner<br />
! GroupManager<br />
|-<br />
|Candidate<br />
|remove <br />
|<br />
|<br />
|<br />
|-<br />
|Applicant<br />
|Remove/approve/deny Assign/deassign to/from group and group role<br />
|Remove/approve/suspend/expire<br />
|Assign/deassign to/from group and group role<br />
|-<br />
|Member<br />
|Remove/approve/suspend/expire Assign/deassign to/from group and group role<br />
|expire from Institute but not from the VO<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|Member’s certificate<br />
|Remove/approve/deny/suspend<br />
|<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|}<br />
<br />
=Migration of the dteam VO from CERN VOMS server to EGI VOMS (AUTH/NGI_GRNET)=<br />
# Sync dteam Greece with dteam CERN.<br />
# Advise sites to add the new VOMS server to their configuration. They need to be told new site-info.def definitions to replace these: <br />
<pre><br />
VO_DTEAM_VOMS_SERVERS='vomss://voms.cern.ch:8443/voms/dteam?/dteam/' <br />
<br />
VO_DTEAM_VOMSES="\<br />
'dteam lcg-voms.cern.ch 15004 \<br />
/DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch dteam 24' \<br />
'dteam voms.cern.ch 15004 \<br />
/DC=ch/DC=cern/OU=computers/CN=voms.cern.ch dteam 24'" </pre><br />
and<br />
<pre><br />
VO_DTEAM_VOMS_CA_DN="\<br />
'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' \<br />
'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"<br />
</pre><br />
with these:<br />
<pre><br />
VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/' <br />
<br />
VO_DTEAM_VOMSES="\<br />
'dteam voms.hellasgrid.gr 15004 \<br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' \<br />
'dteam voms2.hellasgrid.gr 15004 \<br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'" </pre><br />
and<br />
<pre><br />
VO_DTEAM_VOMS_CA_DN="\<br />
'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' \<br />
'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'"<br />
</pre><br />
'''Run yaim after changing site-info.def'''.The new "lsc" files should be '''voms.hellasgrid.gr.lsc''' and '''voms2.hellasgrid.gr.lsc''' with the following contents, respectively:<br />
<pre><br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr<br />
/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006<br />
</pre><br />
<pre><br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr<br />
/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006<br />
</pre><br />
<ol start="3"><br />
<li> Sites also need an rpm containing the host cert(s) of the new VOMS server(s) at least for the WMS, while it still requires the certs of supported VOs. We could add those certs to lcg-vomscerts to smoothen the transition, but it may be better for EGI to control its own rpm. 11/10/2010 lcg-vomscerts has been already updated. Version 6.1.0 and later contains the new certs. Latest[http://etics-repository.cern.ch/repository/download/registered/org.glite/lcg-vomscerts/6.2.0/noarch/lcg-vomscerts-6.2.0-1.noarch.rpm] as of 11/11/2010. <br />
</ol><br />
<ol start="4"><br />
<li> Wait a bit (1 month sounds reasonable).<br />
</ol><br />
<ol start="5"><br />
<li> Close registrations at CERN. service stop vomrs should do.<br />
</ol><br />
<ol start="6"><br />
<li> Sync dteam Greece with dteam CERN.<br />
</ol><br />
<ol start="7"><br />
<li> Advise new users to register with Greece. https://voms.hellasgrid.gr:8443/vo/dteam/vomrs<br />
</ol><br />
<ol start="8"><br />
<li> Remove CERN dteam. '''This will take place on Wednesday January 26'''.<br />
</ol><br />
<ol start="9"><br />
<li> Advise sites to drop CERN dteam configuration.<br />
</ol><br />
<br />
= Resources =<br />
*VOMRS Tutorials: http://www.uscms.org/SoftwareComputing/Grid/VO/tutorials.html<br />
*VOMRS Online Documentation: http://computing.fnal.gov/docs/products/vomrs/<br />
<br />
= Acknowledgements =<br />
Information provided in this page was collected from M. Dimou's VOMRS [http://dimou.web.cern.ch/dimou/lcg/registrar/TF/vomrs-tips.html tips page], with material provided by Tanya Levshina (VOMRS Project Leader and developer).</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Dteam_vo&diff=25689Dteam vo2011-10-07T12:56:02Z<p>Dzila: /* Recipes for VO/ROC/NGI/Site managers */</p>
<hr />
<div>{{Template:Op menubar}}<br />
{{Template:Doc_menubar}}<br />
{{TOC_right}}<br />
<br />
=General Information =<br />
The DTEAM VO is an infrastructure VO that MUST be enabled by all EGI Resource Centres that support the VO concept for user autentication, as stated in the [https://documents.egi.eu/document/31 Resource Centre Operational Level Agreement]. It is meant for testing and troubleshooting of grid capabilities across EGI Resource Centres. Usage of the DTEAM VO is subject to the EGI [[SPG:Documents| Security Policies]].<br />
* [http://operations-portal.egi.eu/vo/downloadAUP/file/dteam-AcceptableUsePolicy-20110926-1316993681969.txt DTEAM AUP].<br />
* '''Get support''': in order to get support about the DTEAM VO please [http://helpdesk.egi.eu/ open a ticket], select type ''Operations'', and set ''concerned VO'' to ''dteam''. If you have privileges, assign it to the Support Unit ''VOsupport unit''.<br />
<br />
=Recipes for VO/ROC/NGI/Site managers=<br />
<br />
==What users filling the '''dteam''' VO Registration form should do==<br />
<br />
Select the appropriate '''Representative''' and '''Group''' for themselves. The Representative corresponding to their region is offered in a drop-down menu. Example: dteam users from Greece should select Kostas Koumantaros or Ioannis Lambiotis as their Representative and /dteam/NGI_GRNET as their Group.<br />
Everybody is automatically registered under the root group /dteam in addition to any Group they might select. Nobody can de-assign them from this "root group" unless they get "Denied", in the first place or, later on, "Suspended", by the VO-Admin, in which case they can't run any Grid jobs and they get deleted from the VOMS database.<br />
When users select additional Groups, the GroupOwners have nothing to do, if they have no objection.<br />
Users may select GroupRoles within a given Group as well.<br />
<br />
==What the VO-Admin can do==<br />
<br />
Everything including VO member suspension/removal that nobody else can do!<br />
'''NB!!!'''If you try to remove a member and the box-to-tick is grey, this means that the member has some authority (GroupOwner/Manager or Representative). You 'll have to remove that funtion first from him/her via "Manage VO Admin Roles". To remove the GroupOwner/Manager autority, use control/click on the relevant Group/Role (it will be blue)!<br />
<br />
==What the Representative can do==<br />
<br />
Approve Candidates during the initial registration and handle Expired users. To do this, the Representative should either click on the link (s)he got in the email notification or go to the web interface, open the "Members" sub-menu, click on "Set status", search for "New" candidates and approve those assigned to him/her.<br />
<br />
The Representative selected by the user can assign another Representative before approving, as appropriate. Example: a DTEAM VO Candidate from a Russian LCG Site selected the SWE ROC manager as Representative. Gonzalo (SWE) can replace himself with Alexander (RDIG).<br />
<br />
==What the GroupOwners can do==<br />
Group Owners can create groups/group roles and assign new Group Owner/Manager roles to member within the subgroups. If they decided that the user doesn't belong to their group(s) they can de-assign him/her at any time. Example: If Sven from DECH selects additional group /dteam/see, Kostas can move him out.<br />
<br />
==What the GroupManagers can do==<br />
They can deassign users from their group at any time.<br />
<br />
http://cern.ch/dimou/lcg/vomrs/Groups-Roles.doc contains EGEE era implementation details and plans on Groups/Roles. As VOMRS fuctionality will be implemented in VOMS this document is becoming obsolete.<br />
<br />
==Proposed responsibilities==<br />
{| border="1"<br />
! <br />
! Operations manager and deputy<br />
! Operations centra staff<br />
! Site staff<br />
|-<br />
|Candidate<br />
|remove <br />
|<br />
|<br />
|<br />
|-<br />
|Applicant<br />
|Remove/approve/deny Assign/deassign to/from group and group role<br />
|Remove/approve/suspend/expire<br />
|Assign/deassign to/from group and group role<br />
|-<br />
|Member<br />
|Remove/approve/suspend/expire Assign/deassign to/from group and group role<br />
|expire from Institute but not from the VO<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|Member’s certificate<br />
|Remove/approve/deny/suspend<br />
|<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|}<br />
<br />
=Mini How-To=<br />
<br />
* To (De)Assign someone as Representative go to "Manage VO Admin Roles".<br />
* To (De)Assign someone as GroupOwner go to "Manage VO Admin Roles", search for the VO member and select the Group (s)he should own.<br />
* To Change Representative for all members go to "Change Representative", Select the right DN from the drop dowm menu, click on each member.<br />
* To receive email notification for actions you need to take go to "Subscription" and select what you wish to be notified about.<br />
<br />
{| border="1"<br />
! <br />
! VO Admin<br />
! Representative<br />
! GroupOwner<br />
! GroupManager<br />
|-<br />
|Candidate<br />
|remove <br />
|<br />
|<br />
|<br />
|-<br />
|Applicant<br />
|Remove/approve/deny Assign/deassign to/from group and group role<br />
|Remove/approve/suspend/expire<br />
|Assign/deassign to/from group and group role<br />
|-<br />
|Member<br />
|Remove/approve/suspend/expire Assign/deassign to/from group and group role<br />
|expire from Institute but not from the VO<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|Member’s certificate<br />
|Remove/approve/deny/suspend<br />
|<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|}<br />
<br />
=Migration of the dteam VO from CERN VOMS server to EGI VOMS (AUTH/NGI_GRNET)=<br />
# Sync dteam Greece with dteam CERN.<br />
# Advise sites to add the new VOMS server to their configuration. They need to be told new site-info.def definitions to replace these: <br />
<pre><br />
VO_DTEAM_VOMS_SERVERS='vomss://voms.cern.ch:8443/voms/dteam?/dteam/' <br />
<br />
VO_DTEAM_VOMSES="\<br />
'dteam lcg-voms.cern.ch 15004 \<br />
/DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch dteam 24' \<br />
'dteam voms.cern.ch 15004 \<br />
/DC=ch/DC=cern/OU=computers/CN=voms.cern.ch dteam 24'" </pre><br />
and<br />
<pre><br />
VO_DTEAM_VOMS_CA_DN="\<br />
'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' \<br />
'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"<br />
</pre><br />
with these:<br />
<pre><br />
VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/' <br />
<br />
VO_DTEAM_VOMSES="\<br />
'dteam voms.hellasgrid.gr 15004 \<br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' \<br />
'dteam voms2.hellasgrid.gr 15004 \<br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'" </pre><br />
and<br />
<pre><br />
VO_DTEAM_VOMS_CA_DN="\<br />
'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' \<br />
'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'"<br />
</pre><br />
'''Run yaim after changing site-info.def'''.The new "lsc" files should be '''voms.hellasgrid.gr.lsc''' and '''voms2.hellasgrid.gr.lsc''' with the following contents, respectively:<br />
<pre><br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr<br />
/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006<br />
</pre><br />
<pre><br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr<br />
/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006<br />
</pre><br />
<ol start="3"><br />
<li> Sites also need an rpm containing the host cert(s) of the new VOMS server(s) at least for the WMS, while it still requires the certs of supported VOs. We could add those certs to lcg-vomscerts to smoothen the transition, but it may be better for EGI to control its own rpm. 11/10/2010 lcg-vomscerts has been already updated. Version 6.1.0 and later contains the new certs. Latest[http://etics-repository.cern.ch/repository/download/registered/org.glite/lcg-vomscerts/6.2.0/noarch/lcg-vomscerts-6.2.0-1.noarch.rpm] as of 11/11/2010. <br />
</ol><br />
<ol start="4"><br />
<li> Wait a bit (1 month sounds reasonable).<br />
</ol><br />
<ol start="5"><br />
<li> Close registrations at CERN. service stop vomrs should do.<br />
</ol><br />
<ol start="6"><br />
<li> Sync dteam Greece with dteam CERN.<br />
</ol><br />
<ol start="7"><br />
<li> Advise new users to register with Greece. https://voms.hellasgrid.gr:8443/vo/dteam/vomrs<br />
</ol><br />
<ol start="8"><br />
<li> Remove CERN dteam. '''This will take place on Wednesday January 26'''.<br />
</ol><br />
<ol start="9"><br />
<li> Advise sites to drop CERN dteam configuration.<br />
</ol><br />
<br />
= Resources =<br />
*VOMRS Tutorials: http://www.uscms.org/SoftwareComputing/Grid/VO/tutorials.html<br />
*VOMRS Online Documentation: http://computing.fnal.gov/docs/products/vomrs/<br />
<br />
= Acknowledgements =<br />
Information provided in this page was collected from M. Dimou's VOMRS [http://dimou.web.cern.ch/dimou/lcg/registrar/TF/vomrs-tips.html tips page], with material provided by Tanya Levshina (VOMRS Project Leader and developer).</div>Dzilahttps://wiki.egi.eu/w/index.php?title=PROC10_Recomputation_of_SAM_results_or_availability_reliability_statistics&diff=25582PROC10 Recomputation of SAM results or availability reliability statistics2011-10-06T13:33:34Z<p>Dzila: /* Steps */</p>
<hr />
<div>{{Template:Op menubar}}<br />
{{Template:Doc_menubar}}<br />
[[Category:Procedures]]<br />
__TOC__<br />
<br />
= Procedure for the recomputation of SAM results and availability/reliability =<br />
<br />
*'''Title''': Recomputation of monitoring results and availability <br />
*'''Document link''': <br />
*'''Last modified''': <br />
*'''Version''': 1.0 <br />
*'''Policy Group Acronym''': OMB<br />
*'''Policy Group Name''': Operations Management Board<br />
*'''Contact Person''': Dimitris Zilaskos<br />
*'''Document Status''': DRAFT<br />
*'''Approved Date''': <br />
*'''Procedure Statement''':The purpose of this document is to ...<br />
<br />
= Overview =<br />
This procedure documents the steps for requesting a correction in the <br />
[[SAM_Instances|SAM test results]] and in the related [[Availability_and_reliability_monthly_statistics|availability statistics]].<br />
<br />
DISCLAIMER: This procedure is only applicable to EGI OPS test results. Procedures for the computation of VO-specific availability report are VO-specific and are out of scope.<br />
<br />
= Prerequisites =<br />
Fixes in test results are accepted only when failures in test results were due to problems <br />
cased to the monitoring infrastructure itself. Some examples:<br />
* invalid proxy certificate used for submitting the monitoring probes in a Nagios instance;<br />
* problems with the Storage Element used for replica management tests resulting in errors on CE's metrics.<br />
<br />
= Steps =<br />
<br />
# '''STEP 1''': notify your Operations Centre by opening a [http://helpdesk.egi.eu/ GGUS ticket] to be assigned to your Operations Centre Support Unit. In the GGUS ticket you must mention:<br />
## the starting and ending time of the problem (including day and hour in UTC)<br />
## the Site, ROC or NGI affected by the problem<br />
## the VO affected by the problem<br />
## a description of the problem<br />
# '''STEP 2''': the Operations Centre anlayzes the request. If the request is validated, the ticket is re-assigned to the [[GGUS:SLM-FAQ|Service Level Management]](SLM) Support Unit, who will be responsible of (1) collecting all reported problems and (2) discuss the reported problems with the SAM Support Unit by re-assigning the ticket to the [[GGUS:SAM/Nagios_FAQ|SAM/Nagios SU]].<br />
# '''STEP 3''': if the request for recomputation of the test results is accepted, the SAM Support Unit will be reponsible of triggering a recomputation of the monthly availability statistics. Re-computation is performed by following these steps:<br />
## All Nagios metric results for any site and service are set to ''unknown'' status from the beginning of the hour reported in the starting time to one hour after the ending time. This is to cover late results that could have arrived later.<br />
## The period is then recomputed for that particular Site, ROC or NGI. As a consequence, the availability and reliability of other sites won't be affected, as unknown periods are not considered in the computation.<br />
# '''STEP 4''': when the new availability statistics are ready for distribution, the SAM/Nagios SU reassignes the ticket to the SLM Support Unit, in order to notify that a new set of reports can be re-distributed to EGI.<br />
<br />
= External links =<br />
* [https://tomtools.cern.ch/confluence/display/SAMDOC/Availability+Re-computation+Policy WLCG Availability re-computation policy]<br />
<br />
= Revision history =</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Dteam_vo&diff=24432Dteam vo2011-09-14T15:03:52Z<p>Dzila: /* Migration of the dteam VO from CERN to EGI VOMS (AUTH/NGI_GRNET) */</p>
<hr />
<div>==Migration of the dteam VO from CERN to EGI VOMS (AUTH/NGI_GRNET)==<br />
# Sync dteam Greece with dteam CERN.<br />
# Advise sites to add the new VOMS server to their configuration. They need to be told new site-info.def definitions to replace these: <br />
<pre><br />
VO_DTEAM_VOMS_SERVERS='vomss://voms.cern.ch:8443/voms/dteam?/dteam/' <br />
<br />
VO_DTEAM_VOMSES="\<br />
'dteam lcg-voms.cern.ch 15004 \<br />
/DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch dteam 24' \<br />
'dteam voms.cern.ch 15004 \<br />
/DC=ch/DC=cern/OU=computers/CN=voms.cern.ch dteam 24'" </pre><br />
and<br />
<pre><br />
VO_DTEAM_VOMS_CA_DN="\<br />
'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' \<br />
'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"<br />
</pre><br />
with these:<br />
<pre><br />
VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/' <br />
<br />
VO_DTEAM_VOMSES="\<br />
'dteam voms.hellasgrid.gr 15004 \<br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' \<br />
'dteam voms2.hellasgrid.gr 15004 \<br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'" </pre><br />
and<br />
<pre><br />
VO_DTEAM_VOMS_CA_DN="\<br />
'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' \<br />
'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'"<br />
</pre><br />
'''Run yaim after changing site-info.def'''.The new "lsc" files should be '''voms.hellasgrid.gr.lsc''' and '''voms2.hellasgrid.gr.lsc''' with the following contents, respectively:<br />
<pre><br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr<br />
/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006<br />
</pre><br />
<pre><br />
/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr<br />
/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006<br />
</pre><br />
<ol start="3"><br />
<li> Sites also need an rpm containing the host cert(s) of the new VOMS server(s) at least for the WMS, while it still requires the certs of supported VOs. We could add those certs to lcg-vomscerts to smoothen the transition, but it may be better for EGI to control its own rpm. 11/10/2010 lcg-vomscerts has been already updated. Version 6.1.0 and later contains the new certs. Latest[http://etics-repository.cern.ch/repository/download/registered/org.glite/lcg-vomscerts/6.2.0/noarch/lcg-vomscerts-6.2.0-1.noarch.rpm] as of 11/11/2010. <br />
</ol><br />
<ol start="4"><br />
<li> Wait a bit (1 month sounds reasonable).<br />
</ol><br />
<ol start="5"><br />
<li> Close registrations at CERN. service stop vomrs should do.<br />
</ol><br />
<ol start="6"><br />
<li> Sync dteam Greece with dteam CERN.<br />
</ol><br />
<ol start="7"><br />
<li> Advise new users to register with Greece. https://voms.hellasgrid.gr:8443/vo/dteam/vomrs<br />
</ol><br />
<ol start="8"><br />
<li> Remove CERN dteam. '''This will take place on Wednesday January 26'''.<br />
</ol><br />
<ol start="9"><br />
<li> Advise sites to drop CERN dteam configuration.<br />
</ol><br />
<br />
==General information about the VO==<br />
The dteam VO started in EGEE as a VO for operations. '''General Grid security policies''' as defined the the documents at [https://wiki.egi.eu/wiki/SPG:Documents] are applicable. Currently the '''AUP''' [http://cic.gridops.org/common/all/documents/AUP/dteam-AcceptableUsePolicy-20060830-144906.aup] (which needs update for EGI terminology) states:<br />
<br />
This acceptable Use Policy applies to all members of the DTEAM Virtual<br />
Organization, hereafter referred to as the VO, with reference to use of<br />
the LCG/EGEE Grid infrastructure, hereafter referred to as the Grid. The<br />
ROC managers' coordination committee is the body that owns and gives<br />
authority to this policy.<br />
<br />
The goal of the VO is to facilitate the deployment of a stable production <br />
Grid infrastructure. To this end, members of this VO -who have to be<br />
associated with a registered site and be involved in its operation- are<br />
allowed to run tests which validate the correct configuration of their<br />
site. Site performance evaluation and/or monitoring programs may also be<br />
run under the DTEAM VO with the approval of the Site Manager, subject to<br />
the agreement of the affected sites' management.<br />
<br />
During all times at which they are utilising Grid resources, in testing or<br />
performing productions for validation, the Members and Managers of the VO<br />
agree to be bound by the Grid Acceptable Usage Rules, VO Security Policy<br />
and other relevant Grid Policies, and to use the Grid only in the<br />
furtherance of the stated goals of the VO.<br />
<br />
==How to get support==<br />
<br />
Open a GGUS ticket, '''select Operations as type''', and '''set concerned VO to dteam'''. If you have privileges, assign it to the '''VOsupport unit'''.<br />
<br />
==Recipes for VO/ROC/NGI/Site managers==<br />
<br />
===What users filling the '''dteam''' VO Registration form should do:===<br />
<br />
Select the appropriate '''Representative''' and '''Group''' for themselves. The Representative corresponding to their region is offered in a drop-down menu. Example: dteam users from Greece should select Kostas Koumantaros or Ioannis Lambiotis as their Representative and /dteam/NGI_GRNET as their Group.<br />
Everybody is automatically registered under the root group /dteam in addition to any Group they might select. Nobody can de-assign them from this "root group" unless they get "Denied", in the first place or, later on, "Suspended", by the VO-Admin, in which case they can't run any Grid jobs and they get deleted from the VOMS database.<br />
When users select additional Groups, the GroupOwners have nothing to do, if they have no objection.<br />
Users may select GroupRoles within a given Group as well.<br />
<br />
===What the VO-Admin can do:===<br />
<br />
Everything including VO member suspension/removal that nobody else can do!<br />
'''NB!!!'''If you try to remove a member and the box-to-tick is grey, this means that the member has some authority (GroupOwner/Manager or Representative). You 'll have to remove that funtion first from him/her via "Manage VO Admin Roles". To remove the GroupOwner/Manager autority, use control/click on the relevant Group/Role (it will be blue)!<br />
<br />
===What the Representative can do:===<br />
<br />
Approve Candidates during the initial registration and handle Expired users. To do this, the Representative should either click on the link (s)he got in the email notification or go to the web interface, open the "Members" sub-menu, click on "Set status", search for "New" candidates and approve those assigned to him/her.<br />
<br />
The Representative selected by the user can assign another Representative before approving, as appropriate. Example: a DTEAM VO Candidate from a Russian LCG Site selected the SWE ROC manager as Representative. Gonzalo (SWE) can replace himself with Alexander (RDIG).<br />
<br />
===What the GroupOwners can do:===<br />
<br />
Group Owners can create groups/group roles and assign new Group Owner/Manager roles to member within the subgroups. If they decided that the user doesn't belong to their group(s) they can de-assign him/her at any time. Example: If Sven from DECH selects additional group /dteam/see, Kostas can move him out.<br />
<br />
<br />
===What the GroupManagers can do:===<br />
They can deassign users from their group at any time.<br />
<br />
http://cern.ch/dimou/lcg/vomrs/Groups-Roles.doc contains EGEE era implementation details and plans on Groups/Roles. As VOMRS fuctionality will be implemented in VOMS this document is becoming obsolete.<br />
<br />
===Mini How-To:===<br />
<br />
* To (De)Assign someone as Representative go to "Manage VO Admin Roles".<br />
* To (De)Assign someone as GroupOwner go to "Manage VO Admin Roles", search for the VO member and select the Group (s)he should own.<br />
* To Change Representative for all members go to "Change Representative", Select the right DN from the drop dowm menu, click on each member.<br />
* To receive email notification for actions you need to take go to "Subscription" and select what you wish to be notified about.<br />
<br />
VOMRS Tutorials: http://www.uscms.org/SoftwareComputing/Grid/VO/tutorials.html<br />
<br />
VOMRS Online Documentation: http://computing.fnal.gov/docs/products/vomrs/<br />
<br />
{| border="1"<br />
! <br />
! VO Admin<br />
! Representative<br />
! GroupOwner<br />
! GroupManager<br />
|-<br />
|Candidate<br />
|remove <br />
|<br />
|<br />
|<br />
|-<br />
|Applicant<br />
|Remove/approve/deny Assign/deassign to/from group and group role<br />
|Remove/approve/suspend/expire<br />
|Assign/deassign to/from group and group role<br />
|-<br />
|Member<br />
|Remove/approve/suspend/expire Assign/deassign to/from group and group role<br />
|expire from Institute but not from the VO<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|Member’s certificate<br />
|Remove/approve/deny/suspend<br />
|<br />
|assign/deassign to/from group and group role<br />
|assign/deassign to/from group and group role<br />
|-<br />
|}<br />
<br />
'''Info obtained from Maria Dimou VOMRS tips page http://dimou.web.cern.ch/dimou/lcg/registrar/TF/vomrs-tips.html, with material provided by Tanya Levshina (VOMRS Project Leader and developer)'''</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=24205Resource Centres OLA and Resource infrastructure Provider OLA reports2011-09-08T11:53:51Z<p>Dzila: /* 2011 */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
[[Category:Service Level Management]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/332 Jan]/<br />
[https://documents.egi.eu/document/402 Feb]/<br />
[https://documents.egi.eu/document/465 Mar]/<br />
[https://documents.egi.eu/document/508 Apr]/<br />
[https://documents.egi.eu/document/593 May]/<br />
[https://documents.egi.eu/document/648 Jun]/<br />
[https://documents.egi.eu/document/716 Jul]/<br />
[https://documents.egi.eu/document/783 Aug]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/42 May]/[https://documents.egi.eu/document/96 Jun]/[https://documents.egi.eu/document/130 Jul]/[https://documents.egi.eu/document/157 Aug]/[https://documents.egi.eu/document/219 Sep]/[https://documents.egi.eu/document/238 Oct]/[https://documents.egi.eu/document/266 Nov]/[https://documents.egi.eu/document/299 Dec]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/ACE/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
* '''Recomputation precedure'''<br />
Should there be doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [https://tomtools.cern.ch/confluence/display/SAM/Availability+Re-computation+Policy]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# ACE as Gridview in the past, is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, ACE takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview had included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. However ACE reports (used since May 2011) do not include the snapshot feature yet.''' <br />
# The calculations performed by ACE always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]<br />
* Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=New_Availability_Reporting&diff=24001New Availability Reporting2011-09-02T08:31:15Z<p>Dzila: /* Use case 2: EGI.eu availability reports */</p>
<hr />
<div>{{Template:Op menubar}}<br />
{{Template:Tools menubar}}<br />
__TOC__<br />
<br />
=Use case 1: NGI availability reports=<br />
We would like to have NGI Availability reports. These reports should include the central services operated by the NGI, this including the regional tools and other middleware core services operated, for example:<br />
<br />
* the VOMS service<br />
* the top-BDII service<br />
* the WMS service<br />
<br />
and the operational services including<br />
** the NGI SAM service<br />
** the accounting portal and repositories (where available)<br />
** the NGI operations dashboard (where available)<br />
** the NGI helpdesk (where available)<br />
**...<br />
<br />
The concerned services would be only those for which the NGI has direct administration responsibilities. For example the NGI availability reports shouldn't include WMS, VOMS etc. instances that are independently deployed by the sites to support local user communities and local projects.<br />
<br />
It is important to consider that a NGI core services is often physically distributed across different sites, that only have the role of hosting the hardware (but no administration responsibility). This has several implications.<br />
<br />
# If one instance is down but the rest of the cluster is up, then the "logical" service is still available. This means that the alias should be monitored for the sake of availability computation, not the individual physical instances<br />
# The site availability should not be impacted by the unavailability of physical instances of a service operated by the NGI.<br />
<br />
This use case could be satisfied by:<br />
- grouping NGI services into a dedicated NGI site (in case of a distributed service, only the alias is registered)<br />
- create a NGI availability profile just applicable to the "NGI" site, where the availability of the site is computed as the AND composition of the availability of all registered services. Note that if some (optional) services are NOT available, then UP should be returned, i.e. the profile should include a mandatory set of services (e.g. regional SAM) and a complementary set of optional services (e.g. the local helpdesk, VOMS, etc.)<br />
<br />
<!-- ==NGI middleware availability==<br />
The NGI middleare logical site includes all core middleware services operated by the NGI: WMS, top-BDII, VOMS etc. regardless of their physical location.<br />
<br />
For example top-BDII is OK IF (tB1 is OK) OR (tB2 is ok) OR .... (tBN is ok).<br />
<br />
NGI middleware service is UP IF (VOMS is UP) AND (top-BDII is UP) AND .... (WMS is UP)<br />
<br />
==NGI operations tool availability==<br />
The NGI operations logical site includes all operations tools operated by the NGI: helpdesk, SAM, ops dashboard etc.<br />
<br />
NGI operations services is UP IF (Helpdesk is UP) AND ... AND (SAM is UP) --><br />
<br />
= Use case 2: EGI.eu availability reports=<br />
We would like to measure the overall availability of EGI.eu services. <br />
Example of such services are:<br />
<br />
Operational<br />
* accounting portal and accounting repository<br />
* GOCDB<br />
* operations portal<br />
* central MyEGI <br />
* message bus<br />
* GGUS<br />
* security Nagios and Pakiti<br />
* security dashboard<br />
* DTEAM VOMS<br />
* OPS VOMS<br />
* Overall availability of all EGI production sites<br />
<br />
Technical<br />
* EGI repository<br />
* RT<br />
<br />
User<br />
* application database<br />
* training database<br />
<br />
For each category above, for example operations, EGI.eu operations service is UP if (GGUS is UP) AND (Operations Portal is UP) AND ... AND (GOCDB is UP).<br />
<br />
In GOCDB the VIRTUALOPS ROC could be evolved into the EGI.eu ROC, which includes all EGI.eu services. <br />
An new availability profile for EGI.eu is needed.<br />
<br />
= Use case 3: Regionalized NGI availability reports =<br />
<br />
The regional VO support in tools is essential for the idea of NGI autonomy coined at the end of EGEE-III and promoted also by EGI. NGI_PL users submit jobs using vo.plgrid.pl, we extended the list of operations tests in NGI_PL by adding PL-Grid-specific service tests, we would like to measure availability being able to customize the list of tests and services on which the algorithm depends. (M. Radecki to expand)<br />
<br />
= Use case 4: Extension of the standard OPS site availability profile =<br />
KIT requested that new services (in addition to CE, SE and BDII) are included in availability computation, for example: WMS, LB, LFC, FTS, top-BDII, VOMS).<br />
In other words, the site requests that any local core services (that is independently operated from the NGI) can be considered in OPS availability reports.<br />
<br />
As not all services necessarily need to be operated WMS, LB etc., if such optional services do not exist, the respective availability (in the site availability computation algorithm) should be 1.</div>Dzilahttps://wiki.egi.eu/w/index.php?title=New_Availability_Reporting&diff=24000New Availability Reporting2011-09-02T08:30:30Z<p>Dzila: /* Use case 2: EGI.eu availability reports */</p>
<hr />
<div>{{Template:Op menubar}}<br />
{{Template:Tools menubar}}<br />
__TOC__<br />
<br />
=Use case 1: NGI availability reports=<br />
We would like to have NGI Availability reports. These reports should include the central services operated by the NGI, this including the regional tools and other middleware core services operated, for example:<br />
<br />
* the VOMS service<br />
* the top-BDII service<br />
* the WMS service<br />
<br />
and the operational services including<br />
** the NGI SAM service<br />
** the accounting portal and repositories (where available)<br />
** the NGI operations dashboard (where available)<br />
** the NGI helpdesk (where available)<br />
**...<br />
<br />
The concerned services would be only those for which the NGI has direct administration responsibilities. For example the NGI availability reports shouldn't include WMS, VOMS etc. instances that are independently deployed by the sites to support local user communities and local projects.<br />
<br />
It is important to consider that a NGI core services is often physically distributed across different sites, that only have the role of hosting the hardware (but no administration responsibility). This has several implications.<br />
<br />
# If one instance is down but the rest of the cluster is up, then the "logical" service is still available. This means that the alias should be monitored for the sake of availability computation, not the individual physical instances<br />
# The site availability should not be impacted by the unavailability of physical instances of a service operated by the NGI.<br />
<br />
This use case could be satisfied by:<br />
- grouping NGI services into a dedicated NGI site (in case of a distributed service, only the alias is registered)<br />
- create a NGI availability profile just applicable to the "NGI" site, where the availability of the site is computed as the AND composition of the availability of all registered services. Note that if some (optional) services are NOT available, then UP should be returned, i.e. the profile should include a mandatory set of services (e.g. regional SAM) and a complementary set of optional services (e.g. the local helpdesk, VOMS, etc.)<br />
<br />
<!-- ==NGI middleware availability==<br />
The NGI middleare logical site includes all core middleware services operated by the NGI: WMS, top-BDII, VOMS etc. regardless of their physical location.<br />
<br />
For example top-BDII is OK IF (tB1 is OK) OR (tB2 is ok) OR .... (tBN is ok).<br />
<br />
NGI middleware service is UP IF (VOMS is UP) AND (top-BDII is UP) AND .... (WMS is UP)<br />
<br />
==NGI operations tool availability==<br />
The NGI operations logical site includes all operations tools operated by the NGI: helpdesk, SAM, ops dashboard etc.<br />
<br />
NGI operations services is UP IF (Helpdesk is UP) AND ... AND (SAM is UP) --><br />
<br />
= Use case 2: EGI.eu availability reports=<br />
We would like to measure the overall availability of EGI.eu services. <br />
Example of such services are:<br />
<br />
Operational<br />
** accounting portal and accounting repository<br />
** GOCDB<br />
** operations portal<br />
** central MyEGI <br />
** message bus<br />
** GGUS<br />
** security Nagios and Pakiti<br />
** security dashboard<br />
** DTEAM VOMS<br />
** OPS VOMS<br />
** Overall availability of all EGI production sites<br />
<br />
Technical<br />
** EGI repository<br />
** RT<br />
<br />
User<br />
** application database<br />
** training database<br />
<br />
For each category above, for example operations, EGI.eu operations service is UP if (GGUS is UP) AND (Operations Portal is UP) AND ... AND (GOCDB is UP).<br />
<br />
In GOCDB the VIRTUALOPS ROC could be evolved into the EGI.eu ROC, which includes all EGI.eu services. <br />
An new availability profile for EGI.eu is needed.<br />
<br />
= Use case 3: Regionalized NGI availability reports =<br />
<br />
The regional VO support in tools is essential for the idea of NGI autonomy coined at the end of EGEE-III and promoted also by EGI. NGI_PL users submit jobs using vo.plgrid.pl, we extended the list of operations tests in NGI_PL by adding PL-Grid-specific service tests, we would like to measure availability being able to customize the list of tests and services on which the algorithm depends. (M. Radecki to expand)<br />
<br />
= Use case 4: Extension of the standard OPS site availability profile =<br />
KIT requested that new services (in addition to CE, SE and BDII) are included in availability computation, for example: WMS, LB, LFC, FTS, top-BDII, VOMS).<br />
In other words, the site requests that any local core services (that is independently operated from the NGI) can be considered in OPS availability reports.<br />
<br />
As not all services necessarily need to be operated WMS, LB etc., if such optional services do not exist, the respective availability (in the site availability computation algorithm) should be 1.</div>Dzilahttps://wiki.egi.eu/w/index.php?title=New_Availability_Reporting&diff=23999New Availability Reporting2011-09-02T08:29:08Z<p>Dzila: /* Use case 2: EGI.eu availability reports */</p>
<hr />
<div>{{Template:Op menubar}}<br />
{{Template:Tools menubar}}<br />
__TOC__<br />
<br />
=Use case 1: NGI availability reports=<br />
We would like to have NGI Availability reports. These reports should include the central services operated by the NGI, this including the regional tools and other middleware core services operated, for example:<br />
<br />
* the VOMS service<br />
* the top-BDII service<br />
* the WMS service<br />
<br />
and the operational services including<br />
** the NGI SAM service<br />
** the accounting portal and repositories (where available)<br />
** the NGI operations dashboard (where available)<br />
** the NGI helpdesk (where available)<br />
**...<br />
<br />
The concerned services would be only those for which the NGI has direct administration responsibilities. For example the NGI availability reports shouldn't include WMS, VOMS etc. instances that are independently deployed by the sites to support local user communities and local projects.<br />
<br />
It is important to consider that a NGI core services is often physically distributed across different sites, that only have the role of hosting the hardware (but no administration responsibility). This has several implications.<br />
<br />
# If one instance is down but the rest of the cluster is up, then the "logical" service is still available. This means that the alias should be monitored for the sake of availability computation, not the individual physical instances<br />
# The site availability should not be impacted by the unavailability of physical instances of a service operated by the NGI.<br />
<br />
This use case could be satisfied by:<br />
- grouping NGI services into a dedicated NGI site (in case of a distributed service, only the alias is registered)<br />
- create a NGI availability profile just applicable to the "NGI" site, where the availability of the site is computed as the AND composition of the availability of all registered services. Note that if some (optional) services are NOT available, then UP should be returned, i.e. the profile should include a mandatory set of services (e.g. regional SAM) and a complementary set of optional services (e.g. the local helpdesk, VOMS, etc.)<br />
<br />
<!-- ==NGI middleware availability==<br />
The NGI middleare logical site includes all core middleware services operated by the NGI: WMS, top-BDII, VOMS etc. regardless of their physical location.<br />
<br />
For example top-BDII is OK IF (tB1 is OK) OR (tB2 is ok) OR .... (tBN is ok).<br />
<br />
NGI middleware service is UP IF (VOMS is UP) AND (top-BDII is UP) AND .... (WMS is UP)<br />
<br />
==NGI operations tool availability==<br />
The NGI operations logical site includes all operations tools operated by the NGI: helpdesk, SAM, ops dashboard etc.<br />
<br />
NGI operations services is UP IF (Helpdesk is UP) AND ... AND (SAM is UP) --><br />
<br />
= Use case 2: EGI.eu availability reports=<br />
We would like to measure the overall availability of EGI.eu services. <br />
Example of such services are:<br />
<br />
Operational<br />
<br />
- accounting portal and accounting repository<br />
- GOCDB<br />
- operations portal<br />
- central MyEGI <br />
- message bus<br />
- GGUS<br />
- security Nagios and Pakiti<br />
- security dashboard<br />
- DTEAM VOMS<br />
- OPS VOMS<br />
- Overall availability of all EGI production sites<br />
<br />
Technical<br />
- EGI repository<br />
- RT<br />
<br />
User<br />
- application database<br />
- training database<br />
<br />
For each category above, for example operations, EGI.eu operations service is UP if (GGUS is UP) AND (Operations Portal is UP) AND ... AND (GOCDB is UP).<br />
<br />
In GOCDB the VIRTUALOPS ROC could be evolved into the EGI.eu ROC, which includes all EGI.eu services. <br />
An new availability profile for EGI.eu is needed.<br />
<br />
= Use case 3: Regionalized NGI availability reports =<br />
<br />
The regional VO support in tools is essential for the idea of NGI autonomy coined at the end of EGEE-III and promoted also by EGI. NGI_PL users submit jobs using vo.plgrid.pl, we extended the list of operations tests in NGI_PL by adding PL-Grid-specific service tests, we would like to measure availability being able to customize the list of tests and services on which the algorithm depends. (M. Radecki to expand)<br />
<br />
= Use case 4: Extension of the standard OPS site availability profile =<br />
KIT requested that new services (in addition to CE, SE and BDII) are included in availability computation, for example: WMS, LB, LFC, FTS, top-BDII, VOMS).<br />
In other words, the site requests that any local core services (that is independently operated from the NGI) can be considered in OPS availability reports.<br />
<br />
As not all services necessarily need to be operated WMS, LB etc., if such optional services do not exist, the respective availability (in the site availability computation algorithm) should be 1.</div>Dzilahttps://wiki.egi.eu/w/index.php?title=New_Availability_Reporting&diff=23998New Availability Reporting2011-09-02T08:28:49Z<p>Dzila: /* Use case 2: EGI.eu availability reports */</p>
<hr />
<div>{{Template:Op menubar}}<br />
{{Template:Tools menubar}}<br />
__TOC__<br />
<br />
=Use case 1: NGI availability reports=<br />
We would like to have NGI Availability reports. These reports should include the central services operated by the NGI, this including the regional tools and other middleware core services operated, for example:<br />
<br />
* the VOMS service<br />
* the top-BDII service<br />
* the WMS service<br />
<br />
and the operational services including<br />
** the NGI SAM service<br />
** the accounting portal and repositories (where available)<br />
** the NGI operations dashboard (where available)<br />
** the NGI helpdesk (where available)<br />
**...<br />
<br />
The concerned services would be only those for which the NGI has direct administration responsibilities. For example the NGI availability reports shouldn't include WMS, VOMS etc. instances that are independently deployed by the sites to support local user communities and local projects.<br />
<br />
It is important to consider that a NGI core services is often physically distributed across different sites, that only have the role of hosting the hardware (but no administration responsibility). This has several implications.<br />
<br />
# If one instance is down but the rest of the cluster is up, then the "logical" service is still available. This means that the alias should be monitored for the sake of availability computation, not the individual physical instances<br />
# The site availability should not be impacted by the unavailability of physical instances of a service operated by the NGI.<br />
<br />
This use case could be satisfied by:<br />
- grouping NGI services into a dedicated NGI site (in case of a distributed service, only the alias is registered)<br />
- create a NGI availability profile just applicable to the "NGI" site, where the availability of the site is computed as the AND composition of the availability of all registered services. Note that if some (optional) services are NOT available, then UP should be returned, i.e. the profile should include a mandatory set of services (e.g. regional SAM) and a complementary set of optional services (e.g. the local helpdesk, VOMS, etc.)<br />
<br />
<!-- ==NGI middleware availability==<br />
The NGI middleare logical site includes all core middleware services operated by the NGI: WMS, top-BDII, VOMS etc. regardless of their physical location.<br />
<br />
For example top-BDII is OK IF (tB1 is OK) OR (tB2 is ok) OR .... (tBN is ok).<br />
<br />
NGI middleware service is UP IF (VOMS is UP) AND (top-BDII is UP) AND .... (WMS is UP)<br />
<br />
==NGI operations tool availability==<br />
The NGI operations logical site includes all operations tools operated by the NGI: helpdesk, SAM, ops dashboard etc.<br />
<br />
NGI operations services is UP IF (Helpdesk is UP) AND ... AND (SAM is UP) --><br />
<br />
= Use case 2: EGI.eu availability reports=<br />
We would like to measure the overall availability of EGI.eu services. <br />
Example of such services are:<br />
<br />
Operational<br />
- accounting portal and accounting repository<br />
- GOCDB<br />
- operations portal<br />
- central MyEGI <br />
- message bus<br />
- GGUS<br />
- security Nagios and Pakiti<br />
- security dashboard<br />
- DTEAM VOMS<br />
- OPS VOMS<br />
- Overall availability of all EGI production sites<br />
<br />
Technical<br />
- EGI repository<br />
- RT<br />
<br />
User<br />
- application database<br />
- training database<br />
<br />
For each category above, for example operations, EGI.eu operations service is UP if (GGUS is UP) AND (Operations Portal is UP) AND ... AND (GOCDB is UP).<br />
<br />
In GOCDB the VIRTUALOPS ROC could be evolved into the EGI.eu ROC, which includes all EGI.eu services. <br />
An new availability profile for EGI.eu is needed.<br />
<br />
= Use case 3: Regionalized NGI availability reports =<br />
<br />
The regional VO support in tools is essential for the idea of NGI autonomy coined at the end of EGEE-III and promoted also by EGI. NGI_PL users submit jobs using vo.plgrid.pl, we extended the list of operations tests in NGI_PL by adding PL-Grid-specific service tests, we would like to measure availability being able to customize the list of tests and services on which the algorithm depends. (M. Radecki to expand)<br />
<br />
= Use case 4: Extension of the standard OPS site availability profile =<br />
KIT requested that new services (in addition to CE, SE and BDII) are included in availability computation, for example: WMS, LB, LFC, FTS, top-BDII, VOMS).<br />
In other words, the site requests that any local core services (that is independently operated from the NGI) can be considered in OPS availability reports.<br />
<br />
As not all services necessarily need to be operated WMS, LB etc., if such optional services do not exist, the respective availability (in the site availability computation algorithm) should be 1.</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=22876Resource Centres OLA and Resource infrastructure Provider OLA reports2011-08-04T13:28:32Z<p>Dzila: /* 2011 */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/716 Jul]<br />
[https://documents.egi.eu/document/648 Jun]<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/ACE/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
* '''Recomputation precedure'''<br />
Should there be doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [https://tomtools.cern.ch/confluence/display/SAM/Availability+Re-computation+Policy]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# ACE as Gridview in the past, is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, ACE takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview had included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. However ACE reports (used since May 2011) do not include the snapshot feature yet.''' <br />
# The calculations performed by ACE always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]<br />
* Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR5&diff=22760EGI-InSPIRE:SA1.8-QR52011-08-03T11:48:22Z<p>Dzila: /* EGI Catch-All CA */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|- <br />
| 14/06/2011 - 16/06/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=481<br />
| Documentation workshop, Zürich<br />
| Operations procedure manuals, Best Practices wiki pages and GOCWIKI transfer were the main focus of the workshop.<br />
|-<br />
| 29/07/2011 - 29/07/2011<br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=559<br />
| 4th OLA Task Force meeting<br />
| Evolution of the RP OLA<br />
|- <br />
| 25/05/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=490<br />
| Op. Doc.:Operational Procedures<br />
| Preparation for the Workshop<br />
|-<br />
| 9/05/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=470<br />
| Operational Documentation: GOCwiki transfer<br />
| Finalisation of steps for the transfer<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
<br> <br />
<br />
*The RC OLA[https://documents.egi.eu/document/31] has been finalized and approved based on the comments received from the OMB at Vilnius on April on June. <br />
*With WLCG collaboration, EGI league reports now use the new ACE availability calculation system, which supports different profiles, an important step towards obtaining different reports for different services in the future (ie Core Services). <br />
*From May 2011 onwards, EGI thresholds for suspending a site have been increased to 70%/75% Availability/Reliability, based on the evaluation perform in recent months<br />
<br />
==== Core Services for uncertified sites<br> ====<br />
<br />
* WMS, LB, and TopBDII service has been launched to be used for uncertified sites<br />
* A webpage [https://cert-devel.grid.auth.gr/] where the NGI managers can add the uncertified sites has been created and a demo version has been presented in July OMB. NGI managers can use this page to add uncertified sites to the Catch All certification infrastructure.<br />
<br />
==== EGI Catch-All CA ====<br />
<br />
The EGI Catch All CA is servicing 5 countries which do not have a national accredited Certification Authority. These countries are Albania, Azerbaijan, Bosnia and Herzegovina, Georgia and Senegal.<br />
<br />
===== GOCWIKI<br> =====<br />
<br />
*A deadline was set for closing the GOCWIKI (end of September 2011). <br />
*The wiki transfer is almost completed. There is effectively nothing left on GOCWIKI.<br />
<br />
===== PROCEDURE MANUALS =====<br />
<br />
*All sections except the ROD&nbsp;part are in excellent shape.&nbsp; These only need OMB approval for release. <br />
*The ROD&nbsp;part still needs some work, and feedback form COD&nbsp;would be appreciated.<br />
<br />
===== BEST&nbsp;PRACTICES =====<br />
<br />
*Completed and fully operational.&nbsp; Only requires people to contribute.<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
| The problem of getting people to participate in the Operational Documentation group is still an ongoing issue.<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
* Produce a new RP OLA draft, to discuss in the OLA workshop in Lyon<br />
* Explore the options provided by ACE profiles to obtain reports for NGI/Core services<br />
* Adjust the WLCG recalculation policy and procedure for EGI sites to EGI needs<br />
* Investigate options for replication of the ops VO from CERN to EGI catch-all VOMS.<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|Apr 2011<br />
|0<br />
|-<br />
|May 2011<br />
|2<br />
|-<br />
|June 2011<br />
|7<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR5&diff=22747EGI-InSPIRE:SA1.8-QR52011-08-03T08:56:05Z<p>Dzila: /* 4. Plans for the next period */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|- <br />
| 14/06/2011 - 16/06/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=481<br />
| Documentation workshop, Zürich<br />
| Operations procedure manuals, Best Practices wiki pages and GOCWIKI transfer were the main focus of the workshop.<br />
|-<br />
| 29/07/2011 - 29/07/2011<br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=559<br />
| 4th OLA Task Force meeting<br />
| Evolution of the RP OLA<br />
|- <br />
| 25/05/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=490<br />
| Op. Doc.:Operational Procedures<br />
| Preparation for the Workshop<br />
|-<br />
| 9/05/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=470<br />
| Operational Documentation: GOCwiki transfer<br />
| Finalisation of steps for the transfer<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
<br> <br />
<br />
*The RC OLA[https://documents.egi.eu/document/31] has been finalized and approved based on the comments received from the OMB at Vilnius on April on June. <br />
*With WLCG collaboration, EGI league reports now use the new ACE availability calculation system, which supports different profiles, an important step towards obtaining different reports for different services in the future (ie Core Services). <br />
*From May 2011 onwards, EGI thresholds for suspending a site have been increased to 70%/75% Availability/Reliability, based on the evaluation perform in recent months<br />
<br />
==== Core Services for uncertified sites<br> ====<br />
<br />
* WMS, LB, and TopBDII service has been launched to be used for uncertified sites<br />
* A webpage [https://cert-devel.grid.auth.gr/] where the NGI managers can add the uncertified sites has been created and a demo version has been presented in July OMB. NGI managers can use this page to add uncertified sites to the Catch All certification infrastructure.<br />
<br />
==== EGI Catch-All CA ====<br />
<br />
The EGI Catch All CA is service 5 countries which do not have a national accredited Certification Authority. These countries are Albania, Azerbaijan, Bosnia and Herzegovina, Georgia and Senegal.<br />
<br />
===== GOCWIKI<br> =====<br />
<br />
*A deadline was set for closing the GOCWIKI (end of September 2011). <br />
*The wiki transfer is almost completed. There is effectively nothing left on GOCWIKI.<br />
<br />
===== PROCEDURE MANUALS =====<br />
<br />
*All sections except the ROD&nbsp;part are in excellent shape.&nbsp; These only need OMB approval for release. <br />
*The ROD&nbsp;part still needs some work, and feedback form COD&nbsp;would be appreciated.<br />
<br />
===== BEST&nbsp;PRACTICES =====<br />
<br />
*Completed and fully operational.&nbsp; Only requires people to contribute.<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
| The problem of getting people to participate in the Operational Documentation group is still an ongoing issue.<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
* Produce a new RP OLA draft, to discuss in the OLA workshop in Lyon<br />
* Explore the options provided by ACE profiles to obtain reports for NGI/Core services<br />
* Adjust the WLCG recalculation policy and procedure for EGI sites to EGI needs<br />
* Investigate options for replication of the ops VO from CERN to EGI catch-all VOMS.<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|Apr 2011<br />
|0<br />
|-<br />
|May 2011<br />
|2<br />
|-<br />
|June 2011<br />
|7<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR5&diff=22746EGI-InSPIRE:SA1.8-QR52011-08-03T08:54:49Z<p>Dzila: /* = Core Services for uncertified sites */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|- <br />
| 14/06/2011 - 16/06/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=481<br />
| Documentation workshop, Zürich<br />
| Operations procedure manuals, Best Practices wiki pages and GOCWIKI transfer were the main focus of the workshop.<br />
|-<br />
| 29/07/2011 - 29/07/2011<br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=559<br />
| 4th OLA Task Force meeting<br />
| Evolution of the RP OLA<br />
|- <br />
| 25/05/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=490<br />
| Op. Doc.:Operational Procedures<br />
| Preparation for the Workshop<br />
|-<br />
| 9/05/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=470<br />
| Operational Documentation: GOCwiki transfer<br />
| Finalisation of steps for the transfer<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
<br> <br />
<br />
*The RC OLA[https://documents.egi.eu/document/31] has been finalized and approved based on the comments received from the OMB at Vilnius on April on June. <br />
*With WLCG collaboration, EGI league reports now use the new ACE availability calculation system, which supports different profiles, an important step towards obtaining different reports for different services in the future (ie Core Services). <br />
*From May 2011 onwards, EGI thresholds for suspending a site have been increased to 70%/75% Availability/Reliability, based on the evaluation perform in recent months<br />
<br />
==== Core Services for uncertified sites<br> ====<br />
<br />
* WMS, LB, and TopBDII service has been launched to be used for uncertified sites<br />
* A webpage [https://cert-devel.grid.auth.gr/] where the NGI managers can add the uncertified sites has been created and a demo version has been presented in July OMB. NGI managers can use this page to add uncertified sites to the Catch All certification infrastructure.<br />
<br />
==== EGI Catch-All CA ====<br />
<br />
The EGI Catch All CA is service 5 countries which do not have a national accredited Certification Authority. These countries are Albania, Azerbaijan, Bosnia and Herzegovina, Georgia and Senegal.<br />
<br />
===== GOCWIKI<br> =====<br />
<br />
*A deadline was set for closing the GOCWIKI (end of September 2011). <br />
*The wiki transfer is almost completed. There is effectively nothing left on GOCWIKI.<br />
<br />
===== PROCEDURE MANUALS =====<br />
<br />
*All sections except the ROD&nbsp;part are in excellent shape.&nbsp; These only need OMB approval for release. <br />
*The ROD&nbsp;part still needs some work, and feedback form COD&nbsp;would be appreciated.<br />
<br />
===== BEST&nbsp;PRACTICES =====<br />
<br />
*Completed and fully operational.&nbsp; Only requires people to contribute.<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
| The problem of getting people to participate in the Operational Documentation group is still an ongoing issue.<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
* Produce a new RP OLA draft, to discuss in the OLA workshop in Lyon<br />
* Explore the options provided by ACE profiles to obtain reports for NGI/Core services<br />
* Adjust the WLCG recalculation policy and procedure for EGI sites to EGI needs<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|Apr 2011<br />
|0<br />
|-<br />
|May 2011<br />
|2<br />
|-<br />
|June 2011<br />
|7<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR5&diff=22745EGI-InSPIRE:SA1.8-QR52011-08-03T08:54:17Z<p>Dzila: /* 2. Main Achievements */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|- <br />
| 14/06/2011 - 16/06/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=481<br />
| Documentation workshop, Zürich<br />
| Operations procedure manuals, Best Practices wiki pages and GOCWIKI transfer were the main focus of the workshop.<br />
|-<br />
| 29/07/2011 - 29/07/2011<br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=559<br />
| 4th OLA Task Force meeting<br />
| Evolution of the RP OLA<br />
|- <br />
| 25/05/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=490<br />
| Op. Doc.:Operational Procedures<br />
| Preparation for the Workshop<br />
|-<br />
| 9/05/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=470<br />
| Operational Documentation: GOCwiki transfer<br />
| Finalisation of steps for the transfer<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
<br> <br />
<br />
*The RC OLA[https://documents.egi.eu/document/31] has been finalized and approved based on the comments received from the OMB at Vilnius on April on June. <br />
*With WLCG collaboration, EGI league reports now use the new ACE availability calculation system, which supports different profiles, an important step towards obtaining different reports for different services in the future (ie Core Services). <br />
*From May 2011 onwards, EGI thresholds for suspending a site have been increased to 70%/75% Availability/Reliability, based on the evaluation perform in recent months<br />
<br />
===== Core Services for uncertified sites<br> ====<br />
<br />
* WMS, LB, and TopBDII service has been launched to be used for uncertified sites<br />
* A webpage [https://cert-devel.grid.auth.gr/] where the NGI managers can add the uncertified sites has been created and a demo version has been presented in July OMB. NGI managers can use this page to add uncertified sites to the Catch All certification infrastructure.<br />
<br />
==== EGI Catch-All CA ====<br />
<br />
The EGI Catch All CA is service 5 countries which do not have a national accredited Certification Authority. These countries are Albania, Azerbaijan, Bosnia and Herzegovina, Georgia and Senegal.<br />
<br />
===== GOCWIKI<br> =====<br />
<br />
*A deadline was set for closing the GOCWIKI (end of September 2011). <br />
*The wiki transfer is almost completed. There is effectively nothing left on GOCWIKI.<br />
<br />
===== PROCEDURE MANUALS =====<br />
<br />
*All sections except the ROD&nbsp;part are in excellent shape.&nbsp; These only need OMB approval for release. <br />
*The ROD&nbsp;part still needs some work, and feedback form COD&nbsp;would be appreciated.<br />
<br />
===== BEST&nbsp;PRACTICES =====<br />
<br />
*Completed and fully operational.&nbsp; Only requires people to contribute.<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
| The problem of getting people to participate in the Operational Documentation group is still an ongoing issue.<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
* Produce a new RP OLA draft, to discuss in the OLA workshop in Lyon<br />
* Explore the options provided by ACE profiles to obtain reports for NGI/Core services<br />
* Adjust the WLCG recalculation policy and procedure for EGI sites to EGI needs<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|Apr 2011<br />
|0<br />
|-<br />
|May 2011<br />
|2<br />
|-<br />
|June 2011<br />
|7<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=GGUS:SLM_FAQ&diff=22734GGUS:SLM FAQ2011-08-03T07:47:54Z<p>Dzila: Created page with '= Not active yet = {{GGUS-FAQ |Unit= Service Level Management |Updated= 2011-08-3 |purpose= * to provide a contact with the EGI.eu TSA1.8 unit that is responsible for providing…'</p>
<hr />
<div>= Not active yet =<br />
{{GGUS-FAQ<br />
|Unit= Service Level Management <br />
|Updated= 2011-08-3<br />
|purpose= <br />
* to provide a contact with the EGI.eu TSA1.8 unit that is responsible for providing a stable reliable infrastructure.<br />
* to collect Availability/Reliability reports and handle them according to EGI procedures<br />
* to handle requests from RPs/RCs with regard to the Availability/Reliability reports such as corrections<br />
|components= <br />
This Support Units does not provide support about any specific technology component that is deployed in the infrastructure. Relevant issues for which this Support Unit can be contacted are:<br />
* delivery of the Availability/Reliability reports<br />
* questions/clarifications about the reports<br />
|assigned by=<br />
* TPM<br />
* any other SU that idenfies a problem to be relevant to this SU<br />
|solved by=<br />
* tickets can be internally solved or assigned to other specific SUs depending on the problem<br />
|responsible=<br />
* the EGI Chief Operations Officer<br />
|documentation= What documentation is available on UNIT?<br />
* EGI web site: http://www.egi.eu/infrastructure/<br />
* EGI Operations wiki: [[Operations]]<br />
|sortname=SLM<br />
}}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR5&diff=22239EGI-InSPIRE:SA1.8-QR52011-07-28T08:15:57Z<p>Dzila: /* 4. Plans for the next period */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|- <br />
| 14/06/2011 - 16/06/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=481<br />
| Documentation workshop, Zürich<br />
| Worked on wiki pages for operational tools and best practices.<br />
|-<br />
| 29/07/2011 - 29/07/2011<br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=559<br />
| 4th OLA Task Force meeting<br />
| Evolution of the RP OLA<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
* Set a deadline for turning off GOCWIKI (end of September 2011).<br />
* The RC OLA[https://documents.egi.eu/document/31] has been finalized and approved based on the comments received from the OMB at Vilnius on April on June.<br />
* With WLCG collaboration, EGI league reports now use the new ACE availability calculation system, which supports different profiles, an important step towards obtaining different reports for different services in the future (ie Core Services).<br />
* From May 2011 onwards, EGI thresholds for suspending a site have been increased to 70%/75% Availability/Reliability, based on the evaluation perform in recent months<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
* Produce a new RP OLA draft, to discuss in the OLA workshop in Lyon<br />
* Explore the options provided by ACE profiles to obtain reports for NGI/Core services<br />
* Adjust the WLCG recalculation policy and procedure for EGI sites to EGI needs<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|Apr 2011<br />
|0<br />
|-<br />
|May 2011<br />
|2<br />
|-<br />
|June 2011<br />
|7<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR5&diff=22238EGI-InSPIRE:SA1.8-QR52011-07-28T08:10:00Z<p>Dzila: /* 2. Main Achievements */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|- <br />
| 14/06/2011 - 16/06/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=481<br />
| Documentation workshop, Zürich<br />
| Worked on wiki pages for operational tools and best practices.<br />
|-<br />
| 29/07/2011 - 29/07/2011<br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=559<br />
| 4th OLA Task Force meeting<br />
| Evolution of the RP OLA<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
* Set a deadline for turning off GOCWIKI (end of September 2011).<br />
* The RC OLA[https://documents.egi.eu/document/31] has been finalized and approved based on the comments received from the OMB at Vilnius on April on June.<br />
* With WLCG collaboration, EGI league reports now use the new ACE availability calculation system, which supports different profiles, an important step towards obtaining different reports for different services in the future (ie Core Services).<br />
* From May 2011 onwards, EGI thresholds for suspending a site have been increased to 70%/75% Availability/Reliability, based on the evaluation perform in recent months<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|Apr 2011<br />
|0<br />
|-<br />
|May 2011<br />
|2<br />
|-<br />
|June 2011<br />
|7<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR5&diff=22237EGI-InSPIRE:SA1.8-QR52011-07-28T08:08:28Z<p>Dzila: /* 2. Main Achievements */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|- <br />
| 14/06/2011 - 16/06/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=481<br />
| Documentation workshop, Zürich<br />
| Worked on wiki pages for operational tools and best practices.<br />
|-<br />
| 29/07/2011 - 29/07/2011<br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=559<br />
| 4th OLA Task Force meeting<br />
| Evolution of the RP OLA<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
* Set a deadline for turning off GOCWIKI (end of September 2011).<br />
* The RC OLA[https://documents.egi.eu/document/31] has been finalized and approved based on the comments received from the OMB at Vilnius on April on June.<br />
* With WLCG collaboration, EGI league reports now use the new ACE availability calculation system, which supports different profiles, an important step towards obtaining different reports for different services in the future (ie Core Services)<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|Apr 2011<br />
|0<br />
|-<br />
|May 2011<br />
|2<br />
|-<br />
|June 2011<br />
|7<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR5&diff=22236EGI-InSPIRE:SA1.8-QR52011-07-28T08:05:59Z<p>Dzila: /* 2. Main Achievements */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|- <br />
| 14/06/2011 - 16/06/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=481<br />
| Documentation workshop, Zürich<br />
| Worked on wiki pages for operational tools and best practices.<br />
|-<br />
| 29/07/2011 - 29/07/2011<br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=559<br />
| 4th OLA Task Force meeting<br />
| Evolution of the RP OLA<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
* Set a deadline for turning off GOCWIKI (end of September 2011).<br />
* The RC OLA[https://documents.egi.eu/document/31] has been finalized and approved based on the comments received from the OMB at Vilnius on April on June.<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|Apr 2011<br />
|0<br />
|-<br />
|May 2011<br />
|2<br />
|-<br />
|June 2011<br />
|7<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR5&diff=22235EGI-InSPIRE:SA1.8-QR52011-07-28T07:39:18Z<p>Dzila: /* 1. Task Meetings */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|- <br />
| 14/06/2011 - 16/06/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=481<br />
| Documentation workshop, Zürich<br />
| Worked on wiki pages for operational tools and best practices.<br />
|-<br />
| 29/07/2011 - 29/07/2011<br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=559<br />
| 4th OLA Task Force meeting<br />
| Evolution of the RP OLA<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
* Set a deadline for turning off GOCWIKI (end of September 2011).<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|Apr 2011<br />
|0<br />
|-<br />
|May 2011<br />
|2<br />
|-<br />
|June 2011<br />
|7<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR5&diff=22234EGI-InSPIRE:SA1.8-QR52011-07-28T07:38:57Z<p>Dzila: /* 1. Task Meetings */</p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|- <br />
| 14/06/2011 - 16/06/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=481<br />
| Documentation workshop, Zürich<br />
| Worked on wiki pages for operational tools and best practices.<br />
|-<br />
| 19/06/2011 - 29/06/2011<br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=559<br />
| 4th OLA Task Force meeting<br />
| Evolution of the RP OLA<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
* Set a deadline for turning off GOCWIKI (end of September 2011).<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|Apr 2011<br />
|0<br />
|-<br />
|May 2011<br />
|2<br />
|-<br />
|June 2011<br />
|7<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1.8-QR5&diff=22216EGI-InSPIRE:SA1.8-QR52011-07-28T06:48:38Z<p>Dzila: </p>
<hr />
<div>__NOTOC__ <br />
<br />
= 1. Task Meetings =<br />
<br />
{| cellspacing="0" cellpadding="5" border="1" align="center"<br />
|-<br />
! style="width: 25%;" | Date (dd/mm/yyyy) <br />
! style="width: 25%;" | Url Indico Agenda <br />
! style="width: 10%;" | Title <br />
! style="width: 10%;" | Outcome<br />
|- <br />
| 14/06/2011 - 16/06/2011 <br />
| https://www.egi.eu/indico/conferenceDisplay.py?confId=481<br />
| Documentation workshop, Zürich<br />
| Worked on wiki pages for operational tools and best practices.<br />
|}<br />
<br />
= 2. Main Achievements =<br />
<br />
* Set a deadline for turning off GOCWIKI (end of September 2011).<br />
<br />
= 3. Issues and Mitigation =<br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | Issue Description <br />
! scope="col" | Mitigation Description<br />
|-<br />
|}<br />
<br />
= 4. Plans for the next period =<br />
<br />
= 5. Number of sites suspended =<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
! style="width: 25%" | Month<br />
! style="width: 25%" | Suspended sites<br />
|-<br />
|Apr 2011<br />
|0<br />
|-<br />
|May 2011<br />
|2<br />
|-<br />
|June 2011<br />
|7<br />
|-<br />
|}</div>Dzilahttps://wiki.egi.eu/w/index.php?title=EGI-InSPIRE:SA1_Task_Metrics_Table&diff=22212EGI-InSPIRE:SA1 Task Metrics Table2011-07-28T06:45:04Z<p>Dzila: </p>
<hr />
<div>{{Template:Op menubar}} <br />
<br />
SA1 task quarterly metrics. <br />
<br />
Back to the [[SA1 Task QR Reports and Metrics]]. <!--<br />
Task metrics.<br />
Note. Only provide values for the metrics relevant to your task.<br />
Values reported need to be aggregated during the reference three months.<br />
--> <br />
<br />
{| cellspacing="0" cellpadding="2" border="1"<br />
|-<br />
! scope="col" | SA1 Task <br />
! scope="col" | Metric name <br />
! scope="col" | Metric description <br />
! scope="col" | QR3 <br />
! QR4 <br />
! QR5 <br />
! QR6 <br />
! QR7 <br />
! QR8 <br />
! QR9 <br />
! QR10 <br />
! QR11 <br />
! QR12 <br />
! QR13 <br />
! QR14 <br />
! QR15 <br />
! QR16<br />
|-<br />
! scope="row" | TSA1.1 <br />
! scope="row" | M.SA1.Size.1 <br />
! scope="row" | Total number of production resource centres that are part of the EGI <br />
| &lt;Q3_value&gt; <br />
| &lt;Q4_value&gt; <br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.2 <br />
! scope="row" | M.SA1.OperationalSecurity.1 <br />
! scope="row" | Number of Site Security Challenge (SSC) made <br />
| 0 <br />
| 0 <br />
| 40 <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.2 <br />
! scope="row" | M.SA1.OperationalSecurity.2 <br />
! scope="row" | Number of Sites passing one Service Challenge <br />
| N/A <br />
| 0 <br />
| N/A (evaluation is still ongoing)<br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.2 <br />
! scope="row" | M.SA1.OperationalSecurity.3 <br />
! scope="row" | Number of suspended sites for security issues <br />
| 0 <br />
| 0<br />
| 0<br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.3 <br />
! scope="row" | M.SA1.ServiceValidation.1 <br />
! scope="row" | Total number of components tested/rejected in staged rollout <br />
| 11/2 <br />
| 29/1<br> <br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.3 <br />
! scope="row" | M.SA1.ServiceValidation.2 <br />
! scope="row" | Number of staged rollout tests undertaken <br />
| 14 <br />
| 40 <br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.3 <br />
! scope="row" | M.SA1.ServiceValidation.3 <br />
! scope="row" | Number of EA teams <br />
| 40 <br />
| 45 <br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.5 <br />
! scope="row" | MSA1.Accounting.1 <br />
! scope="row" | Number of sites adopting AMQ messaging for Usage Record publication <br />
| 149 (90 RGMA, 62 direct insertion, 56% infrastructure ok) <br />
| 241&nbsp; <br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.7 <br />
! scope="row" | M.SA1.Support.7 <br />
! scope="row" | COD Workload per month <br />
| <br />
764/551/844 <br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1<br />
<br />
| <br />
135/363/315 <br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1<br />
<br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.7 <br />
! scope="row" | M.SA1.Support.8 <br />
! scope="row" | ROD Workload per month (breakdown per region/NGI) <br />
| <br />
2943/1912/2090 <br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1<br />
<br />
| <br />
1530/1692/2059 <br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1<br />
<br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.7 <br />
! scope="row" | M.SA1.Support.9 <br />
! scope="row" | ROD Quality Metrics per month (breakdown per region/NGI) <br />
| <br />
0.90/0.81/0.76 <br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1<br />
<br />
| <br />
0.85/0.82/0.86<br />
<br />
See: https://documents.egi.eu/secure/ShowDocument?docid=155&amp;version=1<br />
<br />
| &lt;Q5_value&gt; <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|-<br />
! scope="row" | TSA1.8 <br />
! scope="row" | M.SA1.Operation.2 <br />
! scope="row" | Number of sites suspended <br />
| 1/0/1 <br />
| 2/0/0 <br />
| 0/2/7 <br />
| &lt;Q6_value&gt; <br />
| &lt;Q7_value&gt; <br />
| &lt;Q8_value&gt; <br />
| &lt;Q9_value&gt; <br />
| &lt;Q10_value&gt; <br />
| &lt;Q11_value&gt; <br />
| &lt;Q12_value&gt; <br />
| &lt;Q13_value&gt; <br />
| &lt;Q14_value&gt; <br />
| &lt;Q15_value&gt; <br />
| &lt;Q16_value&gt;<br />
|}<br />
<br />
[[Category:Metrics]]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=21881Resource Centres OLA and Resource infrastructure Provider OLA reports2011-07-19T15:51:08Z<p>Dzila: /* Process for quality verification */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/648 Jun]<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/ACE/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
* '''Recomputation precedure'''<br />
Should there be doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [https://tomtools.cern.ch/confluence/display/SAM/Availability+Re-computation+Policy]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# ACE as Gridview in the past, is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, ACE takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview had included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. However ACE reports (used since May 2011) do not include the snapshot feature yet.''' <br />
# The calculations performed by ACE always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]<br />
* Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=21880Resource Centres OLA and Resource infrastructure Provider OLA reports2011-07-19T15:50:37Z<p>Dzila: /* Process for quality verification */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/648 Jun]<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/ACE/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
* '''Recomputation precedure''<br />
Should there be doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [https://tomtools.cern.ch/confluence/display/SAM/Availability+Re-computation+Policy]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# ACE as Gridview in the past, is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, ACE takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview had included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. However ACE reports (used since May 2011) do not include the snapshot feature yet.''' <br />
# The calculations performed by ACE always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]<br />
* Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=21878Resource Centres OLA and Resource infrastructure Provider OLA reports2011-07-19T15:26:11Z<p>Dzila: /* Known issues and recommendations to NGIs */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/648 Jun]<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/ACE/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# ACE as Gridview in the past, is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, ACE takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview had included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. However ACE reports (used since May 2011) do not include the snapshot feature yet.''' <br />
# The calculations performed by ACE always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]<br />
* Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=21835Resource Centres OLA and Resource infrastructure Provider OLA reports2011-07-19T09:55:37Z<p>Dzila: /* Process for quality verification */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/648 Jun]<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/ACE/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# Gridview is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, GridView takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview has included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. Currently (January 2011) it is used for the pdf reports generator but not for the Excel ones. However ACE reports (used since May 2011) do not include the snapshot feature yet.''' <br />
# The calculations performed by gridview always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]<br />
* Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=21834Resource Centres OLA and Resource infrastructure Provider OLA reports2011-07-19T09:34:32Z<p>Dzila: /* Known issues and recommendations to NGIs */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/648 Jun]<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# Gridview is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, GridView takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview has included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. Currently (January 2011) it is used for the pdf reports generator but not for the Excel ones. However ACE reports (used since May 2011) do not include the snapshot feature yet.''' <br />
# The calculations performed by gridview always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]<br />
* Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=21833Resource Centres OLA and Resource infrastructure Provider OLA reports2011-07-19T09:34:11Z<p>Dzila: /* Known issues and recommendations to NGIs */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/648 Jun]<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# Gridview is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, GridView takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview has included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. Currently (January 2011) it is used for the pdf reports generator but not for the Excel ones. However ACE reports (used since May 2011) do not include the snapshot feature yet''' <br />
# The calculations performed by gridview always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]<br />
* Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=21832Resource Centres OLA and Resource infrastructure Provider OLA reports2011-07-19T09:32:17Z<p>Dzila: /* Known issues and recommendations to NGIs */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/648 Jun]<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# Gridview is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, GridView takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview has included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. Currently (January 2011) it is used for the pdf reports generator but not for the Excel ones.''' <br />
# The calculations performed by gridview always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]<br />
* Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=21560Resource Centres OLA and Resource infrastructure Provider OLA reports2011-07-08T11:29:10Z<p>Dzila: /* Report generator */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/648 Jun]<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (overall xls file from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
* ''' Impact for increase availability suspension threshold to 70%'''<br />
There is a proposal to increase the threshold for suspesion to 70% for 3 consecutive months, from 50% that it is currently. The impact of this can be found at [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# Gridview is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, GridView takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview has included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. Currently (January 2011) it is used for the pdf reports generator but not for the Excel ones.''' <br />
# arcCE tests have been considered critical since mid July 2010, but sites and RODs are not getting notified in any operational tool about their results. This is being investigated, in the meantime sites/NGIs dealing with availability/reliability tickets caused by arcCE issues, are advised to solve such tickets mentioning that this was due to the arcCE in the solution. Some background [https://gus.fzk.de/ws/ticket_info.php?ticket=61953] and [https://gus.fzk.de/ws/ticket_info.php?ticket=62074]<br />
# creamCE tests are critical but not taken into account for availability/reliability calculations. This was discussed on various occasions, more recently into the OMB 26 October: [https://www.egi.eu/indico/conferenceDisplay.py?confId=150]. '''As of May 2011 creamCE/ArcCE are taken into account'''.<br />
# The calculations performed by gridview always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=21559Resource Centres OLA and Resource infrastructure Provider OLA reports2011-07-08T11:28:05Z<p>Dzila: /* Report generator */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/648 Jun]<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (overall xls file from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
* ''' Impact for increase availability suspension threshold to 70%'''<br />
There is a proposal to increase the threshold for suspesion to 70% for 3 consecutive months, from 50% that it is currently. The impact of this can be found at [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# Gridview is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, GridView takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview has included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. Currently (January 2011) it is used for the pdf reports generator but not for the Excel ones.''' <br />
# arcCE tests have been considered critical since mid July 2010, but sites and RODs are not getting notified in any operational tool about their results. This is being investigated, in the meantime sites/NGIs dealing with availability/reliability tickets caused by arcCE issues, are advised to solve such tickets mentioning that this was due to the arcCE in the solution. Some background [https://gus.fzk.de/ws/ticket_info.php?ticket=61953] and [https://gus.fzk.de/ws/ticket_info.php?ticket=62074]<br />
# creamCE tests are critical but not taken into account for availability/reliability calculations. This was discussed on various occasions, more recently into the OMB 26 October: [https://www.egi.eu/indico/conferenceDisplay.py?confId=150]. '''As of May 2011 creamCE/ArcCE are taken into account'''.<br />
# The calculations performed by gridview always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=21228Resource Centres OLA and Resource infrastructure Provider OLA reports2011-07-05T11:12:19Z<p>Dzila: /* 2011 */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
<br />
= Performance reports=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/648 Jun]<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (overall xls file from May 01 2010)<br />
<br />
= Availability statistics per service/Resource Centre =<br />
[https://grid-monitoring.cern.ch/myegi/sa/# MyEGI]<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by GridView in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
* ''' Impact for increase availability suspension threshold to 70%'''<br />
There is a proposal to increase the threshold for suspesion to 70% for 3 consecutive months, from 50% that it is currently. The impact of this can be found at [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# Gridview is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, GridView takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview has included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. Currently (January 2011) it is used for the pdf reports generator but not for the Excel ones.''' <br />
# arcCE tests have been considered critical since mid July 2010, but sites and RODs are not getting notified in any operational tool about their results. This is being investigated, in the meantime sites/NGIs dealing with availability/reliability tickets caused by arcCE issues, are advised to solve such tickets mentioning that this was due to the arcCE in the solution. Some background [https://gus.fzk.de/ws/ticket_info.php?ticket=61953] and [https://gus.fzk.de/ws/ticket_info.php?ticket=62074]<br />
# creamCE tests are critical but not taken into account for availability/reliability calculations. This was discussed on various occasions, more recently into the OMB 26 October: [https://www.egi.eu/indico/conferenceDisplay.py?confId=150]. '''As of May 2011 creamCE/ArcCE are taken into account'''.<br />
# The calculations performed by gridview always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centre_OLA:_Release_Notes&diff=20855Resource Centre OLA: Release Notes2011-06-28T07:26:54Z<p>Dzila: /* Release notes */</p>
<hr />
<div>{{Template:Op menubar}}<br />
{{TOC_right}}<br />
[[Category:OperationalLevelAgreement]]<br />
<br />
= Resource Centre OLA v. 1.0=<br />
<br />
Download the [https://documents.egi.eu/document/31 Resource Centre OLA v1.0]<br />
<br />
==Release notes==<br />
<br />
This is the first version of the EGI Resource Centre OLA adopting the EGI operations terminology defined in the [https://documents.egi.eu/document/218 EGI Operations Architecture] (EGI-InSPIRE deliverable D4.1). <br />
<br />
Specifically, in this version:<br />
* terminology: it was adapted and harmonized. Sites are now referred as Resource Centers (RCs). The EGEE Regional Operations Centre (ROC) entity is now more generically defines as ''Operations Centre''. The infrastructure operated by the Operations Centre is defined Resource Infrastructure, and the legal entity responsible for it is the Resource infrastructure Provider. ''Service Level Availability'' is replaced by ''Operational Level Agreement'' in compliance with ITIL.<br />
* responsibilities: the Resource Centre Operations Manager role was introduced to define the Resource Centre contact responsible of accepting the OLA and of making sure it is endorsed by the Resource Centre. The OLA clarifies that the accuracy of GOCDB information is a joint responsibility of the Resource Centre Operations Manager and of the Resource Provider Operations Manager.<br />
* deployed middleware: any supported grid technology that complies to the UMD specification can be deployed. The minimum set of capabilities to be provided by a Resource Cente is relaxed in compliance with the current availability computation algorithm. <br />
* deployed resources: all constraints concerning the minimum amount of resource capacity to be provided were removed.<br />
* service quality parameters:<br />
** the maximum response time to GGUS tickets is raised from 4 to 8 operating hours<br />
* suspension policy: a Resource Centre that does not provide the minimum requested availability for three consecutive months is eligible to suspension. The minimum availability to be provided raised from 50% to 70%<br />
* supported VOs: DTEAM and OPS need to be mandatorily supported by the Resource Centre for troubleshooting and monitoring purposes.<br />
* change management: the OLA amendment procedure is now defined.</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=19955Resource Centres OLA and Resource infrastructure Provider OLA reports2011-06-17T12:23:54Z<p>Dzila: /* Known issues and recommendations to NGIs */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
= Performance=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (overall xls file from May 01 2010)<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by GridView in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
* ''' Impact for increase availability suspension threshold to 70%'''<br />
There is a proposal to increase the threshold for suspesion to 70% for 3 consecutive months, from 50% that it is currently. The impact of this can be found at [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# Gridview is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, GridView takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview has included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. Currently (January 2011) it is used for the pdf reports generator but not for the Excel ones.''' <br />
# arcCE tests have been considered critical since mid July 2010, but sites and RODs are not getting notified in any operational tool about their results. This is being investigated, in the meantime sites/NGIs dealing with availability/reliability tickets caused by arcCE issues, are advised to solve such tickets mentioning that this was due to the arcCE in the solution. Some background [https://gus.fzk.de/ws/ticket_info.php?ticket=61953] and [https://gus.fzk.de/ws/ticket_info.php?ticket=62074]<br />
# creamCE tests are critical but not taken into account for availability/reliability calculations. This was discussed on various occasions, more recently into the OMB 26 October: [https://www.egi.eu/indico/conferenceDisplay.py?confId=150]. '''As of May 2011 creamCE/ArcCE are taken into account'''.<br />
# The calculations performed by gridview always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centres_OLA_and_Resource_infrastructure_Provider_OLA_reports&diff=19954Resource Centres OLA and Resource infrastructure Provider OLA reports2011-06-17T12:21:03Z<p>Dzila: /* 2011 */</p>
<hr />
<div>{{Template:Op menubar}}<br />
[[Category:Procedures]]<br />
{{TOC_right}}<br />
<br />
Is is mandatory that EGI certified Resource Centres provide a minimum monthly availability and reliability as specified below (see the [https://documents.egi.eu/document/31 site-NGI Operational Level Agreement] for details). Availability and reliability statistics (based on the global OPS VO) are issued on a monthly basis.<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
| '''minimum availability'''<br />
| 70%<br />
|-<br />
| '''minimum reliabilty'''<br />
| 75%<br />
|-<br />
|'''Condition for suspension'''<br />
| Resource Centres which have an availability of less than '''70%''' for three consecutive months will be suspended, i.e. removed from the production infrastructure. This will change to 70% from PY2 (May 2011 reports). Note. This suspension policy was reviewed in April 2011, and the original 50% threshold was increased to 70%.<br />
|-<br />
|'''Condition for justification'''<br />
|Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.<br />
|}<br />
<br />
= Performance=<br />
* [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics Overview] of availability and reliability statistics including suspended sites<br />
* [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable List of sites] for which availability followup procedures were not applicable<br />
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]--><br />
== 2011 ==<br />
[https://documents.egi.eu/document/593 May]<br />
[https://documents.egi.eu/document/508 Apr]<br />
[https://documents.egi.eu/document/465 Mar]<br />
[https://documents.egi.eu/document/402 Feb]<br />
[https://documents.egi.eu/document/332 Jan]<br />
<br />
== 2010 ==<br />
*[https://documents.egi.eu/document/299 Dec] | [https://documents.egi.eu/document/266 Nov] | [https://documents.egi.eu/document/238 Oct] | [https://documents.egi.eu/document/219 Sep] | [https://documents.egi.eu/document/157 Aug] | [https://documents.egi.eu/document/130 Jul] | [https://documents.egi.eu/document/96 Jun]|[https://documents.egi.eu/document/42 May]<br />
*[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE league tables)<br />
<br />
== EGI-wide Availability and Reliability ==<br />
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (overall xls file from May 01 2010)<br />
<br />
=Report generator=<br />
*[http://gvdev.cern.ch/GVPC/Excel/ GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and SAM results for VOs)<br />
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]<br />
<br />
=Process for quality verification=<br />
<br />
* '''Generation of statistics'''<br />
Availability and reliability statistics are automatically generated the first week of the month by GridView in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/]. An Excel version is available at [http://gvdev.cern.ch/GVPC/Excel/]<br />
<br />
* '''Preliminary processing'''<br />
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.<br />
<br />
* '''Publication'''<br />
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.<br />
<br />
* '''Handling of sites below targets'''<br />
For a site that misses availability/reliability targets but is not eligible for suspension: <br />
<br />
# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given <br />
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].<br />
# if the explanation is found satisfactory the ticket is closed <br />
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket <br />
# the child ticket can then be closed <br />
# the parent ticket will be closed when all child tickets have been closed.<br />
<br />
* '''Handling of sites that are eligible for suspension'''<br />
For a site that is eligible for suspension: <br />
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])<br />
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects. <br />
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI <br />
# the child ticket closes either when the site is suspended or when suspension is canceled <br />
# the parent ticket will be closed when all child tickets have been closed<br />
<br />
* '''Wiki follow up page'''<br />
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]<br />
<br />
* ''' Impact for increase availability suspension threshold to 70%'''<br />
There is a proposal to increase the threshold for suspesion to 70% for 3 consecutive months, from 50% that it is currently. The impact of this can be found at [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact]<br />
<br />
=Known issues and recommendations to NGIs=<br />
# Gridview is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, GridView takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview has included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. Currently (January 2011) it is used for the pdf reports generator but not for the Excel ones.''' <br />
# arcCE tests have been considered critical since mid July 2010, but sites and RODs are not getting notified in any operational tool about their results. This is being investigated, in the meantime sites/NGIs dealing with availability/reliability tickets caused by arcCE issues, are advised to solve such tickets mentioning that this was due to the arcCE in the solution. Some background [https://gus.fzk.de/ws/ticket_info.php?ticket=61953] and [https://gus.fzk.de/ws/ticket_info.php?ticket=62074]<br />
# creamCE tests are critical but not taken into account for availability/reliability calculations. This was discussed on various occasions, more recently into the OMB 26 October: [https://www.egi.eu/indico/conferenceDisplay.py?confId=150]<br />
# The calculations performed by gridview always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.<br />
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.<br />
<br />
=Resources=<br />
* Definition of Availability and Reliability and related computation algorithm ([https://twiki.cern.ch/twiki/pub/LCG/GridView/Gridview_Service_Availability_Computation.pdf paper])<br />
* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/ACE Availability Computation Engine] (ACE)<br />
* [https://documents.egi.eu/document/31 Operational Level Agreement between NGI and site]<br />
* [https://wiki.egi.eu/wiki/OLA_release_notes OLA release notes]<br />
<!--* OLD: [https://twiki.cern.ch/twiki/bin/view/EGEE/MonthlyAvailability EGEE-III Comments on site availability and reliability statistics] --><br />
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]</div>Dzilahttps://wiki.egi.eu/w/index.php?title=Resource_Centre_OLA:_Release_Notes&diff=18898Resource Centre OLA: Release Notes2011-06-10T10:25:44Z<p>Dzila: /* Resource Centre OLA v. 1.0 */</p>
<hr />
<div>{{Template:Op menubar}}<br />
{{TOC_right}}<br />
[[Category:OperationalLevelAgreement]]<br />
<br />
= Resource Centre OLA v. 1.0=<br />
<br />
{|border="1" border="1"<br />
|- style="background-color:lightgray;"<br />
| Release Date<br />
| Release Version<br />
| Approved by<br />
|-<br />
| 03 June 2011<br />
| 1.0<br />
| Operations Management Board<br />
|}<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="center"<br />
!date<br />
!version<br />
!Comment<br />
|-<br />
| June 2011<br />
| 1.0 [https://documents.egi.eu/document/218]<br />
| <br />
Version 1.0 Release notes<br />
<br />
First version of the OLA that attempts to adopt the new terminology as explained in the Operations Architecture document at [https://documents.egi.eu/document/218]. Available at [https://documents.egi.eu/document/31]<br />
<br />
Specifically, in this version:<br />
*Sites are now referred as Resource Centers or RCs<br />
*NGIs are Resource Infrastructure Providers or RIPs<br />
*Minimum thresholds for Cores and Storage removed.<br />
*Includes all the previous changes required for the transition from EGEE to EGI<br />
**EGEE replaced with EGI<br />
**SLD replaced with OLA<br />
**ROCs with NGIs<br />
**Removed strict references to gLite as the supported middlewhere<br />
**More flexible site services requirements<br />
**Require site to support for more than one non monitoring VO<br />
**GGUS thresholds updated<br />
**Regional bodies replaced with their national incarnations<br />
**References to SAM replaced with NGI Nagios and related tools.<br />
<br />
*Definitions changed to be more consistent with the terminology about capabilities<br />
*Functionality capabilities and Information discovery capabilities defined and required<br />
*Alarm tickets removed<br />
*Responsibility of the Resource Infrastructure Provider Operations Manager to inform RCs for changes to OLA redefined.<br />
*Certified RC defined<br />
*Functional Capabilities redefined<br />
*Clarified RIP is responsibile for its own and its sites information accuracy in GOCDB<br />
<br />
|}<br />
<!-- = Previous releases = --></div>Dzila