Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "EGI-InSPIRE:SA1.7-QR6"

From EGIWiki
Jump to navigation Jump to search
Line 27: Line 27:
'''ROD teams news letter'''
'''ROD teams news letter'''


This quarter we have published a ROD teams newsletter in may and june. The rationale behind the newsletter is descibed in the QR4 report.
This quarter we have published a ROD teams newsletter in August and October. The rationale behind the newsletter is descibed in the QR4 report.


'''ROD session at EGI TF'''
'''ROD session at EGI TF'''


We will organise a ROD teams session at the EGI TF in Lyon. We have come up with a programme that will be published on the web soon. In the ROD teams session we have an overview of the Grid oversight work as well as we will discuss some more detailed topics. We aim to have a presentation on the operations portal and we will conclude the session with a presentation off the result of a survey that we will soon publish.
In this edition of the EGI Tech Forum we have organised a 1.5 hour session where we have had three topics. There was a presentation of the COD staff on the new simplified escalation procedure that came into effect as of October 1st. Also the ROD metrics were discussed and its incorporation in the OLA. This topic caused a fair amount of discussion. The outcome of this discussion was that these metrics will continuously be collected and published in this newsletter. Later on we will restart the discussion on how this should enter the OLA. Finally, results were presented of an investigation of the reason for closing alarms in non-OK status and some tips were given on how to do this properly.
Next, there was a presentation by COD staff on the results of the survey that we have held among our RODs about the work that they do. There were questions about the operational tools, documentation etcetera. In any case, the COD has provided their feedback on this in this slot. A good thing was that the operational tools developers Cyril l’Orphelin and Emir Imamagic were in the audience so a part of the slot became a Q&A sessions between users of the operational tools and developers which was very useful.


'''Procedures'''
'''Procedures'''


We have proposed a change to the COD escalation procedure which entails the removal of an unnecessary step. In the present situation when a site is unresponsive to a GGUS ticket issued after an alarm. Eventually the ROD team escallates the issue to COD who bounces it back to the NGI asking the NGI to try again to get into contact with the site. We deem this step to be unnecessary and we have proposed a change in the procedure that is under discussion at the moment.
 


'''Availability followup'''
'''Availability followup'''
*There is a Nagios probe under development that is going to raise an alarm when a site's avaliability and/or reliability is below the 70%/75% threshold. The COD has provided input which was put into the RT ticket: https://rt.egi.eu/rt/Ticket/Display.html?id=289.  
*There is a Nagios probe under development that is going to raise an alarm when a site's avaliability and/or reliability is below the 70%/75% threshold. The COD has provided input which was put into the RT ticket: https://rt.egi.eu/rt/Ticket/Display.html?id=289. We have organised a phone conf on the requirements that this probe should fulfill.  
*Recently, we discovered that in the availability en reliability metrics there were a substantial amount of UNKNOWN test results for individual sites but also for all sites in an entire NGI. Since UNKNOWN test results are not taken into account in the availability/reliability metrics, this will cloud the availability and reliability metrics. Currently this issue is under investigation.
*Unknown issue


= 3. Issues and Mitigation  =
= 3. Issues and Mitigation  =

Revision as of 17:29, 1 November 2011


1. Task Meetings

Date (dd/mm/yyyy) Url Indico Agenda Title Outcome


2. Main Achievements

Grid Oversight

ROD teams news letter

This quarter we have published a ROD teams newsletter in August and October. The rationale behind the newsletter is descibed in the QR4 report.

ROD session at EGI TF

In this edition of the EGI Tech Forum we have organised a 1.5 hour session where we have had three topics. There was a presentation of the COD staff on the new simplified escalation procedure that came into effect as of October 1st. Also the ROD metrics were discussed and its incorporation in the OLA. This topic caused a fair amount of discussion. The outcome of this discussion was that these metrics will continuously be collected and published in this newsletter. Later on we will restart the discussion on how this should enter the OLA. Finally, results were presented of an investigation of the reason for closing alarms in non-OK status and some tips were given on how to do this properly. Next, there was a presentation by COD staff on the results of the survey that we have held among our RODs about the work that they do. There were questions about the operational tools, documentation etcetera. In any case, the COD has provided their feedback on this in this slot. A good thing was that the operational tools developers Cyril l’Orphelin and Emir Imamagic were in the audience so a part of the slot became a Q&A sessions between users of the operational tools and developers which was very useful.

Procedures


Availability followup

  • There is a Nagios probe under development that is going to raise an alarm when a site's avaliability and/or reliability is below the 70%/75% threshold. The COD has provided input which was put into the RT ticket: https://rt.egi.eu/rt/Ticket/Display.html?id=289. We have organised a phone conf on the requirements that this probe should fulfill.
  • Unknown issue

3. Issues and Mitigation

Issue Description Mitigation Description

4. Plans for the next period