Difference between revisions of "EGI-InSPIRE:SA1.7-QR6"
(Created page with '__NOTOC__ = 1. Task Meetings = {| cellspacing="0" cellpadding="5" border="1" align="center" |- ! style="width: 25%;" | Date (dd/mm/yyyy) ! style="width: 25%;" | Url Indico A…') |
|||
Line 24: | Line 24: | ||
= 2. Main Achievements = | = 2. Main Achievements = | ||
== Grid Oversight == | |||
'''ROD teams news letter''' | |||
This quarter we have published a ROD teams newsletter in may and june. The rationale behind the newsletter is descibed in the QR4 report. | |||
'''ROD session at EGI TF''' | |||
We will organise a ROD teams session at the EGI TF in Lyon. We have come up with a programme that will be published on the web soon. In the ROD teams session we have an overview of the Grid oversight work as well as we will discuss some more detailed topics. We aim to have a presentation on the operations portal and we will conclude the session with a presentation off the result of a survey that we will soon publish. | |||
'''Procedures''' | |||
We have proposed a change to the COD escalation procedure which entails the removal of an unnecessary step. In the present situation when a site is unresponsive to a GGUS ticket issued after an alarm. Eventually the ROD team escallates the issue to COD who bounces it back to the NGI asking the NGI to try again to get into contact with the site. We deem this step to be unnecessary and we have proposed a change in the procedure that is under discussion at the moment. | |||
'''Availability followup''' | |||
*There is a Nagios probe under development that is going to raise an alarm when a site's avaliability and/or reliability is below the 70%/75% threshold. The COD has provided input which was put into the RT ticket: https://rt.egi.eu/rt/Ticket/Display.html?id=289. | |||
*Recently, we discovered that in the availability en reliability metrics there were a substantial amount of UNKNOWN test results for individual sites but also for all sites in an entire NGI. Since UNKNOWN test results are not taken into account in the availability/reliability metrics, this will cloud the availability and reliability metrics. Currently this issue is under investigation. | |||
= 3. Issues and Mitigation = | = 3. Issues and Mitigation = |
Revision as of 17:23, 1 November 2011
1. Task Meetings
Date (dd/mm/yyyy) | Url Indico Agenda | Title | Outcome |
---|---|---|---|
2. Main Achievements
Grid Oversight
ROD teams news letter
This quarter we have published a ROD teams newsletter in may and june. The rationale behind the newsletter is descibed in the QR4 report.
ROD session at EGI TF
We will organise a ROD teams session at the EGI TF in Lyon. We have come up with a programme that will be published on the web soon. In the ROD teams session we have an overview of the Grid oversight work as well as we will discuss some more detailed topics. We aim to have a presentation on the operations portal and we will conclude the session with a presentation off the result of a survey that we will soon publish.
Procedures
We have proposed a change to the COD escalation procedure which entails the removal of an unnecessary step. In the present situation when a site is unresponsive to a GGUS ticket issued after an alarm. Eventually the ROD team escallates the issue to COD who bounces it back to the NGI asking the NGI to try again to get into contact with the site. We deem this step to be unnecessary and we have proposed a change in the procedure that is under discussion at the moment.
Availability followup
- There is a Nagios probe under development that is going to raise an alarm when a site's avaliability and/or reliability is below the 70%/75% threshold. The COD has provided input which was put into the RT ticket: https://rt.egi.eu/rt/Ticket/Display.html?id=289.
- Recently, we discovered that in the availability en reliability metrics there were a substantial amount of UNKNOWN test results for individual sites but also for all sites in an entire NGI. Since UNKNOWN test results are not taken into account in the availability/reliability metrics, this will cloud the availability and reliability metrics. Currently this issue is under investigation.
3. Issues and Mitigation
Issue Description | Mitigation Description |
---|---|