Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @


From EGIWiki
Jump to navigation Jump to search
EGI Inspire Main page

Inspire reports menu: Home SA1 weekly Reports SA1 Task QR Reports NGI QR Reports NGI QR User support Reports

1. Task Meetings

Date (dd/mm/yyyy) Url Indico Agenda Title Outcome
30-8-2011 COD
19-9-2011 Network Support workshop at EGI TF
22-9-2011 Grid Oversight session at EGI TF
22-9-2011 COD F2F
19-10-2011 Nagios A/R probe
26-10-2011 COD
18-8-2011 NetSup Status Update VideoConference
weekly shopping list meeting TPM

2. Main Achievements

Grid Oversight

ROD teams news letter

This quarter we have published a ROD teams newsletter in August and October. The rationale behind the newsletter is descibed in the QR4 report.

ROD teams questionnaire

Some time ago we have send out a questionnaire to the ROD teams. The reason for this was that we wanted to have their opinion on how they perceive their work. We have asked their opinion on the operational tools, documentation, video tutorials, and this newsletter etcetera. We have got no less that 44 responses which we found very valuable. From 12 NGIs we have got more than one response. The outcome was discussed during the Grid Oversight session at the EGI Tech Forum (

ROD session at EGI TF

In this edition of the EGI Tech Forum we have organised a 1.5 hour session where we have had three topics. There was a presentation of the COD staff on the new simplified escalation procedure that came into effect as of October 1st. Also the ROD metrics were discussed and its incorporation in the OLA. This topic caused a fair amount of discussion. The outcome of this discussion was that these metrics will continuously be collected and published in the ROD newsletter. Later on we will restart the discussion on how this should enter the OLA. Finally, results were presented of an investigation of the reason for closing alarms in non-OK status and some tips were given on how to do this properly. Next, there was a presentation by COD staff on the results of the survey that we have held among our RODs about the work that they do. There were questions about the operational tools, documentation etcetera. In any case, the COD has provided their feedback on this in this slot. A good thing was that the operational tools developers Cyril l’Orphelin and Emir Imamagic were in the audience so a part of the slot became a Q&A sessions between users of the operational tools and developers which was very useful.

Finally, Cyril ‘lOrphelin gave an interesting presentation on the recent developments and improvements of the dashboard. There is going to be a security dashboard to detect and inform sites about security issues. Further there is also going to be a VO-oriented dashboard. Links to the presentations may be found at:

RP OLA and ROD metrics

For last few months COD team was working within OLA task force to create Resource Provider OLA which will contain obligations between EGI and NGI. This OLA has been approved at the OMB on October 25th 2011. One of the actions was to define ROD metric base on which EGI will check if ROD service is properly delivered by NGIs. During our COD session on Technical Forum in Lyon we presented our proposal for this metric – please read presentation (from page number 9). As a result of the discussion we decided to provide first in a monthly basis simulation of this metric to check what the current status is. We decided to set initially the threshold at the level of 10 items. It means that since October we are going to ask all NGIs above 10 items about the explanation through GGUS, what was the reason of such result and how do you plan to improve the situation.

Non-OK Alarms Followup

In general alarms should not be closed in non-OK status. However, in some cases it is inevitable. Closing alarms in non-OK status is allowed but a reason for doing so should be given by the ROD teams in question. We have collected information on reasons why ROD teams close alarms in non-OK status for the months August and September in identify if the reasons given were valid or if there are some deficiencies in teh operational tools or there is some lack of training or documentation/information. NGIs were identified that were closing alarms in non-OK status because of invalid or insufficient reasons. Those NGIs were approached by the COD team.

Availability followup There is a Nagios probe under development that is going to raise an alarm when a site's avaliability and/or reliability is below the 70%/75% threshold. The COD has provided input which was put into the RT ticket: We have organised a phone conf on the requirements that this probe should fulfill. We have done a new proposal in this field and hope to get aan agreement from all parties involved so this issues can make some progress.

Unknown followup Recently, we discovered that in the availability en reliability metrics there were a substantial amount of UNKNOWN test results for individual sites but also for all sites in an entire NGI. Since UNKNOWN test results are not taken into account in the availability/reliability metrics, this will cloud the availability and reliability metrics. Currently this issue is under investigation. More information on this topic may be found at: Grid_operations_oversight/Unknown_issue


Nothing in particular to report on this task except that the change of TPM shift between the Italian and German TPM teams take place at the same local time (regardless of summer or winter time), at 14:00.

Network Support

The main task meeting has been held in Lyon during the EGI Technical Forum, at the Network Support Operations workshop (Sept 19, 2011).The agenda of the Net Sup workshop is at ) ; Additionally, as in the previous quarter, during this quarter task meetings have been carried out by Video Conference and phone calls. The main phone meeting has been the Network Support coordination Video Conference on August 18, 2011. The agenda of the VC meeting has been:

1. Update on the on-going activities on the three tools we intend to present at the Net Sup Workshop at the EGI Tech Forum at the end of September (Lyon, France): - HINTS - PerfSONAR live CD for e2eMON - NetJobs
2. Questionnaire for NRENs about IPv6
3. Collaboration with HEPiX IPv6 WG
4. IPv6 strategy

Another VCONF has been held on October 12, 2011 around IPv6 activities (GARR-SWITCH). The agenda for the VC has been focused on the next steps around IPv6, jointly to be carried out by SWITCH and GARR.

HINTS improvements: 64-bit architecture plug in to be made available is currently in progress at FranceGrille/CNRS DSI. Further testing and validation users have been enrolled (currently 11).

The current tools (Network) for network support for EGI have been presented to the NGIs in Lyon and slow, but progressive adoption by NGIs/NRENs is in progress. PerfSONAR-MDM based tools have also been presented to the NGIs presented at the EGI Technical Forum by DANTE.

Participation of the EGI NetSup coordination team to the HEPiX IPv6 Working Group has continued and a test node has been set at at GARR for the HEPiX IPv6 testebed. IPv6 activities have now a plan jointly designed by SWITCH and GARR. I n addition, the IPv6 survey has been filled by almost all NGIs/EIROs and it is published online, available at IPv6

3. Issues and Mitigation

Issue Description Mitigation Description
Grid Oversight: None
TPM: None
Network Support:

An issue has emerged related to the DownCollector tool, which requires updgrade/porting to GOCDB v .4.

No manpower so far has been identified to do this porting; in the meanwhile the tool has been removed by the list of provided tools.
Network Support:

Some network tickets in GGUS required attention and the involvement of the corresponding parties to solve them a bit time-demanding.

4. Plans for the next period

Grid Oversight

1. Continue investigation of the impact on operations support model related to new middlewares in EGI.

2. Continue the investigation on how to improve availability and reliability metrics. In this respect we will continue to monitor the progress on RT ticket 289 where a request of formulated to create a nagios probe that measures availability and reliability.

3. Evaluation of upcoming new releases of the operational dashboard.

4. Continue reviewing the ROD metrics


Nothing in particular to report except continuing the work as usual.

Network Support

Wrap up the results from the IPv6 survey for NGIs and circulate it to Operations, JRA1, Network Support and TMB. Proceed with the IPv6-related activities:

 1. Write a reference document on possible strategy for the inclusion of IPv6-only resources
 2. Perform basic tests with UMD and IPv6
 3. Keep collaborating with the HEPiX IPv6 WLCG working group

Officially release the 64-bit HINTS probe for HINTS; Clarify the AuthN/AuthZ model of HINTS in a document related to the tool.