1. Task Meetings

2. Main Achievements

Grid Oversight

1. ROD teams news letter

The transition from EGEE to EGI InSPIRE came about with a lot of changes. For Operations, the EGEE Regional Operations Centres, called ROCs, are in the process of being dismantled and their responsibilities transferred to the NGIs, or have already completed this process. In the EGI era, ROD teams will monitor the quality of sites in their country or region, whereas COD is responsible for the global oversight over the whole EGI infrastructure. This is to provide a high-quality grid infrastructure to the user communities. These changes have also leaded us to think about how COD and ROD are going to interact with each other in this new setting. During the Grid Oversight session at the EGI Tech Forum it was made clear to us that people find it cumbersome to travel in order to have regular face to face meetings. Nevertheless, we do feel the need to create and maintain a coherent and alive Grid Oversight community and to have interaction between ROD and COD that goes beyond the dashboards. This is necessary, in our view, to create a top-quality grid infrastructure for our users. For this reason we have created this newsletter. The purpose of this newsletter is to inform you about recent and upcoming developments related to Grid Oversight and to show to you the metrics indicating how well we did the past month. It is our intention to publish a newsletter every month.

2. Input given on approved Procedures

New NGI creation process coordination The purpose of this document is to clearly describe the actions and the relative steps to be undertaken for integrating a NGI (or a group of NGIs) into the EGI operational structure. The newest version became effective as of Dec 1st .

Operations Centre decommission The purpose of this document is to clearly describe the actions and the relative steps to be undertaken for decommission of an Operations Centre. This procedure became effective as of Dec 1st.

COD escalation procedure The purpose of this document is to define an escalation procedure for operational problems. The newest version became effective as of Dec 1st. This procedure is essential for ROD work and we encourage you to read it.

Making a Nagios test an operations test The purpose of this document is to clearly describe the actions and the relative steps to be undertaken for making a Nagios tests an operations test. A Nagios test is set as operations test to enable the operations dashboard to display an alarm in case the test fails. This procedure will become effective as of Jan 1st.

3. Renaming of "critical" tests

“Operations test” should be used for tests raising alarms for ROD. Recently it was decided that a new name should be assigned to a test which is raising alarms in operations dashboard. COD used to call it “critical test” but it was causing confusion with critical Nagios test status. In a poll the name which gained the majority was “operations test”.

Network Support

The main achievements are the outcome of the EGI Network Support proposal Task Force, i.e. a structured proposal around seven identified Use Cases formalized, and discussed on the face to face meeting in Amsterdam on January 24, 2011. ( The community has been introduced to the seven use cases: the GGUS Support System, the PERT team, Scheduled maintenances, Network troubleshooting on demand, e2e Scheduled Monitoring, DownCollector, Policy and Collaboration. For each one of them the specific proposal from the task force has been described and discussed within the EGI operations community. The proposal from the TF has been based on a previously distributed questionnaire to the NGIs. Results are published on the EGI Operations Wiki on NST. A roadmap ahed has been agreed upon for each one of them. In particular the task committed to:

1) set up a Network Support unit within GGUS for Network Related issues, and GARR has agreed to start exploiting the provisioning of the corresponding required effort (voluntarily) , at least in a prelimiray way, in order to assess its loing term sustainability and reconfirm this committment in the next months. The GGUS workflow has been identified and agreed.

2) Provide, maintain and support a Network Troubleshooting tool on demand, called HINTS, voluntarily provided (unfunded) by the French NGI. A central HINTS server instance will be made available at GARR and the French NGI will start a pilot deployment of the tool after the central server will be made available.

3) Provide and maintain a perfSONAR-based live CD distribution for on demand and scheduled e2e monitoring, based on perfSONAR-MDM, voluntarily contributed by the Spanish NGI and NREN RedIRIS. Later on, a dedicated GUI will be made available. Historical Data will be stored in a DB.

4) Keep a permanent liaison with the GN3 PerfSONAR communuity, and assess the tools provide by pS, provide feedback to the GN3 community. Periodically reporting about the new features and progress around the pS based tools.

5) Further refine the NetJobs tool w.r.t. provided functionality and usability of the Web Interface, providing a central server instance at GARR; promote the tool within the EGI Net Sup operations comminity, especially for the basic metrics ( n.hops, RTT, available bandwidth).

6) Organize a general questionnaire for the NRENs, aimed at better understanding their interaction model with the NGIs, the best practices, the tools they are familair with, and asking about theri availability to provide a PERT contact point for the EGI project.

3. Issues and Mitigation

Grid Oversight: None
Network Support: A first set of issues is related to the poor funding of this task: most of the work and the proposed tools are provided completly unfinded, on a purely volunteer basis

from the NGIs and the NRENs.

This issue as such is hard to mitigate, being intrinsically related to the EGI-Inspire DoW and the strategical choices made initially. However, despite this pending issue, many NGIs and some NRENs have still shown a very positive and collaborative attitude in sharing their contribution in terms of manpower and tools with the EGI Operations community.
Network Support: A second issue is coming from the lack of information on the specific interaction between the NGIs and their corresponding underlying NREN. This will be mitigated by a specif questionnaire for the NRENs, similary to what already distributed to the NGIs, in which the interaction model, the workflows, the preferred tools and communication channels between NREN and NGI will be identified and highlighted.
Network Support: A third issue is coming by the very distributed nature of the developments and the support manpower for the task, as already stated, essentially completely unfunded. Mitigation for this will be the considerable adoption of VideoConferencing and email.
Network Support: A fourth concern is the lack of manpower and official support for the HINTS tools since the 2 main developers are leaving. However, we have already contacted the French NGI about this, and they agreed to keep supporting the tool.

4. Plans for the next period

Grid Oversight

1. Continue ROC transition to NGIs.

2. Initiate investigation on how to have a consistent and coherent integration of nonproduction resources in the infrastructure.

3. Initiate investigation of the impact on operations support model related to new middlewares in EGI.

4. Initiate the investigation on how to improve availability and reliability metrics.

5. Evaluation of upcoming new releases of the operational dashboard.

Network Support

Plans for the next period.

1. Organization of the questionnaire for the NRENs

2. Ensure HINTs keeps being supported and maintained, keep discussing with the French NGI about ut

3. Provide a HINT server instance at GARR and start the early adoption by some sites within French NGI

4. Keep working on the live perfSONAR CD distribution for e2e monitoring and the corresponding GUI

5. Refine the NetJobs tool and improve its GUI

6. Liaise with the GN3 pS community, and the GN3-SA2-T3 task in particluar, to keep the synergies between GN3 and EGI for the benefit of the 2 communities.

7. Keep working on the Use Case related to Network-related scheduled maintenances

