|EGI Inspire Main page|
|Inspire reports menu:||Home •||SA1 weekly Reports •||SA1 Task QR Reports •||NGI QR Reports •||NGI QR User support Reports|
1. Task Meetings
2. Main Achievements
ROD teams newsletter
This quarter we have published a ROD teams newsletter in November, December and January. The rationale behind the newsletter is descibed in the SA1.7-QR4 report.
ROD performance index
For background information on this, have a look at SA1.7-QR6, section RP OLA and ROD metrics. Since October we have been asking all NGIs above 10 items in the COD dashboard duting one month about the explanation through GGUS, what was the reason of such result and how do you plan to improve the situation. The good news is that we have seen a continuous decling in the amount of items in the COD dashboard.
Non-OK Alarms Followup
For background information on this, have a look at SA1.7-QR6, section Non-OK Alarms Followup. We have continued this activity in QR7.
See SA1.7-QR6 for more background information. There has been a phone conf with jra1 (https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=716) where the availability probe has been discussed. There will be a probe that meets the following specs:
- The probe only measures availability
- The probe computes the availability 30 days in the past
- The probe returns a WARNING when: 70%>= availability <=75%
- The probe returns a CRITICAL when: availability <70%
A test version of the probe will be available in March.
See SA1.7-QR6 for more background information. There has been a phone conf with jra1 (https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=716) where the "unknown" issue has been discussed. The following proposal was made for the follow-up of the unknown issue:
- each month COD team will send a GGUS tickets to NGIs (one ticket per NGI) with the list of sites
which are above 10% of UNKNOWN
- In the ticket we will ask the NGI to investigate the issue and fix the problem
- The NGIs closes the ticket which indicates that they have received the notification
In the meantime we have put this proposal into practice.
The new SU “MPI User Support” the problem of ticket bouncing back to TPM because of lack of a proper support unit should be solved.
Instructions for the TPM how to use of the SU “Operations”: This SU is meant for managerial problems that concern operations and is not a catch-all: The purpose of this SU is to provide a contact with the EGI.eu team that coordinates EGI operations for any technical and operational matter and to handle requests from Resource Centers and Resource Infrastructure Providers that are willing to be integrated into the EGI production infrastructure. As well it is there to notify any operational issues that is general and does not concern a specific Resource Infrastructure or Grid middleware deployed. Middleware related issues that cannot be handled by TPM must be assigned to the DMSU. This includes configuration and documentation problems with the middleware. For deployment problems that concern a vast majority of the production sites, for which it is infeasible to open an individual ticket to every site/NGI, TPM can assign the ticket to the Operations SU. The SU “Operations” staff can offer coordination of the handling of such incidents, when the scale cannot be managed by TPM.
- Started setting up IPv6 testbed
- Finalized IPv6 survery for NGIs
- Documented IPv6 status on Wiki
- Collaborated with HEPiX IPv6 WG setting up node at GARR and providing testbed verification scripts for GridFTP
- Enrolled new HINTS users for testing HINTS
- Defined workplan schedule for HINTS
- Disseminated on PerfSONAR MDM tools towards NGIs
3. Issues and Mitigation
|Issue Description||Mitigation Description|
|network support: Suffered major security accident at GARR||HINTS server hacked and unavailable security policy at GARR has been completely reviewed and hardened considerably.|
|network support: DownCollector required upgrade to GOCDB4||Currenlty no manpower available for this|
4. Plans for the next period
The plans for the next period is to proceed with the current activities and come up with a proposal to include test resources in the infrastructure.
- Keep consolidating HINTS and providing the 64-bit architecture HINTS probe
- Continue the deployment campaing for HINTS
- Provide more structures support for PerfSONAR-MDM and disseminate about it by the Tier-2 sites, in collaboration with DANTE
- Restart (after re-installation) production HINTS server at GARR