Difference between revisions of "EGI-InSPIRE:SA1.7-QR8"
(Created page with "__NOTOC__ = 1. Task Meetings = <!-- Notes. Report here all task-specific meetings held. This includes (a) face-to-face meetings and (b) phone meetings. Make sure that for all tas...") |
|||
Line 65: | Line 65: | ||
'''ROD teams newsletter''' | '''ROD teams newsletter''' | ||
This quarter we have published a ROD teams newsletter in | This quarter we have published a ROD teams newsletter in February and April. The rationale behind the newsletter is descibed in the [[SA1.7-QR4]] report. | ||
'''ROD performance index''' | '''ROD performance index''' | ||
For background information on this, have a look at [[SA1.7-QR6]], section '''RP OLA and ROD metrics'''. | For background information on this, have a look at [[SA1.7-QR6]], section '''RP OLA and ROD metrics'''. | ||
Since October we have been asking all NGIs above 10 items in the COD dashboard duting one month about the explanation through GGUS, what was the reason of such result and how do you plan to improve the situation. | Since October we have been asking all NGIs above 10 items in the COD dashboard duting one month about the explanation through GGUS, what was the reason of such result and how do you plan to improve the situation. Currently we are continuing to collect and investigate these metrics and also to correlate this with other metrics and see if we can draw some conclusions from them. | ||
'''Non-OK Alarms Followup''' | '''Non-OK Alarms Followup''' | ||
For background information on this, have a look at [[SA1.7-QR6]], section '''Non-OK Alarms Followup'''. | For background information on this, have a look at [[SA1.7-QR6]], section '''Non-OK Alarms Followup'''. | ||
We have continued this activity in | We have continued this activity in Q8. | ||
'''Availability followup''' | '''Availability followup''' | ||
Line 84: | Line 84: | ||
*The probe returns a WARNING when: 70%>= availability <=75% | *The probe returns a WARNING when: 70%>= availability <=75% | ||
*The probe returns a CRITICAL when: availability <70% | *The probe returns a CRITICAL when: availability <70% | ||
We are waiting for this probe to be available for testing. | |||
'''Unknown Followup''' | '''Unknown Followup''' |
Revision as of 10:16, 2 May 2012
1. Task Meetings
2. Main Achievements
Grid Oversight
ROD teams newsletter
This quarter we have published a ROD teams newsletter in February and April. The rationale behind the newsletter is descibed in the SA1.7-QR4 report.
ROD performance index
For background information on this, have a look at SA1.7-QR6, section RP OLA and ROD metrics. Since October we have been asking all NGIs above 10 items in the COD dashboard duting one month about the explanation through GGUS, what was the reason of such result and how do you plan to improve the situation. Currently we are continuing to collect and investigate these metrics and also to correlate this with other metrics and see if we can draw some conclusions from them.
Non-OK Alarms Followup
For background information on this, have a look at SA1.7-QR6, section Non-OK Alarms Followup. We have continued this activity in Q8.
Availability followup
See SA1.7-QR6 for more background information. There has been a phone conf with jra1 (https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=716) where the availability probe has been discussed. There will be a probe that meets the following specs:
- The probe only measures availability
- The probe computes the availability 30 days in the past
- The probe returns a WARNING when: 70%>= availability <=75%
- The probe returns a CRITICAL when: availability <70%
We are waiting for this probe to be available for testing.
Unknown Followup
See SA1.7-QR6 for more background information. There has been a phone conf with jra1 (https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=716) where the "unknown" issue has been discussed. The following proposal was made for the follow-up of the unknown issue:
- each month COD team will send a GGUS tickets to NGIs (one ticket per NGI) with the list of sites
which are above 10% of UNKNOWN
- In the ticket we will ask the NGI to investigate the issue and fix the problem
- The NGIs closes the ticket which indicates that they have received the notification
In the meantime we have put this proposal into practice.
TPM
The new SU “MPI User Support” the problem of ticket bouncing back to TPM because of lack of a proper support unit should be solved.
Instructions for the TPM how to use of the SU “Operations”: This SU is meant for managerial problems that concern operations and is not a catch-all: The purpose of this SU is to provide a contact with the EGI.eu team that coordinates EGI operations for any technical and operational matter and to handle requests from Resource Centers and Resource Infrastructure Providers that are willing to be integrated into the EGI production infrastructure. As well it is there to notify any operational issues that is general and does not concern a specific Resource Infrastructure or Grid middleware deployed. Middleware related issues that cannot be handled by TPM must be assigned to the DMSU. This includes configuration and documentation problems with the middleware. For deployment problems that concern a vast majority of the production sites, for which it is infeasible to open an individual ticket to every site/NGI, TPM can assign the ticket to the Operations SU. The SU “Operations” staff can offer coordination of the handling of such incidents, when the scale cannot be managed by TPM.
Network Support
- Started setting up IPv6 testbed
- Finalized IPv6 survery for NGIs
- Documented IPv6 status on Wiki
- Collaborated with HEPiX IPv6 WG setting up node at GARR and providing testbed verification scripts for GridFTP
- Enrolled new HINTS users for testing HINTS
- Defined workplan schedule for HINTS
- Disseminated on PerfSONAR MDM tools towards NGIs
3. Issues and Mitigation
Issue Description | Mitigation Description |
---|---|
network support: Suffered major security accident at GARR | HINTS server hacked and unavailable security policy at GARR has been completely reviewed and hardened considerably. |
network support: DownCollector required upgrade to GOCDB4 | Currenlty no manpower available for this |
4. Plans for the next period
Grid Oversight
The plans for the next period is to proceed with the current activities and come up with a proposal to include test resources in the infrastructure.
TPM
Network Support
- Keep consolidating HINTS and providing the 64-bit architecture HINTS probe
- Continue the deployment campaing for HINTS
- Provide more structures support for PerfSONAR-MDM and disseminate about it by the Tier-2 sites, in collaboration with DANTE
- Restart (after re-installation) production HINTS server at GARR