Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

EGI-InSPIRE:SA1.7-QR7

From EGIWiki
(Redirected from SA1.7-QR7)
Jump to navigation Jump to search
EGI Inspire Main page


Inspire reports menu: Home SA1 weekly Reports SA1 Task QR Reports NGI QR Reports NGI QR User support Reports



1. Task Meetings

Date (dd/mm/yyyy) Url Indico Agenda Title Outcome
17-11-2011 https://www.egi.eu/indico/conferenceDisplay.py?confId=688 COD https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=688
21-11-2011 https://www.egi.eu/indico/conferenceDisplay.py?confId=695 COD-EGI.eu https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=1&confId=695
23-11-2011 https://www.egi.eu/indico/conferenceDisplay.py?confId=701 unknown meeting https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=701
05-12-2011 https://www.egi.eu/indico/conferenceDisplay.py?confId=703 COD https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=703
15-12-2011 https://www.egi.eu/indico/conferenceDisplay.py?confId=708 COD https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=708
19-12-2011 https://www.egi.eu/indico/conferenceDisplay.py?confId=716 COD (availability probe) https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=716
19-01-2012 https://www.egi.eu/indico/conferenceDisplay.py?confId=803 COD https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=803
24-01-2012 https://www.egi.eu/indico/conferenceDisplay.py?confId=618 OMB https://www.egi.eu/indico/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=618
13-01-2012 https://www.egi.eu/indico/conferenceDisplay.py?confId=781 IPv6 meeting https://www.egi.eu/indico/conferenceDisplay.py?confId=781

2. Main Achievements

Grid Oversight

ROD teams newsletter

This quarter we have published a ROD teams newsletter in November, December and January. The rationale behind the newsletter is descibed in the SA1.7-QR4 report.

ROD performance index

For background information on this, have a look at SA1.7-QR6, section RP OLA and ROD metrics. Since October we have been asking all NGIs above 10 items in the COD dashboard duting one month about the explanation through GGUS, what was the reason of such result and how do you plan to improve the situation. The good news is that we have seen a continuous decling in the amount of items in the COD dashboard.

Non-OK Alarms Followup

For background information on this, have a look at SA1.7-QR6, section Non-OK Alarms Followup. We have continued this activity in QR7.

Availability followup

See SA1.7-QR6 for more background information. There has been a phone conf with jra1 (https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=716) where the availability probe has been discussed. There will be a probe that meets the following specs:

  • The probe only measures availability
  • The probe computes the availability 30 days in the past
  • The probe returns a WARNING when: 70%>= availability <=75%
  • The probe returns a CRITICAL when: availability <70%

A test version of the probe will be available in March.

Unknown Followup

See SA1.7-QR6 for more background information. There has been a phone conf with jra1 (https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=716) where the "unknown" issue has been discussed. The following proposal was made for the follow-up of the unknown issue:

  • each month COD team will send a GGUS tickets to NGIs (one ticket per NGI) with the list of sites

which are above 10% of UNKNOWN

  • In the ticket we will ask the NGI to investigate the issue and fix the problem
  • The NGIs closes the ticket which indicates that they have received the notification

In the meantime we have put this proposal into practice.

TPM

The new SU “MPI User Support” the problem of ticket bouncing back to TPM because of lack of a proper support unit should be solved.

Instructions for the TPM how to use of the SU “Operations”: This SU is meant for managerial problems that concern operations and is not a catch-all: The purpose of this SU is to provide a contact with the EGI.eu team that coordinates EGI operations for any technical and operational matter and to handle requests from Resource Centers and Resource Infrastructure Providers that are willing to be integrated into the EGI production infrastructure. As well it is there to notify any operational issues that is general and does not concern a specific Resource Infrastructure or Grid middleware deployed. Middleware related issues that cannot be handled by TPM must be assigned to the DMSU. This includes configuration and documentation problems with the middleware. For deployment problems that concern a vast majority of the production sites, for which it is infeasible to open an individual ticket to every site/NGI, TPM can assign the ticket to the Operations SU. The SU “Operations” staff can offer coordination of the handling of such incidents, when the scale cannot be managed by TPM.

Network Support

  1. Started setting up IPv6 testbed
  2. Finalized IPv6 survery for NGIs
  3. Documented IPv6 status on Wiki
  4. Collaborated with HEPiX IPv6 WG setting up node at GARR and providing testbed verification scripts for GridFTP
  5. Enrolled new HINTS users for testing HINTS
  6. Defined workplan schedule for HINTS
  7. Disseminated on PerfSONAR MDM tools towards NGIs

3. Issues and Mitigation

Issue Description Mitigation Description
network support: Suffered major security accident at GARR HINTS server hacked and unavailable security policy at GARR has been completely reviewed and hardened considerably.
network support: DownCollector required upgrade to GOCDB4 Currenlty no manpower available for this

4. Plans for the next period

Grid Oversight

The plans for the next period is to proceed with the current activities and come up with a proposal to include test resources in the infrastructure.

TPM

Network Support

  1. Keep consolidating HINTS and providing the 64-bit architecture HINTS probe
  2. Continue the deployment campaing for HINTS
  3. Provide more structures support for PerfSONAR-MDM and disseminate about it by the Tier-2 sites, in collaboration with DANTE
  4. Restart (after re-installation) production HINTS server at GARR