Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "EGI-InSPIRE:SA1.7-QR4"

From EGIWiki
Jump to navigation Jump to search
 
(26 intermediate revisions by 2 users not shown)
Line 1: Line 1:
__NOTOC__
{{Template:EGI-Inspire menubar}}
 
{{Template:Inspire_reports_menubar}}
{{TOC_right}}
= 1. Task Meetings =
= 1. Task Meetings =
<!--
<!--
Line 11: Line 14:
! style="width: 10%" | Outcome
! style="width: 10%" | Outcome
|-
|-
|26-01-2010
|14-02-2011
|https://www.egi.eu/indico/conferenceDisplay.py?confId=315
|https://www.egi.eu/indico/conferenceDisplay.py?confId=327
|CODOC meeting with COO
|https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=315
|-
|26-01-2010
|https://www.egi.eu/indico/conferenceDisplay.py?confId=314
|CODOC
|https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=314
|-
|01-12-2010
|https://www.egi.eu/indico/conferenceDisplay.py?confId=227
|CODOC
|https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=227
|-
|15-12-2010
|https://www.egi.eu/indico/conferenceDisplay.py?confId=235
|CODOC
|CODOC
|https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=235
|https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=327
|-
|-
|13-01-2011
|14-03-2011
|https://www.egi.eu/indico/conferenceDisplay.py?confId=273
|https://www.egi.eu/indico/conferenceDisplay.py?confId=429
|CODOC
|CODOC
|https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=273
|https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=429
|-
|-
|19-01-2011
|31-03-2011
|https://www.egi.eu/indico/conferenceDisplay.py?confId=249
|https://www.egi.eu/indico/conferenceDisplay.py?confId=437
|CODOC
|COD Dashboard meeting
|https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=249
|https://www.egi.eu/indico/conferenceDisplay.py?confId=437
|-
|-
|14-01-2011
|13-04-2011
|https://www.egi.eu/indico/conferenceDisplay.py?confId=271
|https://www.egi.eu/indico/conferenceDisplay.py?confId=457
|EGI Net Sup proposal Task Force meeting nr. 5
|CODOC F2F
|https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=271
|https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=457
|-
|-
|10-12-2011
|07-04-2011
|https://www.egi.eu/indico/conferenceDisplay.py?confId=226
|EGI Net Sup proposal Task Force meeting nr. 4
|
|
|Network Update
|https://wiki.egi.eu/w/images/6/6b/EGI_Network_Support_coordination-2011-4-7.doc
|-
|-
|22-11-2011
|https://www.egi.eu/indico/conferenceDisplay.py?confId=222
|EGI Net Sup proposal Task Force meeting nr. 3
|https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=222
|-
|-
|10-11-2011
|weekly
|
|https://www.egi.eu/indico/categoryDisplay.py?categId=27
|EGI Net Sup proposal Task Force meeting nr. 2
|shopping list meeting
|
|https://www.egi.eu/indico/categoryDisplay.py?categId=27
|-
|-
|}
|}
Line 68: Line 52:
-->
-->
== Grid Oversight ==
== Grid Oversight ==
1. '''ROD teams news letter'''
'''ROD teams news letter'''
 
The transition from EGEE to EGI InSPIRE came about with a lot of changes. For Operations, the EGEE Regional Operations Centres, called ROCs, are in the process of being dismantled and their responsibilities transferred to the NGIs, or have already completed this process. In the EGI era, ROD teams will monitor the quality of sites in their country or region, whereas COD is responsible for the global oversight over the whole EGI infrastructure. This is to provide a high-quality grid infrastructure to the user communities.
The transition from EGEE to EGI InSPIRE came about with a lot of changes. For Operations, the EGEE Regional Operations Centres, called ROCs, are in the process of being dismantled and their responsibilities transferred to the NGIs, or have already completed this process. In the EGI era, ROD teams will monitor the quality of sites in their country or region, whereas COD is responsible for the global oversight over the whole EGI infrastructure. This is to provide a high-quality grid infrastructure to the user communities.
These changes have also leaded us to think about how COD and ROD are going to interact with each other in this new setting. During the Grid Oversight session at the EGI Tech Forum it was made clear to us that people find it cumbersome to travel in order to have regular face to face meetings. Nevertheless, we do feel the need to create and maintain a coherent and alive Grid Oversight community and to have interaction between ROD and COD that goes beyond the dashboards. This is necessary, in our view, to create a top-quality grid infrastructure for our users. For this reason we have created this newsletter. The purpose of this newsletter is to inform you about recent and upcoming developments related to Grid Oversight and to show to you the metrics indicating how well we did the past month. It is our intention to publish a newsletter every month.
These changes have also leaded us to think about how COD and ROD are going to interact with each other in this new setting. During the Grid Oversight session at the EGI Tech Forum it was made clear to us that people find it cumbersome to travel in order to have regular face to face meetings. Nevertheless, we do feel the need to create and maintain a coherent and alive Grid Oversight community and to have interaction between ROD and COD that goes beyond the dashboards. This is necessary, in our view, to create a top-quality grid infrastructure for our users. For this reason we have created this newsletter. The purpose of this newsletter is to inform you about recent and upcoming developments related to Grid Oversight and to show to you the metrics indicating how well we did the past month. We have published newsletters since december 2011. We will continue to do this on a monthly basis.
2. '''Input given on approved Procedures'''
''New NGI creation process coordination'' The purpose of this document is to clearly describe the actions and the relative steps to be undertaken for integrating a NGI (or a group of NGIs) into the EGI operational structure. The newest version became effective as of Dec 1st .


''Operations Centre decommission'' The purpose of this document is to clearly describe the actions and the relative steps to be undertaken for decommission of an Operations Centre. This procedure became effective as of Dec 1st.
'''ROD session at EGI UF'''


''COD escalation procedure'' The purpose of this document is to define an escalation procedure for operational problems. The newest version became effective as of Dec 1st. This procedure is essential for ROD work and we encourage you to read it.
At the EGI User Forum in Vilnius, we have organised a ROD teams session. During the ROD session there were four presentations. The first one was from Marcin Radecki discussing the Grid Oversight work. In the second presentation, Gonçalo Borges from the NGI IBERGRID gave a very nice presentation on the IBERGRID operations and their experiences with the regionalised operational tools. Finally there was a slot on operational tools where two presentations were given by Cyril L'Orphelin on the status and roadmap of the operational portal and Emir Imamagic on the SAM roadmap. The presentations can be downloaded from: https://www.egi.eu/indico/sessionDisplay.py?sessionId=9&confId=207#20110411. We were very pleased with the fact that no less than 35 people were attending this session.


''Making a Nagios test an operations test'' The purpose of this document is to clearly describe the actions and the relative steps to be undertaken for making a Nagios tests an operations test. A Nagios test is set as operations test to enable the operations dashboard to display an alarm in case the test fails. This procedure will become effective as of Jan 1st.
'''Tutorial videos'''
3. '''Renaming of "critical" tests'''
“Operations test” should be used for tests raising alarms for ROD.
Recently it was decided that a new name should be assigned to a test which is raising alarms in operations dashboard. COD used to call it “critical test” but it was causing confusion with critical Nagios test status. In a poll the name which gained the majority was “operations test”.


== Network Support ==
COD team has started using new technology to pass info to ROD members. You can now learn your duties by watching our video tutorials!
The main achievements are the outcome of the EGI Network Support proposal Task Force, i.e. a structured proposal around seven identified Use Cases formalized, and discussed on the face to face meeting in Amsterdam on January 24, 2011.
The series will contain 6 parts:
( https://www.egi.eu/indico/conferenceTimeTable.py?confId=153#20110124)
    1. How to become a ROD member – 7 steps which should be done to become a ROD member
The community has been introduced to the seven use cases:  the GGUS Support System, the PERT team,  Scheduled maintenances, Network troubleshooting on demand, e2e Scheduled Monitoring,  DownCollector, Policy and Collaboration.  
    2. Operations tools – a brief introduction of operations tools needed by a ROD member to perform their duties
For each one of them the specific proposal from the task force has been described and discussed within the EGI operations community.  
    3. How to handle alarms – an instruction how to manage alarms on the Operations Portal (ticket creation, closing and masking alarms)
The proposal from the TF has been based on a previously distributed questionnaire to the NGIs.
    4. How to handle tickets – an instruction how to manage tickets on the Operations Portal (ticket creation, updating and closing tickets)
Results are published on the EGI Operations Wiki on  https://wiki.egi.eu/wiki/NST.  
    5. Issues escalated to COD – an introduction of cases which are escalated to COD and how to deal with them
A roadmap ahed has been agreed upon for each one of them.  
    6. Operations portal – a brief introduction of the Operations Portal tools
In particular the task committed to:
Currently the first two videos are available and you can find links to them on ROD wiki page: [[wiki/Grid_operations_oversight/ROD#Videos_tutorials]]. All videos will be uploaded to YouTube soon.


1)  set up a Network Support unit within GGUS for Network Related issues, and GARR has agreed to start exploiting the provisioning of the corresponding required effort (voluntarily) , at least in a prelimiray way, in order to assess its loing term sustainability and reconfirm this committment in the next months. The GGUS workflow has been identified and agreed.
== TPM ==
TPM activity is done by two teams, which are in permanent contact, so no extra meetings are required to organize the daily work. TPM can be considered as a very reliable service. A prototype of the Technology Helpdesk (EMI/IGE/SAGA) was presented in Vilnius. It is a separate GGUS instance to deal with middleware related tickets. TPM should be able to identify these tickets and assign it to DMSU.


2) Provide, maintain and support a Network Troubleshooting tool on demand, called HINTS, voluntarily provided (unfunded) by the French NGI. A central HINTS server instance will be made available at GARR and the French NGI will start a pilot deployment of the tool after the central server will be made available.
== Network Support ==
In Quarter Four IGI GARR staff has been coordinating the Network  Support task for EGI according to IGI’s commitments in the EGI-Inspire  DoW.  In particular a videoconference among the contributing partners to the Net Sup task force has been organized on April 7, 2011, to make the point overall on the status of the tools being further developed by  volunteering NGIs, to improve the usability and reliability. The status  of HINTS has been reported by FranceGrilles and UREC; the current status  of NetJobs has been summarized by GARR and UREC, and finally the status  of PerfSONAR e2e Monitor live CD has been summarized by Red-IRIS. All  tools appear to be already at a satisfactory level of development and  deployment. The general decision to complete the general development of  the tools, extensively test them on a distributed set of EGI sites and be ready to present the tools at a mature stage for deployment at the  next  EGI User Forum in Lyon has been taken, as suggested by the SA1  activity of EGI-Inspire during the face-to-face meeting in Amsterdam on  January 24, 2011 ( Network Supported F2F OMB). In Q4 the  IGI GARR staff has been deploying the NetJobs server in Rome  and transferring the DB from the development server based in Paris, at  UREC. The front-end http://netjobs.dir.garr.it currently connects to a DB located at GARR.  Correspondingly, a platform to further develop both  the front end and the back end DB, based on NetBeans and pgAdmin (for Postgresql) , has been set up, in order to enable the further refinement  of the tool in the next months. Currently 8 sites in France and Italy do belong to the corresponding NetJobs associated testbed, but the  intention is to extend this set in the next weeks. A session (informal discussion) on the further tasks the Network  Support should be involved in (besides the three tools currently being provided) has been organized at the EGI User Forum in Vilnius on Monday  April 11 2011. Few participants were actually attending the meeting (  FranceGrille, Switch/Swing, METALAB/CESNET, GARR,  KIT/D-Grid (GGUS) ). Among the tasks which have been identified as requiring more effort is  the support for IPv6. Even if Requirements for the Middleware do  normally come from the UCB and the user community in general, the has  been consensus to start a discussion with both the EGI and EMI  communities around this item. In Q4 GARR has also attended a PerfSONAR development meeting in Poznan,  Poland (February 7-8, 2011) and liaised with the NRENs/GEANT community  about the PerfSONAR tools for multi domain network monitoring  and their  possible application to EGI. A permanent communication channel with  GEANT / DANTE ( SA2, NA4) has been established around PerfSONAR and its  possible application for EGI. In particular IGI/GARR staff joined the  PerfSONAR User Panel, in consideration of their role within the  EGI-Inspire project and EGI. Finally, in Q4 the FAQ section of the GGUS Network Support Unit has  been updated, to describe the newly established procedures around network support for EGI, as discussed at the end of January in Amsterdam. A first installation of the HINTS server has been performed and it is  currently reachable at GARR at http://grid-4.dir.garr.it : it will be further tested in the forthcoming months, jointly with the deployment of  probes at various pilot sites.


3) Provide and maintain a perfSONAR-based live CD distribution for on demand and scheduled e2e monitoring, based on perfSONAR-MDM, voluntarily contributed by the Spanish NGI and NREN  RedIRIS. Later on, a dedicated GUI will be made available. Historical Data will be stored in a DB.
To summarize the main achievements:


4) Keep a permanent liaison with the GN3 PerfSONAR communuity, and assess the tools provide by pS, provide feedback to the GN3 community. Periodically reporting about the new features and progress around the pS based tools.
- Defined the long term strategy for Network Support.
 
- Installed the HINTS server in France and Italy.
5) Further refine the NetJobs tool w.r.t. provided functionality and usability of the Web Interface, providing a central server instance at GARR; promote the tool within the EGI Net Sup operations comminity, especially for the basic metrics ( n.hops, RTT, available bandwidth).
- Written the NetSup GGUS FAQ section for GGUS users.
 
- Defined the workflow for network support within GGUS.
6) Organize a general questionnaire for the NRENs, aimed at better understanding their interaction model with the NGIs, the best practices, the tools they are familair with, and asking about theri availability to provide a PERT contact point for the EGI project.
  - Formalised liaison with GN3 and EGI participation to perfSONAR User  Panel.


= 3. Issues and Mitigation =
= 3. Issues and Mitigation =
Line 114: Line 96:
|-
|-
|'''Grid Oversight:''' None
|'''Grid Oversight:''' None
|
|-
|'''TPM:''' None
|
|-
|'''Network Support:''' None
|
|
|}
|}
Line 122: Line 110:
1. Continue ROC transition to NGIs.  
1. Continue ROC transition to NGIs.  


2. Initiate investigation on how to have a consistent and coherent integration of nonproduction resources in the infrastructure.
2. Continue investigation of the impact on operations support model related to new  
 
3. Initiate investigation of the impact on operations support model related to new  
middlewares in EGI.
middlewares in EGI.
   
   
4. Initiate the investigation on how to improve availability and reliability metrics.
3. Continue the investigation on how to improve availability and reliability metrics.
 
4. Evaluation of upcoming new releases of the operational dashboard.
 
5. Finish the tutorial videos.
 
== TPM ==
Plans shall be worked out to further automate TPMs work and how the monitoring of untouched tickets could be improved. A workshop with all people involved in the TPM task could help with this.
In preparation for this workshop ticket statistics and analysis will be done.


5. Evaluation of upcoming new releases of the operational dashboard.


== Network Support ==
== Network Support ==
 
Fully deploy the operational tools (HINTS, NetJobs, e2eMON Live perfSONAR CD) on a large testbed and test them in view of production usage by EGI. Organize a questionnarie for the NRENs to clarify NREN-NGI interaction models. Start discussion around IPv6 in the community.
Plans for the next period.

Latest revision as of 19:13, 6 January 2015

EGI Inspire Main page


Inspire reports menu: Home SA1 weekly Reports SA1 Task QR Reports NGI QR Reports NGI QR User support Reports



1. Task Meetings

Date (dd/mm/yyyy) Url Indico Agenda Title Outcome
14-02-2011 https://www.egi.eu/indico/conferenceDisplay.py?confId=327 CODOC https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=327
14-03-2011 https://www.egi.eu/indico/conferenceDisplay.py?confId=429 CODOC https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=429
31-03-2011 https://www.egi.eu/indico/conferenceDisplay.py?confId=437 COD Dashboard meeting https://www.egi.eu/indico/conferenceDisplay.py?confId=437
13-04-2011 https://www.egi.eu/indico/conferenceDisplay.py?confId=457 CODOC F2F https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=457
07-04-2011 Network Update https://wiki.egi.eu/w/images/6/6b/EGI_Network_Support_coordination-2011-4-7.doc
weekly https://www.egi.eu/indico/categoryDisplay.py?categId=27 shopping list meeting https://www.egi.eu/indico/categoryDisplay.py?categId=27

2. Main Achievements

Grid Oversight

ROD teams news letter

The transition from EGEE to EGI InSPIRE came about with a lot of changes. For Operations, the EGEE Regional Operations Centres, called ROCs, are in the process of being dismantled and their responsibilities transferred to the NGIs, or have already completed this process. In the EGI era, ROD teams will monitor the quality of sites in their country or region, whereas COD is responsible for the global oversight over the whole EGI infrastructure. This is to provide a high-quality grid infrastructure to the user communities. These changes have also leaded us to think about how COD and ROD are going to interact with each other in this new setting. During the Grid Oversight session at the EGI Tech Forum it was made clear to us that people find it cumbersome to travel in order to have regular face to face meetings. Nevertheless, we do feel the need to create and maintain a coherent and alive Grid Oversight community and to have interaction between ROD and COD that goes beyond the dashboards. This is necessary, in our view, to create a top-quality grid infrastructure for our users. For this reason we have created this newsletter. The purpose of this newsletter is to inform you about recent and upcoming developments related to Grid Oversight and to show to you the metrics indicating how well we did the past month. We have published newsletters since december 2011. We will continue to do this on a monthly basis.

ROD session at EGI UF

At the EGI User Forum in Vilnius, we have organised a ROD teams session. During the ROD session there were four presentations. The first one was from Marcin Radecki discussing the Grid Oversight work. In the second presentation, Gonçalo Borges from the NGI IBERGRID gave a very nice presentation on the IBERGRID operations and their experiences with the regionalised operational tools. Finally there was a slot on operational tools where two presentations were given by Cyril L'Orphelin on the status and roadmap of the operational portal and Emir Imamagic on the SAM roadmap. The presentations can be downloaded from: https://www.egi.eu/indico/sessionDisplay.py?sessionId=9&confId=207#20110411. We were very pleased with the fact that no less than 35 people were attending this session.

Tutorial videos

COD team has started using new technology to pass info to ROD members. You can now learn your duties by watching our video tutorials! The series will contain 6 parts:

   1. How to become a ROD member – 7 steps which should be done to become a ROD member
   2. Operations tools – a brief introduction of operations tools needed by a ROD member to perform their duties
   3. How to handle alarms – an instruction how to manage alarms on the Operations Portal (ticket creation, closing and masking alarms)
   4. How to handle tickets – an instruction how to manage tickets on the Operations Portal (ticket creation, updating and closing tickets)
   5. Issues escalated to COD – an introduction of cases which are escalated to COD and how to deal with them
   6. Operations portal – a brief introduction of the Operations Portal tools

Currently the first two videos are available and you can find links to them on ROD wiki page: wiki/Grid_operations_oversight/ROD#Videos_tutorials. All videos will be uploaded to YouTube soon.

TPM

TPM activity is done by two teams, which are in permanent contact, so no extra meetings are required to organize the daily work. TPM can be considered as a very reliable service. A prototype of the Technology Helpdesk (EMI/IGE/SAGA) was presented in Vilnius. It is a separate GGUS instance to deal with middleware related tickets. TPM should be able to identify these tickets and assign it to DMSU.

Network Support

In Quarter Four IGI GARR staff has been coordinating the Network Support task for EGI according to IGI’s commitments in the EGI-Inspire DoW. In particular a videoconference among the contributing partners to the Net Sup task force has been organized on April 7, 2011, to make the point overall on the status of the tools being further developed by volunteering NGIs, to improve the usability and reliability. The status of HINTS has been reported by FranceGrilles and UREC; the current status of NetJobs has been summarized by GARR and UREC, and finally the status of PerfSONAR e2e Monitor live CD has been summarized by Red-IRIS. All tools appear to be already at a satisfactory level of development and deployment. The general decision to complete the general development of the tools, extensively test them on a distributed set of EGI sites and be ready to present the tools at a mature stage for deployment at the next EGI User Forum in Lyon has been taken, as suggested by the SA1 activity of EGI-Inspire during the face-to-face meeting in Amsterdam on January 24, 2011 ( Network Supported F2F OMB). In Q4 the IGI GARR staff has been deploying the NetJobs server in Rome and transferring the DB from the development server based in Paris, at UREC. The front-end http://netjobs.dir.garr.it currently connects to a DB located at GARR. Correspondingly, a platform to further develop both the front end and the back end DB, based on NetBeans and pgAdmin (for Postgresql) , has been set up, in order to enable the further refinement of the tool in the next months. Currently 8 sites in France and Italy do belong to the corresponding NetJobs associated testbed, but the intention is to extend this set in the next weeks. A session (informal discussion) on the further tasks the Network Support should be involved in (besides the three tools currently being provided) has been organized at the EGI User Forum in Vilnius on Monday April 11 2011. Few participants were actually attending the meeting ( FranceGrille, Switch/Swing, METALAB/CESNET, GARR, KIT/D-Grid (GGUS) ). Among the tasks which have been identified as requiring more effort is the support for IPv6. Even if Requirements for the Middleware do normally come from the UCB and the user community in general, the has been consensus to start a discussion with both the EGI and EMI communities around this item. In Q4 GARR has also attended a PerfSONAR development meeting in Poznan, Poland (February 7-8, 2011) and liaised with the NRENs/GEANT community about the PerfSONAR tools for multi domain network monitoring and their possible application to EGI. A permanent communication channel with GEANT / DANTE ( SA2, NA4) has been established around PerfSONAR and its possible application for EGI. In particular IGI/GARR staff joined the PerfSONAR User Panel, in consideration of their role within the EGI-Inspire project and EGI. Finally, in Q4 the FAQ section of the GGUS Network Support Unit has been updated, to describe the newly established procedures around network support for EGI, as discussed at the end of January in Amsterdam. A first installation of the HINTS server has been performed and it is currently reachable at GARR at http://grid-4.dir.garr.it : it will be further tested in the forthcoming months, jointly with the deployment of probes at various pilot sites.

To summarize the main achievements:

- Defined the long term strategy for Network Support.
- Installed the HINTS server in France and Italy.
- Written the NetSup GGUS FAQ section for GGUS users.
- Defined the workflow for network support within GGUS.
- Formalised liaison with GN3 and EGI participation to perfSONAR User  Panel.

3. Issues and Mitigation

Issue Description Mitigation Description
Grid Oversight: None
TPM: None
Network Support: None

4. Plans for the next period

Grid Oversight

1. Continue ROC transition to NGIs.

2. Continue investigation of the impact on operations support model related to new middlewares in EGI.

3. Continue the investigation on how to improve availability and reliability metrics.

4. Evaluation of upcoming new releases of the operational dashboard.

5. Finish the tutorial videos.

TPM

Plans shall be worked out to further automate TPMs work and how the monitoring of untouched tickets could be improved. A workshop with all people involved in the TPM task could help with this. In preparation for this workshop ticket statistics and analysis will be done.


Network Support

Fully deploy the operational tools (HINTS, NetJobs, e2eMON Live perfSONAR CD) on a large testbed and test them in view of production usage by EGI. Organize a questionnarie for the NRENs to clarify NREN-NGI interaction models. Start discussion around IPv6 in the community.