Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Agenda-13-06-2016"

From EGIWiki
Jump to navigation Jump to search
 
(26 intermediate revisions by the same user not shown)
Line 7: Line 7:


= UMD/CMD  =
= UMD/CMD  =
* UMD 3.14.2 RC ready
** problem with dependencies generated within EPEL: Package voms-clients is obsoleted by voms-clients-cpp, trying to install voms-clients-cpp-2.0.13-1.el6.x86_64 instead
** solution should be setting priorities so that UMD comes first (thanks Mattias)


* UMD 4 next release in preparation, release scheduled by June
** first update for SL6
** adding several products, see products in verification
* CMD
** RT setup: IT support to configure CMD together with UMD, discussion in progress
** Verification process
*** starting with BDII info provider
*** external infrastructure needed to perform the tests
** Staged-Rollout: TBD
== Staged rollout updates  ==
== Staged rollout updates  ==


Line 27: Line 40:


= Operations =
= Operations =
== Central monitoring ==
* this has been postponed due to technical issues in setting up the central instance
== RFC proxy will be default ==
* moving to RFC proxy instead of legacy proxy
* in production since a while, everybody is using RFC
* we will ask VOMS TP to make a little modification on VOMS client, changing the default


== EGI Operations Support activities stopped ==
== EGI Operations Support activities stopped ==
Line 37: Line 60:


== Monthly Availability/Reliability ==
== Monthly Availability/Reliability ==
A/R report on ARGO: http://argo.egi.eu/lavoisier/ngi_reports?accept=html
List of the underperforming RCs for (at least) 3 consecutive months:
* AfricaArabia https://ggus.eu/?mode=ticket_info&ticket_id=117094: main problems with the monitoring system, waiting for the release of the central one
** ASRT
** DZ-01-ARN (recovered)
** EG-ZC-T3: unresponsive since too months, must be suspended
** ZA-UJ
* AsiaPacific: (since February) https://ggus.eu/index.php?mode=ticket_info&ticket_id=121222
** IN-DAE-VECC-02 (miscellaneous issues)
** MY-UPM-BIRUNI-01
* NGI_DE: https://ggus.eu/?mode=ticket_info&ticket_id=121975
** UNI-SIEGEN-HEP
* NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=120573
** egee.fesb.hr issue with SE element which affected the whole NGI: situation improved, they are planning to decommission it during this year.
* NGI_IL: (since last month) https://ggus.eu/index.php?mode=ticket_info&ticket_id=121223
** IL_IUCC_IG: suspended on June 6th
* '''NGI_MARGI https://ggus.eu/index.php?mode=ticket_info&ticket_id=118465 no monitoring data since January'''
* NGI_MD: https://ggus.eu/index.php?mode=ticket_info&ticket_id=120578
** the only site MD-02-IMI was suspended in March for security reasons, asked for news
* NGI_NDGF: https://ggus.eu/index.php?mode=ticket_info&ticket_id=121985
** EENet problem with the probe


== Decommissioning SL5 ==
== Decommissioning SL5 ==
* Tracked on [https://wiki.egi.eu/wiki/SL5_retirement SL5_retirement wiki]
* Sites still deploying unsupported service end-points risk suspension, unless documented technical reasons prevent a Site Admin from updating these end-points https://wiki.egi.eu/wiki/PROC16_Decommissioning_of_unsupported_software#Escalation_phase see step 7
* '''Status https://wiki.egi.eu/wiki/SL5_retirement#2016-06-13_Overall_status''' reported below.
* '''from this week on EGI Operations can suspend sites that host SL5 services in production and not set under downtime'''
** tickets will be opened
=== Status and actions ===
* 2 [https://midmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_Top-BDII&style=detail&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0 Top-BDII]
* 15 [https://midmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_Site-BDII&style=detail&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0 Site-BDII]
* 5 [https://midmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_MyProxy&style=detail&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0 MyProxy]
* 7 [https://midmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_WMS&style=detail&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0 WMS] and 7 [https://midmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_LB&style=detail&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0 LB]
* 2 [https://midmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_VOMS&style=detail&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0 VOMS]
* 1 [https://midmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_emi.ARGUS&style=detail&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0 ARGUS]
* 12 [https://midmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CREAM-CE&style=detail&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0 CREAM-CE]
* 0 [https://midmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_QCG.Computing&style=detail&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0 QCG Computing]
* 2 [https://midmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_SRM&style=detail&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0 STORM]


== NGIs argus server not properly configured ==
== NGIs argus server not properly configured ==
Some time ago (more than a year I think), EGI ran a campaign to have
NGIs run a "NGI Argus" service. This campaign resulted in new services
being added to goc-db for each NGI.
Unfortunately, as explained in the OMB in February, our monitoring is
currently unable to check the deployment of these services:
- For 6 services, our monitoring cannot contact the NGI Argus
- For 18 services, our monitoring is not authorized to get the right
information from the NGI Argus
- For 1 service, our monitoring indicates that the NGI Argus is not
properly configured and does not pull the rules from argus.cern.ch
In the end, only 5 services are properly configured and monitored!
The changes are rather easy:
* If we can't contact them, the site needs to make sure that there is no firewall blocking 195.251.55.111 from accessing the argus 'pap' port
* If we are not authorized, the site needs to add the right ACE to their argus authorization
pap-admin add-ace 'CN=srv-111.afroditi.hellasgrid.gr,OU=afroditi.hellasgrid.gr,O=HellasGrid, C=GR' 'POLICY_READ_LOCAL|POLICY_READ_REMOTE|CONFIGURATION_READ'
* If the argus server is not properly configured (no rule pulled), the site has to follow http://wiki.nikhef.nl/grid/Argus_Global_Banning_Setup_Overview#NGI_Argus
The '''current status''' of the infrastructure can be found:
* In the secmon nagios (not sure you have access to this):
https://secmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_ngi.ARGUS&style=detail&sorttype=1&sortoption=3
* On the security dashboard:
https://operations-portal.egi.eu/csiDashboard/ngi/any/tab/list/filter/monitoring/page/list?tsid=4
On the security dashboard, each NGI should have a "argus-ban" result:
* "Ok" means ok
* "Unknown" means that we can't contact them
* "High" means that we are not authorized
* "Critical" means that argus is not pull rules from argus.cern.ch
The parent ticket is https://ggus.eu/?mode=ticket_info&ticket_id=120770
'''2016_06_13 UPDATE'''
pending tickets:
* NGI_MD https://ggus.eu/?mode=ticket_info&ticket_id=120746
* NGI_FI https://ggus.eu/?mode=ticket_info&ticket_id=120747
* NGI_MARGI https://ggus.eu/?mode=ticket_info&ticket_id=120765


== FedCloud status ==
== FedCloud status ==


=== A/R profile ===
* only [https://ggus.eu/?mode=ticket_info&ticket_id=120995 GoeGrid] (NGI_DE) is not publishing images
* open [https://ggus.eu/?mode=ticket_info&ticket_id=121262 tickets] to sites where dteam is not working: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=121265 MK-04-FINKICLOUD] -> this can lead to suspension as per OLA!
* '''cloud profiles still under approval at OMB''', email to be circulated by EGI Operations for approval; if profiles will be approved, the new profile will be used for A/R from July 1st, the suspension will start from August 1st on


{| cellspacing="5" cellpadding="5" border="1" align="left"
{| border=1
! scope="col" | A/R profile
| '''A/R Profile'''
! scope="col" | March
| '''March'''
! scope="col" | April
| '''April'''
! scope="col" | May
| '''May'''
|-
|-
| improvements  
| improvements  
Line 68: Line 179:
|-
|-
|}
|}
* CYFRONET-CLOUD (+100%): in the old profile it fails the accounting test
* GoeGRID (+80.7%): in the old profile it fails the cdmi test
* TR-FC1-ULAKBIM (+47.59%): it was failing the accounting test in the old profile
* HG-09-Okeanos-Cloud: https://ggus.eu/index.php?mode=ticket_info&ticket_id=122012 (SOLVED, updated the cert)
** failures with the probes:
** eu.egi.cloud.OCCI-Context-ops: CATEGORIES CRITICAL - SSL_connect returned=1 errno=0 state=error: certificate verify failed
** eu.egi.cloud.OCCI-VM-ops: CRITICAL - SSL connection with "https://okeanos-occi2.hellasgrid.gr:9000/" could not be established! SSL_connect
* '''MK-04-FINKICLOUD unreachable'''
* NCG-INGRID-PT (+26.74%): https://ggus.eu/index.php?mode=ticket_info&ticket_id=122013 (a new server are going to be put in production, decommissioning the old one)
** failures mainly with the cloud probes:
** eu.egi.cloud.OCCI-VM-ops (sometimes warning, sometimes critical): WARNING - "http://aurora.ncg.ingrid.pt:8787" failed to instantiate a COMPUTE instance in the given timeframe! Timeout: 300s
** eu.egi.cloud.OpenStack-VM-ops: Critical: could not fetch flavor ID, endpoint does not correctly exposes available flavors: 110 Connection timed out
* SCAI (-21.61%) https://ggus.eu/index.php?mode=ticket_info&ticket_id=122015 (CAs not completely updated)
** some repeated failures with the CA probes
** also eu.egi.cloud.OCCI-VM-ops CRITICAL - Unexpected response from https://fc.scai.fraunhofer.de:8787/! Net::HTTP::Post failed! HTTP Response status: [500] Internal Server Error : The server has either erred or is incapable of performing the requested operation.
* UPV-GRyCAP (-24.56) https://ggus.eu/index.php?mode=ticket_info&ticket_id=122014 (SOLVED, CAs updated)
** it is still failing the eu.egi.OCCI-IGTF probe
** org.nagios.OCCI-TCP: 05-11-2016 17:56:27 Connection refused


= AOB  =
= AOB  =
Line 75: Line 205:


* '''11 Jul 2016''' https://indico.egi.eu/indico/event/3003/
* '''11 Jul 2016''' https://indico.egi.eu/indico/event/3003/
* new calendar available until end of 2016 https://indico.egi.eu/indico/category/32/

Latest revision as of 10:06, 14 June 2016


General information

UMD/CMD

  • UMD 3.14.2 RC ready
    • problem with dependencies generated within EPEL: Package voms-clients is obsoleted by voms-clients-cpp, trying to install voms-clients-cpp-2.0.13-1.el6.x86_64 instead
    • solution should be setting priorities so that UMD comes first (thanks Mattias)
  • UMD 4 next release in preparation, release scheduled by June
    • first update for SL6
    • adding several products, see products in verification
  • CMD
    • RT setup: IT support to configure CMD together with UMD, discussion in progress
    • Verification process
      • starting with BDII info provider
      • external infrastructure needed to perform the tests
    • Staged-Rollout: TBD

Staged rollout updates

Preview repository

on 2016-05-17 released:

  • preview 1.2.0
    • LCMAPS-plugins-vo-ca-ap 0.0.1-1
    • STORM 1.11.11
  • Preview 2.1.0
    • NorduGrid ARC 15.03 update 6
    • LCMAPS-plugins-vo-ca-ap 0.0.1-1

Generic information about Preview repository: https://wiki.egi.eu/wiki/Preview_Repository

Note: EGI provides the preview repository without any additional quality assurance process, but the products are released as they are provided by the product team. EGI recommends the use of the UMD repositories, which contain software verified through the quality assurance process of UMD.

Operations

Central monitoring

  • this has been postponed due to technical issues in setting up the central instance

RFC proxy will be default

  • moving to RFC proxy instead of legacy proxy
  • in production since a while, everybody is using RFC
  • we will ask VOMS TP to make a little modification on VOMS client, changing the default

EGI Operations Support activities stopped

  • Operations Support core activity has not been re-bid in the phase 2 of the EGI core activities
  • all Operations Support activities have been moved to the EGI.eu Operations
  • all the operational procedures involving operations support have been updated pointing to EGI operations. Please, let us know if we

missed to update any documents.

  • The operations support support unit in GGUS has been decommissioned. Please, use the Operations support unit instead from now on.

Monthly Availability/Reliability

A/R report on ARGO: http://argo.egi.eu/lavoisier/ngi_reports?accept=html

List of the underperforming RCs for (at least) 3 consecutive months:

Decommissioning SL5

Status and actions

NGIs argus server not properly configured

Some time ago (more than a year I think), EGI ran a campaign to have NGIs run a "NGI Argus" service. This campaign resulted in new services being added to goc-db for each NGI.

Unfortunately, as explained in the OMB in February, our monitoring is currently unable to check the deployment of these services: - For 6 services, our monitoring cannot contact the NGI Argus - For 18 services, our monitoring is not authorized to get the right information from the NGI Argus - For 1 service, our monitoring indicates that the NGI Argus is not properly configured and does not pull the rules from argus.cern.ch

In the end, only 5 services are properly configured and monitored!

The changes are rather easy:

  • If we can't contact them, the site needs to make sure that there is no firewall blocking 195.251.55.111 from accessing the argus 'pap' port
  • If we are not authorized, the site needs to add the right ACE to their argus authorization
pap-admin add-ace 'CN=srv-111.afroditi.hellasgrid.gr,OU=afroditi.hellasgrid.gr,O=HellasGrid, C=GR' 'POLICY_READ_LOCAL|POLICY_READ_REMOTE|CONFIGURATION_READ'

The current status of the infrastructure can be found:

  • In the secmon nagios (not sure you have access to this):

https://secmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_ngi.ARGUS&style=detail&sorttype=1&sortoption=3

  • On the security dashboard:

https://operations-portal.egi.eu/csiDashboard/ngi/any/tab/list/filter/monitoring/page/list?tsid=4

On the security dashboard, each NGI should have a "argus-ban" result:

  • "Ok" means ok
  • "Unknown" means that we can't contact them
  • "High" means that we are not authorized
  • "Critical" means that argus is not pull rules from argus.cern.ch

The parent ticket is https://ggus.eu/?mode=ticket_info&ticket_id=120770

2016_06_13 UPDATE pending tickets:

FedCloud status

  • only GoeGrid (NGI_DE) is not publishing images
  • open tickets to sites where dteam is not working: MK-04-FINKICLOUD -> this can lead to suspension as per OLA!
  • cloud profiles still under approval at OMB, email to be circulated by EGI Operations for approval; if profiles will be approved, the new profile will be used for A/R from July 1st, the suspension will start from August 1st on
A/R Profile March April May
improvements 2 6 5
unchanged 11 7 5
worsening 9 10 12
  • CYFRONET-CLOUD (+100%): in the old profile it fails the accounting test
  • GoeGRID (+80.7%): in the old profile it fails the cdmi test
  • TR-FC1-ULAKBIM (+47.59%): it was failing the accounting test in the old profile
  • HG-09-Okeanos-Cloud: https://ggus.eu/index.php?mode=ticket_info&ticket_id=122012 (SOLVED, updated the cert)
    • failures with the probes:
    • eu.egi.cloud.OCCI-Context-ops: CATEGORIES CRITICAL - SSL_connect returned=1 errno=0 state=error: certificate verify failed
    • eu.egi.cloud.OCCI-VM-ops: CRITICAL - SSL connection with "https://okeanos-occi2.hellasgrid.gr:9000/" could not be established! SSL_connect
  • MK-04-FINKICLOUD unreachable
  • NCG-INGRID-PT (+26.74%): https://ggus.eu/index.php?mode=ticket_info&ticket_id=122013 (a new server are going to be put in production, decommissioning the old one)
    • failures mainly with the cloud probes:
    • eu.egi.cloud.OCCI-VM-ops (sometimes warning, sometimes critical): WARNING - "http://aurora.ncg.ingrid.pt:8787" failed to instantiate a COMPUTE instance in the given timeframe! Timeout: 300s
    • eu.egi.cloud.OpenStack-VM-ops: Critical: could not fetch flavor ID, endpoint does not correctly exposes available flavors: 110 Connection timed out
  • SCAI (-21.61%) https://ggus.eu/index.php?mode=ticket_info&ticket_id=122015 (CAs not completely updated)
    • some repeated failures with the CA probes
    • also eu.egi.cloud.OCCI-VM-ops CRITICAL - Unexpected response from https://fc.scai.fraunhofer.de:8787/! Net::HTTP::Post failed! HTTP Response status: [500] Internal Server Error : The server has either erred or is incapable of performing the requested operation.
  • UPV-GRyCAP (-24.56) https://ggus.eu/index.php?mode=ticket_info&ticket_id=122014 (SOLVED, CAs updated)
    • it is still failing the eu.egi.OCCI-IGTF probe
    • org.nagios.OCCI-TCP: 05-11-2016 17:56:27 Connection refused

AOB

Next meeting