Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Agenda-09-05-2016"

From EGIWiki
Jump to navigation Jump to search
(Created page with "{{TOC right}} = General information = * the Operations meeting will be on the '''2nd Monday of the month''' * the EGI Operations Meeting schedule for '''first half of 2016''' ...")
 
 
(40 intermediate revisions by 3 users not shown)
Line 7: Line 7:


= News from URT =
= News from URT =
* CMD under design stage, comments are welcome https://wiki.egi.eu/wiki/EGI_Cloud_Middleware_Distribution
* UMD4 to be released by May
* wiki writing/rewriting in progress for both (release scheduled still to be written properly for UMD, apologies for that)


== Staged rollout updates  ==
== Staged rollout updates  ==
* VOMS-ADMIN SERVER 3.4.1
* STORM 1.11.10
* DCACHE 2.13.27


== Next releases  ==
== Next releases  ==
* UMD 4.1.0 RC ready, release by May
** goal: add SL6 to UMD4 in order to allow dismission of UMD3
** SL6 migrated from UMD3 to UMD4, non-supported products have been removed
** CentOS7 --> only Frontier for now, a new release will be made including dCache and ARGUS server


= Preview repository =
= Preview repository =
Line 37: Line 48:


* EGI Operations proposal to align Fedcloud sites to the A/R related procedures used for the grid sites
* EGI Operations proposal to align Fedcloud sites to the A/R related procedures used for the grid sites
 
* based on the availability reliability of monitored services in cloudmon, EGI Operations will start follow up with underperforming sites as we are doing for every grid sites
** based on the availability reliability of monitored services in cloudmon, EGI Operations will start follow up with underperforming sites as we are doing for every grid sites
* sites will NOT be suspended for a/r performance at least until end of May
** sites will NOT be suspended for a/r performance at least until end of May
* in parallel EGI Operations will start [https://wiki.egi.eu/wiki/PROC08 PROC08] to include cloud probes in the EGI_CRITICAL and EGI profiles used for A/R computations (IN PROGRESS)
* in parallel EGI Operations will start [https://wiki.egi.eu/wiki/PROC08 PROC08] to include cloud probes in the EGI_CRITICAL and EGI profiles used for A/R computations (IN PROGRESS)


Line 51: Line 61:
** '''Starting notification of sites eligible for suspension'''
** '''Starting notification of sites eligible for suspension'''


== FedCloud status ==
=== Comparing the two profiles ===
 
=== Old issues ===
 
Grouped by NGI, please follow up with sites.
 
* NGI_UK
** 100IT (OpenStack)
*** vmcatcher issues https://ggus.eu/index.php?mode=ticket_info&ticket_id=116358#update#19 '''IN PROGRESS'''
*** BDII and GOCDB have different Endpoint URLs https://ggus.eu/index.php?mode=ticket_info&ticket_id=119002#update#5 '''FIXED'''
 
* NGI_PL
** CYFRONET-CLOUD (OpenStack)
*** VMCatcher https://ggus.eu/index.php?mode=ticket_info&ticket_id=116363#update#29 '''IN PROGRESS'''
 
* NGI_DE
** GoeGrid (OpenNebula)
*** OCCI, VMCatcher https://ggus.eu/index.php?mode=ticket_info&ticket_id=119003 https://ggus.eu/index.php?mode=ticket_info&ticket_id=116365 '''IN PROGRESS'''
 
* NGI_GRNET
** HG-09-Okeanos-Cloud (Synnefo)
*** VMCatcher, issue with large metadata, on hold (it requires some development) https://ggus.eu/index.php?mode=ticket_info&ticket_id=116368 '''ON HOLD'''


* NGI_TR
see the FedCloud meeting slides for details https://indico.egi.eu/indico/event/2847/
** TR-FC1-ULAKBIM (OpenStack)
*** Missing GLUE2DomainID and image description looks wrong https://ggus.eu/index.php?mode=ticket_info&ticket_id=119005#update#15 '''IN PROGRESS'''


* New tickets opened to track issues in publishing appliances on AppDB for fedcloud.egi.eu: https://ggus.eu/index.php?mode=ticket_info&ticket_id=120010
* nagios probes
* Issue with OCCI and fedcloud.egi.eu VO at MK-04-FINKICLOUD (NGI_MARGI): https://ggus.eu/index.php?mode=ticket_info&ticket_id=120027


=== New issues ===
{| class="wikitable"
|New profile
|Old profile
|-
|
* eu.egi.cloud.vm-management.occi
** eu.egi.cloud.OCCI-Context
** eu.egi.cloud.OCCI-VM
** org.nagios.OCCI-TCP
** eu.egi.OCCI-IGTF
* org.openstack.nova
** eu.egi.Keystone-IGTF
** eu.egi.cloud.OpenStack-VM
** org.nagios.Keystone-TCP
|
* eu.egi.cloud.vm-management.occi
** eu.egi.cloud.OCCI-Context
** eu.egi.cloud.OCCI-VM
** org.nagios.OCCI-TCP
* eu.egi.cloud.storage-management.cdmi
** org.nagios.CDMI-TCP
* eu.egi.cloud.accounting
** eu.egi.cloud.APEL-Pub
|}


* How the sites figures are changed:


=== Actions ===
{| class="wikitable"
|
|March
|April
|-
|improvements
|2
|6
|-
|unchanged
|11
|7
|-
|worsening
|9
|10
|}


* EGI Operations have been asked by user support to contact '''sites with unresolved technical problems in the support of the fedcloud.egi.eu''' VO since a long time
* '''BIFI (-79%)''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=121201 (SOLVED)
** if issues cannot be fixed quickly, '''sites will be asked to remove the support to fedcloud.egi.eu'''
** failures starting from April 7th related to CAs probes
** they will re-enable the VO support as soon as they are able to fix the issues
*** eu.egi.Keystone-IGTF and eu.egi.OCCI-IGTF
** sites will be contacted directly by EGI Operations
** missing some CAs
** misconfiguration in Apache2 webserver. Wrong setting:
#SSLVerifyClient optional_no_ca
Correct setting:
SSLVerifyClient optional
* '''CETA-GRID (-42%)''': failures seem due to the grid-part of the site
* '''CYFRONET-CLOUD (-100%)''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=121202
** same for march, problem not solved yet: wrong endpoint URL on GOC-DB
* '''IFCA-LCG2 (-24%)''': until April 6th the failures were due to eu.egi.OCCI-IGTF probe, then fixed
** the other failures occurred in the month were due to the OpenStack upgrade to Liberty version
* '''IN2p3-IRES (-6%)''' and NCG-INGRID-PT (-7%): failures due to the grid part of the site
* '''TR-FC1-ULAKBIM (-6%)''': failures registered between April 7th and 11th in both the reports
* '''HG-09-Okeanos-Cloud''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=121203 (SOLVED)
**  the token the AA mechanism served was expired
* '''GoeGRID''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=121204 (SOLVED)
** failures with the CAs probe
*** there were some invalid CAs files
* '''PRISMA-INFN-BARI''': it must be closed


=== Getting help ===  
== FedCloud ==


* the whole FedCloud wiki has been reviewed, removing redundancies, updating links and instructions
* almost all sites publish at least one image from AppDB
* from the operations point of view: https://wiki.egi.eu/wiki/Federated_Cloud_resource_providers_support
** only exceptions [https://ggus.eu/?mode=ticket_info&ticket_id=120995 GoeGrid] (NGI_DE) and [https://ggus.eu/?mode=ticket_info&ticket_id=120996  TR-FC1-ULAKBIM] (NGI_TR), getting critical because they have use cases to serve (respectively with Terradue and BILS), please NGIs have a look
** see in particular the [https://wiki.egi.eu/wiki/MAN10 manual for the installation of a cloud site]
* open [https://ggus.eu/?mode=ticket_info&ticket_id=121262 tickets] to sites where dteam is not working: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=121263  UPV-GRyCAP] [https://ggus.eu/index.php?mode=ticket_info&ticket_id=121264 CESGA] [https://ggus.eu/index.php?mode=ticket_info&ticket_id=121265 MK-04-FINKICLOUD] [https://ggus.eu/index.php?mode=ticket_info&ticket_id=121266 BIFI] [https://ggus.eu/index.php?mode=ticket_info&ticket_id=121267 IN2P3-IRES] [https://ggus.eu/index.php?mode=ticket_info&ticket_id=121268 FZJ] [https://ggus.eu/index.php?mode=ticket_info&ticket_id=121269 CYFRONET-CLOUD] [https://ggus.eu/index.php?mode=ticket_info&ticket_id=121270 INFN-CATANIA-NEBULA] [https://ggus.eu/index.php?mode=ticket_info&ticket_id=121271 100IT]
** TBD: review the support units associated with FedCloud (in progress)
* please remember that '''from June on EGI Operations will be starting notifying sites eligible for suspension'''
* Cloud Middleware Distribution (similar to UMD, but for cloud only), is under dsign phase, you may want to check the status of the discussion happening in FedCloud and UMD teams here https://wiki.egi.eu/wiki/EGI_Cloud_Middleware_Distribution


== Decommissioning SL5 ==
== Decommissioning SL5 ==
* Tracked on [https://wiki.egi.eu/wiki/SL5_retirement SL5_retirement wiki]
* Tracked on [https://wiki.egi.eu/wiki/SL5_retirement SL5_retirement wiki]
* No checks for dCache, DPM, ARC, UNICORE --> ''' Action on NGIs/ROCs to follow up directly with sites'''
* Sites still deploying unsupported service end-points risk suspension, unless documented technical reasons prevent a Site Admin from updating these end-points https://wiki.egi.eu/wiki/PROC16_Decommissioning_of_unsupported_software#Escalation_phase see step 7
* Status https://wiki.egi.eu/wiki/SL5_retirement#2016-05-09_Overall_status
* '''don't forget to upgrade the apel client host if on SL5, we count on it to stop brokers accepting SSLv3'''


== NGIs argus server not properly configured ==
== NGIs argus server not properly configured ==
Line 137: Line 183:


The parent ticket is https://ggus.eu/?mode=ticket_info&ticket_id=120770
The parent ticket is https://ggus.eu/?mode=ticket_info&ticket_id=120770
'''2016_05_09 UPDATE'''
pending tickets:
* NGI_MD https://ggus.eu/?mode=ticket_info&ticket_id=120746
* NGI_FI https://ggus.eu/?mode=ticket_info&ticket_id=120747
* NGI_MARGI https://ggus.eu/?mode=ticket_info&ticket_id=120765
* AfricaArabia https://ggus.eu/?mode=ticket_info&ticket_id=120767
Other 5 servers are failing again


= AOB  =
= AOB  =
Line 146: Line 202:
List of the underperforming RCs for (at least) 3 consecutive months:
List of the underperforming RCs for (at least) 3 consecutive months:


* AfricaArabia https://ggus.eu/?mode=ticket_info&ticket_id=117094:
* AfricaArabia https://ggus.eu/?mode=ticket_info&ticket_id=117094: main problems with the monitoring system, waiting for the release of the central one
** ASRT
** DZ-01-ARN
** DZ-01-ARN
** EG-ZC-T3: unresponsive since too months, must be suspended
** EG-ZC-T3: unresponsive since too months, must be suspended
** ZA-UJ
** ZA-UJ


* AsiaPacific: (since February) https://ggus.eu/index.php?mode=ticket_info&ticket_id=120180
* AsiaPacific: (since February) https://ggus.eu/index.php?mode=ticket_info&ticket_id=121222
** MY-UM-SIFIR: network and power failure
** MY-UM-SIFIR: network and power failure - SUSPENDED on May 5th
** IN-DAE-VECC-02 (miscellaneous issues)
** INDIACMS-TIFR


* NGI_DE: (since February) https://ggus.eu/index.php?mode=ticket_info&ticket_id=120181
* NGI_DE: (since February) https://ggus.eu/index.php?mode=ticket_info&ticket_id=120181 (SOLVED)
** LRZ-LMU no feedback
** LRZ-LMU problems in updating the CAs, now A/R figures are improving


* NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=120577
* NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=120573
** FBF-Brescia-IT working for improving the behaviour
** egee.fesb.hr issue with SE element which affected the whole NGI


* NGI_MARGI https://ggus.eu/index.php?mode=ticket_info&ticket_id=118465 no monitoring data since January  
* NGI_MARGI https://ggus.eu/index.php?mode=ticket_info&ticket_id=118465 no monitoring data since January  
Line 165: Line 224:
** the only site MD-02-IMI was suspended in March for security reasons, asked for news
** the only site MD-02-IMI was suspended in March for security reasons, asked for news


* ROC_LA
* NGI_NDGF: https://ggus.eu/index.php?mode=ticket_info&ticket_id=121224 (SOLVED)
** UFAL: suspended by the NGI manager
** EENet problem with the probe
 
* NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=121225 (SOLVED)
** CYFRONET-PROMETHEUS Site will be reorganized. QCG-Computing was established recently and it has so far 100% availability.
 
== EGI Operations Support activities stopped ==
 
* EGI Operations Support activity stopped on April 30, 2016
* Operations Support GGUS SU to be decommissioned
* all corresponding tickets will be moved to EGI Operations (except resource allocation)


== Next meeting ==
== Next meeting ==


* '''9 May 2016''' https://indico.egi.eu/indico/event/2739/
* '''13 Jun 2016''' https://indico.egi.eu/indico/event/2740/

Latest revision as of 13:07, 9 May 2016


General information

News from URT

Staged rollout updates

  • VOMS-ADMIN SERVER 3.4.1
  • STORM 1.11.10
  • DCACHE 2.13.27

Next releases

  • UMD 4.1.0 RC ready, release by May
    • goal: add SL6 to UMD4 in order to allow dismission of UMD3
    • SL6 migrated from UMD3 to UMD4, non-supported products have been removed
    • CentOS7 --> only Frontier for now, a new release will be made including dCache and ARGUS server

Preview repository

On April 1st it was released Preview 2.0.0

The second major release of Preview was created for releasing the products available on CentOS 7 and Scientific Linux 6 platforms that are about to be included in UMD4.

The products available in this first release are only for CentOS 7 platform:

  • ARC
  • Argus
  • dcache
  • fts3
  • site-bdii
  • top-bdii

The Scientific Linux 6 products will be available in one of the next updates.

Generic information about Preview repository: https://wiki.egi.eu/wiki/Preview_Repository

Note: EGI provides the preview repository without any additional quality assurance process, but the products are released as they are provided by the product team. EGI recommends the use of the UMD repositories, which contain software verified through the quality assurance process of UMD.

Operational issues

Aligning Fedcloud sites to the A/R procedures

  • EGI Operations proposal to align Fedcloud sites to the A/R related procedures used for the grid sites
  • based on the availability reliability of monitored services in cloudmon, EGI Operations will start follow up with underperforming sites as we are doing for every grid sites
  • sites will NOT be suspended for a/r performance at least until end of May
  • in parallel EGI Operations will start PROC08 to include cloud probes in the EGI_CRITICAL and EGI profiles used for A/R computations (IN PROGRESS)

The proposed timeline is:

  • February 2016:
    • EGI Operations will check the status of the production cloud services in order to understand which issues (if any) the site has and provide help to NGIs and sites;
    • Start of the integration of cloud probes in EGI CRITICAL profile(current set+openstack): To be agreed with the ARGO team, PROC08 will be followed
  • June 2016:
    • Starting notification of sites eligible for suspension

Comparing the two profiles

see the FedCloud meeting slides for details https://indico.egi.eu/indico/event/2847/

  • nagios probes
New profile Old profile
  • eu.egi.cloud.vm-management.occi
    • eu.egi.cloud.OCCI-Context
    • eu.egi.cloud.OCCI-VM
    • org.nagios.OCCI-TCP
    • eu.egi.OCCI-IGTF
  • org.openstack.nova
    • eu.egi.Keystone-IGTF
    • eu.egi.cloud.OpenStack-VM
    • org.nagios.Keystone-TCP
  • eu.egi.cloud.vm-management.occi
    • eu.egi.cloud.OCCI-Context
    • eu.egi.cloud.OCCI-VM
    • org.nagios.OCCI-TCP
  • eu.egi.cloud.storage-management.cdmi
    • org.nagios.CDMI-TCP
  • eu.egi.cloud.accounting
    • eu.egi.cloud.APEL-Pub
  • How the sites figures are changed:
March April
improvements 2 6
unchanged 11 7
worsening 9 10
#SSLVerifyClient optional_no_ca

Correct setting:

SSLVerifyClient optional
  • CETA-GRID (-42%): failures seem due to the grid-part of the site
  • CYFRONET-CLOUD (-100%): https://ggus.eu/index.php?mode=ticket_info&ticket_id=121202
    • same for march, problem not solved yet: wrong endpoint URL on GOC-DB
  • IFCA-LCG2 (-24%): until April 6th the failures were due to eu.egi.OCCI-IGTF probe, then fixed
    • the other failures occurred in the month were due to the OpenStack upgrade to Liberty version
  • IN2p3-IRES (-6%) and NCG-INGRID-PT (-7%): failures due to the grid part of the site
  • TR-FC1-ULAKBIM (-6%): failures registered between April 7th and 11th in both the reports
  • HG-09-Okeanos-Cloud: https://ggus.eu/index.php?mode=ticket_info&ticket_id=121203 (SOLVED)
    • the token the AA mechanism served was expired
  • GoeGRID: https://ggus.eu/index.php?mode=ticket_info&ticket_id=121204 (SOLVED)
    • failures with the CAs probe
      • there were some invalid CAs files
  • PRISMA-INFN-BARI: it must be closed

FedCloud

Decommissioning SL5

NGIs argus server not properly configured

Some time ago (more than a year I think), EGI ran a campaign to have NGIs run a "NGI Argus" service. This campaign resulted in new services being added to goc-db for each NGI.

Unfortunately, as explained in the OMB in February, our monitoring is currently unable to check the deployment of these services: - For 6 services, our monitoring cannot contact the NGI Argus - For 18 services, our monitoring is not authorized to get the right information from the NGI Argus - For 1 service, our monitoring indicates that the NGI Argus is not properly configured and does not pull the rules from argus.cern.ch

In the end, only 5 services are properly configured and monitored!

The changes are rather easy:

  • If we can't contact them, the site needs to make sure that there is no firewall blocking 195.251.55.111 from accessing the argus 'pap' port
  • If we are not authorized, the site needs to add the right ACE to their argus authorization
pap-admin add-ace 'CN=srv-111.afroditi.hellasgrid.gr,OU=afroditi.hellasgrid.gr,O=HellasGrid, C=GR' 'POLICY_READ_LOCAL|POLICY_READ_REMOTE|CONFIGURATION_READ'

The current status of the infrastructure can be found:

  • In the secmon nagios (not sure you have access to this):

https://secmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_ngi.ARGUS&style=detail&sorttype=1&sortoption=3

  • On the security dashboard:

https://operations-portal.egi.eu/csiDashboard/ngi/any/tab/list/filter/monitoring/page/list?tsid=4

On the security dashboard, each NGI should have a "argus-ban" result:

  • "Ok" means ok
  • "Unknown" means that we can't contact them
  • "High" means that we are not authorized
  • "Critical" means that argus is not pull rules from argus.cern.ch

The parent ticket is https://ggus.eu/?mode=ticket_info&ticket_id=120770

2016_05_09 UPDATE pending tickets:

Other 5 servers are failing again

AOB

Monthly Availability/Reliability

A/R report on ARGO: http://argo.egi.eu/lavoisier/ngi_reports?accept=html

List of the underperforming RCs for (at least) 3 consecutive months:

EGI Operations Support activities stopped

  • EGI Operations Support activity stopped on April 30, 2016
  • Operations Support GGUS SU to be decommissioned
  • all corresponding tickets will be moved to EGI Operations (except resource allocation)

Next meeting