Difference between revisions of "Agenda-2020-11-16"

From EGIWiki
Jump to: navigation, search
(Created page with "{{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}} Category:Grid Operations Meetings Back to https://wiki.egi.eu/wiki/Operations_Meeting = General informatio...")
 
(Next meeting)
 
(27 intermediate revisions by 2 users not shown)
Line 11: Line 11:
 
** https://wiki.egi.eu/wiki/Next_middleware_release
 
** https://wiki.egi.eu/wiki/Next_middleware_release
  
* UMD-4.12.0 regular release is almost ready (testing RC)
+
* UMD4 release in preparation
** CVMFS 2.7.3, ARCCE 6.7.0, gfal 2.18.1, davix 0.7.6, xrootd 4.12.3
+
** StoRM, VOMS, BDII update, dCache
** next releases: update for VOMS on C7, StoRM on C7, BDII C7/SL6
+
** VERY URGENT
 
 
  
 +
* feedback on software automation from the EGI Conference
  
 
== Preview repository  ==
 
== Preview repository  ==
Line 21: Line 21:
 
** '''[[Preview 1.29.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/1.0/1.29.0/ AppDB info] (sl6):  ARC 6.8.0 and 6.8.1, BDII 5.5.26, CVMFS 2.7.4, dCache 5.2.31, DMLite/DPM 1.14.0, frontier-squid 4.13.1, glite-info-update-endpoints 3.0.2, lcg-info 1.12.5, STORM 1.11.18
 
** '''[[Preview 1.29.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/1.0/1.29.0/ AppDB info] (sl6):  ARC 6.8.0 and 6.8.1, BDII 5.5.26, CVMFS 2.7.4, dCache 5.2.31, DMLite/DPM 1.14.0, frontier-squid 4.13.1, glite-info-update-endpoints 3.0.2, lcg-info 1.12.5, STORM 1.11.18
 
** '''[[Preview 2.29.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/2.0/2.29.0/ AppDB info] (CentOS 7):  ARC 6.8.0 and 6.8.1, BDII 5.5.26, CVMFS 2.7.4, dCache 5.2.31, DMLite/DPM 1.14.0, frontier-squid 4.13.1, glite-info-update-endpoints 3.0.2, lcg-info 1.12.5, STORM 1.11.18
 
** '''[[Preview 2.29.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/2.0/2.29.0/ AppDB info] (CentOS 7):  ARC 6.8.0 and 6.8.1, BDII 5.5.26, CVMFS 2.7.4, dCache 5.2.31, DMLite/DPM 1.14.0, frontier-squid 4.13.1, glite-info-update-endpoints 3.0.2, lcg-info 1.12.5, STORM 1.11.18
 +
* included in the upcoming release: DPM, VOMS
  
 
= Operations  =
 
= Operations  =
Line 28: Line 29:
 
** '''(14th Sept)''' 70 endpoints, 14 CRITICAL, success rate is about 80%
 
** '''(14th Sept)''' 70 endpoints, 14 CRITICAL, success rate is about 80%
 
** '''Oct 1st: included in the [https://poem.egi.eu/ui/public_metricprofiles/ARGO_MON_CRITICAL ARGO_MON_CRITICAL] profile (A/R computation)'''
 
** '''Oct 1st: included in the [https://poem.egi.eu/ui/public_metricprofiles/ARGO_MON_CRITICAL ARGO_MON_CRITICAL] profile (A/R computation)'''
*** (Oct 12th) 71 endpoints, success rate (including WARNING) 85.9%
+
*** (Nov 16th) 76 endpoints, success rate (including WARNING) 84.2%
 
** working on the probe for the host certificate validity check: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=147386 GGUS 147386]
 
** working on the probe for the host certificate validity check: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=147386 GGUS 147386]
  
Line 37: Line 38:
  
 
== Feedback from DMSU  ==
 
== Feedback from DMSU  ==
 +
 +
== Upgrade of central argus node ==
 +
Message sent to administrators of NGIs argus servers:
 +
* A replacement of the central argus servers (lcgargus03.cern.ch & lcgargus04.cern.ch), which are behind the argus.cern.ch & lcgargus.cern.ch aliases, is planned for Tuesday 17th November 2020 between 10:00 and 12:00.
 +
* This replacement should be transparent, requiring no change of configuration on your side. Please report any issue you have with your NGI argus server.
 +
* The two new hosts, lcargus21.cern.ch and lcgargus22.cern.ch are already ready for production, you can remotely test them if you want. The operation next week is simply a change of alias.
  
 
== Monthly Availability/Reliability ==
 
== Monthly Availability/Reliability ==
Line 43: Line 50:
 
** AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147748
 
** AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147748
 
*** '''HK-HKU-CC-01''': migrating DPM from sl6 to CenOS7
 
*** '''HK-HKU-CC-01''': migrating DPM from sl6 to CenOS7
*** '''TW-NCUHEP''': ARC-CE failures due to outdated CAs package
+
*** '''TW-NCUHEP''': ARC-CE failures due to outdated CAs package, performance is now good
**NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146871
 
***'''GoeGRID''': CREAM-CE intermittent failures not affecting ATLAS; failures with ARC-CE, now passing the tests
 
 
** NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148519
 
** NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148519
 
*** LRZ-LMU: CE had problems due to the decommission of SharedFS; the other CE returns UNKNOWN in the IGTF test.
 
*** LRZ-LMU: CE had problems due to the decommission of SharedFS; the other CE returns UNKNOWN in the IGTF test.
 
** NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148518
 
** NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148518
 
*** egee.irb.hr: in the process of a major upgrade from CentOS 6 to CentOS 7, some delays.
 
*** egee.irb.hr: in the process of a major upgrade from CentOS 6 to CentOS 7, some delays.
 +
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148957
 +
*** INFN-CATANIA: SRM problems
 +
*** INFN-CATANIA-STACK: recovered
 +
*** INFN-PADOVA: decommissioning process
 
**NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147311
 
**NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147311
***WCSS64
+
***WCSS64: failures on QCG and CREAM CEs
 
**NGI_UK:
 
**NGI_UK:
***'''UKI-NORTHGRID-SHEF-HEP''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=146455 ARC-CE re-installed, some condor problems to fix  
+
***'''UKI-NORTHGRID-SHEF-HEP''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=146455 ARC-CE re-installed, some condor problems to fix, improving...
***'''UKI-SOUTHGRID-SUSX''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=144720 Migration from CREAM to ARC, WN migration to CentOS7; SRM to be decommissioned; ARC-CE was failing the IGTF test, then solved; site-bdii failures.
+
***'''UKI-SOUTHGRID-SUSX''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=144720 Migration from CREAM to ARC, WN migration to CentOS7; SRM to be decommissioned; ARC-CE was failing the IGTF test, then solved; site-bdii failures. new failures on ARC-CE.
 
** ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148515
 
** ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148515
 
*** ATLAND: downtime due to powercut and quarantine
 
*** ATLAND: downtime due to powercut and quarantine
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: ('''September 2020'''):
+
** ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148956
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148957
+
*** CBPF: SRM failures due to information not properly published. Physical access to facilities restricted due to COVID measures; planned a DPM update in December.
*** INFN-CATANIA
 
*** INFN-CATANIA-STACK
 
*** INFN-PADOVA
 
 
** NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148958
 
** NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148958
*** UA-NSCMBR: IGTF outdated
+
*** UA-NSCMBR: IGTF outdated; improving...
** ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148956
+
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: ('''October 2020'''):
*** CBPF
+
** AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=149353
 +
*** JP-KEK-CRC-02: migration from CREAM-CE to ARC-CE, some problems with the ARC-CE which has been marked then as "not production"
 +
** CERN-PROD: https://ggus.eu/index.php?mode=ticket_info&ticket_id=149351 webdav failures
 +
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=149352
 +
*** INFN-LECCE
 +
*** TRIGRID-INFN-CATANIA
 +
** NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=149356
 +
*** UA_BITP_ARC: bdii freshness failures
 +
** ROC_LA https://ggus.eu/index.php?mode=ticket_info&ticket_id=149355
 +
*** SUPERCOMPUTO-UNAM
  
  
Line 85: Line 100:
 
** [https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CREAM-CE&style=detail eu.egi.sec.CREAMCE]
 
** [https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CREAM-CE&style=detail eu.egi.sec.CREAMCE]
 
* Nov 1st: probe returns CRITICAL status, alarms created on the ROD dashboard, ROD teams start to create tickets
 
* Nov 1st: probe returns CRITICAL status, alarms created on the ROD dashboard, ROD teams start to create tickets
 +
** https://ggus.eu/index.php?mode=ticket_info&ticket_id=149312
 
* 1st Jan 2021: EGI Ops will start chasing the sites still providing CREAM-CE endpoints
 
* 1st Jan 2021: EGI Ops will start chasing the sites still providing CREAM-CE endpoints
 
** By this time service end-points which couldn't be upgraded should be put into downtime by site admin or ROD:
 
** By this time service end-points which couldn't be upgraded should be put into downtime by site admin or ROD:
Line 92: Line 108:
 
* [https://wiki.egi.eu/wiki/PROC16_Decommissioning_of_unsupported_software PROC16 Decommission of unsupported software]
 
* [https://wiki.egi.eu/wiki/PROC16_Decommissioning_of_unsupported_software PROC16 Decommission of unsupported software]
 
* deadline: '''end of July'''
 
* deadline: '''end of July'''
* Catalin is in contact with ARC team to get a webinar on ARC administration, scheduled (to be confirmed) for July 6th please contact operations@ for information
+
 
  
 
* Status
 
* Status
Line 106: Line 122:
 
|-
 
|-
 
| 2020-10-12 || 32 || 19 || -
 
| 2020-10-12 || 32 || 19 || -
 +
|-
 +
| 2020-11-16 || 26 || 16 || -
 
|}
 
|}
  
Line 119: Line 137:
  
 
== Next meeting  ==
 
== Next meeting  ==
Nov 16th, 2020 https://indico.egi.eu/event/5100/
+
In 2021

Latest revision as of 15:11, 16 November 2020

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators


Back to https://wiki.egi.eu/wiki/Operations_Meeting

General information

Middleware

UMD

  • UMD4 release in preparation
    • StoRM, VOMS, BDII update, dCache
    • VERY URGENT
  • feedback on software automation from the EGI Conference

Preview repository

  • released on 2020-10-09
    • Preview 1.29.0 AppDB info (sl6): ARC 6.8.0 and 6.8.1, BDII 5.5.26, CVMFS 2.7.4, dCache 5.2.31, DMLite/DPM 1.14.0, frontier-squid 4.13.1, glite-info-update-endpoints 3.0.2, lcg-info 1.12.5, STORM 1.11.18
    • Preview 2.29.0 AppDB info (CentOS 7): ARC 6.8.0 and 6.8.1, BDII 5.5.26, CVMFS 2.7.4, dCache 5.2.31, DMLite/DPM 1.14.0, frontier-squid 4.13.1, glite-info-update-endpoints 3.0.2, lcg-info 1.12.5, STORM 1.11.18
  • included in the upcoming release: DPM, VOMS

Operations

ARGO/SAM

FedCloud

Feedback from DMSU

Upgrade of central argus node

Message sent to administrators of NGIs argus servers:

  • A replacement of the central argus servers (lcgargus03.cern.ch & lcgargus04.cern.ch), which are behind the argus.cern.ch & lcgargus.cern.ch aliases, is planned for Tuesday 17th November 2020 between 10:00 and 12:00.
  • This replacement should be transparent, requiring no change of configuration on your side. Please report any issue you have with your NGI argus server.
  • The two new hosts, lcargus21.cern.ch and lcgargus22.cern.ch are already ready for production, you can remotely test them if you want. The operation next week is simply a change of alias.

Monthly Availability/Reliability


  • sites suspended:

IPv6 readiness plans

CREAM-CE Decommission

ARC Middleware 5 end of support, migration to ARC 6


  • Status
Date Number of endpoints in BDII Number of GGUS tickets Issues
2020-06-08 75 42 Some ARC endpoints publish a timestamp instead of a version like 5.X.Y; we can fairly assume they are ARC6 nightly builds, but we're going to close the corresponding tickets after explicit confirmation from the site admin.
2020-07-13 53 29 -
2020-09-14 34 18 -
2020-10-12 32 19 -
2020-11-16 26 16 -

Storage accounting

Many sites stopped the publication of storage accounting records. Opened 57 tickets to fix that.

AOB

Next meeting

In 2021