Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Agenda-2021-01-11"

From EGIWiki
Jump to navigation Jump to search
 
(26 intermediate revisions by 2 users not shown)
Line 8: Line 8:


== UMD ==
== UMD ==
* plans on CentOS8 ONGOING
 
* UMD4 schedule: https://wiki.egi.eu/wiki/UMD_Release_Schedule
* CentOS8 rebuild EOL in 2021 (was: May 2029), '''possible switch to CentOS8 Stream''' (maintained until August 2024) https://blog.centos.org/2020/12/future-is-centos-stream/ '''discussion ongoing''', especially in WLCG
** https://wiki.egi.eu/wiki/Next_middleware_release
** https://wiki.egi.eu/wiki/Next_middleware_release
 
* '''CentOS7 will be maintained until June 2024'''
* UMD4 release in preparation
** Moving UMD4/C7 to UMD5/C7
** StoRM, VOMS, BDII update, dCache
* '''SL6 is retired''', URT will not accept updates (unless critical and agreed with EGI Operations)
** VERY URGENT


* feedback on software automation from the EGI Conference
* feedback on software automation from the EGI Conference


== Preview repository  ==
== Preview repository  ==
* released on 2020-10-09
*2020-11-30
** '''[[Preview 1.29.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/1.0/1.29.0/ AppDB info] (sl6):  ARC 6.8.0 and 6.8.1, BDII 5.5.26, CVMFS 2.7.4, dCache 5.2.31, DMLite/DPM 1.14.0, frontier-squid 4.13.1, glite-info-update-endpoints 3.0.2, lcg-info 1.12.5, STORM 1.11.18
** '''[[Preview 1.30.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/1.0/1.30.0/ AppDB info] '''(last release on sl6)'''CVMFS 2.7.5 and egi-cvmfs-2-7.12, dCache 5.2.35, DMLite/DPM 1.14.2, Dynafed 1.6.0, STORM 1.11.19, VOMS 10-20 release, xrootd 4.12.5
** '''[[Preview 2.29.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/2.0/2.29.0/ AppDB info] (CentOS 7):  ARC 6.8.0 and 6.8.1, BDII 5.5.26, CVMFS 2.7.4, dCache 5.2.31, DMLite/DPM 1.14.0, frontier-squid 4.13.1, glite-info-update-endpoints 3.0.2, lcg-info 1.12.5, STORM 1.11.18
** '''[[Preview 2.30.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/2.0/2.30.0/ AppDB info] (CentOS 7):  APEL-SSM 3.0.1, CVMFS 2.7.5 and egi-cvmfs-2-7.12, dCache 5.2.35, DMLite/DPM 1.14.2, Dynafed 1.6.0, STORM 1.11.19, VOMS 10-20 release, xrootd 4.12.5 and 5.0.3
* included in the upcoming release: DPM, VOMS


= Operations  =
= Operations  =


== ARGO/SAM  ==
== ARGO/SAM  ==
* [https://argo-mon-fedcloud.cro-ngi.hr/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_org.opensciencegrid.htcondorce&style=detail HTCondor-CE probes] included in the ARGO_MON_OPERATORS profile on May 13th: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146949
* [https://argo-mon-fedcloud.cro-ngi.hr/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_org.opensciencegrid.htcondorce&style=detail HTCondor-CE probes]  
** '''(14th Sept)''' 70 endpoints, 14 CRITICAL, success rate is about 80%
** '''Oct 1st: included in the [https://poem.egi.eu/ui/public_metricprofiles/ARGO_MON_CRITICAL ARGO_MON_CRITICAL] profile (A/R computation)'''
*** (Nov 16th) 76 endpoints, success rate (including WARNING) 84.2%
** working on the probe for the host certificate validity check: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=147386 GGUS 147386]
** working on the probe for the host certificate validity check: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=147386 GGUS 147386]
** integration with secmon and pakiti: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=150006 GGUS 150006]
* CREAM-CE metrics removed from ARGO_MON, ARGO_MON_OPERATIONS and ARGO_MON_CRITICAL ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=149778 GGUS 149778])
* CREAM-CE metrics removed from ARGO_MON, ARGO_MON_OPERATIONS and ARGO_MON_CRITICAL ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=149778 GGUS 149778])
** emi.cream.CREAMCE*
** emi.cream.CREAMCE*
Line 41: Line 39:


== Feedback from DMSU  ==
== Feedback from DMSU  ==
== Upgrade of central argus node ==
Message sent to administrators of NGIs argus servers:
* A replacement of the central argus servers (lcgargus03.cern.ch & lcgargus04.cern.ch), which are behind the argus.cern.ch & lcgargus.cern.ch aliases, is planned for Tuesday 17th November 2020 between 10:00 and 12:00.
* This replacement should be transparent, requiring no change of configuration on your side. Please report any issue you have with your NGI argus server.
* The two new hosts, lcargus21.cern.ch and lcgargus22.cern.ch are already ready for production, you can remotely test them if you want. The operation next week is simply a change of alias.


== Monthly Availability/Reliability ==
== Monthly Availability/Reliability ==
Line 53: Line 45:
*** '''HK-HKU-CC-01''': migrating DPM from sl6 to CenOS7
*** '''HK-HKU-CC-01''': migrating DPM from sl6 to CenOS7
*** '''TW-NCUHEP''': ARC-CE failures due to outdated CAs package, performance is now good
*** '''TW-NCUHEP''': ARC-CE failures due to outdated CAs package, performance is now good
** CERN-PROD: https://ggus.eu/index.php?mode=ticket_info&ticket_id=149351
** '''CERN-PROD''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=149351
*** webdav failures which required a fix in the EOS services https://its.cern.ch/jira/browse/EOS-4515 ; some instability with the site-bdii
*** webdav failures which required a fix in the EOS services https://its.cern.ch/jira/browse/EOS-4515 ; some instability with the site-bdii
** NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148518
** NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148518
*** egee.irb.hr: in the process of a major upgrade from CentOS 6 to CentOS 7, some delays.
*** '''egee.irb.hr''': in the process of a major upgrade from CentOS 6 to CentOS 7, some delays.
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148957
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148957
*** INFN-CATANIA: SRM problems; the SRM service will be decommissioned
*** '''INFN-CATANIA''': SRM problems; the SRM service will be decommissioned
*** INFN-CATANIA-STACK: recovered
*** '''INFN-CATANIA-STACK''': recovered
*** INFN-PADOVA: decommissioning process
*** '''INFN-PADOVA''': decommissioning process
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=149352
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=149352
*** INFN-LECCE: authz failures on SRM; CREAM-CE to decommission
*** '''INFN-LECCE''': authz failures on SRM; CREAM-CE to decommission
*** TRIGRID-INFN-CATANIA: CREAM-CE to decommission
*** '''TRIGRID-INFN-CATANIA''': CREAM-CE to decommission
** NGI_IT https://ggus.eu/index.php?mode=ticket_info&ticket_id=149798
** NGI_IT https://ggus.eu/index.php?mode=ticket_info&ticket_id=149798
*** INFN-ROMA1-CMS: intermittent failures on SRM service; some failures on ARC-CE servers
*** '''INFN-ROMA1-CMS''': intermittent failures on SRM service; some failures on ARC-CE servers
**NGI_UK:
**NGI_UK:
***'''UKI-SOUTHGRID-SUSX''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=144720 Migration from CREAM to ARC, WN migration to CentOS7; SRM to be decommissioned; ARC-CE was failing the IGTF test, then solved; site-bdii failures. new failures on ARC-CE.
***'''UKI-SOUTHGRID-SUSX''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=144720 Migration from CREAM to ARC, WN migration to CentOS7; SRM to be decommissioned; ARC-CE was failing the IGTF test, then solved; site-bdii failures. new failures on ARC-CE.
** NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148958
** NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148958
*** UA-NSCMBR: IGTF outdated; new failures with ARC-CE and SRM/webdav
*** '''UA-NSCMBR''': IGTF outdated; new failures with ARC-CE and SRM/webdav
** ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148515
** ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148515
*** ATLAND: downtime due to powercut and quarantine
*** '''ATLAND''': downtime due to powercut and quarantine
** ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148956
** ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148956
*** CBPF: SRM failures due to information not properly published. Physical access to facilities restricted due to COVID measures; planned a DPM update in December.
*** '''CBPF''': SRM failures due to information not properly published. Physical access to facilities restricted due to COVID measures; planned a DPM update in December.
** ROC_LA https://ggus.eu/index.php?mode=ticket_info&ticket_id=149355
** ROC_LA https://ggus.eu/index.php?mode=ticket_info&ticket_id=149355
*** SUPERCOMPUTO-UNAM: scheduled a downtime for upgrading the site.
*** '''SUPERCOMPUTO-UNAM''': scheduled a downtime for upgrading the site.
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: ('''December 2020'''):
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: ('''December 2020'''):
** AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150109
*** '''INDIACMS-TIFR'''
*** '''KR-KNU-T3'''
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150108
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150108
*** GARR-01-DIR
*** '''GARR-01-DIR'''
** NGI_NDGF: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150111
*** '''SE-SNIC-T2''': network issues. Planned a meeting with the internet provider.
** NGI_TR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150107
** NGI_TR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150107
*** AZ-IFAN
*** '''AZ-IFAN''': CREAM-CE and SRM decommissioned, HTCondorCE deployed; Site-BDII re-installed.
** Russia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150110
*** '''ITEP''': hardware problems with storage element, replacement of ARC-CE machine


*sites suspended:
*sites suspended:
Line 89: Line 88:
* please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment  
* please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment  
* if any relevant, information will be summarised at  OMB
* if any relevant, information will be summarised at  OMB
== Top-BDII problem affecting the publication of accounting records ==
* on 20th Dec 2020 the top-bdii at CERN lcg-bdii.cern.ch stopped working
** https://ggus.eu/?mode=ticket_info&ticket_id=150055
* since then, it wasn't possible to publish the accounting data
** the SSM script couldn't find the Message Brokers queue to send the messages
* top-bdii fixed on 4th Jan 2021
* this problem affected all the sites because by default in the APEL SSM config file it is set CERN's top-BDII
** each site can set instead the top-BDII of its region:
*** [https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&id=1205 Top-BDIIs service group] on GOCDB
*** [https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_Top-BDII&style=overview Top-BDII servers] monitored by ARGO


== CREAM-CE Decommission ==
== CREAM-CE Decommission ==
Line 101: Line 113:
* Nov 1st: probe returns CRITICAL status, alarms created on the ROD dashboard, ROD teams start to create tickets
* Nov 1st: probe returns CRITICAL status, alarms created on the ROD dashboard, ROD teams start to create tickets
** https://ggus.eu/index.php?mode=ticket_info&ticket_id=149312
** https://ggus.eu/index.php?mode=ticket_info&ticket_id=149312
* 1st Jan 2021: EGI Ops will start chasing the sites still providing CREAM-CE endpoints
* 1st Feb 2021: EGI Ops will start chasing the sites still providing CREAM-CE endpoints
** By this time service end-points which couldn't be upgraded should be put into downtime by site admin or ROD:
** By this time service end-points which couldn't be upgraded should be put into downtime by site admin or ROD
 
* 1st March 2021: Sites still deploying unsupported service endpoints risk suspension, unless documented technical reasons prevent a Site Admin from updating these endpoints.
== ARC Middleware 5 end of support, migration to ARC 6 ==
* [https://operations-portal.egi.eu/broadcast/archive/2668 EGI Operations Broadcast]
* [https://wiki.egi.eu/wiki/PROC16_Decommissioning_of_unsupported_software PROC16 Decommission of unsupported software]
* deadline: '''end of July'''
 
 
* Status
{| class="wikitable sortable"
|-
! Date !! Number of endpoints in BDII !! Number of GGUS tickets !! Issues
|-
| 2020-06-08 || 75 || 42 || Some ARC endpoints publish a timestamp instead of a version like 5.X.Y; we can fairly assume they are ARC6 nightly builds, but we're going to close the corresponding tickets after explicit confirmation from the site admin.
|-
| 2020-07-13 || 53 || 29 || -
|-
| 2020-09-14 || 34 || 18 || -
|-
| 2020-10-12 || 32 || 19 || -
|-
| 2020-11-16 || 26 || 16 || -
|}


== Storage accounting ==
== VOMS upgrade to CentOS 7 ==


Many sites stopped the publication of storage accounting records. Opened [https://ggus.eu/index.php?mode=ticket_search&show_columns_check%5B0%5D=TICKET_TYPE&show_columns_check%5B1%5D=AFFECTED_VO&show_columns_check%5B2%5D=AFFECTED_SITE&show_columns_check%5B3%5D=PRIORITY&show_columns_check%5B4%5D=RESPONSIBLE_UNIT&show_columns_check%5B5%5D=STATUS&show_columns_check%5B6%5D=DATE_OF_CHANGE&show_columns_check%5B7%5D=SHORT_DESCRIPTION&show_columns_check%5B8%5D=SCOPE&su_hierarchy=0&keyword=publishing+storage+accounting+records&specattrib=none&status=all&typeofproblem=all&ticket_category=all&date_type=creation+date&tf_radio=1&timeframe=any&from_date=10+Jul+2020&to_date=11+Jul+2020&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO%21&ticket_per_page=60  57 tickets] to fix that.
* VOMS for CentOS 7 released Nov 23rd with [https://repository.egi.eu/2020/11/18/release-umd-4-12-3/ UMD 4.12.13]
* 12 tickets not solved yet
** VOMS Admin 3.8.0, VOMS Server 2.0.15
* page for checking when the records were published: http://goc-accounting.grid-support.ac.uk/storagetest/storagesitesystems.html
* VOMS endpoints registered on GOCDB as production and monitored: 41
* [http://accounting-devel.egi.eu/storage.php Accounting Portal Prototype view]
** Provided by 33 sites
* list of ticket opened: [https://ggus.eu/index.php?mode=ticket_search&su_hierarchy=0&status=all&date_type=creation+date&tf_radio=1&timeframe=any&from_date=11+Dec+2020&to_date=12+Dec+2020&ticket_category=all&typeofproblem=all&specattrib=none&user=paolini&keyword=upgrade+your+VOMS+server+to%3A+CentOS7%2C+VOMS+Admin+server+3.8.0%2C+VOMS+server+2.0.15&orderticketsby=REQUEST_ID&orderhow=desc&ticket_per_page=50&show_columns_check%5B0%5D=TICKET_TYPE&show_columns_check%5B1%5D=AFFECTED_VO&show_columns_check%5B2%5D=AFFECTED_SITE&show_columns_check%5B3%5D=PRIORITY&show_columns_check%5B4%5D=RESPONSIBLE_UNIT&show_columns_check%5B5%5D=STATUS&show_columns_check%5B6%5D=DATE_OF_CHANGE&show_columns_check%5B7%5D=SHORT_DESCRIPTION&show_columns_check%5B8%5D=SCOPE&search_submit=Search GGUS]
* the VOMS servers need to be published in the BDII in order to easily collect the deployed version


= AOB  =
= AOB  =
Line 137: Line 130:


== Next meeting  ==
== Next meeting  ==
In 2021
8th Feb 2021

Latest revision as of 13:28, 11 January 2021

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators


Back to https://wiki.egi.eu/wiki/Operations_Meeting

General information

Middleware

UMD

  • feedback on software automation from the EGI Conference

Preview repository

  • 2020-11-30
    • Preview 1.30.0 AppDB info (last release on sl6): CVMFS 2.7.5 and egi-cvmfs-2-7.12, dCache 5.2.35, DMLite/DPM 1.14.2, Dynafed 1.6.0, STORM 1.11.19, VOMS 10-20 release, xrootd 4.12.5
    • Preview 2.30.0 AppDB info (CentOS 7): APEL-SSM 3.0.1, CVMFS 2.7.5 and egi-cvmfs-2-7.12, dCache 5.2.35, DMLite/DPM 1.14.2, Dynafed 1.6.0, STORM 1.11.19, VOMS 10-20 release, xrootd 4.12.5 and 5.0.3

Operations

ARGO/SAM

  • HTCondor-CE probes
    • working on the probe for the host certificate validity check: GGUS 147386
    • integration with secmon and pakiti: GGUS 150006
  • CREAM-CE metrics removed from ARGO_MON, ARGO_MON_OPERATIONS and ARGO_MON_CRITICAL (GGUS 149778)
    • emi.cream.CREAMCE*
    • eu.egi.CREAM*

FedCloud

Feedback from DMSU

Monthly Availability/Reliability

  • sites suspended:
    • WCSS64 (NGI_PL)

IPv6 readiness plans

Top-BDII problem affecting the publication of accounting records

  • on 20th Dec 2020 the top-bdii at CERN lcg-bdii.cern.ch stopped working
  • since then, it wasn't possible to publish the accounting data
    • the SSM script couldn't find the Message Brokers queue to send the messages
  • top-bdii fixed on 4th Jan 2021
  • this problem affected all the sites because by default in the APEL SSM config file it is set CERN's top-BDII

CREAM-CE Decommission

VOMS upgrade to CentOS 7

  • VOMS for CentOS 7 released Nov 23rd with UMD 4.12.13
    • VOMS Admin 3.8.0, VOMS Server 2.0.15
  • VOMS endpoints registered on GOCDB as production and monitored: 41
    • Provided by 33 sites
  • list of ticket opened: GGUS
  • the VOMS servers need to be published in the BDII in order to easily collect the deployed version

AOB

Next meeting

8th Feb 2021