Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Agenda-2021-03-08"

From EGIWiki
Jump to navigation Jump to search
 
(29 intermediate revisions by 2 users not shown)
Line 6: Line 6:


= Middleware =
= Middleware =
== Software repository maintenance ==
* [https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=30300 unscheduled downtime] from 03-Mar-21 17:00:00 (UTC) to 05-Mar-21 14:30:06 (UTC)
* Broadcasts announcing the [https://operations-portal.egi.eu/broadcast/archive/2834 unexpected downtime] and the [https://operations-portal.egi.eu/broadcast/archive/2837 recovery]
* Migration from IASA to IBERGRID
* [https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=30308 Warning downtime] until 09-Mar-21 14:30:00 (UTC)
* repository is completely online; the related website not yet (only some very simple html pages)
* more details of the incident will be circulated soon


== UMD ==
== UMD ==
Line 12: Line 19:
* migration of Software Provisioning infrastructure to IBERGRID still ongoing
* migration of Software Provisioning infrastructure to IBERGRID still ongoing
** in particular, administration portal used for release creation done successfully
** in particular, administration portal used for release creation done successfully
* February release planned https://wiki.egi.eu/wiki/UMD_Release_Schedule to be discussed at today's meeting
* still planning next release, delay due to repo migration
* problem: UMD-4 missing voms-clients-cpp-2.0.15: http://repository.egi.eu/sw/production/umd/4/centos7/x86_64/updates/
** to be fixed urgently


== Preview repository  ==
== Preview repository  ==
*released on 2020-11-30:
*released on 2021-02-22
** '''[[Preview 1.30.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/1.0/1.30.0/ AppDB info] '''(last release on sl6)''':  CVMFS 2.7.5 and egi-cvmfs-2-7.12, dCache 5.2.35, DMLite/DPM 1.14.2, Dynafed 1.6.0, STORM 1.11.19, VOMS 10-20 release, xrootd 4.12.5
** '''[[Preview 2.31.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/2.0/2.31.0/ AppDB info] (CentOS 7):  APEL-SSM 3.1.1, ARC 6.10.1, CVMFS 2.8.0 and egi-cvmfs-3-1.13, davix 0.7.6, dCache 5.2.38, gfal2 2.18.2
** '''[[Preview 2.30.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/2.0/2.30.0/ AppDB info] (CentOS 7):  APEL-SSM 3.0.1, CVMFS 2.7.5 and egi-cvmfs-2-7.12, dCache 5.2.35, DMLite/DPM 1.14.2, Dynafed 1.6.0, STORM 1.11.19, VOMS 10-20 release, xrootd 4.12.5 and 5.0.3
* collecting information for the next release


= Operations  =
= Operations  =


== ARGO/SAM  ==
== ARGO/SAM  ==
* Migration to CentoOS 7 completed
* Site-BDII metrics org.bdii.Entries and org.bdii.Freshness [https://ggus.eu/index.php?mode=ticket_info&ticket_id=150657 removed] from ARGO_MON_CRITICAL profile
** some probes not yet ready for CentOS 7 are temporary executed by https://egi-mon-old.argo.grnet.gr/nagios/
** the metrics are still kept in the ARGO_MON_OPERATORS profiles
* [https://argo-mon-fedcloud.cro-ngi.hr/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_org.opensciencegrid.htcondorce&style=detail HTCondor-CE probes]  
** it is still an important service to support infrastructure oversight activities
* [https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_org.opensciencegrid.htcondorce&style=overview HTCondor-CE probes]
** deployed on secmon and pakiti: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=150006 GGUS 150006]
** working on the probe for the host certificate validity check: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=147386 GGUS 147386]
** working on the probe for the host certificate validity check: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=147386 GGUS 147386]
** integration with secmon and pakiti: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=150006 GGUS 150006]
*** With 8.9.12 installed (expected the week of Mar 15), you should be able to query remote HTCondor-CEs for their host certificate using the following:
$ python -c 'import htcondor; ad = htcondor.Collector("collector2.opensciencegrid.org:9619").locate(htcondor.DaemonTypes.Schedd, "hosted-ce10.opensciencegrid.org"); print htcondor.SecMan().ping(ad, "READ")["ServerPublicCert"]' | openssl x509 -noout -subject -enddate
subject= /CN=hosted-ce10.opensciencegrid.org
notAfter=Apr 26 12:26:42 2021 GMT
* CREAM-CE metrics removed from ARGO_MON, ARGO_MON_OPERATIONS and ARGO_MON_CRITICAL ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=149778 GGUS 149778])
* CREAM-CE metrics removed from ARGO_MON, ARGO_MON_OPERATIONS and ARGO_MON_CRITICAL ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=149778 GGUS 149778])
** emi.cream.CREAMCE*
** emi.cream.CREAMCE*
Line 46: Line 54:
*** '''INDIACMS-TIFR''' failures with HTCondor-CE and webdav; additional failures with SRM tests
*** '''INDIACMS-TIFR''' failures with HTCondor-CE and webdav; additional failures with SRM tests
*** '''KR-KNU-T3''': migration from CREAM-CE to HTCondor-CE
*** '''KR-KNU-T3''': migration from CREAM-CE to HTCondor-CE
** NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150467
*** '''SCAI''': replacement of the cloud cluster
** NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148518
** NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148518
*** '''egee.irb.hr''': in the process of a major upgrade from CentOS 6 to CentOS 7, some delays.
*** '''egee.irb.hr''': in the process of a major upgrade from CentOS 6 to CentOS 7, some delays.
Line 59: Line 65:
*** '''SUPERCOMPUTO-UNAM''': scheduled a downtime for upgrading the site.
*** '''SUPERCOMPUTO-UNAM''': scheduled a downtime for upgrading the site.
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: ('''February 2021'''):
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: ('''February 2021'''):
 
** NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150816
 
*** GoeGrid
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150818
*** INFN-PISA
** ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150817
*** ICN-UNAM: replacing CREAM-CE
*sites suspended:
*sites suspended:
** GARR-01-DIR (NGI_IT)
** GARR-01-DIR (NGI_IT), TW-NCUHEP (AsiaPacific)


== IPv6 readiness plans  ==
== IPv6 readiness plans  ==
Line 69: Line 79:
* if any relevant, information will be summarised at  OMB
* if any relevant, information will be summarised at  OMB


== Top-BDII problem affecting the publication of accounting records ==
== APEL migration from ActiveMQ to ARGO Message Service (AMS) ==
* Migration insructions: https://github.com/apel/ssm/blob/dev/migrating_to_ams.md
* ActiveMQ is going to be dismissed at the end of March with the end of EOSC-hub
** https://ggus.eu/index.php?mode=ticket_info&ticket_id=140318
* Currently an issue with apel client prevent SSM to send properly the records through AMS
** it doesn't affect cloud and storage accounting
** ARC-CE might not work if using an old bundled version of SSM - but new ARC versions may work if set to use standalone SSM
** With CondorCE it may work, we will find some sites to test it
** by mid-March a fix will be released; then the sites with ARC-CE/HTCondorCE can implement the change
* starting the migration with FedCloud sites
** [https://ggus.eu/index.php?mode=ticket_search&su_hierarchy=0&status=all&date_type=creation+date&tf_radio=1&timeframe=any&from_date=05+Mar+2021&to_date=06+Mar+2021&ticket_category=all&typeofproblem=all&specattrib=none&user=paolini&keyword=APEL+migration+from+ActiveMQ+to+AMS&orderticketsby=REQUEST_ID&orderhow=desc&ticket_per_page=50&show_columns_check%5B0%5D=TICKET_TYPE&show_columns_check%5B1%5D=AFFECTED_VO&show_columns_check%5B2%5D=AFFECTED_SITE&show_columns_check%5B3%5D=PRIORITY&show_columns_check%5B4%5D=RESPONSIBLE_UNIT&show_columns_check%5B5%5D=STATUS&show_columns_check%5B6%5D=DATE_OF_CHANGE&show_columns_check%5B7%5D=SHORT_DESCRIPTION&show_columns_check%5B8%5D=SCOPE&search_submit=Search list of tickets]
 
=== Feedback from NGI_FRANCE ===
On the Cloud infra, several tickets have been open to switch to the new messaging system. It would nice to have the following RPMs made available from CMD repo, and not only from UMD:
* apel-ssm-2.4.1-1.el7.noarch
* python-argo-ams-library-0.5.1-1.el7.noarch
 
In addition, many Cloud sites are now using OpenStack Stein or newer. These version are provided with python-daemon = 2.2.3-1.el7. It conflicts with the requirement of apel-ssm ( python-daemon < 2.2.0)
 
'''Feedback from URT''':
* there is new apel-ssm 3.0.0 version under untested repo and this new version solves the dependency issue of requirement python-daemon <= 2.2.0.
* there is also the new python-argo-ams 0.54 library


* on 20th Dec 2020 the top-bdii at CERN lcg-bdii.cern.ch stopped working
== ARC-CE probe failing due to UMD repositories being down ==
** https://ggus.eu/?mode=ticket_info&ticket_id=150055
* since then, it wasn't possible to publish the accounting data
** the SSM script couldn't find the Message Brokers queue to send the messages
* top-bdii fixed on 4th Jan 2021


* this problem affected all the sites because by default in the APEL SSM config file it is set CERN's top-BDII
* The [https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=30300 unavailability] of UMD repository caused a failure with the ARC-CE IGTF probes (org.nordugrid.ARC-CE-result-ops)
** each site can set instead the top-BDII of its region:
** https://ggus.eu/index.php?mode=ticket_info&ticket_id=150827
*** [https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&id=1205 Top-BDIIs service group] on GOCDB
Job terminated as Failed. - Failed in data staging: Failed checking source replica http://repository.egi.eu:80/sw/production/cas/1/current/meta/ca-policy-egi-core.list: Failed to obtain information about file: Failed to connect to repository.egi.eu(IPv4):80 - JID: gsiftp://alex4.nipne.ro:2811/jobs/yq0NDmskJcynuvw3Vp3UrRNqABFKDmABFKDm8hJKDmABFKDmxx7PPm
*** [https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_Top-BDII&style=overview Top-BDII servers] monitored by ARGO
* Asked the ARC-CE developers to remove this dependency from the probe:
* CERN's top-BDII is going to be retired
** https://ggus.eu/index.php?mode=ticket_info&ticket_id=150833
* It will be asked a recomputation to exclude these failures from the A/R figures


== CREAM-CE Decommission ==
== CREAM-CE Decommission ==
Line 116: Line 144:


== Next meeting  ==
== Next meeting  ==
8th Mar 2021
12th Apr 2021

Latest revision as of 12:41, 9 March 2021

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators


Back to https://wiki.egi.eu/wiki/Operations_Meeting

General information

Middleware

Software repository maintenance

  • unscheduled downtime from 03-Mar-21 17:00:00 (UTC) to 05-Mar-21 14:30:06 (UTC)
  • Broadcasts announcing the unexpected downtime and the recovery
  • Migration from IASA to IBERGRID
  • Warning downtime until 09-Mar-21 14:30:00 (UTC)
  • repository is completely online; the related website not yet (only some very simple html pages)
  • more details of the incident will be circulated soon

UMD

  • CentOS8 discussion still ongoing
  • migration of Software Provisioning infrastructure to IBERGRID still ongoing
    • in particular, administration portal used for release creation done successfully
  • still planning next release, delay due to repo migration

Preview repository

  • released on 2021-02-22
    • Preview 2.31.0 AppDB info (CentOS 7): APEL-SSM 3.1.1, ARC 6.10.1, CVMFS 2.8.0 and egi-cvmfs-3-1.13, davix 0.7.6, dCache 5.2.38, gfal2 2.18.2

Operations

ARGO/SAM

  • Site-BDII metrics org.bdii.Entries and org.bdii.Freshness removed from ARGO_MON_CRITICAL profile
    • the metrics are still kept in the ARGO_MON_OPERATORS profiles
    • it is still an important service to support infrastructure oversight activities
  • HTCondor-CE probes
    • deployed on secmon and pakiti: GGUS 150006
    • working on the probe for the host certificate validity check: GGUS 147386
      • With 8.9.12 installed (expected the week of Mar 15), you should be able to query remote HTCondor-CEs for their host certificate using the following:
$ python -c 'import htcondor; ad = htcondor.Collector("collector2.opensciencegrid.org:9619").locate(htcondor.DaemonTypes.Schedd, "hosted-ce10.opensciencegrid.org"); print htcondor.SecMan().ping(ad, "READ")["ServerPublicCert"]' | openssl x509 -noout -subject -enddate
subject= /CN=hosted-ce10.opensciencegrid.org
notAfter=Apr 26 12:26:42 2021 GMT
  • CREAM-CE metrics removed from ARGO_MON, ARGO_MON_OPERATIONS and ARGO_MON_CRITICAL (GGUS 149778)
    • emi.cream.CREAMCE*
    • eu.egi.CREAM*

FedCloud

Feedback from DMSU

Monthly Availability/Reliability

IPv6 readiness plans

APEL migration from ActiveMQ to ARGO Message Service (AMS)

  • Migration insructions: https://github.com/apel/ssm/blob/dev/migrating_to_ams.md
  • ActiveMQ is going to be dismissed at the end of March with the end of EOSC-hub
  • Currently an issue with apel client prevent SSM to send properly the records through AMS
    • it doesn't affect cloud and storage accounting
    • ARC-CE might not work if using an old bundled version of SSM - but new ARC versions may work if set to use standalone SSM
    • With CondorCE it may work, we will find some sites to test it
    • by mid-March a fix will be released; then the sites with ARC-CE/HTCondorCE can implement the change
  • starting the migration with FedCloud sites

Feedback from NGI_FRANCE

On the Cloud infra, several tickets have been open to switch to the new messaging system. It would nice to have the following RPMs made available from CMD repo, and not only from UMD:

  • apel-ssm-2.4.1-1.el7.noarch
  • python-argo-ams-library-0.5.1-1.el7.noarch

In addition, many Cloud sites are now using OpenStack Stein or newer. These version are provided with python-daemon = 2.2.3-1.el7. It conflicts with the requirement of apel-ssm ( python-daemon < 2.2.0)

Feedback from URT:

  • there is new apel-ssm 3.0.0 version under untested repo and this new version solves the dependency issue of requirement python-daemon <= 2.2.0.
  • there is also the new python-argo-ams 0.54 library

ARC-CE probe failing due to UMD repositories being down

Job terminated as Failed. - Failed in data staging: Failed checking source replica http://repository.egi.eu:80/sw/production/cas/1/current/meta/ca-policy-egi-core.list: Failed to obtain information about file: Failed to connect to repository.egi.eu(IPv4):80 - JID: gsiftp://alex4.nipne.ro:2811/jobs/yq0NDmskJcynuvw3Vp3UrRNqABFKDmABFKDm8hJKDmABFKDmxx7PPm 

CREAM-CE Decommission

  • End of Security Updates and Support: 31st Dec 2020
  • Decommissioning deadline: 31st Jan 2021
  • PROC16 Decommission of unsupported software
  • Decommissioning start date: Oct 1st 2020
  • Nov 1st: probe returns CRITICAL status, alarms created on the ROD dashboard, ROD teams start to create tickets
  • 1st Feb 2021: EGI Ops will start chasing the sites still providing CREAM-CE endpoints
    • By this time service end-points which couldn't be upgraded should be put into downtime by site admin or ROD
  • 1st March 2021: Sites still deploying unsupported service endpoints risk suspension, unless documented technical reasons prevent a Site Admin from updating these endpoints.
  • Tickets opened: 49
  • Please note that at least one CE endpoint should be associated to the APEL service type in order to monitor the publication of the accounting data, as explained here
    • If the CE you are going to remove was also registered as APEL service type, do not forget to move the APEL service type to a different CE endpoint.

VOMS upgrade to CentOS 7

  • VOMS for CentOS 7 released Nov 23rd with UMD 4.12.13
    • VOMS Admin 3.8.0, VOMS Server 2.0.15
  • VOMS endpoints registered on GOCDB as production and monitored: 41
    • Provided by 33 sites
  • list of ticket opened: GGUS
  • the VOMS servers need to be published in the BDII in order to easily collect the deployed version

AOB

Next meeting

12th Apr 2021