Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Agenda-2021-11-15"

From EGIWiki
Jump to navigation Jump to search
 
(12 intermediate revisions by 2 users not shown)
Line 8: Line 8:
== UMD ==
== UMD ==


* including EOS in UMD
* CentOS Stream 8 now the recommended OS for new installations
* C8->CS8 migrations recommended
* CS9 will be supported by CERN and FNAL
* middleware: recommended path is C7->CS9 (we will probabily skip CS8)
 
* new release https://repository.egi.eu/UMD/4.15.1.html
** ARC-CE 6.13.0 bug fixes release
** Xrootd 5.3.1 bug fixes release
** ''CERN EOS 5.0.2 new release of EOS Open Storage which provides a storage solution large amounts of physics data and user files, with a focus on interactive and batch analysis.''
** dCache 6.2.31 security vulnerability fix
** Infrastructure Manager Nagios probe 1.3.1
** GridFTP 13.21.1 minor bug fix of some Globus packages
** gfal2 2.19.2 regular update of the gfal clientes
** gfal2-utils 1.6.0 regular update of the gfal2-utils clientes
** EGI CVMFS 3.3.16 new release for the EGI default configuration meta-package configured for EGI.
** CVMFS 2.8.2 patch release containing bug fixes for clients and new diagnostics commands for the client.
** HTCondor 9.0.1 New major release of HTCondor
** HTCondor-CE 5.1.3 New Major Reelase of the HTCondor-CE


== Preview repository  ==
== Preview repository  ==
Line 20: Line 37:


== ARGO/SAM  ==
== ARGO/SAM  ==
* probe for checking the HTCondorCE host certificate validity ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=147386 GGUS 147386]):
* probe for checking the HTCondorCE host certificate validity deployed in production ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=147386 GGUS 147386]):
** checks on expiration date, CN, and CA:
** checks on expiration date, CN, and CA:
*** https://argo-mon-devel.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_org.opensciencegrid.htcondorce&style=detail
*** https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_org.opensciencegrid.htcondorce&style=detail
** to be deployed in production once new condor client is released in UMD
** it is working fine (very few failures)
** to be included in the A/R profile


== FedCloud  ==
== FedCloud  ==
* [https://indico.egi.eu/event/4775/ badging]


== Feedback from DMSU  ==
== Feedback from DMSU  ==
Line 54: Line 70:
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: ('''October 2021'''):
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: ('''October 2021'''):
** NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154745
** NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154745
*** GoeGrid
*** '''GoeGrid''': relocation of the cluster to a different building on the campus and subsequent network issues; handover to new staff; problems fixed.
** NGI_IBERGRID: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154750
*** '''UAM-LCG2'''
** NGI_RO: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154746
** NGI_RO: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154746
*** GRIDIFIN
*** '''GRIDIFIN'''
** NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154747
** NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154747
*** PSNC
*** '''PSNC''': storage backend issues affecting the HPC cluster and DPM, causing also ARC-CE instability; DPM issues were fixed, working on HPC cluster
** Russia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154748
** Russia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154748
*** RU-SARFTI
*** '''RU-SARFTI''': ARC-CE failures, problem with hard drives, fixed
** NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154749
*** '''UA-KNU''': failures with IGTF metric, now fixed.


*sites suspended:
*sites suspended:
Line 80: Line 100:
* if any relevant, information will be summarised at  OMB
* if any relevant, information will be summarised at  OMB


== APEL migration from ActiveMQ to ARGO Message Service (AMS) ==
= AOB  =
* '''ActiveMQ dismissed on July 8th''': for security reasons it is not possible maintain it any longer.
** Scheduled Donwtime: https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=30888 https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=30889
* Migration insructions (HTCondorCE, Storage, and Cloud accounting): https://github.com/apel/ssm/blob/dev/migrating_to_ams.md
* ARC 6.12.0 released, instructions:
** http://www.nordugrid.org/arc/releases/6.12/release_notes_6.12.html
** all the sites with ARC-CE need to update to this version
* Recommended versions:
** Apel Clien: 1.9.0
** APEL SSM: 3.2.1
* Cloud accounting campaign:
** [https://ggus.eu/index.php?mode=ticket_search&su_hierarchy=0&status=all&date_type=creation+date&tf_radio=1&timeframe=any&from_date=05+Mar+2021&to_date=06+Mar+2021&ticket_category=all&typeofproblem=all&specattrib=none&user=paolini&keyword=APEL+migration+from+ActiveMQ+to+AMS&orderticketsby=REQUEST_ID&orderhow=desc&ticket_per_page=50&show_columns_check%5B0%5D=TICKET_TYPE&show_columns_check%5B1%5D=AFFECTED_VO&show_columns_check%5B2%5D=AFFECTED_SITE&show_columns_check%5B3%5D=PRIORITY&show_columns_check%5B4%5D=RESPONSIBLE_UNIT&show_columns_check%5B5%5D=STATUS&show_columns_check%5B6%5D=DATE_OF_CHANGE&show_columns_check%5B7%5D=SHORT_DESCRIPTION&show_columns_check%5B8%5D=SCOPE&search_submit=Search list of tickets]
** 2 tickets (out of 21) not solved yet
*** '''UA-BITP''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=150857 150857]: old records associated to "None" VO (due to a configuration issue) needs to be removed from the repository
*** '''CYFRONET-CLOUD''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=150845 150845]:
* HTCondorCE and Storage accounting campaign:
** [https://ggus.eu/index.php?mode=ticket_search&su_hierarchy=0&status=all&date_type=creation+date&tf_radio=1&timeframe=any&from_date=15+Mar+2021&to_date=16+Mar+2021&ticket_category=all&typeofproblem=all&specattrib=none&user=paolini&keyword=APEL+migration+from+ActiveMQ+to+AMS+-+new+settings&orderticketsby=REQUEST_ID&orderhow=desc&ticket_per_page=50&show_columns_check%5B0%5D=TICKET_TYPE&show_columns_check%5B1%5D=AFFECTED_VO&show_columns_check%5B2%5D=AFFECTED_SITE&show_columns_check%5B3%5D=PRIORITY&show_columns_check%5B4%5D=RESPONSIBLE_UNIT&show_columns_check%5B5%5D=STATUS&show_columns_check%5B6%5D=DATE_OF_CHANGE&show_columns_check%5B7%5D=SHORT_DESCRIPTION&show_columns_check%5B8%5D=SCOPE&search_submit=Search list of tickets]
** 8 tickets (out of 53) not solved yet
*** '''ICN-UNAM''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=151374 151374]:
*** '''INFN-COSENZA''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=151183 151183]: using an old SSM version
*** '''INFN-NAPOLI-ATLAS''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=150976 150976]: some problems similar to https://ggus.eu/index.php?mode=ticket_info&ticket_id=154347
*** '''INFN-PISA''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=150977 150977]:
*** '''INFN-ROMA3''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=150978 150978]:
*** '''INFN-TRIESTE''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=151981 151981]:
*** '''SUPERCOMPUTO-UNAM''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=151375 151375]:
*** '''UA-BITP''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=150991 150991]: accounting records not properly parsed
* ARC-CE and storage accounting campaign:
** [https://ggus.eu/index.php?mode=ticket_search&su_hierarchy=0&status=all&date_type=creation+date&tf_radio=1&timeframe=any&from_date=11+Jun+2021&to_date=12+Jun+2021&ticket_category=all&typeofproblem=all&specattrib=none&user=paolini&keyword=APEL+migration+from+ActiveMQ+to+AMS+-+ARC-CE+new+settings&orderticketsby=REQUEST_ID&orderhow=asc&ticket_per_page=120&show_columns_check%5B0%5D=TICKET_TYPE&show_columns_check%5B1%5D=AFFECTED_SITE&show_columns_check%5B2%5D=PRIORITY&show_columns_check%5B3%5D=RESPONSIBLE_UNIT&show_columns_check%5B4%5D=STATUS&show_columns_check%5B5%5D=DATE_OF_CHANGE&show_columns_check%5B6%5D=SHORT_DESCRIPTION&search_submit=Search list of tickets]
** 11 tickets (out of 112) not solved yet
*** '''Australia-ATLAS'''[https://ggus.eu/index.php?mode=ticket_info&ticket_id=152428 152428] and '''Australia-T2''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152429 152429]: they stilll have ARC-CE 5.4; moving to a Cloudscheduler based compute system and will be removing the ARC-CE's in the near future
*** '''CA-SFU-T2''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152433 152433]: CEs updated; some errors with the benchmark which seem harmful. Duplicated records for the previous months were cleaned, it was suggested to set `apel_messages = summaries` in the arc conf file; investigating on some inconsistencies.
*** '''IN2P3-IPNL''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152460 152460]: CE not yet in production
*** '''JP-KEK-CRC-02''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152471 152471]: installed new CEs, some authz failures with sending the records...
*** '''RU-SPbSU''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152491 152491]: some discrepancy between local database and central repository, involved ARC developers, see [https://ggus.eu/index.php?mode=ticket_info&ticket_id=154090 GGUS 154090]...
*** '''Taiwan-LCG2''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152497 152497]: setting up the ARC6 server...
*** '''TASK''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152498 152498]:
*** '''TW-FTT''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152503 152503]:
*** '''UA-MHI''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152509 152509]: downtime until 22nd Sept...
*** '''UKI-NORTHGRID-MAN-HEP''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152521 152521]: HTC data in May and June seems higher than expected... there are duplicated records that need to deleted in the central repository...
 
=== Prerequisites for using AMS ===
* A valid host certificate from an IGTF Accredited CA.
* A GOCDB 'Site' entry flagged as 'Production'.
* A GOCDB 'Service' entry of the correct service type flagged as 'Production'. The following service types are used:
** For Grid accounting use 'gLite-APEL'.
** For Cloud accounting use 'eu.egi.cloud.accounting'.
** For Storage accounting use 'eu.egi.storage.accounting'.
* The 'Host DN' listed in the GOCDB 'Service' entry must exactly match the certificate DN of the host used for accounting. Make sure there are no leading or trailing spaces in the 'Host DN' field.


=== Monitoring of the accounting data ===
To ensure the monitoring of the publication of the accounting data, '''one CE per site''' needs to be registered as "'''''APEL'''''" service endpoint.
* http://goc-accounting.grid-support.ac.uk/rss/SITE-NAME_Pub.html
* http://goc-accounting.grid-support.ac.uk/rss/SITE-NAME_Sync.html
= AOB  =
* EGI Conference 18 - 21 Oct 2021: https://indico.egi.eu/event/5464/overview


== Next meeting  ==
== Next meeting  ==
Dec
Dec

Latest revision as of 13:45, 15 November 2021

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators


Back to https://wiki.egi.eu/wiki/Operations_Meeting

General information

Middleware

UMD

  • CentOS Stream 8 now the recommended OS for new installations
  • C8->CS8 migrations recommended
  • CS9 will be supported by CERN and FNAL
  • middleware: recommended path is C7->CS9 (we will probabily skip CS8)
  • new release https://repository.egi.eu/UMD/4.15.1.html
    • ARC-CE 6.13.0 bug fixes release
    • Xrootd 5.3.1 bug fixes release
    • CERN EOS 5.0.2 new release of EOS Open Storage which provides a storage solution large amounts of physics data and user files, with a focus on interactive and batch analysis.
    • dCache 6.2.31 security vulnerability fix
    • Infrastructure Manager Nagios probe 1.3.1
    • GridFTP 13.21.1 minor bug fix of some Globus packages
    • gfal2 2.19.2 regular update of the gfal clientes
    • gfal2-utils 1.6.0 regular update of the gfal2-utils clientes
    • EGI CVMFS 3.3.16 new release for the EGI default configuration meta-package configured for EGI.
    • CVMFS 2.8.2 patch release containing bug fixes for clients and new diagnostics commands for the client.
    • HTCondor 9.0.1 New major release of HTCondor
    • HTCondor-CE 5.1.3 New Major Reelase of the HTCondor-CE

Preview repository

  • released on 2021-06-10
  • released on 2021-08-11
    • Preview 2.35.0 (CentOS 7): APEL SSM 3.2.1, DPM/DMLite 1.15.0 and 1.15.1, frontier-squid 4.15.2, xrootd 5.3.0
  • We plan to stop the release of Preview since it doesn't seem to be used very much, and it is also easier to catch the last version of the products from EPEL or the product teams repos, prior the release in UMD.

Operations

ARGO/SAM

FedCloud

Feedback from DMSU

New Known Error Database (KEDB)

The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home

  • problems are tracked with Jira tickets to better follow-up their evoulution
  • problems can be registered by DMSU staff and EGI Operations team

Monthly Availability/Reliability

  • sites suspended:

Documentation

IPv6 readiness plans

AOB

Next meeting

Dec