Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Agenda-2021-12-13"

From EGIWiki
Jump to navigation Jump to search
(Created page with "{{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}} Category:Grid Operations Meetings Back to https://wiki.egi.eu/wiki/Operations_Meeting = General informatio...")
 
 
(34 intermediate revisions by the same user not shown)
Line 41: Line 41:
*** https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_org.opensciencegrid.htcondorce&style=detail
*** https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_org.opensciencegrid.htcondorce&style=detail
** it is working fine (very few failures)
** it is working fine (very few failures)
** to be included in the A/R profile
** In January we will make the request to included the metric in the A/R profile
* Memory limits set by the ARC-CE probe: https://ggus.eu/index.php?mode=ticket_info&ticket_id=155081
** the default is 512MB, but they were increased because failures on some sites
*** 1GB for normal test jobs, 1.5GB for security jobs
*** these limits seem to high for a simple test jobs that is expected to run fast and with low demand
** request to come back to the default limits and let the probe use particular settings in CEs if any
* a proposal could be:
** sites with particular environment settings can define the values on GOCDB using the extension properties
** the probe is executed with its default values unless there is something else defined on GOCDB


== FedCloud  ==
== FedCloud  ==
Line 57: Line 65:
*Under-performed sites in the past A/R reports with issues not yet fixed:
*Under-performed sites in the past A/R reports with issues not yet fixed:
** AfricaArabia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154295
** AfricaArabia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154295
*** '''MA-01-CNRST''': ARC-CE failures
*** '''MA-01-CNRST''': ARC-CE failures: jobs are submitted but don't manage to finish; failures with org.nordugrid.ARC-CE-SRM-result
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150818
** NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150818
*** '''INFN-PISA''': HTCondorCE failures fixed; SRM failures not yet
*** '''INFN-PISA''': HTCondorCE failures fixed; SRM failures not yet; webdav failures
** NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153659
** NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153659
*** '''TASK''': in the process of replacing QCG with ARC-CE
*** '''TASK''': in the process of replacing QCG with ARC-CE
** NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=152841
*** '''UA-NSCMBR''': problem during the DPM update: conflict between xrootd 5 and dmlite 1.13. Unscheduled downtime due to power failure in the computing centre. NFS configuration issue affected ARC-CE. Accounting data republished using the ARC accountng functionalities.
** NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153660
*** '''UKI-SOUTHGRID-SUSX''': CE configuration issues; some other failures occurred.
** ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153658
*** '''SUPERCOMPUTO-UNAM''': some network issues
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: ('''October 2021'''):
** NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154745
*** '''GoeGrid''': relocation of the cluster to a different building on the campus and subsequent network issues; handover to new staff; problems fixed.
** NGI_IBERGRID: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154750
*** '''UAM-LCG2'''
** NGI_RO: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154746
** NGI_RO: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154746
*** '''GRIDIFIN'''
*** '''GRIDIFIN''': jobs don;t manage to finish since 15th July
** NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154747
** NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154747
*** '''PSNC''': storage backend issues affecting the HPC cluster and DPM, causing also ARC-CE instability; DPM issues were fixed, working on HPC cluster
*** '''PSNC''': storage backend issues affecting the HPC cluster and DPM, causing also ARC-CE instability; DPM issues were fixed, working on HPC cluster
** ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153658
*** '''SUPERCOMPUTO-UNAM''': jobs cannot be submitted
** Russia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154748
** Russia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154748
*** '''RU-SARFTI''': ARC-CE failures, problem with hard drives, fixed
*** '''RU-SARFTI''': ARC-CE failures, problem with hard drives, fixed
** NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=154749
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: ('''November 2021'''):
*** '''UA-KNU''': failures with IGTF metric, now fixed.
** AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=155177
*** '''INDIACMS-TIFR''': major power outages, network issues; currently failures with HTCondorCE
*** '''TW-NCHC'''
** NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=155178
*** '''LRZ-LMU''' problem with retrieving the SURL
*** '''MPPMU'''
** NGI_IBERGRID: https://ggus.eu/index.php?mode=ticket_info&ticket_id=155180
*** '''NCG-INGRID-PT''': problems with the webdav probes affecting only ops VO (see also [https://ggus.eu/index.php?mode=ticket_info&ticket_id=151396 GGUS 151396]
 
 
 
 


*sites suspended:
*sites suspended:
Line 99: Line 109:
* please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment  
* please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment  
* if any relevant, information will be summarised at  OMB
* if any relevant, information will be summarised at  OMB
== Transition from X509 to federated identities (AARC profile token) ==
* WLCG is testing aai tokens (WLCG profile) as authz system for accessing the middleware, with Indigo IAM as a replacement of VOMS
* In Feb 2022 OSG will fully move to token-based AAI, abandoning X509 certificates
* HTCondorCE: [https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=PlanToReplaceGridCommunityToolkit replacement of Grid Community Toolkit]
** The long-term support series (9.0.x) from the CHTC repositories will support X509/VOMS authentication through Sep 2022
** Starting in 9.3.0 (released in October), the HTCondor feature releases does NOT contain this support
** EGI sites are recommended to stay with the long-term support series for the time being
What we need to know in preparation of the transition:
'''Checking the middleware compliance with the AARC Profile token''':
* '''ARC-CE''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=154958
** So far focusing on the WLCG profile, which is built upon the AARC profile, so this should cover everything.
* '''Argus''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=154959
* '''dCache''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=154960
* '''DPM''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=154961
** The DPM is in maintenance mode to be phased out by ~2024. There is no effort for implementing new functionality, which furthermore would be short-lived.
* '''HTCondor-CE''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=154962
* '''STORM''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=154963
'''Need to check the awareness and readiness of users communities''':
* which GRID services do they use
* are they familiar with AAI identities
* are they ready for the switch
'''Migration of the VOs from VOMS to Check-in'''
* transition period where both X509 and tokens cam be used
** delays in updating the GRID elements to the latest version compliant with tokens
** not all if the middleware products can be complianat with tokens at the same time
** the same VO has to interact with element supporting different authentications


= AOB  =
= AOB  =
 
* DPM migration
* Transition from X509 to federated identities


== Next meeting  ==
== Next meeting  ==
Jan
Jan

Latest revision as of 11:54, 13 December 2021

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators


Back to https://wiki.egi.eu/wiki/Operations_Meeting

General information

Middleware

UMD

  • CentOS Stream 8 now the recommended OS for new installations
  • C8->CS8 migrations recommended
  • CS9 will be supported by CERN and FNAL
  • middleware: recommended path is C7->CS9 (we will probabily skip CS8)
  • new release https://repository.egi.eu/UMD/4.15.1.html
    • ARC-CE 6.13.0 bug fixes release
    • Xrootd 5.3.1 bug fixes release
    • CERN EOS 5.0.2 new release of EOS Open Storage which provides a storage solution large amounts of physics data and user files, with a focus on interactive and batch analysis.
    • dCache 6.2.31 security vulnerability fix
    • Infrastructure Manager Nagios probe 1.3.1
    • GridFTP 13.21.1 minor bug fix of some Globus packages
    • gfal2 2.19.2 regular update of the gfal clientes
    • gfal2-utils 1.6.0 regular update of the gfal2-utils clientes
    • EGI CVMFS 3.3.16 new release for the EGI default configuration meta-package configured for EGI.
    • CVMFS 2.8.2 patch release containing bug fixes for clients and new diagnostics commands for the client.
    • HTCondor 9.0.1 New major release of HTCondor
    • HTCondor-CE 5.1.3 New Major Reelase of the HTCondor-CE

Preview repository

  • released on 2021-06-10
  • released on 2021-08-11
    • Preview 2.35.0 (CentOS 7): APEL SSM 3.2.1, DPM/DMLite 1.15.0 and 1.15.1, frontier-squid 4.15.2, xrootd 5.3.0
  • We plan to stop the release of Preview since it doesn't seem to be used very much, and it is also easier to catch the last version of the products from EPEL or the product teams repos, prior the release in UMD.

Operations

ARGO/SAM

  • probe for checking the HTCondorCE host certificate validity deployed in production (GGUS 147386):
  • Memory limits set by the ARC-CE probe: https://ggus.eu/index.php?mode=ticket_info&ticket_id=155081
    • the default is 512MB, but they were increased because failures on some sites
      • 1GB for normal test jobs, 1.5GB for security jobs
      • these limits seem to high for a simple test jobs that is expected to run fast and with low demand
    • request to come back to the default limits and let the probe use particular settings in CEs if any
  • a proposal could be:
    • sites with particular environment settings can define the values on GOCDB using the extension properties
    • the probe is executed with its default values unless there is something else defined on GOCDB

FedCloud

Feedback from DMSU

New Known Error Database (KEDB)

The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home

  • problems are tracked with Jira tickets to better follow-up their evoulution
  • problems can be registered by DMSU staff and EGI Operations team

Monthly Availability/Reliability



  • sites suspended:

Documentation

IPv6 readiness plans

Transition from X509 to federated identities (AARC profile token)

  • WLCG is testing aai tokens (WLCG profile) as authz system for accessing the middleware, with Indigo IAM as a replacement of VOMS
  • In Feb 2022 OSG will fully move to token-based AAI, abandoning X509 certificates
  • HTCondorCE: replacement of Grid Community Toolkit
    • The long-term support series (9.0.x) from the CHTC repositories will support X509/VOMS authentication through Sep 2022
    • Starting in 9.3.0 (released in October), the HTCondor feature releases does NOT contain this support
    • EGI sites are recommended to stay with the long-term support series for the time being

What we need to know in preparation of the transition:

Checking the middleware compliance with the AARC Profile token:

Need to check the awareness and readiness of users communities:

  • which GRID services do they use
  • are they familiar with AAI identities
  • are they ready for the switch

Migration of the VOs from VOMS to Check-in

  • transition period where both X509 and tokens cam be used
    • delays in updating the GRID elements to the latest version compliant with tokens
    • not all if the middleware products can be complianat with tokens at the same time
    • the same VO has to interact with element supporting different authentications

AOB

  • DPM migration
  • Transition from X509 to federated identities

Next meeting

Jan