Agenda-2020-09-14
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to https://wiki.egi.eu/wiki/Operations_Meeting
General information
Middleware
UMD
- plans on CentOS8 STARTED
Preview repository
- released on 2020-08-05
- Preview 1.28.0 AppDB info (sl6): dCache 5.2.25, frontier-squid 4.12.2, gfal2 2.18.1, xrootd 5.0.0
- Preview 2.28.0 AppDB info (CentOS 7): dCache 5.2.25, frontier-squid 4.12.2, gfal2 2.18.1, xrootd 5.0.0
Operations
ARGO/SAM
- HTCondor-CE probes included in the ARGO_MON_OPERATORS profile on May 13th: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146949
- (14th Sept) 70 endpoints, 14 CRITICAL, success rate is about 80%
- on Oct 1st they will be included in the ARGO_MON_CRITICAL profile (A/R computation)
- please fix the failures by that date
- working on the probe for the host certificate validity check: GGUS 147386
- CREAM-CE metrics in the ARGO_MON_OPERATORS profile on May 27th: eu.egi.CREAMCE-JobSubmit, eu.egi.CREAMCE.WN-Csh, eu.egi.CREAMCE.WN-Softver
- (14th Sept) results: 152 endpoints, 20 WARNING (Timeout occurred (900 sec) ), 24 CRITICAL. Success rate 84.2% (71.1% including the WARNING)
- When eu.egi.CREAMCE.WN-Softver is successful:
CREAM JobOutput OK: retrieved outputSandbox: ['std.err', 'std.out'] **** std.err **** **** std.out **** egee01 has UMD 3.14.4
When it fails:
CREAM JobOutput ERROR [DONE-OK, exitCode=1 ]: retrieved outputSandbox: ['std.err', 'std.out'] **** std.err **** **** std.out **** ERROR: unable to find glite, EMI, LCG or UMD WN version on n1037-amd
FedCloud
Feedback from DMSU
Verify configuration records
On a yearly basis, the information registered into GOC-DB need to be verified. NGIs and RCs have been asked to check them. In particular:
- NGI managers should review the people registered and the roles assigned to them, and in particular check the following information:
- ROD E-Mail
- Security E-Mail
- NGI Managers should also review the status of the "not certified" RCs, in according to the RC Status Workflow;
- RCs administrators should review the people registered and the roles assigned to them, and in particular check the following information:
- telephone numbers
- CSIRT E-Mail
- RC administrators should also review the information related to the registered service endpoints.
The process should be completed by June 22nd.
- 30 tickets
- Not yet solved: 8
Monthly Availability/Reliability
- Under-performed sites in the past A/R reports with issues not yet fixed:
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147748
- HK-HKU-CC-01: migrating DPM from sl6 to CenOS7
- TW-NCUHEP: ARC-CE failures due to outdated CAs package
- NGI_BG: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147747
- BG01-IPP: CREAM-CE failures, improving...
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146871
- GoeGRID: CREAM-CE intermittent failures not affecting ATLAS; failures with ARC-CE
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147313
- mainz: some problems in March and April, that could not be fixed easily; in May, the HPC infrastructure was attacked and the whole computer center was shut down; in downtime.
- wuppertalprod: SRM failures to to a BDII issue, fixed
- NGI_GRNET: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148171
- HG-02-IASA: problems with certificates renewal due to COVID situation; the montlhy figures are improving
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148170
- Hephy-Vienna: SRM decommissioned, moved to EOS
- INFN-PADOVA-STACK: the ESACO instance was having problems with the new AAI host certificate. See 148242
- NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147311
- WCSS64
- NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148167
- WUT: downtime for site update, production jobs can run.
- NGI_UK:
- UKI-NORTHGRID-SHEF-HEP: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146455 ARC-CE re-installed, some condor problems to fix
- UKI-SOUTHGRID-SUSX: https://ggus.eu/index.php?mode=ticket_info&ticket_id=144720 Migration from CREAM to ARC, WN migration to CentOS7; SRM to be decommissioned; ARC-CE was failing the IGTF test, then solved; site-bdii failures.
- NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147750
- UA-ISMA: migration to ARC6 and other planned software updates
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147748
- Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (August 2020):
- CERN: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148516
- webdav failures due to insufficient space in the partition, fixed.
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148519
- LRZ-LMU: CE had problems due to the decommission of SharedFS
- NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148518
- egee.irb.hr: in the process of a major upgrade from CentOS 6 to CentOS 7, some delays.
- NGI_IL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148521
- TECHNION-HEP
- NGI_NL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148520
- SARA-MATRIX
- ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148515
- ATLAND: downtime due to powercut and quarantine
- CERN: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148516
- sites suspended:
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
CREAM-CE Decommission
- End of Security Updates and Support: 31st Dec 2020 (Decommissioning deadline)
- Original broadcast: https://operations-portal.egi.eu/broadcast/archive/2293
- PROC16 Decommission of unsupported software
- Decommissioning start date: Oct 1st 2020
- a probe detecting CREAM-CE endpoints will be run, returning WARNING status
- Nov 1st: probe returns CRITICAL status, alarms created on the ROD dashboard, ROD teams start to create tickets
- 1st Jan 2021: EGI Ops will start chasing the sites still providing CREAM-CE endpoints
- By this time service end-points which couldn't be upgraded should be put into downtime by site admin or ROD:
ARC Middleware 5 end of support, migration to ARC 6
- EGI Operations Broadcast
- PROC16 Decommission of unsupported software
- deadline: end of July
- Catalin is in contact with ARC team to get a webinar on ARC administration, scheduled (to be confirmed) for July 6th please contact operations@ for information
- Status
Date | Number of endpoints in BDII | Number of GGUS tickets | Issues |
---|---|---|---|
2020-06-08 | 75 | 42 | Some ARC endpoints publish a timestamp instead of a version like 5.X.Y; we can fairly assume they are ARC6 nightly builds, but we're going to close the corresponding tickets after explicit confirmation from the site admin. |
2020-07-13 | 53 | 29 | - |
Storage accounting
Many of sites stopped the publication of storage accounting records. Opened 57 tickets to fix that.
- 15 tickets not solved yet
- page for checking when the records were published: http://goc-accounting.grid-support.ac.uk/storagetest/storagesitesystems.html
- Accounting Portal Prototype view
AOB
Next meeting
Sept 14th, 2020 https://indico.egi.eu/event/5098/