Difference between revisions of "Agenda-2020-09-14"
Jump to navigation
Jump to search
Line 20: | Line 20: | ||
== ARGO/SAM == | == ARGO/SAM == | ||
* [https://argo-mon-fedcloud.cro-ngi.hr/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_org.opensciencegrid.htcondorce&style=detail HTCondor-CE probes] included in the ARGO_MON_OPERATORS profile on May 13th: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146949 | * [https://argo-mon-fedcloud.cro-ngi.hr/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_org.opensciencegrid.htcondorce&style=detail HTCondor-CE probes] included in the ARGO_MON_OPERATORS profile on May 13th: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146949 | ||
** '''(28th Aug)''' | ** '''(28th Aug)''' 70 endpoints, 14 CRITICAL, success rate is about 80% | ||
** '''on Oct 1st they will be included in the [https://poem.egi.eu/ui/public_metricprofiles/ARGO_MON_CRITICAL ARGO_MON_CRITICAL] profile (A/R computation)''' | ** '''on Oct 1st they will be included in the [https://poem.egi.eu/ui/public_metricprofiles/ARGO_MON_CRITICAL ARGO_MON_CRITICAL] profile (A/R computation)''' | ||
*** '''please fix the failures by that date''' | *** '''please fix the failures by that date''' |
Revision as of 10:02, 14 September 2020
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to https://wiki.egi.eu/wiki/Operations_Meeting
General information
Middleware
UMD
- plans on CentOS8 STARTED
Preview repository
- released on 2020-08-05
- Preview 1.28.0 AppDB info (sl6): dCache 5.2.25, frontier-squid 4.12.2, gfal2 2.18.1, xrootd 5.0.0
- Preview 2.28.0 AppDB info (CentOS 7): dCache 5.2.25, frontier-squid 4.12.2, gfal2 2.18.1, xrootd 5.0.0
Operations
ARGO/SAM
- HTCondor-CE probes included in the ARGO_MON_OPERATORS profile on May 13th: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146949
- (28th Aug) 70 endpoints, 14 CRITICAL, success rate is about 80%
- on Oct 1st they will be included in the ARGO_MON_CRITICAL profile (A/R computation)
- please fix the failures by that date
- working on the probe for the host certificate validity check: GGUS 147386
- CREAM-CE metrics in the ARGO_MON_OPERATORS profile on May 27th: eu.egi.CREAMCE-JobSubmit, eu.egi.CREAMCE.WN-Csh, eu.egi.CREAMCE.WN-Softver
- (24th Aug) results: 156 endpoints, 22 WARNING (Timeout occurred (900 sec) ), 31 CRITICAL. Success rate 80.1% (66% including the WARNING)
- When eu.egi.CREAMCE.WN-Softver is successful:
CREAM JobOutput OK: retrieved outputSandbox: ['std.err', 'std.out'] **** std.err **** **** std.out **** egee01 has UMD 3.14.4
When it fails:
CREAM JobOutput ERROR [DONE-OK, exitCode=1 ]: retrieved outputSandbox: ['std.err', 'std.out'] **** std.err **** **** std.out **** ERROR: unable to find glite, EMI, LCG or UMD WN version on n1037-amd
FedCloud
Feedback from DMSU
Verify configuration records
On a yearly basis, the information registered into GOC-DB need to be verified. NGIs and RCs have been asked to check them. In particular:
- NGI managers should review the people registered and the roles assigned to them, and in particular check the following information:
- ROD E-Mail
- Security E-Mail
- NGI Managers should also review the status of the "not certified" RCs, in according to the RC Status Workflow;
- RCs administrators should review the people registered and the roles assigned to them, and in particular check the following information:
- telephone numbers
- CSIRT E-Mail
- RC administrators should also review the information related to the registered service endpoints.
The process should be completed by June 22nd.
- 30 tickets
- Not yet solved after 1 month: 16
Monthly Availability/Reliability
- Under-performed sites in the past A/R reports with issues not yet fixed:
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147748
- HK-HKU-CC-01: migrating DPM from sl6 to CenOS7
- TW-NCUHEP: ARC-CE failures due to outdated CAs package
- NGI_BG: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147747
- BG01-IPP: CREAM-CE failures, improving...
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146871
- GoeGRID: CREAM-CE intermittent failures not affecting ATLAS; failures with ARC-CE
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147313
- mainz: some problems in March and April, that could not be fixed easily; in May, the HPC infrastructure was attacked and the whole computer center was shut down; in downtime.
- wuppertalprod: SRM failures to to a BDII issue, fixed
- NGI_GRNET: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148171
- HG-02-IASA: problems with certificates renewal due to COVID situation; the montlhy figures are improving
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148170
- Hephy-Vienna
- INFN-PADOVA-STACK: the ESACO instance was having problems with the new AAI host certificate. See 148242
- NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147311
- WCSS64
- NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148167
- WUT: downtime for site update, production jobs can run.
- NGI_UK:
- UKI-NORTHGRID-SHEF-HEP: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146455 ARC-CE re-installed, some condor problems to fix
- UKI-SOUTHGRID-SUSX: https://ggus.eu/index.php?mode=ticket_info&ticket_id=144720 Migration from CREAM to ARC, WN migration to CentOS7; SRM to be decommissioned; ARC-CE was failing the IGTF test, then solved; site-bdii failures.
- NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147750
- UA-ISMA: migration to ARC6 and other planned software updates
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147748
- Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (August 2020):
- CERN: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148516
- webdav failures due to insufficient space in the partition, fixed.
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148519
- LRZ-LMU: CE had problems due to the decommission of SharedFS
- NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148518
- egee.irb.hr: in the process of a major upgrade from CentOS 6 to CentOS 7, some delays.
- NGI_IL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148521
- TECHNION-HEP
- NGI_NL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148520
- SARA-MATRIX
- ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148515
- ATLAND: downtime due to powercut and quarantine
- CERN: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148516
- sites suspended:
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
ARC Middleware 5 end of support, migration to ARC 6
- EGI Operations Broadcast
- PROC16 Decommission of unsupported software
- deadline: end of July
- Catalin is in contact with ARC team to get a webinar on ARC administration, scheduled (to be confirmed) for July 6th please contact operations@ for information
- Status
Date | Number of endpoints in BDII | Number of GGUS tickets | Issues |
---|---|---|---|
2020-06-08 | 75 | 42 | Some ARC endpoints publish a timestamp instead of a version like 5.X.Y; we can fairly assume they are ARC6 nightly builds, but we're going to close the corresponding tickets after explicit confirmation from the site admin. |
2020-07-13 | 53 | 29 | - |
Storage accounting
Many of sites stopped the publication of storage accounting records. Opened 57 tickets to fix that.
- page for checking when the records were published: http://goc-accounting.grid-support.ac.uk/storagetest/storagesitesystems.html
- Accounting Portal Prototype view
SECMON failures
Several CEs are failing the job submission tests, preventing pakiti to check the vulnerabilities fixes on the WNs.
- original ticket: https://ggus.eu/index.php?mode=ticket_info&ticket_id=143837
- List of tickets to the sites
- https://ggus.eu/index.php?mode=ticket_info&ticket_id=144732
AOB
Next meeting
Sept 14th, 2020 https://indico.egi.eu/event/5098/