Difference between revisions of "Agenda-2020-07-13"
Jump to navigation
Jump to search
(37 intermediate revisions by 2 users not shown) | |||
Line 19: | Line 19: | ||
== ARGO/SAM == | == ARGO/SAM == | ||
* | * [https://argo-mon-fedcloud.cro-ngi.hr/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_org.opensciencegrid.htcondorce&style=detail HTCondor-CE probes] included in the ARGO_MON_OPERATORS profile on May 13th: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146949 | ||
** 57 endpoints, 17 CRITICAL, success rate is about 70.2% | |||
** '''on Sept 1st they will be included in the [https://poem.egi.eu/ui/public_metricprofiles/ARGO_MON_CRITICAL ARGO_MON_CRITICAL] profile (A/R computation)''' | |||
*** '''please fix the failures by that date''' | |||
** working on the probe for the host certificate validity check: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=147386 GGUS 147386] | |||
* CREAM-CE metrics in the ARGO_MON_OPERATORS profile on [https://ggus.eu/index.php?mode=ticket_info&ticket_id=147169 May 27th]: eu.egi.CREAMCE-JobSubmit, eu.egi.CREAMCE.WN-Csh, eu.egi.CREAMCE.WN-Softver | |||
** [https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CREAM-CE&style=detail results]: 177 endpoints, 15 WARNING (Timeout occurred (900 sec) ), 53 CRITICAL. Success rate 70% (61.6% including the WARNING) | ** [https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CREAM-CE&style=detail results]: 177 endpoints, 15 WARNING (Timeout occurred (900 sec) ), 53 CRITICAL. Success rate 70% (61.6% including the WARNING) | ||
*When eu.egi.CREAMCE.WN-Softver is successful: | *When eu.egi.CREAMCE.WN-Softver is successful: | ||
Line 37: | Line 42: | ||
**** std.out **** | **** std.out **** | ||
ERROR: unable to find glite, EMI, LCG or UMD WN version on n1037-amd | ERROR: unable to find glite, EMI, LCG or UMD WN version on n1037-amd | ||
== FedCloud == | == FedCloud == | ||
Line 66: | Line 68: | ||
[https://wiki.egi.eu/wiki/Verify_Configuration_Records#2020-05 List of tickets]. | [https://wiki.egi.eu/wiki/Verify_Configuration_Records#2020-05 List of tickets]. | ||
* 30 tickets | |||
* Not yet solved after 1 month: 16 | |||
== Monthly Availability/Reliability == | == Monthly Availability/Reliability == | ||
Line 71: | Line 75: | ||
*Under-performed sites in the past A/R reports with issues not yet fixed: | *Under-performed sites in the past A/R reports with issues not yet fixed: | ||
**AfricaArabia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146877 | **AfricaArabia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146877 | ||
***ZA-WITS-CORE: SE hardware problem, machine sent to the vendor; CREAM-CE failures due to a [https://cream-guide.readthedocs.io/en/latest/Releases.html#release-1-16-6 known issue] with the classads library ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=146979 GGUS 146979]) | ***'''ZA-WITS-CORE''': SE hardware problem, machine sent to the vendor; CREAM-CE failures due to a [https://cream-guide.readthedocs.io/en/latest/Releases.html#release-1-16-6 known issue] with the classads library ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=146979 GGUS 146979]) | ||
**AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=142591 | **AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=142591 | ||
***INDIACMS-TIFR: SRM service not published in the BDII ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=142245 142245]), DPM 1.9.0 version ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=145676 145676]); installing a new DPM headnode | ***'''INDIACMS-TIFR''': SRM service not published in the BDII ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=142245 142245]), DPM 1.9.0 version ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=145676 145676]); installing a new DPM headnode | ||
**NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146871 | **NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146871 | ||
***GoeGRID: CREAM-CE intermittent failures not affecting ATLAS; failures with ARC-CE | ***'''GoeGRID''': CREAM-CE intermittent failures not affecting ATLAS; failures with ARC-CE | ||
** | **NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147313 | ||
*** | ***'''mainz''': some problems in March and April, that could not be fixed easily; in May, the HPC infrastructure was attacked and the whole computer center was shut down; in downtime. | ||
*** | ***'''wuppertalprod''': SRM failures to to a BDII issue, fixed | ||
**NGI_UK: | **NGI_UK: | ||
***UKI-NORTHGRID-SHEF-HEP: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146455 | ***'''UKI-NORTHGRID-SHEF-HEP''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=146455 ARC-CE re-installed, some condor problems to fix | ||
***UKI-SOUTHGRID-SUSX: https://ggus.eu/index.php?mode=ticket_info&ticket_id=144720 Migration from CREAM to ARC, WN migration to CentOS7; SRM to be decommissioned; ARC-CE was failing the IGTF test | ***'''UKI-SOUTHGRID-SUSX''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=144720 Migration from CREAM to ARC, WN migration to CentOS7; SRM to be decommissioned; ARC-CE was failing the IGTF test, then solved; site-bdii failures. | ||
**ROC_CANADA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146452 | **ROC_CANADA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146452 | ||
***CA-SFU-T2: SLURM problems caused failures to site-BDII freshness check due to some old jobs not properly cancelled; recovered | ***'''CA-SFU-T2''': SLURM problems caused failures to site-BDII freshness check due to some old jobs not properly cancelled; recovered | ||
***CA-WATERLOO-T2: SRM failures not involving production VOs, fixed; some unscheduled downtime affected the the A/R figures; | ***'''CA-WATERLOO-T2''': SRM failures not involving production VOs, fixed; some unscheduled downtime affected the the A/R figures; ARC-CE and Site-BDII back in production with a minimum set of resources; A/R figures are improving. | ||
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: ('''June 2020'''): | |||
** NGI_BG: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147747 | |||
*Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (''' | *** '''BG01-IPP''' | ||
** | ** AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147748 | ||
*** | *** '''HK-HKU-CC-01''' | ||
*** '''TW-NCUHEP''' | |||
** | ** NGI_NL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147749 | ||
*** | *** '''SARA-MATRIX''': SRM not published in the BDII: planned a network change to solve the problem. | ||
** | ** NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147750 | ||
*** | *** '''UA-ISMA''': migration to ARC6 and other planned software updates | ||
**NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id= | |||
*** | |||
*sites suspended: | *sites suspended: | ||
== IPv6 readiness plans == | == IPv6 readiness plans == | ||
Line 130: | Line 118: | ||
|- | |- | ||
| 2020-06-08 || 75 || 42 || Some ARC endpoints publish a timestamp instead of a version like 5.X.Y; we can fairly assume they are ARC6 nightly builds, but we're going to close the corresponding tickets after explicit confirmation from the site admin. | | 2020-06-08 || 75 || 42 || Some ARC endpoints publish a timestamp instead of a version like 5.X.Y; we can fairly assume they are ARC6 nightly builds, but we're going to close the corresponding tickets after explicit confirmation from the site admin. | ||
|- | |||
| 2020-07-13 || 53 || 29 || - | |||
|} | |} | ||
== | == Storage accounting == | ||
Many of sites stopped the publication of storage accounting records. Opened [https://ggus.eu/index.php?mode=ticket_search&show_columns_check%5B0%5D=TICKET_TYPE&show_columns_check%5B1%5D=AFFECTED_VO&show_columns_check%5B2%5D=AFFECTED_SITE&show_columns_check%5B3%5D=PRIORITY&show_columns_check%5B4%5D=RESPONSIBLE_UNIT&show_columns_check%5B5%5D=STATUS&show_columns_check%5B6%5D=DATE_OF_CHANGE&show_columns_check%5B7%5D=SHORT_DESCRIPTION&show_columns_check%5B8%5D=SCOPE&su_hierarchy=0&keyword=publishing+storage+accounting+records&specattrib=none&status=all&typeofproblem=all&ticket_category=all&date_type=creation+date&tf_radio=1&timeframe=any&from_date=10+Jul+2020&to_date=11+Jul+2020&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO%21&ticket_per_page=60 57 tickets] to fix that. | |||
* page for checking when the records were published: http://goc-accounting.grid-support.ac.uk/storagetest/storagesitesystems.html | |||
* [http://accounting-devel.egi.eu/storage.php Accounting Portal Prototype view] | |||
== SECMON failures == | == SECMON failures == | ||
Line 261: | Line 136: | ||
= AOB = | = AOB = | ||
== Next meeting == | == Next meeting == | ||
Sept 14th, 2020 https://indico.egi.eu/event/5098/ |
Latest revision as of 14:24, 14 September 2020
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to https://wiki.egi.eu/wiki/Operations_Meeting
General information
Middleware
UMD
- plans on CentOS8 STARTED
Preview repository
- released on 2020-05-08
- Preview 1.27.0 AppDB info (sl6): ARC 6.5.0 and 6.6.0, CVMFS 2.7.2, dCache 5.2.20, frontier-squid 4.11.2, gfal2 2.17.2, xrootd 4.11.3
- Preview 2.27.0 AppDB info (CentOS 7): ARC 6.5.0 and 6.6.0, CVMFS 2.7.2, dCache 5.2.20, frontier-squid 4.11.2, gfal2 2.17.2, xrootd 4.11.3
Operations
ARGO/SAM
- HTCondor-CE probes included in the ARGO_MON_OPERATORS profile on May 13th: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146949
- 57 endpoints, 17 CRITICAL, success rate is about 70.2%
- on Sept 1st they will be included in the ARGO_MON_CRITICAL profile (A/R computation)
- please fix the failures by that date
- working on the probe for the host certificate validity check: GGUS 147386
- CREAM-CE metrics in the ARGO_MON_OPERATORS profile on May 27th: eu.egi.CREAMCE-JobSubmit, eu.egi.CREAMCE.WN-Csh, eu.egi.CREAMCE.WN-Softver
- results: 177 endpoints, 15 WARNING (Timeout occurred (900 sec) ), 53 CRITICAL. Success rate 70% (61.6% including the WARNING)
- When eu.egi.CREAMCE.WN-Softver is successful:
CREAM JobOutput OK: retrieved outputSandbox: ['std.err', 'std.out'] **** std.err **** **** std.out **** egee01 has UMD 3.14.4
When it fails:
CREAM JobOutput ERROR [DONE-OK, exitCode=1 ]: retrieved outputSandbox: ['std.err', 'std.out'] **** std.err **** **** std.out **** ERROR: unable to find glite, EMI, LCG or UMD WN version on n1037-amd
FedCloud
Feedback from DMSU
Verify configuration records
On a yearly basis, the information registered into GOC-DB need to be verified. NGIs and RCs have been asked to check them. In particular:
- NGI managers should review the people registered and the roles assigned to them, and in particular check the following information:
- ROD E-Mail
- Security E-Mail
- NGI Managers should also review the status of the "not certified" RCs, in according to the RC Status Workflow;
- RCs administrators should review the people registered and the roles assigned to them, and in particular check the following information:
- telephone numbers
- CSIRT E-Mail
- RC administrators should also review the information related to the registered service endpoints.
The process should be completed by June 22nd.
- 30 tickets
- Not yet solved after 1 month: 16
Monthly Availability/Reliability
- Under-performed sites in the past A/R reports with issues not yet fixed:
- AfricaArabia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146877
- ZA-WITS-CORE: SE hardware problem, machine sent to the vendor; CREAM-CE failures due to a known issue with the classads library (GGUS 146979)
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=142591
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146871
- GoeGRID: CREAM-CE intermittent failures not affecting ATLAS; failures with ARC-CE
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147313
- mainz: some problems in March and April, that could not be fixed easily; in May, the HPC infrastructure was attacked and the whole computer center was shut down; in downtime.
- wuppertalprod: SRM failures to to a BDII issue, fixed
- NGI_UK:
- UKI-NORTHGRID-SHEF-HEP: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146455 ARC-CE re-installed, some condor problems to fix
- UKI-SOUTHGRID-SUSX: https://ggus.eu/index.php?mode=ticket_info&ticket_id=144720 Migration from CREAM to ARC, WN migration to CentOS7; SRM to be decommissioned; ARC-CE was failing the IGTF test, then solved; site-bdii failures.
- ROC_CANADA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146452
- CA-SFU-T2: SLURM problems caused failures to site-BDII freshness check due to some old jobs not properly cancelled; recovered
- CA-WATERLOO-T2: SRM failures not involving production VOs, fixed; some unscheduled downtime affected the the A/R figures; ARC-CE and Site-BDII back in production with a minimum set of resources; A/R figures are improving.
- AfricaArabia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146877
- Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (June 2020):
- NGI_BG: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147747
- BG01-IPP
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147748
- HK-HKU-CC-01
- TW-NCUHEP
- NGI_NL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147749
- SARA-MATRIX: SRM not published in the BDII: planned a network change to solve the problem.
- NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147750
- UA-ISMA: migration to ARC6 and other planned software updates
- NGI_BG: https://ggus.eu/index.php?mode=ticket_info&ticket_id=147747
- sites suspended:
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
ARC Middleware 5 end of support, migration to ARC 6
- EGI Operations Broadcast
- PROC16 Decommission of unsupported software
- deadline: end of July
- Catalin is in contact with ARC team to get a webinar on ARC administration, scheduled (to be confirmed) for July 6th please contact operations@ for information
- Status
Date | Number of endpoints in BDII | Number of GGUS tickets | Issues |
---|---|---|---|
2020-06-08 | 75 | 42 | Some ARC endpoints publish a timestamp instead of a version like 5.X.Y; we can fairly assume they are ARC6 nightly builds, but we're going to close the corresponding tickets after explicit confirmation from the site admin. |
2020-07-13 | 53 | 29 | - |
Storage accounting
Many of sites stopped the publication of storage accounting records. Opened 57 tickets to fix that.
- page for checking when the records were published: http://goc-accounting.grid-support.ac.uk/storagetest/storagesitesystems.html
- Accounting Portal Prototype view
SECMON failures
Several CEs are failing the job submission tests, preventing pakiti to check the vulnerabilities fixes on the WNs.
- original ticket: https://ggus.eu/index.php?mode=ticket_info&ticket_id=143837
- List of tickets to the sites
- https://ggus.eu/index.php?mode=ticket_info&ticket_id=144732
AOB
Next meeting
Sept 14th, 2020 https://indico.egi.eu/event/5098/