Difference between revisions of "Agenda-2021-09-13"
Jump to navigation
Jump to search
Line 140: | Line 140: | ||
* ARC-CE and storage accounting campaign: | * ARC-CE and storage accounting campaign: | ||
** [https://ggus.eu/index.php?mode=ticket_search&su_hierarchy=0&status=all&date_type=creation+date&tf_radio=1&timeframe=any&from_date=11+Jun+2021&to_date=12+Jun+2021&ticket_category=all&typeofproblem=all&specattrib=none&user=paolini&keyword=APEL+migration+from+ActiveMQ+to+AMS+-+ARC-CE+new+settings&orderticketsby=REQUEST_ID&orderhow=asc&ticket_per_page=120&show_columns_check%5B0%5D=TICKET_TYPE&show_columns_check%5B1%5D=AFFECTED_SITE&show_columns_check%5B2%5D=PRIORITY&show_columns_check%5B3%5D=RESPONSIBLE_UNIT&show_columns_check%5B4%5D=STATUS&show_columns_check%5B5%5D=DATE_OF_CHANGE&show_columns_check%5B6%5D=SHORT_DESCRIPTION&search_submit=Search list of tickets] | ** [https://ggus.eu/index.php?mode=ticket_search&su_hierarchy=0&status=all&date_type=creation+date&tf_radio=1&timeframe=any&from_date=11+Jun+2021&to_date=12+Jun+2021&ticket_category=all&typeofproblem=all&specattrib=none&user=paolini&keyword=APEL+migration+from+ActiveMQ+to+AMS+-+ARC-CE+new+settings&orderticketsby=REQUEST_ID&orderhow=asc&ticket_per_page=120&show_columns_check%5B0%5D=TICKET_TYPE&show_columns_check%5B1%5D=AFFECTED_SITE&show_columns_check%5B2%5D=PRIORITY&show_columns_check%5B3%5D=RESPONSIBLE_UNIT&show_columns_check%5B4%5D=STATUS&show_columns_check%5B5%5D=DATE_OF_CHANGE&show_columns_check%5B6%5D=SHORT_DESCRIPTION&search_submit=Search list of tickets] | ||
** | ** 11 tickets (out of 112) not solved yet | ||
*** '''Australia-ATLAS'''[https://ggus.eu/index.php?mode=ticket_info&ticket_id=152428 152428] and '''Australia-T2''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152429 152429]: they stilll have ARC-CE 5.4; moving to a Cloudscheduler based compute system and will be removing the ARC-CE's in the near future | *** '''Australia-ATLAS'''[https://ggus.eu/index.php?mode=ticket_info&ticket_id=152428 152428] and '''Australia-T2''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152429 152429]: they stilll have ARC-CE 5.4; moving to a Cloudscheduler based compute system and will be removing the ARC-CE's in the near future | ||
*** '''CA-SFU-T2''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152433 152433]: CEs updated, check the accounting publication in the coming days... some errors with the benchmark which seem harmful. Duplicated records for the previous months, it was suggested to set `apel_messages = summaries` in the arc conf file. | *** '''CA-SFU-T2''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152433 152433]: CEs updated, check the accounting publication in the coming days... some errors with the benchmark which seem harmful. Duplicated records for the previous months, it was suggested to set `apel_messages = summaries` in the arc conf file. | ||
Line 149: | Line 149: | ||
*** '''TASK''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152498 152498]: | *** '''TASK''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152498 152498]: | ||
*** '''TW-FTT''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152503 152503]: | *** '''TW-FTT''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152503 152503]: | ||
*** UA-MHI [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152509 152509]: downtime until 22nd Sept... | *** '''UA-MHI''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152509 152509]: downtime until 22nd Sept... | ||
*** UKI-NORTHGRID-MAN-HEP [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152521 152521]: HTC data in May and June seems higher than expected... there are duplicated records that need to deleted in the central repository... | *** '''UKI-NORTHGRID-MAN-HEP''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=152521 152521]: HTC data in May and June seems higher than expected... there are duplicated records that need to deleted in the central repository... | ||
=== Prerequisites for using AMS === | === Prerequisites for using AMS === |
Revision as of 16:33, 9 September 2021
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to https://wiki.egi.eu/wiki/Operations_Meeting
General information
Middleware
UMD
- CentOS8 discussion still ongoing
- repository frontend web pages restored as static pages
- UMD 4.15.0 has been released (https://repository.egi.eu/static/UMD/4.15.0.html) and includes several updates for CentOS7:
- StoRM 1.11.21 - several bugs fixes and improvements
- lcmaps-plugins 1.8.1 - Update of lcmaps plugins
- CERN Frontier 4.15.2.1
- dmlite 1.15.0
- APEL SSM 3.2.1
- Dynamic DNS Nagios probe 1.0.1
- Infrastructure Manager Nagios probe 1.0.1
- dCache 6.2
Preview repository
- released on 2021-05-20:
- Preview 2.33.0 (CentOS 7): ARC 6.11.0, STORM 1.11.20 and 1.11.21, VOMS 04-21
- released on 2021-06-10
- Preview 2.34.0 (CentOS 7): ARC 6.12.0, CVMFS 2.8.1, xrootd 5.2.0
Operations
ARGO/SAM
- New version of the APEL Pub and Sync metrics deployed in production (GGUS 140317):
- probe for checking the HTCondorCE host certificate validity (GGUS 147386):
- checks on expiration date, CN, and CA:
- to be deployed in production once new condor client is released in UMD
FedCloud
Feedback from DMSU
New Known Error Database (KEDB)
The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home
- problems are tracked with Jira tickets to better follow-up their evoulution
- problems can be registered by DMSU staff and EGI Operations team
Verify configuration records
On a yearly basis, the information registered into GOC-DB need to be verified. NGIs and RCs have been asked to check them. In particular:
- NGI managers should review the people registered and the roles assigned to them, and in particular check the following information:
- ROD E-Mail
- Security E-Mail
- NGI Managers should also review the status of the "not certified" RCs, in according to the RC Status Workflow;
- RCs administrators should review the people registered and the roles assigned to them, and in particular check the following information:
- telephone numbers
- CSIRT E-Mail
- RC administrators should also review the information related to the registered service endpoints.
The process should be completed by July 2nd.
Monthly Availability/Reliability
- Under-performed sites in the past A/R reports with issues not yet fixed:
- NGI_FRANCE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=152253
- AUVERGRID: Long downtime connected to IN2P3-LPC site. Some problems with org.nordugrid.ARC-CE-result and org.nordugrid.ARC-CE-srm metrics: even if the sub-metrics complete successfully, the test jobs don't manage to get the "ending" status, producing an UNKNOWN status in the A/R computation
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150818
- INFN-PISA: HTCondorCE failures fixed; SRM failures not yet
- NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=152258
- UA-BITP: authentication issues with one of the nagios servers, fixed; additionally, power supply issues at the resource center
- UA-KNU: storage system degradation; host certificate expired, then installed a new one.
- NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=152841
- UA-NSCMBR: problem during the DPM update: conflict between xrootd 5 and dmlite 1.13. Unscheduled downtime due to power failure in the computing centre
- ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148956
- CBPF: DPM updated; SRM failures due to information not properly published, fixed; other SRM failures due to available space
- ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150817
- ICN-UNAM: replaced CREAM-CE; SE certificate expired; new failures with HTCondorCE; problems disappeared after re-installation; further failures on the CE, then fixed.
- Russia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=152840
- RU-SARFTI: failures with org.nordugrid.ARC-CE-SRM-result
- RU-SPbSU: failures with org.nordugrid.ARC-CE-SRM-result
- NGI_FRANCE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=152253
- Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (August 2021):
- NGI_GRNET: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153655
- GR-07-UOI-HEPLAB
- NGI_IBERGRID: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153661
- USC-LCG2: SRM failures, services restarted
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153657
- INFN-MILANO-ATLASC: SRM failures casued by the IPv4 data-port interface: gridftp server was opening connections to the private interface instead of the public one.
- NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153659
- TASK
- NGI_TR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153654
- AZ-IFAN
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153660
- UKI-SOUTHGRID-SUSX: CE configuration issues
- ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153658
- SUPERCOMPUTO-UNAM
- NGI_GRNET: https://ggus.eu/index.php?mode=ticket_info&ticket_id=153655
- sites suspended:
Documentation
- plan to decommission MediaWiki
- content to be moved to different locations (confluence and https://docs.egi.eu/)
- confluence space hosting policies and procedures: EGI Policies and Procedures
- EGI Federation Operations
- Change Management, Release and Deployment Management, Incident and Service Request Management, Problem Management, Information Security Management
- Manuals, How-Tos, Troubleshooting, FAQs:
- huge number of material need to be reviewed and in case updated when moved to the new place
- location will be https://docs.egi.eu/providers/operations-manuals/
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
APEL migration from ActiveMQ to ARGO Message Service (AMS)
- ActiveMQ dismissed on July 8th: for security reasons it is not possible maintain it any longer.
- Migration insructions (HTCondorCE, Storage, and Cloud accounting): https://github.com/apel/ssm/blob/dev/migrating_to_ams.md
- ARC 6.12.0 released, instructions:
- http://www.nordugrid.org/arc/releases/6.12/release_notes_6.12.html
- all the sites with ARC-CE need to update to this version
- Recommended versions:
- Apel Clien: 1.9.0
- APEL SSM: 3.2.1
- Cloud accounting campaign:
- list of tickets
- 2 tickets (out of 21) not solved yet
- HTCondorCE and Storage accounting campaign:
- list of tickets
- 6 tickets (out of 53) not solved yet
- ARC-CE and storage accounting campaign:
- list of tickets
- 11 tickets (out of 112) not solved yet
- Australia-ATLAS152428 and Australia-T2 152429: they stilll have ARC-CE 5.4; moving to a Cloudscheduler based compute system and will be removing the ARC-CE's in the near future
- CA-SFU-T2 152433: CEs updated, check the accounting publication in the coming days... some errors with the benchmark which seem harmful. Duplicated records for the previous months, it was suggested to set `apel_messages = summaries` in the arc conf file.
- IN2P3-IPNL 152460: CE not yet in production
- JP-KEK-CRC-02 152471: by the end of September...
- RU-SPbSU 152491: some discrepancy between local database and central repository, involved ARC developers...
- Taiwan-LCG2 152497: setting up the ARC6 server...
- TASK 152498:
- TW-FTT 152503:
- UA-MHI 152509: downtime until 22nd Sept...
- UKI-NORTHGRID-MAN-HEP 152521: HTC data in May and June seems higher than expected... there are duplicated records that need to deleted in the central repository...
Prerequisites for using AMS
- A valid host certificate from an IGTF Accredited CA.
- A GOCDB 'Site' entry flagged as 'Production'.
- A GOCDB 'Service' entry of the correct service type flagged as 'Production'. The following service types are used:
- For Grid accounting use 'gLite-APEL'.
- For Cloud accounting use 'eu.egi.cloud.accounting'.
- For Storage accounting use 'eu.egi.storage.accounting'.
- The 'Host DN' listed in the GOCDB 'Service' entry must exactly match the certificate DN of the host used for accounting. Make sure there are no leading or trailing spaces in the 'Host DN' field.
Monitoring of the accounting data
To ensure the monitoring of the publication of the accounting data, one CE per site needs to be registered as "APEL" service endpoint.
- http://goc-accounting.grid-support.ac.uk/rss/SITE-NAME_Pub.html
- http://goc-accounting.grid-support.ac.uk/rss/SITE-NAME_Sync.html
AOB
Next meeting
Oct