|Main||EGI.eu operations services||Support||Documentation||Tools||Activities||Performance||Technology||Catch-all Services||Resource Allocation||Security|
Progress of SA1 issues
- D4.7 Operations Sustainability
SA1.1 Activity Management
- Attended WLCG GDB, presented MW upgrade campaign status
- Ongoing activities for tracking upgrading status from gLite 3.2 components
- Including preparations for suspension of unresponsive sites
- Planning for retirement of EMI-1
- SVG - handling of several new issues. High Risk ADVISORY [EGI-SVG-2012-4600] issued.
- Planning for security workshop and talks at ISGC 2013
- New custom security probes for NGI SAM instance being developed
SA1.3 Staged rollout
- Globus integration task meeting 16.11.2012
SA1.4 Central tools
- GOCDB: unscheduled downtime today because of a power cut at STFC. The failover instance seemed not reachable and no DNS switching was applied today. In contact with the GOCDB administrators to understand when the system will be restored
- Problem with a configuration of site BDII on HG-03-AUTH caused SAM CREAM-CE tests to use incorrect broker and fail reporting WN results on Monday 19th. Workaround was provided to SAM admins and problem was corrected on HG-03-AUTH.
- Problem with sites not publishing WN results correctly (NGI_PL) under investigation. Logs on brokers indicate error on client side.
- broker network upgrade planned next week, see broadcast: https://operations-portal.egi.eu/broadcast/archive/id/817
- incident in the broker network today caused problems to the SAM infrastructure:
POST MORTEM Yesterday a misconfiguration on the EGI message broker instance running at HG-03-AUTH caused publishing of a wrong message broker endpoint information onto top BDIIs. The issue was fixed earlier today at approximately 11:40 EEST. A workaround to the issue was communicated to the administrators to the NGI SAM instances today at 9:43 CET.
This problem caused all the org.sam.CREAMCE-JobSubmit-/ops/Role=lcgadmin tests on NGI/ROC Nagios instances to fail with "proxy expired" error. The impact of the issue is still being assessed.
It will took several hours before CREAM tests come back to function properly as the test jobs already submitted with the information of the bad broker, have to fail for timeout, and only then new test jobs can be submitted.
The incident has an impact on OPS Availability and Reliability statistics of sites. Statistics will be recomputed automatically by the SAM team. There is no need to request recomputations explicitly. A recomputation will be triggered as soon as SAM becomes stable again.
Repository - Network outage last Tuesday and additional unscheduled downtime on 20/11 due to power cut and other infrastructural issues at RAL - assessment of latest list of sites not publishing user DNs and discussion at the OMB. EGI.eu proceeded today with the opening of NGI tickets to foster progress in fixing this problem: https://ggus.eu/ws/ticket_info.php?ticket=88641
- Presentation at the GDB meeting at CERN
- Working on the new features for the next release on 2012-11-28
- Discussion with WLCG people about the interface GGUS-SNOW
#88630 may have broader impact, we will the next operations meeting.
Software support was presented at todays OMB, no specific comments received.
|DMSU tickets flow Nov 4--10|
|back to tpm||0|
|reassigned to 3rd level||11|
|open DMSU tickets status|
|waiting for reply||3|
SA1.8 Availability and core services
- Publication of final A/R reports for October 2012
- There is an open request from NGI_IT to perform recomputation of A/R for October
- Finalized migration procedure for dteam VO from VOMRS onto VOMS endpoint
- Final migration and provision of UMD2 based VOMS only service scheduled for end of November
- Ongoing investigation of notification mechanism on UMD2 based VOMS endpoint
- Handling of dteam VO membership/registration requests
== Documentation ==
- ongoing work on EGI service proftolio
- work started to split Availability and reliability monthly statistics page into procedure and page with statistics
- EGI OLA introducted during OMB meeting
- new preocedure created PROC16 Unsupported software version decommission and introduced on OMB meeting for approval
- new escalation process created: Escalation for operational problem with unsupported MW at site and introduced on OMB meeting for approval
- User documentation space was created