EGI-InSPIRE:Sa1 2012-11-21

From EGIWiki
Revision as of 17:58, 20 November 2012 by Tferrari (talk | contribs) (SA1.5 Accounting)
Jump to: navigation, search
Main operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security

<< SA1 weekly reports

Progress of SA1 issues


  • D4.7 Operations Sustainability

SA1.1 Activity Management

  • Attended WLCG GDB, presented MW upgrade campaign status

SA1.2 Security

SA1.3 Staged rollout

SA1.3 Integration

  1. Globus integration task meeting 16.11.2012

SA1.4 Central tools

  • Problem with a configuration of site BDII on HG-03-AUTH caused SAM CREAM-CE tests to use incorrect broker and fail reporting WN results on Monday 19th. Workaround was provided to SAM admins and problem was corrected on HG-03-AUTH.
  • Problem with sites not publishing WN results correctly (NGI_PL) under investigation. Logs on brokers indicate error on client side.
  • broker network upgrade planned next week, see broadcast:
  • incident in the broker network today caused problems to the SAM infrastructure:

POST MORTEM Yesterday a misconfiguration on the EGI message broker instance running at HG-03-AUTH caused publishing of a wrong message broker endpoint information onto top BDIIs. The issue was fixed earlier today at approximately 11:40 EEST. A workaround to the issue was communicated to the administrators to the NGI SAM instances today at 9:43 CET.

This problem caused all the org.sam.CREAMCE-JobSubmit-/ops/Role=lcgadmin tests on NGI/ROC Nagios instances to fail with "proxy expired" error. The impact of the issue is still being assessed.

It will took several hours before CREAM tests come back to function properly as the test jobs already submitted with the information of the bad broker, have to fail for timeout, and only then new test jobs can be submitted.

The incident has an impact on OPS Availability and Reliability statistics of sites. Statistics will be recomputed automatically by the SAM team. There is no need to request recomputations explicitly. A recomputation will be triggered as soon as SAM becomes stable again.

SA1.5 Accounting

Repository - Network outage last Tuesday and additional unscheduled downtime on 20/11 due to power cut and other infrastructural issues at RAL - assessment of latest list of sites not publishing user DNs and discussion at the OMB. proceeded today with the opening of NGI tickets to foster progress in fixing this problem:

SA1.6 Helpdesk

  • Presentation at the GDB meeting at CERN
  • Working on the new features for the next release on 2012-11-28
  • Discussion with WLCG people about the interface GGUS-SNOW

SA1.7 Support

Software Support

Network Support

SA1.8 Availability and core services

  1. Publication of final A/R reports for October 2012
  2. There is an open request from NGI_IT to perform recomputation of A/R for October
  3. Finalized migration procedure for dteam VO from VOMRS onto VOMS endpoint
  4. Final migration and provision of UMD2 based VOMS only service scheduled for end of November
  5. Ongoing investigation of notification mechanism on UMD2 based VOMS endpoint
  6. Handling of dteam VO membership/registration requests

== Documentation ==

  1. ongoing work on EGI service proftolio
  2. work started to split Availability and reliability monthly statistics page into procedure and page with statistics
  3. EGI OLA introducted during OMB meeting
  4. new preocedure created PROC16 Unsupported software version decommission and introduced on OMB meeting for approval
  5. new escalation process created: Escalation for operational problem with unsupported MW at site and introduced on OMB meeting for approval
  6. User documentation space was created