Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

EGI-InSPIRE:Sa1 2012-11-21

From EGIWiki
Revision as of 17:32, 27 November 2012 by Ap (talk | contribs) (→‎SA1.5 Accounting)
Jump to navigation Jump to search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


<< SA1 weekly reports

Progress of SA1 issues

Nothing new to report

Milestones/Deliverables

  • D4.7 Operations Sustainability in progress

SA1.1 Activity Management

Meetings

  • Attended WLCG GDB, presented MW upgrade campaign status
  • WLCG MB
  • EGI.eu-DECIDE
  • Operations Management Board, preparation of agenda and contributions
  • Monday coordination meeting of software upgrade campaign
  • PC2012 meeting
  • ops VO management
  • Overview and status assessment of the unsupported middleware process
  • definition of decommissioning calendar for remaining glite 3.2 products and EMI 1
  • definition of new escalation procedures for handling future middleware upgrades
  • definition of new probes needed for future mw upgrade campaigns
  • Review of the WLCG software lifecycle document
  • Follow up of one issue between SAM and the EGI brokers network (defined a requirement for SAM)
  • kickoff of SLURM discussion forum and status assessment of support in CREAm
  • definition of mandate of working group for analysis/revision of probes (under COD coordination)
  • revision of MoU with DANTE

SA1.2 Security

  • Ongoing activities for tracking upgrading status from gLite 3.2 components
  • Including preparations for suspension of unresponsive sites
  • Planning for retirement of EMI-1
  • SVG - handling of several new issues. High Risk ADVISORY [EGI-SVG-2012-4600] issued.
  • Planning for security workshop and talks at ISGC 2013
  • New custom security probes for NGI SAM instance being developed

SA1.3 Staged rollout

  • Release of UMD 2.3.0
  • Participation in the OMB

SA1.3 Integration

  1. Globus integration task meeting 16.11.2012

SA1.4 Central tools

  • GOCDB: unscheduled downtime today because of a power cut at STFC. The failover instance seemed not reachable and no DNS switching was applied today. In contact with the GOCDB administrators to understand when the system will be restored
  • Problem with a configuration of site BDII on HG-03-AUTH caused SAM CREAM-CE tests to use incorrect broker and fail reporting WN results on Monday 19th. Workaround was provided to SAM admins and problem was corrected on HG-03-AUTH.
  • Problem with sites not publishing WN results correctly (NGI_PL) under investigation. Logs on brokers indicate error on client side.
  • broker network upgrade planned next week, see broadcast: https://operations-portal.egi.eu/broadcast/archive/id/817
  • incident in the broker network today caused problems to the SAM infrastructure:

POST MORTEM Yesterday a misconfiguration on the EGI message broker instance running at HG-03-AUTH caused publishing of a wrong message broker endpoint information onto top BDIIs. The issue was fixed earlier today at approximately 11:40 EEST. A workaround to the issue was communicated to the administrators to the NGI SAM instances today at 9:43 CET.

This problem caused all the org.sam.CREAMCE-JobSubmit-/ops/Role=lcgadmin tests on NGI/ROC Nagios instances to fail with "proxy expired" error. The impact of the issue is still being assessed.

It will took several hours before CREAM tests come back to function properly as the test jobs already submitted with the information of the bad broker, have to fail for timeout, and only then new test jobs can be submitted.

The incident has an impact on OPS Availability and Reliability statistics of sites. Statistics will be recomputed automatically by the SAM team. There is no need to request recomputations explicitly. A recomputation will be triggered as soon as SAM becomes stable again.

SA1.5 Accounting

Repository - Network outage last Tuesday and additional unscheduled downtime on 20/11 due to power cut and other infrastructural issues at RAL - assessment of latest list of sites not publishing user DNs and discussion at the OMB. EGI.eu proceeded today with the opening of NGI tickets to foster progress in fixing this problem: https://ggus.eu/ws/ticket_info.php?ticket=88641


Portal New release of the Accounting Portal on http://accounting.egi.eu, changes in this version:

  • Improved UserDN country classification patterns.
  • Improvements on usage by country.
  • GET interface for CSV
  • Support new RFC 2253 UserDNs.
  • Better support for custom VOs.
  • UserDN NGI attribution
  • Support for local jobs, three options selectable on most views:
    • Only Grid jobs (default).
    • Grid+Local jobs (In case there is a corresponding global VO, both are aggregated)
    • Only Local jobs

SA1.6 Helpdesk

  • Presentation at the GDB meeting at CERN
  • Working on the new features for the next release on 2012-11-28
  • Discussion with WLCG people about the interface GGUS-SNOW

SA1.7 Support

Software Support

#88630 may have broader impact, we will the next operations meeting.

Software support was presented at todays OMB, no specific comments received.

DMSU tickets flow Nov 4--10
assigned 17
back to tpm 0
reassigned to 3rd level 11
solved 3
open DMSU tickets status
assigned 1
in progress 11
waiting for reply 3
on hold 4

Network Support

No report received

SA1.8 Availability and core services

  1. Publication of final A/R reports for October 2012
  2. There is an open request from NGI_IT to perform recomputation of A/R for October
  3. Finalized migration procedure for dteam VO from VOMRS onto VOMS endpoint
  4. Final migration and provision of UMD2 based VOMS only service scheduled for end of November
  5. Ongoing investigation of notification mechanism on UMD2 based VOMS endpoint
  6. Handling of dteam VO membership/registration requests

Documentation

  1. ongoing work on EGI service proftolio
  2. work started to split Availability and reliability monthly statistics page into procedure and page with statistics
  3. EGI OLA introducted during OMB meeting
  4. new preocedure created PROC16 Unsupported software version decommission and introduced on OMB meeting for approval
  5. new escalation process created: Escalation for operational problem with unsupported MW at site and introduced on OMB meeting for approval
  6. User documentation space was created

Meetings

  • EGI operations presentation at TF-NOC task force meeting, Poznan 12-13 Dec