Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:13042012"

From EGIWiki
Jump to navigation Jump to search
(Created page with "Operations Meeting Main ==Introduction== * Minutes of last meeting ==Announcements== * Meetings/conferences * Availability/...")
 
 
(10 intermediate revisions by 2 users not shown)
Line 8: Line 8:


* Meetings/conferences
* Meetings/conferences
** [https://www.egi.eu/indico/conferenceDisplay.py?confId=820 Ops-Workshop im April 2012] ([https://www.egi.eu/indico/conferenceTimeTable.py?confId=820#all.detailed].
* Availability/reliability statistics
* Availability/reliability statistics
https://documents.egi.eu/public/ShowDocument?docid=1091
BDII: 92%. but see: https://ggus.eu/ws/ticket_info.php?ticket=81094 (caused by downtime of NGI-DE Nagios, recalculation of
values was not done)
NGI_DE: A:92 % R:96 %. first green month.
* Monitoring
* Monitoring
ntr
* Staged rollout/updates
* Staged rollout/updates
sites with CREAM CEs: enabling glexec in GOC-DB


==Round the sites==
==Round the sites==
Line 18: Line 30:
* DESY-HH
* DESY-HH
* DESY-ZN
* DESY-ZN
* FZJuelich
* FZJuelich (Mathilda Romberg)
ntr
* Goegrid
* Goegrid
* GSI
* GSI
* ITWM
* ITWM
* KIT (GridKa, FZK-LCG2)
* KIT (GridKa, FZK-LCG2)
auger SoftwareManager role.
* KIT (Uni Karlsruhe)
* KIT (Uni Karlsruhe)
* LRZ
* LRZ
Line 28: Line 42:
* MPPMU
* MPPMU
* RWTH Aachen
* RWTH Aachen
* SCAI
* SCAI (Andre Gemuend)
ntr
* Uni Bonn
* Uni Bonn
* Uni Dortmund
* Uni Dortmund
Line 37: Line 52:
* Uni Wuppertal
* Uni Wuppertal
; SwiNG
; SwiNG
* CSCS
* CSCS (via Email)
All has been working fine until yesterday, when a network problem caused the GPFS scratch filesystem to die.
We were unable to  recover it and today we have rebuilt it from scratch:
ARC is still not working, but all gLite/EGI services are up and running.
This was an unscheduled downtime that will certainly affect A&R of this month.
 
Next week the grid cluster at CSCS enters a scheduled downtime.
We will move the hardware from the old building to the new datacentre
and introduce major changes in the infrastructure: new WNs to replace
old Sun Blades and new network design, we'll move from hybrid
ethernet/infiniband to all-infiniband. The downtime should last no longer than 3 weeks.
* PSI
* PSI
* Switch
* Switch (Alessandro Ussai)
- not much to report, we decommissioned in 2011 our site at SWITCH
- we only have a giis which is run for the ARC sites in NGI_CH this is why we are not attending regularly the op meeting anymore
as we don't have resources we will attend sporadically within the monitoring tasks though, when necessary


Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.
Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.
Line 45: Line 73:
==Status ROD==
==Status ROD==


* Any problematic tickets?
Again tickets for ROD.
 
* Any problematic tickets? Ticket from central COD about alarms that are older than 72 hours. Situation unclear: Who take action? ROD shifter checked the dashboard but alarm dissapeared. Strange behaviour of the dashboard.  There was a thread via our email list. We/Dimitri/KIT will report this in the escalated tickets.
* We handle our tickets (user tickets in the NGI-DE helpdesk) really softly. We have to think about escalation procedures/escalation table with expiration dates dependent on the priority of ticket.
* Handover of the ROD shift
* Handover of the ROD shift
* ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE_CH_Operations_Center:Operations_Teams#Shifts_rotation_table
* ROD shift schedule was updated from Dimitri: https://wiki.egi.eu/wiki/NGI_DE_CH_Operations_Center:Operations_Teams#Shifts_rotation_table


==AOB==
==AOB==


If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.
If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.

Latest revision as of 10:13, 23 April 2012

Operations Meeting Main

Introduction

  • Minutes of last meeting

Announcements

  • Availability/reliability statistics
https://documents.egi.eu/public/ShowDocument?docid=1091
BDII: 92%. but see: https://ggus.eu/ws/ticket_info.php?ticket=81094 (caused by downtime of NGI-DE Nagios, recalculation of 
values was not done)
NGI_DE: A:92 % R:96 %. first green month.
  • Monitoring
ntr
  • Staged rollout/updates
sites with CREAM CEs: enabling glexec in GOC-DB

Round the sites

NGI-DE
  • BMRZ-FRANKFURT (Uni Frankfurt)
  • DESY-HH
  • DESY-ZN
  • FZJuelich (Mathilda Romberg)
ntr
  • Goegrid
  • GSI
  • ITWM
  • KIT (GridKa, FZK-LCG2)
auger SoftwareManager role.
  • KIT (Uni Karlsruhe)
  • LRZ
  • MPI-K
  • MPPMU
  • RWTH Aachen
  • SCAI (Andre Gemuend)
ntr
  • Uni Bonn
  • Uni Dortmund
  • Uni Dresden
  • Uni Freiburg
  • Uni Mainz-Maigrid
  • Uni Siegen
  • Uni Wuppertal
SwiNG
  • CSCS (via Email)
All has been working fine until yesterday, when a network problem caused the GPFS scratch filesystem to die.
We were unable to  recover it and today we have rebuilt it from scratch:
ARC is still not working, but all gLite/EGI services are up and running.
This was an unscheduled downtime that will certainly affect A&R of this month.
Next week the grid cluster at CSCS enters a scheduled downtime.
We will move the hardware from the old building to the new datacentre
and introduce major changes in the infrastructure: new WNs to replace
old Sun Blades and new network design, we'll move from hybrid
ethernet/infiniband to all-infiniband. The downtime should last no longer than 3 weeks.
  • PSI
  • Switch (Alessandro Ussai)
- not much to report, we decommissioned in 2011 our site at SWITCH
- we only have a giis which is run for the ARC sites in NGI_CH this is why we are not attending regularly the op meeting anymore 
as we don't have resources we will attend sporadically within the monitoring tasks though, when necessary

Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.

Status ROD

Again tickets for ROD.

  • Any problematic tickets? Ticket from central COD about alarms that are older than 72 hours. Situation unclear: Who take action? ROD shifter checked the dashboard but alarm dissapeared. Strange behaviour of the dashboard. There was a thread via our email list. We/Dimitri/KIT will report this in the escalated tickets.
  • We handle our tickets (user tickets in the NGI-DE helpdesk) really softly. We have to think about escalation procedures/escalation table with expiration dates dependent on the priority of ticket.
  • Handover of the ROD shift
  • ROD shift schedule was updated from Dimitri: https://wiki.egi.eu/wiki/NGI_DE_CH_Operations_Center:Operations_Teams#Shifts_rotation_table

AOB

If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.