NGI DE CH Operations Center:Operations Meeting:02092011

Operations Meeting Main

How to connect

Connect Via Phone:

Following DFN ISDN Gateways are available: Germany - Berlin:+49-30-2541080 Germany - Stuttgart:+49-711-6330190

If requested please enter following conference number: 97922688

Introduction

Minutes of last meeting

Announcements

Meetings/conferences

5-9 September 2011
GridKa School 
Karlsruhe, Germany
http://gridka-school.scc.kit.edu/2011)

8-9 September 2011
Paradiso workshop
Brussels, Belgium 
http://www.paradiso-fp7.eu )

19 -23 September 2011
EGI Technical Forum
Lyon, France (http://tf2011.egi.eu/)

Availability/reliability statistics

There was the question from John Alan Kennedy from MPPMU why MPPMU's scheduled downtime  [08-07-2011 16:00 to 15-07-2011 16:00 
UTC] was not taken into account. Explanation from Dimitris Zilaskos: Downtime was only marked as 'at warning'.

For the problem of UNI-Siegen-HEP we will wait untill the statistics of August and we will see if the situation has improved.

Monitoring

Our Nagios box was updated. Last week there were some problems with notifications and the tests were not up to date. This problem  
was caused by the central instance at CERN. EGI monitoring team is involved.

Staged rollout/updates

ntr

Round the sites

NGI-DE

BMRZ-FRANKFURT (Uni Frankfurt)
DESY-HH

new wn's (exchange the old hardware), now we have 37kHS and ~4800 jobs slots at DESY-HH.
100% occupancy (mostly cms, atlas, but also ilc, and hera).

DESY-ZN
FZJuelich (Rebecca Breu)

ntr

Goegrid
GSI
ITWM (Martin Braun)

ntr
Announcement: short downtime of 1 hour on next Monday, because internet connectivity will be maintained

KIT (GridKa, FZK-LCG2, Dimitri Nilsen, Foued Jrad, Tobias Koenig)

Business as usual.

We tried to install the UMD release 1.1 for gLite login and bookkeeping package. Test was not succesful-> information was put on  
rollout board, still have no answer. LB package is actually not very usefull 

GridKA school next week.

KIT (Uni Karlsruhe)
LRZ
MPI-K
MPPMU (Cesare Delle Fratte)

update in July

big problem was the  information system since last weekend
big problem with CREAM CE, got stuck because of lack of memory, now green again
-> This two main problems influenced reliability/availability statistics for the August but but we are concerned about the two 
different numbers on the operation portal and the grid view portal
Recomendation: We will wait untill the official numbers for August are published

RWTH Aachen ()

During the last two weeks: problems with dCache probably caused by a raid controller, 4000 files with wrong checksums, still under 
investigation

SCAI
Uni Bonn
Uni Dortmund
Uni Dresden (Ralph Mueller Pfefferkorn)

Production runs quite fine

problems with the gstat/BDII published data caused by a wrong torque queue configuration, problem is fixed this week

Uni Freiburg (Anton Gamel)

On Wednesday evening we had a downtime because of a aircondition failure, some pools were down for some hours

Tickets: decommission of CE1, Ce was drained and put out of production in the GOCDB, but there are still some panda jobs submitted 
beside the WMS system. A reason for that could be the hard coded CEs in user jobs. Is there any advice how to proceed?

Cream is memory "eating" caused by blah daemon. Actually our CREAM has 8GB and 1GB swap. It is using all the swap. Recommendation. 
Better to add memory or ask the the rollout list/board. Actually there is no recommendation at the rollout list/board. AT KIT we 
have 16GB.

Two recommendations from John Alan Keenedy (MPPMU): We at MPPMU had the same problem. I read at the rollout list the problem is 
the SG helper. Regard to the Panda jobs: Ask the ATLAS people. Torsten Harrenberg is in vacancy but ROD is back from vacation.

Uni Mainz-Maigrid
Uni Siegen
Uni Wuppertal

SwiNG

CSCS

Business as usual despite some issues with Lustre filesystem (scratch) that we hope to fix in the upcoming weeks by replacing it  
by GPFS with SSDs for metadata.

PSI
Switch

Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.

Status ROD

Any problematic tickets?
Handover of the ROD shift

(from DESY-HH, week before) - nothing serious to report, but to mention that 
In NGI_DE there is discrepancy in the number of published job slots (GoeGrid - two times then 
in reality) and in the apel statistics (MPPMU - wrong SI2K). Proposal - can we have the table in NGI_DE
wiki with the actual number of job slots, total HS for all sites in NGI_DE, so that we can compare 
this with the information from bddi,gstat,gridview and contact sites in case of disagreement?

This week: 35 29.08 04.09 Team2, FhG (SCAI)

Next weeks:

36 05.09 11.09 Team3, KIT

37 12.09 18.09 Team4, JUELICH

38 19.09 25.09 Team5, BADW-LRZ

ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE_CH_Operations_Center:Operations_Teams#Shifts_rotation_table

ROD Workshop @ EGI TechForum.

EGI technical forum: ROD team’s session on Thursday September 22nd in afternoon at 4. Have a
look at the agenda at: https://www.egi.eu/indico/contributionDisplay.py?contribId=35&confId=452.

AOB

For write access to this wiki page please contact wilhelm.buehleraddkit.edu

One participant: Sound quality was very good during this telephone conference and the wiki page is very useful

If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.

NGI DE CH Operations Center:Operations Meeting:02092011

Contents

How to connect

Introduction

Announcements

Round the sites

Status ROD

AOB

Navigation menu

NGI DE CH Operations Center:Operations Meeting:02092011

How to connect

Introduction

Announcements

Round the sites

Status ROD

AOB

Navigation menu

Search