NGI DE CH Operations Center:Operations Meeting:30112012

Introduction

Minutes of last meeting

Announcements

UNI-FREIBURG

still error, update plan requested.
https://ggus.eu/ws/ticket_info.php?ticket=87414

MAIGRID

https://ggus.eu/ws/ticket_info.php?ticket=87418
currently downtime

FZJ

update plan requested.
https://ggus.eu/ws/ticket_info.php?ticket=88681

GoeGrid

https://ggus.eu/ws/ticket_info.php?ticket=87415

errors: FZK-LCG2, MPPMU, RWTH-Aachen, LRZ-LMU

Meetings/conferences
Availability/reliability statistics
Monitoring
Staged rollout/updates

Round the sites

NGI-DE

BMRZ-FRANKFURT (Uni Frankfurt)
DESY-HH

all services are running EMI
now migrating worker nodes to EMI2 release (sl5)
for WLCG and other vo we have test queue with the emi2-sl6 worker nodes
recently upgraded/replaced part of the worker nodes
most of the recent purchases for the worker nodes are for the Interlagos AMD processors
we observe some discrepancy between the HEPSPEC results and the real jobs from the vo's for Intel and AMD processors (does anyone knows about the plan for new benchmark to replace HEPSPEC?)
added recently storage space to the SE's. This month there will be another update for storage
Since August we run cvmfs-nfs version for the atlas software directory. Works rather nice (cvmfs-nfs server is extremely old server, but with SSD) for our case. We had up to 4.500 running atlas jobs. 
The only problem we observe is  when too many (more than 5-10) jobs do the setup of atlas software at the same time on the worker node (this doesn't influence jobs on other worker node, 
so it's nfs-client issue), then the setup time increased from 15 seconds to >30 minutes (atlas has an internal timeout on setting the software, so such jobs are killed by the timeout watcher). But
since we have a lot of worker nodes, probability that jobs landed at the same time on some worker node is low, so we have small fraction of jobs killed by the timeout because of setup of software.

DESY-ZN
FZJuelich
Goegrid
GSI
ITWM
KIT (GridKa, FZK-LCG2)
KIT (Uni Karlsruhe)
LRZ
MPI-K
MPPMU
RWTH Aachen
SCAI
Uni Bonn
Uni Dortmund
Uni Dresden
Uni Freiburg
Uni Mainz-Maigrid
Uni Siegen
Uni Wuppertal

SwiNG

CSCS

 *   Received new hardware for storage. 6 IBM DCS3700 and 6 IBM M4 IO servers. We are testing the setup and hope to be able to add it to dCache in January.
 *   Recently upgraded dCache to 1.9.12-23 without major issues. Added an xrootd door and modified the way we have dCache pools installed; now we have one domain per dCache Pool, which is very useful when we have problems in one pool and need to operate on it without affecting others. (this may have been communicated already by Pablo Fernandez in the previous meeting).
 *   Currently having some problems with WNs being kicked out of GPFS randomly. This could be related to network issues on the IB network. Still debugging it.
 *   Currently having some problems with CVMFS: at times, cvmfs will raise some error condition and the mount point would be stalled. This affect jobs running off that cvmfs mount point. We are working on it.
 *   Suffered some HW problems on some WNs (DALCO/SUPERMICRO/INTEL twin nodes). Several components have been replaced (disks and boards, mostly).

PSI
Switch

Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.

Status ROD

Any problematic tickets?
Handover of the ROD shift
ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE_CH_Operations_Center:Operations_Teams#Shifts_rotation_table

AOB

If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.

NGI DE CH Operations Center:Operations Meeting:30112012

Contents

Introduction

Announcements

Round the sites

Status ROD

AOB

Navigation menu

NGI DE CH Operations Center:Operations Meeting:30112012

Introduction

Announcements

Round the sites

Status ROD

AOB

Navigation menu

Search