Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

NGI DE CH Operations Center:Operations Meeting:30112012

From EGIWiki
Jump to navigation Jump to search

Operations Meeting Main

Introduction

  • Minutes of last meeting

Announcements

UNI-FREIBURG

still error, update plan requested.
https://ggus.eu/ws/ticket_info.php?ticket=87414

MAIGRID

https://ggus.eu/ws/ticket_info.php?ticket=87418
currently downtime

FZJ

update plan requested.
https://ggus.eu/ws/ticket_info.php?ticket=88681

GoeGrid

https://ggus.eu/ws/ticket_info.php?ticket=87415

errors: FZK-LCG2, MPPMU, RWTH-Aachen, LRZ-LMU


  • Meetings/conferences
  • Availability/reliability statistics
  • Monitoring
  • Staged rollout/updates

Round the sites

NGI-DE
  • BMRZ-FRANKFURT (Uni Frankfurt)
  • DESY-HH
all services are running EMI
now migrating worker nodes to EMI2 release (sl5)
for WLCG and other vo we have test queue with the emi2-sl6 worker nodes
recently upgraded/replaced part of the worker nodes
most of the recent purchases for the worker nodes are for the Interlagos AMD processors
we observe some discrepancy between the HEPSPEC results and the real jobs from the vo's for Intel and AMD processors (does anyone knows about the plan for new benchmark to replace HEPSPEC?)
added recently storage space to the SE's. This month there will be another update for storage
Since August we run cvmfs-nfs version for the atlas software directory. Works rather nice (cvmfs-nfs server is extremely old server, but with SSD) for our case. We had up to 4.500 running atlas jobs. 
The only problem we observe is  when too many (more than 5-10) jobs do the setup of atlas software at the same time on the worker node (this doesn't influence jobs on other worker node, 
so it's nfs-client issue), then the setup time increased from 15 seconds to >30 minutes (atlas has an internal timeout on setting the software, so such jobs are killed by the timeout watcher). But
since we have a lot of worker nodes, probability that jobs landed at the same time on some worker node is low, so we have small fraction of jobs killed by the timeout because of setup of software.
  • DESY-ZN
  • FZJuelich
  • Goegrid
  • GSI
  • ITWM
  • KIT (GridKa, FZK-LCG2)
  • KIT (Uni Karlsruhe)
  • LRZ
  • MPI-K
  • MPPMU
  • RWTH Aachen
  • SCAI
  • Uni Bonn
  • Uni Dortmund
  • Uni Dresden
  • Uni Freiburg
  • Uni Mainz-Maigrid
  • Uni Siegen
  • Uni Wuppertal
SwiNG
  • CSCS
 *   Received new hardware for storage. 6 IBM DCS3700 and 6 IBM M4 IO servers. We are testing the setup and hope to be able to add it to dCache in January.
 *   Recently upgraded dCache to 1.9.12-23 without major issues. Added an xrootd door and modified the way we have dCache pools installed; now we have one domain per dCache Pool, which is very useful when we have problems in one pool and need to operate on it without affecting others. (this may have been communicated already by Pablo Fernandez in the previous meeting).
 *   Currently having some problems with WNs being kicked out of GPFS randomly. This could be related to network issues on the IB network. Still debugging it.
 *   Currently having some problems with CVMFS: at times, cvmfs will raise some error condition and the mount point would be stalled. This affect jobs running off that cvmfs mount point. We are working on it.
 *   Suffered some HW problems on some WNs (DALCO/SUPERMICRO/INTEL twin nodes). Several components have been replaced (disks and boards, mostly).
  • PSI
  • Switch

Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.

Status ROD

AOB

If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.