NGI DE CH Operations Center:Operations Meeting:30112012
Jump to navigation
Jump to search
Introduction
- Minutes of last meeting
Announcements
UNI-FREIBURG
still error, update plan requested. https://ggus.eu/ws/ticket_info.php?ticket=87414
MAIGRID
https://ggus.eu/ws/ticket_info.php?ticket=87418 currently downtime
FZJ
update plan requested. https://ggus.eu/ws/ticket_info.php?ticket=88681
GoeGrid
https://ggus.eu/ws/ticket_info.php?ticket=87415
errors: FZK-LCG2, MPPMU, RWTH-Aachen, LRZ-LMU
- Meetings/conferences
- Availability/reliability statistics
- Monitoring
- Staged rollout/updates
Round the sites
- NGI-DE
- BMRZ-FRANKFURT (Uni Frankfurt)
- DESY-HH
all services are running EMI now migrating worker nodes to EMI2 release (sl5) for WLCG and other vo we have test queue with the emi2-sl6 worker nodes recently upgraded/replaced part of the worker nodes most of the recent purchases for the worker nodes are for the Interlagos AMD processors we observe some discrepancy between the HEPSPEC results and the real jobs from the vo's for Intel and AMD processors (does anyone knows about the plan for new benchmark to replace HEPSPEC?) added recently storage space to the SE's. This month there will be another update for storage Since August we run cvmfs-nfs version for the atlas software directory. Works rather nice (cvmfs-nfs server is extremely old server, but with SSD) for our case. We had up to 4.500 running atlas jobs. The only problem we observe is when too many (more than 5-10) jobs do the setup of atlas software at the same time on the worker node (this doesn't influence jobs on other worker node, so it's nfs-client issue), then the setup time increased from 15 seconds to >30 minutes (atlas has an internal timeout on setting the software, so such jobs are killed by the timeout watcher). But since we have a lot of worker nodes, probability that jobs landed at the same time on some worker node is low, so we have small fraction of jobs killed by the timeout because of setup of software.
- DESY-ZN
- FZJuelich
- Goegrid
- GSI
- ITWM
- KIT (GridKa, FZK-LCG2)
- KIT (Uni Karlsruhe)
- LRZ
- MPI-K
- MPPMU
- RWTH Aachen
- SCAI
- Uni Bonn
- Uni Dortmund
- Uni Dresden
- Uni Freiburg
- Uni Mainz-Maigrid
- Uni Siegen
- Uni Wuppertal
- SwiNG
- CSCS
* Received new hardware for storage. 6 IBM DCS3700 and 6 IBM M4 IO servers. We are testing the setup and hope to be able to add it to dCache in January. * Recently upgraded dCache to 1.9.12-23 without major issues. Added an xrootd door and modified the way we have dCache pools installed; now we have one domain per dCache Pool, which is very useful when we have problems in one pool and need to operate on it without affecting others. (this may have been communicated already by Pablo Fernandez in the previous meeting). * Currently having some problems with WNs being kicked out of GPFS randomly. This could be related to network issues on the IB network. Still debugging it. * Currently having some problems with CVMFS: at times, cvmfs will raise some error condition and the mount point would be stalled. This affect jobs running off that cvmfs mount point. We are working on it. * Suffered some HW problems on some WNs (DALCO/SUPERMICRO/INTEL twin nodes). Several components have been replaced (disks and boards, mostly).
- PSI
- Switch
Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.
Status ROD
- Any problematic tickets?
- Handover of the ROD shift
- ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE_CH_Operations_Center:Operations_Teams#Shifts_rotation_table
AOB
If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.