NGI DE CH Operations Center:Operations Meeting:02092011
How to connect
Connect Via Phone:
Following DFN ISDN Gateways are available: Germany - Berlin:+49-30-2541080 Germany - Stuttgart:+49-711-6330190
If requested please enter following conference number: 97922688
- Minutes of last meeting
5-9 September 2011 GridKa School Karlsruhe, Germany http://gridka-school.scc.kit.edu/2011)
8-9 September 2011 Paradiso workshop Brussels, Belgium http://www.paradiso-fp7.eu )
19 -23 September 2011 EGI Technical Forum Lyon, France (http://tf2011.egi.eu/)
- Availability/reliability statistics
There was the question from John Alan Kennedy from MPPMU why MPPMU's scheduled downtime [08-07-2011 16:00 to 15-07-2011 16:00 UTC] was not taken into account. Explanation from Dimitris Zilaskos: Downtime was only marked as 'at warning'.
For the problem of UNI-Siegen-HEP we will wait untill the statistics of August and we will see if the situation has improved.
Our Nagios box was updated. Last week there were some problems with notifications and the tests were not up to date. This problem was caused by the central instance at CERN. EGI monitoring team is involved.
- Staged rollout/updates
Round the sites
- BMRZ-FRANKFURT (Uni Frankfurt)
new wn's (exchange the old hardware), now we have 37kHS and ~4800 jobs slots at DESY-HH. 100% occupancy (mostly cms, atlas, but also ilc, and hera).
- FZJuelich (Rebecca Breu)
- ITWM (Martin Braun)
ntr Announcement: short downtime of 1 hour on next Monday, because internet connectivity will be maintained
- KIT (GridKa, FZK-LCG2, Dimitri Nilsen, Foued Jrad, Tobias Koenig)
Business as usual. We tried to install the UMD release 1.1 for gLite login and bookkeeping package. Test was not succesful-> information was put on rollout board, still have no answer. LB package is actually not very usefull GridKA school next week.
- KIT (Uni Karlsruhe)
- MPPMU (Cesare Delle Fratte)
update in July big problem was the information system since last weekend big problem with CREAM CE, got stuck because of lack of memory, now green again -> This two main problems influenced reliability/availability statistics for the August but but we are concerned about the two different numbers on the operation portal and the grid view portal Recomendation: We will wait untill the official numbers for August are published
- RWTH Aachen ()
During the last two weeks: problems with dCache probably caused by a raid controller, 4000 files with wrong checksums, still under investigation
- Uni Bonn
- Uni Dortmund
- Uni Dresden (Ralph Mueller Pfefferkorn)
Production runs quite fine
problems with the gstat/BDII published data caused by a wrong torque queue configuration, problem is fixed this week
- Uni Freiburg (Anton Gamel)
On Wednesday evening we had a downtime because of a aircondition failure, some pools were down for some hours
Tickets: decommission of CE1, Ce was drained and put out of production in the GOCDB, but there are still some panda jobs submitted beside the WMS system. A reason for that could be the hard coded CEs in user jobs. Is there any advice how to proceed?
Cream is memory "eating" caused by blah daemon. Actually our CREAM has 8GB and 1GB swap. It is using all the swap. Recommendation. Better to add memory or ask the the rollout list/board. Actually there is no recommendation at the rollout list/board. AT KIT we have 16GB.
Two recommendations from John Alan Keenedy (MPPMU): We at MPPMU had the same problem. I read at the rollout list the problem is the SG helper. Regard to the Panda jobs: Ask the ATLAS people. Torsten Harrenberg is in vacancy but ROD is back from vacation.
- Uni Mainz-Maigrid
- Uni Siegen
- Uni Wuppertal
Business as usual despite some issues with Lustre filesystem (scratch) that we hope to fix in the upcoming weeks by replacing it by GPFS with SSDs for metadata.
Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.
- Any problematic tickets?
- Handover of the ROD shift
(from DESY-HH, week before) - nothing serious to report, but to mention that In NGI_DE there is discrepancy in the number of published job slots (GoeGrid - two times then in reality) and in the apel statistics (MPPMU - wrong SI2K). Proposal - can we have the table in NGI_DE wiki with the actual number of job slots, total HS for all sites in NGI_DE, so that we can compare this with the information from bddi,gstat,gridview and contact sites in case of disagreement?
This week: 35 29.08 04.09 Team2, FhG (SCAI)
36 05.09 11.09 Team3, KIT
37 12.09 18.09 Team4, JUELICH
38 19.09 25.09 Team5, BADW-LRZ
- ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE_CH_Operations_Center:Operations_Teams#Shifts_rotation_table
- ROD Workshop @ EGI TechForum.
EGI technical forum: ROD team’s session on Thursday September 22nd in afternoon at 4. Have a look at the agenda at: https://www.egi.eu/indico/contributionDisplay.py?contribId=35&confId=452.
For write access to this wiki page please contact wilhelm.buehleraddkit.edu
One participant: Sound quality was very good during this telephone conference and the wiki page is very useful
If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.