Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:16032012"
Jump to navigation
Jump to search
(14 intermediate revisions by 2 users not shown) | |||
Line 14: | Line 14: | ||
VoOps: | VoOps: | ||
Av/Re= 97% | Av/Re= 97% | ||
UNI BONN | UNI BONN only has 69% | ||
BDII: Av/Re=99,3% | BDII: Av/Re=99,3% | ||
Line 21: | Line 21: | ||
https://helpdesk.ngi-de.eu/index.php?mode=ticket_info&ticket_id=1969 | https://helpdesk.ngi-de.eu/index.php?mode=ticket_info&ticket_id=1969 | ||
https://helpdesk.ngi-de.eu/index.php?mode=ticket_info&ticket_id=1978 | https://helpdesk.ngi-de.eu/index.php?mode=ticket_info&ticket_id=1978 | ||
High rate of unknown Nagios tests. We also contacted sites MPI-K, UNI-Dresden, UNI-Karlsruhe, UNI-Siegen per email. Reason: May be | |||
connection problems | |||
The NGI-DE Nagios was also down during our (KIT, FZKA-LCG2) downtime last week. We will recalculate the availability numbers for | |||
March. | |||
* Staged rollout/updates | * Staged rollout/updates | ||
UMD: | UMD: | ||
http://www.eu-emi.eu/emi-1-kebnekaise-updates/-/asset_publisher/Ir6q/content/update-14-16-03-2012 | |||
http://www.eu-emi.eu/emi-1-kebnekaise-updates/-/asset_publisher/Ir6q/content/update-13-17-02-2012 | http://www.eu-emi.eu/emi-1-kebnekaise-updates/-/asset_publisher/Ir6q/content/update-13-17-02-2012 | ||
Line 30: | Line 38: | ||
[[Operations/Platform Deployment Survey]] | [[Operations/Platform Deployment Survey]] | ||
For the surveys we need your feedback untill 20st of March via Email | |||
==Round the sites== | ==Round the sites== | ||
Line 41: | Line 49: | ||
* Goegrid | * Goegrid | ||
* GSI | * GSI | ||
* ITWM | * ITWM (Martin Braun) | ||
09.03-> | - 09.03 -> ITWM will not be able to attend the phone conference on 9/3/12. There is nothing special to report. | ||
- 16/3/12 ntr | |||
* KIT (GridKa, FZK-LCG2) | * KIT (GridKa, FZK-LCG2, Dimitri Nilsen, Tobias Koenig) | ||
9.03-> | - 9.03 -> Announcement: downtime 13.03 - 15.03, full site maintainance | ||
- 16/3/12 downtime succeeded with two hours delay. During the downtime one central router was replaced by a new model, also we did | |||
some dCache updates. disk firmware and BIOS upgrades, Tape TSM update, complete cluster was reinstalled, changes of the central | |||
power supply, some LFC and 3D-DB were migrated to Oracle version 11g | |||
* KIT (Uni Karlsruhe) | * KIT (Uni Karlsruhe) | ||
* LRZ | * LRZ | ||
* MPI-K | * MPI-K | ||
* MPPMU | * MPPMU (Cesare Delle Fratte) | ||
- ntr | |||
- a few problems with CREAMS and ATLAS jobs. Problems are solved now. | |||
* RWTH Aachen | * RWTH Aachen | ||
* SCAI | * SCAI (Andre Gemuend) | ||
- Problems with SE: DPM daemon died. We will update EMI-DPM | |||
- ROD also filed two tickets (one concerning the DPM daemon problem and one concerning the related CREAM CE) instead of only one | |||
for the dpm daemon: | |||
* Uni Bonn | * Uni Bonn | ||
* Uni Dortmund | * Uni Dortmund | ||
* Uni Dresden | * Uni Dresden (Ralph Mueller Pfeeferkorn) | ||
- 16/3/12 Last week we had a two days downtime. We updated CREAM, Apel to the EMI release, dCache was updated to version 1.9.12 | |||
and dCache update included the upgrade from SL4 to SL5. Now all seems to be fine now | |||
- Problems with EMI release: After the update sometimes the Nagios test fails, with error message "job could not be submitted" or | |||
something like that. Comment by KIT: We also tested the BDII in EMI in preproduction. It was not described in the documentation | |||
that you need 3GB RAM. | |||
* Uni Freiburg | * Uni Freiburg | ||
* Uni Mainz-Maigrid | * Uni Mainz-Maigrid | ||
Line 71: | Line 91: | ||
welcome LRZ Daniel Waldmann | welcome LRZ Daniel Waldmann | ||
* Any problematic tickets? | * Any problematic tickets? No | ||
* Handover of the ROD shift | * Handover of the ROD shift | ||
* ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE_CH_Operations_Center:Operations_Teams#Shifts_rotation_table | * ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE_CH_Operations_Center:Operations_Teams#Shifts_rotation_table | ||
Line 80: | Line 100: | ||
12 19.03 25.03 Team2, FhG | 12 19.03 25.03 Team2, FhG | ||
*Nagios<-->Dashboard issue | *Nagios<-->Dashboard issue | ||
https://helpdesk.ngi-de.eu/?mode=ticket_info&ticket_id=2039 | (Nagios send information/notification to the dashboard and the information/notification is not displayed correctly at the dashboard. | ||
In consequence the tickets hang in the dashboard longer than the problem persists) | |||
https://helpdesk.ngi-de.eu/?mode=ticket_info&ticket_id=2039 | |||
https://helpdesk.ngi-de.eu/?mode=ticket_info&ticket_id=2038 | |||
==AOB== | ==AOB== | ||
- Reminder: Operation workshop in April: We hope many sites participate. You can register now. See announcement section above. | |||
- short Telco meeting before the workshop at 13. April | |||
If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list. | If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list. |
Latest revision as of 11:32, 22 March 2012
Introduction
- Minutes of last meeting
Announcements
- Meetings/conferences
- Availability/reliability statistics
Feb: VoOps: Av/Re= 97% UNI BONN only has 69% BDII: Av/Re=99,3%
- Monitoring
https://helpdesk.ngi-de.eu/index.php?mode=ticket_info&ticket_id=1969 https://helpdesk.ngi-de.eu/index.php?mode=ticket_info&ticket_id=1978
High rate of unknown Nagios tests. We also contacted sites MPI-K, UNI-Dresden, UNI-Karlsruhe, UNI-Siegen per email. Reason: May be connection problems
The NGI-DE Nagios was also down during our (KIT, FZKA-LCG2) downtime last week. We will recalculate the availability numbers for March.
- Staged rollout/updates
UMD: http://www.eu-emi.eu/emi-1-kebnekaise-updates/-/asset_publisher/Ir6q/content/update-14-16-03-2012 http://www.eu-emi.eu/emi-1-kebnekaise-updates/-/asset_publisher/Ir6q/content/update-13-17-02-2012
- Survey
Usage and future maintenance of deployed software Operations/Platform Deployment Survey
For the surveys we need your feedback untill 20st of March via Email
Round the sites
- NGI-DE
- BMRZ-FRANKFURT (Uni Frankfurt)
- DESY-HH
- DESY-ZN
- FZJuelich
- Goegrid
- GSI
- ITWM (Martin Braun)
- 09.03 -> ITWM will not be able to attend the phone conference on 9/3/12. There is nothing special to report. - 16/3/12 ntr
- KIT (GridKa, FZK-LCG2, Dimitri Nilsen, Tobias Koenig)
- 9.03 -> Announcement: downtime 13.03 - 15.03, full site maintainance - 16/3/12 downtime succeeded with two hours delay. During the downtime one central router was replaced by a new model, also we did some dCache updates. disk firmware and BIOS upgrades, Tape TSM update, complete cluster was reinstalled, changes of the central power supply, some LFC and 3D-DB were migrated to Oracle version 11g
- KIT (Uni Karlsruhe)
- LRZ
- MPI-K
- MPPMU (Cesare Delle Fratte)
- ntr - a few problems with CREAMS and ATLAS jobs. Problems are solved now.
- RWTH Aachen
- SCAI (Andre Gemuend)
- Problems with SE: DPM daemon died. We will update EMI-DPM - ROD also filed two tickets (one concerning the DPM daemon problem and one concerning the related CREAM CE) instead of only one for the dpm daemon:
- Uni Bonn
- Uni Dortmund
- Uni Dresden (Ralph Mueller Pfeeferkorn)
- 16/3/12 Last week we had a two days downtime. We updated CREAM, Apel to the EMI release, dCache was updated to version 1.9.12 and dCache update included the upgrade from SL4 to SL5. Now all seems to be fine now - Problems with EMI release: After the update sometimes the Nagios test fails, with error message "job could not be submitted" or something like that. Comment by KIT: We also tested the BDII in EMI in preproduction. It was not described in the documentation that you need 3GB RAM.
- Uni Freiburg
- Uni Mainz-Maigrid
- Uni Siegen
- Uni Wuppertal
- SwiNG
- CSCS
- PSI
- Switch
Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.
Status ROD
welcome LRZ Daniel Waldmann
- Any problematic tickets? No
- Handover of the ROD shift
- ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE_CH_Operations_Center:Operations_Teams#Shifts_rotation_table
09 27.02 04.03 Team5, LRZ 10 05.03 11.03 Team6, CSCS/NGI_CH 11 12.03 18.03 Team1, DESY 12 19.03 25.03 Team2, FhG
- Nagios<-->Dashboard issue
(Nagios send information/notification to the dashboard and the information/notification is not displayed correctly at the dashboard. In consequence the tickets hang in the dashboard longer than the problem persists) https://helpdesk.ngi-de.eu/?mode=ticket_info&ticket_id=2039 https://helpdesk.ngi-de.eu/?mode=ticket_info&ticket_id=2038
AOB
- Reminder: Operation workshop in April: We hope many sites participate. You can register now. See announcement section above. - short Telco meeting before the workshop at 13. April
If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.