Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:01072011"
Jump to navigation
Jump to search
m (moved NGI DE CH Operatons Center:Operations Meeting:01072011 to DECH:Operations Meeting:01072011) |
|||
(34 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
[[NGI_DE_CH_Operations_Center:Operations_Meeting|Operations Meeting Main]] | |||
==Introduction== | ==Introduction== | ||
* Minutes of last meeting | * Minutes of last meeting | ||
no comments | |||
==Announcements== | ==Announcements== | ||
* Meetings/conferences | * Meetings/conferences | ||
At the moment at the University of Berlin there is running the Grid educational event organized by Yves Kemp (DESY-HH). The test | |||
user certificates are delivered and no problems are known. All is running fine. | |||
* Availability/reliability statistics | * Availability/reliability statistics | ||
NGI-DE in average 97%/98% | |||
problem site (BMRZ-FRANKFURT) ticket 1304, 73%/73% availability/reliability | |||
Solution: Caused by gLite 3.2 migration, misconfigurations and some disk space issues. Problems have been solved | |||
* Sites certification procedure | |||
Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production | |||
* Monitoring | * Monitoring | ||
Updated this week the regional NGI-DE monitoring and the ARC monitoring. Now the regional monitoring supports Globus 5. Each | |||
Globus site should register their Globus 5 resources in the GOCDB. We continue the monitoring together with the Suisse sites. Res | |||
has to ask his people if they want. | |||
Question from Florian (LRZ): What Globus services will be monitored? GSI ssh, gridftp and gram 5 are monitored. The Gsissh test | |||
will test the connection to the ports and the authorization, especially if the test DN is supported at the site (the non D-grid | |||
sites have to enter the DN manually, for D-Grid sites this is done automatically if they support the DG ops or DG test. We (LRZ) | |||
also have problems to register our Globus 5 resources in the GOCDB (invalid service type)->Answer: Please open a ticket. Foued will | |||
check this after the meeting | |||
* Staged rollout/updates | * Staged rollout/updates | ||
ntr from KIT | |||
Martin/ITWM: EMI WN, EMI DPM staged rollout: We have some problems; the corresponding tickets are not solved until now and we are | |||
still waiting for solutions | |||
* Other | |||
Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production | |||
==Round the sites== | ==Round the sites== | ||
; NGI-DE | |||
* BMRZ-FRANKFURT (Uni Frankfurt) | |||
* DESY-HH | |||
Andreas, Dmitri: | |||
Good load: 4000jobs | |||
job efficiency: 10 VOs in parallel | |||
Network interruption on last Wednesday, firewall settings, DFN connection were reconfigured. Another interruption is needed. | |||
Updated two of three dCache instances successfully to 1.9.12 (second golden release) | |||
Concerned about the migration from gLite to EMI release. People from CERN (LCG rollout list) recommend not to use the EMI release | |||
and we do not know when the release will be available. Actually we only do recommended WN high risk security updates. Dimitri (KIT) | |||
does not see that it is necessary to install the EMI release at the moment and we have to wait until the first EMI release is | |||
officially available. | |||
* DESY-ZN | |||
* FZJuelich | |||
* Goegrid | |||
* GSI | |||
* ITWM | |||
Martin: | |||
Take part at the staged rollout for WNs / EMI test: SAM test failed->DT was needed, to solve the incident a rollback to gLite 3.2 | |||
was needed | |||
SW areas run full->upgraded some WNs to SL5.6 but the CREAM CEs wants an older version, excluded some packets (e.g. the CArgus | |||
package) that have caused the problem | |||
Switched the ROD shift next week with CSCS | |||
* KIT (GridKa, FZK-LCG2) | |||
Dimitri: | |||
Last weeks LHCb had a very high load on the dCache Urprod instance, in consequence the system crashed. The dCache instance had to | |||
be optimized (additional servers etc.). | |||
Problems with one tape library (stucking tapes, errors in reading labels) | |||
CREAM CE Instability still there | CREAM CE Instability still there | ||
* KIT (Uni Karlsruhe) | |||
* LRZ | |||
Florian: | |||
Systems up and running | |||
No planned DT within the next weeks | |||
Plan to evaluate CernVMFS | |||
* MPI-K | |||
* MPPMU | |||
Cesare: | |||
CREAM CEs are up and running | |||
Next week we will have a DT to upgrade the BDII, SRM to the latest gLite 3.2 and new WNs WNs. Decommission old LCG CE and old farm. | |||
Good hammercloud tests | |||
Problems with the BDII are still under investigation | |||
* RWTH Aachen | |||
* | * SCAI | ||
Oliver: | |||
ntr | |||
* Uni Bonn | |||
* Uni Dortmund | |||
* Uni Dresden | |||
Ralph: | |||
Longer DT to update operation system of several components (WNs and Cream CEs). Because of less manpower vendor support was needed. | |||
All is working again since yesterday. | |||
* Uni Freiburg | |||
Anton via email: | |||
Hello Tobias, | |||
unfortunately I cannot attend the meeting today. | |||
For the minutes: Site UNI-FREIBURG is running smoothly. We had some small downtimes (at risk) for service of the water cooling | |||
system. | |||
Cheers! | |||
Anton | |||
* Uni Mainz-Maigrid | |||
* Uni Siegen | |||
* Uni Wuppertal | |||
; SwiNG | |||
* CSCS | |||
Miguel via email: | |||
Hi all, | |||
Unfortunately I cannot make it to the meeting today. This is the report for CSCS: | |||
Last week there was a problem with our scratch Lustre filesystem. One of the RAIDs lost 3 disks while rebuilding after a previous | |||
disk failure and we lost scratch data. Those jobs that were reading/writing to that part of the FS failed (74 out of ~1400). | |||
Since then, we disabled the failed storage server and all works fine. | |||
Also last week we had two tickets: | |||
a) https://ggus.eu/ws/ticket_info.php?ticket=71814 APEL Consumer did not respond to a request for confirmation of pre-published | |||
data | |||
It has been solved as it was a problem present in both gLite 3.2 and EMI1 releases of APEL. | |||
b) https://ggus.eu/ws/ticket_info.php?ticket=71436 hone jobs are aborted immediately after submission into | |||
cream02.lcg.cscs.ch:8443/cream-pbs-other queue | |||
It has also been solved and the changes will be applied when we upgrade our EA EMI 1 CREAM CE. | |||
We are suffering from random segfaults of Torque pbs_mom service in WNs. It means that, eventually and for small periods of time | |||
(a few hours max.), there might be less slots available for jobs. Running jobs are not affected. Adaptive Computing support team | |||
is working on it, but they're having a hard time repeating the issues we see here. | |||
Next Wednesday is CSCS maintenance day and we've been informed that there might be a network downtime of about 30min. It is unsure | |||
yet whether it would affect us, but if network team confirms it, we'll establish the downtime in gocdb according to rules. | |||
Besides that, there is no planned action for CSCS-LCG2 production systems on maintenance day. | |||
Starting on Monday July 4 we will be on duty with ROD shift. Our previous problems accessing the Operations Portal have been fixed. | |||
That's all. | |||
Best regards, | |||
Miguel | |||
* PSI | |||
* Switch | |||
Res: | |||
We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between | |||
ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty. | |||
Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed. | Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed. | ||
==Status ROD== | ==Status ROD== | ||
24 13.06 19.06 Team3, KIT | |||
25 20.06 26.06 Team4, JUELICH | |||
26 27.06 03.07 Team5, BADW-LRZ | |||
27 04.07 10.07 Team2, FhG (ITWM) switched wk 27/29 | |||
28 11.07 17.07 Team1. DESY | |||
29 18.07 24.07 Team6, CSCS/NGI_CH switched wk 27/29 | |||
* Any problematic tickets? | |||
Problems with University of Karlsruhe (T2 of KIT) Site was down for one week and did not respond. | |||
Site contacted via phone. Now solved. | |||
At one site job submission failed for a short time. It seems that there is instability of the test. Recommendation: Please open a | |||
ticket to notify the site | |||
* Handover of the ROD shift | * Handover of the ROD shift | ||
Res: | |||
We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between | |||
ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty. | |||
* ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE:ROD-DE | * ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE:ROD-DE | ||
==AOB== | ==AOB== | ||
@ Henry Jonker / Frankfurt: At GridKa we enabled at one of our CEs the Life sciences project e-NMR (enmr.eu) to use our computer | |||
resources. From our site it is working. Since yesterday there have been some jobs running. Note: Ticket was reopened | |||
Important: For the next meeting we will use the DFN telephone conferencing system and the new documentation page | |||
https://wiki.egi.eu/wiki/NGI_DE_CH_Operatons_Center . We want to track this meeting so the sites have the possibility to put their | |||
information in the wiki. Connection details and the link for the webpage will follow with the next reminder | |||
If you have additional topics to be discussed during the meeting, please submit them in advance via our | If you have additional topics to be discussed during the meeting, please submit them in advance via our email list. |
Latest revision as of 16:18, 5 September 2011
Introduction
- Minutes of last meeting
no comments
Announcements
- Meetings/conferences
At the moment at the University of Berlin there is running the Grid educational event organized by Yves Kemp (DESY-HH). The test user certificates are delivered and no problems are known. All is running fine.
- Availability/reliability statistics
NGI-DE in average 97%/98% problem site (BMRZ-FRANKFURT) ticket 1304, 73%/73% availability/reliability Solution: Caused by gLite 3.2 migration, misconfigurations and some disk space issues. Problems have been solved
- Sites certification procedure
Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production
- Monitoring
Updated this week the regional NGI-DE monitoring and the ARC monitoring. Now the regional monitoring supports Globus 5. Each Globus site should register their Globus 5 resources in the GOCDB. We continue the monitoring together with the Suisse sites. Res has to ask his people if they want.
Question from Florian (LRZ): What Globus services will be monitored? GSI ssh, gridftp and gram 5 are monitored. The Gsissh test will test the connection to the ports and the authorization, especially if the test DN is supported at the site (the non D-grid sites have to enter the DN manually, for D-Grid sites this is done automatically if they support the DG ops or DG test. We (LRZ) also have problems to register our Globus 5 resources in the GOCDB (invalid service type)->Answer: Please open a ticket. Foued will check this after the meeting
- Staged rollout/updates
ntr from KIT Martin/ITWM: EMI WN, EMI DPM staged rollout: We have some problems; the corresponding tickets are not solved until now and we are still waiting for solutions
- Other
Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production
Round the sites
- NGI-DE
- BMRZ-FRANKFURT (Uni Frankfurt)
- DESY-HH
Andreas, Dmitri: Good load: 4000jobs job efficiency: 10 VOs in parallel Network interruption on last Wednesday, firewall settings, DFN connection were reconfigured. Another interruption is needed. Updated two of three dCache instances successfully to 1.9.12 (second golden release) Concerned about the migration from gLite to EMI release. People from CERN (LCG rollout list) recommend not to use the EMI release and we do not know when the release will be available. Actually we only do recommended WN high risk security updates. Dimitri (KIT) does not see that it is necessary to install the EMI release at the moment and we have to wait until the first EMI release is officially available.
- DESY-ZN
- FZJuelich
- Goegrid
- GSI
- ITWM
Martin: Take part at the staged rollout for WNs / EMI test: SAM test failed->DT was needed, to solve the incident a rollback to gLite 3.2 was needed SW areas run full->upgraded some WNs to SL5.6 but the CREAM CEs wants an older version, excluded some packets (e.g. the CArgus package) that have caused the problem Switched the ROD shift next week with CSCS
- KIT (GridKa, FZK-LCG2)
Dimitri: Last weeks LHCb had a very high load on the dCache Urprod instance, in consequence the system crashed. The dCache instance had to be optimized (additional servers etc.). Problems with one tape library (stucking tapes, errors in reading labels) CREAM CE Instability still there
- KIT (Uni Karlsruhe)
- LRZ
Florian: Systems up and running No planned DT within the next weeks Plan to evaluate CernVMFS
- MPI-K
- MPPMU
Cesare: CREAM CEs are up and running Next week we will have a DT to upgrade the BDII, SRM to the latest gLite 3.2 and new WNs WNs. Decommission old LCG CE and old farm. Good hammercloud tests Problems with the BDII are still under investigation
- RWTH Aachen
- SCAI
Oliver: ntr
- Uni Bonn
- Uni Dortmund
- Uni Dresden
Ralph: Longer DT to update operation system of several components (WNs and Cream CEs). Because of less manpower vendor support was needed. All is working again since yesterday.
- Uni Freiburg
Anton via email: Hello Tobias, unfortunately I cannot attend the meeting today. For the minutes: Site UNI-FREIBURG is running smoothly. We had some small downtimes (at risk) for service of the water cooling system. Cheers! Anton
- Uni Mainz-Maigrid
- Uni Siegen
- Uni Wuppertal
- SwiNG
- CSCS
Miguel via email: Hi all, Unfortunately I cannot make it to the meeting today. This is the report for CSCS: Last week there was a problem with our scratch Lustre filesystem. One of the RAIDs lost 3 disks while rebuilding after a previous disk failure and we lost scratch data. Those jobs that were reading/writing to that part of the FS failed (74 out of ~1400). Since then, we disabled the failed storage server and all works fine. Also last week we had two tickets: a) https://ggus.eu/ws/ticket_info.php?ticket=71814 APEL Consumer did not respond to a request for confirmation of pre-published data It has been solved as it was a problem present in both gLite 3.2 and EMI1 releases of APEL. b) https://ggus.eu/ws/ticket_info.php?ticket=71436 hone jobs are aborted immediately after submission into cream02.lcg.cscs.ch:8443/cream-pbs-other queue It has also been solved and the changes will be applied when we upgrade our EA EMI 1 CREAM CE. We are suffering from random segfaults of Torque pbs_mom service in WNs. It means that, eventually and for small periods of time (a few hours max.), there might be less slots available for jobs. Running jobs are not affected. Adaptive Computing support team is working on it, but they're having a hard time repeating the issues we see here. Next Wednesday is CSCS maintenance day and we've been informed that there might be a network downtime of about 30min. It is unsure yet whether it would affect us, but if network team confirms it, we'll establish the downtime in gocdb according to rules. Besides that, there is no planned action for CSCS-LCG2 production systems on maintenance day. Starting on Monday July 4 we will be on duty with ROD shift. Our previous problems accessing the Operations Portal have been fixed. That's all. Best regards, Miguel
- PSI
- Switch
Res: We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.
Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.
Status ROD
24 13.06 19.06 Team3, KIT 25 20.06 26.06 Team4, JUELICH 26 27.06 03.07 Team5, BADW-LRZ 27 04.07 10.07 Team2, FhG (ITWM) switched wk 27/29 28 11.07 17.07 Team1. DESY 29 18.07 24.07 Team6, CSCS/NGI_CH switched wk 27/29
- Any problematic tickets?
Problems with University of Karlsruhe (T2 of KIT) Site was down for one week and did not respond. Site contacted via phone. Now solved.
At one site job submission failed for a short time. It seems that there is instability of the test. Recommendation: Please open a ticket to notify the site
- Handover of the ROD shift
Res: We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.
- ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE:ROD-DE
AOB
@ Henry Jonker / Frankfurt: At GridKa we enabled at one of our CEs the Life sciences project e-NMR (enmr.eu) to use our computer resources. From our site it is working. Since yesterday there have been some jobs running. Note: Ticket was reopened
Important: For the next meeting we will use the DFN telephone conferencing system and the new documentation page https://wiki.egi.eu/wiki/NGI_DE_CH_Operatons_Center . We want to track this meeting so the sites have the possibility to put their information in the wiki. Connection details and the link for the webpage will follow with the next reminder
If you have additional topics to be discussed during the meeting, please submit them in advance via our email list.