Difference between revisions of "EGI-InSPIRE:Switzerland-QR9"
Jump to navigation
Jump to search
(Created page with "__NOTOC__ {| border="1" cellspacing="0" cellpadding="2" |- !scope="col"| Quarterly Report Number !scope="col"| NGI Name !scope="col"| Partner Name !scope="col"| Author |- | | | ...") |
|||
(15 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
{{Template:EGI-Inspire menubar}} | |||
{{Template:Inspire_reports_menubar}} | |||
{{TOC_right}} | |||
{| border="1" cellspacing="0" cellpadding="2" | {| border="1" cellspacing="0" cellpadding="2" | ||
|- | |- | ||
Line 7: | Line 10: | ||
!scope="col"| Author | !scope="col"| Author | ||
|- | |- | ||
| | | 9 | ||
| | | NGI_CH | ||
| | | SWITCH | ||
| | | Simon Leinen | ||
|- | |- | ||
|} | |} | ||
Line 51: | Line 54: | ||
!scope="col"|Date||Location||Title||Participants||Outcome (Short report & Indico URL) | !scope="col"|Date||Location||Title||Participants||Outcome (Short report & Indico URL) | ||
|- | |- | ||
||||||||| | |May 30 - June 1||Uppsala, SE||NorduGrid 2012 conference||Sigve Haug, Gianfranco Sciacca (UNIBE-LHEP), Tyanko Aleksiev, Sergio Maffioletti (UZH)||Defined way forward for establishing operational solutions still missing from ARC middleware for: APEL accounting (affects CSCS, Unibe, Unige), information system, full integration with ATLAS operations. | ||
|- | |||
|June 25 - June 29||Cetraro, IT||Int. Adv. Res. Workshop on High Performance Computing, Grid and Clouds||Sergio Maffioletti||The main focus was on how to prepare providers and community support for next generation large scale data analysis, with an emphasis on cloud computing. | |||
|- | |- | ||
|} | |} | ||
Line 68: | Line 73: | ||
!scope="col"|Publication title||Journal / Proceedings title||align="left" |Journal references<br> ''Volume number<br> Issue<br><br>Pages from - to''||align="left" |Authors ''<br>1.<br>2.<br>3.<br>Et al?'' | !scope="col"|Publication title||Journal / Proceedings title||align="left" |Journal references<br> ''Volume number<br> Issue<br><br>Pages from - to''||align="left" |Authors ''<br>1.<br>2.<br>3.<br>Et al?'' | ||
|- | |- | ||
|AppPot: bridging the Grid and Cloud worlds||Proc. EGI Community Forum 2012||||R. Murri, S. Maffioletti, T. Aleksiev | |||
|- | |||
|A Grid execution model for Computational Chemistry Applications using GC3Pie and AppPot||Proc. EGI Community Forum 2012||||A. Costantini, A. Laganà, S. Maffioletti, R. Murri, O. Gervasi | |||
|- | |||
|Computational workflows with GC3Pie||Proc. EGI Community Forum 2012||||S. Maffioletti, R. Murri, T. Aleksiev | |||
|- | |||
|GC3Pie: A Python framework for high-throughput computing||Proc. EGI Community Forum 2012||||S. Maffioletti, R. Murri, T. Aleksiev | |||
|- | |||
|Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud||Proc. CHEP 2012||||J. Elmsheuser, R. Medrano Llamas, F. Legger, A. Sciabá, G. Sciacca, M. Ubeda Garcia, D. van der Ster | |||
|} | |} | ||
Line 73: | Line 87: | ||
<!--''Note: just report activities relevant to this Quarter.''--> | <!--''Note: just report activities relevant to this Quarter.''--> | ||
===2.1. Progress Summary=== | === 2.1. Progress Summary === | ||
===2.2. Main Achievements=== | |||
==== CSCS ==== | |||
The entire site moved from its old location in Manno to a newly built centre in Lugano-Cornaredo. This was used as an opportunity to uprade the T2 cluster, see under "achievements" below. | |||
=== 2.2. Main Achievements === | |||
==== CSCS ==== | |||
#Full network redesign, from a mixed environment with 2 networks (IB and ethernet) to fully integrated IB and 10G ethernet network with Voltaire bridges. | |||
#Replacement of the remaining old Sun Thumpers storage servers by newer IBM based storage (DS 3500). | |||
#Replacement of the remaining old Sun Worker Nodes (96 blade machines) with new Sandy Bridge systems with 32 job slots per host. | |||
#Reorganization and simplification of the Scratch filesystem where the jobs at CSCS run. Based on GPFS. | |||
#Increased the bandwidth available to VM-based grid services from 1G ethernet to 10GB ethernet. | |||
#Replaced 2x VM-based CREAM-CE machines by 3 physical hosts. These systems are now EMI UMD1 latest update. | |||
#Replaced 2x VM-based ARGUS servers by 3 physical hosts and integration with the CREAM-CE on the same nodes. These systems are now EMI UMD1 latest update. | |||
#Installed CernVMFS for all the VOs supported by CSCS. Currently only used by ATLAS and LHCb. | |||
#General upgrade of service nodes from SL 5.4/5 to SL5.7 and latest kernel available at the moment (308). | |||
==== PSI ==== | |||
#Added 11 worker notes (2 * 8 core Xeon "Sandy Bridge" ES-2670 2.6 GHz, 48GB DDR3) and put them into production. | |||
==== UNIGE ==== | |||
#Stable operation | |||
#Upgraded network hardware | |||
#Preparation of next upgrade: replacement of the oldest CPU nodes. | |||
===2.3. Issues and mitigation=== | ==== UNIBE-LHEP ==== | ||
#Very stable cluster operation | |||
#Successful relocation of T2 hardware from CSCS to Bern (~10500 HEPCSPEC06, storage). Commissioning underway. | |||
==== UZH ==== | |||
#Involvement in NA activities, in particular Virtual Teams, specifically: | |||
##EGI Champions | |||
##Science gateway primer | |||
=== <br> === | |||
=== 2.3. Issues and mitigation === | |||
==== CSCS ==== | |||
#Unfortunately, we are seeing soft lockup CPU errors in our old Sun Thors that cause the system (dCache pools) to block the software RAIDs available. This seems to be related to a known bug by Sun that is not going to be fixed. Replacement plan is being drafted. | |||
#Initially we saw ARP problems with the bridge ethernet/IB. Now this seems to have been fixed. | |||
#CSCS status on the WLCG Dashboard is not always green. This seems to be caused by user job related issues. We think it may have to do with CVMFS, but it's difficult to be sure about it at this point. Debugging is in course. | |||
==== UNIGE ==== | |||
#Some disk space management: identifying and removing data we no longer need | |||
#Solaris is no longer supported by the DPM software team. We have such machines in our SE. The mitigation is not to update the DPM and hope that it will last another year or two. | |||
#The DPM software on the Solaris disk servers does not support RFC proxies. For this reason our SE is no longer a data source for NORDUgrid jobs on other sites. | |||
{| border="1" cellspacing="0" cellpadding="2" | {| border="1" cellspacing="0" cellpadding="2" | ||
|- | |- | ||
!scope="col"| Issue Description | ! scope="col" | Issue Description | ||
!scope="col"| Mitigation Description | ! scope="col" | Mitigation Description | ||
|} | |} | ||
Line 90: | Line 157: | ||
|- | |- | ||
| Issue Description || Issue mitigation | | Issue Description || Issue mitigation | ||
|- | |||
| CSCS: We are seeing soft lockup CPU errors in our old Sun Thors that cause the system (dCache pools) to block the software RAIDs available. This seems to be related to a known bug by Sun that is not going to be fixed. || Replacement plan is being drafted. | |||
|- | |||
| CSCS status on the WLCG Dashboard is not always green. This seems to be caused by user job related issues. We think it may have to do with CVMFS, but it's difficult to be sure about it at this point. || Debugging is in course. | |||
|- | |||
| UNIGE: Some disk space management issues || identifying and removing data we no longer need | |||
|- | |||
| UNIGE: Solaris is no longer supported by the DPM software team. We have such machines in our SE. || Mitigation is not to update the DPM and hope that it will last another year or two. | |||
|- | |||
| UNIGE: The DPM software on the Solaris disk servers does not support RFC proxies. || For this reason our SE is no longer a data source for NORDUgrid jobs on other sites. | |||
--> | --> |
Latest revision as of 19:19, 9 January 2015
EGI Inspire Main page |
Inspire reports menu: | Home • | SA1 weekly Reports • | SA1 Task QR Reports • | NGI QR Reports • | NGI QR User support Reports |
Quarterly Report Number | NGI Name | Partner Name | Author |
---|---|---|---|
9 | NGI_CH | SWITCH | Simon Leinen |
1. MEETINGS AND DISSEMINATION
1.1. CONFERENCES/WORKSHOPS ORGANISED
Date | Location | Title | Participants | Outcome (Short report & Indico URL) |
---|---|---|---|---|
1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED
Date | Location | Title | Participants | Outcome (Short report & Indico URL) |
---|---|---|---|---|
May 30 - June 1 | Uppsala, SE | NorduGrid 2012 conference | Sigve Haug, Gianfranco Sciacca (UNIBE-LHEP), Tyanko Aleksiev, Sergio Maffioletti (UZH) | Defined way forward for establishing operational solutions still missing from ARC middleware for: APEL accounting (affects CSCS, Unibe, Unige), information system, full integration with ATLAS operations. |
June 25 - June 29 | Cetraro, IT | Int. Adv. Res. Workshop on High Performance Computing, Grid and Clouds | Sergio Maffioletti | The main focus was on how to prepare providers and community support for next generation large scale data analysis, with an emphasis on cloud computing. |
1.3. PUBLICATIONS
Publication title | Journal / Proceedings title | Journal references Volume number Issue Pages from - to |
Authors 1. 2. 3. Et al? |
---|---|---|---|
AppPot: bridging the Grid and Cloud worlds | Proc. EGI Community Forum 2012 | R. Murri, S. Maffioletti, T. Aleksiev | |
A Grid execution model for Computational Chemistry Applications using GC3Pie and AppPot | Proc. EGI Community Forum 2012 | A. Costantini, A. Laganà, S. Maffioletti, R. Murri, O. Gervasi | |
Computational workflows with GC3Pie | Proc. EGI Community Forum 2012 | S. Maffioletti, R. Murri, T. Aleksiev | |
GC3Pie: A Python framework for high-throughput computing | Proc. EGI Community Forum 2012 | S. Maffioletti, R. Murri, T. Aleksiev | |
Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud | Proc. CHEP 2012 | J. Elmsheuser, R. Medrano Llamas, F. Legger, A. Sciabá, G. Sciacca, M. Ubeda Garcia, D. van der Ster |
2. ACTIVITY REPORT
2.1. Progress Summary
CSCS
The entire site moved from its old location in Manno to a newly built centre in Lugano-Cornaredo. This was used as an opportunity to uprade the T2 cluster, see under "achievements" below.
2.2. Main Achievements
CSCS
- Full network redesign, from a mixed environment with 2 networks (IB and ethernet) to fully integrated IB and 10G ethernet network with Voltaire bridges.
- Replacement of the remaining old Sun Thumpers storage servers by newer IBM based storage (DS 3500).
- Replacement of the remaining old Sun Worker Nodes (96 blade machines) with new Sandy Bridge systems with 32 job slots per host.
- Reorganization and simplification of the Scratch filesystem where the jobs at CSCS run. Based on GPFS.
- Increased the bandwidth available to VM-based grid services from 1G ethernet to 10GB ethernet.
- Replaced 2x VM-based CREAM-CE machines by 3 physical hosts. These systems are now EMI UMD1 latest update.
- Replaced 2x VM-based ARGUS servers by 3 physical hosts and integration with the CREAM-CE on the same nodes. These systems are now EMI UMD1 latest update.
- Installed CernVMFS for all the VOs supported by CSCS. Currently only used by ATLAS and LHCb.
- General upgrade of service nodes from SL 5.4/5 to SL5.7 and latest kernel available at the moment (308).
PSI
- Added 11 worker notes (2 * 8 core Xeon "Sandy Bridge" ES-2670 2.6 GHz, 48GB DDR3) and put them into production.
UNIGE
- Stable operation
- Upgraded network hardware
- Preparation of next upgrade: replacement of the oldest CPU nodes.
UNIBE-LHEP
- Very stable cluster operation
- Successful relocation of T2 hardware from CSCS to Bern (~10500 HEPCSPEC06, storage). Commissioning underway.
UZH
- Involvement in NA activities, in particular Virtual Teams, specifically:
- EGI Champions
- Science gateway primer
2.3. Issues and mitigation
CSCS
- Unfortunately, we are seeing soft lockup CPU errors in our old Sun Thors that cause the system (dCache pools) to block the software RAIDs available. This seems to be related to a known bug by Sun that is not going to be fixed. Replacement plan is being drafted.
- Initially we saw ARP problems with the bridge ethernet/IB. Now this seems to have been fixed.
- CSCS status on the WLCG Dashboard is not always green. This seems to be caused by user job related issues. We think it may have to do with CVMFS, but it's difficult to be sure about it at this point. Debugging is in course.
UNIGE
- Some disk space management: identifying and removing data we no longer need
- Solaris is no longer supported by the DPM software team. We have such machines in our SE. The mitigation is not to update the DPM and hope that it will last another year or two.
- The DPM software on the Solaris disk servers does not support RFC proxies. For this reason our SE is no longer a data source for NORDUgrid jobs on other sites.
Issue Description | Mitigation Description |
---|