Difference between revisions of "EGI-InSPIRE:Switzerland-QR9"
Jump to navigation
Jump to search
CSCS
Line 79: | Line 79: | ||
The entire site moved from its old location in Manno to a newly built centre in Lugano-Cornaredo. This was used as an opportunity to uprade the T2 cluster, see under "achievements" below. | The entire site moved from its old location in Manno to a newly built centre in Lugano-Cornaredo. This was used as an opportunity to uprade the T2 cluster, see under "achievements" below. | ||
===2.2. Main Achievements=== | === 2.2. Main Achievements === | ||
==== CSCS ==== | |||
#Full network redesign, from a mixed environment with 2 networks (IB and ethernet) to fully integrated IB and 10G ethernet network with Voltaire bridges. | |||
#Replacement of the remaining old Sun Thumpers storage servers by newer IBM based storage (DS 3500). | |||
#Replacement of the remaining old Sun Worker Nodes (96 blade machines) with new Sandy Bridge systems with 32 job slots per host. | |||
#Reorganization and simplification of the Scratch filesystem where the jobs at CSCS run. Based on GPFS. | |||
#Increased the bandwidth available to VM-based grid services from 1G ethernet to 10GB ethernet. | |||
#Replaced 2x VM-based CREAM-CE machines by 3 physical hosts. These systems are now EMI UMD1 latest update. | |||
#Replaced 2x VM-based ARGUS servers by 3 physical hosts and integration with the CREAM-CE on the same nodes. These systems are now EMI UMD1 latest update. | |||
#Installed CernVMFS for all the VOs supported by CSCS. Currently only used by ATLAS and LHCb. | |||
#General upgrade of service nodes from SL 5.4/5 to SL5.7 and latest kernel available at the moment (308). | |||
==== PSI ==== | |||
#Added 11 worker notes (2 * 8 core Xeon "Sandy Bridge" ES-2670 2.6 GHz, 48GB DDR3) and put them into production. | |||
==== UNIGE ==== | |||
#Stable operation | |||
#Upgraded network hardware | |||
#Preparation of next upgrade: replacement of the oldest CPU nodes. | |||
=== 2.3. Issues and mitigation === | |||
==== CSCS<br> ==== | |||
#Unfortunately, we are seeing soft lockup CPU errors in our old Sun Thors that cause the system (dCache pools) to block the software RAIDs available. This seems to be related to a known bug by Sun that is not going to be fixed. Replacement plan is being drafted. | |||
#Initially we saw ARP problems with the bridge ethernet/IB. Now this seems to have been fixed. | |||
#CSCS status on the WLCG Dashboard is not always green. This seems to be caused by user job related issues. We think it may have to do with CVMFS, but it's difficult to be sure about it at this point. Debugging is in course. | |||
==== UNIGE ==== | |||
#Some disk space management: identifying and removing data we no longer need | |||
#Solaris is no longer supported by the DPM software team. We have such machines in our SE. The mitigation is not to update the DPM and hope that it will last another year or two. | |||
#The DPM software on the Solaris disk servers does not support RFC proxies. For this reason our SE is no longer a data source for NORDUgrid jobs on other sites. | |||
===2.3. Issues and mitigation=== | ===2.3. Issues and mitigation=== |
Revision as of 18:10, 27 July 2012
Quarterly Report Number | NGI Name | Partner Name | Author |
---|---|---|---|
1. MEETINGS AND DISSEMINATION
1.1. CONFERENCES/WORKSHOPS ORGANISED
Date | Location | Title | Participants | Outcome (Short report & Indico URL) |
---|---|---|---|---|
1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED
Date | Location | Title | Participants | Outcome (Short report & Indico URL) |
---|---|---|---|---|
1.3. PUBLICATIONS
Publication title | Journal / Proceedings title | Journal references Volume number Issue Pages from - to |
Authors 1. 2. 3. Et al? |
---|
2. ACTIVITY REPORT
2.1. Progress Summary
CSCS
The entire site moved from its old location in Manno to a newly built centre in Lugano-Cornaredo. This was used as an opportunity to uprade the T2 cluster, see under "achievements" below.
2.2. Main Achievements
CSCS
- Full network redesign, from a mixed environment with 2 networks (IB and ethernet) to fully integrated IB and 10G ethernet network with Voltaire bridges.
- Replacement of the remaining old Sun Thumpers storage servers by newer IBM based storage (DS 3500).
- Replacement of the remaining old Sun Worker Nodes (96 blade machines) with new Sandy Bridge systems with 32 job slots per host.
- Reorganization and simplification of the Scratch filesystem where the jobs at CSCS run. Based on GPFS.
- Increased the bandwidth available to VM-based grid services from 1G ethernet to 10GB ethernet.
- Replaced 2x VM-based CREAM-CE machines by 3 physical hosts. These systems are now EMI UMD1 latest update.
- Replaced 2x VM-based ARGUS servers by 3 physical hosts and integration with the CREAM-CE on the same nodes. These systems are now EMI UMD1 latest update.
- Installed CernVMFS for all the VOs supported by CSCS. Currently only used by ATLAS and LHCb.
- General upgrade of service nodes from SL 5.4/5 to SL5.7 and latest kernel available at the moment (308).
PSI
- Added 11 worker notes (2 * 8 core Xeon "Sandy Bridge" ES-2670 2.6 GHz, 48GB DDR3) and put them into production.
UNIGE
- Stable operation
- Upgraded network hardware
- Preparation of next upgrade: replacement of the oldest CPU nodes.
2.3. Issues and mitigation
CSCS
- Unfortunately, we are seeing soft lockup CPU errors in our old Sun Thors that cause the system (dCache pools) to block the software RAIDs available. This seems to be related to a known bug by Sun that is not going to be fixed. Replacement plan is being drafted.
- Initially we saw ARP problems with the bridge ethernet/IB. Now this seems to have been fixed.
- CSCS status on the WLCG Dashboard is not always green. This seems to be caused by user job related issues. We think it may have to do with CVMFS, but it's difficult to be sure about it at this point. Debugging is in course.
UNIGE
- Some disk space management: identifying and removing data we no longer need
- Solaris is no longer supported by the DPM software team. We have such machines in our SE. The mitigation is not to update the DPM and hope that it will last another year or two.
- The DPM software on the Solaris disk servers does not support RFC proxies. For this reason our SE is no longer a data source for NORDUgrid jobs on other sites.
2.3. Issues and mitigation
Issue Description | Mitigation Description |
---|