EGI-InSPIRE:Switzerland-QR5
Jump to navigation
Jump to search
Quarterly Report Number | NGI Name | Partner Name | Author |
---|---|---|---|
5 | NGI_CH | Switzerland | Andres Aeschlimann |
1. MEETINGS AND DISSEMINATION
1.1. CONFERENCES/WORKSHOPS ORGANISED
Date | Location | Title | Participants | Outcome (Short report & Indico URL) |
---|---|---|---|---|
2011-06-10 | Bern, Switzerland | Swiss National Grid Association - Scientific Advisory Council 2011 | 20 | http://www.swing-grid.ch/event/306650-swing-scientific-advisory-council-2011 |
1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED
Date | Location | Title | Participants | Outcome (Short report & Indico URL) |
---|---|---|---|---|
2011-05-09 - 2011-05-12 | Sundvolden, Norway | Annual Nordugrid Conference | 1 | http://indico.hep.lu.se/conferenceDisplay.py?confId=1047 |
1.3. PUBLICATIONS
Publication title | Journal / Proceedings title | Journal references Volume number Issue Pages from - to |
Authors 1. 2. 3. Et al? |
---|---|---|---|
Swiss National Grid Association - Annual Report 2011 | http://www.swing-grid.ch |
2. ACTIVITY REPORT
2.1. Progress Summary
2.2. Main Achievements
CSCS
- Participated in stage rollout of the following components:
- EMI1 ARGUS 1.3.0-6
- EMI1 APEL 1.0.0-0
- EMI1 CREAM 1.3
- EMI1 glexec_wn 1.0.0-1
- gLite glexec_wn 3.2.5-1
- gLite glexec_wn 3.2.6-3
- Deployment of 2 new Argus servers. Replacement of cream installation in 1 machine for a fresh installation of EMI1 cream.
2.3. Issues and mitigation
Issue Description | Mitigation Description |
---|---|
CSCS
1. Failure in one of our Lustre scratch servers which made a few jobs to fail immediately and many to get stalled/not queued.
2. We are starting to see excessive rate of failed disks in our Lustre servers (Sun J4400) 3. Suffering random segfaults of Torque pbs_mom process. 4. Batch system Torque did not fail over when it was supposed to do. 5. Still suffering from user jobs doing excessive IO which slows down other jobs running in the shared Lustre o GPFS filesystem. |
CSCS
1. Deactivated the Lustre server and removed hung jobs after a long period of time.
2. Replacing disks as soon as we can, but since all disks are of the same age, sooner or later we will have to replace some machines/disks. 3. Contacted Adaptive Computing but not very responsive on this issue. Working on it trying to make pbs_mom to dump core files. 4. Replaced Torque for a newer version. 5. Penalized excessive use of high amount of certain operations (such as hte commands 'find', 'du', etc.). |