EGI-InSPIRE:Switzerland-QR5

From EGIWiki
Revision as of 09:11, 28 July 2011 by Aesch (talk | contribs)
Jump to: navigation, search
Quarterly Report Number NGI Name Partner Name Author
5 NGI_CH Switzerland Andres Aeschlimann


1. MEETINGS AND DISSEMINATION

1.1. CONFERENCES/WORKSHOPS ORGANISED

Date Location Title Participants Outcome (Short report & Indico URL)
2011-06-10 Bern, Switzerland Swiss National Grid Association - Scientific Advisory Council 2011 20 http://www.swing-grid.ch/event/306650-swing-scientific-advisory-council-2011

1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED

Date Location Title Participants Outcome (Short report & Indico URL)
2011-05-09 - 2011-05-12 Sundvolden, Norway Annual Nordugrid Conference 1 http://indico.hep.lu.se/conferenceDisplay.py?confId=1047


1.3. PUBLICATIONS

Publication title Journal / Proceedings title Journal references
Volume number
Issue

Pages from - to
Authors
1.
2.
3.
Et al?
Swiss National Grid Association - Annual Report 2011 http://www.swing-grid.ch

2. ACTIVITY REPORT

2.1. Progress Summary

2.2. Main Achievements

CSCS

  • Participated in stage rollout of the following components:
    • EMI1 ARGUS 1.3.0-6
    • EMI1 APEL 1.0.0-0
    • EMI1 CREAM 1.3
    • EMI1 glexec_wn 1.0.0-1
    • gLite glexec_wn 3.2.5-1
    • gLite glexec_wn 3.2.6-3
  • Deployment of 2 new Argus servers. Replacement of cream installation in 1 machine for a fresh installation of EMI1 cream.

2.3. Issues and mitigation

Issue Description Mitigation Description
CSCS
1. Failure in one of our Lustre scratch servers which made a few jobs to fail immediately and many to get stalled/not queued.


2. We are starting to see excessive rate of failed disks in our Lustre servers (Sun J4400)
3. Suffering random segfaults of Torque pbs_mom process.
4. Batch system Torque did not fail over when it was supposed to do.
5. Still suffering from user jobs doing excessive IO which slows down other jobs running in the shared Lustre o GPFS filesystem.

CSCS
1. Deactivated the Lustre server and removed hung jobs after a long period of time.


2. Replacing disks as soon as we can, but since all disks are of the same age, sooner or later we will have to replace some machines/disks.
3. Contacted Adaptive Computing but not very responsive on this issue. Working on it trying to make pbs_mom to dump core files.
4. Replaced Torque for a newer version.
5. Penalized excessive use of high amount of certain operations (such as hte commands 'find', 'du', etc.).