Difference between revisions of "EGI-InSPIRE:Switzerland-QR10"
Line 48: | Line 48: | ||
!scope="col"|Date||Location||Title||Participants||Outcome (Short report & Indico URL) | !scope="col"|Date||Location||Title||Participants||Outcome (Short report & Indico URL) | ||
|- | |- | ||
|October 2012||Lugano (CH)||Swiss High Performance Computing Forum||UNIGE (Szymon Gadomski), UNIBE-LHEP (Gianfranco Sciacca and Sigve Haug), ETH (Christoph Grab and Pablo Fernandez)||Talk: 'Disk Pool Manager Storage Systems at the Universities of Bern and Geneva' | |October 2012||Lugano (CH)||Swiss High Performance Computing Forum||UNIGE (Szymon Gadomski), UNIBE-LHEP (Gianfranco Sciacca and Sigve Haug), ETH/CSCS (Christoph Grab and Pablo Fernandez)||Talk: 'Disk Pool Manager Storage Systems at the Universities of Bern and Geneva' | ||
S.Gadomski (UNIGE-DPNC) and G.Sciacca (UNIBE-LHEP) | S.Gadomski (UNIGE-DPNC) and G.Sciacca (UNIBE-LHEP) | ||
|- | |- |
Revision as of 12:23, 5 November 2012
Quarterly Report Number | NGI Name | Partner Name | Author |
---|---|---|---|
1. MEETINGS AND DISSEMINATION
1.1. CONFERENCES/WORKSHOPS ORGANISED
Date | Location | Title | Participants | Outcome (Short report & Indico URL) |
---|---|---|---|---|
September 2012 | Prague | EGI Technical Forum | UNIBE-LHEP (Gianfranco Sciacca), UZH (Sergio Maffioletti,Tyanko Alexiev), SWITCH (Alessandro Usai, Simon Leinen, Valery Tschopp) | Important know-how build,accounting problem discussed, OMB and OTAG |
1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED
Date | Location | Title | Participants | Outcome (Short report & Indico URL) |
---|---|---|---|---|
October 2012 | Lugano (CH) | Swiss High Performance Computing Forum | UNIGE (Szymon Gadomski), UNIBE-LHEP (Gianfranco Sciacca and Sigve Haug), ETH/CSCS (Christoph Grab and Pablo Fernandez) | Talk: 'Disk Pool Manager Storage Systems at the Universities of Bern and Geneva'
S.Gadomski (UNIGE-DPNC) and G.Sciacca (UNIBE-LHEP) |
August 2012 | Karlsruhe (DE) | GridKa school | G. Sciacca (UNIBE), Sergio Maffioletti (UZH) | Various topics of interest for Grid site administrators, community networking |
1.3. PUBLICATIONS
Publication title | Journal / Proceedings title | Journal references Volume number Issue Pages from - to |
Authors 1. 2. 3. Et al? |
---|
2. ACTIVITY REPORT
2.1. Progress Summary
Progress on accounting and monitoring.
Transition to UMD releases almost complete.
UNIBE-ID certified (accounting problem in progress).
Review/reshuffle of the tasks among the partners.
2.2. Main Achievements
Solution to the accounting problem of UNIBE finally in place. CSCS are still waiting for a definite solution for the accounting of CREAM CE + ARC CE mixed environment.
ARC monitoring to be enabled on the prod Nagios system: the ARC probes are NOT production quality, and therefore some sites reserve their right to accept the monitoring of their ARC services (flagged as non prod in gocdb).
Nagios ARC probes review: on going, arc gridftp probe decommissioned by OMB after NGI_CH request. Upgrade to UMD1 and UMD2 complete: Geneva will soon upgrade the site bdii and DPM from gLite 3.2 to UMD (see GGUS ticket) -> apart from UNIGE No gLite 3.1 and gLite 3.2 services in NGI_CH.
Security effort to be reviewd by NGI_CH. Some manager tasks to be reassigned from SWITCH to SWING. UNIBE-ID certified, it will start being monitored by the Nagios prod system as soon as the probes are enabled (it has been certified through the use of the Nagios test system). note: ARC 2 bdii problem escalated with a GGUS ticket. In general it has been remarked that ARC support is lacking, and that it could be improved.
2.3. Issues and mitigation
CSCS
- Some storage problems at CSCS, due to hardware failures, they have been tackled with emergency disks.
- CPU expansion at CSCS (the cluster has now 21800 HS06 computing power).
- All compute nodes upgraded to UMD1,
- All VOBOX middleware services (gsissh and bdii) were removed from the VO-specific machines.
PSI
- For over a year 1TB disks failures in our old SUN X4540 systems, no data loss thanks to the resilient ZFS RAID6. Need to continue and operate these systems for some more time.
- Introduction of a limit on 3 GB RAM usage for jobs in the SGE configuration -> neededed to switch from srmcp (Java) based tools to lcg-tools, due to srmcp causing too high short lived peak memory consumption and therefore SGE to kill the job.
UNIGE
- Issue: Space token UNIGE-DPNC_LOCALGROUPDISK getting over 90% full a few times.
- Mitigation: Finding which replicas are no longer or less needed.
- Methods: statistic of age and last access times(available from the DPM), lists of datasets to keep (maintained by analysis project leaders), evaluation of total data size by project based on those lists, negotiation in the group meetings.
UNIBE-LHEP
Main achievements
- Stable operation, no down time.
- Completed electrical and cooling server room upgrades in order to accommodate the hardware moved from CSCS
- Rolling move (without downtime) of all production worker nodes and lustre servers to new water-cooled racks
- Build of new ARC CE as front-end of new cluster
- Advanced in commissioning of new cluster, WN image customised for ATLAS and Infiniband interconnect
- Pledged CPU and disk resources to ATLAS as Tier-2 centre for April 2013
- Site account records now appear in the central EGI accounting portal (backdated to 2010)
Issues and mitigations
- Issue: bugs in current version of ARC middleware used to build the new CE
- Mitigation: adopt code hacks developed elsewhere in order to circumvent the bugs
- Issue: EGI Nagios monitoring not yet in production or production value
- Mitigation: increase levels of local monitoring at the cost of manpower and rely on operational monitoring by ATLAS
UNIBE-ID
Site certified (ARC 2 CE, no gLite/storage). Accounting to be fixed soon.
Issue Description | Mitigation Description |
---|