Difference between revisions of "TSA2.5 Deployed Middleware Support Unit"
Line 98: | Line 98: | ||
== Systems available for DMSU == | == Systems available for DMSU == | ||
In order to debug issues and design workaround availability to some systems is needed. | In order to debug issues and design workaround availability to some systems is needed. [[DMSU_machines]] page contains the list of systems available for the DMSU staff pr partner. | ||
=== CESNET === | === CESNET === |
Revision as of 18:18, 24 January 2012
DMSU people and expertise
DMSU_People_Institutes page provides the list of people with their expertise.
Ticket handling procedure
The purpose of DMSU work is twofold: to find solution to problems problems which do not require changes in code, documentation, ... (whatever is released by the TP), and to provide thorough analysis, yielding well-specified bug report otherwise.
The following ticket handling guidelines contribute to achieving these goals:
- The first mandatory step of DMSU work on a ticket is understanding what is the reason of the reported problem. The outcome of the analysis is documented with the ticket, preferably as a response to the user. The analysis may or may not include thorough reproduction of the problem; it is left to common sense.
- During the analysis DMSU also assesses the priority of the ticket (see bellow) and adjusts Type of problem and Ticket category fields eventually.
- Typically, the analysis involves communication with the users. DMSU sets ticket state to Waiting-for-reply whenever expecting feedback by the user. It is foreseen GGUS will implement automatic switch to In-progress when the user answers.
- DMSU expertise should cover most tickets. For remaining tough issues developers (i.e. the 3rd line support) can be involved. However, the control on the ticket is still kept within DMSU, i.e. the ticket is not reassigned to another support unit.
- If solution of the problem does not induce changes in code, documentation, default configuration etc., i.e. release of anything by the technology provider, DMSU closes the ticket.
- Otherwise, the ticket is reassigned to the appropriate 3rd line support unit. In this case, the most recent comment (i.e. on reassignment) should contain a brief summary of the DMSU analysis on the ticket, pointing to what is wrong exactly, how to reproduce the problem etc., so that 3rd line supporters don't have to gather all information from the ticket correspondence, which tends to be rather long.
A special case are tickets that were solved in DMSU but they require comment by th 3rd line, i.e. to confirm feasibility of the solution. Those tickets should be closed in DMSU just with a comment indicating the 3rd line was contacted, and the 3rd line approached by other means. The standard GGUS workflow must not be used for this communication, in order to keep the statistics clean, mostly.
If a ticket is wrongly assigned to 3rd line support, i.e. the problem is quite simple and it should have been solved by DMSU preferably, then:
- 3rd line support reassign back the ticket to DMSU. A comment pointing to appropriate documentation or giving justification why this is a trivial issue must be given in this case.
- this mechanism will be used as a metric of DMSU failures, and checked thoroughly, therefore it should not be abused.
When the user does not react on a raised question, she is typically reminded weekly, on the DMSU meetings. If there is no reaction for more than one month, the ticket is closed as unsolved.
Ticket priorities
Followup of tickets with 3rd line support units
DMSU shifts
The main purpose of DMSU shift is no surprise: keep the things running, not to leave an important issue without fast reaction etc.
The shifts are held by groups of people with expertise on different middleware stacks. However, due to the prevailing gLite-related traffic in DMSU only gLite shifts are formally organized currently, the other stacks are handled on the best effort basis.
The specific duties of the person on shift are:
- to follow incoming emails from GGUS, being able to react within approx. 2 hours in normal working hours
- to identify "top priority" and "very urgent" issues, not only by the priority set by the submitter but also by using common sense, and to make sure an appropriate expert starts looking into the issue; this includes assigning the ticket to a specific person
- to keep checking that there is reasonable response time, namely as a reaction to further submitter's correspondence; it should be almost immediate on "top priority", and we can probably afford upto 1 week for "less urgent"
One person holds the shift for one week, the duty is passed to the other on Monday afternoon.
Shift schedule
Dec 5 | Zdeněk Salvet |
Dec 12 | INFN |
Dec 19 | Aleš Křenek |
Dec 26 | best effort |
Jan 2 | Aleš Křenek |
Jan 9 | Alessandro Paolini |
Jan 16 | Zdeněk Salvet |
Jan 23 | Sergio Traldi |
Jan 30 | Aleš Křenek |
DMSU Digests
Brief description and indexing of issues solved within DMSU that are likely to have broader impact on EGI Operations.
Maintained on separate page Middleware_issues_and_solutions
Operations Documentation
DMSU contributes to maintenance of EGI Operations_Manuals, in particular
- MAN05 BDII high-availability
- WMS_best_practices
- VOMS_Replication
Systems available for DMSU
In order to debug issues and design workaround availability to some systems is needed. DMSU_machines page contains the list of systems available for the DMSU staff pr partner.
CESNET
prague_cesnet_lcg2
- # nodes/cores: 20/80
- OS: SL 5.2
- Batch system: PBSPro
- Grid m/w: gLite 3.1
- EA Site: Y
https://goc.gridops.org/site/list?id=48
{floi1,floi2}.egee.cesnet.cz
Virtual machines for experimental services, can be scratched and reinstalled as required.
- # nodes/cores: 2/2
- OS: SL 5.3 x86_64
- Batch system: N/A
- Scheduler: N/A
- Grid m/w and flavour: LB 2.0 (of gLite 3.2)
- EA Site: N
JUELICH
INFN
# nodes/cores: a) 4/8 b) 3/6 OS: a) SL4 x86_32 b) SL5 x86_64 Batch system: torque/pbs Scheduler: maui Grid m/w and flavour: a) INFNGRID 3.1 (based on gLite 3.1) b) INFNGRID 3.2 (based on gLite 3.2) EA Site: Y (for STORM service) https://goc.gridops.org/site/list?id=95
BADW-LRZ
Linux Cluster
* Nodes/Cores: 938/5532 * OS: SUSE Linux Enterprise Server 10 * Batch System/Scheduler: SGE 6.2 * Globus 4.0.7 * EA Site: N
NDGF
Smokerings
* Nodes/Cores: 66/528 * CentOS 5.7 * Torque (to be replaced by SLURM) * MOAB (to be replaced by SLURM) * ARC 1.1 * EA Site * https://goc.gridops.org/node/list?id=7055071
Obsolete stuff
Not used anymore but keeping the old links here.
Relevant tickets
Relevant tickets passed through DMSU and assigned to other support units are gathered here
Meetings
- 100603 DMSU Kickoff Meeting, Amsterdam
- 100608 DMSU Weekly Assigner Meeting
- 100615 DMSU Weekly Assigner Meeting
- 100622 DMSU Weekly Assigner Meeting
- 100817 DMSU Weekly Assigner Meeting
- 100824 DMSU Weekly Assigner Meeting
- 100831 DMSU Weekly Assigner Meeting
- 100907 DMSU Weekly Assigner Meeting
- 100921 DMSU Weekly Assigner Meeting
- 100928 DMSU Weekly Assigner Meeting
- 101005 DMSU Weekly Assigner Meeting
- 101026 DMSU Weekly Assigner Meeting
- 110125 DMSU Weekly Assigner Meeting