Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "TSA2.5 Deployed Middleware Support Unit"

From EGIWiki
Jump to navigation Jump to search
(Redirected page to EGI DMSU)
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Purpose ==
#REDIRECT[[EGI DMSU]].
 
The purpose of DMSU work is twofold: to find solution to middleware-related problems
problems which ''do not'' require changes
in code, documentation, ... (whatever is released by the TP),
and to provide thorough analysis, yielding well-specified bug report
otherwise.
 
DMSU also carries sufficient expertise to provide emergency fixes to middleware
problems in the unlikely case the TP fails to deliver, for whatever reason.
 
== People and expertise ==
 
[[DMSU_People_Institutes]] page provides the list of people
with their expertise.
 
== Interaction with TPM and 3rd line support ==
 
All accountable interaction
with DMSU on middleware issues happens through assignment of GGUS tickets.
 
=== TPM ===
 
TPM assigns tickets to DMSU to start handling them.
 
DMSU may reassign a ticket back to TPM when it appears not to be a middleware
issues.
In this case, clear description of the reason for assigning back must
be given in the comment.
 
=== 3rd line support units ===
 
DMSU assigns tickets 3rd line support unit when it is confirmed that
solving the issue '''requires changes''' to some artifact released by the TP
(code, documentation, default configuration etc.).
 
Exceptionally, a ticket can be reassigned if DMSU lacks the required
expertise.
 
In both cases, the last DMSU comment on the ticket should contain
'''summary of the problem''' and its analysis
(unless the history of the ticket is very brief).
 
3rd line support unit can '''reassign back''' to DMSU if it turns out
that the issue was easy to solve, and no fix in any artefact is required,
therefore the issue should have been fixed by DMSU.
In this case, description of the reason of assigning back (e.g. pointers
to appropriate documentation) must be given.
 
On ''top priority'' and ''very urgent'' tickets 3rd line support are
expected to assign ETA (details bellow). DMSU '''checks ETA assignment''',
and it may trigger its renegotiation.
 
If the 3rd line support finds a workaround which may lead to
'''lowering priority''', DMSU must be informed, and the change in priority
must be approved (eventually after consulting EGI operations).
 
== Ticket priorities ==
 
In GGUS, tickets are classified to 4 priority levels.
Middleware ticket handling in DMSU and 3rd line support differs
according to their priority as described bellow.
 
=== Top Priority ===
 
Issues which affect the entire infrastructure, its significant portion,
or a very large number of users, with paralyzing impact.
Immediate reaction is required.
 
'''DMSU reaction''' -- immediate within working hours. DMSU work is
restricted for the sake of speed, mostly to assessment whether the ticket really deserves Top-priority category. Once this is confirmed, the appropriate 3rd line support unit is involved (to get an early warning). In general, no thorough, time-consuming analysis is done, and the ticket is reassigned to 3rd line quickly.
 
'''TP reaction''' -- typically, SLA guarantees 4 hour reaction.
The reaction should contain estimation of ETA ('''E'''stimated '''T'''ime of '''A'''rrival of the fix).
The time is not formally bounded, however, it should be within a few days; fix of a top-priority issue
triggers an emergency release typically.
 
'''ETA monitoring''' the top priority tickets are quite rare,
currently we evaluate ETA manually.
 
=== Very Urgent ===
 
Issues of broad impact, where no workaround is known or feasible.
 
'''DMSU reaction''' -- preferably on the same day, the ticket handling
guidelines above apply, however, the ticket should not be delayed for more
than 2 working days before reassignment to 3rd line.
 
'''TP reaction''' -- typically, SLA guarantees 2 working days.
ETA specification is required again.
The problem is expected to be fixed in 45 days (as a safe upper limit), typically in the
next scheduled bugfix release.
 
'''ETA monitoring''' -- support in GGUS will be required, to be negotiated,
though not urgent, the number of very-urgent tickets is quite low too.
 
=== Urgent ===
 
Issues of impact on significant user community, however, affecting
only some patterns of their work, and with a workaround generally available.
 
'''DMSU reaction'''  -- preferably in 2 working days to assess the priority and in 5 working days to produce first results of the ticket analysis.
 
'''TP reaction''' -- typically, SLA guarantees 5 working days. The fix is
scheduled according to the actual release plan of the TP. The only
requirement is that ''urgent'' issues should precede ''less urgent'' ones.
 
'''ETA monitoring''' -- no ETA assigned.
 
=== Less Urgent ===
 
Less significant issues with either easy workaround or marginal impact.
 
'''DMSU reaction''' -- within 2 weeks to produce first results of the ticket
analysis.
 
'''TP reaction''' -- typically, SLA guarantees 15 working days.
''Less urgent'' issues are fixed on the best effort basis.
 
'''ETA monitoring''' --- no ETA assigned.
 
== Followup of tickets with 3rd line support units ==
 
Besides handling the incoming tickets DMSU also performs
elementary followup of the tickets assigned to the 3rd line.
This work is restricted (due to limited available effort)
to checking high priority categories, and to basic aggregate checks on others.
 
Approach to the ticket is differentiated according to their priority:
 
=== Top Priority and Very Urgent ===
 
When such ticket is assigned to 3rd line, the TP is obliged,
withing the reaction time given by SLA, to assign ETA to the ticket.
Assignment of ETA is essential for EGI Operation to plan accordingly
(e.g. whether to deploy emergency workarounds).
 
The assignment of ETA is checked by DMSU.
 
When the ETA time arrives, DMSU checks whether the fix was delivered.
If not, TP is requested to provide a new estimate and an appropriate
justification. If there are doubts, the tickets can be escalated to TCB.
 
=== Urgent and Less Urgent ===
 
'''The process described bellow is tentative, and further discussion is required.'''
 
It's agreed that solving all submitted tickets may reach beyond
the capabilities of TP.
Therefore the Fedora approach of closing low-priority tickets
on major release, regardless of the fix availability, is taken.
This is a tradeoff approach, avoiding the ever-increasing backlog
of tickets.
If the reported problems persist in the new release,
and users are still affected, they are expected to submit new tickets.
 
More specifically, the following is expected from TP (assuming the major
release at month X):
# When a fix is available in a revision or minor release, the ticket is closed as ''solved''.
# Before a major release, e.g. at month X-1, the TP is expected to run a pre-release campaign on all open tickets.
# Issues that can be solved with feasible effort are fixed in this campaign and the fixes are scheduled for the upcoming major release.
# All issues submitted earlier than month X-1 are closed as ''unsolved'' once the release is available.
 
This round should happen for major releases, and it is optional for
minor ones. We also require that it is done at least once per year
if major releases are less frequent.
As long as the process is followed, every major release is started with
a clean table.
 
Finally, DMSU checks, at time point X+delta, i.e. well after the release,
that there are no open tickets submitted before X-1.
 
== Internal ticket handling guidelines ==
 
* The first mandatory step of DMSU work on a ticket is understanding what is the reason of the reported problem.  The outcome of the analysis is '''documented with the ticket''', preferably as a response to the user. The analysis may or may not include thorough reproduction of the problem; it is left to common sense.
* During the analysis DMSU also assesses the priority of the ticket (see bellow) and adjusts ''Type of problem'' and ''Ticket category'' fields eventually.
* Typically, the analysis involves communication with the users. DMSU sets ticket state to ''Waiting-for-reply'' whenever expecting feedback by the user. It is foreseen GGUS will implement automatic switch to ''In-progress'' when the user answers.
* DMSU expertise should cover most tickets.  When necessary developers (i.e. the 3rd line support) can be involved for brief consultation.  As long as no considerable effort is required from the 3rd line support, the control on the ticket is still kept within DMSU, i.e. the  ticket is '''not reassigned''' to another support unit. On the other hand, tough issues when DMSU expertise runs out should be still reassigned.
* If solution of the problem does not induce changes in code, documentation, default configuration etc., i.e. release of anything by the technology provider, '''DMSU closes the ticket'''.
* Otherwise, the ticket is '''reassigned''' to the appropriate 3rd line support unit. In this case, the most recent comment (i.e. on reassignment) should contain a '''brief summary''' of the DMSU analysis on the ticket, pointing to what is wrong exactly, how to reproduce the problem etc., so that 3rd line supporters don't have to gather all information from the ticket correspondence, which tends to be rather long.
 
A special case are tickets that were solved in DMSU but they require '''comment by th 3rd line''', i.e. to confirm feasibility of the solution. Those tickets should be closed in DMSU just with a comment indicating the 3rd line was contacted, and the 3rd line approached by other means. The standard GGUS workflow must not be used for this communication, in order to keep the statistics clean, mostly.
 
If a ticket is wrongly assigned to 3rd line support, i.e. the problem is
quite simple and it should have been solved by DMSU preferably, then:
* 3rd line support '''reassign back''' the ticket to DMSU. A comment pointing to appropriate documentation or giving justification why this is a trivial issue must be given in this case.
* this mechanism will be used as a metric of DMSU failures, and checked thoroughly, therefore it should not be abused.
 
When the user does not react on a raised question, she is typically reminded weekly, on the DMSU meetings. If there is '''no reaction''' for more than '''one month''', the ticket is closed as ''unsolved''.
 
 
== DMSU shifts  ==
 
The main purpose of DMSU shift is no surprise: keep the things running, not to leave an important issue without fast reaction etc.
 
The shifts are held by groups of people with expertise on different middleware stacks. However, due to the prevailing gLite-related traffic in DMSU only gLite shifts are formally organized currently, the other stacks are handled on the best effort basis.
 
The specific duties of the person on shift are:
 
*to follow incoming emails from GGUS, being able to react within approx. 2 hours in normal working hours
*to identify "top priority" and "very urgent" issues, not only by the priority set by the submitter but also by using common sense, and to make sure an appropriate expert starts looking into the issue; this includes assigning the ticket to a specific person
*to keep checking that there is reasonable response time, namely as a reaction to further submitter's correspondence; it should be almost immediate on "top priority", and we can probably afford upto 1 week for "less urgent"
 
One person holds the shift for one week, the duty is passed to the other on Monday afternoon.
 
=== Shift schedule  ===
 
{| style="width: 173px; height: 119px;"
 
|-
| Dec 5
| Zdeněk Salvet
|-
| Dec 12
| INFN
|-
| Dec 19
| Aleš Křenek
|-
| Dec 26
| best effort
|-
| Jan 2
| Aleš Křenek
|-
| Jan 9
| Alessandro Paolini
|-
| Jan 16
| Zdeněk Salvet
|-
| Jan 23
| Sergio Traldi
|-
| Jan 30
| Aleš Křenek
|-
| Feb 06
| Sara Bertocco
|-
| Feb 13
| Zdeněk Salvet
|-
|}
 
== DMSU Digests ==
 
Brief description and indexing of issues solved within DMSU that are likely to have broader impact on EGI Operations.
 
Maintained on separate page [[Middleware_issues_and_solutions]]
 
== Operations Documentation ==
 
DMSU contributes to maintenance of EGI [[Operations_Manuals]], in particular
 
* [[MAN05]] BDII high-availability
* [[WMS_best_practices]]
* [[VOMS_Replication]]
 
 
== Systems available for DMSU ==
 
In order to debug issues and design workaround availability to some systems is needed. [[DMSU_machines]] page contains the list of systems available for the DMSU staff per partner.
 
== Obsolete stuff ==
 
[[DMSU_Old_Stuff]]
 
Not used anymore but keeping the old links here.

Latest revision as of 20:18, 1 December 2012

Redirect to:

.