Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @

ROD Alarms and tickets

From EGIWiki
Revision as of 14:25, 16 June 2011 by Pslizik (talk | contribs) (Created by moving material from Operations/ROD/Draft)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Handling tickets

Creating tickets

Ticket creation occurs when the age of an alarm in an error state has passed 24 hours, whether or not a site has already made some action on the alarm. A ticket can be created from the Operations Portal. The process can be summarised in the following list:

  1. Click on the double arrow in the upper left corner next to the NGI name to open up the drop down information on the site. Then, open the "New NAGIOS alarms" drop down box. Click the "T+" icon to create the ticket.
  2. Check the site notepad to see if any action has been taken on the alarm.

Refer to the Dashboard Howto if you need a more detailed guide.

If more than one alarm should be handled by the same ticket, proceed as follows:

  1. Create a ticket for one of the alarms.
  2. Open the Assigned Alarms drop-down box. Click on the mask icon next to the alarm identifier.
  3. A window will open in which you can select the alarms to be masked.

If an alarm, which is masked by another alarm, remains in "critical" condition because of another (unrelated) problem, you can unmask it by clicking on the mask icon again and close the ticket for the solved alarms.

  1. Fill in the relevant information in the ticket section. If there was information in the site notepad, ensure that the ticket information reflects that information. Also ensure that the TO: select boxes, and FROM: and SUBJECT: fields are all correct. Generally, a ticket should go to all of site, NGI and ROD.
  2. Press the Submit button and a pop up window will appear confirming that the ticket was correctly submitted. Your ticket has now been assigned a GGUS ID, but also an internal (hidden) Dashboard ID, which means that if you create a ticket through Dashboard, you have to close it through Dashboard as well. If you close a ticket opened through Dashboard in GGUS, it will remain open in Dashboard!

Creating tickets without an alarm

It is also possible to create a ticket for a site without an alarm. This can happen if there is an issue with one of the tools (GStat, for instance) that does not create an alarm in Dashboard. In this case, click on the "T+" icon in the upper right corner of the site box - the one with "Create a ticket (without an alarm)" tooltip, and fill in the appropriate fields as when creating a ticket for an alarm.

Ticket content templates

The email is addressed to the corresponding NGI, together with the site and ROD. To view the list of NGI e-mail addresses, click the Regional List link in the Dashboard menu.

Generally, you should not remove any content from the template, but you are free to add any information you think the site might find helpful in any of the three editing fields (Header content, Main content, and Footer content).

Changing the state of and closing a ticket

  1. When the state of an alarm for a site with an open ticket changes to OK then the ticket associated with that alarm can be updated in the Dashboard. Do this by clicking Update for the ticket in the Tickets drop down. Now change Escalate to Problem solved and fill in any information about how the problem has been solved. Clicking Update will then close the ticket in both GGUS and the Dashboard.
  2. If the Nagios alarm is in an unstable state, and the site has not responded to the problem in a reasonable amount of time (3 days) then a 2nd email can be sent to the site by updating the Escalate field to 2nd step.
  3. If a new failure is detected for the site, the existing ticket should not be modified (though the deadline can be extended) but a new ticket should be submitted for this new problem.
  4. If the site's problem can not be fixed in a reasonable amount of time (3 days from the 2nd step) then escalate the ticket to Political procedure. This means that the NGI manager will contact both COD and the site to negotiate about suspending the site.

Workflow and escalation procedure

The workflow and escalation procedures are documented in more detail at PROC01.

Ticket handling during weekends

Due to the fact that weekends are not considered working days it is noted that ROD teams do not have any responsibilities during weekends and that RODs should ensure that tickets do not expire during weekends. The alarm age does not increase during the weekend.

Currently there is no automatic mechanism for handling ticket expiration over public holiday periods. As this can vary from site to site, RODs are encouraged to get their sites to announce their public holidays (if they are located in another country) so that ticket expiration can be set accordingly. (Correspondingly, ROD operators also have no duties when they are on public holidays.) The ROD can edit the ticket's expiration day by clicking the "T+" (Edit Ticket) icon. The value is set to 3 days by default.

Handling sites and nodes

Alarm raised to sites when monitoring is off

If an alarm is raised for a service that has its monitoring status set to OFF in GOCDB (also visible in the Dashboard in the Nodes box or in the alarm row as Node status) then ROD should not open a ticket. The alarm can be cleared even if it is red by pressing the lightning icon and giving an explanation.

Sites with multiple tickets open

When opening a ticket against a site with existing tickets ROD should consider that these problems may be linked or dependant on pending solutions. If the problem is different but maybe linked the expiry dates for each ticket should be synchronized to the latest date.

Also consider masking new problems with an old ticket.

In the event of more than one ticket being opened for the same problem ROD must decide which ticket has been active and then close those with no responses. ROD staff must comment in the ticket that they are closing a ticket which is linked to another and state the GGUS number for reference.