Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "ROD Alarms and tickets"

From EGIWiki
Jump to navigation Jump to search
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Template:Op menubar}} {{Template:GO menubar}} {{TOC_right}}  
{{Template:Op menubar}} {{Template:GO menubar}} {{TOC_right}}  
[[Category:Grid_Oversight]]
 
== Alarms  ==
== Alarms  ==


'''Alarms''' are automatically generated notifications ''(I would prefer to not call them notification because be have also Nagios notifications what may confuse. You can use for example incident or alert )'' created by Nagios. You work with alarams  ''(typo)'' from within the Dashboard. The [[Operations/ROD/Duties|Duties]] part of this manual describes, how you are supposed to react to alarms and when to turn them into tickets.(''That's not true. I couldn't find there such information.'')
'''Alarms''' are automatically generated notifications created by Nagios and are handled from within the Dashboard. 


==== Handling alarms  ====
==== Handling alarms  ====


When an alarm is generated, the site administrators have 24 hours to start acting on the issue. ROD should keep an eye on the Dashboard at least twice a day (this is the minimum for small NGIs, large NGIs would require more frequent checking) (''this is a good practice not a procedure''). If ROD spots an alarm, he should notify the site's administrators about the problem.(''They don't have to according to procedure. They can.'')
When an alarm is generated, the site administrators have 24 hours to start acting on the issue. If ROD spots an alarm, he can notify the site's administrators about the problem.<br>


In general, ''red'' alarms (''you didn't explain what is red alarm before'')should not be closed, whether the site is in downtime or not (''that's not true.. you can close it when site is in downtime ''). Anyway, due to the internal Nagios implementation, Dashboard may not be notified of the problem being resolved. In such cases, ROD is free to close a ''red'' alarm, but he must provide an explanation (''and ggus ticket for the problem.'') (Dashboard provides a form for this).
*If the problem is fixed within 24 hours and the solution is tested by Nagios (the alarm's color turns green in the Dashboard), then ROD has to make sure that the results are not flapping and can close alarm without any other action.  
 
The following list describes how to proceed with closing alarms in particular situations:
 
*If the problem is fixed within 24 hours and the solution is tested by Nagios (the alarm's color turns green in the Dashboard), then ROD is free to close the alarms without any other action.(''ROD has to make sure that site is not flapping'')
*If the problem cannot be fixed within 24 hours, and the site administrators put the service into an unscheduled downtime, ROD should just wait until the problem is fixed or until the downtime is over. No other action is necessary.  
*If the problem cannot be fixed within 24 hours, and the site administrators put the service into an unscheduled downtime, ROD should just wait until the problem is fixed or until the downtime is over. No other action is necessary.  
*If the problem cannot be fixed within 24 hours and the administrators don't put the service into a downtime, then a ticket must be issued. The procedure is described below in the [[Operations/ROD/Alarms and tickets#Tickets|Tickets]] section.  
*If the problem cannot be fixed within 24 hours and the administrators don't put the service into a downtime, then a ticket must be issued. The procedure is described below in the [[ROD Alarms and tickets#Tickets|Tickets]] section.  
*If the service is in downtime and the problem is fixed (as verified by Nagios), then ROD is free to close the alarm.  
*If the service is in downtime and the problem is fixed (as verified by Nagios), then ROD can close the alarm.  
*If the downtime is over and the problem is still present (e.g. if the administrators forgot to extend the downtime), then a ticket must be issued.  
*If the downtime is over and the problem is still present (e.g. if the administrators forgot to extend the downtime), then a ticket must be issued.  
*If an alarm is raised for a service that has its monitoring status set to OFF in GOCDB (also visible in the Dashboard in the Nodes box or in the alarm row as ''Node status'') then ROD should not open a ticket. The alarm can be cleared even if it is red by pressing the lightning icon and giving an explanation.
*If an alarm is raised for a service that has its monitoring status set to OFF in GOCDB (also visible in the Dashboard in the Nodes box or in the alarm row as ''Node status'') then ROD should not open a ticket.  
*The alarm can be cleared even if it is marked red by pressing the lightning icon and giving an explanation.


For handling tickets during public holidays, see [[Operations/ROD/Alarms and tickets#Handling_alarms_and_tickets_during_weekends_and_public_holidays|below]]. There is also a [[Grid operations oversight/ROD#Video_tutorials|video tutorial]] on handling alarms available.  
For handling tickets during public holidays, see [[ROD Alarms and tickets#Handling_alarms_and_tickets_during_weekends_and_public_holidays|below]]. There is also a [[Regional Operator on Duty#Video_tutorials|video tutorial]] on handling alarms available.  


== Tickets  ==
== Tickets  ==
Line 28: Line 25:
==== Creating tickets  ====
==== Creating tickets  ====


Ticket creation occurs when the age of an alarm in an error state has passed 24 hours, '''whether or not''' a site has already made some action on the alarm. A ticket can be (''has to be''!) created from the Operations Portal. First, check the site notepad to see if any action has already been taken on the alarm (''why?''). In order to actually create a ticket, click on the double arrow in the upper left corner next to the NGI name. That opens the drop down box with information on the site. Then, open the ''New NAGIOS alarms'' drop down box. Click the "T+" icon to create the ticket.  
Ticket creation occurs when the age of an alarm in an error state has passed 24 hours, '''whether or not''' a site has already made some action on the alarm. A ticket has to be created from the Operations Portal.&nbsp; In order to actually create a ticket, click on the double arrow in the upper left corner next to the NGI name. That opens the drop down box with information on the site. Then, open the ''New NAGIOS alarms'' drop down box. Click the "T+" icon to create the ticket.  


Refer to the [https://documents.egi.eu/public/ShowDocument?docid=301 Dashboard Howto] if you need a more detailed guide.  
Refer to the [https://documents.egi.eu/public/ShowDocument?docid=301 Dashboard Howto] if you need a more detailed guide.  
Line 56: Line 53:


#When the state of an alarm for a site with an open ticket changes to OK, then the ticket associated with that alarm can be updated in the Dashboard. Do this by clicking ''Update'' for the ticket in the ''Tickets'' drop down. Now change ''Escalate'' to ''Problem solved'' and fill in any information about how the problem has been solved. Clicking ''Update'' will then close the ticket in both GGUS and the Dashboard.  
#When the state of an alarm for a site with an open ticket changes to OK, then the ticket associated with that alarm can be updated in the Dashboard. Do this by clicking ''Update'' for the ticket in the ''Tickets'' drop down. Now change ''Escalate'' to ''Problem solved'' and fill in any information about how the problem has been solved. Clicking ''Update'' will then close the ticket in both GGUS and the Dashboard.  
#If the Nagios alarm is in an unstable state, and the site has not responded to the problem in a reasonable (''those timelines are strictly given by escalation procedure'') amount of time (3 days) then a 2<sup>nd</sup> email can be sent to the site by updating the ''Escalate'' field to ''2<sup>nd</sup> step''.  
#If the Nagios alarm is in an unstable state, and the site has not responded to the problem in 3 days then a 2<sup>nd</sup> email can be sent to the site by updating the ''Escalate'' field to ''2<sup>nd</sup> step''.  
#If a new failure is detected for the site, the existing ticket should not be modified (though the deadline can be extended) but a new ticket should be submitted for this new problem.  
#If a new failure is detected for the site, the existing ticket should not be modified (though the deadline can be extended) but a new ticket should be submitted for this new problem.  
#If the site's problem can not be fixed in a reasonable amount of time (3 days from the ''2<sup>nd</sup> step'') then escalate the ticket to ''Political procedure''. This means that the NGI manager will contact both COD and the site to negotiate about suspending the site.
#If the site's problem can not be fixed in 3 days from the ''2<sup>nd</sup> step of the escalation procedur'' then escalate the ticket to ''Political procedure''. This means that the NGI manager will contact both EGI&nbsp;Operations and the site to negotiate about suspending the site.


==== Sites with multiple tickets open  ====
==== Sites with multiple tickets open  ====
Line 64: Line 61:
When opening a ticket against a site with existing tickets ROD should consider that these problems may be linked or dependant on pending solutions. If the problem is different but maybe linked the expiry dates for each ticket should be synchronized to the latest date.  
When opening a ticket against a site with existing tickets ROD should consider that these problems may be linked or dependant on pending solutions. If the problem is different but maybe linked the expiry dates for each ticket should be synchronized to the latest date.  


Also consider masking new problems with an old ticket.  
Also consider masking new problems with an old ticket.<br>
 
In the event of more than one ticket being opened for the same problem ROD must decide which ticket has been active and then close those with no responses. ROD staff must comment in the ticket that they are closing a ticket which is linked to another and state the GGUS number for reference.(''Strange statement. It is not possible to create a ticket for same problem because we cannot have multiple alarms for the same test. If we have same problem in different tests than we should not close such tickets because we sill close automatically alarm which was not solved'')


== Handling alarms and tickets during weekends and public holidays  ==
== Handling alarms and tickets during weekends and public holidays  ==
Line 74: Line 69:
Currently there is no automatic mechanism for handling ticket expiration over public holiday periods, because they differ among countries. If some of the sites the ROD team is in charge of are located in another country, the ROD is encouraged to get them to announce their public holidays, so that ticket expiration can be set accordingly. (Correspondingly, ROD operators also have no duties when they are on public holidays.) The ROD can edit the ticket's expiration day by clicking the "T+" (Edit Ticket) icon. The value is set to 3 days by default.  
Currently there is no automatic mechanism for handling ticket expiration over public holiday periods, because they differ among countries. If some of the sites the ROD team is in charge of are located in another country, the ROD is encouraged to get them to announce their public holidays, so that ticket expiration can be set accordingly. (Correspondingly, ROD operators also have no duties when they are on public holidays.) The ROD can edit the ticket's expiration day by clicking the "T+" (Edit Ticket) icon. The value is set to 3 days by default.  


Please note that ROD is ''not'' requested to announce their national holidays to the COD team. However, the last day before a public holiday, ROD is requested to check  
Please note that ROD is ''not'' requested to announce their national holidays to the EGI&nbsp;Operations team. However, the last day before a public holiday, ROD is requested to check  


*if there are any tickets that are to be expired during the holiday and change their expiration date;  
*if there are any tickets that are to be expired during the holiday and change their expiration date;  
Line 81: Line 76:
== Workflow and escalation procedure  ==
== Workflow and escalation procedure  ==


The workflow and escalation procedures are documented in more detail at [[PROC01]].
The workflow and escalation procedures are documented in more detail at [[PROC01 Grid Oversight escalation]].  
 
[[Category:Infrastructure_Oversight]]

Latest revision as of 13:58, 23 October 2014

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 



Alarms

Alarms are automatically generated notifications created by Nagios and are handled from within the Dashboard. 

Handling alarms

When an alarm is generated, the site administrators have 24 hours to start acting on the issue. If ROD spots an alarm, he can notify the site's administrators about the problem.

  • If the problem is fixed within 24 hours and the solution is tested by Nagios (the alarm's color turns green in the Dashboard), then ROD has to make sure that the results are not flapping and can close alarm without any other action.
  • If the problem cannot be fixed within 24 hours, and the site administrators put the service into an unscheduled downtime, ROD should just wait until the problem is fixed or until the downtime is over. No other action is necessary.
  • If the problem cannot be fixed within 24 hours and the administrators don't put the service into a downtime, then a ticket must be issued. The procedure is described below in the Tickets section.
  • If the service is in downtime and the problem is fixed (as verified by Nagios), then ROD can close the alarm.
  • If the downtime is over and the problem is still present (e.g. if the administrators forgot to extend the downtime), then a ticket must be issued.
  • If an alarm is raised for a service that has its monitoring status set to OFF in GOCDB (also visible in the Dashboard in the Nodes box or in the alarm row as Node status) then ROD should not open a ticket.
  • The alarm can be cleared even if it is marked red by pressing the lightning icon and giving an explanation.

For handling tickets during public holidays, see below. There is also a video tutorial on handling alarms available.

Tickets

In contrast with alarms, which are mere notifications, tickets are created manually. They are used to report problems to the responsible support units. Additionally, they allow to track the actions taken in order to resolve the issue.

Creating tickets

Ticket creation occurs when the age of an alarm in an error state has passed 24 hours, whether or not a site has already made some action on the alarm. A ticket has to be created from the Operations Portal.  In order to actually create a ticket, click on the double arrow in the upper left corner next to the NGI name. That opens the drop down box with information on the site. Then, open the New NAGIOS alarms drop down box. Click the "T+" icon to create the ticket.

Refer to the Dashboard Howto if you need a more detailed guide.

If more than one alarm should be handled by the same ticket, proceed as follows:

  1. Create a ticket for one of the alarms.
  2. Open the Assigned Alarms drop-down box. Click on the mask icon next to the alarm identifier.
  3. A window will open in which you can select the alarms to be masked.

If an alarm, which is masked by another alarm, remains in "critical" condition because of another (unrelated) problem, you can unmask it by clicking on the mask icon again and close the ticket for the solved alarms.

  1. Fill in the relevant information in the ticket section. If there was information in the site notepad, ensure that the ticket information reflects that information. Also ensure that the TO: select boxes, and FROM: and SUBJECT: fields are all correct. Generally, a ticket should go to all of site, NGI and ROD.
  2. Press the Submit button and a pop up window will appear confirming that the ticket was correctly submitted. Your ticket has now been assigned a GGUS ID, but also an internal (hidden) Dashboard ID, which means that if you create a ticket through Dashboard, you have to close it through Dashboard as well. If you close a ticket opened through Dashboard in GGUS, it will remain open in Dashboard!
Creating tickets without an alarm

It is also possible to create a ticket for a site without an alarm. This can happen if there is an issue with one of the tools (GStat, for instance) that does not create an alarm in Dashboard. In this case, click on the "T+" icon in the upper right corner of the site box - the one with "Create a ticket (without an alarm)" tooltip, and fill in the appropriate fields as when creating a ticket for an alarm.

Ticket content templates

The email is addressed to the corresponding NGI, together with the site and ROD. To view the list of NGI e-mail addresses, click the Regional List link in the Dashboard menu.

Generally, you should not remove any content from the template, but you are free to add any information you think the site might find helpful in any of the three editing fields (Header content, Main content, and Footer content).

Changing the state of and closing a ticket

  1. When the state of an alarm for a site with an open ticket changes to OK, then the ticket associated with that alarm can be updated in the Dashboard. Do this by clicking Update for the ticket in the Tickets drop down. Now change Escalate to Problem solved and fill in any information about how the problem has been solved. Clicking Update will then close the ticket in both GGUS and the Dashboard.
  2. If the Nagios alarm is in an unstable state, and the site has not responded to the problem in 3 days then a 2nd email can be sent to the site by updating the Escalate field to 2nd step.
  3. If a new failure is detected for the site, the existing ticket should not be modified (though the deadline can be extended) but a new ticket should be submitted for this new problem.
  4. If the site's problem can not be fixed in 3 days from the 2nd step of the escalation procedur then escalate the ticket to Political procedure. This means that the NGI manager will contact both EGI Operations and the site to negotiate about suspending the site.

Sites with multiple tickets open

When opening a ticket against a site with existing tickets ROD should consider that these problems may be linked or dependant on pending solutions. If the problem is different but maybe linked the expiry dates for each ticket should be synchronized to the latest date.

Also consider masking new problems with an old ticket.

Handling alarms and tickets during weekends and public holidays

Due to the fact that weekends are not considered working days, it is noted that ROD teams do not have any responsibilities during weekends and that RODs should ensure that tickets do not expire during weekends. The alarm age does not increase during the weekend.

Currently there is no automatic mechanism for handling ticket expiration over public holiday periods, because they differ among countries. If some of the sites the ROD team is in charge of are located in another country, the ROD is encouraged to get them to announce their public holidays, so that ticket expiration can be set accordingly. (Correspondingly, ROD operators also have no duties when they are on public holidays.) The ROD can edit the ticket's expiration day by clicking the "T+" (Edit Ticket) icon. The value is set to 3 days by default.

Please note that ROD is not requested to announce their national holidays to the EGI Operations team. However, the last day before a public holiday, ROD is requested to check

  • if there are any tickets that are to be expired during the holiday and change their expiration date;
  • if there are any alarms that will pass the 72 hour period during the holidays and handle them properly in advance.

Workflow and escalation procedure

The workflow and escalation procedures are documented in more detail at PROC01 Grid Oversight escalation.