Difference between revisions of "FAQ Regional Operator on Duty"

From EGIWiki
Jump to: navigation, search
(Handling the eu.egi.lowAvailability alarm)
Line 44: Line 44:
= Handling the '''eu.egi.lowAvailability''' alarm =
= Handling the '''eu.egi.lowAvailability''' alarm =
Availability alarms are handled by ROD teams through Dashboard in Operations Portal. These alarms are thought to be a warning for NGI informing about poor performance of site within the last 30 days.
Go to procedure [[PROC04_Quality_verification_of_monthly_availability_and_reliability_statistcs#Process_of_handling_RC_Availability_and_Reliability| PROC04 Quality verification of monthly availability and reliability statistcs ]]
'''Understanding the alarm:'''
When an alarm is raised, it means that the Availability metric has dropped below the threshold of 70% for the last 30 days period.
'''Handling alarms:'''
ROD should treat the alarm as a warning that availability for the period of last 30 days has dropped below 70%.
The alarm is handled identically to other alarms: usually a ticket must be submitted to the site. It can be closed as soon as the alarm goes into OK status (however it is recommended to make sure it is a couple percent above the threshold before closing it. e.g. 80%). If the problem continues for over 30 days the ticket should be closed. If the alarm is raised again, ROD has to open a new ticket. This should motivate the site to work on the problem.
It is up to ROD whether they ask for site's explanation.

Revision as of 11:05, 22 November 2012

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security

EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 

How to handle issues during weekends and public holidays?

Due to the fact that weekends and public holidays are not considered working days it is noted that ROD teams do not have any responsibilities during these days. RODs should ensure that in these days tickets do not expire and alarms will not age above 72h.

What to do with alarms when node is not in production and is part of production sit?

It often happens that testing nodes on production sites are set as non-production. In such case Nagios monitoring system will send information about all nodes. As a result ROD will see on their dashboard alarms for non-production node. If it necessary to monitor such testing node it is recommended to put such non-production node in downtime.

What to do when a sites have multiple alarms/ticket?

When opening a ticket against a site with existing tickets ROD should consider that these problems may be linked or dependant on pending solutions. In such case ROD should use masking mechanism to gather and assigne alarms to one ticket rather than open a ticket for each alarm.

If the problem is different but maybe linked the expiry dates for each ticket should be synchronized to the latest date.

How to handle issues for site/node in downtime?

Handling tickets for site/node in downtime

When a ticket has been raised against a site that subsequently enters downtime time, the expiry date on the ticket can be extended.

Sites that are in downtime will still have monitoring switched on and therefore may appear to be failing tests but no alarms on Operations Portal will be raised against them. ROD must take care that when opening tickets to ensure that they don't open tickets against sites in downtime.

Handling alarms for site/node in downtime

It often happens that a failure occurred generating a lot of alarms and then site manager decided to put site in Downtime. Getting these alarms OK may take more than 72h when the issue is escalated to COD.
ROD should not create a ticket for sites/nodes in Downtime and is not obligated to deal with such alarms but it is recommended to close these alarms to avoid being escalated to COD. In such case as a reson of closing NON-OK alarm ROD should put link to the downtime in GOC DB.

Site in downtime for more than a month

If a site is in DOWNTIME for more than a month then it is advised that the site should go to the uncertified state.

What to do in case of accounting issue?

In case of problems with accounting it is not recommended to suggest downtime at the second step of the escalation process for this test. Accounting service is not a functionality which is critical for users but it still need to be follow up.

Watch out for flapping states

You may want to wait for a second test to be run before closing an alarm which is in an OK state. This ensures that the OK result for that tests is stable. The waiting period is, of course, dependent on how long the test takes and how frequently it is checked.

Handling the eu.egi.lowAvailability alarm

Go to procedure PROC04 Quality verification of monthly availability and reliability statistcs