Revision as of 18:06, 6 December 2012

Main

EGI.eu operations services

Support

Documentation

Tools

Activities

Performance

Technology

Catch-all Services

Resource Allocation

Security

EGI Infrastructure Operations Oversight menu:

Home •

EGI.eu Operations Team •

Regional Operators (ROD)

this page will contain information about UNKNOWN status issue

Present situation

Availability and Reliability calculations formulas:

Availability = Uptime / (Total time - Time_status_was_UNKNOWN)
Reliability = Uptime / (Total time - Scheduled Downtime - Time_status_was_UNKNOWN)

How to read in context of UNKNOWN status:

Period in which site is in status UNKNOWN is not taken into calculation.
During this period EGI doesn’t know what is happening with the infrastructure.

Problems & Questions

No policy for test developers when test should return UNKNOWN status. What does UNKNOWN status mean?
Some of NGIs reach ~0% for all their sites and some reach even ~40%, sometimes disproporsions are even within one NGI. What/where is the reason for so high values and disproporsions?

What can cause UNKNOWN status?

Site will have UNKNOWN status in the following cases:
1. relevant availability test(s) reported UNKNOWN state
2. results of relevant availability test(s) are missing in the central MRS database.

The second case can be caused by:
2.1. SAM instance failure (network connection failure, internal tests failures)
2.2. WN tests results missing.

This case of WN tests results missing can occur if:
2.2.1. the SAM WN-probe framework was unable to publish results to messaging system (e.g. firewall issues on sites)
2.2.2. the SAM job failed to run on site (e.g. CE errors, WMS error) (sidenote - this case should be CRITICAL, need to double check how does ACE summarize case when it has CRITICAL and UNKNOWN).

When test can return UNKNOWN status?

UNKNOWN status is documented in the Nagios plugins developer guidelines (http://nagiosplug.sourceforge.net/developer-guidelines.html): "Invalid command line arguments were supplied to the plugin or low-level failures internal to the plugin (such as unable to fork, or open a tcp socket) that prevent it from performing the specified operation. Higher-level errors (such as name resolution errors, socket timeouts, etc) are outside of the control of plugins and should generally NOT be reported as UNKNOWN states.".

There is no plugins review process so we cannot be absolutely sure that plugin developers actually follow guideline.

What can cause UNKNOWN status disproportions between sites within one NGI?

Disproportions indicate site problem and are caused by case 1. or 2.2.

Where should NAGIOS admin look for help?

NAGIOS admin should contact SAM team through tool-admins@mailman.egi.eu or GGUS.

Solution proposals

Strict policy for the developers how to use UNKNOWN status

Advantage: we will be sure that all problems will be properly addressed as ERROR not UNKNOWN
Disadvantages: someone has to write the policy and check whether it is respected

Alarms for UNKNOWN status should be created when UNKNOWN status is longer than 4h

Advantage: we will be notified if the UNKNOWN status takes too long
Disadvantages: it means an extra work for ROD which will be look not only after ERRORs but also UNKNOWNs

Threshold for UNKNOWN status

Advantage: it is easy and fast to implement and automate
Disadvantages: there is a possibility that overlook an important problem

Revision history

Version	Authors	Date	Comments
1.0	Malgorzata Krakowian	2011-10-12	First draft

Difference between revisions of "Unknown issue"

Revision as of 18:06, 6 December 2012

Present situation

Problems & Questions

What can cause UNKNOWN status?

When test can return UNKNOWN status?

What can cause UNKNOWN status disproportions between sites within one NGI?

Where should NAGIOS admin look for help?

Solution proposals

Strict policy for the developers how to use UNKNOWN status

Alarms for UNKNOWN status should be created when UNKNOWN status is longer than 4h

Threshold for UNKNOWN status

Revision history

Navigation menu

Revision as of 11:19, 1 November 2012 (view source) Krakow (talk \| contribs) ← Older edit	Revision as of 18:06, 6 December 2012 (view source) Magda (talk \| contribs) m (moved Grid operations oversight/Unknown issue to Unknown issue: Wiki space review (RT 4614)) Newer edit →
(No difference)

Difference between revisions of "Unknown issue"

Revision as of 18:06, 6 December 2012

Present situation

Problems & Questions

What can cause UNKNOWN status?

When test can return UNKNOWN status?

What can cause UNKNOWN status disproportions between sites within one NGI?

Where should NAGIOS admin look for help?

Solution proposals

Strict policy for the developers how to use UNKNOWN status

Alarms for UNKNOWN status should be created when UNKNOWN status is longer than 4h

Threshold for UNKNOWN status

Revision history

Navigation menu

Search