Unknown issue

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security

EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 

this page will contain information about UNKNOWN status issue

Present situation

Availability and Reliability calculations formulas:

Availability = Uptime / (Total time - Time_status_was_UNKNOWN)
Reliability = Uptime / (Total time - Scheduled Downtime - Time_status_was_UNKNOWN)

How to read in context of UNKNOWN status:

  1. Period in which site is in status UNKNOWN is not taken into calculation.
  2. During this period EGI doesn’t know what is happening with the infrastructure.

Problems & Questions

  1. No policy for test developers when test should return UNKNOWN status. What does UNKNOWN status mean?
  2. Some of NGIs reach ~0% for all their sites and some reach even ~40%, sometimes disproporsions are even within one NGI. What/where is the reason for so high values and disproporsions?

What can cause UNKNOWN status?

Site will have UNKNOWN status in the following cases:
1. relevant availability test(s) reported UNKNOWN state
2. results of relevant availability test(s) are missing in the central MRS database.

The second case can be caused by:
2.1. SAM instance failure (network connection failure, internal tests failures)
2.2. WN tests results missing.

This case of WN tests results missing can occur if:
2.2.1. the SAM WN-probe framework was unable to publish results to messaging system (e.g. firewall issues on sites)
2.2.2. the SAM job failed to run on site (e.g. CE errors, WMS error) (sidenote - this case should be CRITICAL, need to double check how does ACE summarize case when it has CRITICAL and UNKNOWN).

When test can return UNKNOWN status?

UNKNOWN status is documented in the Nagios plugins developer guidelines (http://nagiosplug.sourceforge.net/developer-guidelines.html): "Invalid command line arguments were supplied to the plugin or low-level failures internal to the plugin (such as unable to fork, or open a tcp socket) that prevent it from performing the specified operation. Higher-level errors (such as name resolution errors, socket timeouts, etc) are outside of the control of plugins and should generally NOT be reported as UNKNOWN states.".

There is no plugins review process so we cannot be absolutely sure that plugin developers actually follow guideline.

What can cause UNKNOWN status disproportions between sites within one NGI?

Disproportions indicate site problem and are caused by case 1. or 2.2.

Where should NAGIOS admin look for help?

NAGIOS admin should contact SAM team through tool-admins@mailman.egi.eu or GGUS.

Solution proposals

Strict policy for the developers how to use UNKNOWN status

Advantage: we will be sure that all problems will be properly addressed as ERROR not UNKNOWN
Disadvantages: someone has to write the policy and check whether it is respected

Alarms for UNKNOWN status should be created when UNKNOWN status is longer than 4h

Advantage: we will be notified if the UNKNOWN status takes too long
Disadvantages: it means an extra work for ROD which will be look not only after ERRORs but also UNKNOWNs

Threshold for UNKNOWN status

Advantage: it is easy and fast to implement and automate
Disadvantages: there is a possibility that overlook an important problem

Revision history

Version Authors Date Comments
1.0 Malgorzata Krakowian 2011-10-12 First draft