Difference between revisions of "Unknown issue"
m (moved Grid operations oversight/Unknown issue to Unknown issue: Wiki space review (RT 4614)) |
|
(No difference)
|
Revision as of 17:06, 6 December 2012
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
EGI Infrastructure Operations Oversight menu: | Home • | EGI.eu Operations Team • | Regional Operators (ROD) |
this page will contain information about UNKNOWN status issue
Present situation
Availability and Reliability calculations formulas:
Availability = Uptime / (Total time - Time_status_was_UNKNOWN) Reliability = Uptime / (Total time - Scheduled Downtime - Time_status_was_UNKNOWN)
How to read in context of UNKNOWN status:
- Period in which site is in status UNKNOWN is not taken into calculation.
- During this period EGI doesn’t know what is happening with the infrastructure.
Problems & Questions
- No policy for test developers when test should return UNKNOWN status. What does UNKNOWN status mean?
- Some of NGIs reach ~0% for all their sites and some reach even ~40%, sometimes disproporsions are even within one NGI. What/where is the reason for so high values and disproporsions?
What can cause UNKNOWN status?
Site will have UNKNOWN status in the following cases:
1. relevant availability test(s) reported UNKNOWN state
2. results of relevant availability test(s) are missing in the central MRS database.
The second case can be caused by:
2.1. SAM instance failure (network connection failure, internal tests failures)
2.2. WN tests results missing.
This case of WN tests results missing can occur if:
2.2.1. the SAM WN-probe framework was unable to publish results to messaging system (e.g. firewall issues on sites)
2.2.2. the SAM job failed to run on site (e.g. CE errors, WMS error) (sidenote - this case should be CRITICAL, need to double check how does ACE summarize case when it has CRITICAL and UNKNOWN).
When test can return UNKNOWN status?
UNKNOWN status is documented in the Nagios plugins developer guidelines (http://nagiosplug.sourceforge.net/developer-guidelines.html): "Invalid command line arguments were supplied to the plugin or low-level failures internal to the plugin (such as unable to fork, or open a tcp socket) that prevent it from performing the specified operation. Higher-level errors (such as name resolution errors, socket timeouts, etc) are outside of the control of plugins and should generally NOT be reported as UNKNOWN states.".
There is no plugins review process so we cannot be absolutely sure that plugin developers actually follow guideline.
What can cause UNKNOWN status disproportions between sites within one NGI?
Disproportions indicate site problem and are caused by case 1. or 2.2.
Where should NAGIOS admin look for help?
NAGIOS admin should contact SAM team through tool-admins@mailman.egi.eu or GGUS.
Solution proposals
Strict policy for the developers how to use UNKNOWN status
Advantage: we will be sure that all problems will be properly addressed as ERROR not UNKNOWN
Disadvantages: someone has to write the policy and check whether it is respected
Alarms for UNKNOWN status should be created when UNKNOWN status is longer than 4h
Advantage: we will be notified if the UNKNOWN status takes too long
Disadvantages: it means an extra work for ROD which will be look not only after ERRORs but also UNKNOWNs
Threshold for UNKNOWN status
Advantage: it is easy and fast to implement and automate
Disadvantages: there is a possibility that overlook an important problem
Revision history
Version | Authors | Date | Comments |
---|---|---|---|
1.0 | Malgorzata Krakowian | 2011-10-12 | First draft |