“...a wealth of information creates a poverty of attention...”
― Herbert A. Simon
Why did it take you so long to restart a server?!?!
You know this problem: several monitoring systems constantly monitor your IT infrastructure. Your team receives notification even for non-critical errors, action is taken before business notices. It took time and dedication to implement proactive processes, you are on the right path, you can be proud of it.
One ugly day - usually a Sunday, between 2:00 and 3:00 - a major incident occurs; thousands of events are generated. The stand-by support team is overwhelmed by the flood of events, no one has an overview. After some time the culprit is found: a buggy process used all the memory of a critical server, leading to a crash. Rebooting the server solved the problem (or something like that).
Given the situation, you feel your team did a good job.
The Business Application Owner seems to disagree: “if a simple server restart solved the issue, why did it take three hours to restart the server? Don’t you monitor your systems? Your team has good technical skills and I know you all work very hard, but you need to be proactive!”
Ironically, the better the monitoring process is, the worse the problem gets. You need to check several metrics to be proactive. However, in case of major incidents – especially network related incidents – all these metrics fire off their events.
You know the problem, but what is the solution?
Smart sales representatives will tell you the monitoring tool they sell can deal with that. You do not believe it.
Smart consultants will come up with some ideas:
- Tune thresholds and set-up dependencies to prevent false positives or irrelevant notifications. You have done this already, it did improve the situation, but it did not solve the problem.
- Consolidate your tools and just keep just one monitoring tool to have a central dashboard Consolidation is a good idea; however, you will never reduce your tool set to just one tool: a “generic” tool such as Nagios will never replace a specific tool such as “App Dynamics” and vice versa.
An Umbrella Monitoring System might help
A customer had exactly this problem. Over the years, they had set up a very extensive monitoring process to track production relevant applications in about 40 different factories. Additionally to standard monitoring of servers metrics, oracle database clusters and network devices, the customer has an extensive monitoring of the data flow between the different applications and within applications. The monitoring not only detects hanging processes, but also logical errors or possibly inconsistent data.
Such processes help to detect quickly even small errors. On the other hand, a few months ago the crash of a database server led to 1.127 unique events within only a few hours.
The idea of the umbrella monitoring system is simple:
1. Send all alarms into a central system
2. Have a program that “links” related alarms
3. Use some GUI to display events in a way humans can handle.
In the picture below, you see the events in the way they appear in any event management tool (we only show the top 10 events, you can imagine how the remaining 1.121 events look like).
Now look at the picture below: you see the same information after related events have been aggregated into one event. Much cleaner, isn’t it?
The column “Critical Events not yet notified” is particularly useful. It shows you that the Event might lead to further critical Incidents later on; these Incidents have not happened yet, but you know they will sooner or later: you can warn your Business Owner.
Implementing such an umbrella system takes some time, but it is no rocket science. The main idea is quite simple, implementing it take some patience.
In this blog series, we will go through the necessary steps.