When you setup a monitoring system for SQL Server, you often use thresholds to determine when an instance is healthy. You might say that you want to be alerted when CPU use is over 90% or when there’s only 10% of disk space left. The trouble with these thresholds is that they will often throw off false positives, or send you an alert when really nothing is wrong. Simple thresholds often have to be tuned to the individual instance, since a server with 10 TB still has 1 TB of space left at 90% disk use.
Baron Schwartz blogged about this issue in an article and he’s been creating software that monitors MySQL beyond simple thresholds, after stating that they do not work in most cases. He makes a good point, and I think a lot of monitoring is ineffective because DBAs just stop paying attention to the alerts. The question then becomes how to perform automated health checks that don’t cry wolf.
Let’s just not throw the baby out with the bathwater. Health checks are essential for any complex system, and a modern database driven application is no exception. You simply cannot have a human glued to a screen 24/7 to catch all the issues. You do have to set guidelines that a computer can understand, and that means thresholds.
Thresholds are arbitrary – and that’s okay
Sometimes monitoring is just about maintaining an agreed upon level of uptime and performance, often termed an SLA. If my team needs an application that is only unavailable for 5 minutes per month, I need a threshold to tell me if met that goal or not. Latency is often another key metric for an SLA. Google has done studies that show if a page takes too long (like over 400 ms), they lose customers. You need to set a threshold for how long it takes to load pages in your application. You also probably have a data backup SLA, so you may want to get an alert when the offsite backups haven’t run in three days.
Draw a line in the sand
There are also internal standards you get to set. You could set basic thresholds for how often deadlocks occur before you start investigating, or if a process is causing another process to wait beyond 30 seconds. These are metrics that are pretty much one-to-one with user-facing issues, but it’s arbitrary when you start paying attention to them. In a big enough system, there are going to be the odd errors that seem important but just aren’t. Deadlocks, for instance, are just a normal part of having a concurrent system with locking (citation). What you need to do as the DBA is set an appropriate standard for your application, so you don’t go crazy tracking each error while still keeping the system running smoothly. Doctors use a similar idea when evaluating your health. When is high blood pressure hypertension? It’s a pretty arbitrary standard, but it still provides a useful category for humans to evaluate the risk of inaction vs. the cost of action.
Proof of the pudding is in the tasting
Thresholds often monitor metrics that are simply proxies or stand-ins for true health. A server can be running at 10% CPU but a process can be waiting on CPU resources. Think of a multi-core, multi-socket machine: one core can be running at 100% while the others site idly by, and all the while an impatient user is waiting for his report on the other end. There could be a chance that a server has only a 20 MB left, but that’s perfectly fine because it’s not going to grow. The most important things to monitor are the things your users actually care about. Imagine keeping track of your 401(k). You set a health check that makes sure your investments are worth 20% more year-over-year. This would work in USD, but not when using Zimbabwe dollars. It would be great if you could check whether you could afford a townhouse in Maui and a round of golf every week when you retire instead of an arbitrary dollar amount that’s subject to inflation.
Monitoring the end goal is the gold standard of health checks. You stop measuring an arbitrary resource like CPU or disk space and instead ask if your application can perform its tasks in a reasonable amount of time. You stop asking how many bytes you have left, and start asking if a teenage girl can express her romantic angst to all of her friends (if you’re Facebook, for instance). Resource monitoring, or any proxy-based monitoring, is still useful for troubleshooting, but only after a results-based alert has raised a red flag. This is the true issue with most health checks: they monitor stand-ins instead of the real thing. Make your health check mimic a real-world use case, like placing an order, and you’ll have a much better idea if something’s wrong.
Since keeping systems up is a huge part of a DBA’s job, checking status on all of our instances is pretty important. Threshold monitoring is pretty much the standard practice, but it comes with way too many false positives if deployed naively. Just to review, there are three good uses for a threshold:
- To test metrics for an SLA
- To help you pay attention to what’s important and ignore the rest
- To verify that a system can actually perform a task end-to-end