When you think of application monitoring, the first things that come to mind are probably server load and availability, disk space usage, memory consumption, performance, and so on. To maximize application uptime, these are necessary metrics to monitor, but they’re by no means complete. There are many other measurements to consider, including the monitoring of application-specific metrics, cloud-related issues, potential security breaches, and most importantly, the user experience.
In this article, we’ll explore the basics of monitoring an application, its environment and components, and other basic telemetry. We’ll also dive into how to balance the monitoring of individual components with a more holistic system and user view. Finally, we’ll see how the latest tools and analytics available can make your life easier, and help ensure that your users ultimately much more satisfied.
When monitoring an application to ensure acceptable uptime and performance for your users, you need to start with the components. This includes the physical servers themselves and, to start, their overall availability. Server monitoring—and monitoring computers in general—both involve enough telemetry that it needs to be a core focus. (Other infrastructure will be covered in the next section.)
Beyond an indication of whether a server is simply up or down, other metrics to track include a server’s CPU utilization, including peaks and averages over various time periods. Things to look for include the obvious over-utilization, but don’t forget that under-utilization of CPU power can indicate issues and be just as concerning. For example, it can indicate anything from network routing issues (such as requests not arriving), to application features not being used. You’ll also want to view individual server statistics, as well as those for groups of servers, to understand if CPU usage is a systemic problem (for example, overall application server stress) or indicative of a subset of servers that are out-of-date (such as older hardware), or a server that’s about to fail.
Other telemetry to monitor includes server memory utilization and I/O load over time. These are especially important in environments where server virtualization is used heavily. In these cases, the statistics reported from virtual servers may not indicate CPU or memory usage issues, but that underlying physical servers may be oversubscribed in terms of virtualization, CPU, I/O communication with disks and peripherals, or starved of physical memory.
Finally, server-specific measurements need to include user requests over time, as well as concurrent user activity reported in standard deviation graphs. Not only will this yield server performance information, it shows the utilization of your systems overall. We’ll examine the value of this data (such as in usage analytics) later in this article.
Now that we’ve covered servers and their physical component telemetry, let’s look a little deeper at some of the foundational components of your application’s physical build-out. This includes network infrastructure, storage infrastructure, and overall bandwidth capacity and consumption.
As any seasoned IT professional can tell you, it’s important to quantify network monitoring beyond the common statement, “The network is slow!” Network utilization monitoring includes the measurement of network traffic in terms of bits-per-second across LANs and sub-LANs within your application infrastructure. Understanding the theoretical and practical limits of these segments is crucial to knowing when packets will be lost, and when network storms may ensue. For instance, as you approach the bandwidth limit of a 100Mbps LAN segment, UDP messages will be lost, and TCP/IP messages that are lost will be retransmitted, potentially amplifying the problem. Monitoring the network should reveal segment bandwidth usage over time, across different areas of the network (between the application servers and database servers, for example).
Further, protocol-specific network monitoring will provide more granular insight into application utilization, and perhaps performance issues for certain areas of functionality (such as HTTP/S traffic versus back-end database traffic). Additionally, monitoring requests to specific network ports can pinpoint potential security holes (like port 23 for Telnet), as well as routing and switching delays within applicable network components.
Beyond raw network utilization, other infrastructure to be monitored includes network-attached storage solutions. Although these numbers are included in the network telemetry, specific telemetry is required to indicate storage usage, timeouts, and potential disk failures. Again, tracking both over- and under-utilization of storage resources is valuable. For instance, a lack of storage system access can indicate failure of a data backup plan.
Basic Application Telemetry
Turning our focus to the application itself, it’s important to monitor some key telemetry, which can involve database access and processing. In terms of access, it’s crucial to watch the number of open database connections, which can balloon and affect performance. Reasons for this include large (and growing) pools of physical and virtual application servers, programming errors, and application server misconfiguration. Tracking this over time can point out design decisions made early that don’t scale as application usage increases.
In terms of database processing, it’s important to monitor the quantity of database queries, their response times, and the quantity of data passed between the database and applications. This needs to include both averages and outliers. Occasional latency can be hidden or overshadowed when looking only at averages, yet those outliers can directly impact and annoy your users.
In terms of errors, your monitoring strategy should look at application exceptions, database errors or warnings, application server logs for unusual activity (excessive Java garbage collection), web logs indicating concerning requests, and so on. This is the start of monitoring for security indicators in your application.
Many of the monitoring basics covered so far mostly apply to servers and infrastructure you own. However, as public cloud usage grows, it’s important to include cloud-specific telemetry in your monitoring plan and strategy. This includes taking baseline measurements of all the telemetry outlined so far, both before and after any components of your application are moved to the cloud. As your deployment changes over time, or if at some point you switch cloud providers, you need to re-baseline your metrics.
Cloud monitoring includes obvious metrics, such as cloud availability (checking for outages) and Internet latency and outages between you, your ISP, and your cloud provider. But it should go further and include:
- Internet routing decisions
- Measurements of fixed or subscribed lines between you and your provider
- Internal and external request latency
- Cloud-to-cloud and ground-to-cloud timings to cover hybrid cloud usage
Other metrics vary by cloud service—especially PaaS—that you subscribe to, such as database, compute, storage, and so on.
Additional Application Parameters
Although we touched on basic application telemetry earlier, application-specific monitoring should also include organizationally defined key performance indicators, or KPIs. These are application-specific, and include measurements such as transactions (as defined by your application) per second or other timeframe, request throughput, and request latency, to ensure they meet internal goals or external customer service level agreements (SLAs).
For e-commerce applications, KPIs may also include overall sales, credit card transactions, or the percentage of abandoned shopping carts per day. Looking deeper, you should also track database size growth rates, changing database index requirements, query plans, and so on to determine future needs and optimizations over time.
Beyond application usage, it’s important to include monitoring of DevOps activity such as application deployments, continuous delivery, and even testing activity. Monitoring and understanding how these activities affect live systems can help you optimize your DevOps procedures.
So far, the telemetry discussed has focused mainly on components and very granular data. However, it’s important to take a system, or end-to-end view of monitoring, where you look beyond components such as servers, databases, or just the network. With this strategy, your monitoring helps to uncover system-wide issues that affect users, from the users’ point of view. For instance, when an issue occurs, users don’t care that your servers weren’t overloaded, or the network wasn’t saturated. All they know is that something was slow, failed, or behaved unexpectedly.
As an example, let’s say your application experiences an issue or outage—from the user perspective—due to a complex interaction between components that otherwise appear to be up and running fine. Even if this incident affects only a small percentage of transactions, the net result might be hundreds or even thousands of people who are now unhappy with your service. Looking at issues this way (as opposed to simply measuring server utilization or resource consumption) can be a real eye-opener.
With this monitoring philosophy, the goal is to detect system issues before users do, or at least determine what’s happening from the user’s point of view. Done right, this approach will indicate which (and how many) users were affected, and for how long a particular issue had occurred. Looking for patterns and correlations in these events often uncovers seasonal behaviors or other usage patterns (maybe even hacking attempts) that adversely affect your application.
If a hacking attempt or security breach is to blame, holistic monitoring can help you determine if critical data was lost or stolen, or if systems have become compromised in any way. In most security incidents, the sooner you detect an issue, the better chance you have of limiting the damage.
Monitoring Tools, Analytics, and Strategy
With the basics of monitoring telemetry in mind, it’s time to touch on strategies and tools to take it to the next level. For example, as important as it is to have a sound monitoring strategy, you also need to have a well-planned response strategy in place that includes the following:
- First level detection: Identify, understand, and begin root cause analysis of the issue
- A documented communication plan. This should include the names and contact information of executives or leaders who can make quick, appropriate business and technology decisions, taking into account time zones.
- Identification of a short-term fix to restore the application.
- Kick-off an investigation for future avoidance (a long-term solution).
Monitoring tools to use include:
- Dashboards for real-time system telemetry and reporting.
- Log parsing, using tools that safely work with production systems.
Visualization tools such as dashboards that monitor at a glance.
- Business intelligence to mine your logs for hidden information, such as seasonal usage patterns or security incidents.
- Automation tools that remove manual work, with automated detection, recovery, and risk mitigation.
- Analytics, such as advanced threat intelligence, that looks for suspicious user activity, out-of-band network access, unusual database activity, and so on, to detect hacking incidents before they become security breaches.
Working with a monitoring tool vendor will not only help you implement a sound monitoring strategy, it will also ensure that it evolves and becomes more comprehensive over time.