Today I wanted to write about something that’s been on my mind for the last few months. The industry spends quite a bit of time talking about observability these days and something’s been, somewhat vaguely, bothering me about it. So about a week or so ago, I spent some time figuring out what was bothering me and had some insights I would like to share.
First, observability is not an end in and of itself, but a means to an end. We don’t gather all this data, manage it, and analyze it just to achieve observability. Observability being the end would be like saying that we backup our data so that we can have backups. Nope, we backup our data so that we can restore our systems in case of catastrophic failure. So what is then the objective of doing all this hard work involved in observability? That, it turns out, has multiple answers. And those answers depend on which function you are in, what your business objectives are, what industry you are in, etc.
Today we speak about observability explicitly in terms of DevOps/SRE. So, what is the objective of observability for DevOps engineers or SREs? The answer to that one hides behind the acronym itself … Site Reliability. We need observability for our digital properties to be reliable, available, and performant. Our customers wouldn’t like it otherwise and we would lose revenue, users, reputation, etc. But there are other goals of observability. If you are a CISO, Director of Compliance, or SOC Analyst, your goals are to ensure apps are not breached, compliance with regulatory rules, prevention of data leaks, etc. In other words, your goal is to maintain secure and compliant systems. Furthermore, if you are on the revenue side of the business, like in Product Management or Marketing, your goals are probably to improve your products, understand your users, improve customer experience, all of which can be summarized as revenue growth. I call this: Objectives Driven Observability.
Second thing bothering me is the generally accepted definition of observability:
But is this what it really is? I don’t think so. To explain what I mean by this, we will need to look at what logs, metrics, and traces are but, first, as I said above, let’s start with the objectives. Then we’ll circle back to answer the question of what defines observability.
To achieve any of the above mentioned objectives for operations or security, for example, you need some data, and some platform/technology to apply analytics to that data (I’ll leave the discussion of platform type and properties for a future blog and keep focus on data). So, for example, to achieve the objective of reliability, you will need to detect things like: when your systems are experiencing errors, when important transactions fail, when exceptions occur in your code, when a new code deployment occurred, etc. All of these are examples of events that occur at a particular time and they hold key insights on what might be impacting your reliability. Furthermore you will probably need to understand other important information such as how long it takes to execute a key transaction, what is the latency of inserts into your database, and how much memory or CPU your are consuming. These are examples of measurements and they are different from events because they are measurements over time of specific Service Level Indicators (SLIs) that impact your reliability.
In the vernacular of other functions such as Security or Marketing, this information takes on different flavors but it is essentially the same. A user that fails a login is an event and the rate of traffic on your site from a single external IP address is a Key Risk Indicator (KRI), a numerical measurement that might indicate your site is under Distributed Denial of Service (DDoS) attack. Successful payment is an event and the number of clicks on a suggested catalog item is a Key Performance Indicator (KPI), a numerical measurement that indicates success of a specific offer or product feature. SLIs, KRIs, and KPIs are all numerical indicators, but are referred to differently in different domains, but collectively let's refer to them here as KPIs. So, the definition of observability might look something like this:
Let's go a bit further to the left now and talk about where do Events and KPIs come from. You can probably guess where I’m going: logs, metrics, traces. But what is the relationship? Logs, metrics, and traces are data formats and their payload is events, and SLIs, KRIs, and KPIs collectively are data, or machine data to be more precise. Logs are unstructured text generated in code and they exist because developers need to debug their pre-production code by understanding what events their code is generating. Metrics and traces give you numerical measurements of specific indicators at a point in time and then they are aggregated to produce measurements over time of those same indicators. Logs are almost exclusively the source of events but also provide a huge source of KPIs. For example, the rate of 404 http response codes is an SLI computed by analyzing the rate of specific events over time that can then be monitored. Also, in Security, the rate of failed logins by users, indicating, for example, a possible brute force attack, is a computed KRI that comes from analyzing the rate of failed login events. And in Marketing or Product vernacular, a rate of concurrent users on a site is a computed KPI by analyzing two events such as login and logoff events. Metrics typically carry measurements of infrastructure or application SLIs such as CPU, memory, latency etc., and traces enable computation of measurements of transaction SLIs such as how long it takes to execute specific steps in an important transaction.
So, what is my point? Logs, metrics, and traces don’t give you observability… monitoring and analysis of events and KPIs do. In the end, it looks something like this:
Enterprises use some combination of these three different data formats to capture events and compute KPIs required to achieve their objectives, typically more than just one and typically reliability, security and compliance, and more and more often, business intelligence objectives as well. In digital business, looking at the same events and KPIs through a different lens is required to maximize return on managing machine data and achieving enterprise objectives.
In other words, while technology vendors can obsess over data formats, business leaders and employees, at the end of the day, focus their energies on achieving business objectives. Hence, the value they need from their observability platform should be objectives-driven across variety of functions and business requirements. That’s what we’re focused on at Sumo Logic.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.