You’ve probably heard about logs and metrics and have been told you need to monitor them. You may be told to monitor them for compliance or because someone suggested it’d be a great way to reduce time to investigate bugs. It can be a ton to handle and understand, and there’s a great deal of information out there.

Brief history of modern IT

Information systems are ever-evolving. They evolve rapidly in the types of hardware and software that’s used. Systems evolve in how they all come together in a system through their architecture. These systems have been around for generations in various ways, even in the pre-computer era. Though it may be fun to research how logging worked with a Babbage Difference Engine, it may not be the most relevant topic here. To make this relevant to logs and metrics, let’s look at just the past two decades and focus on web systems.

We all know how IT systems looked even up to the mid-2000s. They were relatively simple compared to today. There was usually a centralized server, some networking devices connecting them all together, and some controller managing access. Web IT systems were generally composed of a centralized web server that hosted static pages with some dynamic content queried from databases. Of course, there’s a bit more complexity to it all, but it was typically far less complex architecturally than today.

If there were errors or reliability issues, we’d look into the logs of a few servers. The architecture was mostly manageable enough to look at logs manually.

Simple.

The cloud

As web technology advanced, websites evolved from static with a bit of dynamically generated content to robust applications where the browser was just a terminal to interact with the applications. This presented a scalability problem on the backend where customers needed access to their systems in near real-time with little or no delay. Service-oriented architectures (SOA) started taking off and gave birth to the “cloud” as we know it: on-demand scalability. Amazon Web Services (AWS) propelled this technology to the mainstream and proliferated the concepts worldwide with the release of Elastic Compute Cloud (EC2) in 2006.

To achieve the scalability to meet the demand, new architectures of information systems emerged. No longer do you manually have to physically create servers to support loads or throw more RAM into a physical box. There’s no need to manage the virtualization software. The cloud platforms allow you to click a few buttons and have a high-performance server in minutes. Different services’ load balancers and autoscaling now automatically take care of this as needed.

Now, we have an infrastructure of a potentially unknown number of systems working in unison. Architectures that deploy microservices through a container service like Docker add even more complexity. This poses a major issue for troubleshooting when a problem arises.

Bottlenecks in troubleshooting

Let’s suppose during a peak usage time, customers are seeing error statuses as the page loads. At the same time, you’re seeing, there’s still plenty of healthy activity. To find the errors, you would look at your logs. When there are 100s, if not 1000s, of instances and containers, it can be virtually impossible to track down even the instance that’s having the problem, let alone the individual event that is causing those errors.

This is generally where we are today.

Today’s architecture is incredibly complex and comprises an endless number of components. The fun part is that a tiny team can assemble all these system components. The same level of physical architecture needed years ago would have been laughably impossible with those resources. The not-so-fun part is that managing, orchestrating, and monitoring it all cannot be done “the easy way” anymore by just opening a log file and grepping it.

Manually reviewing the individual logs of all the components doesn’t scale. To help out with investigations and troubleshooting, many log management tools have been developed to centrally collect logs and analyze large amounts of data in realtime.

Logs vs metrics

Before we continue, understanding the difference between logs and metrics may be a good idea.

What Are Logs?

A log message is a system-generated set of data when an event has happened to describe the event. In a log message is the log data. Log data are the details about the event, such as a resource accessed, who accessed it, and the time. Each event in a system will have different data sets in the message.

If you want to read deep into the topic, check out Logging and Log Management (Chuvakin et al., 2013). They specify five general categories of logs: informational, debug, warning, error, or alert.

Informational are typically “benign” events; debug are used during troubleshooting code or systems; warnings are things like something may be missing but won’t directly impact the system; errors are messages to convey a problem has occurred; and alerts are that something important has happened and are largely around security use cases.

Logs will tell the “story” of what happened in a system that got it to the issue you’re troubleshooting.

To learn more about creating a log message, I have a previous blog post around what goes into it: Best Practices for Creating Custom Logs.

What are metrics?

While logs are about a specific event, metrics are a measurement at a point in time for the system. This unit of measure can have the value, timestamp, and identifier of what that value applies to (like a source or a tag). Logs may be collected whenever an event occurs, but metrics are typically collected at fixed-time intervals. These are referred to as the resolution.

The collection of the data are referred to as a time-series metric. They can be visualized in graphs such as gauges, counters, and timers.

The topic of metrics is still relatively new in the industry, and I highly recommend reading James Turnbull’s The Art of Monitoring.

Though a measurement of the system’s health can be stored in a performance log file, this method can be very costly to collect the system’s health. You’re creating a new entry constantly with all the metadata about the system to get a single number. A metric will normalize this so you aren’t logging the same data repeatedly. You’re only storing the number and timestamp. The metric will be a tiny fraction of the size of an entire log entry for these measurements.

Do I really need logs and metrics?

A log is an event that happened, and a metric is a measurement of the health of a system.

Think of it this way. If you go to a doctor when you’re sick, you can tell her everywhere you went in the past week. You can mention things like attending a food festival, your kid’s birthday party, and your office. If you give your doctor enough clues about your history, she may be able to pinpoint your ailment exactly.

This would be the same as troubleshooting your system based completely on looking at logs.

To help her come up with a very confident conclusion about what is wrong and how to get better, she’ll need your vitals like blood pressure, temperature, weight, and possibly other measurements like blood samples. You will always have these measurements that can be taken whether you’re healthy or sick. These are the equivalent of the metrics.

If you theoretically have a constant flow of these measurements for long periods, she can see your temperature starting to rise at a certain time outside of your normal range. She can see from your history of where you were that you were at the food festival a day before. She can conclude that the food you ate at the festival got you sick; a day later, your temperature started to rise slightly without you noticing, and a day after that, you started feeling your symptoms. She’ll confidently know the ailment based on that pattern of symptoms about what you remember doing.

Ok, so this is an overly simplified approach to medicine. I have no medical expertise to say that my anecdote is realistic. But it’s a good analogy to grasp the difference between logs and metrics and how they can work together.

Let’s relate this analogy to IT management. If you were to start having significant errors on your server, you can look back at your logs. The problem may be that there’s an acceptable error rate, and you don’t know exactly when the real issue started in all the noise. It can take some time to dig in and find the starting point of the actual issue.

With your metrics history, memory usage is fairly stable and consistent. You then spot a sudden rise and spike in memory utilization. But without the logs, it doesn’t mean much besides telling you, “Uhh, this isn’t right…”

Bringing together both logs and metrics can provide much more clarity. During the spike in memory utilization, some unusual log entries may indicate the specific event that caused the spike.

How to get started with logs and metrics

Whichever service(s) you want to handle your collection of data, it’s all going to come down to your use case and what you want to understand. If you’re a DevOps engineer keeping your systems up and stable, then maybe audit logs for your GSuite probably won’t be the most relevant. On the other hand, if you’re a security engineer keeping the bad guys out, then stack traces from the dev environment may not exactly tell you that you have someone snooping around your AWS environment. As a developer, you probably do care about those stack traces.

Security

Do you not have visibility into who’s accessing your AWS environment? You should be monitoring at least your CloudTrail, VPC, and ELB logs. Host Metrics of your instances can tell you potentially that your CPU and network utilizations are suddenly under a heavy load, which can be indicative of an attack. Getting an alert of this spike will have you jump into your tool at the exact time the spike started. Then you can correlate your events to this start to understand potentially why it began.

DevOps

Do you want to debug your code and troubleshoot an outage? Sending debug logs, stack traces, and commits would help, but also make sure you’re bringing in your production logs to see how it runs once it’s under a real load. Again, host metrics or CloudWatch Metrics can easily tell you down to the second when your system suddenly began raising in resource utilization. You can potentially use this exact second to easily see you also had a commit seconds before which broke a service and brought down other services.

So the type of event logs you’ll want to collect is going to have an answer of: it depends.

Plan and architect

There are many use case and scenarios for what you want to know about your system. There is no one right answer many times. Good ol’ requirements gathering, project management, and architecting are important to make sure you are bringing in the right data to solve your problem. There will be use cases where you only need logs. There’s times where you’ll only need metrics. Then there’s ones where bringing them together will make your job easier than you thought possible.

I’ve worked with customers who have the dreaded compliance use case. No one wants to spend money on storing logs for regulatory reasons. Yet if they’re already sending that data over, they may as well see what other intelligence they can gather from it. It is not uncommon for a customer to get that “whoa!” moment and realize there’s a goldmine of information waiting to make their lives easier from the required compliance logs alone.

There are many logging and metrics tools out there. There’s still few outside of Sumo Logic that allows you to ingest both logs and metrics then correlate them together natively. Often times, they handle logs or metrics and have integrations for the other type of data. We feel the two are too important to be independent of each other. They need to work together to tell you the whole story of your systems.

BY SECURITY USE CASE

BY OBSERVABILITY USE CASE

BY INDUSTRY

BY COMPETITION

LEARN

ENGAGE

TRAIN

COMMUNITY