2022 Gartner® Magic Quadrant™ SIEM
Get the reportMore
Monitoring requires a multi-faceted approach if DevOps teams want end-to-end visibility and deep insight into issues. This is especially true in the case of modern microservices applications, which are essentially collections of distributed services that talk to each other over a service mesh. With monolithic applications, requests can be tracked easily from the client to the server and back, but with modern applications, every request passes through numerous services before completion. Along the way, a request can fail or slow down for various reasons. While metrics are great for alerting and logs are good for digging deeper to identify problem areas, there are certain issues that need deeper visibility than metrics and logs can provide. This is why distributed tracing belongs in every DevOps team’s monitoring arsenal.
There are quite a few similarities between logs and traces. First of all, they have the same goal of delivering deeper visibility into the functioning of a system. They look beyond metrics to tell the detailed story behind an issue. In doing so, they collect a lot of repetitive data that needs to be distilled and analyzed in order to find hidden gems. But if you want to take a look at the individual incarnation of of a particular issue with fine-grained detail: they are unbeatable.
Second, they’re both proactively inserted into an application by its developers. This is significant because, as Pawel Brzoska of Sumo Logic says, “an application’s observability gains a lot from the fact that telemetry signals are designed, composed, and produced by an application developer/vendor in compliance with industry standards, and are not a proprietary, black box component of the monitoring vendor.”
Third, logs and distributed tracing have a similar architecture – they collect performance data from the system and route it to an external location for further analysis. Thus, they both require storing and analysis of performance data. In fact, you could argue that traces are a form of structured log. They both describe unique events that happened in a specific timeframe. As a matter of fact, many customers achieve some tracing use cases (see below) by logging equivalent information into individual component logs and using stitching techniques in log analytics engines to achieve trace-like visibility based on them. Not very easy but it can work.
There are some key differences between logs and traces that make them complementary to each other. First, let’s focus on the data formatLogs can be produced by any part of the stack, sometimes with minimal structure, whereas distributed tracing creates highly-structured records of user transactions as they pass through different services and parts of the application stack. Log data is collected in the form of raw text, while tracing data is recorded as structured events called “spans”. A span defines how long a request spent in a particular service. Multiple spans can then be stitched together into a “trace” (using trace and span IDs), which is a description of the entire transaction.. Spans adhere to a specific structure and can be more lightweight than log files. Secondly, and most importantly, logs and taces also serve two different use cases, complementary but also orthogonal. Logs describe detailed behavior of a single infrastructure component: a node, container, process, application service, and its ability to serve clients’ requests. Traces on the other hand introduce a concept of a transaction - a single client call that travels through multiple infrastructure elements. A single trace describes the journey and gathers details of the performance and availability of all steps on its way into a single record.
From that point of view - traces are better when you want to start troubleshooting top-down: from user transaction down to individual faulty infrastructure components. When you find the issue, tracing becomes less valuable and you need to switch to logs that, as mentioned above, contain full detailed information about the component’s behavior.
Finally, logging has existed for decades, while distributed tracing has been gaining prominence in recent years. It has been driven by the migration from monoliths to microservices-based, distributed, and cloud-native applications. Microservices architectures mean that user transactions can involve dozens or even hundreds of individual services, and traditional application performance tools are unable to stitch the myriad of services calls into a coherent picture. Additionally, modern applications can be complex to the point of almost being organic in their intricacies. Monitoring these applications has required new ways of thinking and innovative approaches, of which tracing is but one, albeit important, development.
Troubleshooting and monitoring modern applications require different, complementary data sources to get a full picture. First, performance metrics are typically used to monitor for the symptoms of failures (high load, high usage, slowdowns, etc.). They are ideal to set up high-level alerts, but they rarely tell you where the underlying issue is. Logs, on the other hand, give deep details on errors and events but don’t provide much insight into the experience of end-users. While metrics point to the requests that have failed and the logs sometimes shed some light on why they failed, tracing pinpoints the exact source of the failure. It helps to answer the following questions:
Weaveworks has shared their experience for distributed tracing with Jaeger and shed light on how this complements metrics and logs. Tracing is able to identify the small details behind latency or failed requests that would otherwise be hard to track with metrics and logs.
In his talk at KubeCon, Bryan Boreham of Weaveworks discussed how tracing can help optimize system performance. He explained that identifying the longest span is the best way to find the slowest queries that need to be sped up, and furthermore, gaps between spans should be filled with new spans to get the complete picture. In addition, if there are many spans of the same duration, this could be caused by a timeout, which is an unnatural end to requests that need to be corrected.
Sampling is when the tracing tool filters out repetitive tracing data, leaving behind only the meaningful traces. It has origins in Google’s Dapper paper and is being taken to new heights with Jaeger. Jaeger has multiple types of sampling such as:
Probabilistic sampling: The sampler samples a percentage of randomly selected traces, and this percentage can be defined.
Rate-limited sampling: The sampler is set to sample a specific number of traces per second.
Adaptive sampling: This is yet to be released officially, but it’s the most mature type of sampling where multiple rules can be set. For example, the sampler can be set to apply varying rate-limits for different services based on their traffic volumes.
The standard for distributed tracing has evolved from OpenCensus to OpenTracing to Open Telemetry. While the first two are more mature standards, Open Telemetry is a sandbox project; however, according to CNCF, it’s on track to replace the former standards.
Zipkin and Jaeger are the most commonly used tools. Zipkin is the older of the two, but Jaeger has been adopted by CNCF, and is therefore better integrated into the Kubernetes ecosystem. Appdash is another option that’s based on Zipkin but written in Go.
There are multiple options for reporters that send tracing data to Zipkin and Jaeger. These include Kafka, Scribe, ActiveMQ, gRPC, and RabbitMQ.
Metrics and logs are essential for most day-to-day operations, but they don’t always tell all the details of the story. Distributed tracing fills in those gaps in observability, and it’s a natural extension for anyone committed to logging. You should consider setting up a basic instrumentation practice and systematically adding tracing IDs to logs from the start. In other words, don’t throw your metrics or logs away; instead, enhance them with distributed tracing.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.Start free trial
Observability has become one of the most important areas of your application and infrastructure landscape, and the market has an abundance of tools available that seem to do what you need. In reality, however, most products – especially leading open-source based products – were created to solve a single problem extremely well, and have added additional supporting functionality to become a more robust solution; but the non-core functionality is rarely best of breed. Examples of these are Prometheus and Grafana.