Connect the dots with the new Trace operator
The trace operator is a new “beta” feature in Sumo Logic that allows the user to identify and follow chains of entities across different log messages, which themselves may be distributed across different assemblies, machines, or even datacenters. Its origins lie in our culture of “dogfooding” and a recent hackathon where engineers had the opportunity to work on cool or itch-scratching projects of their own choosing.
Since the Sumo Logic service itself is a cloud-based distributed system, we often found ourselves investigating behaviors across multiple components of the system. Following our own logging advice, we use unique IDs to track these events and to make them easily identifiable within our logs. However, unless the “originating ID” follows activity across every single system component, it was still necessary to perform multiple searches to follow event chains all the way to the end. To show how trace automates this procedure and makes our lives easier, we’ll walk through a simplified session tracking example.
Session Tracking Example
Say that your product uses a variety of session IDs to track requests as they flow throughout your system. For example, different components might use a series of 4-digit hexadecimal IDs to process a customer order as shown below.
Now imagine that an error is encountered within the system while processing the accountID causing an internal error log to be generated containing the webID: “PROCESSING FAILED: webID=7F92“.
Manually connecting the dots
Starting from this information, we could perform a series of searches and manual investigations to uncover the root cause from this set of logs:
- User action webID=7F92
- Initiating requestID=082A for webID=7F92 …
- … orderID=34C8 received for requestID=082A …
- Retrieving userID=11D2 for requestID=082A …
- … accountID=1234 access, userID=11D2 …
- ERROR accountID=1234 not found!
(this error percolates back until the original webID fails) - PROCESSING FAILED: webID=79F92
Note that to arrive at this conclusion we are essentially following a ”chain” of these hex IDs across different components of our system.
Session tracking with trace
The idea of the trace operator is to automate this process, allowing us to jump almost directly from the observed webID (log #1) to the original failure deep within the system (log #6) via the following query:
* | trace “ID=([0-9a-fA-F]{4})” “7F92″ | where _raw matches “*ERROR*”
Let’s deconstruct what’s happening here. First, assume that our * keyword search query runs over the time window of interest, capturing all relevant logs and plenty of irrelevant ones as well. Next we have the trace operator:
- The regular expression (with exactly one capturing group) ”ID=([0-9a-fA-F]{4})” tells trace how to identify the individual pieces of the chain we are trying to build, in this case 4-digit hex strings following “ID=“.
- The final value gives trace the starting point to build a chain from, which for us is the original webID 7F92.
- trace then scans incoming logs to build the underlying chain based on IDs occurring together in the same log, starting from the user supplied initial value (here 7F92).
For example, when trace observes this log
Initiating requestID=082A for webID=7F92 …
it uses the regex to identify two IDs: 082A and 7F92. Since 7F92 is the starting point it is already part of the chain, and since 082A has just co-occurred with 7F92 we add it to the chain as well. As trace works its way through the logs, any log containing any ID which is part of the chain is passed through, and any other log is simply ignored. For example the following log would not be added, because none of these IDs are connected to the chain we build starting from the webID 7F92:
Initiating requestID=8182 for webID=8384 …
This is how the trace operator filters logs by “connecting the dots” across different log messages.
The smoking gun
Finally, once we’ve used trace to filter down to logs containing IDs which we know to be connected to the failing webID 7F92, we do string matching to filter down to logs containing the substring “ERROR” and discover a failure associated with the accountID. Note that if we had simply done an “ERROR” keyword search we might be faced with a deluge of other errors not directly connected to the specific issue we were trying to investigate. Furthermore, without the constructing our chain of IDs, there would be no obvious connection between accountID 1234 and our failure webID 7F92. Hopefully this example has given you a taste for what you can do with trace – there are certainly many other possible applications.
"From Academia to @SumoLogic". Great blog post on academic & engineering challenges of #logmanagement | sumolo.gs/14tZHzd
21 May | SumoLogic
