Blog › Authors › David Andrzejewski

David Andrzejewski, Data Sciences Engineer

Machine Data at Strata: “BigData++”

03.05.2014 | Posted by David Andrzejewski, Data Sciences Engineer

A few weeks ago I had the pleasure of hosting the machine data track of talks at Strata Santa Clara. Like “big data”, the phrase “machine data” is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or collected automatically by machines. This includes software logs and sensor measurements from systems as varied as mobile phones, airplane engines, and data centers. The concept is closely related to the “internet of things”, which refers to the trend of increasing connectivity and instrumentation in existing devices, like home thermostats.

More data, more problems

This data can be useful for the early detection of operational problems or the discovery of opportunities for improved efficiency. However, the de­coupling of data generation and collection from human action means that the volume of machine data can grow at machine scales (i.e., Moore’s Law), an issue raised by both Monash and Abadi. This explosive growth rate amplifies existing challenges associated with “big data.” In particular, two common motifs among the talks at Strata were the difficulties around:

  1. mechanics: the technical details of data collection, storage, and analysis
  2. semantics: extracting understandable and actionable information from the data deluge

The talks

The talks covered applications involving machine data from both physical systems (e.g., cars) and computer systems, and highlighted the growing fuzziness of the distinction between the two categories.

Steven Gustafson and Parag Goradia of GE discussed the “industrial internet” of sensors monitoring heavy equipment such as airplane engines or manufacturing machinery. One anecdotal data point was that a single gas turbine sensor can generate 500 GB of data per day. Because of the physical scale of these applications, using data to drive even small efficiency improvements can have enormous impacts (e.g., in amounts of jet fuel saved).

Moving from energy generation to distribution, Brett Sargent of LumaSense Technologies presented a startling perspective on the state of the power grid in the United States, stating that the average age of an electrical distribution substation in the United States is over 50 years, while its intended lifetime was only 40 years. His talk discussed remote sensing and data analysis for monitoring and troubleshooting this critical infrastructure.

Ian Huston, Alexander Kagoshima, and Noelle Sio from Pivotal presented analyses of traffic data. The talk revealed both common-­sense (traffic moves more slowly during rush hour) and counterintuitive (disruptions in London tended to resolve more quickly when it was raining) findings.

My presentation showed how we apply machine learning at Sumo Logic to help users navigate machine log data (e.g., software logs). The talk emphasized the effectiveness of combining human guidance with machine learning algorithms.

Krishna Raj Raja and Balaji Parimi of Cloudphysics discussed how machine data can be applied to problems in data center management. One very interesting idea was to use data and modeling to predict how different configuration changes would affect data center performance.

Conclusions

The amount of data available for analysis is exploding, and we are still in the very early days of discovering how to best make use of it. It was great to hear about different application domains and novel techniques, and to discuss strategies and design patterns for getting the most out of data.

 

David Andrzejewski, Data Sciences Engineer

Beyond LogReduce: Refinement and personalization

01.23.2013 | Posted by David Andrzejewski, Data Sciences Engineer

LogReduce is a powerful feature unique to the Sumo Logic offering. At the click of a single button, the user can apply the Summarize function to their previous search results, distilling hundreds of thousands of unstructured log messages into a discernible set of underlying patterns.

While this capability represents a significant advance in log analysis, we haven’t stopped there. One of the central principles of Sumo Logic is that, as a cloud-based log management service, we are uniquely positioned to deliver a superior service that learns and improves from user interactions with the system. In the case of LogReduce, we’ve added features that allow the system to learn better, more accurate patterns (refinement), and to learn which patterns a given user might find most relevant (personalization).

Refinement

Users have the ability to refine the automatically extracted signatures by splitting overly generalized patterns into finer-grained signatures or editing overly specific signatures to mark fields as wild cards. These modifications will then be remembered by the Sumo Logic system. As a result, all future queries run by users within the organization will be improved by returning higher-quality signatures.

Personalization

Personalized LogReduce helps users uncover the insights most important to them by capturing user feedback and using it to shape the ranking of the returned results. Users can promote or demote signatures to ensure that they do (or do not) appear at the top of Summarize results. Besides obeying this explicit feedback, Sumo Logic also uses this information to compute a relevance score which is used to rank signatures according to their content. These relevance profiles are individually tailored to each Sumo Logic user. For example, consider these Summarize query results:

 Results before feedback 

Since we haven’t given any feedback yet, their relevance scores are all equal to 5 (neutral) and they fall back to being ranked by count.

Promotion

Now, let’s pretend that we are in charge of ensuring that our database systems are functioning properly, so we promote one of the database-related signatures:

Results after promote

We can see that the signature we have promoted has now been moved to the top of the results, with the maximum relevance score of 10. When we do future Summarize queries, that signature will continue to appear at the top of results (unless we later choose to undo its promotion by simply clicking the thumb again).

The scores of the other two database-related signatures have increased as well, improving their rankings. This is because the content of these signatures is similar to the promoted database signature. This boost also will persist to future searches.

Demotion

This functionality works in the opposite direction as well. Continuing our running example, our intense focus on database management may mean that we find log messages about compute jobs to be distracting noise in our search results. We could try to “blacklist” these messages by putting Boolean negations in our original query string (e.g., “!comput*”), but this approach is not very practical or flexible. As we add more and more terms to our our search, it becomes increasingly likely that we will unintentionally filter out messages that are actually important to us. With Personalized LogReduce, we can simply demote one of the computation-related logs:

Results after demote

This signature then drops to the bottom of the results. As with promotion, the relevance and ranking of the other similar computation-related signature has also been lowered, and this behavior will be persisted across other Summarize queries for this user.

Implicit feedback

Besides taking into account explicit user feedback (promotion and demotion), Summarize can also track and leverage the implicit signals present in user behavior. Specifically, when a user does a “View Details” drill-down into a particular signature to view the raw logs, this is also taken to be a weaker form of evidence to increase the relevance scores of related signatures.

Conclusion

The signature refinement and personalized relevance extensions to LogReduce enable the Sumo Logic service to learn from experience as users explore their log data. This kind of virtuous cycle holds great promise for helping users get from raw logs to business-critical insights in the quickest and easiest way possible, and we’re only getting started. Try these features out on your own logs at no cost with Sumo Logic Free and let us know what you think!

David Andrzejewski, Data Sciences Engineer

Scala at Sumo: type classes with a machine learning example

09.23.2012 | Posted by David Andrzejewski, Data Sciences Engineer

At Sumo Logic we use the Scala programming language, and we are always on the lookout for ways that we can leverage its features to write high-quality software. The type class pattern (an idea which comes from Haskell) provides a flexible mechanism for associating behaviors with types, and context bounds make this pattern particularly easy to express in Scala code.  Using this technique, we can extend existing types with new functionalities without worrying about inheritance.  In this post we introduce a motivating example and examine a few different ways to express our ideas in code before coming back to type classes, context bounds, and what they can do for us.

… Continue Reading

David Andrzejewski, Data Sciences Engineer

Scala at Sumo: grab bag of tricks

07.25.2012 | Posted by David Andrzejewski, Data Sciences Engineer

As mentioned previously on this blog, at Sumo Logic we primarily develop our backend software using the Scala programming language. In this post we present some (very) miscellaneous Scala tidbits and snippets that showcase some interesting features we have taken advantage of in our own code.

… Continue Reading

David Andrzejewski, Data Sciences Engineer

Connect the dots with the new Trace operator

07.18.2012 | Posted by David Andrzejewski, Data Sciences Engineer

The trace operator is a new “beta” feature in Sumo Logic that allows the user to identify and follow chains of entities across different log messages, which themselves may be distributed across different assemblies, machines, or even datacenters.  Its origins lie in our culture of “dogfooding” and a recent hackathon where engineers had the opportunity to work on cool or itch-scratching projects of their own choosing.

Since the Sumo Logic service itself is a cloud-based distributed system, we often found ourselves investigating behaviors across multiple components of the system.  Following our own logging advice, we use unique IDs to track these events and to make them easily identifiable within our logs.  However, unless the “originating ID” follows activity across every single system component, it was still necessary to perform multiple searches to follow event chains all the way to the end.  To show how trace automates this procedure and makes our lives easier, we’ll walk through a simplified session tracking example.

Session Tracking Example

Say that your product uses a variety of session IDs to track requests as they flow throughout your system.  For example, different components might use a series of 4-digit hexadecimal IDs to process a customer order as shown below.

Now imagine that an error is encountered within the system while processing the accountID causing an internal error log to be generated containing the webID: “PROCESSING FAILED: webID=7F92“.  

Manually connecting the dots

Starting from this information, we could perform a series of searches and manual investigations to uncover the root cause from this set of logs:

  1. User action webID=7F92
  2. Initiating requestID=082A for webID=7F92 …
  3.  … orderID=34C8 received for requestID=082A …
  4. Retrieving userID=11D2 for requestID=082A …
  5. … accountID=1234 access, userID=11D2 …
  6. ERROR accountID=1234 not found! 
    (this error percolates back until the original webID fails)
  7. PROCESSING FAILED: webID=79F92

Note that to arrive at this conclusion we are essentially following a ”chain” of these hex IDs across different components of our system.

Session tracking with trace

The idea of the trace operator is to automate this process, allowing us to jump almost directly from the observed webID (log #1) to the original failure deep within the system (log #6) via the following query:

* | trace “ID=([0-9a-fA-F]{4})” “7F92″ | where _raw matches “*ERROR*”

Let’s deconstruct what’s happening here. First, assume that our * keyword search query runs over the time window of interest, capturing all relevant logs and plenty of irrelevant ones as well.  Next we have the trace operator:

  • The regular expression (with exactly one capturing group) ”ID=([0-9a-fA-F]{4})” tells trace how to identify the individual pieces of the chain we are trying to build, in this case 4-digit hex strings following “ID=“.
  • The final value gives trace the starting point to build a chain from, which for us is the original webID 7F92.
  • trace then scans incoming logs to build the underlying chain based on IDs occurring together in the same log, starting from the user supplied initial value (here 7F92).  

For example, when trace observes this log

Initiating requestID=082A for webID=7F92 …

it uses the regex to identify two IDs: 082A and 7F92.  Since 7F92 is the starting point it is already part of the chain, and since 082A has just co-occurred with 7F92 we add it to the chain as well.  As trace works its way through the logs, any log containing any ID which is part of the chain is passed through, and any other log is simply ignored. For example the following log would not be added, because none of these IDs are connected to the chain we build starting from the webID 7F92:

Initiating requestID=8182 for webID=8384 …

This is how the trace operator filters logs by “connecting the dots” across different log messages.

The smoking gun

Finally, once we’ve used trace to filter down to logs containing IDs which we know to be connected to the failing webID 7F92, we do string matching to filter down to logs containing the substring “ERROR” and discover a failure associated with the accountID.  Note that if we had simply done an “ERROR” keyword search we might be faced with a deluge of other errors not directly connected to the specific issue we were trying to investigate.  Furthermore, without the constructing our chain of IDs, there would be no obvious connection between accountID 1234 and our failure webID 7F92.  Hopefully this example has given you a taste for what you can do with trace – there are certainly many other possible applications.

David Andrzejewski, Data Sciences Engineer

Sumo Logic Logging Tips, Part 1 of N

05.03.2012 | Posted by David Andrzejewski, Data Sciences Engineer

The Sumo Logic Log Management and Analytics Service allows any organization to easily store, search, and analyze the log data produced by its applications. But even with the very best tools, the quality of the insights uncovered will depend on the raw materials: the logs themselves.

At Sumo Logic we make extensive use of our own product for developing, debugging, and troubleshooting our own product, which is itself a distributed and complex cloud-based SaaS application. This “dogfooding” process has helped enormously to improve the product, both in terms of features and maturity. It has also taught us a few things about writing logging into our own application. In this open-ended series we plan to share some logging patterns and strategies that we have found to be helpful.

General Logging Tips

When in doubt, log it

This one is quite simple: you won’t know about the things you don’t log. While pairing every single line of code with a corresponding log statement may be taking things too far, on the balance we have found that it is more common to regret something you didn’t log than something you did. This advice may be a bit self-serving, but when an elusive bug rears its ugly head on your production system at 3 A.M. you will be glad you logged that crucial extra piece of information.

Use logging tools

There are a variety of logging libraries that can give you a high degree of control over different logging levels, output locations, and formats. Log4j is a good example.

Periodically log the progress of long-running operations

Many data processing tasks involve a “bulk processor” that chews through many individual pieces of data, one at a time. Logging every item processed may be overly verbose, but logging at regular intervals (say, every second or every 10 seconds) may be quite helpful:

If our logs contain a regular stream of these messages indicating that 5000 items were processed per second and we observe an unexplained drop to 5 items per second, this may be an excellent clue that something unusual is happening.

Log for readability and parsability

Logging statements can be directed at many audiences: operational personnel maintaining the system, other developers reading your code, or computer programs parsing out structured data fields. Each of these audiences will appreciate logs that are self-documenting and easily consumed.

GOOD
Records request received – user=foobaz@bar.com, userId=77A2063948571CC6, requestId=5B54D5E1D59E9850, recordsRequested=50

NOT AS GOOD
foobaz@bar.com | 77A2063948571CC6 | 5B54D5E1D59E9850 | 50 

Logging for Distributed Systems

A common characteristic of modern distributed applications is that any single event touches many different system components which themselves may reside on many different machines. These systems can be especially challenging to build, operate, and maintain. Problems may manifest only under particular circumstances, and often application logs act as the “black box”, containing the only evidence available to uncover the root causes of system misbehavior. Given this setting, it is of the utmost importance to ensure that your logging is sufficient to help you understand your system and keep things running smoothly.

Use unique IDs to track an event across multiple system components

Unique IDs (such as the hexademical representation of a random number) can be helpful to pinpoint relevant logs when investigating a problem. For example, when a user logs into your website, you can associate that visit with a unique session ID that is embedded in all logs resulting from this visit. Searching your logs for this session ID should allow you to easily to reconstruct all of the system events associated with this user’s session.

This pattern can also be nested for finer-grained searchability. For example, during a visit to your website a user may access their account preferences. When the frontend of your site issues a internal request for this information, this request itself can be associated with both the user session ID and a new request ID. If initial troubleshooting of the user session reveals that this particular request was unsuccessful, this request ID can be then used to drill deeper into why the request failed.

A reasonable rule of thumb is to use a new ID for any event, transaction, or session that looks like “one thing” to other components of the system.

Log each side of a remote message or request

The interfaces between different components, especially those residing on different machines, can be a rich source of potential problems. Your message may be delivered to the wrong recipient, the remote recipient may fail to process your request, your message may be eaten by a grue, and so on. 

One strategy for dealing with this issue is to put matching send/receive logging statements on each side of any remote message or request. A search for these message pairs that returns an odd number of logs would then be a red flag indicating that a sent message was never received, was received twice, or suffered some other mischief.

Twitter