2022 Gartner® Magic Quadrant™ for APM and Observability Read the Report

Armin Wasicek

Armin Wasicek is a senior software engineer at Sumo Logic working on advanced analytics. Previously, he spent many happy years as a researcher in academia and industry. His interests are machine learning, security and the internet of things. Armin holds PhD and MSc degrees from Technical University Vienna, Austria and he was a Marie Curie Fellow at University of California, Berkeley.

Posts by Armin Wasicek


Artificial Intelligence vs. Machine Learning vs. Deep Learning: What's the Difference?


5 Best Practices for Using Sumo Logic Notebooks for Data Science

This year, at Sumo Logic’s third annual user conference, Illuminate 2018, we presented Sumo Logic Notebooks as a way to do data science in Sumo Logic. Sumo Logic Notebooks are an experimental feature that integrate Sumo Logic, notebooks and common machine learning frameworks. They are a bold attempt to go beyond what the current Sumo Logic product has to offer and enable a data science workflow leveraging our core platform. Why Notebooks? In the data science world, notebooks have emerged as an important tool to do data science. Notebooks are active documents that are created by individuals or groups to write and run code, display results, and share outcomes and insights. Like every other story, a data science notebook follows a structure that is typical for its genre. We usually have four parts. We (a) start with defining a data set, (b) continue to clean and prepare the data, (c) perform some modeling using the data, and (d) interpret the results. In essence, a notebook should record an explanation of why experiments were initiated, how they were performed, and then display the results. Anatomy of a Notebook A notebook segments a computation in individual steps called paragraphs. A paragraph contains an input and an output section. Each paragraph executes separately and modifies the global state of the notebook. State can be defined as the ensemble of all relevant variables, memories, and registers. Paragraphs must not necessarily contain computations, but also can contain text or visualizations to illustrate the workings of the code. The input section (blue) will contain the instruction to the notebook execution engine (sometimes called kernel or interpreter). The output section (green) will display a trace of the paragraph’s execution and/or an intermediate result. In addition, the notebook software will expose some controls (purple) for managing and versioning notebook content as well as operational aspects such as starting and stopping executions. Human Speed vs Machine Speed The power of the notebook roots in its ability to segment and then slow down computation. Common executions of computer programs are done at machine speed. Machine speed suggests that when a program is submitted to the processor for execution, it will run from start to end as fast as possible and only block for IO or user input. Consequently, the state of the program changes so fast that it is neither observable, nor modifiable by humans. Programmers would typically attach debuggers physically or virtually to stop programs during execution at so-called breakpoints and read out and analyze their state. Thus, they would slow down execution to human speed. Notebooks make interrogating the state more explicit. Certain paragraphs are dedicated to make progress in the computation, i.e., advance the state, whereas other paragraphs would simply serve to read out and display the state. Moreover, it is possible to rewind state during execution by overwriting certain variables. It is also simple to kill the current execution, thereby deleting the state and starting anew. Notebooks as an Enabler for Productivity Notebooks increase productivity, because they allow for incremental improvement. It is cheap to modify code and rerun only the relevant paragraph. So when developing a notebook, the user builds up state and then iterates on that state until progress is made. Running a stand-alone program on the contrary will incur more setup time and might be prone to side-effects. A notebook will most likely keep all its state in the working memory whereas every new execution of a stand-alone program will need to build up the state on every time it is run. This takes more time and the required IO operations might fail. Working off a program state in the memory and iterating on that proved to be very efficient. This is particularly true for data scientists, as their programs usually deal with a large amount of data that has to be loaded in and out of memory as well as computations that can be time-consuming. From an the organizational point of view, notebooks are a valuable tool for knowledge management. As they are designed to be self-contained, sharable units of knowledge, they amend themselves for: Knowledge transfer Auditing and validation Collaboration Notebooks at Sumo Logic At Sumo Logic, we expose notebooks as an experimental feature to empower users to build custom models and analytics pipelines on top of log metrics data sets. The notebooks provide the framework to structure a thought process. This thought process can be aimed at delivering a special kind of insight or outcome. It could be drilling down on a search. Or an analysis specific to a vertical or an organization. We provide notebooks to enable users to go beyond what Sumo Logic operators have to offer, and train and test custom machine learning (ML) algorithms on your data. Inside notebooks we deliver data using data frames as a core data structure. Data frames make it easy to integrate logs and metrics with third-party data. Moreover, we integrate with other leading data wrangling, model management and visualization tools/services to provide a blend of the best technologies to create value with data. Technology Stack Sumo Logic Notebooks are an integration of several software packages to make it easy to define data sets using the Sumo Query language and use the result data set as a data frame in common machine learning frameworks. Notebooks are delivered as a Docker container and can therefore be installed on laptops or cloud instances without much effort. The most common machine learning libraries such as Apache Spark, pandas, and TensorFlow are pre-installed, but others are easy to add through python’s pip installer, or using apt-get and other package management software from the command line. Changes can be made persistent by committing the Docker image. The key of Sumo Logic Notebooks is the integration of the Sumo Logic API data adapter with Apache Spark. After a query has been submitted, the adapter will load the data and ingest it into Spark. From there we can switch over to a python/pandas environment or continue with Spark. The notebook software provides the interface to specify data science workflows. Best Practices for Writing Notebooks #1 One notebook, one focus A notebook contains a complete record of procedures, data, and thoughts to pass on to other people. For that purpose, they need to be focused. Although it is tempting to put everything in one place, this might be confusing for users. Better write two or more notebooks than overloading a single notebook. #2 State is explicit A common source of confusion is that program state gets passed on between paragraphs through hidden variables. The set of variables that represent the interface between two subsequent paragraphs should be made explicit. Referencing variables from other paragraphs than the previous one should be avoided. #3 Push code in modules A notebook integrates code, it is not a tool for code development. That would be an Integrated Development Environment (IDE). Therefore, a notebook should one contain glue code and maybe one core algorithm. All other code should be developed in an IDE, unit tested, version controlled, and then imported via libraries in the notebook. Modularity and all other good software engineering practices are still valid in notebooks. As in practice number one too much code clutters the notebook and distracts from the original purpose or analysis goal. #4 Use speaking variables and tidy up your code Notebooks are meant to be shared and read by others. Others might not have an easy time following our thought process, if we did not come up with good, self-explaining names. Tidying up the code goes a long way, too. Notebooks impose an even higher standard than traditional code on quality. #5 Label diagrams A picture is worth a thousand words. A diagram, however, will need some words to label axes, describe lines and dots, and comprehend other important informations such sample size, etc. A reader can have a hard time to seize the proportion or importance of a diagram without that information. Also keep in mind that diagrams are easily copy-pasted from the notebook into other documents or in chats. Then they lose the context of the notebook in which they were developed. Bottom Line The segmentation of a thought process is what fuels the power of the notebook. Facilitating incremental improvements when iterating on a problem boosts productivity. Sumo Logic enables the adoption of notebooks to foster the use of data science with logs and metrics data. Additional Resources Visit our Sumo Logic Notebooks documentation page to get started Check out Sumo Logic Notebooks on DockerHub or Read the Docs Read our latest press release announcing new platform innovations, including our new Data Science Insights innovation


Understanding Sumo Logic Query Language Design Patterns


Finding and Debugging Memory Leaks with Sumo

Memory leaks happen when programs allocate more memory than they return. Memory is beside Compute one of the critical assets of any computer system. If a machine runs out of memory, it cannot provide its service. In the worst case, the entire machine might crash and tear down all running programs. The bugs responsible for that misbehavior are often hard to find. Sumo’s collector enables monitoring memory consumption out of the box. Using some additional tooling, it is possible to collect fine-grained logs and metrics that accelerate finding and efficient debugging of memory leaks. Ready to get started? See all the ways the Sumo Logic platform helps monitor and troubleshoot—from a seamless ingestion of data, to cross-platform versatility, and more. You can even get started for free.Free Trial Memory Management and Memory Leaks Memory management is done on multiple levels: The Operating System (OS) keeps track of memory allocated by its program in kernel space. In user space, virtual machines like the JVM might implement their own memory management component. At its core, memory management follows a Producer-Consumer pattern. The OS or VM gives away (produces) chunks of memory whenever programs are requesting (consuming) memory. Since memory is a finite resource in any computer system, programs have to release the allocated memory that is then returned to the pool of available memory managed by the producer. For some applications, the programmer is responsible for releasing memory, in others like the JVM a thread called garbage collector will collect all objects that are used no more. A healthy system would run through this give-and-take in a perfect circle. In a bad system, the program fails to return unused memory. This happens for example if the programmer forgets to call the functionfree, or if some objects keep on being referenced from a global scope after usage. In that case, new operations will allocate more memory on top of the already allocated, but unused memory. This is misbehavior is called a memory leak. Depending on the size of the objects this can be as little as a few bytes, kilobytes, or even megabytes if the objects, for example, contain images. Based on the frequency the erroneous allocation is called, the free space fills up as quickly as a few microseconds or it could take months to exhaust the memory in a server. This long time-to-failure can make memory leaks very tricky to debug because it is hard to track an application running over a long period. Moreover, if the leak is just a few bytes this marginal amount gets lost in the noise of common allocation and release operations. The usual observation period might be too short to recognize a trend. This article describes a particularly interesting instance of a memory leak. This example uses the Akka actor framework, but for simplicity, you can think of an actor as an object. The specific operation in this example is downloading a file: An actor is instantiated when the user invokes a specific operation (download a file) The actor accumulates memory over its lifetime (keeps adding to the temporary file in memory) After the operation completes (file has been saved to disk), the actor is not released The root cause of the memory leak is that it can handle only one request and it is useless after saving the content of the file. There are no references to the actor in the application code, but there still is a parent-child relationship defined in the actor system that defines a global scope. From After-the-Fact Analysis to Online Memory Supervision Usually, when a program runs out of memory it terminates with an “Out of Memory” error or exception. In case of the JVM, it will create a heap dump on termination. A heap dump is an image of program’s memory at the termination instant and saved to disk. This heap dump file can then be analyzed using tools such as MemoryAnalyzer, YourKit, or VisualVM for the JVM. These tools are very helpful to identify which objects are consuming what memory. They operate, however, on a snapshot of the memory and cannot keep track of the evolution of the memory consumption. Verifying that a patch works is out of the scope of these tools. With a little scripting, we can remediate this and use Sumo to build an “Online Memory Supervisor” that stores and processes this information for us. In addition to keeping track of the memory consumption history of our application, it saves us from juggling around with heap dump files that can potentially become very large. Here’s how we do it: 1. Mechanism to interrogate JVM for current objects and their size The JVM provides an API for creating actual memory dumps during runtime, or just retrieve a histogram of all current objects and their approximate size in memory. We want to do the latter as this is much more lightweight. The jmap tool in the Java SDK makes this interface accessible from the command line: jmap -histo PID Getting the PID of the JVM is as easy as grepping for it in the process table. Note that in case the JVM runs as a server using an unprivileged user, we need to run the command as this user via su. A bash one-liner to dump the object histogram could look like: sudo su stream -c"jmap -histops -ax | grep "[0-9]* java" | awk '{print $1}' > /tmp/${HOSTID}_jmap-histo-`date +%s`.txt" 2. Turn result into metrics for Sumo or just drop it as logs As a result of the previous operation, we have now a file containing a table with object names, count, and retained memory. IN order to use it in Sumo we’ll need to submit it for ingestion. Here we got two options: (a) send the raw file as logs, or (b) convert the counts to metrics. Each object’s measurement is a part of a time series tracking the evolution of the object’s memory consumption. Sumo Metrics ingest various time series input formats, we’ll use Graphite because it’s simple. To affect the conversion of a jmap histogram to Graphite we use bash scripting. The script cuts beginning and end of the file and then parses the histogram to produce two measurements: <class name, object count, timestamp> <class name, retained size, timestamp> Sending these measurements to Sumo can be done either through Sumo’s collector, using collectd with Sumo plugin, or sending directly to the HTTP endpoint. For simplicity, we’ve used the Graphite format and target the Sumo collector. To be able to differentiate both measurements as well as different hosts we prepend this information to the classpath: <count|size>.<host>.classpath For example, a jmap histogram might contain data in tabular form like: 69: 18 1584 98: 15 720 103: 21 672 104: 21 672 Our script turns that into Graphite format and adds some more hierarchy to the package name. In the next section, we will leverage this hierarchy to perform queries on objects counts and sizes. 18 123 15 123 21 123 21 123 In our case, we’ll just forward these logs to the Sumo collector. Previously, we’ve defined a Graphite source for Metrics. Then, it’s as easy as cat histogram-in-graphite | nc -q0 localhost 2003. 3. Automate processing via Ansible and StackStorm So far we are now capable of creating a fine-grained measurement of an application’s memory consumption using a couple of shell commands and scripts. Using the DevOps automation tools Ansible and StackStorm, we can turn this manual workflow in an Online Memory Supervision System. Ansible helps us to automate taking the measurement of multiple hosts. For each individual host, it connects to the hosts via ssh, runs the jmap command, the python conversion script, and submits the measurement to Sumo. StackStorm manages this workflow for us. In a given period, it kicks off Ansible and logs the process. In case something goes wrong, it defines remediation steps. Of course, there are alternatives to the myriad of available tools. Ansible competes with SaltStack, Chef, and Puppet. StackStorm is event-driven automation with all bells and whistles, for this example, we could have used a shell script with sleepor a simple cron job. Using Sumo to Troubleshoot Memory Leaks Now it’s time to use Sumo to analyze our memory. In the previous steps, we have submitted and ingested our application’s fine-grained memory consumption data. After this preparation, we can leverage Sumo to query the data and build dashboards. Using queries, we can perform in-depth analysis. This is useful as part of a post-mortem analysis to track down a memory leak, or during development to check, if a memory allocation/deallocation scheme actually works. During runtime, dashboards could monitor critical components of the application. Let’s check this out on a live example. We’ll use a setup of three JVMs simulating an application and a StackStorm instance. Each is running in their own Docker container, simulating a distributed system. To make our lives easier, we orchestrate this demo setup using Vagrant: Figure 1: Memory leak demo setup and control flow A Memory Measurement node orchestrates the acquisition process. We’ve developed a short Ansible script that connects to several application nodes and retrieves a histogram dump from the JVMs running the faulty program from [1]. It converts the dumps to Graphite metrics and sends them via the collector to Sumo. StackStorm periodically triggers the Ansible workflow. Finally, we use the UI to find and debug memory leaks. Analyze memory consumption First, we want to get an overview of what’s going on in the memory. We start to look at the total memory consumption of a single host. A simple sum over all objects sizes yields the application’s memory consumption over time. The steeply increasing curve abruptly comes to an end at a total of about 800 Mb. This is the total memory that we dispatched to the JVM (java -Xmx800m -jar memleak-assembly-0.1.jar). Figure 2: Total memory consumption of host memleak3 Drilling down on top memory consumers often hints at the responsible classes for a memory leak. For that query, we parse out all objects and sum their counts and sizes. Then we display only the top 10 counts. In the size query, we filter out objects above a certain size. These objects are the root objects of the application and do not contain much information. Figure 3: Top memory consumers on a single node Figure 4: To memory top consumers by size We find out that a Red-Black Tree dominates the objects. Looking at the Scala manual tells us that HashMaps make extensive use of this data structure: Scala provides implementations of immutable sets and maps that use a red-black tree internally. Access them under the names TreeSet and TreeMap. We know that ActorSystem uses HashMaps to store and maintain actors. Parsing and aggregating queries help to monitor entire subsystems of a distributed application. We use that to find out that the ActorSystem accumulates memory not only on a single host but over a set of hosts. This leads us to believe that this increase might not be an individual error, by a systemic issue. Figure 5: Use query parsing and aggregation operations to display the ActorSystem’s memory consumption A more detailed view of the Child actor reveals the trend how it accumulates memory. The trick in this query is that in the search part we filter out the packages* the search expression and then use the aggregation part to parse out the single hosts and sum the size values of their individual objects. Since all three JVMs started at the same time, their memory usage increases at a similar rate in this picture. We can also split this query into three separate queries like below. These are looking at how the Child actors on all three hosts are evolving. Figure 6: The bad Child actor accumulating memory Finally, we verify that the patch worked. The latest chart shows that allocation and deallocation are now in balance on all three hosts. Figure 7: Memory leak removed, all good now Memory Analysis for Modern Apps Traditional memory analyzers were born in the era of standalone, desktop applications. Therefore, they work on snapshots and heap dumps and cannot track the dynamicity of memory allocation and deallocation patterns. Moreover, they are also restricted to work on single images and it is not easy to adapt them to a distributed system. Modern Apps have different requirements. Digital Businesses provide service 24/7, scale out in the cloud, and compete in terms of feature velocity. To achieve feature velocity, detecting memory issues online is more useful than after-the-fact. Bugs such as memory leaks need rapid detection and bugfixes inserted frequently and without stopping services. Pulling heap dumps and starting memory analyzers just won’t work in many cases. Sumo takes memory analysis to the next level. Leveraging Sumo’s Metrics product we can track memory consumption for classes and objects within an application. We look at aggregations of their counts and sizes to pinpoint the fault. Memory leaks are often hard to find and need superior visibility into an application’s memory stack to become debuggable. Sumo achieves this not only for a single instance of an application but scales memory analysis across the cloud. Additionally, Sumo’s Unified Logs and Monitoring (ULM) enables correlating logs and metrics and facilitates understanding the root cause of a memory leak. Bottom Line In this post, we showed how to turn Sumo into a fine-grained, online memory supervision system using modern DevOps tools. The fun doesn’t stop here. The presented framework can be easily extended to include metrics for threads and other resources of an application. As a result of this integration, developers and operators gain high visibility in the execution of their application. References Always stop unused Akka actors – Blog Post Acquire object histograms from multiple hosts – Ansible Script Sumo’s Modern Apps report – BI Report