Blog › Company

Christian Beedgen, Co-Founder & CTO

An Official Docker Image For The Sumo Logic Collector

12.11.2014 | Posted by Christian Beedgen, Co-Founder & CTO

Learning By Listening, And Doing

Over the last couple of months, we have spent a lot of time learning about Docker, the distributed application delivery platform that is taking the world by storm. We have started looking into how we can best leverage Docker for our own service. And of course, we have spent a lot of time talking to our customers. We have so far learned a lot by listening to them describe how they deal with logging in a containerized environment.

We actually have already re-blogged how Caleb, one of our customers, is Adding Sumo Logic To A Dockerized App. Our very own Dwayne Hoover has written about Four Ways to Collect Docker Logs in Sumo Logic.

Along the way, it has become obvious that it makes sense for us to provide an “official” image for the Sumo Collector. Sumo Logic exposes an easy to use HTTP API, but the vast majority of our customers are leveraging our Collector software as a trusted, production-grade data collection conduit. We are and will continue to be excited about folks building their own images for their own custom purposes. Yet, the questions we get make it clear that we should release an official Sumo Logic Collector image for use in a containerized world

Instant Gratification, With Batteries Included

A common way to integrate logging with containers is to use Syslog. This has been discussed before in various places all over the internet. If you can direct all your logs to Syslog, we now have a Sumo Logic Syslog Collector image that will get you up and running immediately:

docker run -d -p 514:514 -p 514:514/udp --name="sumo-logic-collector" sumologic/collector:latest-syslog [Access ID] [Access key]
view raw gistfile1.txt hosted with ❤ by GitHub

Started this way, the default Syslog port 514 is mapped port on the host. To test whether everything is working well, use telnet on the host:

telnet localhost 514
view raw hosted with ❤ by GitHub

Then type some text, hit return, and then CTRL-] to close the connection, and enter quit to exit telnet. After a few moments, what you type should show up in the Sumo Logic service. Use a search to find the message(s).

To test the UDP listener, on the host, use Netcat, along the lines of:

echo "I'm in ur sysloggz" | nc -v -u -w 0 localhost 514
view raw hosted with ❤ by GitHub

And again, the message should show up on the Sumo Logic end when searched for.

If you want to start a container that is configured to log to syslog and make it automatically latch on to the Collector container’s exposed port, use linking:

docker run -it --link sumo-logic-collector:sumo ubuntu /bin/bash
view raw hosted with ❤ by GitHub

From within the container, you can then talk to the Collector listening on port 514 by using the environment variables populated by the linking:

echo "I'm in ur linx" | nc -v -u -w 0 $SUMO_PORT_514_TCP_ADDR $SUMO_PORT_514_TCP_PORT
view raw hosted with ❤ by GitHub

That’s all there is to it. The image is available from Docker Hub. Setting up an Access ID/Access Key combination is described in our online help.

Composing Collector Images From Our Base Image

Following the instructions above will get you going quickly, but of course it can’t possibly cover all the various logging scenarios that we need to support. To that end, we actually started by first creating a base image. The Syslog image extends this base image. Your future images can easily extend this base image as well. Let’s take a look at what is actually going on! Here’s the Github repo:

One of the main things we set out to solve was to clarify how to allow creating an image that does not require customer credentials to be baked in. Having credentials in the image itself is obviously a bad idea! Putting them into the Dockerfile is even worse. The trick is to leverage a not-so-well documented command line switch on the Collector executable to pass the Sumo Logic Access ID and Access Key combination to the Collector. Here’s the meat of the startup script referenced in the Dockerfile:

/opt/SumoCollector/collector console -- -t -i $access_id -k $access_key -n $collector_name -s $sources_json
view raw hosted with ❤ by GitHub

The rest is really just grabbing the latest Collector Debian package and installing it on top of a base Ubuntu 14.04 system, invoking the start script, checking arguments, and so on.

As part of our continuous delivery pipeline, we are getting ready to update the Docker Hub-hosted image every time a new Collector is released. This will ensure that when you pull the image, the latest and greatest code is available.

How To Add The Batteries Yourself

The base image is intentionally kept very sparse and essentially ships with “batteries not included”. In itself, it will not lead to a working container. This is because the Sumo Logic Collector has a variety of ways to setup the actual log collection. It supports tailing files locally and remotely, as well as pulling Windows event logs locally and remotely.

Of course, it can also act as a Syslog sink. And, it can do any of this in any combination at the same time. Therefore, the Collector is either configured manually via the Sumo Logic UI, or (and this is almost always the better way), via a configuration file. The configuration file however is something that will change from use case to use case and from customer to customer. Baking it into a generic image simply makes no sense.

What we did instead is to provide a set of examples. This can be found in the same Github repository under “example”: There’s a couple of sumo-source.json example files illustrating, respectively, how to set up file collection, and how to setup Syslog UDP and Syslog TCP collection. The idea is to allow you to either take one of the example files verbatim, or as a starting point for your own sumo-sources.json. Then, you can build a custom image using our image as a base image. To make this more concrete, create a new folder and put this Dockerfile in there:

1 2 3
FROM sumologic/collector
MAINTAINER Happy Sumo Customer
ADD sumo-sources.json /etc/sumo-sources.json
view raw gistfile1.dockerfile hosted with ❤ by GitHub

Then, put a sumo-sources.json into the same folder, groomed to fit your use case. Then build the image and enjoy.

A Full Example

Using this approach, if you want to collect files from various containers, mount a directory on the host to the Sumo Logic Collector container. Then mount the same host directory to all the containers that use file logging. In each container, setup logging to log into a subdirectory of the mounted log directory. Finally, configure the Collector to just pull it all in.

The Sumo Logic Collector has for years been used across our customer base in production for pulling logs from files. More often than not, the Collector is pulling from a deep hierarchy of files on some NAS mount or equivalent. The Collector is quite adept and battle tested at dealing with file-based collection.

Let’s say the logs directory on the host is called /tmp/clogs. Before setting up the source configuration accordingly, make a new directory for the files describing the image. Call it for example sumo-file. Into this directory, put this Dockerfile:

1 2 3
FROM sumologic/collector
MAINTAINER Happy Sumo Customer
ADD sumo-sources.json /etc/sumo-sources.json
view raw gistfile1.dockerfile hosted with ❤ by GitHub

The Dockerfile extends the base image, as discussed. Next to the Dockerfile, in the same directory, there needs to be a file called sumo-sources.json which contains the configuration:

1 2 3 4 5 6 7 8 9 10 11 12 13 14
"api.version": "v1",
"sources": [
"sourceType" : "LocalFile",
"name": "localfile-collector-container",
"pathExpression": "/tmp/clogs/**",
"multilineProcessingEnabled": false,
"automaticDateParsing": true,
"forceTimeZone": false,
"category": "collector-container"
view raw gistfile1.json hosted with ❤ by GitHub

With this in place, build the image, and run it:

docker run -d -v /tmp/clogs:/tmp/clogs -d --name="sumo-logic-collector" [image name] [your Access ID] [your Access key]
view raw hosted with ❤ by GitHub

Finally, add -v /tmp/clogs:/tmp/clogs when running other containers that are configured to log to /tmp/clogs in order for the Collector to pick up the files.

Just like the ready-to-go syslog image we described in the beginning, a canonical image for file collection is available. See the source:

docker run -v /tmp/clogs:/tmp/clogs -d --name="sumo-logic-collector" sumologic/collector:latest-file [Access ID] [Access key]
view raw hosted with ❤ by GitHub

If you want to learn more about using JSON to configure sources to collect logs with the Sumo Logic Collector, there is a help page with all the options spelled out.

That’s all for today. We have more coming. Watch this space. And yes, comments are very welcome.

Brandon Mensing

Keep Tabs on Your Box

12.09.2014 | Posted by Brandon Mensing

You can argue till you’re blue in the face about which cloud storage to use these days. But you can’t argue (at least not with me) about whether you should be auditing what’s going on in those tools. Sure, these services have a decent EULA and plenty of encryption. However, that doesn’t mean your own employees are following your policies. And let’s not forget how easily passwords fail.

The good news is that you can get setup to monitor your Box account with Sumo Logic in a few simple steps. Following our documentation, a simple script will talk to the Box service, retrieve audit events and bring them into Sumo Logic via a Script Source. Then just install the Box app from our service (a few simple clicks) and you’ll be able to quickly inspect how your organization is using Box.

With our off the shelf dashboards, I encourage all of you using Box to explore your deployment. You may be surprised at what you find. Odd behavior patterns can bubble up, showing you that a critically sensitive document is being overshared; or that one of your users has recently shared thousands upon thousands of documents in an apparent attempt to walk out the door with your IP; or that an unexpected user is logged in and accessing documents.

Cloud storage systems promise a lot of security and privacy but ultimately it’s still up to you to identify malicious users, policy violations and security threats. Using this great new app, you’ll find that it’s easy and effective to monitor and investigate activities in your Box account. Give it a try today and use our support portal to send us your feedback!

Christian Beedgen, Co-Founder & CTO

Shifting Into Overdrive

12.02.2014 | Posted by Christian Beedgen, Co-Founder & CTO


How Our Journey Began

Four years ago, my co-founder Kumar and I were just two guys who called coffee shops our office space. We had seen Werner Vogel’s AWS vision pitch at Stanford and imagined a world of Yottabyte scale where machine learning algorithms could make sense of it all. We dreamed of becoming the first and only native cloud analytics platform for machine generated data and next gen apps, and we dreamed that we would attract and empower customers. We imagined the day when we’d get our 500th customer. After years of troubleshooting scale limitations with on-premises enterprise software deployments, we bet our life savings that multi-tenant cloud apps could scale to the infinite data scales that were just a few years away.

Eclipsing Our First Goal

Just a few weeks ago, we added our 500th enterprise customer in just over two years since Sumo Logic’s inception. As software developers, the most gratifying part of our job is when customers use and love our software. This past month has been the most gratifying part of the journey so far as I’ve travelled around the world meeting with dozens of happy customers. At each city, I’m blown away by the impact that Sumo Logic has on our customers’ mission critical applications. Our code works, our customers love our software and our business is taking off faster than we could have imagined.  

Momentum Is Kicking In

Our gratitude for our customers only grows when we dig through the stats of what we’ve been able to build together with our world class investors and team of 170+ Sumos. Just last quarter alone, we exceeded expectations with:

  • 100%+ Quarter over Quarter ACV growth
  • 100+ new customer logos
  • 12 new 1 Terabyte/day customers
  • 1 Quadrillion new logs indexed

Dozens of new Sumos bringing badass skills from companies like Google, Atlassian, Microsoft, Akamai and even VMware…

Shifting Into Overdrive

It is still early days, and we have a tireless road of building ahead of us. Big data is approaching a $20B per year industry. And, we’re addressing machine data, which is growing 5X faster than any other segment of data. No company has built a platform for machine data that approaches our scale in the cloud:

  • 1 million events ingested per second
  • 8 petabytes scanned per day
  • 1 million queries processed per day

Today, we’re excited to share the news that Ramin Sayar will joining us to lead Sumo Logic as our new president and CEO. With 20 years of industry experience, he has a proven track record for remarkable leadership, incubating and growing significant new and emerging businesses within leading companies. He comes to us from VMWare, where he was Sr. Vice President and General Manager of the Cloud Management Business Unit. In his time at VMWare, he developed the product and business strategy and led the fastest growing business unit. He was responsible for the industry leading Cloud Management Business and Strategy, R&D, Operating P&L, Product Mgmt, Product Marketing and field/business Operations for VMware’s Cloud Mgmt offerings.

Our mission remains the same: to enable businesses to harness the power of machine data to improve their operations and deliver outstanding customer experience. With our current momentum and Ramin’s leadership, I am extremely excited about the next chapter in Sumo Logic’s journey. Please know how grateful we are to you, our customers, partners, and investors, for your belief in us and for the privilege to innovate on your behalf every day.

Binh Nguyen

Improving Misfit Wearable’s Devices With Sumo Logic

11.20.2014 | Posted by Binh Nguyen

At Misfit Wearables, we’ve been using Sumo Logic with great success and we wanted to share our story.  We create smart devices that promote fitness and wellness.  In order to develop great devices, capturing and analyzing data are critical tasks.  We previously used a well-known open source log management tool.  This tool was slow, limited and didn’t really deliver the value that we were looking for, yet for some time we had come to merely accept what we configured it to do for us.   When we made the change to Sumo Logic, we saw some big changes.  In summary Sumo Logic is an effective tool to manage logs and analyze data and is widely used by all our engineers in Misfit.  However it goes way beyond that.  The feedback from our product team has been tremendous and the main thing that is different is performance.  Today, with Sumo Logic running in our environment, a job that once took five to ten minutes now takes several seconds.  That was the beginning of our Sumo Logic story, because we are now implementing the Partition feature throughout our environment, and we’re already seeing results.  Partitions has taken us to another level of performance improvement.


The Misfit Environment

 To describe our environment, we collect a lot of data, but our setup is not unusual. Each day, we collect various logs from different collectors such as servers, clients, customers, websites and stores.  Then by using the Data Forwarding option, all of these can be backed-up by an AWS S3 bucket.  Pictured below is a report from the Sumo Logic dashboard of the total queries that run daily.



Fig 1. Daily query by Sumo Logic


 Sumo Logic also provides us an Anomaly Detection tool that can help us to automatically uncover security and other issues in real time. For those that are new to the software, there are a number of useful support features that make things easier such as Amazon Cloud Front, Data Volume, AWS Cloud Trail and Log Analysis Quick Start.  Adoption and value have thus come about quickly at Misfit, and we are constantly finding ways to save time and effort with this powerful tool.



 Fig 2. Sumo Logic Applications Collection



Recently, Sumo Logic introduced a feature called “Partitions” and we have started implementing it in a number of productive ways.  For example, we now can easily filter a subset of one collector into a partition by creating an index.  With this approach, we have seen drastically improved search query performance based on the reduced total number of messages that need to be searched and all partition indexes can now be automatically included in searches.

 To better understand the new feature, we have set up the following test:

1.      We chose a small-size collector, which only has around 2% of total daily volume.

2.      Then, we measure the query time by using the default index and one specific index for this collector during last N days (N = 1, 2, 3,…,14).








Fig 3. Query time by using the default index versus an index by partitions. (a) Query time by the default index. (b) Query time by using an index for this collector.  (c) The ratio between the query time by using an index and the query time by using the default index.


 In Figure 3, for example searching some logs during “Last 14 days”, using the default index will take about 154 seconds; meanwhile it only takes 12 seconds by creating an index for this collector. The ratio between non-partitioned and partitioned indices over various time periods can be seen in Figure 3c.  We save a lot of time, effort and resources with this new feature which helps with product development cycles.

 Saving time and resources on query and analysis is critical to our product.  We here at Misfit Wearables have enjoyed using Sumo Logic and we look forward to further emerging features from the company that help us in what we do every day which is improve our product day by day and even hour by hour.  


Binh Nguyen @ Misfit Wearables 


Brandon Mensing

Use AWS CloudTrail to Avoid a Console Attack

11.11.2014 | Posted by Brandon Mensing

Our app for AWS CloudTrail now offers a dashboard specifically for monitoring console login activity. In the past months since the AWS team added this feature, we decided to break out these user activities in order to provide better visibility into what’s going on with your AWS account.

Many of you might think of this update as incremental and not newsworthy, but I’m actually writing here today to tell you otherwise! More and more people are using APIs and CLIs (and third parties) to work with AWS outside the console. As console logins are becoming more and more rare and as more business-critical assets are being deployed in AWS, it’s critical to always know who’s logged into your console and when.

For a great and terrifying read about just how badly things can go wrong when someone gains access to your console, look no further than the story of Code Spaces. With one story opening with “was a company” and another “abruptly closed,” there isn’t exactly a lot of suspense about how things turned out for this company. After attackers managed to gain access to Code Spaces’ AWS console, they built themselves a stronghold of backdoors and began an attempt to extort money from the company. When the attackers accounts were removed, they quickly used the additional users they had generated to get back in and begin taking out infrastructure and data. With the service down and their customer’s data in disarray, all trust in their product was lost. The company was effectively destroyed in a matter of hours.

The new dashboard in our updated CloudTrail app allows you to quickly see who’s attempting to login to your console, from where and whether or not they’re using multi-factor authentication (which we highly recommend).

CloudTrail Console Logins

If you haven’t installed the app previously, be sure to follow our simple steps from our documentation to setup the appropriate permissions in AWS. For those of you who have already installed the app previously, you can install the app again anew in order to get a new copy of the app with the additional dashboard included. From there, we encourage you to customize queries for your specific situation and even consider setting up a scheduled search to alert you to a problematic situation.

Keeping an eye out for suspicious activity on your AWS console can be an invaluable insight. As attackers get more sophisticated, it’s harder and harder to keep your business secure and operational. With the help of Sumo Logic and logs from AWS CloudTrail you can stay ahead of the game by preventing the most obvious (and most insidious) types of breaches. With functionality like this, perhaps Code Spaces would still be in business.

Eillen Voellinger

Certified, bonafide, on your side

11.05.2014 | Posted by Eillen Voellinger

“How do I build trust with a cloud-based service?”  This is the most common question Sumo Logic is asked, and we’ve got you covered. We built the service so it was not just an effortless choice for enterprise customers but the obvious one and building trust through a secure architecture was one of the first things we took care of.

Sumo Logic is  SOC 2 Type 2 and HIPAA compliant. Sumo Logic also complies with the U.S. – E.U. Safe Harbor framework and will soon be PCI/DSS 3.0 compliant. No other cloud-based log analytics service can say this. For your company, this means you can safely get your logs into Sumo Logic – a service you can trust and a service that will protect your data just like you would.

These are no small accomplishments, and it takes an A-team to get it done. It all came together when we hired Joan Pepin, a phreak and a hacker by admission. Joan is our VP of Security and CISO. She was employee number 11 at Sumo Logic and her proficiency has helped shape our secure service.

Our secure architecture is also a perfect match for our “Customer First” policy and agile development culture. We make sure that we are quickly able to meet customer needs and to fix issues in real-time without compromising our secure software development processes. From network security to secure software development practices, we ensured that our developers are writing secure code in a peer-reviewed and process-driven fashion.

Sumo Logic was built from the ground up to be secure, reliable, fast, and compliant. Joan understands what it means to defend a system, keep tabs on it, watch it function live. Joan worked for the Department of Defense. She can’t actually talk about what she did when she was there, but we can confirm that she was there because the Department of Defense, as she puts it, “thought my real world experience would balance off the Ph.Ds.”

Joan learned the craft from Dr. Who, a member of the ( Legion of Doom. ( ) If hacker groups were rock and roll, the Legion of Doom would be Muddy Waters, Chuck Berry, Buddy Holly. They created the idea of a hacker group. They hacked into a number of state 911 systems and stole the documentation on them, distributing it throughout BBS’ in the United States. They were the original famous hacking group. Joan is no Jane-come-lately. She’s got the best resume you can have in this business.

We’re frequently asked about all the security procedures we adopt at Sumo Logic. Security is baked into every component of our service. Other than the various attestations I mentioned earlier, we also encrypt data at rest and in transit. Other security processes that are core to the Sumo Logic service include:

+ Centrally managed, FIPS-140 two-factor authentication devices for operations personnel

+ Biometric access controls

+ Whole-disk encryption

+ Thread-level access controls

+ Whitelisting of individual processes, users, ports and addresses

+ Strong AES-256-CBC encryption

+ Regular penetration tests and vulnerability scans

+ A strong Secure Development Life-Cycle (SDLC)

+ Threat intelligence and managed vulnerability feeds to stay current with the constantly evolving threatscape and security trends

If you’re still curious about the extent to which our teams have gone to keep your data safe, check out our white paper on the topic:

We use our own service to capture our logs, which has helped us accomplish our enviable security and compliance accomplishments. We’ve done the legwork so your data is secure and so you can use Sumo Logic to meet your unique security and compliance needs. We have been there done that with the Sumo Logic service and now it’s your turn.

Karthik Anantha Padmanabhan

Optimizing Breadth-First Search for Social Networks

10.28.2014 | Posted by Karthik Anantha Padmanabhan

Social network graphs, like the ones captured by Facebook and Twitter exhibit small-world characteristics [2][3]. In 2011, Facebook documented that among all Facebook users at the time of their research (721 million users with 69 billion friendship links), there is an average average shortest path distance of 4.74 between users. This simply means that on average, any two people in the world are separated by just five other people. It’s a small world indeed ! Formally, a small-world network is defined to be a network where the typical distance L between two randomly chosen nodes grows proportionally to the logarithm of the number of nodes N in the network [4].

Consider the following scenario. You have a social network profile and you want someone to introduce you to the person in that profile. But luckily you are given the entire friendship graph captured by this social network. If there are mutual friends, then you just ask one of them to help you out. If not, you need some sequence of friend introductions to finally meet that person. What is the minimum number of intermediate friend introductions that you need to meet that person you are interested in ? This is equivalent to finding the shortest path in the social network graph between you and that person of interest. The solution is to run Breadth First Search on that social network graph with your profile as the starting vertex. The other interesting question is, if we have extra information about our graph exhibiting small-world properties,  can we make the exhaustive Breadth-First Search (BFS) faster? The ideas expressed on this topic appeared in Beamer et al [1], where the authors optimized BFS for the number of edges traversed.

Breadth-First Search:

BFS uses the idea of a frontier that separates the visited nodes from unvisited nodes. The frontier holds the nodes of the recently visited level and is used to find the next set of nodes to be visited. On every step of BFS, the current frontier is used to identify the next frontier from the set of unvisited nodes.

Example Graph (1).png

Figure 1. A simple graph

Looking at the example in the figure, the current frontier consists of the nodes 1, 2, 3, 4 and 5. The edges from these nodes are examined to find a node that has not been visited. In the above case node 2’s edges are used to mark H and add it to the next frontier. But note that even though H has been marked by 2, nodes 3, 4 and 5 still inspect H to see whether it is visited or not.

Pseudocode for Naive BFS [5] :

Input: A graph G = (V,E) containing V vertices and E edges and source vertex s

Output: parent: Array[Int], where parent[v] gives the the parent of v in the graph or  -1 is if a parent does not exist

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
class BFS(g: Graph) {
val parent = ArrayBuffer.fill(g.numberVertices)(-1).toArray
def bfs(source: Int, updater: (Seq[Int], Array[Int]) => Seq[Int]) = {
var frontier = Seq(source)
parent(source) = -2
while (!frontier.isEmpty) {
frontier = updater(frontier, parent)
trait TopDownUpdater extends FrontierUpdater {
def update(frontier: Seq[Int], parents: Array[Int]): Seq[Int] = {
val next = ArrayBuffer[Int]()
frontier.foreach{ node =>
graph.getNeighbors(node).filter(parents(_) == -1).foreach { neighbor =>
next += neighbor
parents(neighbor) = node
view raw top-down.scala hosted with ❤ by GitHub

One of the observations of conventional BFS (henceforth referred to as top-down BFS) is that it always performs in the worst case complexity, i.e., O(|V| + |E|) where V and E are the number of vertices and number of edges respectively. For example, if a node v has p parents, then we just need to explore one edge from any p parents to v to check for connectivity. But top-down BFS checks all incoming edges to v.

The redundancy of these additional edge lookups is more pronounced when top-down BFS is run on graphs exhibiting small-world properties. As a consequence of the definition of small-world networks, the number of nodes increases exponentially with the effective diameter of the network, which result in large networks with very low diameters.  The low diameter of these graphs forces them to have a larger number of nodes at a particular level and leads to top-down BFS visiting a larger number of nodes in every step, making the frontier very large. Traversing the edges of the nodes in a frontier is the major computation that is performed, and top-down BFS unfortunately ends up visiting all the outgoing edges from the frontier. Moreover, it has also been shown in [1] that most of the edge lookups from the frontier nodes end up in visited nodes (marked by some other parent), which gives further evidence that iterating through all edges from the frontier can be avoided.

The idea behind bottom-up BFS [1] is to avoid visiting all the edges of the nodes in the frontier, which is a pretty useful thing to do for the reasons mentioned above. To accomplish this, bottom-up BFS traverses the edges of the unvisited nodes to find a parent in the current frontier. If an unvisited node has at least one of its parents in the current frontier, then that node is added to the next frontier. To efficiently find if a node’s parent is present in the frontier, the frontier data structure is changed to a bitmap.

Untitled drawing (4).pngUntitled drawing (6).png

Figure 2. Bottom up BFS

In the above example, {H, I, J, K } are the unvisited nodes. However only nodes { J, H } have a neighbor in the current frontier and as a result the next frontier now becomes {H , J}. In the next iteration the set of unvisited nodes will be {I, K} and each of them have a parent in the current frontier which is {H, J}. So {I, K} will be visited and the search will complete in the next iteration since there will be no more nodes to be added to the next frontier, since all nodes will be visited.


Pseudocode for Bottom-up BFS:

Input: A graph G = (V,E) containing V vertices and E edges and source vertex s

Output: parent: Array[Int], where parent[v] gives the the parent of v in the graph or  -1 is if a parent does not exist

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
trait DirectedSerialAncestorManager extends SerialAncestorManager{
var _graph: SerialDirectedGraph = _
def getAncestor(id: Int): IndexedSeq[Int] = {
def getVertices: IndexedSeq[Int] = (0 to _graph.numberVertices - 1)
trait SBottomUpUpdater extends FrontierUpdater with SerialAncestorManager {
def update(frontier: BitSet, parents: Array[Int]): Seq[Int] = {
val next = mutable.BitSet()
val vertices = getVertices
val frontierSet = frontier.toSet
(vertices.filter(parents(_) == -1)).foreach { node =>
val neighbors = (getAncestor(node))
neighbors.find(frontierSet) match {
case Some(ancestor) => {
parents(node) = ancestor
next(node) = true
case None => None
view raw bottom-up.scala hosted with ❤ by GitHub

 The major advantage to this approach is that the search for an unvisited node’s parent will terminate once any one parent is found in the current frontier. Contrast this with top-down BFS, which needs to visit all the neighbors of a node in the frontier during every step.

Top-down, Bottom-up, or both?

When the frontier is large, you gain by performing bottom-ups BFS as it only examines some edges of the unvisited nodes. But when the frontier is small, it may not be advantageous to perform bottom-up BFS, as apart from having to go over, it incurs the additional overhead of identifying the unvisited nodes. Small-world networks usually start off with small frontiers in the initial step and have an exponential increase in the frontier size in the middle stages of the search procedure. These tradeoffs lead us to another approach for small-world networks where we combine combine both top-down and bottom-up BFS—hybrid BFS  [1]. In hybrid BFS, the size of the frontier is used to define a heuristic, which is used to switch between the two approaches, top-down and bottom-up. A thorough analysis of this heuristic is presented in [1].

How about parallelizing these approaches ?  

When trying to parallelize the two approaches, observe that bottom-up BFS is easier to parallelize than top-down BFS. For bottom-up BFS, you can introduce parallelism in the stage where you populate the next frontier. Each of the unvisited nodes can be examined in parallel, and since every node just updates itself in the next data structure, it does not require the use of locks.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
trait ParallelAncestorManager {
def getAncestor(id: Int): ParSeq[Int]
def getParVertices: ParSeq[Int]
trait PBottomUpUpdater extends FrontierUpdater with ParallelAncestorManager {
def update(frontier: Seq[Int], parents: Array[Int]):Seq[Int] = {
val next = BitSet()
val frontierSet = frontier.toSet
getParVertices.filter(parents(_) == -1).foreach { node =>
val parNeighbors = getAncestor(node)
parNeighbors.find(x => frontierSet.contains(x)) match {
case Some(ancestor) => {
parents(node) = ancestor
next(node) = true
case None => None
view raw bottom-up-par.scala hosted with ❤ by GitHub

On inspecting the top-down BFS pseudo-code for sources of parallelism, observe that the nodes in the current frontier can be explored in parallel. The parallel top-down pseudo-code is:

1 2 3 4 5 6 7 8 9 10 11 12 13
trait PTopDownUpdater extends FrontierUpdater {
def update(frontier: Seq[Int], parents: Array[Int]): Seq[Int] = {
val next = ArrayBuffer[Int]()
frontier.par.foreach { node =>
graph.getNeighbors(node).filter(parents(_) == -1).foreach { neighbor =>
next += neighbor
parents(neighbor) = node
view raw top-down-par.scala hosted with ❤ by GitHub

 In terms of correctness, the above pseudo-code looks good, but there is a benign race condition introduced by updating parents and next. This may result in a node being added more than once, making it inefficient. But it does not affect the correctness of the algorithm. Cleaner code would have a synchronized block to ensure only one thread updates the frontier.

The hybrid approach combining the parallel versions of top-down and bottom-up BFS provides one of the fastest single node implementation of Parallel BFS [1].


  1. Beamer, Scott, Krste Asanović, and David Patterson. “Direction-optimizing breadth-first search.” Scientific Programming 21.3 (2013): 137-148.

  2. Ugander, Johan, et al. “The anatomy of the facebook social graph.” arXiv preprint arXiv:1111.4503 (2011).

  3. Li, Jun, Shuchao Ma, and Shuang Hong. “Recommendation on social network based on graph model.” Control Conference (CCC), 2012 31st Chinese. IEEE, 2012.

  4. Watts, Duncan J., and Steven H. Strogatz. “Collective dynamics of ‘small-world’networks.” nature 393.6684 (1998): 440-442.

  5. Introduction to Algorithms (1990) by T H Cormen, C E Leiserson, R L Rivest

Ariel Smoliar, Senior Product Manager

Transaction Mining for Deeper Machine Data Intelligence

10.22.2014 | Posted by Ariel Smoliar, Senior Product Manager

The new Sumo Logic Transaction capability allows users to analyze related sequences of machine data. The comprehensive views uncover user behavior, operational and security insights that can help organizations optimize business strategy, plans and processes.

The new capability allows you to monitor transactions by a specific transaction ID (session ID, IP, user name, email, etc.) while handling data from distributed systems, where a request is passed through several different systems, each with its own transaction ID.

Over the past two months, we have worked with beta customers on a variety of use cases, including:

  • Tracking transactions in a payment processing platform

  • Following typical user sessions, detecting anomalous checkout transactions and catching checkout drop off in e-commerce websites

  • Tracking renewals, upgrades and new signup transactions

  • Monitoring phone registrations failures over a specific period

  • Tracking on-boarding of new users in SaaS products

The last use case is reflective of what SaaS companies care most about: truly understanding the behavior of users on their website that drive long-term engagement. We’ve used our new transaction analytics capabilities to better understand how users find our site, the process by which they get to our Sumo Logic Free page, and how quickly they sign up. Our customer success team uses Transaction Analytics to monitor how long it takes users to create a dashboard, run a search, and perform other common actions. This enables them to provide very specific feedback to the product team for future improvements.

This screenshot depicts a query with IP as the transaction ID and the various states mapped from the logs

Sankey diagram visualizes the flow of the various components/states of a transaction on an e-commerce website

Many of our customers are already using tools such as Google Analytics to monitor visitors flow on their website and understand customer behavior. We are not launching this new capability to replace Google Analytics (even if it’s not embraced in some countries as Germany). What we bring on top of monitoring visitors flow, is the ability to identify divergence in state sequences and understand better the transitions between the states, in terms of latency for example. You probably see updates that some companies are announcing on plugins for log management platforms to detect anomalies and monitor user behavior and sessions. The team’s product philosophy is that we would like to provide our users all-rounded capability that enables them to make smart choices without requiring external tools, all from their machine data within the Sumo product.

It was a fascinating journey working on the transaction capability with our analytics team. It’s a natural evolution of our analytics strategy which now includes: 1) real-time aggregation and correlation with our Dashboards; 2) machine learning to automatically uncover anomalies and patterns; and 3) now transaction analytics to rapidly uncover relationships across distributed events.

We are all excited to launch Transaction Analytics. Please share with us your feedback on the new capability and let us know if we can help with your use cases. The transaction searches and the new visualization are definitely our favorite content.

Amanda Saso, Principal Tech Writer

Data, with a little Help from my friends

10.20.2014 | Posted by Amanda Saso, Principal Tech Writer


Ever had that sinking feeling when you start a new job and wonder just why you made the jump? I had a gut check when, shortly after joining Sumo Logic in June of 2012, I realized that we had less than 50 daily hits to our Knowledge Base on our support site. Coming from a position where I was used to over 7,000 customers reading my content each day, I nearly panicked.  After calming down, I realized that what I was actually looking at was an amazing opportunity.

Fast forward to 2014. I’ve already blogged about the work I’ve done with our team to bring new methods to deliver up-to-date content. (If you missed it, you can read the blog here.) Even with these improvements I couldn’t produce metrics that proved just how many customers and prospects we have clicking through our Help system. Since I work at a data analytics company, it was kind of embarrassing to admit that I had no clue how many visitors were putting their eyes on our Help content. I mean, this is some basic stuff!

Considering how much time I’ve spent working with our product, I knew that I could get all the information I needed using Sumo Logic…if I could get my hands on some log data. I had no idea how to get logging enabled, not to mention how logs should be uploaded to our Service. Frankly, my English degree is not conducive to solving engineering challenges (although I could write a pretty awesome poem about my frustrations). I’m at the mercy of my Sumo Logic co-workers to drive any processes involving how Help is delivered and how logs are sent to Sumo Logic.  All I could do was pitch my ideas and cross my fingers.

I am very lucky to work with a great group of people who are happy to help me out when they can. This is especially true of Stefan Zier, our Chief Architect, who once again came to my aid. He decommissioned old Help pages (my apologies to anyone who found their old bookmarks rudely displaying 404’s) and then routed my Help from the S3 bucket through our product, meaning that Help activity can be logged. I now refer to him as Stefan, Patron Saint of Technical Writers. Another trusty co-worker we call Panda helped me actually enable the logging.

Once the logging began we could finally start creating some Monitors to build out a Help Metrics Dashboard. In addition to getting the number of hits and the number of distinct users, we really wanted to know which pages were generating the most hits (no surprise that search-related topics bubbled right to the top). We’re still working on other metrics, but let me share just a few data points with you.


Take a look at the number of hits our Help site has handled since October 1st:


We now know that Wednesday is when you look at Help topics the most:


And here’s where our customers are using Help, per our geo lookup operator Monitor:


It’s very exciting to see how much Sumo Logic has grown, and how many people now look at content written by our team, from every corner of the world. Personally, it’s gratifying to feel a sense of ownership over a dataset in Sumo Logic, thanks to my friends.

What’s next from our brave duo of tech writers? Beyond adding additional logging, we’re working to find a way to get feedback on Help topics directly from users. If you have any ideas or feedback, in the short term, please shoot us an email at We would love to hear from you!

Kumar Saurabh, Co-Founder & VP of Engineering

Machine Data Intelligence – an update on our journey

10.16.2014 | Posted by Kumar Saurabh, Co-Founder & VP of Engineering

In 1965, Dr. Hubert Dreyfus, a professor of philosophy at MIT, later at Berkeley, was hired by RAND Corporation to explore the issue of artificial intelligence.  He wrote a 90-page paper called “Alchemy and Artificial Intelligence” (later expanded into the book What Computers Can’t Do) questioning the computer’s ability to serve as a model for the human brain.  He also asserted that no computer program could defeat even a 10-year-old child at chess.

Two years later, in 1967, several MIT students and professors challenged Dreyfus to play a game of chess against MacHack (a chess program that ran on a PDP-6 computer with only 16K of memory).  Dreyfus accepted. Dreyfus found a move, which could have captured the enemy queen.  The only way the computer could get out of this was to keep Dreyfus in checks with his own queen until he could fork the queen and king, and then exchange them.  And that’s what the computer did.  The computer checkmated Dreyfus in the middle of the board.

I’ve brought up this “man vs. machine” story because I see another domain where a similar change is underway: the field of Machine Data.

Businesses run on IT and IT infrastructure is getting bigger by the day, yet IT operations still remain very dependent on analytics tools with very basic monitoring logic. As the systems become more complex (and more agile) simple monitoring just doesn’t cut it. We cannot support or sustain the necessary speed and agility unless the tools becomes much more intelligent.

We believed in this when we started Sumo Logic and with the learnings of running a large-scale system ourselves, continue to invest in making operational tooling more intelligent. We knew the market needed a system that complemented the human expertise. Humans don’t scale that well – our memory is imperfect so the ideal tools should pick up on signals that humans cannot, and at a scale that perfectly matches the business needs and today’s scale of IT data exhaust.

Two years ago we launched our service with a pattern recognition technology called LogReduce and about five months ago we launched Structure Based Anomaly Detection. And the last three months of the journey have been a lot like teaching a chess program new tricks – the game remains the same, just that the system keeps getting better at it and more versatile.

We are now extending our Structured Based Anomaly Detection capabilities with Metric Based Anomaly Detection. A metric could be just that – a time series of numerical value. You can take any log, filter, aggregate and pre-process however you want – and if you can turn that into a number with a time stamp – we can baseline it, and automatically alert you when the current value of the metric goes outside an expected range based on the history. We developed this new engine in collaboration with the Microsoft Azure Machine Learning team, and they have some really compelling models to detect anomalies in a time series of metric data – you can read more about that here.

The hard part about Anomaly Detection is not about detecting anomalies – it is about detecting anomalies that are actionable. Making an anomaly actionable begins with making it understandable. Once an analyst or an operator can grok the anomalies – they are much more amenable to alert on it, build a playbook around it, or even hook up automated remediation to the alert – the Holy Grail.

And, not all Anomaly Detection engines are equal. Like chess programs there are ones that can beat a 5 year old and others that can even beat the grandmasters. And we are well on our way to building a comprehensive Anomaly Detection engine that becomes a critical tool in every operations team’s arsenal. The key question to ask is: does the engine tell you something that is insightful, actionable and that you could not have found with standard monitoring tools.

Below is an example of  an actual Sumo production use case where some of our nodes were spending a lot of time in garbage collection impacting refresh rates for our dashboards for some of the customers.


If this looks interesting, our Metric Based Anomaly Detection service based on Azure Machine Learning is being offered to select customers in a limited beta release and will be coming soon to machines…err..a browser near you (we are a cloud based service after all).

P.S. If you like stories, here is another one for you. 30 years after MackHack beat Dreyfus, in the year 1997  Kasparov (arguably one of the best human chess players) played the Caro-Kann Defence. He then allowed Deep Blue to commit a knight sacrifice, which wrecked his defenses and forced him to resign in fewer than twenty moves.  Enough said.