Blog › Technology

Vance Loiselle, CEO

Sequoia Joins The Team and Eight Lessons of a First Time CEO

05.20.2014 | Posted by Vance Loiselle, CEO

I originally envisioned this blog as a way to discuss our recent $30 million funding, led by our latest investor, Sequoia Capital, with full participation from Greylock, Sutter Hill and Accel. I’ve been incredibly impressed with the whole Sequoia team and look forward to our partnership with Pat. Yet despite 300 enterprise customers (9 of the Global 500), lots of recent success against our large competitor, Splunk, and other interesting momentum metrics, I’d rather talk about the ride and lessons learned from my first two years as a CEO.

  1. It’s Lonely. Accept It and Move On.  My mentor, former boss and CEO of my previous company told me this, years ago. But at the time, it applied to him and not me (in hindsight I realize I did not offer much help). But, like being a first time parent, you really can’t fathom it until you face it yourself. I’m sure there’s some psychology about how certain people deal with it and others don’t. I’m constantly thinking about the implications of tactical and strategic decisions. I’ve learned that if you’re too comfortable, you’re not pushing hard enough. The best advice I can give is to find a Board member you can trust, and use him or her as a sounding board early and often.
  2. Trust Your Gut.  There have been many occasions when I have been given good advice on key decisions. One problem with good advice, is you can get too much of it, and it isn’t always aligned. The best leader I ever met, and another long-time mentor, would always ask, ‘what is your gut telling you?’ More often than not, your gut is right. The nice thing about following your instincts, the only person to blame if it goes awry is yourself.
  3. Act Like It’s Your Money. I grew up in Maine, where $100,000 can still buy a pretty nice house. When I first moved to California from Boston it took me some time to get accustomed to the labor costs and other expenses. The mentality in most of the top startups in Silicon Valley is “don’t worry, you can always raise OPM (other people’s money)”. Though I understand the need to invest ahead of the curve, especially in a SaaS-based business like ours, I also believe too much funding can cause a lack of discipline. People just expect they can hire or spend their way around a problem.
  4. Don’t Be Arrogant. Just saying it almost disqualifies you. Trust me, I have come across all kinds. Backed by arguably the four best Venture Capital firms in the business, I have had plenty of opportunities to meet other CEOs, founders and execs.  Some are incredible people and leaders. Some, however, act like they and their company are way too valuable and important to treat everyone with respect. Life is too short not to believe in karma.
  5. Listen Carefully. If a sales rep is having trouble closing deals, put yourself in his shoes and figure out what help he needs. If the engineering team is not meeting objectives fast enough, find out if they really understand the customer requirements. Often the smallest tweaks in communication or expectations can drastically change the results. Lastly, listen to your customer(s). It is very easy to write off a loss or a stalled relationship to some process breakdown, but customers buy from people they trust.  Customers trust people who listen.
  6. It’s a People Business. Software will eat the world, but humans still make the decisions. We’re building a culture that values openness and rapid decision-making while aligning our corporate mission with individual responsibilities. This balance is a constant work in process and I understand that getting this balance right is a key to successfully scaling the Sumo Logic business.
  7. Find the Right VCs at the Right Time. I can’t take any credit for getting Greylock or Sutter Hill to invest in our A and B rounds, respectively.  But I do have them to thank for hiring me and helping me. We partnered with Accel in November of 2012 and now Sequoia has led this recent investment. Do not underestimate the value of getting high quality VCs. Their access to customers, top talent, and strategic partners is invaluable.  Not to mention the guidance they give in Board meetings and at times of key decisions. The only advice I can give here is:  1) know your business cold, 2) execute your plan and 3) raise money when you have wind at your back.  Venture Capitalists make a living on picking the right markets with the right teams with the right momentum. Markets can swing (check Splunk’s stock price in last 3 months) and momentum can swing (watch the Bruins in the Stanley Cup – never mind they lost to the Canadiens).
  8. Believe. It may be cliché, but you have to believe in the mission. If you haven’t watched Twelve O’Clock High, watch it. It’s not politically correct, but it speaks volumes about how to lead and manage. You may choose the wrong strategy or tactics at times. But you’ll never know if you don’t have conviction about the goals.

OK, so I’m no Jack Welch or Steve Jobs, and many of these lessons are common sense. But no matter how much you think you know, there is way more that you don’t. Hopefully one person will be a little better informed or prepared by my own experience.

 

Jacek Migdal

Building Scala at Scale

05.12.2014 | Posted by Jacek Migdal

The Scala compiler can be brutally slow. The community has a love-hate relationship with it. Love means “Yes, scalac is slow”. Hate means, “Scala — 1★ Would Not Program Again”. It’s hard to go a week without reading another rant about the Scala compiler.

Moreover, one of the Typesafe co-founders left the company shouting, “The Scala compiler will never be fast” (17:53). Even Scala inventor Martin Odersky provides a list of fundamental reasons why compiling is slow.

At Sumo Logic, we happily build over 600K lines of Scala code[1] with Maven and find this setup productive. Based on the public perception of the Scala build process, this seems about as plausible as a UFO landing on the roof of our building. Here’s how we do it:

Many modules

At Sumo Logic, we have more than 120 modules. Each has its own source directory, unit tests, and dependencies. As a result, each of them is reasonably small and well defined. Usually, you just need to modify one or a few of them, which means that you can just build them and fetch binaries of dependencies[2].

Using this method is a huge win in build time and also makes the IDE and test suites run more quickly. Fewer elements are always easier to handle.

We keep all modules in single GitHub repository. Though we have experimented with a separate repository for each project, keeping track of version dependencies was too complicated.

Parallelism on module level

Although Moore’s law is still at work, single cores have not become much faster since 2004. The Scala compiler has some parallelism, but it’s nowhere close to saturating eight cores[3] in our use case.

Enabling parallel builds in Maven 3 helped a lot. At first, it caused a lot of non-deterministic failures, but it turns out that always forking the Java compiler fixed most of the problems[4]. That allows us to fully saturate all of the CPU cores during most of the build time. Even better, it allows us to overcome other bottlenecks (e.g., fetching dependencies).

Incremental builds with Zinc

Zinc brings features from sbt to other build systems, providing two major gains:

  • It keeps warmed compilers running, which avoids the startup JVM “warm-up tax”.
  • It allows incremental compilation. Usually we don’t compile from a clean state, we just make a simple change to get recompiled. This is a huge gain when doing Test Driven Development.

For a long time we were unable to use Zinc with parallel modules builds. As it turns out, we needed to tell Zinc to fork Java compilers. Luckily, an awesome Typesafe developer, Peter Vlugter, implemented that option and fixed our issue.

Time statistics

The following example shows the typical development workflow of building one module. For this benchmark, we picked the largest one by lines of code (53K LOC).

Example module build time

This next example shows building all modules (674K LOC), the most time consuming task.

 Total build time

Usually we can skip test compilation, bringing build time down to 12 minutes.[5]

Wrapper utility

Still, some engineers were not happy, because:

  • Often they build and test more often than needed.
  • Computers get slow if you saturate the CPU (e.g., video conference becomes sluggish).
  • Passing the correct arguments to Maven is hard.

Educating developers might have helped, but we picked the easier route. We created a simple bash wrapper that:
Runs every Maven process with lower CPU priority (nice -n 15); so the build process doesn’t slow the browser, IDE, or a video conference.

  • Makes sure that Zinc is running. If not, it starts it. 
  • Allows you to compile all the dependencies (downstream) easily for any module.
  • Allows you to compile all the things that depend on a module (upstream).
  • Makes it easy to select the kind of tests to run.

Though it is a simple wrapper, it improves usability a lot. For example, if you fixed a library bug for a module called “stream-pipeline” and would like to build and run unit tests for all modules that depend on it, just use this command:

bin/quick-assemble.sh -tu stream-pipeline

Tricks we learned along the way

  1. Print the longest chain of module dependency by build time.
    That helps identify the “unnecessary or poorly designed dependencies,” which can be removed. This makes the dependency graph much more shallow, which means more parallelism.
  2. Run a build in a loop until it fails.
    As simple as in bash:  while bin/quick-assemble.sh; do :; done.
    Then leave it overnight. This is very helpful for debugging non-deterministic bugs, which are common in a multithreading environment.
  3. Analyze the bottlenecks of build time.
    CPU? IO? Are all cores used? Network speed? The limiting factor can vary during different phases. iStat Menus proved to be really helpful.
  4. Read the Maven documentation.
    Many things in Maven are not intuitive. The “trial and error” approach can be very tedious for this build system. Reading the documentation carefully is a huge time saver.

Summary

Building at scale is usually hard. Scala makes it harder, because relatively slow compiler. You will hit the issues much earlier than in other languages. However, the problems are solvable through general development best practices, especially:

  • Modular code
  • Parallel execution by default
  • Invest time in tooling

Then it just rocks!

 

[1] ( find ./ -name ‘*.scala’ -print0 | xargs -0 cat ) | wc -l
[2] All modules are built and tested by Jenkins and the binaries are stored in Nexus.
[3] The author’s 15-inch Macbook Pro from late 2013 has eight cores.
[4] We have little Java code. Theoretically, Java 1.6 compiler is thread-safe, but it has some concurrency bugs. We decided not to dig into that as forking seems to be an easier solution.
[5] Benchmark methodology:

  • Hardware: MacBook Pro, 15-inch, Late 2013, 2.3 GHz Intel i7, 16 GB RAM.
  • All tests were run three times and median time was selected.
  • Non-incremental Maven goal: clean test-compile.
  • Incremental Maven goal: test-compile. A random change was introduced to trigger some recompilation.
Russell

Why You Should Never Catch Throwable In Scala

05.05.2014 | Posted by Russell

Scala is a subtle beast and you should heed its warnings. Most Scala and Java programmers have heard that catching Throwable, a superclass of all exceptions, is evil and patterns like the following should be avoided:

1 2 3 4 5 6 7
try {
aDangerousFunction()
} catch {
case ex: Throwable => println(ex)
// Or even worse
case ex => println(ex)
}

This pattern is absurdly dangerous. Here’s why:

The Problem

In Java, catching all throwables can do nasty things like preventing the JVM from properly responding to a StackOverflowError or an OutOfMemoryError. Certainly not ideal, but not catastrophic. In Scala, it is much more heinous. Scala uses exceptions to return from nested closures. Consider code like the following:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
def inlineMeAgain[T](f: => T): T = {
f
}
 
def inlineme(f: => Int): Int = {
try {
inlineMeAgain {
return f
}
} catch {
case ex: Throwable => 5
}
}
 
def doStuff {
val res = inlineme {
10
}
println("we got: " + res + ". should be 10")
}
doStuff
view raw scalaclosures.scala hosted with ❤ by GitHub

We use a return statement from within two nested closures. This seems like it may be a bit of an obscure edge case, but it’s certainly possible in practice. In order to handle this, the Scala compiler will throw a NonLocalReturnControl exception. Unfortunately, it is a Throwable, and you’ll catch it. Whoops. That code will print 5, not 10. Certainly not what was expected.

The Solution

While we can say “don’t catch Throwables” until we’re blue in the face, sometimes you really want to make sure that absolutely no exceptions get through. You could include the other exception types everywhere you want to catch Throwable, but that’s cumbersome and error prone. Fortunately, this is actually quite easy to handle, thanks to Scala’s focus on implementing much of the language without magic—the “catch” part of the try-catch is just some sugar over a partial function—we can define partial functions!

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
def safely[T](handler: PartialFunction[Throwable, T]): PartialFunction[Throwable, T] = {
case ex: ControlThrowable => throw ex
// case ex: OutOfMemoryError (Assorted other nasty exceptions you don't want to catch)
//If it's an exception they handle, pass it on
case ex: Throwable if handler.isDefinedAt(ex) => handler(ex)
// If they didn't handle it, rethrow. This line isn't necessary, just for clarity
case ex: Throwable => throw ex
}
 
// Usage:
/*
def doSomething: Unit = {
try {
somethingDangerous
} catch safely {
ex: Throwable => println("AHHH")
}
}
*/

This defines a function “safely”, which takes a partial function and yields another partial function. Now, by simply using catch safely { /* catch block */ } we’re free to catch Throwables (or anything else) safely and restrict the list of all the evil exception types to one place in the code. Glorious.

Sanjay Sarathy, CMO

Sumo Logic, ServiceNow and the Future of Event Management

04.29.2014 | Posted by Sanjay Sarathy, CMO

Today’s reality is that companies have to deal with disjointed systems when it comes to detecting, investigating and remediating issues in their infrastructure.  Compound that with the exponential growth of machine data and you have a recipe for frustrated IT and security teams who are tasked with uncovering insights from this data exhaust and then remediating issues as appropriate.  Customer dissatisfaction, at-risk SLAs and even revenue misses are invariable consequences of this fragmented approach.  

With our announcement today of a certified integration with ServiceNow, companies now have a closed loop system that makes it much easier for organizations to uncover known and unknown events in Sumo Logic and then immediately create alerts and incidents in ServiceNow.  The bi-directional integration supports the ability for companies to streamline the entire change management process, capture current and future knowledge, and lay the groundwork for integrated event management capabilities.  This integration takes advantage of all the Sumo Logic analytics capabilities, including LogReduce and Anomaly Detection, to identify what’s happening in your enterprise, even if you never had rules to detect issues in the first place.  

ServiceNow Integration

The cloud-to-cloud integration of ServiceNow and Sumo Logic also boosts productivity by eliminating the whole concept of downloading, installing and managing software.  Furthermore, IT organizations also have the ability to elastically scale their data analytics needs to meet the service management requirements of the modern enterprise.

Let us know if you’re interested in seeing our integration with ServiceNow.  And while you’re at it, feel free to register for Sumo Logic Free.  It’s a zero price way to understand how our machine data analytics service works.

PS – check out our new web page which provides highlights of recent capabilities and features that we’ve launched. 

Mitigating the Heartbleed Vulnerability

04.10.2014 | Posted by Stefan Zier, Lead Architect

skitched-20140410-094910By now, you have likely read about the security vulnerability known as the Heartbleed bug. It is a vulnerability in the widespread OpenSSL library. It allows stealing the information protected, under normal conditions, by the SSL/TLS encryption used to encrypt traffic on the Internet (including Sumo Logic).

How did we eliminate the threat?

When we were notified about the issue, we quickly discovered that our own customer-facing SSL implementation was vulnerable to the attack — thankfully, the advisory was accompanied by some scripts and tools to test for the vulnerability. Mitigation happened in four steps:

  1. Fix vulnerable servers. As a first step, we needed to make sure to close the information leak. In some cases, that meant working with third party vendors (most notably, Amazon Web Services, who runs our Elastic Load Balancers) to get all servers patched. This step was concluded once we confirmed that all of load balancers on the DNS rotation were no longer vulnerable.

  2. Replace SSL key pairs. Even though we had no reason to believe there was any actual attack against our SSL private keys, it was clear all of them had to be replaced as a precaution. Once we had them deployed out to all the servers and load balancers, we revoked all previous certificates with our CA, GeoTrust. All major browsers perform revocation checks against OCSP responders or CRLs.

  3. Notify customers. Shortly after we resolved the issues, we sent an advisory to all of our customers, recommending a password change. Again, this was a purely precautionary measure, as there is no evidence of any passwords leaking.

  4. Update Collectors. We have added a new feature to our Collectors that will automatically replace the Collector’s credentials. Once we complete testing, we will recommend all customers to upgrade to the new version. We also enabled support for certificate revocation checking, which wasn’t enabled previously.     

How has this affected our customers?

Thus far, we have not seen any signs of unusual activity, nor have we seen any customers lose data due to this bug. Unfortunately, we’ve had to inconvenience our customers with requests to change passwords and update Collectors, but given the gravity of the vulnerability, we felt the inconvenience was justified.

Internal impact

Our intranet is hosted on AWS and our users use OpenVPN to connect to it. The version of OpenVPN we had been running needed to be updated to a version that was released today. Other servers behind OpenVPN also needed updating.

Sumo Logic uses on the order of 60-70 SaaS services internally. Shortly after we resolved our customer facing issues, we performed an assessment of all those SaaS services. We used the scripts to test for the vulnerability combined with DNS lookups. If a service looked like it was hosted with a provider/service that was known to have been vulnerable (such as AWS ELB), we added it to our list. We are now working our way through the list and changing passwords on all affected applications, starting with the most critical ones. Unfortunately, manually changing passwords in all of the affected applications takes time and presents a burden on our internal IT users. We plan to have completed this process by the end of the week. 

Interesting Days

Overall, the past few days were pretty interesting on the internet. Many servers (as many as 66% of SSL servers on the net) are running OpenSSL, and most were affected. Big sites, including Yahoo Mail and many others were affected. The pace of exploitation and spread of the issue were both concerning. Thankfully, Codenomicon, the company that discovered this vulnerability, did an amazing job handling and disclosing it in a pragmatic and responsible fashion. This allowed everybody to fix the issue rapidly and minimize impact on their users.

Ariel Smoliar, Senior Product Manager

AWS Elastic Load Balancing – New Visibility Into Your AWS Load Balancers

03.06.2014 | Posted by Ariel Smoliar, Senior Product Manager

After the successful launch of the Sumo Logic Application for AWS CloudTrail last November and with numerous customers now using this application, we were really excited to work again on a new logging service from AWS, this time providing analytics around log files generated by the AWS Load Balancers.

Our integration with AWS CloudTrail targets use cases relevant to security, usage and operations. With our new application for AWS Elastic Load Balancing, we provide our customers with dashboards that provide real-time insights into operational data. You will also be able to add additional use cases based on your requirements by parsing the log entries and visualizing the data using our visualization tools.

Insights from ELB Log Data

Sumo Logic runs natively on the AWS infrastructure and uses AWS load balancers, so we had plenty of raw data to work with during the development of the content. You will find 12 fields in the ELB logs covering the entire request/response lifecycle. By adding the request, backend and response processing time, we can highlight the total time (latency) from when the load balancer started reading the request headers to when the load balancer started sending the response headers to the client. The Latency Analysis dashboard presents a granular analysis per domain, client IP and backend instance (EC2).

The Application also provides analysis of the status codes based on the ELB and backend instances status codes. Please note that the total count for the status codes will be similar for both the ELB and the instances most of the time, unless there are issues, such as no backend response or client rejected request. Additionally, for ELBs that have been configured with a TCP listener (layer 4) rather than HTTP, the TCP requests will be logged. In this case, you will see that the URL has three dashes and there are no values for the HTTP status codes.

Alerting Frequency

Often during my discussions with Sumo Logic users, the topic of scheduled searches and alerting comes up. Based on our work with ELB logs, there is no specific threshold that we recommend that covers every single use case scenario. The threshold should be based on the application – e.g., tiny beacon requests versus downloading huge files cause different latencies. Sumo Logic provides you with the flexibility to set threshold in the scheduled search or just to change the color in the graph for monitoring purpose, based on the value range

Visualization

I want to talk a little bit about machine data visualization. While skiing last week in Steamboat Colorado, I kept thinking about the relevance of the beautiful Rocky Mountain landscape with the somewhat more mundane world of load balancer data visualization. So here is what we did to present the load balancers data in a more compelling way:

pic1_blog

You can slice and dice the data using our Transpose operator as we did in the Latency by Load Balancer monitor, but I would like to focus on a different feature that was built by our UI team and share how we used it in this application. This feature combines data about the number of requests, the size of the total requests, the client IP address and integrates these data elements into the Total Requests and Data Volume monitor. 

We first used this visualization approach in our Nginx app (Traffic Volume and Bytes Served monitor). We received very positive feedback and decided it made sense to incorporate this approach into this application as well.

Combining three fields in a single view enables you to get faster overview of your environment and also provides you with the ability to drill-down and investigate any activity.

Screen Shot 2014-03-05 at 6.32.01 PM

It reminds one of the landscape above, right? :-)

To get this same visualization, click on the gear icon in the Search screen and choose the Change Series option. 

pic3_blog

For each data series, you can choose how you would like to represent the data. We used Column Chart for the total requests and Line Chart for the received and sent data. 

pic4_blog

I find it beautiful and useful. I hope you plan to use this visualization approach in your dashboards, and please let us know if any help is required.

One more thing…

Please stay tuned and check our posts next week… we can’t wait to share with you where we’re going next in the world of Sumo Logic Applications.

David Andrzejewski, Data Sciences Engineer

Machine Data at Strata: “BigData++”

03.05.2014 | Posted by David Andrzejewski, Data Sciences Engineer

A few weeks ago I had the pleasure of hosting the machine data track of talks at Strata Santa Clara. Like “big data”, the phrase “machine data” is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or collected automatically by machines. This includes software logs and sensor measurements from systems as varied as mobile phones, airplane engines, and data centers. The concept is closely related to the “internet of things”, which refers to the trend of increasing connectivity and instrumentation in existing devices, like home thermostats.

More data, more problems

This data can be useful for the early detection of operational problems or the discovery of opportunities for improved efficiency. However, the de­coupling of data generation and collection from human action means that the volume of machine data can grow at machine scales (i.e., Moore’s Law), an issue raised by both Monash and Abadi. This explosive growth rate amplifies existing challenges associated with “big data.” In particular, two common motifs among the talks at Strata were the difficulties around:

  1. mechanics: the technical details of data collection, storage, and analysis
  2. semantics: extracting understandable and actionable information from the data deluge

The talks

The talks covered applications involving machine data from both physical systems (e.g., cars) and computer systems, and highlighted the growing fuzziness of the distinction between the two categories.

Steven Gustafson and Parag Goradia of GE discussed the “industrial internet” of sensors monitoring heavy equipment such as airplane engines or manufacturing machinery. One anecdotal data point was that a single gas turbine sensor can generate 500 GB of data per day. Because of the physical scale of these applications, using data to drive even small efficiency improvements can have enormous impacts (e.g., in amounts of jet fuel saved).

Moving from energy generation to distribution, Brett Sargent of LumaSense Technologies presented a startling perspective on the state of the power grid in the United States, stating that the average age of an electrical distribution substation in the United States is over 50 years, while its intended lifetime was only 40 years. His talk discussed remote sensing and data analysis for monitoring and troubleshooting this critical infrastructure.

Ian Huston, Alexander Kagoshima, and Noelle Sio from Pivotal presented analyses of traffic data. The talk revealed both common-­sense (traffic moves more slowly during rush hour) and counterintuitive (disruptions in London tended to resolve more quickly when it was raining) findings.

My presentation showed how we apply machine learning at Sumo Logic to help users navigate machine log data (e.g., software logs). The talk emphasized the effectiveness of combining human guidance with machine learning algorithms.

Krishna Raj Raja and Balaji Parimi of Cloudphysics discussed how machine data can be applied to problems in data center management. One very interesting idea was to use data and modeling to predict how different configuration changes would affect data center performance.

Conclusions

The amount of data available for analysis is exploding, and we are still in the very early days of discovering how to best make use of it. It was great to hear about different application domains and novel techniques, and to discuss strategies and design patterns for getting the most out of data.

 

Bruno Kurtic, Founding Vice President of Product and Strategy

The New Era of Security – yeah, it’s that serious!

02.23.2014 | Posted by Bruno Kurtic, Founding Vice President of Product and Strategy

Security is a tricky thing and it means different things to different people.   It is truly in the eye of the beholder.  There is the checkbox kind, there is the “real” kind, there is the checkbox kind that holds up, and there is the “real” kind that is circumvented, and so on.  Don’t kid yourself: the “absolute” kind does not exist. 

I want to talk about security solutions based on log data.  This is the kind of security that kicks in after the perimeter security (firewalls), intrusion detection (IDS/IPS), vulnerability scanners, and dozens of other security technologies have done their thing.  It ties all of these technologies together, correlates their events, reduces false positives and enables forensic investigation.  Sometimes this technology is called Log Management and/or Security Information and Event Management (SIEM).  I used to build these technologies years ago, but it seems like decades ago. 

SIEM

A typical SIEM product is a hunking appliance, sharp edges, screaming colors – the kind of design that instills confidence and says “Don’t come close, I WILL SHRED YOU! GRRRRRRRRRR”.

Ahhhh, SIEM, makes you feel safe doesn’t it.  It should not.  I proclaim this at the risk at being yet another one of those guys who wants to rag on SIEM, but I built one, and beat many, so I feel I’ve got some ragging rights.  So, what’s wrong with SIEM?  Where does it fall apart?

SIEM does not scale

It is hard enough to capture a terabyte of daily logs (40,000 Events Per Second, 3 Billion Events per Day) and store them.  It is couple of orders of magnitude harder to run correlation in real time and alert when something bad happens.  SIEM tools are extraordinarily difficult to run at scales above 100GB of data per day.  This is because they are designed to scale by adding more CPU, memory, and fast spindles to the same box.  The exponential growth of data over the two decades when those SIEM tools were designed has outpaced the ability to add CPU, memory, and fast spindles into the box.

Result: Data growth outpaces capacity → Data dropped  from collection → Significant data dropped from correlation → Gap in analysis → Serious gap in security

SIEM normalization can’t keep pace

SIEM tools depend on normalization (shoehorning) of all data into one common schema so that you can write queries across all events.  That worked fifteen years ago when sources were few.  These days sources and infrastructure types are expanding like never before.  One enterprise might have multiple vendors and versions of network gear, many versions of operating systems, open source technologies, workloads running in infrastructure as a service (IaaS), and many custom written applications.  Writing normalizers to keep pace with changing log formats is not possible.

Result: Too many data types and versions → Falling behind on adding new sources → Reduced source support → Gaps in analysis → Serious gaps in security

SIEM is rule-only based

This is a tough one.  Rules are useful, even required, but not sufficient.  Rules only catch the thing you express in them, the things you know to look for.   To be secure, you must be ahead of new threats.  A million monkeys writing rules in real-time: not possible.

Result: Your rules are stale → You hire a million monkeys → Monkeys eat all your bananas → You analyze only a subset of relevant events → Serious gap in security

SIEM is too complex

DuckTapeSIEM

It is way too hard to run these things.  I’ve had too many meetings and discussions with my former customers on how to keep the damned things running and too few meetings on how to get value out of the fancy features we provided.  In reality most customers get to use the 20% of features because the rest of the stuff is not reachable.  It is like putting your best tools on the shelf just out of reach.  You can see them, you could do oh so much with them, but you can’t really use them because they are out of reach.

Result: You spend a lot of money → Your team spends a lot of time running SIEM → They don’t succeed on leveraging the cool capabilities → Value is low → Gaps in analysis → Serious gaps in security   

So, what is an honest, forward-looking security professional who does not want to duct tape a solution to do?  What you need is what we just started: Sumo Logic Enterprise Security Analytics.  No, it is not absolute security, it is not checkbox security, but it is a more real security because it:

Scales

Processes terabytes of your data per day in real time. Evaluates rules regardless of data volume and does not restrict what you collect or analyze.  Furthermore, no SIEM style normalization, just add data, a pinch of savvy, a tablespoon of massively parallel compute, and voila.

Result: you add all relevant data → you analyze it all → you get better security 

Simple

It is SaaS, there are no appliances, there are no servers, there is no storage, there is just a browser connected to an elastic cloud.

Result: you don’t have to spend time on running it → you spend time on using it → you get more value → better analysis → better security

Machine Learning

SecurityAnomaliesRules, check.  What about that other unknown stuff?  Answer: machine that learns from data.  It detects patterns without human input.  It then figures out baselines and normal behavior across sources.  In real-time it compares new data to the baseline and notifies you when things are sideways.  Even if “things” are things you’ve NEVER even thought about and NOBODY in the universe has EVER written a single rule to detect.  Sumo Logic detects those too. 

Result: Skynet … nah, benevolent overlord, nah, not yet anyway.   New stuff happens → machines go to work → machines notify you → you provide feedback → machines learn and get smarter → bad things are detected → better security

Read more: Sumo Logic Enterprise Security Analytics

Manish Khettry

Sumo Logic Deployment Infrastructure and Practices

01.08.2014 | Posted by Manish Khettry

Introduction

Here at Sumo Logic, we run a log management service that ingests and indexes many terabytes of data a day; our customers then use our service to query and analyze all of this data. Powering this service are a dozen or more separate programs (which I will call assembly from now on), running in the cloud, communicating with one another. For instance the Receiver assembly is responsible for accepting log lines from collectors running on our customer host machines, while the Index assembly creates text indices for the massive amount of data pumping into our system constantly being fed by the Receivers.

We deploy to our production system multiple times each week, while our engineering teams are constantly building new features, fixing bugs, improving performance, and, last but not least, working on infrastructure improvements to help in the care and wellbeing of this complex big-data system. How do we do it? This blog post tries to explain our (semi)-continuous deployment system.

Running through hoops

In any continuous deployment system, you need multiple hoops that your software must pass through, before you deploy it for your users. At Sumo Logic, we have four well defined tiers with clear deployment criteria for each.  A tier is an instance of the entire Sumo Logic service where all the assemblies are running in concert as well as all the monitoring infrastructure (health checks, internal administrative tools, auto-remediation scripts, etc) watching over it.

Night

This is the first step in the sequence of steps that our software goes through. Originally intended as a nightly deploy, we now automatically deploy the latest clean builds of each assembly on our master branch several times every day. A clean build means that all the unit tests for the assemblies pass. In our complex system, however, it is the interaction between assemblies which can break functionality. To test these, we have a number of integration tests running against Night regularly. Any failures in these integration tests are an early warning that something is broken. We also have a dedicated person troubleshooting problems with Night whose responsibility it is, at the very least, to identify and file bugs for problems.

Stage

We cut a release branch once a week and use Stage to test this branch much as we use Night to keep master healthy. The same set of integration tests that run against Night also run against Stage and the goal is to stabilize the branch in readiness for a deployment to production. Our QA team does ad-hoc testing and runs their manual test suites against Stage.

Long

Right before production is the Long tier. We consider this almost as important as our Production tier. The interaction between Long and Production is well described in this webinar given by our founders. Logs from Long are fed to Production and vice versa, so Long is used to monitor and trouble shoot problems with Production.

Deployments to Long are done manually a few days before a scheduled deployment to Production from a build that has passed all automated unit tests as well as integration tests on Stage. While the deployment is manually triggered, the actual  process of upgrading and restarting the entire system is about as close to a one-button-click as you can get (or one command on the CLI)!

Production

After Long has soaked for a few days, we manually deploy the software running on Long to Production, the last hoop our software has to jump through. We aim for a full deployment every week and often times will do smaller upgrades of our software between full deploys.

Being Production, this deployment is closely watched and there are a fair number of safeguards built into the process. Most notably, we have two dedicated engineers who manage this deployment, with one acting as an observer. We also have a tele-conference with screen sharing that anyone can join and observe the deploy process.

Social Practices

Closely associated with the software infrastructure are the social aspects of keeping this system running. These are:

Ownership

We have well defined ownership of these tiers within engineering and devops which rotate weekly. An engineer is designated Primary and is responsible for Long and Production. Similarly we have a designated Jenkins Cop role, to keep our continuous integration system and Night and Stage healthy.

Group decision making and notifications

We have a short standup everyday before lunch, which everyone in engineering attends. The Primary and Jenkins Cop update the team on any problems or issues with these tiers for the previous day.

In addition to a physical meeting, we use Campfire, to discuss on-going problems and notifying others of changes to any of these tiers. If someone wants to change a configuration property on night to test a new feature, the person would update everyone else on campfire. Everyone (and not just the Primary or Jenkins Cop) is in the loop about these tiers and can jump in to troubleshoot problems.

Automate almost everything. A checklist for the rest.

There are certain things that are done or triggered manually. In cases where humans operate something (a deploy to Long or Production for instance), we have a checklist for engineers to follow. For more on checklists, I refer you to an excellent book, The Checklist Manifesto.

Conclusion

This system has been in place since Sumo Logic went live and has served us well. It bears mentioning that the key to all of this is automation, uniformity, and well-delineated responsibilities. For example, spinning up a complete system takes just a couple of commands in our deployment shell. Also, any deployment (even a personal one for development) comes up with everything pre-installed and running, including health checks, monitoring dashboards or auto-remediation scripts. Identifying and fixing a problem on Production is no different from that on Night. In almost every way (except for waking up the Jenkins Cop in the middle of the night and the sizing), these are identical tiers!

While automation is key, it doesn’t take away the fact that people who run and keep things healthy. A deployment to production can be stressful, more so for the Primary than anyone else and having a well defined checklist can take away some of the stress.

Any system like this needs constant improvements and since we are not sitting idle, there are dozens of features, big and small that need to be worked on. Two big ones are:

  • Red-Green deployments, where new releases are rolled out to a small set of instances and once we are confident they work, are pushed to the rest of the fleet.

  • More frequent deployments of smaller parts of the system. Smaller more frequent deployments are less risky.

In other words, there is a lot of work to do. Come join us at Sumo Logic!

Bill Lazar

Open Source in the Sumo Logic UI

12.09.2013 | Posted by Bill Lazar

Startups are well-known for being go fast, release and iterate. Having quality engineers at Sumo Logic is a big part of doing that well enough that customers want our solution, but like many other young tech companies open source libraries and tools are also a key element in our ability to deliver.

As a recent hire into the User Interface development team I was excited to see just which open source software goes into our cloud log management solution. The list is extensive, because so many of our peers are making great stuff available, but a quick look just at the front end codebase shows:

  • jQuery: The big daddy, jQuery is used by millions of web applications and websites to add dynamic behavior and content to otherwise plain pages.
  • Backbone: A lean, subtly powerful framework for building expressive client-side apps, Backbone provides a core set of MV* classes and a foundation for many community-developed extensions.
  • Sass/Compass: Think “programmable CSS” and you’re capturing the essence of Sass while big brother Compass adds an extensive set of reusable cross-browser CSS patterns as well as several handy utilities.
  • D3: A library for manipulating documents based on data, we use D3 to drive many of the beautiful interactive charts that enable our customers to understand the huge volume of data they process in our application.
  • Require.js: Building large applications is much easier to manage when code can be split into small, coherent chunks (files) and Require.js enables apps to do just this. 
  • Code Mirror: This versatile text editor is the basis for Sumo Logic’s powerful search query editors.
  • jQuery Plugins: Many, the more important to us include Select2ToasterqTip, and jQuery’s jQuery UI.

Collectively these libraries–along with their counterparts used in our service layer–make it possible for a small company to rapidly deliver the depth and quality of Sumo Logic in a cost-effective process. Instead of writing essentially boilerplate code to perform mundane tasks our team is able to create application-specific high value code.

In the days before FOSS proliferated the cost per developer or per CPU for each piece of software would have been prohibitive; the economics of Silicon Valley, where two guys in a coffee shop can spin up a Pinterest or Sumo Logic, just wouldn’t have worked.

Twitter