Blog › Technology

Amanda Saso, Principal Tech Writer

Data, with a little Help from my friends

10.20.2014 | Posted by Amanda Saso, Principal Tech Writer

 

Ever had that sinking feeling when you start a new job and wonder just why you made the jump? I had a gut check when, shortly after joining Sumo Logic in June of 2012, I realized that we had less than 50 daily hits to our Knowledge Base on our support site. Coming from a position where I was used to over 7,000 customers reading my content each day, I nearly panicked.  After calming down, I realized that what I was actually looking at was an amazing opportunity.

Fast forward to 2014. I’ve already blogged about the work I’ve done with our team to bring new methods to deliver up-to-date content. (If you missed it, you can read the blog here.) Even with these improvements I couldn’t produce metrics that proved just how many customers and prospects we have clicking through our Help system. Since I work at a data analytics company, it was kind of embarrassing to admit that I had no clue how many visitors were putting their eyes on our Help content. I mean, this is some basic stuff!

Considering how much time I’ve spent working with our product, I knew that I could get all the information I needed using Sumo Logic…if I could get my hands on some log data. I had no idea how to get logging enabled, not to mention how logs should be uploaded to our Service. Frankly, my English degree is not conducive to solving engineering challenges (although I could write a pretty awesome poem about my frustrations). I’m at the mercy of my Sumo Logic co-workers to drive any processes involving how Help is delivered and how logs are sent to Sumo Logic.  All I could do was pitch my ideas and cross my fingers.

I am very lucky to work with a great group of people who are happy to help me out when they can. This is especially true of Stefan Zier, our Chief Architect, who once again came to my aid. He decommissioned old Help pages (my apologies to anyone who found their old bookmarks rudely displaying 404’s) and then routed my Help from the S3 bucket through our product, meaning that Help activity can be logged. I now refer to him as Stefan, Patron Saint of Technical Writers. Another trusty co-worker we call Panda helped me actually enable the logging.

Once the logging began we could finally start creating some Monitors to build out a Help Metrics Dashboard. In addition to getting the number of hits and the number of distinct users, we really wanted to know which pages were generating the most hits (no surprise that search-related topics bubbled right to the top). We’re still working on other metrics, but let me share just a few data points with you.

 

Take a look at the number of hits our Help site has handled since October 1st:

 

We now know that Wednesday is when you look at Help topics the most:

 

And here’s where our customers are using Help, per our geo lookup operator Monitor:

 

It’s very exciting to see how much Sumo Logic has grown, and how many people now look at content written by our team, from every corner of the world. Personally, it’s gratifying to feel a sense of ownership over a dataset in Sumo Logic, thanks to my friends.

What’s next from our brave duo of tech writers? Beyond adding additional logging, we’re working to find a way to get feedback on Help topics directly from users. If you have any ideas or feedback, in the short term, please shoot us an email at documentation@sumologic.com. We would love to hear from you!

Kumar Saurabh, Co-Founder & VP of Engineering

Machine Data Intelligence – an update on our journey

10.16.2014 | Posted by Kumar Saurabh, Co-Founder & VP of Engineering

In 1965, Dr. Hubert Dreyfus, a professor of philosophy at MIT, later at Berkeley, was hired by RAND Corporation to explore the issue of artificial intelligence.  He wrote a 90-page paper called “Alchemy and Artificial Intelligence” (later expanded into the book What Computers Can’t Do) questioning the computer’s ability to serve as a model for the human brain.  He also asserted that no computer program could defeat even a 10-year-old child at chess.

Two years later, in 1967, several MIT students and professors challenged Dreyfus to play a game of chess against MacHack (a chess program that ran on a PDP-6 computer with only 16K of memory).  Dreyfus accepted. Dreyfus found a move, which could have captured the enemy queen.  The only way the computer could get out of this was to keep Dreyfus in checks with his own queen until he could fork the queen and king, and then exchange them.  And that’s what the computer did.  The computer checkmated Dreyfus in the middle of the board.

I’ve brought up this “man vs. machine” story because I see another domain where a similar change is underway: the field of Machine Data.

Businesses run on IT and IT infrastructure is getting bigger by the day, yet IT operations still remain very dependent on analytics tools with very basic monitoring logic. As the systems become more complex (and more agile) simple monitoring just doesn’t cut it. We cannot support or sustain the necessary speed and agility unless the tools becomes much more intelligent.

We believed in this when we started Sumo Logic and with the learnings of running a large-scale system ourselves, continue to invest in making operational tooling more intelligent. We knew the market needed a system that complemented the human expertise. Humans don’t scale that well – our memory is imperfect so the ideal tools should pick up on signals that humans cannot, and at a scale that perfectly matches the business needs and today’s scale of IT data exhaust.

Two years ago we launched our service with a pattern recognition technology called LogReduce and about five months ago we launched Structure Based Anomaly Detection. And the last three months of the journey have been a lot like teaching a chess program new tricks – the game remains the same, just that the system keeps getting better at it and more versatile.

We are now extending our Structured Based Anomaly Detection capabilities with Metric Based Anomaly Detection. A metric could be just that – a time series of numerical value. You can take any log, filter, aggregate and pre-process however you want – and if you can turn that into a number with a time stamp – we can baseline it, and automatically alert you when the current value of the metric goes outside an expected range based on the history. We developed this new engine in collaboration with the Microsoft Azure Machine Learning team, and they have some really compelling models to detect anomalies in a time series of metric data – you can read more about that here.

The hard part about Anomaly Detection is not about detecting anomalies – it is about detecting anomalies that are actionable. Making an anomaly actionable begins with making it understandable. Once an analyst or an operator can grok the anomalies – they are much more amenable to alert on it, build a playbook around it, or even hook up automated remediation to the alert – the Holy Grail.

And, not all Anomaly Detection engines are equal. Like chess programs there are ones that can beat a 5 year old and others that can even beat the grandmasters. And we are well on our way to building a comprehensive Anomaly Detection engine that becomes a critical tool in every operations team’s arsenal. The key question to ask is: does the engine tell you something that is insightful, actionable and that you could not have found with standard monitoring tools.

Below is an example of  an actual Sumo production use case where some of our nodes were spending a lot of time in garbage collection impacting refresh rates for our dashboards for some of the customers.

anomaly_detection_metric

If this looks interesting, our Metric Based Anomaly Detection service based on Azure Machine Learning is being offered to select customers in a limited beta release and will be coming soon to machines…err..a browser near you (we are a cloud based service after all).

P.S. If you like stories, here is another one for you. 30 years after MackHack beat Dreyfus, in the year 1997  Kasparov (arguably one of the best human chess players) played the Caro-Kann Defence. He then allowed Deep Blue to commit a knight sacrifice, which wrecked his defenses and forced him to resign in fewer than twenty moves.  Enough said.

References

[1] http://www.chess.com/article/view/machack-attack

 

 

 

David Andrzejewski, Data Sciences Engineer

Scala at Sumo: type class law for reliable abstractions

10.09.2014 | Posted by David Andrzejewski, Data Sciences Engineer

Abstraction is a fundamental concept in software development. Identifying and building abstractions well-suited to the problem at hand can make the difference between clear, maintainable code and a teetering, Jenga-like monolith duct-taped together by a grotesque ballet of tight coupling and special case handling. While a well-designed abstraction can shield us from detail, it can also suffer from leakage, failing to behave as expected or specified and causing problems for code built on top of it. Ensuring the reliability of our abstractions is therefore of paramount concern.

In previous blog posts, we’ve separately discussed the benefits of using type classes in Scala to model abstractions, and using randomized property testing in Scala to improve tests. In this post we discuss how to combine these ideas in order to build more reliable abstractions for use in your code. If you find these ideas interesting please be sure to check out the references at the end of this post.

Type classes for fun and profit

Type classes allow us to easily build additional behaviors around data types in a type-safe way. One simple and useful example is associated with the monoid abstraction, which represents a set of items which can be combined with one another (such as the integers, which can be combined by addition). Loosely[1], a monoid consists of

  • a collection of objects (e.g., integers)

  • a binary operation for combining these objects to yield a new object of the same type (e.g., addition)

  • an identity object whose combination leaves an object unchanged (e.g., the number 0)

This abstraction is captured by the scalaz trait Monoid[F]:

1 2 3 4
trait Monoid[F] {
def zero: F
def append(f1: F, f2: => F): F
}
view raw lawblog-monoid.scala hosted with ❤ by GitHub

The utility of this machinery is that it gives us a generalized way to use types that support some notion of “addition” or “combination”, for example[2]:

1 2 3 4 5 6 7 8 9 10
def addItUp[F : Monoid](items: Seq[F]): F = {
// Combine a bunch of items
val m = implicitly[Monoid[F]]
items.foldLeft(m.zero){case (total, next) => m.append(total,next)}
}
 
scala> addItUp(Seq("day ", "after ", "day"))
res1: String = "day after day"
scala> addItUp(Seq(1,2,3))
res2: Int = 6

As described in our earlier machine learning example, this can be more convenient than requiring that the data types themselves subtype or inherit from some kind of “Addable” interface.

I am the law!

In Scala, the Monoid[F] trait definition (combined with the compiler type-checking) buys us some important sanity checks with respect to behavior. For example, the function signature append(x: F, y: F): F guarantees that we’re never going to get a non-F result[3].

However, there are additional properties that an implementation of Monoid[F] must satisfy in order to truly conform to the conceptual definition of a monoid, but which are not easily encoded into the type system. For example, the monoid binary operation must satisfy left and right identity with respect to the “zero” element. For integers under addition the zero element is 0, and we do indeed have x + 0 = 0 + x = x for any integer x.

We can codify this requirement in something called type class law. When defining a particular type class, we can add some formal properties or invariants which we expect implementations to obey. The codification of these constraints can then be kept alongside the type class definition. Again returning to scalaz Monoid[4], we have

1 2 3 4 5 6 7 8 9 10
trait Monoid[F] extends Semigroup[F] {
...
trait MonoidLaw extends SemigroupLaw {
def leftIdentity(a: F)(implicit F: Equal[F]) =
F.equal(a, append(zero, a))
def rightIdentity(a: F)(implicit F: Equal[F]) =
F.equal(a, append(a, zero))
}
...
}

An interesting observation is that this implementation depends upon another type class instance Equal[F] which simply supplies an equal() function for determining whether two instances of F are indeed equal. Of course, Equal[F] comes supplied with its own type class laws for properties any well-defined notion of equality must satisfy such as commutativity (x==y iff y==x), reflexivity (x==x), and transitivity (if a==b and b==c then a==c).

A machine learning example

We now consider an example machine learning application where we are evaluating some binary classifier (like a decision tree) over test data. We run our evaluation over different sets of data, and for each set we produce a very simple output indicating how many predictions were made, and of those, how many were correct: 

1
case class Evaluation(total: Int, correct: Int)

We can implement Monoid[Evaluation] [5] in order to combine the our experimental results across multiple datasets:

1 2 3 4 5
object EvaluationMonoid extends Monoid[Evaluation] {
def zero = Evaluation(0,0)
def append(x: Evaluation, y: => Evaluation) =
Evaluation(x.total + y.total, x.correct + y.correct)
}

We’d like to ensure that our implementation satisfies the relevant type class laws. We could write a handful of unit tests against one or more hand-coded examples, for example using ScalaTest:

1 2 3 4 5 6 7 8 9 10 11
"Evaluation Monoid" should {
 
import EvaluationMonoid._
implicit val eq = Equal.equalA[Evaluation]
val testEvaluation = Evaluation(3, 2)
 
"obey Monoid typeclass Law" in {
Monoid.monoidLaw.leftIdentity(testEval) should be (true)
Monoid.monoidLaw.rightIdentity(testEval) should be (true)
}
}

However, this merely gives us an existence result. That is, there exists some value for which our the desired property holds. We’d like something a little stronger. This is where we can use ScalaCheck to do property testing, randomly generating as many arbitrary instances of Evaluation as we’d like. If the law holds for all [6] generated instances, we can have a higher degree confidence in the correctness of our implementation. To accomplish this we simply need to supply a means of generating random Evaluation instances via ScalaCheck Gen:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
val evalGen = for {total <- Gen.choose(0, 1000);
correct <- Gen.choose(0, total)}
yield Evaluation(total,correct)
 
"Evaluation Monoid" should {
 
import EvaluationMonoid._
implicit val eq = Equal.equalA[Evaluation]
"obey Monoid typeclass Law" in {
forAll (evalGen) { testEval => {
Monoid.monoidLaw.leftIdentity(testEval) should be (true)
Monoid.monoidLaw.rightIdentity(testEval) should be (true)
}}
}
}

Now that’s an abstraction we can believe in!

So what?

This level of confidence becomes important when we begin to compose type class instances, mixing and matching this machinery to achieve our desired effects. Returning to our Evaluation example, we may want to evaluate different models over these datasets, storing the results for each dataset in a Map[String,Evaluation] where the keys refer to which model was used to obtain the results. In scalaz, we get the Monoid[Map[String,Evaluation]] instance “for free”, given an instance of Monoid[Evaluation]:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
scala> implicit val em = EvaluationMonoid
em: EvaluationMonoid.type = EvaluationMonoid$@34f5b235
 
scala> implicit val mm = mapMonoid[String,Evaluation]
mm: scalaz.Monoid[Map[String,Evaluation]] = scalaz.std.MapInstances$$anon$4@13105b09
 
scala> val dataset1 = Map("modelA" -> Evaluation(3,2),
| "modelB" -> Evaluation(4,1))
dataset1: scala.collection.immutable.Map[String,Evaluation] =
Map(modelA -> Evaluation(3,2), modelB -> Evaluation(4,1))
 
scala> val dataset2 = Map("modelA" -> Evaluation(5,4))
dataset2: scala.collection.immutable.Map[String,Evaluation] =
Map(modelA -> Evaluation(5,4))
 
scala> mm.append(dataset1,dataset2)
res3: Map[String,Evaluation] =
Map(modelA -> Evaluation(8,6), modelB -> Evaluation(4,1))

Conclusion and references

If you are using the scalaz library, many of the provided type classes come “batteries included” with type class laws. Even if you are not, these ideas can help you to build more reliable type class instances which can be composed and extended with confidence. See below for some additional references and readings on this subject:

Footnotes

[1] Omitting associativity and explicit discussion of closure.

[2] For brevity, these code snippets do not show library (scalaz, ScalaTest, ScalaCheck) imports.

[3] Excluding the unfortunate possibilities of null return values or thrown Exceptions.

[4] A semigroup is a more general concept than a monoid, which is modeled in scalaz by having Monoid[F] extend Semigroup[F].

[5] This implementation has a bit of a boilerplate flavor, this post describes how we could automagically derive our Monoid[Evaluation] instance.

[6] As implied by the ScalaCheck project’s appropriate logo.

Cloud Log Management for Control Freaks

10.02.2014 | Posted by Bright Fulton

The following is a guest post from Bright Fulton, Director of Engineering Operations at Swipely.

Like other teams that value their time and focus, Swipely Engineering strongly prefers partnering with third party infrastructure, platform, and monitoring services. We don’t, however, like to be externally blocked while debugging an issue or asking a new question of our data. Is giving up control the price of convenience? It shouldn’t be. The best services do the heavy lifting for you while preserving flexibility. The key lies in how you interface with the service: stay in control of data ingest and code extensibility.

A great example of this principle is Swipely’s log management architecture. We’ve been happily using Sumo Logic for years. They have an awesome product and are responsive to their customers. That’s a strong foundation, but because logging is such a vital function, we retain essential controls while taking advantage of all the power that Sumo Logic provides.


Get the benefits

Infrastructure services have flipped our notion of stability: instead of being comforted by long uptime, we now see it as a liability. Instances start, do work for an hour, terminate. But where do the logs go? One key benefit of a well integrated log management solution is centralization: stream log data off transient systems and into a centralized service.

Once stored and indexed, we want to be able to ask questions of our logs, to react to them. Quick answers come from ad-hoc searches:

  • How many times did we see this exception yesterday?

  • Show me everything related to this request ID.

Next, we define scheduled reports to catch issues earlier and shift toward a strategic view of our event data.

  • Alert me if we didn’t process a heartbeat job last hour.

  • Send me a weekly report of which instance types have the worst clock skew.

Good cloud log management solutions make this centralization, searching, and reporting easy.


Control the data

It’s possible to get these benefits without sacrificing control of the data by keeping the ingest path simple: push data through a single transport agent and keep your own copy. Swipely’s logging architecture collects with rsyslog and processes with Logstash before forwarding everything to both S3 and Sumo Logic.

Swipely’s Logging Architecture

Put all your events in one agent and watch that agent.

You likely have several services that you want to push time series data to: logs, metrics, alerts. To solve each concern independently could leave you with multiple long running agent processes that you need to install, configure, and keep running on every system. Each of those agents will solve similar problems of encryption, authorization, batching, local buffering, back-off, updates. Each comes with its own idiosyncrasies and dependencies. That’s a lot of complexity to manage in every instance.

The lowest common denominator of these time series event domains is the log. Simplify by standardizing on one log forwarding agent in your base image. Use something reliable, widely deployed, open source. Swipely uses rsyslog, but more important than which one is that there is just one.

Tee time

It seems an obvious point, but control freaks shouldn’t need to export their data from third parties. Instead of forwarding straight to the external service, send logs to an aggregation server first. Swipely uses Logstash to receive the many rsyslog streams. In addition to addressing vendor integrations in one place, this point of centralization allows you to:

  • Tee your event stream. Different downstream services have different strengths. Swipely sends all logs to both Sumo Logic for search and reporting and to S3 for retention and batch jobs.

  • Apply real-time policies. Since Logstash sees every log almost immediately, it’s a great place to enforce invariants, augment events, and make routing decisions. For example, logs that come in without required fields are flagged (or dropped). We add classification tags based on source and content patterns. Metrics are sent to a metric service. Critical events are pushed to an SNS topic.


Control the code

The output is as important as the input. Now that you’re pushing all your logs to a log management service and interacting happily through search and reports, extend the service by making use of indexes and aggregation operators from your own code.

Wrap the API

Good log management services have good APIs and Sumo Logic has several. The Search Job API is particularly powerful, giving access to streaming results in the same way we’re used to in their search UI.

Swipely created the sumo-search gem in order to take advantage of the Search Job API. We use it to permit arbitrary action on the results of a search.

Custom alerts and dashboards

Bringing searches into the comfort of the Unix shell is part of the appeal of a tool like this, but even more compelling is bringing them into code. For example, Swipely uses sumo-search from a periodic job to send alerts that are more actionable than just the search query results. We can select the most pertinent parts of the message and link in information from other sources. 

Engineers at Swipely start weekly tactical meetings by reporting trailing seven day metrics. For example: features shipped, slowest requests, error rates, analytics pipeline durations. These indicators help guide and prioritize discussion. Although many of these metrics are from different sources, we like to see them together in one dashboard. With sumo-search and the Search Job API, we can turn any number from a log query into a dashboard widget in a couple lines of Ruby.


Giving up control is not the price of SaaS convenience. Sumo Logic does the heavy lifting of log management for Swipely and provides an interface that allows us to stay flexible. We control data on the way in by preferring open source tools in the early stages of our log pipeline and saving everything we send to S3. We preserve our ability to extend functionality by making their powerful search API easy to use from both shell and Ruby.

We’d appreciate feedback (@swipelyeng) on our logging architecture. Also, we’re not really control freaks and would love pull requests and suggestions on sumo-search!

Vivek Kaushal

Debugging Amazon SES message delivery using Sumo Logic

10.02.2014 | Posted by Vivek Kaushal

 

We at Sumo Logic use Amazon SES (Simple Email Service) for sending thousands of emails every day for things like search results, alerts, account notifications etc. We need to monitor SES to ensure timely delivery and know when emails bounce.

Amazon SES provides notifications about status of email via Amazon SNS (Simple Notification Service). Amazon SNS allows you to send these notifications to any HTTP endpoint. We ingest these messages using Sumo Logic’s HTTP Source.

Using these logs, we have identified problems like scheduled searches which always send results to an invalid email address; and a Microsoft Office 365 outage when a customer reported having not received the sign up email.

 

Here’s a step by step guide on how to send your Amazon SES notifications to Sumo Logic.

1. Set Up Collector. The first step is to set up a hosted collector in Sumo Logic which can receive logs via HTTP endpoint. While setting up the hosted collector, we recommend providing an informative source category name, like “aws-ses”.  

2. Add HTTP Source. After adding a hosted collector, you need to add a HTTP Source. Once a HTTP Source is added, it will generate a URL which will be used to receive notifications from SNS. The URL looks like https://collectors.sumologic.com/receiver/v1/http/ABCDEFGHIJK.  

3. Create SNS Topic. In order to send notifications from SES to SNS, we need to create a SNS topic. The following picture shows how to create a new SNS topic on the SNS console. We uses “SES-Notifications” as the name of the topic in our example.

4. Create SNS Subscription. SNS allows you to send a notification to multiple HTTP Endpoints by creating multiple subscriptions within a topic. In this step we will create one subscription for the SES-Notifications topic created in step 3 and send notifications to the HTTP endpoint generated in step 2.

5. Confirm Subscription. After a subscription is created, Amazon SNS will send a subscription confirmation message to the endpoint. This subscription confirmation notification can be found in Sumo Logic by searching for: _sourceCategory=<name of the sourceCategory provided in step 1>

For example: _sourceCategory=aws-ses 

Copy the link from the logs and paste it in your browser.

6. Send SES notifications to SNS. Finally configure SES to send notifications to SNS. For this, go to the SES console and select the option of verified senders on the left hand side. In the list of verified email addresses, select the email address for which you want to configure the logs. The page looks like

On the above page, expand the notifications section and click edit notifications. Select the SNS topic you created in step 3.

 

7. Switch message format to raw (Optional). SES sends notifications to SNS in a JSON format. Any notification sent through SNS is by default wrapped into a JSON message. Thus in this case, it creates a nested JSON, resulting in a nearly unreadable message. To remove this problem of nested JSON messages, we highly recommend configuring SNS to use raw message delivery option.

Before setting raw message format

After setting raw message format

 

 

JSON operator was used to easily parse the messages as show in the queries below:

1. Retrieve general information out of messages
_sourceCategory=aws-ses | json “notificationType”, “mail”, “mail.destination”, “mail.destination[0]“, “bounce”, “bounce.bounceType”, “bounce.bounceSubType”, “bounce.bouncedRecipients[0]” nodrop

2. Identify most frequently bounced recipients
_sourceCategory=aws-ses AND !”notificationType\”:\”Delivery” | json “notificationType”, “mail.destination[0]” as type,destination  nodrop | count by destination | sort by _count

Vera Chen

We are Shellshock Bash Bug Free Here at Sumo Logic, but What about You?

10.01.2014 | Posted by Vera Chen

Be Aware and Be Prepared

I am betting most of you have heard about the recent “Shellshock Bash Bug”.  If not, here is why you should care – this bug has affected users of Bash, which is one of the most popular utilities installed on operating systems today.  Discovered in early September 2014, this extremely severe bug affects bash versions dating back to version 1.13 and has the ability to process shell commands after function definitions in Bash that exposes systems to security threats.  This vulnerability allows remote attackers to execute any shell command and gain access to internal data, publish malicious code, reconfigure environments and exploit systems in infinite ways.

Shellshock Bash Bug Free, Safe and Secure

None of the Sumo Logic service components were impacted due to the innate design of our systems.  However, for those of you out there who might have fallen victim to this bug based on your system architecture, you’ll want to jump in quickly to address potential vulnerabilities. 

What We Can Do for You

If you have been searching around for a tool to expedite the process of identifying potential attacks on your systems, you’re in the right place!  I recommend that you consider Sumo Logic and especially our pattern recognition capability called LogReduce.  Here is how it works – the search feature enables you to search for the well known “() {“ Shellshock indicators while the touch of the LogReduce button effectively returns potential malicious activity for you to consider.  Take for instance a large group of messages that could be a typical series of ping requests, LogReduce separates messages by their distinct signatures making it easier for you to review those that differ from the norm.  You can easily see instances of scans, attempts and real attacks separated into distinct groups.  This feature streamlines your investigation process to uncover abnormalities and potential attacks.  Give it a try and see for yourself how LogReduce can reveal to you a broad range of remote attacker activity from downloads of malicious files to your systems, to internal file dumps for external retrieval, and many others.

Witness it Yourself

Check out this video to see how our service enables you to proactively identify suspicious or malicious activity on your systems: Sumo Logic: Finding Shellshock Vulnerabilities

Give Us a Try

For those of you, who are completely new to our service, you can sign up for a Free 30 day trail here: Sumo Logic Free 30 Day Trial

 

LogReduce vs Shellshock

09.25.2014 | Posted by Joan Pepin, VP of Security/CISO

 

Shellshock is the latest catastrophic vulnerability to hit the Internet. Following so closely on the heels of Heartbleed, it serves as a reminder of how dynamic information security can be.

(NOTE: Sumo Logic was not and is not vulnerable to this attack, and all of our BASH binaries were patched on Wednesday night.)

Land-Grab

Right now there is a massive land-grab going on, as dozens of criminal hacker groups (and others) are looking to exploit this widespread and serious vulnerability for profit. Patching this vulnerability while simultaneously sifting through massive volumes of data looking for signs of compromise is a daunting task for your security and operations teams. However, Sumo Logic’s patent pending LogReduce technology can make this task much easier, as we demonstrated this morning.

Way Big Data

While working with a customer to develop a query to show possible exploitation of Shellshock, we saw over 10,000 exploitation attempts in a fifteen minute window. It quickly became clear that a majority of the attempts were being made by their internal scanner. By employing LogReduce we were able to very quickly pick out the actual attack attempts from the data-stream, which allowed our customer to focus their resources on the boxes that had been attacked.

 

Fighting the Hydra

From a technical perspective, the Shellshock attack can be hidden in any HTTP header; we have seen it in the User-Agent, the Referrer, and as part of the GET request itself. Once invoked in this way, it can be used to do anything from sending a ping, to sending an email, to installing a trojan horse or opening a remote shell. All of which we have seen already today. And HTTP isn’t even the only vector, but rather just one of many which may be used, including DHCP.

So- Shellshock presents a highly flexible attack vector and can be employed in a number of ways to do a large variety of malicious things. It is is so flexible, there is no single way to search for it or alert on it that will be completely reliable. So there is no single silver bullet to slay this monster, however, LogReduce can quickly shine light on the situation and wither it down to a much less monstrous scale.

We are currently seeing many different varieties of scanning, both innocent and not-so-innocent; as well as a wide variety of malicious behavior, from directly installing trojan malware to opening remote shells for attackers. This vulnerability is actively being exploited in the wild this very second. The Sumo Logic LogReduce functionality can help you mitigate the threat immediately.

mycal tucker

Secret Santa – The Math Behind The Game

09.25.2014 | Posted by mycal tucker

It’s that time of year again! Time for Secret Santa. After all, what shows off your holiday spirit better than exchanging gifts in August? As you attempt to organize your friends into a Secret Santa pool, though, I wonder if you appreciate the beautiful math going on in the background.

For those of you unfamiliar with Secret Santa, here’s the basic idea. A group of friends write their names on slips of paper and drop them into a hat. Once everyone’s name is in, each person blindly draws out a name from the hat. These slips of paper indicate whose Secret Santa each person is. For the sake of simplicity, let us assume that if a person draws their own name, they are their own Secret Santa.

As an example, consider a group of three friends: Alice, Bob, and Carol. Alice draws Bob’s name out of the hat. Bob draws Alice’s name out of the hat. Carol draws her own name out of the hat. In this example, Alice will give Bob a gift; Bob will give Alice a gift; and Carol will give herself a gift.

Here comes the math.

In the example previously described, I would argue that there are two “loops” of people. A loop can be defined as an ordered list of names such that each person gives a gift to the next person in the list except for the last person, who gives to the first person in the list. Below we see a graphical interpretation of the example that clearly shows two loops. Alice and Bob are one loop while Carol is her own loop.

 

We could equally well display this information by using a list. Alice gives a gift to the first person in the list, Bob gives to the second person, and Carol gives to the third person. Thus we can describe the graph above by writing [B, A, C].

One can easily imagine a different arrangement of gift-giving resulting in different number of loops, however. For example, if Alice drew Bob’s name, Bob drew Carol’s name, and Carol drew Alice’s name, there would only be one loop. If Alice drew her own name, Bob his own name, and Carol her own name, there would be three loops.

     [B, C, A]     [A, B, C]

 

In these diagrams, each node is a person and each edge describes giving a gift. Note that each person has exactly one incoming and one outgoing edge since everybody receives and gives one gift. Below each diagram is the corresponding list representation.

The question that had been keeping me up at night recently is as follows: for a group of x people participating in Secret Santa, what is the average number of loops one can expect to see after everyone has drawn names from the hat? After I started touting my discovery of a revolutionary graph theory problem to my friends, they soon informed me that I was merely studying the fairly well known problem of what is the expected number of cycles in a random permutation. Somewhat deflated but determined to research the problem for myself, I pressed on.

To get a rough estimate of the answer, I first simulated the game on my computer. I ran 100 trials for x ranging from 1 to 100 and calculated the number of loops for each trial. I plotted the results and noticed that the resulting curve looked a lot like a log curve. Here’s the graph with a best-fit log line on top.

 

The jitters in the curve no doubt come from not sampling enough simulated trials. Even with that noise, though, what is truly remarkable is that the expected number of loops is nearly exactly equal to the natural log of how many people participate.

These results gave me insights into the problem, but they still didn’t give a completely satisfactory answer. For very small x, for example, ln(x) is a terrible approximation for the number of loops. If x=1, the expected number of loops is necessarily 1 but my log-based model says I should expect 0 loops. Furthermore, intuitively it seems like calculating the number of loops should be a discrete process rather than plugging into a continuous function. Finally, I still didn’t even know for sure that my model was correct. I resolved to analytically prove the exact formula for loops.

Let f(x) represent the average number of loops expected if x people participate in Secret Santa. I decided to work off the hypothesis that f(x)=1+12+13+…+1x (also known as the xth harmonic number). This equation works for small numbers and asymptotically approaches ln(x) for large x.

Since I already know f(x) is correct for small x, the natural way to try to prove my result generally is through a proof by induction.

Base Case:

Let x=1

f(x)=1

 

The average number of loops for a single person in Secret Santa is 1.

The base case works.

 

Inductive Step:

Assume f(x)=1+12+13+…+1x

Prove that f(x+1)=1+12+13+…+1x+1x+1

 

f(x+1)=[f(x)+1]*1x+1+f(x)*xx+1

f(x+1)=f(x)+1x+1

f(x+1)=1+12+13+…+1x+1x+1

Q.E.D.

 

The key insight into this proof is the first line of the inductive step. Here’s one way to think about it if by using our list representation described earlier:

There are two cases one needs to consider.

1) The last element that we place into the x+1spot in the list has value x+1 . This means the first x spots contain all the numbers from 1 to x . The odds of this happening are 1x+1 . Crucially, we get to assume that the average number of loops from the first x elements is therefore f(x) . Adding the last element adds exactly one loop: player x+1 giving himself a gift.

2) The last element that we place into the x+1 spot in the list does not have value x+1 . This covers all the other cases (the odds of this happening are xx+1 ). In this scenario, one of the first x people points to x+1and x+1points to one of the first x people. In essence the x+1th person is merely extending a loop already determined by the first x people. Therefore the number of loops is just f(x).

If we assume a uniform distribution of permutations (as assumption that is easily violated if players have choice ) and we weight these two cases by the probability each of them happening, we get the total expected number of loops for f(x+1).

Just like that, we have proved something theoretically beautiful that also applies to something as mundane as a gift exchange game. It all started by simulating a real-world event, looking at the data, and then switching back into analytical mode.

 

****************************************************************************

 

As I mentioned above, my “research” was by no means novel. For further reading on this topic, feel free to consult this nice summary by John Canny about random permutations or, of course, the Wikipedia article about it.

Since starting to write this article, a colleague of mine has emailed me saying that someone else has even thought of this problem in the Secret Santa context and posted his insights here.

Robert Sloan

Changing Representation

09.18.2014 | Posted by Robert Sloan

I don’t deal in veiled motives — I really like information theory. A lot. It’s been an invaluable conceptual tool for almost every area of my work; and I’m going to try to convince you of its usefulness for engineering problems. Let’s look at a timestamp parsing algorithm in the Sumo Logic codebase.

The basic idea is that each thread gets some stream of input lines (these are from my local /var/log/appfirewall.log), and we want to parse the timestamps (bolded) into another numeric field:

 

Jul 25 08:33:02 vorta.local socketfilterfw[86] <Info>: java: Allow TCP CONNECT (in:5 out:0)

Jul 25 08:39:54 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 1 time

Jul 25 08:42:40 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 1 time

Jul 25 08:43:01 vorta.local socketfilterfw[86] <Info>: java: Allow TCP LISTEN  (in:0 out:1)

Jul 25 08:44:17 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 6 time

 

Being a giant distributed system, we receive logs with hundreds of different timestamp formats, which are interleaved in the input stream. CPU time on the frontend is dedicated to parsing raw log lines, so if we can derive timestamps more quickly, we can reduce our AWS costs. Let’s assume that exactly one timestamp parser will match–we’ll leave ambiguities for another day.

How can we implement this? The naive approach is to try all of the parsers in an arbitrary sequence each time and see which one works; but all of them are computationally expensive to evaluate. Maybe we try to cache them or parallelize in some creative way? We know that caching should be optimal if the logs were all in the same format; and linear search would be optimal if they were randomly chosen.

In any case, the most efficient way to do this isn’t clear, so let’s do some more analysis: take the sequence of correct timestamp formats and label them:

 

Timestamp

Format

Label

Jul 25 08:52:10

MMM dd HH:mm:ss

Format 1

Fri Jul 25 09:06:49 PDT 2014

EEE MMM dd HH:mm:ss ZZZ yyyy

Format 2

1406304462

EpochSeconds

Format 3

[Jul 25 08:52:10]

MMM dd HH:mm:ss

Format 1

 

How can we turn this into a normal, solvable optimization problem? Well, if we try our parsers in a fixed order, the index label is actually just the number of parsing attempts before hitting the correct parser. Let’s keep the parsers in the original order and add another function that reorders them, and then we’ll try them in that order:

 

Format

Parser Label

Parser Index

MMM dd HH:mm:ss

Format 1

2

EEE MMM dd HH:mm:ss ZZZ yyyy

Format 2

1

EpochSeconds

Format 3

3

 

This is clearly better, and we can change this function on every time step. Having the optimal parser choice be a low number is always better, because we’re trying to minimize the time delay of the parsing process:

 

 (Time Delay) (# Tries)

 

 But can we really just optimize over that? It’s not at all clear to me how that translates into an algorithm. While it’s a nice first-order formulation, we’re going to have to change representations to connect it to anything more substantial.

 

Parser Index

Parser Index (Binary)

Parser Index (Unary)

2

10

11

1

1

1

3

11

111

 

This makes it clear that making the parser index small is equivalent to making its decimal/binary/unary representation small. In other words, we want to minimize the information content of the index sequence over our choice of parsers.

 In mathematical terms, the information (notated H) is just the sum of -p log p over each event, where p is the event’s probability. As an analogy, think of -log p as the length of the unary sequence (as above) and p as the probability of the sequence — we’ll use the experimental probability distribution over the parser indices that actually occur.

 As long as the probability of taking more tries is strictly decreasing, minimizing it also minimizes the time required because the information is strictly increasing with the number of tries it takes.

 

arg min{Time Delay} =arg min{Sequence Length * Probability of sequence}

=arg min {-p(# Tries) * log(p(# Tries)) } = arg min{ H(# Tries) }

 

That’s strongly suggestive that what we want to use as the parser-order-choosing function is actually a compression function, whose entire goal in life is to minimize the information content (and therefore size) of byte sequences. Let’s see if we can make use of one: in the general case, these algorithms look like Seq(Int) Seq(Int), making the second sequence  shorter.

 

Parser Index Sequence: Length 13

Parser Index (LZW Compressed): Length 10

12,43,32,64,111,33,12,43,32,64,111,33,12

12,43,32,64,111,33,256,258,260,12

 

Let’s say that we have some past sequence — call it P — and we’re trying to find the next parser-index mapping. I admit that it’s not immediately clear how to do this with a compression algorithm a priori, but if we just perturb the algorithm, we can compare the options for the next functions as:

 

newInfo(parser label) = H(compress(P + [parser label]))-H(compress(P))

 

Any online compression algorithm will allow you to hold state so that you don’t have to repeat computations in determining this. Then, we can just choose the parser with the least newInfo; and if the compressor will minimize information content (which I’ll assume they’re pretty good at), then our algorithm will minimize the required work. If you’d like a deeper explanation of compression, ITILA [1] is a good reference.

 With a fairly small, reasonable change of representation, we now have a well-defined, implementable, fast metric to make online decisions about parser choice. Note that this system will work regardless of the input stream — there is not a worst case except those of the compression algorithm. In this sense, this formulation is adaptive.

 Certainly, the reason that we can draw a precise analogy to a solved problem is because analogous situations show up in many fields, which at least include Compression/Coding, Machine Learning [2], and Controls [3]. Information theory is the core conceptual framework here, and if I’ve succeeded in convincing you, Bayesian Theory [4] is my favorite treatment.

 

References:

  1. Information Theory, Inference, and Learning Algorithms by David MacKay

  2. Prediction, Learning, and Games by Nicolo Cesa-Bianchi and Gabor Lugosi.

  3. Notes on Dynamic Programming and Optimal Control by Demitri Bertsekas

  4. Bayesian Theory by Jose Bernardo and Adrian Smith

Russell

No Magic: Regular Expressions, Part 3

09.11.2014 | Posted by Russell

Evaluating the NFA

In part 1, we parsed the regular expression into an abstract syntax tree. In part 2, we converted that syntax tree into an NFA. Now it’s time evaluate that NFA against a potential string.

NFAs, DFAs and Regular Expressions

Recall from part 2 that there are two types of finite automata: deterministic and non-deterministic. They have one key difference: A non-deterministic finite automata can have multiple paths out of the same node for the same token as well as paths that can be pursued without consuming input. In expressiveness (often referred to as “power”), NFAs, DFAs and regular expressions are all equivalent. This means if you can express a rule or pattern, (eg. strings of even length), with an NFA, you can also express it with a DFA or a regular expression. Lets first consider a regular expressionabc* expressed as a DFA:

regexdfa.png

Evaluating a DFA is straightforward: simply move through the states by consuming the input string. If you finish consuming input in the match state, match, otherwise, don’t. Our state machine, on the other hand, is an NFA. The NFA our code generates for this regular expression is:

dfavsnfa.png

Note that there are multiple unlabeled edges that we can follow without consuming a character. How can we track that efficiently? The answer is surprisingly simple: instead of tracking only one possible state, keep a list of states that the engine is currently in. When you encounter a fork, take both paths (turning one state into two). When a state lacks a valid transition for the current input, remove it from the list.

There are 2 subtleties we have to consider: avoiding infinite loops in the graph and handling no-input-transitions properly. When we are evaluating a given state, we first advance all the states to enumerate all the possible states reachable from our current state if we don’t consume any more input. This is the phase that also requires care to maintain a “visited set” to avoid infinitely looping in our graph. Once we have enumerated those states, we consume the next token of input, either advancing those states or removing them from our set

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
object NFAEvaluator {
def evaluate(nfa: State, input: String): Boolean =
evaluate(Set(nfa), input)
 
def evaluate(nfas: Set[State], input: String): Boolean = {
input match {
case "" =>
evaluateStates(nfas, None).exists(_ == Match())
case string =>
evaluate(
evaluateStates(nfas, input.headOption),
string.tail
)
}
}
 
def evaluateStates(nfas: Set[State],
input: Option[Char]): Set[State] = {
val visitedStates = mutable.Set[State]()
nfas.flatMap { state =>
evaluateState(state, input, visitedStates)
}
}
 
def evaluateState(currentState: State, input: Option[Char],
visitedStates: mutable.Set[State]): Set[State] = {
 
if (visitedStates contains currentState) {
Set()
} else {
visitedStates.add(currentState)
currentState match {
case placeholder: Placeholder =>
evaluateState(
placeholder.pointingTo,
input,
visitedStates
)
case consume: Consume =>
if (Some(consume.c) == input
|| consume.c == '.') {
Set(consume.out)
} else {
Set()
}
case s: Split =>
evaluateState(s.out1, input, visitedStates) ++
evaluateState(s.out2, input, visitedStates)
case m: Match =>
if (input.isDefined) Set() else Set(Match())
}
}
}
}
view raw regexblog11.scala hosted with ❤ by GitHub

And that's it!

Put a bow on it

We’ve finished all the important code, but the API isn’t as clean as we’d like. Now, we need to create a single-call user interface to call our regular expression engine. We’ll also add the ability to match your pattern anywhere in the string with a bit of syntactic sugar.

1 2 3 4 5 6 7 8 9 10 11 12
object Regex {
def fullMatch(input: String, pattern: String) = {
val parsed = RegexParser(pattern).getOrElse(
throw new RuntimeException("Failed to parse regex")
)
val nfa = NFA.regexToNFA(parsed)
NFAEvaluator.evaluate(nfa, input)
}
 
def matchAnywhere(input: String, pattern: String) =
fullMatch(input, ".*" + pattern + ".*")
}
view raw regexblog11.scala hosted with ❤ by GitHub

To use it:

1 2 3
Regex.fullMatch("aaaaab", "a*b") // True
Regex.fullMatch("aaaabc", "a*b") // False
Regex.matchAnywhere("abcde", "cde") // True
view raw regexblog12.scala hosted with ❤ by GitHub

That’s all there is to it. A semi-functional regex implementation in just 106 lines. There are a number of things that could be added but I decided they added complexity without enough value:

  1. Character classes
  2. Value extraction
  3. ?
  4. Escape characters
  5. Any many more.

I hope this simple implementation helps you understand what’s going on under the hood! It’s worth mentioning that the performance of this evaluator is heinous. Truly terrible. Perhaps in a future post I’ll look into why and talk about ways to optimize it…

Twitter