Vera Chen

We are Shellshock Bash Bug Free Here at Sumo Logic, but What about You?

10.01.2014 | Posted by Vera Chen

Be Aware and Be Prepared

I am betting most of you have heard about the recent “Shellshock Bash Bug”.  If not, here is why you should care – this bug has affected users of Bash, which is one of the most popular utilities installed on operating systems today.  Discovered in early September 2014, this extremely severe bug affects bash versions dating back to version 1.13 and has the ability to process shell commands after function definitions in Bash that exposes systems to security threats.  This vulnerability allows remote attackers to execute any shell command and gain access to internal data, publish malicious code, reconfigure environments and exploit systems in infinite ways.

Shellshock Bash Bug Free, Safe and Secure

None of the Sumo Logic service components were impacted due to the innate design of our systems.  However, for those of you out there who might have fallen victim to this bug based on your system architecture, you’ll want to jump in quickly to address potential vulnerabilities. 

What We Can Do for You

If you have been searching around for a tool to expedite the process of identifying potential attacks on your systems, you’re in the right place!  I recommend that you consider Sumo Logic and especially our pattern recognition capability called LogReduce.  Here is how it works – the search feature enables you to search for the well known “() {“ Shellshock indicators while the touch of the LogReduce button effectively returns potential malicious activity for you to consider.  Take for instance a large group of messages that could be a typical series of ping requests, LogReduce separates messages by their distinct signatures making it easier for you to review those that differ from the norm.  You can easily see instances of scans, attempts and real attacks separated into distinct groups.  This feature streamlines your investigation process to uncover abnormalities and potential attacks.  Give it a try and see for yourself how LogReduce can reveal to you a broad range of remote attacker activity from downloads of malicious files to your systems, to internal file dumps for external retrieval, and many others.

Witness it Yourself

Check out this video to see how our service enables you to proactively identify suspicious or malicious activity on your systems: Sumo Logic: Finding Shellshock Vulnerabilities

Give Us a Try

For those of you, who are completely new to our service, you can sign up for a Free 30 day trail here: Sumo Logic Free 30 Day Trial

 

Sanjay Sarathy, CMO

Why Do DevOps Shops Care About Machine Data Analytics?

09.30.2014 | Posted by Sanjay Sarathy, CMO

Introduction

The IT industry is always changing, and at the forefront today is the DevOps movement.  The whole idea of DevOps is centered around helping businesses become more responsive to user requests and adapt faster to market conditions. Successful DevOps rollouts count on the ability to rapidly diagnose application issues that are hidden in machine data. Thus, the ability to quickly uncover patterns and anomalies in your logs is paramount. As a result, DevOps shops are fast becoming a sweet spot for us. Yes, DevOps can mean so many things – lean IT methodologies, agile software development, programmable architectures, a sharing culture and more.  At the root of it all is data, especially machine data.

DevOps job trends have literally exploded onto the scene, as the graphic below indicates.

In the midst of this relatively recent boom, DevOps teams have been searching for tools that help them to fulfill their requirements. Sumo Logic is a DevOps shop and at DevOps Days in Austin, we detailed our our own DevOps scale-up. We covered everything from culture change, to spreading knowledge and the issues that we faced. The result has been that our machine data analytics service is not only incredibly useful to us as a DevOps organization but provides deep insights for any organization looking to optimize its processes.

Sumo Logic At Work In A DevOps Setting

The very notion of software development has been rocked to its core by DevOps, and that has been enabled by rapid analysis in the development lifecycle. Sumo Logic makes it possible to easily integrate visibility into any software infrastructure and monitor the effects of changes throughout development, test and production environments. Data analysis can now cast a wide net and with our custom dashboards and flexible integration, can take place anywhere you can put code. Rapid cause-and-effect, rapid error counts, and rapid analysis mean rapid software development and code updating. If user performance has been an issue, DevOps and Sumo Logic can address those experiences as well through analytic insight from relevant data sources in your environment. That makes for better software for your company and your customers. It also means happier developers and we know that hasn’t traditionally been an easy task.

Sumo Logic offers an enterprise scale cloud-based product that grows as a business grows. TuneIn, a well-known internet radio and podcast platform utilizes Sumo Logic, and in a recent guest post, their development teams shared how they used our technology to create custom searches and alerts for errors and exceptions in the logs, allowing them to reduce overall error rates by close to twenty percent. Another Sumo Logic customer, PagerDuty shared their story of a rapid Sumo Logic DevOps deployment and reaching their ROI point in under a month:

Flexibility, speed, scalability, and extensibility – these are the kind of qualities in their commercial tools that DevOps shops are looking for. Netskope is a cloud based security company and a DevOps shop that has integrated Sumo Logic into their cloud infrastructure. In this video, they describe the value of Sumo Logic to provide instant feedback into the performance and availability of their application.

Today, DevOps teams around the world are using Sumo Logic to deliver the insights they need on demand. With Sumo Logic supporting DevOps teams throughout their application lifecycle, organizations are able to deliver on the promise of their applications and fulfill their business goals.

LogReduce vs Shellshock

09.25.2014 | Posted by Joan Pepin, VP of Security/CISO

 

Shellshock is the latest catastrophic vulnerability to hit the Internet. Following so closely on the heels of Heartbleed, it serves as a reminder of how dynamic information security can be.

(NOTE: Sumo Logic was not and is not vulnerable to this attack, and all of our BASH binaries were patched on Wednesday night.)

Land-Grab

Right now there is a massive land-grab going on, as dozens of criminal hacker groups (and others) are looking to exploit this widespread and serious vulnerability for profit. Patching this vulnerability while simultaneously sifting through massive volumes of data looking for signs of compromise is a daunting task for your security and operations teams. However, Sumo Logic’s patent pending LogReduce technology can make this task much easier, as we demonstrated this morning.

Way Big Data

While working with a customer to develop a query to show possible exploitation of Shellshock, we saw over 10,000 exploitation attempts in a fifteen minute window. It quickly became clear that a majority of the attempts were being made by their internal scanner. By employing LogReduce we were able to very quickly pick out the actual attack attempts from the data-stream, which allowed our customer to focus their resources on the boxes that had been attacked.

 

Fighting the Hydra

From a technical perspective, the Shellshock attack can be hidden in any HTTP header; we have seen it in the User-Agent, the Referrer, and as part of the GET request itself. Once invoked in this way, it can be used to do anything from sending a ping, to sending an email, to installing a trojan horse or opening a remote shell. All of which we have seen already today. And HTTP isn’t even the only vector, but rather just one of many which may be used, including DHCP.

So- Shellshock presents a highly flexible attack vector and can be employed in a number of ways to do a large variety of malicious things. It is is so flexible, there is no single way to search for it or alert on it that will be completely reliable. So there is no single silver bullet to slay this monster, however, LogReduce can quickly shine light on the situation and wither it down to a much less monstrous scale.

We are currently seeing many different varieties of scanning, both innocent and not-so-innocent; as well as a wide variety of malicious behavior, from directly installing trojan malware to opening remote shells for attackers. This vulnerability is actively being exploited in the wild this very second. The Sumo Logic LogReduce functionality can help you mitigate the threat immediately.

mycal tucker

Secret Santa – The Math Behind The Game

09.25.2014 | Posted by mycal tucker

It’s that time of year again! Time for Secret Santa. After all, what shows off your holiday spirit better than exchanging gifts in August? As you attempt to organize your friends into a Secret Santa pool, though, I wonder if you appreciate the beautiful math going on in the background.

For those of you unfamiliar with Secret Santa, here’s the basic idea. A group of friends write their names on slips of paper and drop them into a hat. Once everyone’s name is in, each person blindly draws out a name from the hat. These slips of paper indicate whose Secret Santa each person is. For the sake of simplicity, let us assume that if a person draws their own name, they are their own Secret Santa.

As an example, consider a group of three friends: Alice, Bob, and Carol. Alice draws Bob’s name out of the hat. Bob draws Alice’s name out of the hat. Carol draws her own name out of the hat. In this example, Alice will give Bob a gift; Bob will give Alice a gift; and Carol will give herself a gift.

Here comes the math.

In the example previously described, I would argue that there are two “loops” of people. A loop can be defined as an ordered list of names such that each person gives a gift to the next person in the list except for the last person, who gives to the first person in the list. Below we see a graphical interpretation of the example that clearly shows two loops. Alice and Bob are one loop while Carol is her own loop.

 

We could equally well display this information by using a list. Alice gives a gift to the first person in the list, Bob gives to the second person, and Carol gives to the third person. Thus we can describe the graph above by writing [B, A, C].

One can easily imagine a different arrangement of gift-giving resulting in different number of loops, however. For example, if Alice drew Bob’s name, Bob drew Carol’s name, and Carol drew Alice’s name, there would only be one loop. If Alice drew her own name, Bob his own name, and Carol her own name, there would be three loops.

     [B, C, A]     [A, B, C]

 

In these diagrams, each node is a person and each edge describes giving a gift. Note that each person has exactly one incoming and one outgoing edge since everybody receives and gives one gift. Below each diagram is the corresponding list representation.

The question that had been keeping me up at night recently is as follows: for a group of x people participating in Secret Santa, what is the average number of loops one can expect to see after everyone has drawn names from the hat? After I started touting my discovery of a revolutionary graph theory problem to my friends, they soon informed me that I was merely studying the fairly well known problem of what is the expected number of cycles in a random permutation. Somewhat deflated but determined to research the problem for myself, I pressed on.

To get a rough estimate of the answer, I first simulated the game on my computer. I ran 100 trials for x ranging from 1 to 100 and calculated the number of loops for each trial. I plotted the results and noticed that the resulting curve looked a lot like a log curve. Here’s the graph with a best-fit log line on top.

 

The jitters in the curve no doubt come from not sampling enough simulated trials. Even with that noise, though, what is truly remarkable is that the expected number of loops is nearly exactly equal to the natural log of how many people participate.

These results gave me insights into the problem, but they still didn’t give a completely satisfactory answer. For very small x, for example, ln(x) is a terrible approximation for the number of loops. If x=1, the expected number of loops is necessarily 1 but my log-based model says I should expect 0 loops. Furthermore, intuitively it seems like calculating the number of loops should be a discrete process rather than plugging into a continuous function. Finally, I still didn’t even know for sure that my model was correct. I resolved to analytically prove the exact formula for loops.

Let f(x) represent the average number of loops expected if x people participate in Secret Santa. I decided to work off the hypothesis that f(x)=1+12+13+…+1x (also known as the xth harmonic number). This equation works for small numbers and asymptotically approaches ln(x) for large x.

Since I already know f(x) is correct for small x, the natural way to try to prove my result generally is through a proof by induction.

Base Case:

Let x=1

f(x)=1

 

The average number of loops for a single person in Secret Santa is 1.

The base case works.

 

Inductive Step:

Assume f(x)=1+12+13+…+1x

Prove that f(x+1)=1+12+13+…+1x+1x+1

 

f(x+1)=[f(x)+1]*1x+1+f(x)*xx+1

f(x+1)=f(x)+1x+1

f(x+1)=1+12+13+…+1x+1x+1

Q.E.D.

 

The key insight into this proof is the first line of the inductive step. Here’s one way to think about it if by using our list representation described earlier:

There are two cases one needs to consider.

1) The last element that we place into the x+1spot in the list has value x+1 . This means the first x spots contain all the numbers from 1 to x . The odds of this happening are 1x+1 . Crucially, we get to assume that the average number of loops from the first x elements is therefore f(x) . Adding the last element adds exactly one loop: player x+1 giving himself a gift.

2) The last element that we place into the x+1 spot in the list does not have value x+1 . This covers all the other cases (the odds of this happening are xx+1 ). In this scenario, one of the first x people points to x+1and x+1points to one of the first x people. In essence the x+1th person is merely extending a loop already determined by the first x people. Therefore the number of loops is just f(x).

If we assume a uniform distribution of permutations (as assumption that is easily violated if players have choice ) and we weight these two cases by the probability each of them happening, we get the total expected number of loops for f(x+1).

Just like that, we have proved something theoretically beautiful that also applies to something as mundane as a gift exchange game. It all started by simulating a real-world event, looking at the data, and then switching back into analytical mode.

 

****************************************************************************

 

As I mentioned above, my “research” was by no means novel. For further reading on this topic, feel free to consult this nice summary by John Canny about random permutations or, of course, the Wikipedia article about it.

Since starting to write this article, a colleague of mine has emailed me saying that someone else has even thought of this problem in the Secret Santa context and posted his insights here.

Ben Newton, Senior Product Manager

Piercing the Fog in a Devops World

09.22.2014 | Posted by Ben Newton, Senior Product Manager

Fog on I-280

Two things still amaze me about the San Francisco Bay area two years on after moving here from the east coast – the blindingly blue, cloudless skies – and the fog. It is hard to describe how beautiful it is to drive up the spine of the San Francisco Peninsula on northbound I-280 as the fog rolls over the Santa Cruz mountains. You can see the fog pouring slowly over the peaks of the mountains, and see the highway in front of you disappear into the white, fuzzy nothingness of its inexorable progress down the valley. There is always some part of me that wonders what will happen to my car as I pass into the fog. But then I look at my GPS, know that I have driven this road hundreds of times, and assure myself that my house does still exist in there – somewhere.

The Viaduct

Now, I can contrast that experience with learning to drive in the Blue Ridge Mountains of North Carolina. Here’s the background – It’s only my second time behind the wheel, and my Mom takes me on this crazy stretch of road called the Viaduct. Basically, imagine a road hanging off the side of a mountain, with a sheer mountain side on the one side, and a whole lot of nothing on the other. Now, imagine that road covered in pea-soup fog with 10 ft visibility, and a line of a half dozen cars being led by a terrified teenager with white knuckled hands on the wheel of a minivan hoping he won’t careen off the side of the road to a premature death.  Completely different experience.

So, what’s the difference between those two experiences. Well, 20 years of driving, and GPS for starters. I don’t worry about driving into the thick fog as I drive home because I have done it before, I know exactly where I am, how fast I am going, and I am confident that I can avoid obstacles. That knowledge, insight, and experience make all the difference between an awe-inspiring journey and a gut-wrenching nail-biter. This is really not that different from running a state of the art application. Just like I need GPS and experience to brave the fog going home, the difference between confidently innovating and delighting your customers, versus living in constant fear of the next disaster, is both driven by technology and culture. Here are some ways I would flesh out the analogy:

GPS for DevOps

An app team without visibility into their metrics and errors is a team that will never do world-class operations. Machine Data Analytics provides the means to gather the telemetry data and then provide that insight in real-time. This empowers App Ops and DevOps teams to move more quickly and innovate.

Fog Lights for Avoiding Obstacles

You can’t avoid obstacles if you can’t see them in time. You need the right real-time analytics to quickly detect  issues and avoid them before they wreck your operations.

Experience Brings Confidence

If you have driven the road before, it is always increases confidence and speed. Signature-Based anomaly detection means that the time that senior engineers put in to classify previous events gives the entire team the confidence to classify and debug issues.

 

So, as you drive your Application Operations and DevOps teams to push your application to the cutting edge of performance, remember that driving confidently into the DevOps fog is only possible with the right kind of visibility.

 

Images linked from:

  • http://searchresearch1.blogspot.com/2012/09/wednesday-search-challenge-9512-view-of.html
  • http://www.blueridgerunners.org/LinnCove.jpg
Robert Sloan

Changing Representation

09.18.2014 | Posted by Robert Sloan

I don’t deal in veiled motives — I really like information theory. A lot. It’s been an invaluable conceptual tool for almost every area of my work; and I’m going to try to convince you of its usefulness for engineering problems. Let’s look at a timestamp parsing algorithm in the Sumo Logic codebase.

The basic idea is that each thread gets some stream of input lines (these are from my local /var/log/appfirewall.log), and we want to parse the timestamps (bolded) into another numeric field:

 

Jul 25 08:33:02 vorta.local socketfilterfw[86] <Info>: java: Allow TCP CONNECT (in:5 out:0)

Jul 25 08:39:54 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 1 time

Jul 25 08:42:40 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 1 time

Jul 25 08:43:01 vorta.local socketfilterfw[86] <Info>: java: Allow TCP LISTEN  (in:0 out:1)

Jul 25 08:44:17 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 6 time

 

Being a giant distributed system, we receive logs with hundreds of different timestamp formats, which are interleaved in the input stream. CPU time on the frontend is dedicated to parsing raw log lines, so if we can derive timestamps more quickly, we can reduce our AWS costs. Let’s assume that exactly one timestamp parser will match–we’ll leave ambiguities for another day.

How can we implement this? The naive approach is to try all of the parsers in an arbitrary sequence each time and see which one works; but all of them are computationally expensive to evaluate. Maybe we try to cache them or parallelize in some creative way? We know that caching should be optimal if the logs were all in the same format; and linear search would be optimal if they were randomly chosen.

In any case, the most efficient way to do this isn’t clear, so let’s do some more analysis: take the sequence of correct timestamp formats and label them:

 

Timestamp

Format

Label

Jul 25 08:52:10

MMM dd HH:mm:ss

Format 1

Fri Jul 25 09:06:49 PDT 2014

EEE MMM dd HH:mm:ss ZZZ yyyy

Format 2

1406304462

EpochSeconds

Format 3

[Jul 25 08:52:10]

MMM dd HH:mm:ss

Format 1

 

How can we turn this into a normal, solvable optimization problem? Well, if we try our parsers in a fixed order, the index label is actually just the number of parsing attempts before hitting the correct parser. Let’s keep the parsers in the original order and add another function that reorders them, and then we’ll try them in that order:

 

Format

Parser Label

Parser Index

MMM dd HH:mm:ss

Format 1

2

EEE MMM dd HH:mm:ss ZZZ yyyy

Format 2

1

EpochSeconds

Format 3

3

 

This is clearly better, and we can change this function on every time step. Having the optimal parser choice be a low number is always better, because we’re trying to minimize the time delay of the parsing process:

 

 (Time Delay) (# Tries)

 

 But can we really just optimize over that? It’s not at all clear to me how that translates into an algorithm. While it’s a nice first-order formulation, we’re going to have to change representations to connect it to anything more substantial.

 

Parser Index

Parser Index (Binary)

Parser Index (Unary)

2

10

11

1

1

1

3

11

111

 

This makes it clear that making the parser index small is equivalent to making its decimal/binary/unary representation small. In other words, we want to minimize the information content of the index sequence over our choice of parsers.

 In mathematical terms, the information (notated H) is just the sum of -p log p over each event, where p is the event’s probability. As an analogy, think of -log p as the length of the unary sequence (as above) and p as the probability of the sequence — we’ll use the experimental probability distribution over the parser indices that actually occur.

 As long as the probability of taking more tries is strictly decreasing, minimizing it also minimizes the time required because the information is strictly increasing with the number of tries it takes.

 

arg min{Time Delay} =arg min{Sequence Length * Probability of sequence}

=arg min {-p(# Tries) * log(p(# Tries)) } = arg min{ H(# Tries) }

 

That’s strongly suggestive that what we want to use as the parser-order-choosing function is actually a compression function, whose entire goal in life is to minimize the information content (and therefore size) of byte sequences. Let’s see if we can make use of one: in the general case, these algorithms look like Seq(Int) Seq(Int), making the second sequence  shorter.

 

Parser Index Sequence: Length 13

Parser Index (LZW Compressed): Length 10

12,43,32,64,111,33,12,43,32,64,111,33,12

12,43,32,64,111,33,256,258,260,12

 

Let’s say that we have some past sequence — call it P — and we’re trying to find the next parser-index mapping. I admit that it’s not immediately clear how to do this with a compression algorithm a priori, but if we just perturb the algorithm, we can compare the options for the next functions as:

 

newInfo(parser label) = H(compress(P + [parser label]))-H(compress(P))

 

Any online compression algorithm will allow you to hold state so that you don’t have to repeat computations in determining this. Then, we can just choose the parser with the least newInfo; and if the compressor will minimize information content (which I’ll assume they’re pretty good at), then our algorithm will minimize the required work. If you’d like a deeper explanation of compression, ITILA [1] is a good reference.

 With a fairly small, reasonable change of representation, we now have a well-defined, implementable, fast metric to make online decisions about parser choice. Note that this system will work regardless of the input stream — there is not a worst case except those of the compression algorithm. In this sense, this formulation is adaptive.

 Certainly, the reason that we can draw a precise analogy to a solved problem is because analogous situations show up in many fields, which at least include Compression/Coding, Machine Learning [2], and Controls [3]. Information theory is the core conceptual framework here, and if I’ve succeeded in convincing you, Bayesian Theory [4] is my favorite treatment.

 

References:

  1. Information Theory, Inference, and Learning Algorithms by David MacKay

  2. Prediction, Learning, and Games by Nicolo Cesa-Bianchi and Gabor Lugosi.

  3. Notes on Dynamic Programming and Optimal Control by Demitri Bertsekas

  4. Bayesian Theory by Jose Bernardo and Adrian Smith

Johnathan Hodge

New MySQL App – now GA!

09.17.2014 | Posted by Johnathan Hodge

Just check out the MySQL website and you’ll understand why we added a new app to the Sumo Logic Application Library for MySQL:

MySQL is the world’s most popular open source database software, with over 100 million copies of its software downloaded or distributed throughout its history.

As we spoke to companies about what insight they are missing from MySQL today, there were 3 common themes:

  • Understanding errors: Simply aggregating all the logs into one place for analysis would reduce Mean Time To Investigate

  • Insight into replication issues – many companies are flying blind in this area

  • Query performance – understanding not simply the slowest performers but changes over time

So, we created a set of dashboards and queries that target these areas. The application’s overview dashboard is especially useful because it highlights the daily pattern of slow queries – along with a seven-day average baseline to make it really clear when something isn’t right. You can jump from here into any of the other dashboards…

… including more detail on the slow queries, specifically.

The area that surprised me most from my discussions with customers was the need for insight into replication. It’s clearly a painpoint for companies running MySQL – not because of MySQL per se, more because of the scale of customer environments. Issues with replication are often only uncovered once they have passed a certain pain threshold! With good log analysis, we were able to create a useful dashboard on replication. One of our beta customers said that the app was immediately valuable: “This is useful! Right away I learned that I should add an index to….”. Obviously, we were thrilled with this type of feedback!

We have other dashboards and useful searches in the application to give you greater insight into your MySQL deployment. The App is available now in the Sumo Logic Application Library. Go get it – and let me know what you think!

Mike Cook

Why TuneIn Chose Sumo Logic For Machine Data Analytics

09.15.2014 | Posted by Mike Cook

The following is a guest post from Mike Cook, Director of Technical Operations at TuneIn.

Introduction

TuneIn is a rapidly growing service that allows consumers to listen to over 100,000 radio stations and more than four million podcasts from every continent. During the recently held Soccer World Cup, over 10.5 million people listened live to the games on radio stations streamed via TuneIn, one of the biggest events in our company’s history.

The State of Machine Data Analytics, pre-Sumo Logic

We had no consolidated strategy to analyze our logs, across our systems and applications alike. As a result and especially because of our rapid growth, troubleshooting and event correlation became manual and increasingly painful affairs that involved looking at individual server and application logs. We tried internal tools including syslog-ng and rsyslog, but the maintenance and overhead costs on a lean IT team were too high.

Why Sumo Logic?

There were a number of reasons why Sumo Logic was appealing to us:

  • As a cloud-based service, Sumo Logic immediately offloaded the maintenance issues we had to deal with running our own home-grown log management infrastructure.
  • With pre-built support for a number of infrastructure components that TuneIn runs on, including AWS, Cisco, VMware, Windows/IIS and Linux, we were confident that we could get insights around our logs far faster than other options.
  • In addition, the Sumo Logic LogReduce technology provided a more robust investigation tool to find root causes of issues that traditional monitoring tools just can’t detect.
  • Finally, Sumo Logic provides a compelling business value for what we are trying to accomplish

Internal Adoption of Sumo Logic

We started with creating basic dashboards and alerts around our key operating system logs. As the application development teams realized the value of what Sumo Logic could provide them, we added additional log sources and launched a series of lunch-and-learns to demonstrate the value of the service. These lunch-and-learns have rapidly broadened the adoption of Sumo Logic across TuneIn. We’re now seeing support teams using it to get customer statistics; different development teams using it to analyze API performance; and the executive team getting overall visibility through real-time dashboards.

Business Benefits

It didn’t take us long to see the benefit of Sumo Logic. Since TuneIn is a distributed PaaS, it was frequently difficult to correspond and troubleshoot issues in a particular API. Time to resolution began to drop from several hours to several minutes as developer and operations staff can search and pinpoint issues. Our development teams were quick to create custom searches and alerts for errors and exceptions in the logs, allowing us to reduce overall error rates by close to 20%. Without Sumo Logic we wouldn’t have even known most of those errors were even occurring. We’re just scratching the surface with Sumo Logic. We’re continue to expand our usage and gain critical insight into API performance, what our most popular user activities are, and where our application bottlenecks are. Sumo Logic isn’t just a log consolidation tool, it also serves as a critical tool in our Business Intelligence toolbox.

Russell

No Magic: Regular Expressions, Part 3

09.11.2014 | Posted by Russell

Evaluating the NFA

In part 1, we parsed the regular expression into an abstract syntax tree. In part 2, we converted that syntax tree into an NFA. Now it’s time evaluate that NFA against a potential string.

NFAs, DFAs and Regular Expressions

Recall from part 2 that there are two types of finite automata: deterministic and non-deterministic. They have one key difference: A non-deterministic finite automata can have multiple paths out of the same node for the same token as well as paths that can be pursued without consuming input. In expressiveness (often referred to as “power”), NFAs, DFAs and regular expressions are all equivalent. This means if you can express a rule or pattern, (eg. strings of even length), with an NFA, you can also express it with a DFA or a regular expression. Lets first consider a regular expressionabc* expressed as a DFA:

regexdfa.png

Evaluating a DFA is straightforward: simply move through the states by consuming the input string. If you finish consuming input in the match state, match, otherwise, don’t. Our state machine, on the other hand, is an NFA. The NFA our code generates for this regular expression is:

dfavsnfa.png

Note that there are multiple unlabeled edges that we can follow without consuming a character. How can we track that efficiently? The answer is surprisingly simple: instead of tracking only one possible state, keep a list of states that the engine is currently in. When you encounter a fork, take both paths (turning one state into two). When a state lacks a valid transition for the current input, remove it from the list.

There are 2 subtleties we have to consider: avoiding infinite loops in the graph and handling no-input-transitions properly. When we are evaluating a given state, we first advance all the states to enumerate all the possible states reachable from our current state if we don’t consume any more input. This is the phase that also requires care to maintain a “visited set” to avoid infinitely looping in our graph. Once we have enumerated those states, we consume the next token of input, either advancing those states or removing them from our set

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
object NFAEvaluator {
def evaluate(nfa: State, input: String): Boolean =
evaluate(Set(nfa), input)
 
def evaluate(nfas: Set[State], input: String): Boolean = {
input match {
case "" =>
evaluateStates(nfas, None).exists(_ == Match())
case string =>
evaluate(
evaluateStates(nfas, input.headOption),
string.tail
)
}
}
 
def evaluateStates(nfas: Set[State],
input: Option[Char]): Set[State] = {
val visitedStates = mutable.Set[State]()
nfas.flatMap { state =>
evaluateState(state, input, visitedStates)
}
}
 
def evaluateState(currentState: State, input: Option[Char],
visitedStates: mutable.Set[State]): Set[State] = {
 
if (visitedStates contains currentState) {
Set()
} else {
visitedStates.add(currentState)
currentState match {
case placeholder: Placeholder =>
evaluateState(
placeholder.pointingTo,
input,
visitedStates
)
case consume: Consume =>
if (Some(consume.c) == input
|| consume.c == '.') {
Set(consume.out)
} else {
Set()
}
case s: Split =>
evaluateState(s.out1, input, visitedStates) ++
evaluateState(s.out2, input, visitedStates)
case m: Match =>
if (input.isDefined) Set() else Set(Match())
}
}
}
}
view raw regexblog11.scala hosted with ❤ by GitHub

And that's it!

Put a bow on it

We’ve finished all the important code, but the API isn’t as clean as we’d like. Now, we need to create a single-call user interface to call our regular expression engine. We’ll also add the ability to match your pattern anywhere in the string with a bit of syntactic sugar.

1 2 3 4 5 6 7 8 9 10 11 12
object Regex {
def fullMatch(input: String, pattern: String) = {
val parsed = RegexParser(pattern).getOrElse(
throw new RuntimeException("Failed to parse regex")
)
val nfa = NFA.regexToNFA(parsed)
NFAEvaluator.evaluate(nfa, input)
}
 
def matchAnywhere(input: String, pattern: String) =
fullMatch(input, ".*" + pattern + ".*")
}
view raw regexblog11.scala hosted with ❤ by GitHub

To use it:

1 2 3
Regex.fullMatch("aaaaab", "a*b") // True
Regex.fullMatch("aaaabc", "a*b") // False
Regex.matchAnywhere("abcde", "cde") // True
view raw regexblog12.scala hosted with ❤ by GitHub

That’s all there is to it. A semi-functional regex implementation in just 106 lines. There are a number of things that could be added but I decided they added complexity without enough value:

  1. Character classes
  2. Value extraction
  3. ?
  4. Escape characters
  5. Any many more.

I hope this simple implementation helps you understand what’s going on under the hood! It’s worth mentioning that the performance of this evaluator is heinous. Truly terrible. Perhaps in a future post I’ll look into why and talk about ways to optimize it…

Ozan Unlu

Debugging to Customer Hugging – Becoming an SE

09.10.2014 | Posted by Ozan Unlu

“I know app developers, and that’s not you!” It was a statement that I couldn’t really argue with, and it was coming from one of my closest friends. It didn’t matter that I was employed in a career as an app developer at one of the top software companies in the world. It didn’t matter that I was performing well and the tools and applications I coded were being used by hundreds of internal developers. It didn’t even matter that the friend making the conclusion had never written a single line of code in his life, nor had he any idea of my technical ability. The funny thing was, he meant it as a compliment, and so began the biggest career transition of my life.

Coding and logic puzzles were always very intuitive to me, so I always enjoyed solving a variety of technical challenges. Yet, articulation, interpersonal communication and cross-team collaboration were some of my other strong suits I felt weren’t being used in my professional life. My career ambitions to be the biggest success possible combined with my desire to fulfill my potential always had me wondering if there was a role better suited for me where I would be able to leverage both diverse skills sets. Over the years I had many mentors and through all the various conversations and constructive criticism, the same trend was always prevalent. They all thought I could be more successful within a Program Manager or Technical Lead role as it would allow me to take advantage of these strengths that were being under-used in a purely development-focused role. So I made those career moves, but decided to stay within the company. After all, I didn’t want to cast away the experience and knowledge I had gained during my role there, and believed it would propel me in my new roles as they were in a related field. It did; I continued to be successful, and it was certainly a step in the right direction, but needed to be taken further. I had tunnel vision and when I looked at my career, all my choices seemed a little too safe. It was time to take a risk.

I was informed of the Sales Engineering role as it could be the perfect position for me to stretch my wings and use my full potential. The more I looked into it, the better it seemed. I would be a technical expert with deep knowledge of the product while at the same time selling the value of the solution to potential clients. I would be listening to the customer’s needs and educating them on whether or not our product would be the best fit for them. After spending so much time on research and development teams creating software with the same handful of peers every day, the prospect of working with a mixture of clients who were the top engineering minds in the world across a plethora of different technologies was enticing. Just the ability to work with these industry leaders in a variety of different challenges allowed me to solve more technical problems than I was ever able to do as a developer working on a only a handful of projects over the course of a year. I had warmed up to the idea and it was time to commit to something new.

There is one area of the world that people consistently consider the “Mecca of Tech,” and that is the San Francisco / Silicon Valley Bay Area. That was settled. If I was going to go into sales, I had promised myself I would never sell a product in which I didn’t have full confidence, so I needed to find a company with a product I really believed in. Enter Sumo Logic: a fully cloud based data analytics and machine learning solution.

Curious, I created a free account and played around with the product. In a very short time, I could see the impressive power and versatile functionality, the value it could provide to nearly any tech company. Also growing at a tremendous rate, supported by the top investors and sporting a unique combination of relatively low risk and high upside, I couldn’t craft an argument to deter myself from joining the company. I interviewed, and when offered, accepted the job. After committing, what awaited me next felt like a breath of fresh air.

Joining a start up from a large company and transitioning into the sales field from development, I didn’t know what type of culture to expect. What awaited me was a company culture where team members are genuinely and actively supportive, and it was awesome. In the first couple months I learned more about various technologies in the market than I ever knew existed before. I work with customers and drastically improve their systems, processes and consequently their careers. I did not expect to be able to contribute to our internal product development process yet I have our best engineers coming to ask which direction we should take our features. Being able to work with customers and feel like you’re truly helping them while at the same time continuing to design and engineer a product on the cutting edge is the best of both worlds, and the sizable increase in compensation isn’t a bad side effect either. I have no regrets in making the biggest career transition of my life, I’m happier than I’ve ever been and I’m not looking back.

If you want to join Ozan and work as a Sales Engineer at Sumo, click here!

1 2 3 12

Twitter