Blog › Authors › Russell Cohen

Russell

No Magic: Regular Expressions, Part 3

09.11.2014 | Posted by Russell

Evaluating the NFA

In part 1, we parsed the regular expression into an abstract syntax tree. In part 2, we converted that syntax tree into an NFA. Now it’s time evaluate that NFA against a potential string.

NFAs, DFAs and Regular Expressions

Recall from part 2 that there are two types of finite automata: deterministic and non-deterministic. They have one key difference: A non-deterministic finite automata can have multiple paths out of the same node for the same token as well as paths that can be pursued without consuming input. In expressiveness (often referred to as “power”), NFAs, DFAs and regular expressions are all equivalent. This means if you can express a rule or pattern, (eg. strings of even length), with an NFA, you can also express it with a DFA or a regular expression. Lets first consider a regular expressionabc* expressed as a DFA:

regexdfa.png

Evaluating a DFA is straightforward: simply move through the states by consuming the input string. If you finish consuming input in the match state, match, otherwise, don’t. Our state machine, on the other hand, is an NFA. The NFA our code generates for this regular expression is:

dfavsnfa.png

Note that there are multiple unlabeled edges that we can follow without consuming a character. How can we track that efficiently? The answer is surprisingly simple: instead of tracking only one possible state, keep a list of states that the engine is currently in. When you encounter a fork, take both paths (turning one state into two). When a state lacks a valid transition for the current input, remove it from the list.

There are 2 subtleties we have to consider: avoiding infinite loops in the graph and handling no-input-transitions properly. When we are evaluating a given state, we first advance all the states to enumerate all the possible states reachable from our current state if we don’t consume any more input. This is the phase that also requires care to maintain a “visited set” to avoid infinitely looping in our graph. Once we have enumerated those states, we consume the next token of input, either advancing those states or removing them from our set

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
object NFAEvaluator {
def evaluate(nfa: State, input: String): Boolean =
evaluate(Set(nfa), input)
 
def evaluate(nfas: Set[State], input: String): Boolean = {
input match {
case "" =>
evaluateStates(nfas, None).exists(_ == Match())
case string =>
evaluate(
evaluateStates(nfas, input.headOption),
string.tail
)
}
}
 
def evaluateStates(nfas: Set[State],
input: Option[Char]): Set[State] = {
val visitedStates = mutable.Set[State]()
nfas.flatMap { state =>
evaluateState(state, input, visitedStates)
}
}
 
def evaluateState(currentState: State, input: Option[Char],
visitedStates: mutable.Set[State]): Set[State] = {
 
if (visitedStates contains currentState) {
Set()
} else {
visitedStates.add(currentState)
currentState match {
case placeholder: Placeholder =>
evaluateState(
placeholder.pointingTo,
input,
visitedStates
)
case consume: Consume =>
if (Some(consume.c) == input
|| consume.c == '.') {
Set(consume.out)
} else {
Set()
}
case s: Split =>
evaluateState(s.out1, input, visitedStates) ++
evaluateState(s.out2, input, visitedStates)
case m: Match =>
if (input.isDefined) Set() else Set(Match())
}
}
}
}
view raw regexblog11.scala hosted with ❤ by GitHub

And that's it!

Put a bow on it

We’ve finished all the important code, but the API isn’t as clean as we’d like. Now, we need to create a single-call user interface to call our regular expression engine. We’ll also add the ability to match your pattern anywhere in the string with a bit of syntactic sugar.

1 2 3 4 5 6 7 8 9 10 11 12
object Regex {
def fullMatch(input: String, pattern: String) = {
val parsed = RegexParser(pattern).getOrElse(
throw new RuntimeException("Failed to parse regex")
)
val nfa = NFA.regexToNFA(parsed)
NFAEvaluator.evaluate(nfa, input)
}
 
def matchAnywhere(input: String, pattern: String) =
fullMatch(input, ".*" + pattern + ".*")
}
view raw regexblog11.scala hosted with ❤ by GitHub

To use it:

1 2 3
Regex.fullMatch("aaaaab", "a*b") // True
Regex.fullMatch("aaaabc", "a*b") // False
Regex.matchAnywhere("abcde", "cde") // True
view raw regexblog12.scala hosted with ❤ by GitHub

That’s all there is to it. A semi-functional regex implementation in just 106 lines. There are a number of things that could be added but I decided they added complexity without enough value:

  1. Character classes
  2. Value extraction
  3. ?
  4. Escape characters
  5. Any many more.

I hope this simple implementation helps you understand what’s going on under the hood! It’s worth mentioning that the performance of this evaluator is heinous. Truly terrible. Perhaps in a future post I’ll look into why and talk about ways to optimize it…

Russell

No Magic: Regular Expressions, Part 2

09.05.2014 | Posted by Russell

The code for this post, as well as the post itself, are on github.

Converting the Parse Tree to an NFA

In the last post, we transformed the flat string representation of a regular expression to the hierarchical parse tree form. In this post we’ll transform the parse tree to a state machine. The state machine linearizes the regex components into a graph, producing an “a follows b follows c” relationship. The graph representation will make it easier to evaluate against a potential string.

Why make yet another transformation just to match a regex?

It would certainly be possible to translate directly from a string to a state machine. You could even attempt to evaluate the regex directly from the parse tree or the string. Doing this, however, would likely involve significantly more complicated code. By slowly lowering the level of abstraction, we can ensure that the code at each stage is easy to understand and reason about. This is especially important when building something like a regular expression evaluator with literally infinite edge cases.

NFAs, DFAs, and You

We’ll be creating a state machine called an NFA, or nondeterministic finite automata. It has two types of components: states and transitions. When we diagram it, states will be represented as circles and transitions will be arrows. The match state will be a double circle. If a transition is labeled, it means that we must consume that character from the input to make that transition. Transitions can also be unlabeled — that means that we can make the transition without consuming input. Aside: In other literature, you may see this represented as ε. This state machine represents the simple regex “ab”:

ab.dot.png

Our nodes can have multiple valid subsequent states for a given input, as in the following diagram where there are two paths consuming the same input leaving a node:

nfa.dot.png

Contrast this with deterministic finite automata, where, as the name suggests, a given input can only result in one state change. This relaxation in constraints will make evaluating it a bit more tricky, but it makes it much easier to generate them from our tree, as we’ll see shortly. Fundamentally, NFAs and DFAs are equivalent in the state machines they can represent.

Transforming, in Theory

Let’s lay out a strategy for transforming our parse tree info an NFA. While it may seem daunting, we’ll see that by breaking the process into composable pieces, it can be straightforward. Recall the syntax elements that we’ll need to transform:

1. Characters Literals: Lit(c: Char)

2. *Repeat(r: RegexExpr) 

3. +Plus(r: RegexExpr) 

4. Concatenation: Concat(first: RegexExpr, second: RegexExpr) 

5. |Or(a: RegexExpr, b: RegexExpr)

The transformations that follow were initially outlined by Thompson in his 1968 paper. In the transformation diagrams,In will refer the to the entry point of the state machine and Out will refer to the exit. In practice those can be the “Match” state, the “Start” state, or other regex components. The In / Out abstraction allows us to compose and combine state machines, a key insight. For example, if we have a state machine that matches “abc” and another that matches “cde”, connecting them sequentially will yield a state machine that matches “abccde”. We’ll apply this principle more generally to transform from each element in our syntax tree into a composable state machine.

Let’s start with character literals. A character literal is a transition from one state to another that consumes input. Consuming the literal ‘a’ would look like this:

lit.dot.png

Next lets examine concatenation. To concatenate two components of the parse tree, we just need to link them together with a empty arrow. Note that in this example, just as a Concat can contain the concatenation of two arbitrary regular expressions, A and B in the diagram can both be state machines, not just individual states. Something a little bit weird happens if A and B are both literals. How do we connect to arrows together with no intervening states? The answer is that when necessary, literals can have phantom states at each end to make our state machine sound.Concat(A, B) would transform to:

concat.dot.png

To represent Or(A, B) we need to branch from the start state into the two separate states. Once those machines terminate, they must both point to the subsequent (Out) state:

split.dot.png

Lets consider *. Star can be 0 or more repetitions of a pattern. To do that, we want one wire pointing directly to the next state, and one looping back to our current state via AA* would look like:

astar.dot.png

For +, we’ll use a little trick. a+ is just aa*. Generalizing, Plus(A) can be rewritten to Concat(A, Repeat(A)) We’ll rewrite it to that, rather than designing a special pattern for it. I’ve included + in the language for a specific reason: to demonstrate how other, more complicated regular expression elements that I’ve omitted can often be expressed in terms of our language.

Transforming, in practice

Now that we’ve made a theoretical plan, lets dive in to how we’ll actually code it. We’ll be creating a mutable graph to store our tree. While I’d prefer immutability, making immutable graphs is annoyingly difficult, and I’m lazy.

If we were to try to boil the above diagrams into core components, we’d end up with three types of components: arrows that consume input, the match state, and one state splitting into two states. I know this seems a little odd, and potentially incomplete. You’ll have to trust me that this choice leads to the cleanest code. Here are our 3 NFA component classes:

abstract class State

class Consume(val c: Char, val out: State) extends State
class Split(val out1: State, val out2: State) extends State
case class Match() extends State

Aside: I made Match a case class instead of a regular class. Case classes in Scala bring certain sensible defaults to your class. I used it because it gives value based equality. This makes all Match objects equal, a useful property. For the other types of NFA components, we want reference equality.

Our code will recursively traverse our syntax tree, keeping an andThen object as a parameter. andThen is the what we will attach to the free hanging outputs of our expression. This is required because from an aribtrary branch in your syntax tree, you lack the context of what comes next – andThen allows us pass that context down as we recurse. It also gives us an easy way to append the Match state.

When it comes time to handle Repeat we’ll run into a problem. The andThen for Repeat is the split operator itself. To deal with this issue, we’ll introduce a placeholder to let us bind it later. We’ll represent the placeholder with this class:

class Placeholder(var pointingTo: State) extends State

The var in Placeholder means the pointingTo is mutable. It is the isolated bit of mutability allowing us easily create a cyclical graph. All the other members are immutable.

To start things off, andThen is Match() – this means that we’ll create a state machine matching our Regex which can then transfer to the Match state without consuming anymore input. The code is short but dense:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
object NFA {
def regexToNFA(regex: RegexExpr): State =
regexToNFA(regex, Match())
 
private def regexToNFA(regex: RegexExpr,
andThen: State): State = {
regex match {
case Literal(c) => new Consume(c, andThen)
case Concat(first, second) => {
// Convert first to an NFA. The "out" of that is
// the result of converting second to an NFA.
regexToNFA(first, regexToNFA(second, andThen))
}
case Or(l, r) => new Split(
regexToNFA(l, andThen),
regexToNFA(r, andThen)
)
 
case Repeat(r) =>
val placeholder = new Placeholder(null)
val split = new Split(
// One path goes to the placeholder,
// the other path goes back to r
regexToNFA(r, placeholder),
andThen
)
placeholder.pointingTo = split
placeholder
 
case Plus(r) =>
regexToNFA(Concat(r, Repeat(r)), andThen)
}
}
}
view raw regexblog10.scala hosted with ❤ by GitHub

And we're done! The word to line of code ratio in this post is high -- each line encodes a lot of information, but boils down the the transformations we discussed in the previous section. I should clarify I didn't just sit down and write the code like this -- getting it this short and to the point was the result of several iterations on the data-structures and algorithm. Clean code is hard.

In the debugging process, I wrote a quick script to generate dot files from NFAs so I could look at the NFAs I was actually generating to debug issues. Note that the NFAs this code generates contain lots of unnecessary transitions and states — When writing this, I’ve tried to keep the code as simple as possible, at the cost of performance. Here are some examples for simple regular expressions:

(..)*

evenstrs.dot.png

ab

ab.dot.png

a|b

aorb.dot.png

Now that we have an NFA (the hard part) all we have to do is evaluate it (the easier part). Come back next week for the wrapup post on writing the evaluator, or just read the code for yourself on github.

Russell

No Magic: Regular Expressions

08.18.2014 | Posted by Russell

The code for this post, as well as the post itself, are on github.

Until recently, regular expressions seemed magical to me. I never understood how you could determine if a string matched a given regular expression. Now I know! Here’s how I implemented a basic regular expression engine in under 200 lines of code.

The Spec

Implementing full regular expressions is rather cumbersome, and worse, doesn’t teach you much. The version we’ll implement is just enough to learn without being tedious. Our regular expression language will support:

  • : Match any character
  • : Match abc or cde
  • : Match one or more of the previous pattern
  • : Match 0 or more of the previous pattern
  •  and  ) for grouping

While this is a small set of options, we’ll still be able to make some cute regexes, like m (t|n| ) | b to match Star Wars subtitles without matching Star Trek ones, or (..)* the set of all even length strings.

The Plan of Attack

We’ll evaluate regular expressions in 3 phases: 1. Parse the regular expression into a syntax tree 2. Convert the syntax tree into a state machine 3. Evaluate the state machine against our string

We’ll use a state machine called an NFA to evaluate regular expressions (more on that later). At a high level, the NFA will represent our regex. As we consume inputs, we’ll move from state to state in the NFA. If we get to a point where we can’t follow an allowed transition, the regular expression doesn’t match the string.

This approach was originally demonstrated by Ken Thompson, one of the original authors of Unix. In his 1968 CACM paper he outlines the implementation of a text editor and includes this as a regular expression evaluator. The only difference is that his article is written 7094 machine code. Things used to be way more hard core.

This algorithm is a simplification of how non-backtracking engines like RE2 evaluate regular expressions in provably linear time. It’s notably different from the regex engines found in Python and Java that use backtracking. When given certain inputs, they’ll run virtually forever on small strings. Ours will run in O(length(input) * length(expression).

My approach roughly follows the strategy Russ Cox outlines in his excellent blog post.

Representing a Regular Expression

Lets step back and think about how to represent a regular expression. Before we can hope to evaluate a regular expression, we need to convert it into a data structure the computer can operate on. While strings have a linear structure, regular expressions have a natural hierarchy.

Lets consider the string abc|(c|(de)). If you were to leave it as a string, you’d have to backtrack and jump around as you tried to keep track of the different sets of parenthesis while evaluating the expression. One solution is converting it to a tree, which a computer can easily traverse. For example, b+a would become:

syntaxex.png

To represent the tree, we’ll want to create a hierarchy of classes. For example, our Or class will need to have two subtrees to represent its two sides. From the spec, there are 4 different regular expression components we’ll need to recognize: +*|, and character literals like .a and b. In addition, we’ll also need to be able to represent when one expression follows another. Here are our classes:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
abstract class RegexExpr
 
// ., a, b
case class Literal(c: Char) extends RegexExpr
 
// a|b
case class Or(expr1: RegexExpr, expr2: RegexExpr) extends RegexExpr
 
// ab -> Concat(a,b); abc -> Concat(a, Concat(b, c))
case class Concat(first: RegexExpr, second: RegexExpr) extends RegexExpr
 
// a*
case class Repeat(expr: RegexExpr) extends RegexExpr
 
// a+
case class Plus(expr: RegexExpr) extends RegexExpr
view raw blog1.scala hosted with ❤ by GitHub

Parsing a Regular Expression

To get from a string to a tree representation, we’ll use conversion process known as “parsing.” I’m not going to talk about building parsers in a high level of detail. Rather, I’ll give just enough to point you in the right direction if you wanted to write your own. Here, I’ll give an overview of how I parsed regular expressions with Scala’s parser combinator library.

Scala’s parser library will let us write a parser by just writing the set of rules that describe our language. It uses a lot of obtuse symbols unfortunately, but I hope you’ll be able to look through the noise to see the gist of what’s happening.

When implementing a parser we need to consider order of operations. Just as “PEMDAS” applies to arithmetic, a different set of rules apply to regular expressions. We can express this more formally with the idea of an operator “binding” to the characters near it. Different operators “bind” with different strengths — as an analogy, * binds more tightly than + in expressions like 5+6*4. In regular expressions, * binds more tightly than |. If we were to represent parsing this as a tree the weakest operators end up on top.

multex.png

It follows that we should parse the weakest operators first, followed by the stronger operators. When parsing, you can imagine it as extracting an operator, adding it to the tree, then recursing on the remaining 2 parts of the string.

In regular expressions the order of binding strength is: 1. Character Literal & parentheses 2. + and * 3. “Concatenation” — a is after b 4. |

Since we have 4 levels of binding strength, we need 4 different types of expressions. We named them (somewhat arbitrarily): litlowExpr (+*), midExpr (concatenation), and highExpr (|). Lets jump into the code. First we’ll make a parser for the most basic level, a single character:

1 2 3 4
object RegexParser extends RegexParsers {
def charLit: Parser[RegexExpr] = ("""\w""".r | ".") ^^ {
char => Literal(char.head)
}
view raw regexblog2.scala hosted with ❤ by GitHub

Lets take a moment to explain the syntax. This defines a parser that will build a RegexExpr. The right hand side says: “Find something that matches \w (any word character) or a period. If you do, turn it into a Literal.

Parentheses must be defined at the lowest level of the parser since they are the strongest binding. However, you need to be able to put anything in parentheses. We can accomplish this with the following code:

1 2
def parenExpr: Parser[RegexExpr] = "(" ~> highExpr <~ ")"
def lit: Parser[RegexExpr] = charLit | parenExpr
view raw regexblog3.scala hosted with ❤ by GitHub

Here, we’ll define * and +:

1 2 3 4 5
def repeat: Parser[RegexExpr] = lit <~ "*"
^^ { case l => Repeat(l) }
def plus: Parser[RegexExpr] = lit <~ "+"
^^ { case p => Plus(p) }
def lowExpr: Parser[RegexExpr] = repeat | plus | lit
view raw regexblog4.scala hosted with ❤ by GitHub

Next, we’ll define concatenation, the next level up:

1 2 3
def concat: Parser[RegexExpr] = rep(lowExpr)
^^ { case list => listToConcat(list)}
def midExpr: Parser[RegexExpr] = concat | lowExpr
view raw regexblog.scala hosted with ❤ by GitHub

Finally, we’ll define or:

1 2
def or: Parser[RegexExpr] = midExpr ~ "|" ~ midExpr
^^ { case l ~ "|" ~ r => Or(l, r)}
view raw regexblog6.scala hosted with ❤ by GitHub

Lastly, we’ll define highExprhighExpr is an or, the weakest binding operator, or if there isn’t an or, a midExpr.

1
def highExpr: Parser[RegexExpr] = or | midExpr
view raw regexblog7.scala hosted with ❤ by GitHub

Finally, a touch of helper code to finish it off:

1 2 3 4 5 6 7 8 9 10 11 12
def listToConcat(list: List[RegexExpr]): RegexExpr = list match {
case head :: Nil => head
case head :: rest => Concat(head, listToConcat(rest))
}
 
def apply(input: String): Option[RegexExpr] = {
parseAll(highExpr, input) match {
case Success(result, _) => Some(result)
case failure : NoSuccess => None
}
}
}
view raw regexblog8.scala hosted with ❤ by GitHub

That’s it! If you take this Scala code you’ll be able to generate parse trees for any regular expression that meets the spec. The resulting data structures will be trees.

Now that we can convert our regular expressions into syntax trees, we’re well underway to being able to evaluate them. I wrote about the next steps in part 2. As always, the code is on Github.

Russell

Why You Should Never Catch Throwable In Scala

05.05.2014 | Posted by Russell

Scala is a subtle beast and you should heed its warnings. Most Scala and Java programmers have heard that catching Throwable, a superclass of all exceptions, is evil and patterns like the following should be avoided:

1 2 3 4 5 6 7
try {
aDangerousFunction()
} catch {
case ex: Throwable => println(ex)
// Or even worse
case ex => println(ex)
}

This pattern is absurdly dangerous. Here’s why:

The Problem

In Java, catching all throwables can do nasty things like preventing the JVM from properly responding to a StackOverflowError or an OutOfMemoryError. Certainly not ideal, but not catastrophic. In Scala, it is much more heinous. Scala uses exceptions to return from nested closures. Consider code like the following:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
def inlineMeAgain[T](f: => T): T = {
f
}
 
def inlineme(f: => Int): Int = {
try {
inlineMeAgain {
return f
}
} catch {
case ex: Throwable => 5
}
}
 
def doStuff {
val res = inlineme {
10
}
println("we got: " + res + ". should be 10")
}
doStuff
view raw scalaclosures.scala hosted with ❤ by GitHub

We use a return statement from within two nested closures. This seems like it may be a bit of an obscure edge case, but it’s certainly possible in practice. In order to handle this, the Scala compiler will throw a NonLocalReturnControl exception. Unfortunately, it is a Throwable, and you’ll catch it. Whoops. That code will print 5, not 10. Certainly not what was expected.

The Solution

While we can say “don’t catch Throwables” until we’re blue in the face, sometimes you really want to make sure that absolutely no exceptions get through. You could include the other exception types everywhere you want to catch Throwable, but that’s cumbersome and error prone. Fortunately, this is actually quite easy to handle, thanks to Scala’s focus on implementing much of the language without magic—the “catch” part of the try-catch is just some sugar over a partial function—we can define partial functions!

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
def safely[T](handler: PartialFunction[Throwable, T]): PartialFunction[Throwable, T] = {
case ex: ControlThrowable => throw ex
// case ex: OutOfMemoryError (Assorted other nasty exceptions you don't want to catch)
//If it's an exception they handle, pass it on
case ex: Throwable if handler.isDefinedAt(ex) => handler(ex)
// If they didn't handle it, rethrow. This line isn't necessary, just for clarity
case ex: Throwable => throw ex
}
 
// Usage:
/*
def doSomething: Unit = {
try {
somethingDangerous
} catch safely {
ex: Throwable => println("AHHH")
}
}
*/

This defines a function “safely”, which takes a partial function and yields another partial function. Now, by simply using catch safely { /* catch block */ } we’re free to catch Throwables (or anything else) safely and restrict the list of all the evil exception types to one place in the code. Glorious.

Russell

Fuzzing For Correctness

08.13.2012 | Posted by Russell

Handling the Edge-Case Search-Space Explosion: A Case for Randomized Testing

For much of software testing the traditional quiver of unit, integration and system testing suffices. Your inputs typically have a main case and a small set of well understood edge cases you can cover. But: periodically, we come upon software problems where the range of acceptable inputs and the number of potential code paths is simply too large to have an acceptable degree of confidence in correctness of a software unit (be it a function, module, or entire system).

… Continue Reading

Russell

3 Tips for Writing Performant Scala

07.23.2012 | Posted by Russell

Here at Sumo Logic we write a lot of Scala code. We also have a lot of data, so some of our code has to go really, really fast. While Scala allows us to write correct, clear code quickly, it can be challenging to ensure that you are getting the best performance possible.

Two expressions which seem to be equivalent in terms of performance can behave radically differently. Only with an in-depth understanding the implementation of the language and the standard library can one predict which will be faster.

… Continue Reading

Twitter