Russell
07.23.2012

3 Tips for Writing Performant Scala

Here at Sumo Logic we write a lot of Scala code. We also have a lot of data, so some of our code has to go really, really fast. While Scala allows us to write correct, clear code quickly, it can be challenging to ensure that you are getting the best performance possible.

Two expressions which seem to be equivalent in terms of performance can behave radically differently. Only with an in-depth understanding the implementation of the language and the standard library can one predict which will be faster.

For a great explanation of the implementation and details of the Scala language, I recommend reading Programming in Scala 2ed by Odersky, Spoon and Venners cover to cover. It’s worth every page.

Short of reading the 800+ pages of Programming in Scala, here are 3 pieces of low hanging fruit to help improve the performance of your Scala code.

1. Understand the Collections!

Users of Java are used to ArrayList with constant time lookup and amortized constant time append. In Scala, the object you get when you request a
List(1, 2, 3)
you get linked list. It can be prepended with objects using the “cons” (::) operator in constant time, but many other operations such as index based lookup, length, and append will run in linear time(!). If you want random access, you want an IndexedSeq. If you want constant time append use a ListBuffer. Read the collections chapter of Programming in Scala 2ed for all the details.

2. Be Lazy!

Scala’s collection libraries allow us to write nearly any collection operation as short chain of functions. For example, let’s say we had a bunch of log entries. For each of them we wanted to extract the first word, pull them into groups of 8, then count the number of groups of 8 that contain the word “ERROR.” We would probably write that as:

1 2
val logs: List[String] = aLotOfStrings
logs.map(_.takeWhile(_ != ' ')).grouped(8).count(_.contains("ERROR"))
view raw gistfile1.scala hosted with ❤ by GitHub

logs.map(_.takeWhile(_ != ‘ ‘)) will create an intermediate collection that we never use directly. If the size of logs was near our memory limit, the auxiliary collection could run us out of memory. To avoid generating the intermediate collections, we can run the operations on the list in a “lazy” manner. When we call the “.view” method on a Scala collection, it returns a view into the collection that provides lazy evaluation through a series of closures. For example, consider:

1
(1 to 10) map (_ + 5) map (_ * 2)
view raw funcexample.scala hosted with ❤ by GitHub

If f(x) = x + 5, and g(x) = x * 2, then this is really just the functional composition of g(f(x)) — No reason to create the intermediate collections. A view runs transformations as functional composition instead of as a series of intermediate collections.

So, going back to our initial example, the operation would become:

1
logs.view.map(_.takeWhile(_ != ' ')).grouped(8).count(_.contains("ERROR"))
view raw lazymap.scala hosted with ❤ by GitHub

The call to count will force the results of this computation to be evaluated.  If your chain produces a collection on the other side (eg. just returning a subset of the logs), use .force to make it strict and return a concrete collection.

Using lazy collections must be taken with a grain of salt — while lazy collections often can improve performance, they can also make it worse.  For example:

1 2
def nonview = (1 to 5000000).map(_ % 10).filter(_ > 5).reduce(_ + _)
def view = (1 to 5000000).view.map(_ % 10).filter(_ > 5).reduce(_ + _)
view raw gistfile1.scala hosted with ❤ by GitHub

For this microbenchmark, the lazy version ran 1.5x faster than the strict version.  However, for smaller values of n, the strict version will run faster. Lazy evaluation requires the creation of an additional closure. If creating the closures takes longer than creating intermediate collections, the lazy version will run slower. Profile and understand your bottlenecks before optimizing!

3. Don’t be Lazy!

If you really need a piece of code to go fast, given the current state of Scala libraries and compiler, you’re going to need to write more of it. Sometimes (unfortunately) to write truly performant code in Scala, you need to write it like it’s C or Java. This means eschewing a lot of things you’ve come to love about Scala, such as:

  • Use while loops instead of for loops.  For loops create closures that can create significant overhead. Let me give some context for this:

While-loop version code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14
var i,j,k = 0
var x = 0L
while(i < 10000) {
while(j < 1000) {
while(k < 1000) {
x += 1
k += 1
}
k = 0;
j += 1
}
j = 0
i += 1;
}
view raw gistfile1.scala hosted with ❤ by GitHub
 

The Scala code:

1 2 3 4
var x = 0L
for(i <- 0 to 10000; j <- 0 to 1000; k <- 0 to 1000) {
x += 1
}
view raw gistfile1.scala hosted with ❤ by GitHub

Obviously the while-loop version will run faster, but the difference is surprising. In my benchmark, the while loop version ran on average in .557ms.  The Scala version runs in 9.584ms. That is a 17x improvement! The exact reason is beyond the scope of this post, but in a nutshell, in the Scala version { x += 1 } is creating an anonymous class each time we want to increment x. For what it’s worth, this is issue 1338 on the Scala issue tracker, and there is a compiler plugin to perform a lot of these optimizations automatically. 

  • Replace convenience methods like exists, count, etc. with their hard-coded variants.  For example, instead of:

Version 1:

1
(1 until 1000).count(_ % 10 == 0)
view raw count.scala hosted with ❤ by GitHub

Version 2:

1 2 3 4 5 6
var x = 1
var num = 0
while(x < 1000) {
if(x % 10 == 0) { num += 1 }
x += 1
}
view raw count2.scala hosted with ❤ by GitHub

Version 2 gets a 3x speedup over version 1.

  • Avoid objects when possible — use primitives instead.  Whenever possible, the Scala compiler will insert JVM primitives for things like Ints, Booleans, Doubles and Longs.  However, if you prevent this (by using an implicit conversion, etc.) and the compiler is forced to box your object, you will pay a significant performance cost. [Look for Value Classes to address this in future versions of Scala.] You could also specialize containers for primitive types, but that is beyond the scope of this post.

I really hate suggesting that you write Scala like it’s C to get better performance. I really do. And I really enjoy programming in Scala. I hope in the future the standard library evolves in a way so that it becomes faster than hand-coding the C equivalent.

The Bottom Line

The first two suggestions get followed at Sumo Logic, and really boil down to a solid understanding of Scala’s standard library. The third suggestion gets followed very rarely if at all. This seems surprising — shouldn’t we be trying to write that fastest code possible?

The answer, of course, is no. If we wanted to write the fastest code possible, we would essentially write Scala as if it were C. But then, why not just use C?

There are multiple factors we need to optimize for here. By necessity, our services here at Sumo Logic are designed to scale horizontally. If we need better performance, we can spin up more nodes. It costs money. Developer time also costs money. Writing idiomatic Scala that fully utilizes the type-safety and functional properties of the language can and will produce code that runs slower than writing everything with while-loops in C-style.

But slower code is OK. The critical trade-off for us is that writing clean Scala is faster, less error prone, and easier to maintain than the alternative. Scala’s performance shortcomings have garnered some criticism on the internet recently (and for good reason). This isn’t necessarily a reason not to use Scala. I suspect Scala performance will improve with time as the Scala compiler produces more optimized bytecode and the JVM gains native support for some of the functional features of the Scala language. The critical thing to recognize is that you must consider both developer and code performance in your optimizations.

[Benchmarking Notes]

Code was benchmarked with a script that first executes the function in question 10000 times, then runs it 1000 more times and computes the average. This ensures that the HotSpot JVM will JIT compile the code in question.

«
  1. Viktor Klang says:

    I recommend using @tailrec instead of while loops. Can be more performant, doesn’t introduce vars, easier to compose since you have small methods.

  2. paradigmatic says:

    Currently, the compiler will convert tail recursive methods to while loops. So you can achieve performances by writing old school SMLish code, rather than the bug prone C-ish implementation. I think it is currently a very good compromise.

  3. Mark Hammons says:

    You should note that a recursive function has negligible speed loss compared to a while loop. I’m talking in the range of 3 milliseconds slower over 10000*1000*1000 iterations compared to a while loop. Also please note that your benchmark is flawed in that it only iterates numbers, which Scala will optimize away in the case of while loops and recursive functions. Have your benchmark do something like instantiate objects instead And I think the gap between for and while will close some.

  4. Ichoran says:

    You should use x += k instead of x += 1 to keep the compiler from noticing the while loop can be elided. Also, you should compile with -optimise; this produces a nice speedup for for loops in many cases. You’ll find with version 2.9.1 and later that depending on details your while loop is usually only ~2x faster for simple arithmetic.

  5. [...] They rave about the YourKit profiler and the Caliper microbenchmarker, and point to this interesting post. This is all pre-macros, so the advice may need an update [...]

Twitter