Blog › Authors › Jacek Migdal

Jacek Migdal

Building Scala at Scale

05.12.2014 | Posted by Jacek Migdal

The Scala compiler can be brutally slow. The community has a love-hate relationship with it. Love means “Yes, scalac is slow”. Hate means, “Scala — 1★ Would Not Program Again”. It’s hard to go a week without reading another rant about the Scala compiler.

Moreover, one of the Typesafe co-founders left the company shouting, “The Scala compiler will never be fast” (17:53). Even Scala inventor Martin Odersky provides a list of fundamental reasons why compiling is slow.

At Sumo Logic, we happily build over 600K lines of Scala code[1] with Maven and find this setup productive. Based on the public perception of the Scala build process, this seems about as plausible as a UFO landing on the roof of our building. Here’s how we do it:

Many modules

At Sumo Logic, we have more than 120 modules. Each has its own source directory, unit tests, and dependencies. As a result, each of them is reasonably small and well defined. Usually, you just need to modify one or a few of them, which means that you can just build them and fetch binaries of dependencies[2].

Using this method is a huge win in build time and also makes the IDE and test suites run more quickly. Fewer elements are always easier to handle.

We keep all modules in single GitHub repository. Though we have experimented with a separate repository for each project, keeping track of version dependencies was too complicated.

Parallelism on module level

Although Moore’s law is still at work, single cores have not become much faster since 2004. The Scala compiler has some parallelism, but it’s nowhere close to saturating eight cores[3] in our use case.

Enabling parallel builds in Maven 3 helped a lot. At first, it caused a lot of non-deterministic failures, but it turns out that always forking the Java compiler fixed most of the problems[4]. That allows us to fully saturate all of the CPU cores during most of the build time. Even better, it allows us to overcome other bottlenecks (e.g., fetching dependencies).

Incremental builds with Zinc

Zinc brings features from sbt to other build systems, providing two major gains:

  • It keeps warmed compilers running, which avoids the startup JVM “warm-up tax”.
  • It allows incremental compilation. Usually we don’t compile from a clean state, we just make a simple change to get recompiled. This is a huge gain when doing Test Driven Development.

For a long time we were unable to use Zinc with parallel modules builds. As it turns out, we needed to tell Zinc to fork Java compilers. Luckily, an awesome Typesafe developer, Peter Vlugter, implemented that option and fixed our issue.

Time statistics

The following example shows the typical development workflow of building one module. For this benchmark, we picked the largest one by lines of code (53K LOC).

Example module build time

This next example shows building all modules (674K LOC), the most time consuming task.

 Total build time

Usually we can skip test compilation, bringing build time down to 12 minutes.[5]

Wrapper utility

Still, some engineers were not happy, because:

  • Often they build and test more often than needed.
  • Computers get slow if you saturate the CPU (e.g., video conference becomes sluggish).
  • Passing the correct arguments to Maven is hard.

Educating developers might have helped, but we picked the easier route. We created a simple bash wrapper that:
Runs every Maven process with lower CPU priority (nice -n 15); so the build process doesn’t slow the browser, IDE, or a video conference.

  • Makes sure that Zinc is running. If not, it starts it. 
  • Allows you to compile all the dependencies (downstream) easily for any module.
  • Allows you to compile all the things that depend on a module (upstream).
  • Makes it easy to select the kind of tests to run.

Though it is a simple wrapper, it improves usability a lot. For example, if you fixed a library bug for a module called “stream-pipeline” and would like to build and run unit tests for all modules that depend on it, just use this command:

bin/quick-assemble.sh -tu stream-pipeline

Tricks we learned along the way

  1. Print the longest chain of module dependency by build time.
    That helps identify the “unnecessary or poorly designed dependencies,” which can be removed. This makes the dependency graph much more shallow, which means more parallelism.
  2. Run a build in a loop until it fails.
    As simple as in bash:  while bin/quick-assemble.sh; do :; done.
    Then leave it overnight. This is very helpful for debugging non-deterministic bugs, which are common in a multithreading environment.
  3. Analyze the bottlenecks of build time.
    CPU? IO? Are all cores used? Network speed? The limiting factor can vary during different phases. iStat Menus proved to be really helpful.
  4. Read the Maven documentation.
    Many things in Maven are not intuitive. The “trial and error” approach can be very tedious for this build system. Reading the documentation carefully is a huge time saver.

Summary

Building at scale is usually hard. Scala makes it harder, because relatively slow compiler. You will hit the issues much earlier than in other languages. However, the problems are solvable through general development best practices, especially:

  • Modular code
  • Parallel execution by default
  • Invest time in tooling

Then it just rocks!

 

[1] ( find ./ -name ‘*.scala’ -print0 | xargs -0 cat ) | wc -l
[2] All modules are built and tested by Jenkins and the binaries are stored in Nexus.
[3] The author’s 15-inch Macbook Pro from late 2013 has eight cores.
[4] We have little Java code. Theoretically, Java 1.6 compiler is thread-safe, but it has some concurrency bugs. We decided not to dig into that as forking seems to be an easier solution.
[5] Benchmark methodology:

  • Hardware: MacBook Pro, 15-inch, Late 2013, 2.3 GHz Intel i7, 16 GB RAM.
  • All tests were run three times and median time was selected.
  • Non-incremental Maven goal: clean test-compile.
  • Incremental Maven goal: test-compile. A random change was introduced to trigger some recompilation.
Jacek Migdal

Do logs have a schema?

06.12.2013 | Posted by Jacek Migdal

As human beings, we share quite a few life events that we keep track of, like birthdays, holidays, anniversaries, and so on. These are structured events that occur on exact dates or during specific times of year. 

But how do you keep track of the unique, unexpected events that can be life-changing? The first meeting with someone, an inspiring conversation that sparked a realization—events that may seem common to many, but are so special to you.

Computer systems offer the same dilemma. Some events are expected, like adding a new user. Other events look routine, but from time to time they carry crucial, unexpected information. Unfortunately we most often realize how important pivotal events were after we experience a malfunction.

That’s where logs come in.

Virtually every computer program has some append-only structure for logs. Usually, it is as simple as a text file with a new line for each event. Sometimes the messages are saved to a database if the information may be used later. Why does it work that way? Well, it’s very easy to use and implement–usually it’s just one line of code. Don’t let the simplicity fool you. Logs provide a very powerful way of understanding and debugging systems. In many cases, logs are the sole method of figuring out the reason why something has happened.

From time to time, I’ll read about a new log management tool that converts log data into some standardized format. Well, there is limited value in that approach. Extracting data from logs is useful and could answer many business and operational questions. This works well with things that we expect, and things that answer numerical questions, like determining how many users have signed up in a given period of time.

However, during the process of converting logs to a standardized format, valuable data could be lost. For example, it’s interesting that many users couldn’t log in to your service, but the crucial information is why it happened. The unexpected part is usually very important and often even more valuable.

So do logs have a schema? Well, for the expected things, sure. But for analyzing the unexpected events it’s hard to think of a schema at all, beyond perhaps some partial structure.

That’s why at Sumo Logic, we accept any kind of log you throw at us. During log collection we just need to understand the events (e.g. separate lines) and the timestamp format. Everything else can be derived when you run a query.

Our query language lets you to find or extract structure, and data can be visualized and/or exported. Sumo Logic’s key advantage is how we handle the unexpected with machine learning algorithms. Our patent-pending LogReduce groups similar events on the fly to find anomalies, enabling our customers to review large sets of events quickly to identify the root cause of unexpected things.

No one ever intends to create bugs, but with the complexity and fast pace of software development they are inevitable. Well-designed systems should be debuggable. Log management tools, such as Sumo Logic, are here to help you deal with the logs that are a huge part of today’s technology.

“These days are only important, which are still unknown to us
These several moments are important, these for which we still wait”
(lyrics from famous Polish song by Marek Grechuta)

 

Twitter