David Andrzejewski, Data Sciences Engineer
09.23.2012

Scala at Sumo: type classes with a machine learning example

At Sumo Logic we use the Scala programming language, and we are always on the lookout for ways that we can leverage its features to write high-quality software. The type class pattern (an idea which comes from Haskell) provides a flexible mechanism for associating behaviors with types, and context bounds make this pattern particularly easy to express in Scala code.  Using this technique, we can extend existing types with new functionalities without worrying about inheritance.  In this post we introduce a motivating example and examine a few different ways to express our ideas in code before coming back to type classes, context bounds, and what they can do for us.

Machine learning example: fixed length feature vector representation

Consider a machine learning setting where we have a very simple binary linear classifier that classifies an item according to the sign of the inner product between a weight vector and a fixed-length feature vector representation of that item. For this classifier to work on elements of some type T, we need a way to convert an instance of T into an Array[Double] for our inner product calculation. Converting some object of interest into this more abstract representation is a very common pre-processing task for machine learning algorithms, and is often handled in the simplest possible way, which we introduce as Option #1.

Option 1: require the caller to convert their data to Array[Double] before calling our code

This is the simplest possible approach, but it has a few disadvantages. First, we’re forcing the caller to do the conversions and bookkeeping around the association between the original objects of type T and their corresponding feature vectors. We are also disregarding the assistance of Scala’s type system – it could be possible to accidentally train the same classifier on incompatible types T1 and T2, because the classifier only knows about Array[Double]. Extending this idea, consider a NearestNeighbors class that uses feature vectors to return the k nearest neighbor training examples for a given test example. This class could potentially return feature vectors constructed from different original types than T, which may not be the desired behavior.

Option 2: add a parameter T => Array[Double] that allows us to compute a feature vector from an instance of T

This has some advantages over Option #1: the caller no longer worries about the associations between instances of T and Array[Double]. We also gain the benefits of Scala’s type-checking, in that a LinearClassifier[T1] will complain about being asked to classify a T2. However this approach also introduces additional burdens:

  • if we call other methods that require this functionality, we need to continue passing the function around, or we need to evaluate it and pass its result (an Array[Double]) around, essentially assuming the bookkeeping burden from Option #1.
  • if we want to enrich the our feature representation with more information (e.g., an IndexedSeq[String] describing each feature), we would need to add another function or another output to this function. This could quickly become messy.

Option 3: define a FeatureVector trait with a method features: Array[Double], adding the type bound T <: FeatureVector

Now we are getting somewhere – this design alleviates the two stated disadvantages of Option #2. However, this unfortunately assumes that the original definition of T must have been aware of the FeatureVector trait in order to extend it. Alternatively, the caller could define and instantiate some “wrapper” class that extends FeatureVector. One hazard of the wrapper approach is that it could become unwieldy if we want to further augment T with other functionality. We would either need separate wrappers for every combination of functionalities, or we would need to nest wrappers somehow. If we did decide to use wrappers, we could use Scala’s implicit machinery to automate the wrapping/unwrapping with a view bound T <% FeatureVector.

Type classes

The concept of sub-typing is usually associated with an is-a relation: the bound T <: FeatureVector asserts that our code can treat T as a FeatureVector in the sense that T will supply a features() method. A type class can be thought of as a has-a relation: we would like to ensure that type T has some associated type FeatureVector[T] capable of computing features(x: T) for any given instance x of type T. This design facilitates separation of concerns while still leveraging the type system: instead of having our item T know how to turn itself into an Array[Double], we can easily drop in different implementations of FeatureVector[T] that may compute features(x: T) in different ways. Also, by supplying separate T-parametrized types for different functionalities, we can mix and match behaviors onto our pre-existing type T more easily than we could with wrappers. A classic example from the Scala standard libraries is the Ordering[T] trait: Rather than requiring instances of T to supply an implementation of compare(x: T), we can rely on the existence of an Ordering[T] capable of computing compare(x: T, y: T).

Option 4: define a type FeatureVector[T] and use the context bound T : FeatureVector 

In this version, we have a type-parameterized trait FeatureVector[T], which we define as being able to compute features(x: T) for instances of our type T. Similar to a view bound, the context bound LinearClassifier[T : FeatureVector] asserts the existence of an implicit instance of type FeatureVector[T] in the scope of the caller, and then threading it through to all further calls.  If no FeatureVector[T] is available, we will get an error at compile time. For a full runnable example using this approach, see this gist.  

Note that we could use this design without context bounds or implicits by explicitly passing around an instance of FeatureVector[T], similar to how we passed a function around in Option #2 above. Like Option #2 however, we would have the burden of passing an extra argument around, and this inconvenience might discourage us from factoring our code in this way.

This has been a slightly circuitous journey, but I hope the running example and discussion of different alternatives has helped to illustrate the usefulness of the type class pattern and context bounds in Scala. See the references below for more details.

REFERENCES


«
  1. David Andrzejewski, Data Sciences Engineer David Andrzejewski, Data Sciences Engineer says:

    Slides from a Bay Area Scala Enthusiasts lightning talk (loosely) based on this blog post: http://sumologic.com/scalameetup

    Bay Area Scala Enthusiasts meetup:
    http://www.meetup.com/Bay-Area-Scala-Enthusiasts/

Twitter