Sign up for a live Kubernetes or DevSecOps demo

Click here

David Andrzejewski

David Andrzejewski is a senior engineering manager at Sumo Logic, where he works on applying statistical modeling and analysis techniques to machine data such as logs and metrics. He also co-organizes the SF Bay Area Machine Learning meetup group. David holds a PhD in Computer Sciences from the University of Wisconsin–Madison.

Posts by David Andrzejewski

Blog

Recapping the Top 3 Talks on Futuristic Machine Learning at Scale By the Bay 2018

As discussed in our previous post, we recently had the opportunity to present some interesting challenges and proposed directions for data science and machine learning (ML) at the 2018 Scale By the Bay conference. While the excellent talks and panels at the conference were too numerous to cover here, I wanted to briefly summarize three talks in particular that I found to represent some really interesting (to me) directions for ML on the java virtual machine (JVM). Talk 1: High-performance Functional Bayesian Inference in Scala By Avi Bryant (Stripe) | Full Video Available Here Probabilistic programming lies at the intersection of machine learning and programming languages, where the user directly defines a probabilistic model of their data. This formal representation has the advantage of neatly separating conceptual model specification from the mechanics of inference and estimation, with the intention that this separation will make modeling more accessible to subject matter experts while allowing researchers and engineers to focus on optimizing the underlying infrastructure. (image source) Rainier in an open-source library in Scala that allows the user to define their model and do inference in terms of monadic APIs over distributions and random variables. Some key design decisions are that Rainier is “pure JVM” (ie, no FFI) for ease of deployment, and that the library targets single-machine (ie, not distributed) use cases but achieves high performance via the nifty technical trick of inlining training data directly into dynamically generated JVM bytecode using ASM. Talk 2: Structured Deep Learning with Probabilistic Neural Programs By Jayant Krishnamurthy (Semantic Machines) | Full Video Available Here Machine learning examples and tutorials often focus on relatively simple output spaces: Is an email spam or not? Binary outputs: Yes/No, 1/0, +/-, … What is the expected sale price of a home? Numerical outputs – $1M, $2M, $5M, … (this is the Bay Area, after all!) However, what happens when we want our model to output a more richly structured object? Say that we want to convert a natural language description of an arithmetic formula into a formal binary tree representation that can then be evaluated, for example “three times four minus one” would map to the binary expression tree “(- (* 3 4) 1)”. The associated combinatorial explosion in the size of the output space makes “brute-force” enumeration and scoring infeasible. The key idea of this approach is to define the model outputs in terms of a probabilistic program (which allows us to concisely define structured outputs), but with the probability distributions of the random variables in that program being parameterized in terms of neural networks (which are very expressive and can be efficiently trained). This talk consisted mostly of live-coding, using an open-source Scala implementation which implements a monadic API for a function from neural network weights to a probability distribution over outputs. Talk 3: Towards Typesafe Deep Learning in Scala By Tongfei Chen (Johns Hopkins University) | Full Video Available Here (image source) For a variety of reasons, the most popular deep learning libraries such as TensorFlow & PyTorch are primarily oriented around the Python programming language. Code using these libraries consists primarily of various transformation or processing steps applied to n-dimensional arrays (ndarrays). It can be easy to accidentally introduce bugs by confusing which of the n axes you intended to aggregate over, mis-matching the dimensionalities of two ndarrays you are combining, and so on. These errors will occur at run time, and can be painful to debug. This talk proposes a collection of techniques for catching these issues at compile time via type safety in Scala, and walks through an example implementation in an open-source library. The mechanics of the approach are largely based on typelevel programming constructs and ideas from the shapeless library, although you don’t need to be a shapeless wizard yourself to simply use the library, and the corresponding paper demonstrates how some famously opaque compiler error messages can be made more meaningful for end-users of the library. Conclusion Aside from being great, well-delivered talks, several factors made these presentations particularly interesting to me. First, all three had associated open-source Scala libraries. There is of course no substitute for actual code when it comes to exploring the implementation details and trying out the approach on your own test data sets. Second, these talks shared a common theme of using the type system and API design to supply a higher-level mechanism for specifying modeling choices and program behaviors. This can both make end-user code easier to understand as well as unlock opportunities for having the underlying machinery automatically do work on your behalf in terms of error-checking and optimization. Finally, all three talks illustrated some interesting connections between statistical machine learning and functional programming patterns, which I found interesting as a longer-term direction for trying to build practical machine learning systems. Additional Resources Learn how to analyze Killer Queen game data with machine learning and data science with Sumo Logic Notebooks Interested in working with the Sumo Logic engineering team? We’re hiring! Check out our open positions here Sign up for a free trial of Sumo Logic

Blog

Careful Data Science with Scala

This post gives a brief overview of some ideas we presented at the recent Scale By the Bay conference in San Francisco, for more details you can see a video of the talk or take a look at the slides. The Problems of Sensitive Data and Leakage Data science and machine learning have gotten a lot of attention recently, and the ecosystem around these topics is moving fast. One significant trend has been the rise of data science notebooks (including our own here at Sumo Logic): interactive computing environments that allow individuals to rapidly explore, analyze, and prototype against datasets. However, this ease and speed can compound existing risks. Governments, companies, and the general public are increasingly alert to the potential issues around sensitive or personal data (see, for example, GDPR). Data scientists and engineers need to continuously balance the benefits of data-driven features and products against these concerns. Ideally, we’d like a technological assistance that makes it easier for engineers to do the right thing and avoid unintended data processing or revelation. Furthermore, there is also a subtle technical problem known in the data mining community as “leakage”. Kaufman et al won the best paper award at KDD 2011 for Leakage in Data Mining: Formulation, Detection, and Avoidance, which describes how it is possible to (completely by accident) allow your machine learning model to “cheat” because of unintended information leaks in the training data contaminating the results. This can lead machine learning systems which work well on sample datasets but whose performance is significantly degraded in the real world. As this can be a major problem, especially in systems that pull data from disparate sources to make important predictions. Oscar Boykin of Stripe presented an approach to this problem at Scale By the Bay 2017 using functional-reactive feature generation from time-based event streams. Information Flow Control (IFC) for Data Science My talk at Scale By the Bay 2018 discussed how we might use Scala to encode notions of data sensitivity, privacy, or contamination, thereby helping engineers and scientists avoid these problems. The idea is based on programming languages (PL) research by Russo et al, where sensitive data (“x” below) is put in a container data type (the “box” below) which is associated with some security level. Other code can apply transformations or analyses to the data in-place (known as Functor “map” operation in functional programming), but only specially trusted code with an equal or greater security level can “unbox” the data. To encode the levels, Russo et al propose using the Lattice model of secure information flow developed by Dorothy E. Denning. In this model, the security levels form a partially ordered set with the guarantee that any given pair of levels will have a unique greatest lower bound and least upper bound. This allows for a clear and principled mechanism for determining the appropriate level when combining two pieces of information. In the Russo paper and our Scale By the Bay presentation, we use two levels for simplicity: High for sensitive data, and Low for non-sensitive data. To map this research to our problem domain, recall that we want data scientists and engineers to be able to quickly experiment and iterate when working with data. However, when data may be from sensitive sources or be contaminated with prediction target information, we want only certain, specially-audited or reviewed code to be able to directly access or export the results. For example, we may want to lift this restriction only after data has been suitably anonymized or aggregated, perhaps according to some quantitative standard like differential privacy. Another use case might be that we are constructing data pipelines or workflows and we want the code itself to track the provenance and sensitivity of different pieces of data to prevent unintended or inappropriate usage. Note that, unlike much of the research in this area, we are not aiming to prevent truly malicious actors (internal or external) from accessing sensitive data – we simply want to provide automatic support in order to assist engineers in handling data appropriately. Implementation and Beyond Depending on how exactly we want to adapt the ideas from Russo et al, there are a few different ways to implement our secure data wrapper layer in Scala. Here we demonstrate one approach using typeclass instances and implicit scoping (similar to the paper) as well as two versions where we modify the formulation slightly to allow changing the security level as a monadic effect (ie, with flatMap) having last-write-wins (LWW) semantics, and create a new Neutral security level that always “defers” to the other security levels High and Low. Implicit scoping Most similar to the original Russo paper, we can create special “security level” object instances, and require one of them to be in implicit scope when de-classifying data. (Thanks to Sergei Winitzki of Workday who suggested this at the conference!) Value encoding For LWW flatMap, we can encode the levels as values. In this case, the security level is dynamically determined at runtime by the type of the associated level argument, and the de-classify method reveal() returns a type Option[T] where it is None if the level is High. This implementation uses Scala’s pattern-matching functionality. Type encoding For LWW flatMap, we can encode the levels as types. In this case, the compiler itself will statically determine if reveal() calls are valid (ie, against the Low security level type), and simply fail to compile code which accesses sensitive data illegally. This implementation relies on some tricks derived from Stefan Zeiger’s excellent Type-Level Computations in Scala presentation. Data science and machine learning workflows can be complex, and in particular there are often potential problems lurking in the data handling aspects. Existing research in security and PL can be a rich source of tools and ideas to help navigate these challenges, and my goal for the talk was to give people some examples and starting points in this direction. Finally, it must be emphasized that a single software library can in no way replace a thorough organization-wide commitment to responsible data handling. By encoding notions of data sensitivity in software, we can automate some best practices and safeguards, but it will necessarily only be a part of a complete solution. Watch the Full Presentation at Scale by the Bay Learn More

Blog

Stateful traversal and econometric simulation: together at last!

Functional programming concepts such as the State monad and the Traversable type class can be powerful tools for thinking about common software engineering problems. As with any abstractions, these ideas may seem unfamiliar or opaque in the absence of concrete examples. The goal of this post is therefore to motivate and contextualize a “stateful traversal” design pattern by working through the specific problem of generating synthetic data points drawn from a simple statistical model. AR(p) modeling The p-th order linear autoregressive model AR(p) assumes that data point $$y_t$$ is generated according to the equation where the $$a_i$$ are the fixed autoregressive coefficients and $$\epsilon_t$$ are the iid Gaussian noise terms. This is a basic and foundational model in econometrics and time series analysis. Data generation Say we have some model estimation code which we would like to sanity test against synthetic data points generated by a known model. First, we note that the $$\epsilon$$ terms are, by definition, totally independent of anything else. We can therefore simply sample $$N$$ of these to populate an $$\epsilon$$ vector. The next step is a bit more involved. While we could compute each $$y_t$$ independently using only the previous noise terms $$\epsilon$$, this calculation would be $$O(N^2)$$ and a bit awkward. A more intuitive approach might be to sweep a pointer over the $$\epsilon$$ array and our (to-be-populated) $$y$$ array, using pointer/index arithmetic to select the appropriate previous terms for our calculation of each $$y_t$$. Basic sliding window model. This simple diagram shows how we can slide a function $$f$$ with fixed coefficients a over our previously generated $$y$$ and iid $$\epsilon$$. An unfortunate aspect of this design is that it is mixes inputs and outputs in a way that could easily lead to programming errors. Also, the pointer arithmetic trick couples us to this exact problem and somewhat obscures what we’re really doing: threading the dynamic evolution of some state (here, the trailing window) through the course of our data-generating process. To make the latter point more concrete, consider the bookkeeping complexity that would be entailed by the extension of this approach to models with a richer notion of state, such as an ARIMA model or an Infinite Hidden Markov Model. This motivates an alternative implementation which explicitly models the evolution of some state as we generate our data. State-threading approach The idea here is that the AR(p) equation above tells us how to get a mapping from some state of type S (window $$[y_{t-p},\ldots,y_{t-1}]$$) and an input type I (noise $$\epsilon_t$$) to a new output of type O (output $$y_t$$) and a new state of type S (new window $$[y_{t-(p-1)},\ldots,y_t]$$). As a Scala type signature this would look something like (S,I) => (S,O). Basic state-based model. The above diagram shows the simplest version of this idea, where $$g$$ is nearly identical to $$f$$, except with an additional output for the new state S. We can simplify this a bit via the concept of partial application. We note that $$ a $$ is fixed across all evaluations of our function, so we can fix that parameter to get $$g(y,\epsilon_t)$$, as shown in this diagram. Model with coefficient parameters fixed. Finally, we combine partial application with the functor property of our $$\epsilon$$ sequence by mapping partial application of our function over $$\epsilon$$ to get separate functions $$g_t(y)$$ for each individual position $$t$$. Our function now simply maps from a previous window S to a new window S and an output value O, as shown in this diagram Per-t state-based model. This re-arrangement has stripped our sequential data-generating computation down to its essence: given a sequence of iid noise terms $$\epsilon$$, we almost directly encode our mathematical definition as a function which takes a sliding window state S as its input, and returns a computed value O and a new sliding window state S as output, therefore having type signature S => (S,O). Plug-and-play All that remains is to construct the plumbing necessary to put it all together. As luck would have it, there is a common functional programming idiom for traversing a data structure while simultaneously accumulating state and transforming individual elements based on that state. This allows us to simply supply our transformed function of $$g$$ along with an initial window state (say, all zeros). #wrap_githubgist02765cb6b2af72bce5f1 .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n case class AR(coeff: List[Double]) {\n \n \n \n type Coeff = List[Double]\n \n \n \n type Window = List[Double]\n \n \n \n \n\n \n \n \n // Assuming Coeff supplied as [... , a_{t-2}, a_{t-1}] and\n \n \n \n // likewise, Window supplied as [..., y_{t-2}, y_{t-1}]\n \n \n \n private def g(a: Coeff, noise: Double, w: Window): (Window, Double) = {\n \n \n \n val yt = (a zip w).map{case (x,y) => x*y}.sum + noise\n \n \n \n (w.tail :+ yt, yt)\n \n \n \n }\n \n \n \n \n\n \n \n \n def generate(noise: List[Double]): List[Double] = {\n \n \n \n val initWindow = coeff.map(_ => 0.0) // Init window = all zeros\n \n \n \n val gs = noise.map((g _).curried(coeff)) // One S => (S, O) for each noise input\n \n \n \n StatefulGeneration.generate(initWindow, gs) // ???\n \n \n \n }\n \n \n \n }\n \n\n\n \n\n \n \n\n\n \n \n view raw\n ar-model.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found This is somewhat remarkable. The AR(p) equation tells us how to go from a state (the window of previous $$y_t$$) and an input ($$\epsilon_t$$) to a new state (the new window) and an output (the new $$y_t$$), and we can directly plug exactly this information into some generic machinery to achieve our desired result. So how does it work? State monad Clearly we’ve buried some crucial mechanics – let’s take a look at StatefulGeneration.generate(): #wrap_githubgist5507cbc7dda71c9688dc .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n object StatefulGeneration {\n \n \n \n def generate[G[_] : Traverse : Functor, S, O](init: S, gs: G[S => (S,O)]): G[O] = {\n \n \n \n val (traverse, functor) = (implicitly[Traverse[G]], implicitly[Functor[G]])\n \n \n \n val stateFunctions = functor.map(gs)(State.apply[S,O])\n \n \n \n traverse.sequenceS[S,O](stateFunctions).run(init)._2\n \n \n \n }\n \n \n \n }\n \n\n\n \n\n \n \n\n\n \n \n view raw\n stateful-gen.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found The context bounds assert the availability of utility type class instances for our container type G[_] (here, List), which are supplied by scalaz.std.list.listInstance and retrieved via the implicitly statements. The first piece of machinery invoked in the above example is functor.map(gs)(State.apply). Recall that gs has type List[Window => (Window,Double)], that is, a sequence of functions that each map an input state to a new output state along with an output value. This simple and abstract definition of a stateful computation occupies a special place in the functional programming world, known as the State Monad. There exists a vast amount of instructional content on this topic which we shall not recapitulate here (you can find plenty on the web, for a nice example see “Learn You A Haskell” aka LYAH). Suffice it to say that for our purposes a State instance is a wrapper for a function that takes a state of type S as input, and returns some new state of type S along with an “output” value of type A, something like: #wrap_githubgist01a320efa31d736c4ace .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n case class State[S, A](run: S => (S, A))\n \n\n\n \n\n \n \n\n\n \n \n view raw\n state.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found The “Monad” half of the name means that there exists a flatMap function over State which, loosely speaking, “chains” two of these computations, using the output S of one State as the input to the next and returning a new stateful computation. An example implementation and simple diagram are below, where State.run takes an initial state S as input and then executes the wrapped function: #wrap_githubgistf54c8ff1b5044d6d9d58 .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n def flatMap[A,B](ma: State[S,A])(amb: A => State[S,B]): State[S,B] = {\n \n \n \n State({s => {\n \n \n \n val (s1, a) = ma.run(s)\n \n \n \n amb(a).run(s1)\n \n \n \n }})\n \n \n \n }\n \n\n\n \n\n \n \n\n\n \n \n view raw\n state-flatmap.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found Diagram of flatMap over State. Returning to our generate() function above, the snippet simply wraps each of our $$g_t$$ functions Window => (Window,Double) into a scalaz State, for which flatMap and a variety of other nice helper functions are already defined. Traverse/Sequence What about traverse.sequenceS[Window,Double]? Let’s inspect the type signature of sequenceS, substituting in the actual concrete types we’re using: #wrap_githubgistb0ffc4bb59bf15329b51 .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n def sequenceS[Window,Double](fga: List[State[Window,Double]]): State[Window,List[Double]]\n \n\n\n \n\n \n \n\n\n \n \n view raw\n sequences-signature.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found Verbally, this translates to transforming a List of stateful computations with Window state and Double output into a single stateful computation with Window state and List[Double] output. Informally, sequenceS is able to accomplish this by using the flatMap machinery defined in the previous section to chain all of the individual stateful computations into one. Furthermore, sequenceS also transforms the “output” variable of the resulting stateful computation from Double to List[Double], a list of all the individual outputs, which is exactly the final output we originally wanted. In general, Sequence allows us to “commute” two higher-order types, going from G[F[A]] to F[G[A]]. In this specific case we are transforming List[State[Window,Double]] to State[Window, List[Double]]. The End Finally, we run it by supplying an all-zeros window as the initial state. The resulting value is of type (Window,List[Double]), containing both the final window state and the list of all output values. We retrieve the latter via ._2 and declare victory! What have we gained by all this? First, we have pushed the work of incidental wiring into well-defined library functions, leaving our custom code focused solely on the particulars of our problem: the $$g()$$ function which emits a value and updates the state. This compact surface area for custom logic should be easier to both understand and test. Second, notice that StatefulGeneration.generate() is polymorphic in the type of the data container G[_], subject to the availability of Traverse and Functor type class instances. Finally, the stateful traversal and transformation of sequential data structures is ubiquitous in software, making this design pattern a valuable addition to one’s “vocabulary” for understanding and reasoning about code. References This example was briefly discussed in my talk “Economical machine learning via functional programming” (slides, video) at the Big Data Scala by the Bay conference in Oakland. The code examples use the scalaz library, for which the “Learning Scalaz” blog series is an invaluable companion. The functional programming concepts discussed here are heavily influenced by the excellent book Functional Programming in Scala. My reference for the AR(p) process is Applied Econometric Time Series. A more thorough look at some of these ideas can be found in Eric Torreborre’s epic blog post on the “Essence of the Iterator Pattern” paper by Jeremy Gibbons and Bruno C. d. S. Oliveira. Also see other blog discussion here, here, and here. Any mistakes are of course my own, and I’d love to hear about them!

October 27, 2015

Blog

Scala at Sumo: type class law for reliable abstractions

Abstraction is a fundamental concept in software development. Identifying and building abstractions well-suited to the problem at hand can make the difference between clear, maintainable code and a teetering, Jenga-like monolith duct-taped together by a grotesque ballet of tight coupling and special case handling. While a well-designed abstraction can shield us from detail, it can also suffer from leakage, failing to behave as expected or specified and causing problems for code built on top of it. Ensuring the reliability of our abstractions is therefore of paramount concern. In previous blog posts, we’ve separately discussed the benefits of using type classes in Scala to model abstractions, and using randomized property testing in Scala to improve tests. In this post we discuss how to combine these ideas in order to build more reliable abstractions for use in your code. If you find these ideas interesting please be sure to check out the references at the end of this post. Type classes for fun and profit Type classes allow us to easily build additional behaviors around data types in a type-safe way. One simple and useful example is associated with the monoid abstraction, which represents a set of items which can be combined with one another (such as the integers, which can be combined by addition). Loosely[1], a monoid consists of a collection of objects (e.g., integers) a binary operation for combining these objects to yield a new object of the same type (e.g., addition) an identity object whose combination leaves an object unchanged (e.g., the number 0) This abstraction is captured by the scalaz trait Monoid[F]: #wrap_githubgistc125b7eef26b07354a01 .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n trait Monoid[F] {\n \n \n \n def zero: F\n \n \n \n def append(f1: F, f2: => F): F\n \n \n \n }\n \n\n\n \n\n \n \n\n\n \n \n view raw\n lawblog-monoid.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found The utility of this machinery is that it gives us a generalized way to use types that support some notion of “addition” or “combination”, for example[2]: #wrap_githubgist8ba856345509f6fd8bf3 .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n def addItUp[F : Monoid](items: Seq[F]): F = {\n \n \n \n // Combine a bunch of items\n \n \n \n val m = implicitly[Monoid[F]]\n \n \n \n items.foldLeft(m.zero){case (total, next) => m.append(total,next)}\n \n \n \n }\n \n \n \n \n\n \n \n \n scala> addItUp(Seq("day ", "after ", "day"))\n \n \n \n res1: String = "day after day"\n \n \n \n scala> addItUp(Seq(1,2,3))\n \n \n \n res2: Int = 6\n \n\n\n \n\n \n \n\n\n \n \n view raw\n lawblog-additup.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found As described in our earlier machine learning example, this can be more convenient than requiring that the data types themselves subtype or inherit from some kind of “Addable” interface. I am the law! In Scala, the Monoid[F] trait definition (combined with the compiler type-checking) buys us some important sanity checks with respect to behavior. For example, the function signature append(x: F, y: F): F guarantees that we’re never going to get a non-F result[3]. However, there are additional properties that an implementation of Monoid[F] must satisfy in order to truly conform to the conceptual definition of a monoid, but which are not easily encoded into the type system. For example, the monoid binary operation must satisfy left and right identity with respect to the “zero” element. For integers under addition the zero element is 0, and we do indeed have x + 0 = 0 + x = x for any integer x. We can codify this requirement in something called type class law. When defining a particular type class, we can add some formal properties or invariants which we expect implementations to obey. The codification of these constraints can then be kept alongside the type class definition. Again returning to scalaz Monoid[4], we have #wrap_githubgist274da5ddfb2b96ce6ec8 .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n trait Monoid[F] extends Semigroup[F] { \n \n \n \n ...\n \n \n \n trait MonoidLaw extends SemigroupLaw {\n \n \n \n def leftIdentity(a: F)(implicit F: Equal[F]) = \n \n \n \n F.equal(a, append(zero, a))\n \n \n \n def rightIdentity(a: F)(implicit F: Equal[F]) = \n \n \n \n F.equal(a, append(a, zero))\n \n \n \n }\n \n \n \n ...\n \n \n \n }\n \n\n\n \n\n \n \n\n\n \n \n view raw\n lawblog-monoidlaw.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found An interesting observation is that this implementation depends upon another type class instance Equal[F] which simply supplies an equal() function for determining whether two instances of F are indeed equal. Of course, Equal[F] comes supplied with its own type class laws for properties any well-defined notion of equality must satisfy such as commutativity (x==y iff y==x), reflexivity (x==x), and transitivity (if a==b and b==c then a==c). A machine learning example We now consider an example machine learning application where we are evaluating some binary classifier (like a decision tree) over test data. We run our evaluation over different sets of data, and for each set we produce a very simple output indicating how many predictions were made, and of those, how many were correct: #wrap_githubgist0fea8477be69860df262 .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n case class Evaluation(total: Int, correct: Int)\n \n\n\n \n\n \n \n\n\n \n \n view raw\n lawblog-evaluationscala\n hosted with ❤ by GitHub\n \n \n\n') Not Found We can implement Monoid[Evaluation] [5] in order to combine the our experimental results across multiple datasets: #wrap_githubgist837f1f4fed2d366e34bd .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n object EvaluationMonoid extends Monoid[Evaluation] {\n \n \n \n def zero = Evaluation(0,0)\n \n \n \n def append(x: Evaluation, y: => Evaluation) =\n \n \n \n Evaluation(x.total + y.total, x.correct + y.correct)\n \n \n \n }\n \n\n\n \n\n \n \n\n\n \n \n view raw\n lawblog-evalmonoid.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found We’d like to ensure that our implementation satisfies the relevant type class laws. We could write a handful of unit tests against one or more hand-coded examples, for example using ScalaTest: #wrap_githubgistf0c4a31d7ac6b3f8cf2b .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n "Evaluation Monoid" should {\n \n \n \n \n\n \n \n \n import EvaluationMonoid._\n \n \n \n implicit val eq = Equal.equalA[Evaluation]\n \n \n \n val testEvaluation = Evaluation(3, 2)\n \n \n \n \n\n \n \n \n "obey Monoid typeclass Law" in {\n \n \n \n Monoid.monoidLaw.leftIdentity(testEval) should be (true)\n \n \n \n Monoid.monoidLaw.rightIdentity(testEval) should be (true)\n \n \n \n }\n \n \n \n }\n \n\n\n \n\n \n \n\n\n \n \n view raw\n lawblog-spottest.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found However, this merely gives us an existence result. That is, there exists some value for which our the desired property holds. We’d like something a little stronger. This is where we can use ScalaCheck to do property testing, randomly generating as many arbitrary instances of Evaluation as we’d like. If the law holds for all [6] generated instances, we can have a higher degree confidence in the correctness of our implementation. To accomplish this we simply need to supply a means of generating random Evaluation instances via ScalaCheck Gen: #wrap_githubgist6248a52718525e005cd4 .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n val evalGen = for {total <- Gen.choose(0, 1000);\n \n \n \n correct <- Gen.choose(0, total)} \n \n \n \n yield Evaluation(total,correct)\n \n \n \n \n\n \n \n \n "Evaluation Monoid" should {\n \n \n \n \n\n \n \n \n import EvaluationMonoid._\n \n \n \n implicit val eq = Equal.equalA[Evaluation]\n \n \n \n \n \n \n \n "obey Monoid typeclass Law" in {\n \n \n \n forAll (evalGen) { testEval => {\n \n \n \n Monoid.monoidLaw.leftIdentity(testEval) should be (true) \n \n \n \n Monoid.monoidLaw.rightIdentity(testEval) should be (true)\n \n \n \n }}\n \n \n \n }\n \n \n \n }\n \n\n\n \n\n \n \n\n\n \n \n view raw\n lawblog-proptest.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found Now that’s an abstraction we can believe in! So what? This level of confidence becomes important when we begin to compose type class instances, mixing and matching this machinery to achieve our desired effects. Returning to our Evaluation example, we may want to evaluate different models over these datasets, storing the results for each dataset in a Map[String,Evaluation] where the keys refer to which model was used to obtain the results. In scalaz, we get the Monoid[Map[String,Evaluation]] instance “for free”, given an instance of Monoid[Evaluation]: #wrap_githubgistac4abe76c1a5058b05cc .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n scala> implicit val em = EvaluationMonoid\n \n \n \n em: EvaluationMonoid.type = EvaluationMonoid$@34f5b235\n \n \n \n \n\n \n \n \n scala> implicit val mm = mapMonoid[String,Evaluation]\n \n \n \n mm: scalaz.Monoid[Map[String,Evaluation]] = scalaz.std.MapInstances$$anon$4@13105b09\n \n \n \n \n\n \n \n \n scala> val dataset1 = Map("modelA" -> Evaluation(3,2),\n \n \n \n | "modelB" -> Evaluation(4,1))\n \n \n \n dataset1: scala.collection.immutable.Map[String,Evaluation] = \n \n \n \n Map(modelA -> Evaluation(3,2), modelB -> Evaluation(4,1))\n \n \n \n \n\n \n \n \n scala> val dataset2 = Map("modelA" -> Evaluation(5,4))\n \n \n \n dataset2: scala.collection.immutable.Map[String,Evaluation] = \n \n \n \n Map(modelA -> Evaluation(5,4))\n \n \n \n \n\n \n \n \n scala> mm.append(dataset1,dataset2)\n \n \n \n res3: Map[String,Evaluation] = \n \n \n \n Map(modelA -> Evaluation(8,6), modelB -> Evaluation(4,1))\n \n\n\n \n\n \n \n\n\n \n \n view raw\n lawblog-mapmonoid.scala\n hosted with ❤ by GitHub\n \n \n\n') Not Found Conclusion and references If you are using the scalaz library, many of the provided type classes come “batteries included” with type class laws. Even if you are not, these ideas can help you to build more reliable type class instances which can be composed and extended with confidence. See below for some additional references and readings on this subject: Law Enforcement using Discipline Verifying Typeclass Laws in Haskell with QuickCheck How to test Scalaz type class instances with specs2 and ScalaCheck Haskell’s Type Classes: We Can Do Better Footnotes [1] Omitting associativity and explicit discussion of closure. [2] For brevity, these code snippets do not show library (scalaz, ScalaTest, ScalaCheck) imports. [3] Excluding the unfortunate possibilities of null return values or thrown Exceptions. [4] A semigroup is a more general concept than a monoid, which is modeled in scalaz by having Monoid[F] extend Semigroup[F]. [5] This implementation has a bit of a boilerplate flavor, this post describes how we could automagically derive our Monoid[Evaluation] instance. [6] As implied by the ScalaCheck project’s appropriate logo.

October 9, 2014

Blog

Machine Data at Strata: “BigData++”

A few weeks ago I had the pleasure of hosting the machine data track of talks at Strata Santa Clara. Like “big data”, the phrase “machine data” is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or collected automatically by machines. This includes software logs and sensor measurements from systems as varied as mobile phones, airplane engines, and data centers. The concept is closely related to the “internet of things”, which refers to the trend of increasing connectivity and instrumentation in existing devices, like home thermostats. More data, more problems This data can be useful for the early detection of operational problems or the discovery of opportunities for improved efficiency. However, the de­coupling of data generation and collection from human action means that the volume of machine data can grow at machine scales (i.e., Moore’s Law), an issue raised by both Monash and Abadi. This explosive growth rate amplifies existing challenges associated with “big data.” In particular, two common motifs among the talks at Strata were the difficulties around: mechanics: the technical details of data collection, storage, and analysis semantics: extracting understandable and actionable information from the data deluge The talks The talks covered applications involving machine data from both physical systems (e.g., cars) and computer systems, and highlighted the growing fuzziness of the distinction between the two categories. Steven Gustafson and Parag Goradia of GE discussed the “industrial internet” of sensors monitoring heavy equipment such as airplane engines or manufacturing machinery. One anecdotal data point was that a single gas turbine sensor can generate 500 GB of data per day. Because of the physical scale of these applications, using data to drive even small efficiency improvements can have enormous impacts (e.g., in amounts of jet fuel saved). Moving from energy generation to distribution, Brett Sargent of LumaSense Technologies presented a startling perspective on the state of the power grid in the United States, stating that the average age of an electrical distribution substation in the United States is over 50 years, while its intended lifetime was only 40 years. His talk discussed remote sensing and data analysis for monitoring and troubleshooting this critical infrastructure. Ian Huston, Alexander Kagoshima, and Noelle Sio from Pivotal presented analyses of traffic data. The talk revealed both common-­sense (traffic moves more slowly during rush hour) and counterintuitive (disruptions in London tended to resolve more quickly when it was raining) findings. My presentation showed how we apply machine learning at Sumo Logic to help users navigate machine log data (e.g., software logs). The talk emphasized the effectiveness of combining human guidance with machine learning algorithms. Krishna Raj Raja and Balaji Parimi of Cloudphysics discussed how machine data can be applied to problems in data center management. One very interesting idea was to use data and modeling to predict how different configuration changes would affect data center performance. Conclusions The amount of data available for analysis is exploding, and we are still in the very early days of discovering how to best make use of it. It was great to hear about different application domains and novel techniques, and to discuss strategies and design patterns for getting the most out of data.

March 5, 2014

Blog

Beyond LogReduce: Refinement and personalization

LogReduce is a powerful feature unique to the Sumo Logic offering. At the click of a single button, the user can apply the Summarize function to their previous search results, distilling hundreds of thousands of unstructured log messages into a discernible set of underlying patterns. While this capability represents a significant advance in log analysis, we haven’t stopped there. One of the central principles of Sumo Logic is that, as a cloud-based log management service, we are uniquely positioned to deliver a superior service that learns and improves from user interactions with the system. In the case of LogReduce, we’ve added features that allow the system to learn better, more accurate patterns (refinement), and to learn which patterns a given user might find most relevant (personalization). Refinement Users have the ability to refine the automatically extracted signatures by splitting overly generalized patterns into finer-grained signatures or editing overly specific signatures to mark fields as wild cards. These modifications will then be remembered by the Sumo Logic system. As a result, all future queries run by users within the organization will be improved by returning higher-quality signatures. Personalization Personalized LogReduce helps users uncover the insights most important to them by capturing user feedback and using it to shape the ranking of the returned results. Users can promote or demote signatures to ensure that they do (or do not) appear at the top of Summarize results. Besides obeying this explicit feedback, Sumo Logic also uses this information to compute a relevance score which is used to rank signatures according to their content. These relevance profiles are individually tailored to each Sumo Logic user. For example, consider these Summarize query results: Since we haven’t given any feedback yet, their relevance scores are all equal to 5 (neutral) and they fall back to being ranked by count. Promotion Now, let’s pretend that we are in charge of ensuring that our database systems are functioning properly, so we promote one of the database-related signatures: We can see that the signature we have promoted has now been moved to the top of the results, with the maximum relevance score of 10. When we do future Summarize queries, that signature will continue to appear at the top of results (unless we later choose to undo its promotion by simply clicking the thumb again). The scores of the other two database-related signatures have increased as well, improving their rankings. This is because the content of these signatures is similar to the promoted database signature. This boost also will persist to future searches. Demotion This functionality works in the opposite direction as well. Continuing our running example, our intense focus on database management may mean that we find log messages about compute jobs to be distracting noise in our search results. We could try to “blacklist” these messages by putting Boolean negations in our original query string (e.g., “!comput*”), but this approach is not very practical or flexible. As we add more and more terms to our our search, it becomes increasingly likely that we will unintentionally filter out messages that are actually important to us. With Personalized LogReduce, we can simply demote one of the computation-related logs: This signature then drops to the bottom of the results. As with promotion, the relevance and ranking of the other similar computation-related signature has also been lowered, and this behavior will be persisted across other Summarize queries for this user. Implicit feedback Besides taking into account explicit user feedback (promotion and demotion), Summarize can also track and leverage the implicit signals present in user behavior. Specifically, when a user does a “View Details” drill-down into a particular signature to view the raw logs, this is also taken to be a weaker form of evidence to increase the relevance scores of related signatures. Conclusion The signature refinement and personalized relevance extensions to LogReduce enable the Sumo Logic service to learn from experience as users explore their log data. This kind of virtuous cycle holds great promise for helping users get from raw logs to business-critical insights in the quickest and easiest way possible, and we’re only getting started. Try these features out on your own logs at no cost with Sumo Logic Free and let us know what you think!

January 23, 2013

Blog

Scala at Sumo: type classes with a machine learning example

At Sumo Logic we use the Scala programming language, and we are always on the lookout for ways that we can leverage its features to write high-quality software. The type class pattern (an idea which comes from Haskell) provides a flexible mechanism for associating behaviors with types, and context bounds make this pattern particularly easy to express in Scala code. Using this technique, we can extend existing types with new functionalities without worrying about inheritance. In this post we introduce a motivating example and examine a few different ways to express our ideas in code before coming back to type classes, context bounds, and what they can do for us. Machine learning example: fixed length feature vector representation Consider a machine learning setting where we have a very simple binary linear classifier that classifies an item according to the sign of the inner product between a weight vector and a fixed-length feature vector representation of that item. For this classifier to work on elements of some type T, we need a way to convert an instance of T into an Array[Double] for our inner product calculation. Converting some object of interest into this more abstract representation is a very common pre-processing task for machine learning algorithms, and is often handled in the simplest possible way, which we introduce as Option #1. Option 1: require the caller to convert their data to Array[Double] before calling our code // This is the simplest possible approach, but it has a few disadvantages. First, we’re forcing the caller to do the conversions and bookkeeping around the association between the original objects of type T and their corresponding feature vectors. We are also disregarding the assistance of Scala’s type system – it could be possible to accidentally train the same classifier on incompatible types T1 and T2, because the classifier only knows about Array[Double]. Extending this idea, consider a NearestNeighbors class that uses feature vectors to return the k nearest neighbor training examples for a given test example. This class could potentially return feature vectors constructed from different original types than T, which may not be the desired behavior. Option 2: add a parameter T => Array[Double] that allows us to compute a feature vector from an instance of T // This has some advantages over Option #1: the caller no longer worries about the associations between instances of T and Array[Double]. We also gain the benefits of Scala’s type-checking, in that a LinearClassifier[T1] will complain about being asked to classify a T2. However this approach also introduces additional burdens: if we call other methods that require this functionality, we need to continue passing the function around, or we need to evaluate it and pass its result (an Array[Double]) around, essentially assuming the bookkeeping burden from Option #1. if we want to enrich the our feature representation with more information (e.g., an IndexedSeq[String] describing each feature), we would need to add another function or another output to this function. This could quickly become messy. Option 3: define a FeatureVector trait with a method features: Array[Double], adding the type bound T <: FeatureVector // Now we are getting somewhere – this design alleviates the two stated disadvantages of Option #2. However, this unfortunately assumes that the original definition of T must have been aware of the FeatureVector trait in order to extend it. Alternatively, the caller could define and instantiate some “wrapper” class that extends FeatureVector. One hazard of the wrapper approach is that it could become unwieldy if we want to further augment T with other functionality. We would either need separate wrappers for every combination of functionalities, or we would need to nest wrappers somehow. If we did decide to use wrappers, we could use Scala’s implicit machinery to automate the wrapping/unwrapping with a view bound T <% FeatureVector. Type classes The concept of sub-typing is usually associated with an is-a relation: the bound T <: FeatureVector asserts that our code can treat T as a FeatureVector in the sense that T will supply a features() method. A type class can be thought of as a has-a relation: we would like to ensure that type T has some associated type FeatureVector[T] capable of computing features(x: T) for any given instance x of type T. This design facilitates separation of concerns while still leveraging the type system: instead of having our item T know how to turn itself into an Array[Double], we can easily drop in different implementations of FeatureVector[T] that may compute features(x: T) in different ways. Also, by supplying separate T-parametrized types for different functionalities, we can mix and match behaviors onto our pre-existing type T more easily than we could with wrappers. A classic example from the Scala standard libraries is the Ordering[T] trait: Rather than requiring instances of T to supply an implementation of compare(x: T), we can rely on the existence of an Ordering[T] capable of computing compare(x: T, y: T). Option 4: define a type FeatureVector[T] and use the context bound T : FeatureVector // In this version, we have a type-parameterized trait FeatureVector[T], which we define as being able to compute features(x: T) for instances of our type T. Similar to a view bound, the context bound LinearClassifier[T : FeatureVector] asserts the existence of an implicit instance of type FeatureVector[T] in the scope of the caller, and then threading it through to all further calls. If no FeatureVector[T] is available, we will get an error at compile time. For a full runnable example using this approach, see this gist. Note that we could use this design without context bounds or implicits by explicitly passing around an instance of FeatureVector[T], similar to how we passed a function around in Option #2 above. Like Option #2 however, we would have the burden of passing an extra argument around, and this inconvenience might discourage us from factoring our code in this way. This has been a slightly circuitous journey, but I hope the running example and discussion of different alternatives has helped to illustrate the usefulness of the type class pattern and context bounds in Scala. See the references below for more details. REFERENCES Daniel C. Sobral’s epic StackOverflow answer is a great explanation of these ideas. He also has a related blog post. Another good StackOverflow discussion The excellent book Scala in Depth briefly discusses type classes in Scala. More discussion of type class concepts in the context of the very interesting scalaz library

September 24, 2012

Blog

Scala at Sumo: grab bag of tricks

As mentioned previously on this blog, at Sumo Logic we primarily develop our backend software using the Scala programming language. In this post we present some (very) miscellaneous Scala tidbits and snippets that showcase some interesting features we have taken advantage of in our own code. Laziness Scala borrows many ideas from the world of functional programming, including “laziness”. The essential idea of a lazy collection is that its contents are not computed until they are needed. For example, one could construct a lazy abstraction layer over expensive external resources like disk or database accesses. This abstraction can then be created and passed around without incurring the expense of populating it until the values are actually read out by some other part of the program. This post dives into the performance implications of laziness, here we present a very contrived example that illustrates the basic concept. // // ]]> In the first version, a strict Range is created and mapped over. The map is eagerly evaluated, simultaneously populating the mutant ListBuffer with values and producing a sequence of Boolean values. The exists method is then called with an identity argument, returning true after reading the second value in the sequence, which is true because 1 > 0. However the fact that our ListBuffer contains the value (0,1,2,3) tells us that the map was computed over all elements of the Range; this is because that entire computation happens “before” exists begins consuming the values. // // ]]> In the second version, we call view on the strict Range to create a lazy sequence. The mapping function is then only called when elements of this sequence are consumed by the exists method. Once exists hits the true value, it short-circuits and returns true without consuming the rest of the sequence. This is why we see 0 and 1 only in the ListBuffer. The map computation was only evaluated on an “as-needed” basis to supply values for exists to consume, and exists only needed to consume 0 and 1 before terminating with a value of true. Note that we are using side-effects and a mutable data structure within the mapping function here for illustrative purposes only! In actual code this could easily introduce nasty bugs, as demonstrated by the mildly surprising result of our little experiment. Regex unapply magic When using a regular expression to parse values out of a string for further processing, it is fairly common to see code resembling the following: // // ]]> Conveniently, the Regex class can be combined with Scala pattern-matching machinery to directly bind captured groups to local variables in one shot: // // ]]> This specific instance is a particularly nice example of the general usefulness of Scala case classes, pattern matching, and extractors. If you are interested, these mechanisms are a good place to start digging deeper into how this trick works. Handling nested Java null checks In Scala, the use of null is generally avoided upon due to the availability of the much nicer Option. However, occasionally we need to follow a nested series of calls against null-riddled Java code where if any value in the chain returns null, we would like to return some default value. In this case we can combine a common Scala trick (looping over Option values with for-comprehensions) with the fact that Option can be used as a wrapper of potentially-null values. For the simple case of nested-access with potential lurking nulls, this snippet is much easier on the eyes than an equivalent set of nested if-thens: // // ]]> Executing this App yields the desired behavior: Any null in the chain defaults to authority -1 -1 -1 -1 Complete non-null chain yields correct authority 76

July 25, 2012

Blog

Sumo Logic Logging Tips, Part 1 of N

The Sumo Logic Log Management and Analytics Service allows any organization to easily store, search, and analyze the log data produced by its applications. But even with the very best tools, the quality of the insights uncovered will depend on the raw materials: the logs themselves. At Sumo Logic we make extensive use of our own product for developing, debugging, and troubleshooting our own product, which is itself a distributed and complex cloud-based SaaS application. This “dogfooding” process has helped enormously to improve the product, both in terms of features and maturity. It has also taught us a few things about writing logging into our own application. In this open-ended series we plan to share some logging patterns and strategies that we have found to be helpful. General Logging Tips When in doubt, log it This one is quite simple: you won’t know about the things you don’t log. While pairing every single line of code with a corresponding log statement may be taking things too far, on the balance we have found that it is more common to regret something you didn’t log than something you did. This advice may be a bit self-serving, but when an elusive bug rears its ugly head on your production system at 3 A.M. you will be glad you logged that crucial extra piece of information. Use logging tools There are a variety of logging libraries that can give you a high degree of control over different logging levels, output locations, and formats. Log4j is a good example. Periodically log the progress of long-running operations Many data processing tasks involve a “bulk processor” that chews through many individual pieces of data, one at a time. Logging every item processed may be overly verbose, but logging at regular intervals (say, every second or every 10 seconds) may be quite helpful: // // ]]> If our logs contain a regular stream of these messages indicating that 5000 items were processed per second and we observe an unexplained drop to 5 items per second, this may be an excellent clue that something unusual is happening. Log for readability and parsability Logging statements can be directed at many audiences: operational personnel maintaining the system, other developers reading your code, or computer programs parsing out structured data fields. Each of these audiences will appreciate logs that are self-documenting and easily consumed. GOOD Records request received – user=foobaz@bar.com, userId=77A2063948571CC6, requestId=5B54D5E1D59E9850, recordsRequested=50 NOT AS GOOD foobaz@bar.com | 77A2063948571CC6 | 5B54D5E1D59E9850 | 50 Logging for Distributed Systems A common characteristic of modern distributed applications is that any single event touches many different system components which themselves may reside on many different machines. These systems can be especially challenging to build, operate, and maintain. Problems may manifest only under particular circumstances, and often application logs act as the “black box”, containing the only evidence available to uncover the root causes of system misbehavior. Given this setting, it is of the utmost importance to ensure that your logging is sufficient to help you understand your system and keep things running smoothly. Use unique IDs to track an event across multiple system components Unique IDs (such as the hexademical representation of a random number) can be helpful to pinpoint relevant logs when investigating a problem. For example, when a user logs into your website, you can associate that visit with a unique session ID that is embedded in all logs resulting from this visit. Searching your logs for this session ID should allow you to easily to reconstruct all of the system events associated with this user’s session. This pattern can also be nested for finer-grained searchability. For example, during a visit to your website a user may access their account preferences. When the frontend of your site issues a internal request for this information, this request itself can be associated with both the user session ID and a new request ID. If initial troubleshooting of the user session reveals that this particular request was unsuccessful, this request ID can be then used to drill deeper into why the request failed. A reasonable rule of thumb is to use a new ID for any event, transaction, or session that looks like “one thing” to other components of the system. Log each side of a remote message or request The interfaces between different components, especially those residing on different machines, can be a rich source of potential problems. Your message may be delivered to the wrong recipient, the remote recipient may fail to process your request, your message may be eaten by a grue, and so on. One strategy for dealing with this issue is to put matching send/receive logging statements on each side of any remote message or request. A search for these message pairs that returns an odd number of logs would then be a red flag indicating that a sent message was never received, was received twice, or suffered some other mischief.