Automated Infrastructure Problem Discovery using Sumo Logic and Chef

# Automated Infrastructure Problem Discovery using Sumo Logic and Chef

The Chef-Sumo integration, which can be found on the Sumo Logic Developers open source page, is a way to inch towards something like automated “infrastructure problem discovery.” By “problem”, I mean services not working as intended. That could be a recurring incident, a persistent performance problem, or a security threat. Gathering useful statistics on the interaction between your code and your “Infrastructure-as-Code” will make it easier to discover these problems without intense rote manual querying.

For example, a basic problem that you might discover is that something is in the logs where nothing should be. This could be a one-off security problem, or a problem across the entire service. Monitoring and alerting from kernel to user will tell you quickly.

The Chef-Sumo combination is a quick and scalable way to build this solution. Chef’s verbose Infrastructure-as-Code is powerful, allowing for service description and discovery, e.g. automated AWS discovery and deployment. Sumo pares down Chef’s verbose output into dashboardable, queryable SaaS, and correlates it with other service logs, simultaneously widening coverage and narrowing focus.

To Chef, Sumo is yet another agent to provision; rollout is no more complicated than anything else in a cookbook. To Sumo, Chef is yet another log stream; once provisioned, the Chef server is parsed into sources and registered in Sumo.

## Types of Problems

This focus is critical. Since storage is cheap and logging services want lock-in, the instinct in DevOps is to hoard information. Too often, teams suffer from a cargo cult mentality where the data’s “bigness” is all that matters. In practice, this usually means collecting TBs of data that are unorganized, poorly described and not directed towards problem-solving.

It’s much easier to find needles in haystacks with magnets. With infrastructure logs, that means finding literal anomalies, like an unknown user with privileged access. Or sometimes it means finding pattern mismatches or deviations from known benchmarks, like a flood of pings from a proxy.

## Problem Solving on Rails

Sumo has two out-of-the-box query tools that can make the problem-solving process simpler— Outlier and Anomaly. These are part of Sumo’s “log reduce” family. Outlier tracks the moving average and standard deviation of a value, allowing for alerts and reports when the difference between the value exceeds the mean by some multiple of the standard deviation.

Here’s an example query for a simple AWS alert:
 | source=AWS_inFooTown | parse “* * *: * * * * * * * * \ “* *://*:*/* HTTP/” as server, port, backend | timeslice by 1m | avg(server) as OKserver, avg(port) as OKport, avg(backend) as OKbackend by _timeslice | (OKserver+OKport+OKbackend) as total_time_OK | fields _timeslice, total_time_OK | outlier total_time_OK 

In other search tools, this would require indexing and forwarding the sources, setting up stdev searches in separate summary indexes, and collecting them on a manually written average. Not only does that take a lot of time and effort, it requires knowing where to look. While you will still need to parse each service into your own simple language, not having to learn where to deploy this on every new cookbook is a huge time-saver.

Anomaly is also a huge time-saver, and comes with some pre-built templates for RED/YELLOW/GREEN problems. It detects literal anomalies based on some machine learning logic. Check here to learn more about the logic’s internals.

## How to Look before You Leap

While it’s all hyperloops and SaaS in theory, no configuration management and monitoring rollout is all that simple, especially when the question is “what should monitor what” and the rollout is of a Chef-provisioning-Sumo-monitoring-Chef process.

For example, sometimes the “wrong” source is monitored when Chef is provisioning applications that each consume multiple sources. The simplest way to avoid the confusion at the source is to avoid arrays completely when defining Sumo. Stick with hashes for all sources, and Chef will merge based on the hash-defined rules. Read CodeRanger’s excellent explanation of this fix here.

This is a pretty tedious solution, however, and the good folks at Chef and Sumo have come up with something that’s a lot more elegant: custom resources in Chef, with directives in the JSON configuration. This avoids source-by-source editing, and is in line with Sumo’s JSON standards.

To get started with this approach, take a look at the custom resources debate on GitHub, and read the source for Kennon Kwok’s cookbook for Sumo collectors.

Editor’s Note: Automated Infrastructure Problem Discovery using Sumo Logic and Chef is published by the Sumo Logic DevOps Community. If you’d like to learn more or contribute, visit devops.sumologic.com. Also, be sure to check out the Sumo Logic Developers Open Source page for free tools, API’s and example code that will enable you to monitor and troubleshoot applications from code to production.

Alex Entrekin served on the executive staff of Cloudshare where he was primarily responsible for advanced analytics and monitoring systems. His work extending Splunk into actionable user profiling was featured at VMworld: “How a Cloud Computing Provider Reached the Holy Grail of Visibility.”

Alex is currently an attorney, researcher and writer based in Santa Barbara, CA. He holds a J.D. from the UCLA School of Law.

### Request A Free Sumo Logic Demo

Fill out the form below and a Sumo Logic representative will contact you to schedule your free demo.
“Sumo Logic brings everything together into one interface where we can quickly scan across 1,000 servers and gigabytes of logs and quickly identify problems. It’s awesome software and awesome support.”

Jon Dokuli,
VP of Engineering