Alex Entrekin

Posts by Alex Entrekin

Blog

Working With Field Extraction Rules in Sumo Logic

Field extraction rules compress queries into short phrases, filter out unwanted fields and drastically speed up query times. Fifty at a time can be stored and used in what Sumo Logic calls a “parser library.” These rules are a must once you move from simple collection to correlation and dashboarding. Since they tailor searches prior to source ingestion, the rules never collect unwanted fields, which can drastically speed up query times. Correlations and dashboards require many queries to load simultaneously, so the speed impact can be significant. Setting Up Field Extraction Rules The Sumo Logic team has written some templates to help you get started with common logs like IIS and Apache. While you will need to edit them, they take a lot of the pain out of writing regex parsers from scratch (phew). And if you write your own reusable parsers, save them as a template so you can help yourself to them later. To get started, find a frequently used query snippet. The best candidates are queries that (1) are used frequently and (2) take a while to load. These might pull from dense sources (like iis) or just crawl back over long periods of time. You can also look at de facto high usage queries saved in dashboards, alerts and pinned searches. Once you have the query, first take a look at what the source pulls without any filters. This is important both to ensure that you collect what’s needed, and that you don’t include anything that will throw off the rules. Since rules are “all or nothing,” only include persistent fields. In the example below, I am pulling from a safend collector. Here’s the output from a collector on a USB: 2014-10-09T15:12:33.912408-04:00 safend.host.com [Safend Data Protection] File Logging Alert details: User: user@user.com, Computer: computer.host.com, Operating System: Windows 7, Client GMT: 10/9/2014 7:12:33 PM, Client Local Time: 10/9/2014 3:12:33 PM, Server Time: 10/9/2014 7:12:33 PM, Group: , Policy: Safend for Cuomer Default Policy, Device Description: Disk drive, Device Info: SanDisk Cruzer Pattern USB Device, Port: USB, Device Type: Removable Storage Devices, Vendor: 0781, Model: 550A, Distinct ID: 3485320307908660, Details: , File Name: F:\SOME_FILE_NAME, File Type: PDF, File Size: 35607, Created: 10/9/2014 7:12:33 PM, Modified: 10/9/2014 7:12:34 PM, Action: Write There are certainly reasons to collect all of this (and note that the rule won’t limit collection on the source collector) but I only want to analyze a few parameters. To get it just right, filter it in the Field Extraction panel: Below is the simple Parse Expression I used. Note that more parsing tools are supported that can grep nearly anything that a regular query can. But in this case, I just used parse and nodrop. Nodrop tells the query to pass results along even if the query returns nothing from that field. In this case, it acts like an OR function that concatenates the first three parse functions along with the last one. So if ‘parse regex “Action…”‘ returns nothing, nodrop commands the query to “not drop”, return a blank, and in this case, continue to the next function. Remember that Field Extraction Rules are “all or nothing” with respect to fields. If you add a field that doesn’t exist, nodrop will not help since it only works within existing fields. Use Field Extraction Rules to Speed Up Dashboard Load Time The above example would be a good underlying rule for a larger profiling dashboard. It returns file information only—Action on the File, File ID, File Size, and Type. Another extraction rule might return only User and User Activities, while yet another might include only host server actions. These rules can then be surfaced as dashboard panes, combined into profiles and easily edited. They load only the fields extracted, significantly improving load time, and the modularity of the rules provides a built-in library that makes editing and sharing useful snippets much simpler. Working With Field Extraction Rules in Sumo Logic is published by the Sumo Logic DevOps Community. If you’d like to learn more or contribute, visit devops.sumologic.com. Also, be sure to check out Sumo Logic Developers for free tools and code that will enable you to monitor and troubleshoot applications from code to production. About the Author Alex Entrekin served on the executive staff of Cloudshare where he was primarily responsible for advanced analytics and monitoring systems. His work extending Splunk into actionable user profiling was featured at VMworld: “How a Cloud Computing Provider Reached the Holy Grail of Visibility.” Alex is currently an attorney, researcher and writer based in Santa Barbara, CA. He holds a J.D. from the UCLA School of Law.

Blog

Monitoring and Analyzing Puppet Logs With Sumo Logic

The top Puppet question on ServerFault is How can the little guys effectively learn and use Puppet? Learning Puppet requires learning a DSL that’s thorny enough that the final step in many migrations is to buy Puppet training classes for team. While there is no getting around learning the Puppet DSL, the “little guys” can be more effective if they avoid extending Puppet beyond the realm of configuration management (CM). It can be tempting to extend Puppet to become a monitoring hub, a CI spoke, or many other things. After all, if it’s not in Puppet, it won’t be in your environment, so why not build on that powerful connectedness? The cons of Puppet for log analysis and monitoring Here’s one anecdote from scriptcrafty explaining some of the problems with extending beyond CM: Centralized logic where none is required, Weird DSLs and templating languages with convoluted error messages, Deployment and configuration logic disembodied from the applications that required them and written by people who have no idea what the application requires, Weird configuration dependencies that are completely untestable in a development environment, Broken secrets/token management and the heroic workarounds, Divergent and separate pipelines for development and production environments even though the whole point of these tools is to make things re-usable, and so on and so forth. Any environment complex enough to need Puppet is already too complex to be analyzed with bash and PuppetDB queries. These tools work well for spot investigation and break/fix, but do not extend easily into monitoring and analysis. I’ll use “borrow-time” as an example. To paraphrase the Puppet analytics team, “borrow-time” is the amount of time that the JRuby instances handling Puppet tasks spend on each request. If this number gets high, then there may be something unusually expensive going on. For instance, when the “borrow-timeout-count” metric is > 0, some build request has gone unfilled. It’s tempting to think that the problem is solved by setting a “borrow-timeout-count” trigger in PuppetDB for >0. After all, just about any scripting language will do, and then analysis can be done in the PuppetDB logs. Puppet even has some guides for this in Puppet Server – What’s Going on in There? Monitoring a tool with only its own suggested metrics is not just a convenience sample, but one that is also blind to the problem at hand—uptime and consistency across an inconsistent and complex environment. Knowing that some request has gone unhandled is a good starting point. A closer look at Puppet logs and metrics But look at everything else that Puppet shows when pulling metrics:is trying approach is it runs a risk so let’s look at what one “borrow-time” metrics pull brings up: In the Puppet server: pe-jruby-metrics->status->experimental->metrics "metrics": { "average-borrow-time": 75, "average-free-jrubies": 1.86, "average-lock-held-time": 0, "average-lock-wait-time": 0, "average-requested-jrubies": 1.8959058782351241, "average-wait-time": 77, "borrow-count": 10302, "borrow-retry-count": 0, "borrow-timeout-count": 0, "borrowed-instances": [ { "duration-millis": 2888, "reason": { "request": { "request-method": "post", "route-id": "puppet-v3-catalog-/*/", "uri": "/puppet/v3/catalog/foo.puppetlabs.net" } }, }, ...], "num-free-jrubies": 0, "num-jrubies": 4, "num-pool-locks": 0, "requested-count": 10305, "requested-instances": [ { "duration-millis": 134, "reason": { "request": { "request-method": "get", "route-id": "puppet-v3-file_metadata-/*/", "uri": "/puppet/v3/file_metadata/modules/catalog_zero16/catalog_zero16_impl83.txt" } }, }, ...], "return-count": 10298 } If you are lucky, you’ll have an intuitive feeling about the issue before asking whether the retry count is too high, or if it was only a problem in a certain geo. If the problem is severe, you won’t have time to check the common errors (here and here); you’ll want context. How Sumo Logic brings context to Puppet logs Adding context—such as timeseries, geo, tool, and user—is the primary reason to use Sumo for Puppet monitoring and analysis. Here is an overly simplified example Sumo Logic query where jruby borrowing is compared with the Apache log 2**/3**/4** errors: _sourceName=*jruby-metrics* AND _sourceCategory=*apache* | parse using public/apache/access | if(status_code matches "2*", 1, 0) as successes | if(status_code matches "5*", 1, 0) as server_errors | if(status_code matches "4*", 1, 0) as client_errors | if (num-free-jrubies matches “0”,1,0) as borrowrequired | timeslice by 1d | sum(successes) as successes, sum(client_errors) as client_errors, sum(server_errors) as server_errors sum(borrowrequired) as borrowed_jrubies by _timeslice Centralizing monitoring across the environment means not only querying and joining siloed data, but also allowing for smarter analysis. By appending an “outlier” query to something like the above, you can set baselines and spot trends in your environment instead of guessing and then querying. | timeslice 15d | max(borrowed_jrubies) as borrowed_jrubies by _timeslice | outlier response_time source: help.sumologic.com/Search/Search_Query_Language/Search_Operators/outlier About the Author Alex Entrekin served on the executive staff of Cloudshare where he was primarily responsible for advanced analytics and monitoring systems. His work extending Splunk into actionable user profiling was featured at VMworld: “How a Cloud Computing Provider Reached the Holy Grail of Visibility.” Alex is currently an attorney, researcher and writer based in Santa Barbara, CA. He holds a J.D. from the UCLA School of Law. Monitoring and Analyzing Puppet Logs With Sumo Logic is published by the Sumo Logic DevOps Community. If you’d like to learn more or contribute, visit devops.sumologic.com. Also, be sure to check out Sumo Logic Developers for free tools and code that will enable you to monitor and troubleshoot applications from code to production.

Blog

Building Software Release Cycle Health Dashboards in Sumo Logic

Gauging the health and productivity of a software release cycle is notoriously difficult. Atomic age metrics like “man months” and LOCs may be discredited, but they are too often a reflexive response for DevOps problems. Instead of understanding the cycle itself, management may hire a “DevOps expert” or homebrew one by taking someone off their project and focusing them on “automation.” Or they might add man months and LOCs with more well-intentioned end-to-end tests. What could go wrong? Below, I’ve compiled some metrics and tips for building a release cycle health dashboard using Sumo Logic. Measuring Your Software Release Cycle Speed Jez Humble points to some evidence that delivering faster not only shortens feedback but also makes people happier, even on deployment days. Regardless, shorter feedback cycles do tend to bring in more user involvement in the release, resulting in more useful features and fewer bugs. Even if you are not pushing for only faster releases, you will still need to allocate resources between functions and services. Measuring deployment speed will help. Change lead time: Time between ticket accepted and ticket closed. Change frequency: Time between deployments. Recovery Time: Time between a severe incident and resolution. To get this data to Sumo Logic, ingest your SCM and incident management tools. While not typical log streams, the tags and timestamps are necessary to tracking the pipeline. You can return deployment data from your release management tools. Tracking Teams, Services with the Github App To avoid averaging out insights, Separately tag services and teams in each of the tests above. For example, if a user logic group works on identities and billing, track billing and identity services separately. For Github users, there is an easy solution, the Sumo Logic App for Github, which is currently available in preview. It generates pre-built dashboards in common monitoring areas like security, commit/pipeline and issues. More importantly, each panel provides queries that can be repurposed for separately tagged, team-specific panels. Reusing these queries allows you to build clear pipeline visualizations very quickly. For example, let’s build a “UI” team change frequency panel. First, create a lookup table designating UserTeams. Pin it to saved queries as it can be used across the dashboard to break out teams: "id","user","email","team", "1","Joe","joe@example.com","UI" "2","John","john@example.com","UI" "3","Susan","susan@example.com","UI" "4","John","another_john@example.com","backspace" "5","John","yet_another_john@example.com","backspace" Next, copy the “Pull Requests by Repository” query from the panel: _sourceCategory=github_logs and ( "opened" or "closed" or "reopened" ) | json "action", "issue.id", "issue.number", "issue.title" , "issue.state", "issue.created_at", "issue.updated_at", "issue.closed_at", "issue.body", "issue.user.login", "issue.url", "repository.name", "repository.open_issues_count" as action, issue_ID, issue_num, issue_title, state, createdAt, updatedAt, closedAt, body, user, url, repo_name, repoOpenIssueCnt | count by action,repo_name | where action != "assigned" | transpose row repo_name column action Then, pipe in the team identifier with a lookup command: _sourceCategory=github_logs and ( "opened" or "closed" or "reopened" ) | json "action", "issue.id", "issue.number", "issue.title" , "issue.state", "issue.created_at", "issue.updated_at", "issue.closed_at", "issue.body", "issue.user.login", "issue.url", "repository.name", "repository.open_issues_count" as action, issue_ID, issue_num, issue_title, state, createdAt, updatedAt, closedAt, body, user, url, repo_name, repoOpenIssueCnt | lookup team from https://toplevelurlwithlookups.com/UserTeams.csv on user=user | count by action,repo_name, team | where action != "assigned" | transpose row repo_name team column action This resulting query tracks commits — open, closed or reopened — by team. The visualization can be controlled on the panel editor, and the lookup can be easily piped to other queries to break the pipeline by teams. Don’t Forget User Experience It may seem out of scope to measure user experience alongside a deployment schedule and recovery time, but it’s a release cycle health dashboard, and nothing is a better measure of a release cycle’s health than user satisfaction. There are two standards worth including: Apdex and Net Promoter Score. Apdex: measures application performance on a 0-1 satisfaction scale calculated by… If you want to build an Apdex solely in Sumo Logic, you could read through this blog post and use the new Metrics feature in Sumo Logic. This is a set of numeric metrics tools for performance analysis. It will allow you to set, then tune satisfaction and tolerating levels without resorting to a third party tool. Net Promoter Score: How likely is it that you would recommend our service to a friend or colleague? This one-question survey correlates with user satisfaction, is simple to embed anywhere in an application or marketing channel, and can easily be forwarded to a Sumo Logic dashboard through a webhook. When visualizing these UX metrics, do not use the single numerical callout. Take advantage of Sumo Logic’s time-series capabilities by tracking a line chart with standard deviation. Over time, this will give you an expected range of satisfaction and visual cues of spikes in dissatisfaction that sit on the same timeline as your release cycle. Controlling the Release Cycle Logging Deluge A release cycle has a few dimensions that involve multiple sources, which allow you to query endlessly. For example, speed requires ticketing, CI and deployment logs. Crawling all the logs in these sources can quickly add up to TBs of data. That’s great fun for ad hoc queries, but streams like comment text are not necessary for a process health dashboard, and their verbosity can result in slow dashboard load times and costly index overruns. To avoid this, block this and other unnecessary data by partitioning sources in Sumo Logic’s index tailoring menus. You can also speed up the dashboard by scheduling your underlying query runs for once a day. A health dashboard doesn’t send alerts, so it doesn’t need to be running in real-time. More Resources: How Do you Measure Team Success? On the Care and Feeding of Feedback Cycles Martin Fowler’s Test Pyramid Just Say No to More End to End Tests Quantifying Devops Capability: It’s Important to Keep CALMS 9 Metrics DevOps Teams Track Building Software Release Cycle Health Dashboards in Sumo Logic is published by the Sumo Logic DevOps Community. If you’d like to learn more or contribute, visit devops.sumologic.com. Also, be sure to check out Sumo Logic Developers for free tools and code that will enable you to monitor and troubleshoot applications from code to production. About the Author Alex Entrekin served on the executive staff of Cloudshare where he was primarily responsible for advanced analytics and monitoring systems. His work extending Splunk into actionable user profiling was featured at VMworld: “How a Cloud Computing Provider Reached the Holy Grail of Visibility.” Alex is currently an attorney, researcher and writer based in Santa Barbara, CA. He holds a J.D. from the UCLA School of Law.

Blog

Automated Infrastructure Problem Discovery using Sumo Logic and Chef

The Chef-Sumo integration, which can be found on the Sumo Logic Developers open source page, is a way to inch towards something like automated “infrastructure problem discovery.” By “problem”, I mean services not working as intended. That could be a recurring incident, a persistent performance problem, or a security threat. Gathering useful statistics on the interaction between your code and your “Infrastructure-as-Code” will make it easier to discover these problems without intense rote manual querying. For example, a basic problem that you might discover is that something is in the logs where nothing should be. This could be a one-off security problem, or a problem across the entire service. Monitoring and alerting from kernel to user will tell you quickly. The Chef-Sumo combination is a quick and scalable way to build this solution. Chef’s verbose Infrastructure-as-Code is powerful, allowing for service description and discovery, e.g. automated AWS discovery and deployment. Sumo pares down Chef’s verbose output into dashboardable, queryable SaaS, and correlates it with other service logs, simultaneously widening coverage and narrowing focus. To Chef, Sumo is yet another agent to provision; rollout is no more complicated than anything else in a cookbook. To Sumo, Chef is yet another log stream; once provisioned, the Chef server is parsed into sources and registered in Sumo. Types of Problems This focus is critical. Since storage is cheap and logging services want lock-in, the instinct in DevOps is to hoard information. Too often, teams suffer from a cargo cult mentality where the data’s “bigness” is all that matters. In practice, this usually means collecting TBs of data that are unorganized, poorly described and not directed towards problem-solving. It’s much easier to find needles in haystacks with magnets. With infrastructure logs, that means finding literal anomalies, like an unknown user with privileged access. Or sometimes it means finding pattern mismatches or deviations from known benchmarks, like a flood of pings from a proxy. Problem Solving on Rails Sumo has two out-of-the-box query tools that can make the problem-solving process simpler— Outlier and Anomaly. These are part of Sumo’s “log reduce” family. Outlier tracks the moving average and standard deviation of a value, allowing for alerts and reports when the difference between the value exceeds the mean by some multiple of the standard deviation. Here’s an example query for a simple AWS alert: | source=AWS_inFooTown | parse “* * *: * * * * * * * * \ “* *://*:*/* HTTP/” as server, port, backend | timeslice by 1m | avg(server) as OKserver, avg(port) as OKport, avg(backend) as OKbackend by _timeslice | (OKserver+OKport+OKbackend) as total_time_OK | fields _timeslice, total_time_OK | outlier total_time_OK In other search tools, this would require indexing and forwarding the sources, setting up stdev searches in separate summary indexes, and collecting them on a manually written average. Not only does that take a lot of time and effort, it requires knowing where to look. While you will still need to parse each service into your own simple language, not having to learn where to deploy this on every new cookbook is a huge time-saver. Anomaly is also a huge time-saver, and comes with some pre-built templates for RED/YELLOW/GREEN problems. It detects literal anomalies based on some machine learning logic. Check here to learn more about the logic’s internals. How to Look before You Leap While it’s all hyperloops and SaaS in theory, no configuration management and monitoring rollout is all that simple, especially when the question is “what should monitor what” and the rollout is of a Chef-provisioning-Sumo-monitoring-Chef process. For example, sometimes the “wrong” source is monitored when Chef is provisioning applications that each consume multiple sources. The simplest way to avoid the confusion at the source is to avoid arrays completely when defining Sumo. Stick with hashes for all sources, and Chef will merge based on the hash-defined rules. Read CodeRanger’s excellent explanation of this fix here. This is a pretty tedious solution, however, and the good folks at Chef and Sumo have come up with something that’s a lot more elegant: custom resources in Chef, with directives in the JSON configuration. This avoids source-by-source editing, and is in line with Sumo’s JSON standards. To get started with this approach, take a look at the custom resources debate on GitHub, and read the source for Kennon Kwok’s cookbook for Sumo collectors. Editor’s Note: Automated Infrastructure Problem Discovery using Sumo Logic and Chef is published by the Sumo Logic DevOps Community. If you’d like to learn more or contribute, visit devops.sumologic.com. Also, be sure to check out the Sumo Logic Developers Open Source page for free tools, API’s and example code that will enable you to monitor and troubleshoot applications from code to production. About the Author Alex Entrekin served on the executive staff of Cloudshare where he was primarily responsible for advanced analytics and monitoring systems. His work extending Splunk into actionable user profiling was featured at VMworld: “How a Cloud Computing Provider Reached the Holy Grail of Visibility.” Alex is currently an attorney, researcher and writer based in Santa Barbara, CA. He holds a J.D. from the UCLA School of Law. Resources: LearnChef on Custom Resources CodeRanger on Solving Attribute Merging Problems in Chef with Hashes Gist of Custom Chef Resources in Sumo by Ken Kwok More Ken Kwok on Custom Resources in Chef & Sumo Using Multiple JSON files to Configure Sumo Sources