Back to blog results

October 5, 2016 By Alex Entrekin

Monitoring and Analyzing Puppet Logs With Sumo Logic

Monitoring and Analyzing Puppet Logs - Sumo LogicThe top Puppet question on ServerFault is How can the little guys effectively learn and use Puppet? Learning Puppet requires learning a DSL that’s thorny enough that the final step in many migrations is to buy Puppet training classes for team.

While there is no getting around learning the Puppet DSL, the “little guys” can be more effective if they avoid extending Puppet beyond the realm of configuration management (CM). It can be tempting to extend Puppet to become a monitoring hub, a CI spoke, or many other things. After all, if it’s not in Puppet, it won’t be in your environment, so why not build on that powerful connectedness?

The cons of Puppet for log analysis and monitoring

Here’s one anecdote from scriptcrafty explaining some of the problems with extending beyond CM:

Centralized logic where none is required, Weird DSLs and templating languages with convoluted error messages, Deployment and configuration logic disembodied from the applications that required them and written by people who have no idea what the application requires, Weird configuration dependencies that are completely untestable in a development environment, Broken secrets/token management and the heroic workarounds, Divergent and separate pipelines for development and production environments even though the whole point of these tools is to make things re-usable, and so on and so forth.

Any environment complex enough to need Puppet is already too complex to be analyzed with bash and PuppetDB queries. These tools work well for spot investigation and break/fix, but do not extend easily into monitoring and analysis.

I’ll use “borrow-time” as an example. To paraphrase the Puppet analytics team, “borrow-time” is the amount of time that the JRuby instances handling Puppet tasks spend on each request. If this number gets high, then there may be something unusually expensive going on. For instance, when the “borrow-timeout-count” metric is > 0, some build request has gone unfilled.

It’s tempting to think that the problem is solved by setting a “borrow-timeout-count” trigger in PuppetDB for >0. After all, just about any scripting language will do, and then analysis can be done in the PuppetDB logs. Puppet even has some guides for this in Puppet Server – What’s Going on in There?

Monitoring a tool with only its own suggested metrics is not just a convenience sample, but one that is also blind to the problem at hand—uptime and consistency across an inconsistent and complex environment. Knowing that some request has gone unhandled is a good starting point.

A closer look at Puppet logs and metrics

But look at everything else that Puppet shows when pulling metrics:is trying approach is it runs a risk so let’s look at what one “borrow-time” metrics pull brings up: In the Puppet server: pe-jruby-metrics->status->experimental->metrics


"metrics": {
"average-borrow-time": 75,
"average-free-jrubies": 1.86,
"average-lock-held-time": 0,
"average-lock-wait-time": 0,
"average-requested-jrubies": 1.8959058782351241,
"average-wait-time": 77,
"borrow-count": 10302,
"borrow-retry-count": 0,
"borrow-timeout-count": 0,
"borrowed-instances": [
{
"duration-millis": 2888,
"reason": {
"request": {
"request-method": "post",
"route-id": "puppet-v3-catalog-/*/",
"uri": "/puppet/v3/catalog/foo.puppetlabs.net"
}
},
},
...],
"num-free-jrubies": 0,
"num-jrubies": 4,
"num-pool-locks": 0,
"requested-count": 10305,
"requested-instances": [
{
"duration-millis": 134,
"reason": {
"request": {
"request-method": "get",
"route-id": "puppet-v3-file_metadata-/*/",
"uri": "/puppet/v3/file_metadata/modules/catalog_zero16/catalog_zero16_impl83.txt"
}
},
},
...],
"return-count": 10298
}

If you are lucky, you’ll have an intuitive feeling about the issue before asking whether the retry count is too high, or if it was only a problem in a certain geo. If the problem is severe, you won’t have time to check the common errors (here and here); you’ll want context.

How Sumo Logic brings context to Puppet logs

Adding context—such as timeseries, geo, tool, and user—is the primary reason to use Sumo for Puppet monitoring and analysis. Here is an overly simplified example Sumo Logic query where jruby borrowing is compared with the Apache log 2**/3**/4** errors:

_sourceName=*jruby-metrics* AND _sourceCategory=*apache*
| parse using public/apache/access
| if(status_code matches "2*", 1, 0) as successes
| if(status_code matches "5*", 1, 0) as server_errors
| if(status_code matches "4*", 1, 0) as client_errors
| if (num-free-jrubies matches “0”,1,0) as borrowrequired
| timeslice by 1d
| sum(successes) as successes, sum(client_errors) as client_errors, sum(server_errors) as server_errors sum(borrowrequired) as borrowed_jrubies by _timeslice

Centralizing monitoring across the environment means not only querying and joining siloed data, but also allowing for smarter analysis. By appending an “outlier” query to something like the above, you can set baselines and spot trends in your environment instead of guessing and then querying.


| timeslice 15d
| max(borrowed_jrubies) as borrowed_jrubies by _timeslice
| outlier response_time

source: help.sumologic.com/Search/Search_Query_Language/Search_Operators/outlier

About the Author

Alex Entrekin served on the executive staff of Cloudshare where he was primarily responsible for advanced analytics and monitoring systems. His work extending Splunk into actionable user profiling was featured at VMworld: “How a Cloud Computing Provider Reached the Holy Grail of Visibility.” Alex is currently an attorney, researcher and writer based in Santa Barbara, CA. He holds a J.D. from the UCLA School of Law.

Monitoring and Analyzing Puppet Logs With Sumo Logic is published by the Sumo Logic DevOps Community. If you’d like to learn more or contribute, visit devops.sumologic.com. Also, be sure to check out Sumo Logic Developers for free tools and code that will enable you to monitor and troubleshoot applications from code to production.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Alex Entrekin

More posts by Alex Entrekin.

People who read this also enjoyed