In the Introduction to DevOps, we talked about Continuous Delivery and the Three ways. The Three ways are the patterns used by high-performing DevOps teams. This section focuses on the second pattern, which is to enable fast and continuous feedback from operations to development. A key factor to automating feedback is telemetry. By inserting telemetric data into your production application and environment, the DevOps team can automate feedback mechanisms while monitoring applications in real-time. DevOps teams use telemetry to see and solve problems as they occur, but this data can be useful to both technical and business users.
When properly instrumented, telemetry can also be used to see and understand in real time how customers are engaging with the application. This could be critical information for product managers, marketing teams, and customer support. Thus it’s important that feedback mechanisms share continuous intelligence with all stakeholders.
What is Telemetry and Why Should I Care?
Telemetry as “an automated communications process by which measurements and other data are collected… and transmitted to receiving equipment for monitoring,” according to Wikipedia. Telemetry is a major component in modern motor racing, for example, where race engineers collect data including G forces, temperature readings, wheel speed, and suspension displacement during a test or race and use it to properly tune the car for optimum performance. Systems used in Formula One have become so advanced that the potential lap time of the car can be calculated, establishing a time the driver is expected to meet.
In DevOps and the world of modern cloud apps, we are tracking the health and performance of an application. That telemetry data comes from logs, metrics and events. The measurements are things like memory consumption, CPU performance, and database response time.
Why is this important? John Allspaw, CTO at Etsy once said “If it moves, graph it. If it matters, alert on it.” With 1.6 million active sellers and 24 million active buyers in 2016, Etsy generates enormous volumes of data. Etsy has been the poster child for both DevOps and modern application monitoring. Etsy deploys 50 times per day while monitoring more than 200 million events per day. Etsy tracks everything from logins, shop stats and page performance to rate limiting, API failures and search performance.
To manage this feat Etsy developed StatsD, which they characterize as a “brain-dead simple way to create time-series metrics.” Stats are then sent to a metrics backend like Graphite. Etsy open-sourced StatsD and blogged about it in their seminal post, Measure Anything, Measure Everything.
More to the point, the level of visibility that this data provides can improve MTTI from days to minutes and MTTR by more than 100 percent. Consider the 2016 Puppet Labs State of DevOps Report, which found that high performing teams resolved production incidents 168 times faster than their peers. One of the top practices that decreased MTTR was the proactive monitoring of logs, metrics and events (telemetry).
DevOps Visibility and the Challenge of Continuous Delivery
As organizations embrace the DevOps approach to application development they face new challenges that can’t be met with legacy monitoring tools. Teams need DevOps Visibility. While continuous integration, automated testing and continuous delivery have greatly improved the quality of software, clean code doesn’t mean software always behaves as expected. A faulty algorithm or failure to account for unforeseen conditions can cause software to behave unpredictably. Within the continuous delivery (CD) pipeline, troubleshooting can be difficult, and in cases like debugging in a production environment it may not even be possible.
DevOps teams are challenged with monitoring, tracking and troubleshooting issues in a context where applications, systems, network, and tools across the toolchain all emit their own logging data. In fact, we are generating an ever-increasing variety, velocity, and volume of data.
Challenges of Frequent Release Cycles
As DevOps teams release faster and automate more, these goals can also become pain points. Frequent release introduces new complexity and automation obscures that complexity. In fact, DevOps teams cite deployment complexity as their #1 challenge, according to the EMA’s Automating for Digital Transformaion survey. When asked how continuous delivery impacted production environments, the survey found that 52% of respondents indicated Ops is spending more time troubleshooting and 48% said that Development is drawn into production troubleshooting.
The current challenges for DevOps teams is:
- Ops is spending more time troubleshooting
- Development is drawn into production troubleshooting
- Service levels have degraded with more frequent releases
Performance and availability problems have increased
- Accelerated release cycles that lead to new deployment complexity
- Difficulty syncing multiple development workstreams.
5 Use Cases for Monitoring Events, Logs and Metrics
The primary use cases for monitoring events, logs and metrics fall into these categories:
- Outage Monitoring – monitor production issues and outages to see and solve problems as they occur.
- Product Monitoring – monitor user behavior to get product feedback.
- Build Failure – monitor build, test and deploy processes. Gather feedback so development and operations can safely deploy code.
- Predict Future Anomalies – anticipate issues before they occur
- Security – Monitor for both audit and compliance issues, and security breaches
What Metrics Should I Track?
To get the telemetry you want, you’ll need to decide what’s important to track and determine which logs, events and metrics relate to your KPI’s. Developers can also code up custom events that enrich the telemetry from standard logs and events. To get a clear picture, you’ll need to examine telemetry at three layers: Application, operating environment (system), and network.
What’s important to monitor? Examples include:
- Performance; how quickly a “normalized” search is executed vs others. (Normalizing the search levels the differences between lightweight and heavier searches)
- Potential optimizations when deploying infrastructure
- Changes that occur when an account is provisioned
- Adoption patterns when new features are deployed
- Is the feature visible and discoverable
Telemetry in the Build Stage
For DevOps teams accelerated releases have lead to greater deployment complexity, performance issues, and lower service levels, so there’s been a need for better troubleshooting and root-cause analysis tools. You monitor the build, test and deploy process to improve software quality, maintain or enhance performance, and to prevent issues. The goal is to have app visibility at every stage of the release cycle. Telemetry combined with predictive analytics can also help project future KPI violations.
As stakeholders developers, release engineers, and product managers who drive feature development rely on automated feedback loops that are informed by continuously flowing telemetry to gain visibility into not only software quality, but app performance within the staging environment and usage. Some examples of telemetry to gather in staging include:
- Deriving time-series based metrics from source code (e.g., Github) repo to determine the health of the release cycle.
- Gathering data from artifact repositories (e.g., Artifactory) can tell you a lot about the health of your builds.
- Collecting logs from your CI server (e.g., Jenkins) can point to continuous integration failures
Automated tracking of code deploys is essential for teams who practice Continuous Deployment. Monitoring every aspect of your server and network architecture helps detect when something has gone awry. Correlating the times of each and every code deploy helps to quickly identify human-triggered problems and greatly shorten your time to resolve them.
Telemetry at the Run Stage
Once deployed, availability, reliability and scalability are the focus. When things go wrong in a cloud or serverless computing environment you can’t just blindly reboot the server. You must take a disciplined approach to identifying and resolving issues. So, you want to create enough telemetry to confirm that your services are running properly in production. If your developers are responsible for code in production as with Amazon’s “you build it, you run it,” the stakeholders are dev and DevOps teams. In traditional Enterprise IT shops, the stakeholders are everyone from dev and release to Ops.
Monitoring user behavior gives you instant feedback on things like how your app is being used. Examples of what you can monitor include:
- How customers are engaging with your app
- How successful and engaging are the various features
- What’s going on when crashes happen or non-crashing errors such as failed HTTP requests, failed syncs, and timeouts.
- How successful are your trial versions
- Where should you concentrate future investments
Consider Outsmart Games, a game developer that creates games for mobile devices including a Combat game called “Blood Gate Age of Alchemy.” Outsmart had just deployed the new game into production and by monitoring telemetry (using Sumo Logic), discovered some users were experiencing outages. In identifying the problem they discovered a scenario the developers hadn’t considered: In the game users were asked to select a weapon before combat. For users who opted not use a weapon, the game crashed. Monitoring user behavior decreased their MTTI from days to minutes.
By monitoring IP addresses, the company was able to identify specific users that experienced the outage and sent them a nurturing message. This kind of “application intelligence” can be invaluable to not only the DevOps team, but product management, marketing and even sales.
Telemetry and Security
As you’ll read elsewhere on this site, the purpose of DevSecOps as a distinct methodology is to integrate security smoothly into the DevOps framework, and to do so fully in the spirit of DevOps. But what does this mean in the context of feedback and the process of automating feedback.
When most practitioners speak of threat intelligence, they are trying to understand the risks of the most common and severe external threats like zero-day threats, advanced persistent threats (APTs) and exploits. Certainly, telemetry is already of enormous value with companies like Crowdstrike offering solutions. At the core of this kind of threat monitoring is the telemetry that comes from logs, metrics and events.
Monitoring is even more important in a DevOps context, since most or all of the basic control processes are automated. For DevSecOps, by far the most appropriate choice is a comprehensive, integrated monitoring system with a full API and dashboard capabilities. So, establishing telemetry to gather threat intelligence should be considered a best practice in any high performing DevOps team.
We’ve focused on automating feedback mechanisms through the use of logs, metrics and events (telemetry). Of course there are other patterns to consider: Creating feedback mechanisms that share the learnings discovered in operating an application back to the development team. You can learn more in this section of the DevOps Chronicles site.