Site reliability engineers (SREs) rely on monitoring and analytics tools like Sumo Logic to guarantee uptime and performance of their applications and various components or services in production. The ability to visually monitor, automatically generate alerts and efficiently troubleshoot an issue in real time has become table stakes for any modern SRE team.
Quality engineering (QE) teams have the same challenges and goals. As development teams embrace agile practices and accelerate their software delivery cycles, they also need to dramatically increase the speed and frequency at which they test their applications. This continuous testing places a huge burden on QE teams and requires them to monitor their testing pipelines in real-time to identify and troubleshoot any issues preventing them from properly testing new code and increasing release velocity. QE teams can no longer afford to be a bottleneck in the development process.
To accomplish this, QE teams need to analyze in real-time all the data their pipelines create. They need to collect data from various sources up and down their testing pipeline and then analyze and visually present that information in a compelling way to various constituents across their development organization to make effective decisions. From senior engineering leaders trying to measure and increase their overall release velocity to individual developers trying to fix a specific problem, this information becomes invaluable to any modern software development organization.
“If you can’t measure it, you can’t improve it.” – Peter Drucker
To monitor a QE pipeline properly first requires the ability to collect data from various sources across the pipeline. This includes test logs from different sources including internal grids and services such as Sauce Labs, CI tools like Jenkins, Git repos like Github, artifacts like JFrog, and security tests from tools like Veracode and many others. The analytics platform ingesting this data must be flexible enough to work with both structured and unstructured data and allow QE to present this information in a visually compelling way that’s easy to consume for various types of users.
Once monitoring is implemented successfully, QE teams quickly get a new level of visibility to actually see what’s happening across their testing pipeline. This visibility provides greater accountability across the entire development organization as QE is no longer a black box of passed and failed tests but rather a transparent system that’s vital to the release cycle. This transparency also builds trust across the development organization as quality engineers and developers alike now have a centralized statement of record that can help them better collaborate to identify and resolve issues faster.
This visibility can also be easily customized for different users. Custom dashboards can easily be created for or by senior engineering leaders - providing an overall view of QE while the development and QE teams can have dashboards specific to their application. Even individual developers can have a dashboard for what they care about most.
Once proper monitoring is established, QE teams can begin to create alerts based on this information to proactively highlight issues or problems when they occur. Issues like test times suddenly increasing or test failures growing on a particular platform can now be automatically identified and the appropriate resources notified to address the issue. Integrations with tools like Jira will automatically generate trouble tickets while incident management tools like PagerDuty and Atlassian Opsgenie streamline the alerting process even further to help reduce the time it takes to resolve issues.
Teams can also begin to create more advanced alerts to correlate data from various sources to provide new insights. An example could be a correlation across Github submits, Jenkins pull requests and completed automated tests to calculate the velocity of the pipeline and identify any current or future bottlenecks.
Once an issue is identified, the final step in this process is to quickly define the root cause and fix that issue. With thousands of tests running across multiple builds across multiple platforms, the ability to properly isolate and identify the root cause of a particular QE problem can be overwhelming. However, if the QE pipeline is properly instrumented and integrated with a machine data analytics platform like Sumo Logic, this process becomes greatly simplified. With direct access to the logs and the metrics data associated with a particular issue, quality engineers and developers can quickly investigate an issue, drill down into specific data related to that issue and ultimately identify its root cause. Sumo Logic’s powerful query language allows users to analyze massive amounts of data quickly, while at the same time Search Templates empower occasional users to leverage the same power using simple drop-down forms. With this new process in place, troubleshooting a QE issue moves away from the archaic and inefficient process that struggled to scale, to a highly organized system.
QE-Ops in Action….
Here at Sumo Logic, our QE team uses our machine data analytics platform to more effectively monitor and troubleshoot our testing pipeline(s). The results have been tremendous -- you can read more about how we gained a single, unified view of key analytics through our testing cycle here in another of our technical blogs.
We’re still in the early days of fully implementing a true QE-Ops system, but here’s a snapshot of how we monitor, alert and troubleshoot our automated testing to ensure the functional health of our development pipeline.
Sumo Logic Automated Testing Pipeline
Here’s how the pipeline flows...
- Corresponding test jobs are created on Jenkins. Tests run on different browsers via chosen platform - our in-house-ECS-based grid or Sauce Labs.
- A plugin gets customized test result metaData in JSON.
- The JSON format file is then uploaded to an Amazon S3 bucket.
- From S3, it is ingested to Sumo Logic. Here’s a sample JSON log:
- Search queries are then executed on ingested json logs.
- Dashboards that monitor the QE environment are created via queries.
We’ve created one overview dashboard (Production-Test-Overview) which gives us a bigger picture of all the Jenkins jobs failing on different production environments and multiple deployment specific dashboards (Production-Test-Details) for root cause analysis. Let’s look at a typical troubleshooting flow from a QE’s perspective. It starts with the Overview dashboard which is monitored continuously by the QE team. It contains multiple single-value panels for different deployments which helps to quickly identify the unhealthy deployment.
All these panels are linked to a deployment specific dashboard, by clicking the drill down button one lands to the Production-Test-Details dashboard which can help troubleshoot different use cases.
Identifying Failures Causes
Here while troubleshooting one would always want to reduce the scope and target the job with lesser success rate. The below panel shows the list of jobs having Success Rate <= 85%.
After we’ve identified the failing job name and it’s build url, we begin to investigate the cause. The below panel shows a list of failure reasons along with the build url we can correlate or use filter on buildurl column to track the root cause.In this example the test failed because no emails were received when a password was reset - failedExpectation “No Matching Mail Found”.
Comparing Test Results across browsers
In this example, we want to look at the tests that were run over the past 24 hours for Chrome and Firefox. This is a typical use case when you are launching a set of new features, and want to compare the test results on different browsers.
Determining Test Stability based on Weekly Comparison
Configuring alerts on critical Health Checks to monitor Production Jobs failures
Alerting, Troubleshooting & Triaging
We’ve also created alerts from our test data. In the example below, we’ve created an alert based on a success rate that falls below 85%.
After we’ve identified this failing test, we begin to investigate the cause. In this example, by accessing logs and metrics associated with the test, we’re able to quickly determine the test failed because no emails were received when a password was reset - failedExpectation “No Matching Mail Found”.
Sumo Logic also provides many other valuable capabilities in dashboards, such as time comparison trends, outlier detection and more. This helps analyzing the data to gain more insight.
Not a Sumo Logic user yet? You can sign up for free here.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.