Back to blog results

December 15, 2015 By Joe Nolan

Root Cause Analysis Best Practices

To prevent regression issues from becoming a trend it’s wise to do a Root Cause Analysis (RCA) for bugs. Mistakes happen. Customers get that for the most part. If you release a new feature it can be tough to weed out all of the bugs before release. But the most annoying scenario is to have a bug show up in code that was working just fine in the previous release. RCA to the rescue.

CSI

Just what is RCA? It’s pretty simple. When you find an issue, you want the development team associated with the broken feature to become detectives (like on the Crime Scene Investigation TV show) and determine how a bug was missed. It takes a team with different roles to perform the investigation:

  • Designers – question whether their design for a new feature indirectly impacted the broken feature. Did they misunderstand the relationships or just not know about them? Maybe the team works in a silo and isn’t aware of competing code.
  • Developers – ask themselves did our code cause this? Did we interpret the specs in a different way than the rest of the team?
  • QA – wonders why their tests missed this. Maybe the team decided it was an edge case and didn’t do a deep verification for that feature.

While each issue is different, they all began at some point in time. The most critical thing to start the investigation is to isolate the issue against previous builds until the bad one is identified. In addition to comparing builds, you can use a tool like Sumo Logic to gather evidence from event streams and log files to gain visibility across the entire stack. That’s convenient because team members use different toolsets, and troubleshooting between staging and production environments can be challenging. From there it is usually a simple process to get to the ‘root’.

It’s Not a Root Canal!

Whenever I hear it’s time for the team to do a root cause analysis I cringe. No matter how you spin it someone will feel like you are airing their dirty laundry.

This must be an ego-less process.

It is extremely important for the team to understand this is not an exercise in pointing fingers. Management must establish this culture from the beginning and be careful in any communication about the findings to focus on the solution and not the originator.

A team approach is important. If a team is following best practices there should be enough checks and balances that no one person, or even functional group, is the only culprit.

Now What?

Great news! The team determined what caused the issue. Sorry, this is only half of what RCA is for. Not only should you fix the problem, but you must ensure it doesn’t happen again! If you think a customer was pissed when you caused a regression, it’s even worse if that same one pops up again after it’s been ‘fixed’. You need to prevent this from happening again:

If it was a coding error, write the necessary automated unit and integration tests around it. Catch this the instant it happens again. Don’t hope QA is downstream to protect you.
Was there a breakdown in process? Is it likely to happen again? Fix it!

The obvious answer seems to be for QA to tighten up their tests. In a Continuous world you want to catch bugs early, before they get to QA. Quality should be front loaded so by the time the feature (or bug fix) hits QA, the developer should deem it production ready.

Document the Results

Bug tracking systems, like JIRA, give you the ability to document the findings of the RCA. They may provide the ability for participants to both identify how the problem occurred, and the solution. This allows for an easy reporting mechanism to see when the RCA can be considered complete.

Learn from Mistakes

The RCA is a great tool to make the team aware of their impacts, and also clean up poor practices. It’s in a team’s best interest to know if they are being effective and implement improvements:

  • Designers communicate across feature teams to discover inter-dependent features
  • Developers can make it a standard to have unit and integration tests as part of their code reviews
  • Team defines smaller stories to allow less room for interpretation
  • QA defines tests earlier so developers can write code against the tests

Make the Time and Do It!

Root Cause Analysis will not only improve the quality of your product, it will encourage a stronger team. When members understand that we all make mistakes, and blame is not assigned, the team develops trust. Management, and client facing co-workers will appreciate knowing the team is willing to review these issues and learn from their mistakes.

About the Author

Joe Nolan (@JoeSolobx) is the Mobile QA team lead at Blackboard. He has over 10 years experience leading multinationally located QA teams, and is the founder of the DC Software QA and Testing Meetup.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Joe Nolan

More posts by Joe Nolan.