DevOps Glossary

Root Cause Analysis

What is Root Cause Analysis?

Root cause analysis (RCA) is a method of problem-solving used to investigate known problems and identify their antecedent and underlying causes. While the term root cause analysis seems to imply that issues have a singular cause, this is not always the cause. Problems may have a singular cause, or multiple causes stemming from deficiencies in products, people, processes or other factors.

When is Root Cause Analysis Used?

Root cause analysis is implemented as an investigative tool in a variety of industries. Engineers and product designers use an RCA technique known as failure analysis to proactively evaluate what conditions might cause a product or project to fail. RCA also has applications in healthcare. Doctors may implement RCA as an investigatory tool to help with a diagnosis, and epidemiologists can use RCA to trace the source of an infectious disease outbreak.

For IT organizations, root cause analysis is a key aspect of the cyber security incident response process. When a security breach occurs, SecOps teams must collaborate quickly to determine where the breach originated, isolate the vulnerability that caused the breach and initiate corrective and preventive actions to ensure the vulnerability cannot be exploited again.

How to Do Root Cause Analysis

When investigating a cyber security incident, security operations teams must act quickly to identify and isolate the root cause of the event. The basic outline of the RCA process is identical across industries, regardless of the tools that individual practitioners choose to implement. A process for root cause analysis is described in the following four steps:

  1. Identification and description - the first step to a successful root cause analysis is the accurate identification and description of a problem. If the problem is poorly understood, it may prove difficult to correctly isolate the underlying causes of the problem. For IT operators responding to an automated alert from a security analytics tool, an initial problem statement could be "Our security system sent an alert". Accurate event descriptions also play an important role in RCA. The starting point for a successful analysis should be a collection of accurate event descriptions detailing everything that happened in connection with the problem.
  2. Chronology - Once IT operators have identified the problem and associated events, they should be arranged in chronological order, as in a timeline or sequence of events. This makes it easy to establish and identify causal relationships between events connected to the problem. Organizations that leverage security analytics software can automate the collection of event logs and the integration of logs from multiple sources into a single, standardized format and platform. This streamlines the RCA process, helping these organizations get to step three of RCA at lightning speed.
  3. Differentiation - Differentiation is the third step of the RCA process. Here, investigators incorporate additional contextual data surrounding the events to understand how events are correlated. When a cyber security event is detected, security operators must analyze dependencies between events to distinguish between root causes, causal factors and non-causal factors within the system. Using a data analysis technique called event correlation, enterprise security analytics tools can filter through high volumes of computer logs from a variety of different sources and pinpoint the ones that are most likely to be connected to the problem.
  4. Causal graphing - In the final step of the RCA process, investigators are encouraged to produce a causal graph, diagram or another visual interpretation of the result of the RCA process. Causal graphing illustrates a sequence of key events that begins with the root causes and ends with the problem. This exercise demonstrates the logical pathway that was followed to determine how the problem occurred.

Root Cause Analysis Tools and Techniques

While the general process for root cause analysis remains consistent across industries, investigators differ in the tools and techniques that they use to get to the underlying source of a problem. Even security operators who can automate much of the RCA process with security analytics applications must be familiar with methodologies of root cause analysis to accurately interpret the causes of security events. Here are the two most important tools and methods for RCA in cloud computing environments:


Five Whys Root Cause Analysis

The "Five Whys" method of root cause analysis is an investigative technique that encourages the practitioner to repeatedly ask "Why?" to get to the deepest chain of causation that leads to an incident, event or problem. When a problem is observed, we can rarely get to the root cause after a single iteration of asking "Why did this happen?" We may have to go through several layers of questioning to understand the root cause of an event and identify an opportunity for corrective actions. Use this example as a template for conducting Five Whys RCA:

Problem Statement: The company data server was infected with malware.

  • Why? The server was not updated with the latest malware definitions for our anti-malware application.
  • Why? The automated server that deploys the updates is not operational.
  • Why? The automated server broke last month and it hasn't been repaired or replaced.
  • Why? The person responsible for approving the repair or replacement is on vacation and there was inadequate communication about who should cover change approvals.
  • Why? Lack of process.

Solution: Create a process to ensure that repairs can be approved, even when the normal approving person is away.

This simple example illustrates the depth of questioning that can be required to isolate the root cause.


Fishbone/Ishikawa Diagram Analysis

A fishbone diagram is a visual graphing tool that encourages the investigator to identify potential causes for a problem from a variety of sources. Fishbone diagrams help investigators quickly get to the root cause of issues by encouraging them to identify different types of causes that could have resulted in the problem condition. The leading framework for Fishbone diagrams is the 5 Ms, where investigators look at:

  • Man: Human factors that could have caused the problem
  • Machine: Hardware or technical causal factors
  • Material: Causal factors stemming from material issues, including consumables and information
  • Method: Causal factors stemming from breakdowns in process or methodology
  • Measurement: Causal factors stemming from inaccuracies in measurement tools or inspections

Environmental causal factors are also frequently investigated as part of a Fishbone/Ishikawa diagram analysis.