An Allied Reliability Group 4200 Faber Place Drive
White Paper Charleston, SC 29405
843.414.5760
www.alliedreliabilitygroup.com
The Basics of
Root Cause Analysis
March 25, 2014
Reliability… it’s in our DNA.
Contents What Is Root Cause Analysis
What Is Root Cause Analysis and Why Is It and Why Is It Important?
Important? ................................................................1
Root Cause Analysis (RCA) is the core skill used by
Establishing RCA Triggers .................................2
maintenance and reliability engineering
Effective RCA Barriers........................................2 professionals to resolve problems that impact an
Performing RCA .......................................................3 organization’s ability to meet strategic objectives.
Recognize .............................................................4 RCA is not just a tool, it is a systematic
Rationalize...........................................................5 methodology used by managers, engineers,
Ratify ...................................................................8 supervisors, operators, and technicians to eliminate
chronic problems that affect an organization. RCA is
Resolve.................................................................9
the preferred process for solving a variety of
Realize ...............................................................13
problems, not just equipment failures. Take quality
Metrics....................................................................15 management systems for example, as defined by
ISO 9001:
"The organization shall take action to eliminate
the cause of nonconformities in order to prevent
recurrence.”
ISO 9001, Quality Management Systems -
Requirements, Clause 8.5, Improvement,
Paragraph 8.5.2, Corrective Action
Sponsorship or advocacy for the RCA process must
be earned. Ultimately, it comes down to a personal
choice made by the individual to support the new
way of doing business. Management’s commitment
to the RCA process, and anyone else impacted by
the RCA process, is best gained by:
• Building awareness of why the RCA process
is important and why the sequence of steps
within the process is relevant to meeting the
needs of the business.
• Helping people understand what is expected
of them and how the RCA process will impact
their role or ability to meet their personal
objectives.
• Providing case studies and concrete examples
of successful problem solving that relate to
personnel’s own experiences and needs.
© 2014 Allied Reliability Group Page 1
• Telling the manager specifically what actions good rule of thumb is to not exceed 85% of the
must be taken in order to ensure the success financial benefit within a single fiscal year.
of the RCA process.
• Acknowledging behaviors that reinforce the In the event that the failure does not impact one of
expectations of the RCA process in order to the agreed upon triggers, simply document the
encourage continued support. problem using the Change Analysis method
discussed later in this white paper. That way, the
Establishing RCA Triggers failure history is recorded in case this singular
event is related to another, higher risk event.
The most significant challenge to overcome when
Limiting RCA efforts to specific triggers helps
starting an RCA program is not having enough
organizations overcome time and resource barriers.
resources to implement the corrective actions before
Engaging leadership, as Sponsors for the RCA
the facility suffers from the same problem again.
program, in this first round of RCA decision making
Strong management sponsorship and commitment
aligns the strategic objectives of the organization
certainly helps to overcome this challenge; however,
with the RCA program and ensures that successful
if leadership does not believe that failures are
achievement of objectives is closely tied to how
effectively being resolved in a timely manner, they
effectively the organization supports and executes
will lose interest and, as a result, become more
RCA.
directive and demanding in an effort to implement
risk mitigating actions. This often leads to
counterproductive and extraordinary measures such
Effective RCA Barriers
as around-the-clock maintenance monitoring of
There are a number of reasons why an RCA
critical assets, the implementation of preventive
program, especially a new program, is ineffective
routines that 90% of the time are not adding value,
and eventually unsustainable:
and the feeling that more MRO spares need to be
carried in inventory. These are extraordinary • Poorly Defined Problem Statement – Poorly
measures because they are very costly decisions. defined problems lead to misguided RCA teams
and ineffective problem solving. In many
Establishing RCA triggers is the best way to ensure
instances of asset-related failures, the problem
that the organization is not constrained by
needing to be solved has nothing to do with the
investigation time limits that compromise the
asset. The failure, in effect, is merely a symptom
integrity of the RCA and to ensure that there are
of a systemic problem, or multiple problems,
sufficient labor, material, and financial resources to
that needs to be investigated and resolved.
execute corrective actions.
• No Formal RCA Process – Informal RCA
RCA triggers act like a decision tree and should be
practices lead to assumption-based analysis and
based on organizational strategic objectives. For
decision making. Without the proper
each trigger, it is recommended that you identify
facilitation, RCA events become unproductive
the level of effort that is allowed to resolve the risk.
and rarely result in effective solutions. Most
In essence, you are performing a Cost-Benefit
times, informal RCA becomes a “check the box”
Analysis. Every time an organization investigates
activity.
an event, there is a cost to the organization relative
to manpower and the cost of corrective actions. • Time Limited Investigations – Although not
These costs should not be greater than the financial ideal, it is common for leadership to limit the
value gained from preventing future occurrences. A time that RCA teams have to investigate
Page 2 The Basics of Root Cause Analysis
problems. Typically, this results in the team • Inadequate Resources to Resolve “Big”
stopping at the physical roots. This means that Issues – RCA teams will quickly become
the true root causes of the failure will not be frustrated and unproductive if they believe that
resolved and the organization will suffer from their solutions are unlikely to be implemented
this problem again in the future. due to budget constraints, unavailable capital
for engineered solutions, and an already
• Unchecked Assumptions – It is normal to overburdened maintenance backlog.
build an RCA diagram based on gut-feel and
assumptions; however, this should only be the • Skirting the “Blame Game” – Because human
first step in brainstorming possible causal and latent root causes inevitably lead back to a
chains. The effectiveness of corrective actions is decision made by a member of your
dependent upon the accuracy of the analysis. organization, it is natural for RCA team
Facts should always be used to check members to attempt to hide details or skirt
assumptions. around a particular causal chain. No one likes to
point fingers.
• Insufficient Analysis Detail – When we try to
solve asset-related problems with limited • “It’s Not My Job” Syndrome – It is easy for
knowledge or detail, we have a tendency to only RCA team members to become overwhelmed by
recognize the “rule breakers”. Rule breakers are the thought of the mountain of work that is
events like “Johnny was not wearing his fall piling up while they are engaged in the RCA.
protection”, “Johnny did not follow the Some may even be expressive about RCA not
procedure”, or “Johnny ran a red light”. being their responsibility. This can quickly
Although these events may be true, they are not derail the flow of progress within the RCA.
the whole story. This leads to improper
corrective action selection. It is important to Performing RCA
break the chain of events down into small bites
of information so we can better understand the As we have already stated, RCA is a systematic
human, systemic, and latent details that led to approach to problem solving. Figure 1 shows Allied
failure. Reliability Group’s systematic approach, known as
the “R5 Cause Analysis” process. This model
• Interim “Recovery” Solutions Become resembles the popular Six Sigma DMAIC
Permanent – In many situations, it is methodology, whereby you first set out to define the
necessary to implement interim solutions in incident in order to recognize the problem needing
order to quickly recover from the failure event to be solved. With a clear understanding of the
and return to normal operation. This can often problem, the initial investigator then measures the
mask the root causes and may even create a impact that the problem is having on organizational
false sense of problem resolution. objectives as a way to rationalize whether or not
further investigation is warranted. With a ratified
• RCA Team Lacks Expertise – It is not path forward, the RCA team proceeds to analyze the
uncommon to have an RCA team that lacks the causal factors in order to determine how to improve
skills, knowledge, and experience to drill down performance by mitigating the root causes of the
and explore all possible causal chains. A good incident. Finally, the process is complete once you
indicator of this barrier is a high frequency of have realized that your solutions are effective and
RCAs and solutions that primarily focus on have implemented controls to prevent recurrence.
physical roots.
© 2014 Allied Reliability Group Page 3
Figure 1: R5 Cause Analysis Process
Recognize that will truly prevent recurrence. This often leads
stakeholders to be skeptical of the RCA program
Incident Analysis and may result in a lack of sponsorship to continue
analysis efforts.
As previously discussed, the first step in the process
is to determine if the incident that triggered the call Design and Application Review
for RCA is the problem needing to be solved or if it
The Design and Application Review method is used
is merely an effect of a much bigger incident.
to compare the desired expectations of an asset,
Starting at too high a level within the overall cause
process, or procedure to the original design or
and effect relationship may prolong the analysis
configuration. Differences between the desired
process and result in both management and RCA
expectation and the design should be noted as
team members becoming disengaged. Additionally,
potential questions for further investigation during
if the initiating incident is nothing more than a
the RCA process. As an example, if a production
symptom of an underlying chain of events, the team
process currently requires 700 gallons of a chemical
might not arrive at the necessary corrective actions
per minute and the pump is only capable of 650
Page 4 The Basics of Root Cause Analysis
gallons per minute per the pump flow curve, then making sure to include those individuals or
this could be a problem or a contributing factor to organizations who responded after the
the incident being investigated. incident.
6. What was the effect or impact to the
Change Analysis organization? Gather data relative to
downtime, product loss, waste, scrap, and
The Change Analysis method is also used to clarify other financially quantifiable effects resulting
the problem, or problems, that need to be solved from the incident.
through RCA. Change Analysis helps the team
identify questions that need to be answered and Problem Statement
data that must be collected. Because the Change
Analysis method quantifies the impact of the event The number one barrier to effective problem solving
or initial problem, when coupled with formally is starting an analysis with a poorly defined
defined triggers, the Change Analysis method is problem statement. Fortunately, the result of either
very useful when trying to determine if an RCA is pre-analysis method is a much more clearly defined
required and to what level of detail. problem statement for beginning the analysis. After
the incident analysis is completed, it is time to write
Listed here are the steps that should be followed the problem statement. The problem statement
and the questions that should be asked when should be written in terms of the part or equipment,
facilitating a Change Analysis: the defect, and the impact of the defect.
1. What happened? Interview all personnel Here are a few things to remember when writing
directly and indirectly involved in the the problem statement:
incident. Preserve all physical evidence and
fully document the scene of the incident in • No storytelling, stick to the facts
order to later confirm the failure mode and • Follow the events, not the blame
mechanisms. • Details are better than opinions
2. When did it happen? Document the timeline • Do not jump to conclusions or try to propose
of events that surrounds the initiating solutions
incident. Collect eyewitness statements, video
or photographic evidence, and all data that Rationalize
supports your timeline.
3. Where did it happen? Identify the specific Document Physical Evidence
machine, system, or area where the incident
occurred. Gather information pertaining Physical root causes are the first to be analyzed
similar occurrences, including those that within the Resolve phase. Physical evidence helps
happened in other areas of the plant or the RCA team evaluate and eliminate suspected
facility. causal chains during the RCA. This shortens the
4. How did it happen? Itemize all changes in time it takes to analyze the problem.
product specifications, maintenance and
When documenting physical evidence associated
operating practices or procedures, and
with the incident, it is helpful to think in terms of
changes to the environment that may have
the defect that is evident for a specific part and the
contributed to the incident.
reason why it occurred. This is known as the
5. Who was involved? List the interviewees
“failure mechanism”, a term used to describe the
directly or indirectly involved in the incident,
© 2014 Allied Reliability Group Page 5
chain of events that led to the failure. A failure Types of Root Causes
mechanism is actually a single statement that
contains the device, failure mode, and primary Many RCAs stop at the physical root cause, where
means of failure, or “mechanism”. Documenting technical solutions can be created. As such, human,
physical evidence in this way will help the RCA systemic, and latent causes of problems are not
team. addressed. If the RCA is taken to the latent causes,
then the team can look at the cost and benefits of
addressing the problem at each level and determine
the best level for a short-term and a long-term
solution. At each level moving down the tree in
Figure 2, you see expanded benefits, but in many
cases at a higher cost and effort to capture that
benefit. It is important for the team to complete a
Cost-Benefit Analysis to determine where to
address the problem.
Figure 2: RCA Elements Guide
Page 6 The Basics of Root Cause Analysis
RCA Business Case and Charter certainly be placed in this box to help build
awareness around what the RCA team will be
The last step in the Rationalize phase of the R5 investigating.
Cause Analysis process is to begin documenting the
business case for moving forward. This is not a Target Condition – The “Target Condition”
“check the box” activity. In order to gain describes for leadership and stakeholders what
management’s commitment to allocate resources to success looks like and what will change as a
the analysis of root causes, and eventually towards result of implementing the solutions or
implementing solutions once the analysis is corrective actions proposed by the RCA Team.
complete, it is important to communicate the value
Proposed Action Plan – At first, this box will
to the business for doing so, what success looks like,
be populated with the steps the team plans to
the plan for moving forward, and how progress and
take in order to analyze the problem. As the
results will be measured.
RCA team identifies solutions, this box in the
The tool that is commonly used to communicate all charter will be added to in order to communicate
of this is the A3 charter. It is called an “A3” charter implementation and post-implementation plans.
because everything that needs to be communicated
Metrics Plan – The last component of the A3
fits on a single sheet of A3-size, or 11 inch by 17
charter is the “Metrics Plan”, which illustrates
inch, paper. The charter is divided into boxes for
how the organization will measure the progress
Business Opportunity and Charter, Current
of the RCA team and how solutions will be
Condition, RCA, Target Condition, Proposed Action
evaluated after implementation. It is a good
Plan, and Metrics Plan.
practice to provide both milestones for the team
Business Opportunity and Charter – The and a definition of performance indicators in
purpose of this box is to communicate the this box.
problem statement and the effect this problem
As shown, initially you will only be able to populate
has on the company’s ability to meet strategic
two (2) or three (3) boxes within the charter as the
objectives.
business case for RCA. In the Ratify phase, you will
Current Condition – In this box, document the return to this document to communicate how the
current condition or what is known about the RCA team plans to tackle the issue at hand. Finally,
conditions that may have contributed to the as you finish the investigation and begin to propose
problem. Remember to capture what was corrective actions, you will again return to this
learned during interviews about procedural document as a means of communicating with
changes, changes to maintenance routines, management and other stakeholders what you
changes to parts used on the asset, or even found and how the team plans to mitigate the
environmental changes. problem in the future.
RCA – Usually, during the Rationalize phase,
there is not enough information to diagram the
root causes of the problem (this happens during
the Resolve phase). However, if one of the basic
RCA methods was used as a way to brainstorm
possible avenues to follow up on during the
analysis, then a preliminary graphic could
© 2014 Allied Reliability Group Page 7
Ratify RCA team members will not be distracted by the
magnitude of work stemming from the solutions
RCA Team they select to resolve the problem. This also creates
an opportunity for management to begin budgeting
With the business case clearly understood by for implementation.
management and other stakeholders, you now need
to assemble the team of people who will be RCA Team Roles
responsible for analyzing root causes and
A good place to start looking for RCA team members
determining corrective action solutions.
is the “who” list you recorded and potentially
It is best to build a cross-functional group of experts interviewed during the incident analysis. Each team
who understand the effects that operating, member should be trained in the methods the RCA
maintenance, and engineering procedures and Facilitator plans to use during the analysis.
standards of practice have on asset performance.
“Cross-functional” also means multiple roles. There
You will need to identify those within the
are three (3) types of team members in the problem
organization, or external to the organization, who
solving team structure:
are intimately familiar with the assets involved in
the incident. Sponsor – This person owns the problem and is
responsible for motivating the team, ensuring
There are a number of reasons why a cross-
that each person fully understands the problem
functional problem solving team is the best model
needing to be solved, and guiding decision
for facilitating an RCA. Often, when we are trying
making to ensure alignment with the strategic
to solve complex problems, we are too close to the
objectives of the organization. The Sponsor is
problem to see it for what it really is. Cross-
also the team member responsible for
functional teams help us expand our perspective in
communicating progress and results to top
order to see the big picture and more accurately find
management in order to maintain support for the
solutions.
process. The team’s Sponsor should be a manager
Cross-functional teams improve our ability to who has authority over implementation
communicate the results of the analysis and build resources, believes in the RCA program, and will
buy-in for the solutions at all levels within the actively support the team’s efforts.
organization. This ensures a higher likelihood that
Facilitator – This person guides the team
solutions will be implemented as planned. A cross-
through the process and is responsible for
functional team also allows us to divide the analysis
engaging team members in the analysis to
by function, which reduces the time it takes to
ensure that all perspectives are recognized and
complete the analysis. Finally, by bringing people
considered. The Facilitator is the owner of the
together with different experiences and levels of
RCA process, which means he or she is
knowledge, we are able to transcend functional
responsible for maintaining the team’s focus and
boundaries and more easily solve complex problems
the integrity of the analysis itself. One of the
that require creative, out-of-the-box thinking.
key characteristics of the team’s Facilitator is
In addition to the RCA team, you will also need to that he or she is able to remain objective, never
identify who within the organization will be trying to influence the team’s ideas or decisions
designated to implement the corrective actions. based on his or her own preconceived notions.
Answering this question up front ensures that the
Page 8 The Basics of Root Cause Analysis
Contributor – The majority of team members have firsthand knowledge of the situations
will serve as Contributors. Fundamentally, their leading up to and following the problem. It is
responsibility is to participate as expert important to find people who can help build a
witnesses to the problem at hand. Contributors complete picture around the problem. Be
are responsible for generating ideas under the cautious of those who have a limited perspective
guidance of the Facilitator, providing plausible and are unable to accept the perspective of
solutions to resolve the problem, and working others.
collaboratively with implementation resources to
ensure that the team’s vision is realized. Resolve
Contributors need to be willing to participate in
discussions, not just excited about telling others The R5 Cause Analysis T3 Chart (Figure 3) is an
the way it was, is, and forever shall be. Refer to excellent job aid to help you remember when and
the Change Analysis and identify those who were how to use each of the eight (8) RCA methods in a
closest to the event when it occurred as they will transitional scheme during the Resolve phase.
Figure 3: R5 Cause Analysis T3 Chart
Time-Based Methods interrelated in time. Time methods can also help
illustrate the relationship of conditional factors that
“Time” methods are preferred when analyzing may appear to be unrelated.
accidents or undesirable events in which the time
sequence is critical to the evaluation of combined Time-based methods help organize seemingly
contributing factors. These methods help the RCA random factors into a logic sequence or scenario to
team determine if causal chains are in fact explain how the incident happened.
© 2014 Allied Reliability Group Page 9
There are four (4) steps to facilitating a time-based Evidence within the Sequence of Events Analysis is
RCA: known as “conditional causes” and may lead your
RCA team to discover other problems that must be
1. The first thing that needs to be done is to resolved in order to effectively eliminate the root
organize the data gathered during pre- cause of your initial problem. If you completed a
analysis, or during troubleshooting and Change Analysis prior to beginning your Sequence
restoration activities. of Events Analysis, then you are more likely to have
2. To remove the randomness of the event, the the evidence you need to clarify the incident
second step is to validate the “primary” event requiring your attention.
sequence using the Sequence of Events method.
3. Next, identify the contributing factors that Most Facilitators will start by transferring the pre-
enabled the primary event sequence. These are analysis data to sticky notes in order to easily
not actual occurrences; they are instead separate events from conditional causes and move
supposed conditions or systemic circumstances evidence around within the analysis as ideas from
that must have been present in order for the the team are contributed.
event or events to occur. Contributing factors
are initially identified based on assumptions, Record the events leading up to the incident. Events
but always check assumptions with evidence. should be written in a way that states what
4. The fourth and final step of the time-based happened, not a condition, conclusion, or suspected
RCA facilitation is to prioritize how the RCA circumstance. Additionally, recording post-incident
team will investigate known events or events helps to identify if restoration or
contributing factors down to root causes in troubleshooting activities may be contributing to the
order to identify solutions to prevent frequency of the incident.
recurrence. Time-based methods are an
Then, add the evidence collected to the diagram to
intermediate step in the overall
validate the primary event sequence. If an event is
transformational RCA that helps the RCA
missing evidence, assign an action item to a
team and the organization decipher random
member of the RCA team to validate the event. In
events and conditions and their relationship to
some situations, it may be necessary to pause the
the incident. Typically, a tree-based or
analysis until each and every event has been
transparency-based method is still needed in
validated to prevent false conclusions as to what
order to effectively solve the real problem.
actually happened leading up to the incident.
Sequence of Events Forcing Functions
The best method to use when trying to identify the Once the primary event sequence has been
importance of each contributing factor in the causal validated, the next step is to identify the
chain is Sequence of Events. This method displays a contributing factors or “forcing functions”. We often
horizontal causal chain, relative to time, leading up refer to these as “forcing functions” because they are
to the specific problem needing to be solved. It is the situations that existed, or are perceived to have
common, as well, to document the events in time existed, that enabled the primary events to result in
after the problem as these factors may have led to an undesirable incident. There are two (2) types of
the frequency at which the problem occurs. forcing functions most commonly used in time-based
methods: conditional and systemic.
When facilitating this method, it is a good practice
to provide evidence that supports your timeline.
Page 10 The Basics of Root Cause Analysis
Within the Sequence of Events method, we are time of secondary events and systemic contributing
going to identify the conditional functions. factors, especially if the team is expected to process
Conditional functions are different from events a large volume of data, evidence, or eyewitness
because they identify circumstances, such as asset accounts that appear to be unrelated to the primary
parameters or environmental changes, that could events.
have contributed to an event or led to the event
causing the incident you are trying to solve. Some Tree-Based Methods
practitioners will also refer to these circumstances
as “conditional causes”. “Tree” methods are used to examine the undesired
effects of a system, such as the introduction of
Conditional functions must also be validated using product defects and equipment breakdowns. Tree
data collected prior to the analysis or after the methods present the possible causes identified by
analysis by a member of the RCA team. However, the RCA team in branching scenarios that represent
placement of these factors within the primary event the logical ordering of known factors, with each
sequence is subjective and based on the knowledge scenario then evaluated using evidence to determine
and experience of the RCA team. The goal is to solution selection.
capture the situations that existed within the
timeline that could lead the team closer to Five Why Analysis
identifying the true root causes of the problem.
The Five Why method is a basic RCA tool that
Event and Causal Factors evaluates possible causes by asking why each event
or factor occurred in a chained progression, typically
When dealing with time-related problems in which from top to bottom. The reason for the “5” in the
various contributing conditions or branched causal “Five Why” is to ensure that human and potentially
chains exist, it is best to expand the Sequence of systemic root causes are documented in the causal
Events by using the Event and Causal Factors chain. Stopping before the 5th “Why” may only
Analysis method. This method helps your RCA team capture the physical events that occurred and may
determine the relationship in time of primary, not provide enough detail for effective solution
secondary, and conditional causes, especially if the selection.
team is expected to process a large volume of data,
evidence, or eyewitness accounts that appear to be The Five Why method is facilitated by asking why a
unrelated to the physical events that led to the condition exists. The progression of conditions can
accident or undesirable incident. shift from the physical roots, to human, then
systemic. At the fifth “Why”, we transition to the
At this stage in the analysis, the RCA team should lowest element of root cause, the latent cause.
use the Fault Tree Analysis method discussed in the
following section to break down the conditional The Five Why method is best used on the shop floor
causes that led to the accident or undesirable by Operators and Technicians as a basic problem
incident. This will help determine corrective actions solving method to quickly and simply record the
to prevent recurrence and thus stop the rest of the events that occurred leading up to the failure or
primary sequence of events from happening in the quality issue. This method is not suitable for
future. complex problems because it is limited to a single
causal chain.
The Event and Causal Factors Analysis method
helps your RCA team determine the relationship in
© 2014 Allied Reliability Group Page 11
Fault Tree Analysis This helps the team think sequentially and makes it
easier to decide if causal factors are related in time
A Fault Tree Analysis is simply a branched Five or are independent.
Why. When you are faced with a multi-faceted
problem that could have long causal chains, the Transparency-Based Methods
Fault Tree method is the preferred approach in
order to achieve a common understanding of all of “Transparency” methods are used to proactively
the major factors that could have contributed to the identify product design, safety, quality, or reliability
system’s undesired effect. This is an advanced problems that have the potential to impact your
method and is a better tool to use than Five Why organization’s ability to meet strategic objectives.
when trying to solve complex, equipment-related These methods create visibility of unknown
problems. We must remember that when dealing relationships between systems, machines, and
with equipment-related problems we always have a components, as well as the control mechanisms,
minimum of two (2) causes that exist at the same such as standard operating procedures and
point in time, a conditional cause and an actionable preventive maintenance routines, that may be
cause. This means that directly under your effect or ineffective in mitigating risk.
problem needing to be solved, you will have at least
two (2) causal chains. For this reason alone, the
Cause and Effect (Fishbone Diagram)
Five Why method is inadequate.
A Cause and Effect Diagram (also known as a
“fishbone diagram”) is a basic brainstorming tool
Logic Tree Analysis
used to illustrate the relationships of various causal
The Logic Tree Analysis method is used to examine factors that may contribute to the problem, or
the various scenarios represented in a fault tree “effect”. Most practitioners facilitate this
using logic to determine if causal chains are brainstorming process by creating four (4) branches,
independent or interrelated. one (1) for each causal factor category. We call these
branches the “4 Ms”, which stand for Machine,
This method uses “And” statements to illustrate Methods, Materials, and Man. This allows you and
that two (2) or more chains are related in time and the RCA team to organize your thoughts to better
both must occur to cause the problem. RCA teams, understand what causal factors need to be analyzed
and their sponsors, love to see “And” statements further using the Simplified Failure Mode and
because it reduces the number of solutions that Effects Analysis (sFMEA) or Failure Mode, Effects,
have to be implemented. When you have two (2) and Criticality Analysis (FMECA) advanced
causal factors that are linked by “And”, you only transparency methods.
have to eliminate one (1) to effectively prevent the
problem from occurring again in the future. Simplified Failure Mode and Effects
Analysis
“Or” statements are used to illustrate the opposite,
that each chain or branch independently causes the sFMEA is used to identify likely failure modes in a
problem with no relationship to other factors. With top-down approach from system to component. We
an “Or” statement, you must implement a solution call it “simplified” because this form of Failure Mode
for each cause in order to prevent reoccurrence. Analysis (FMA) stops at the component level.
Instead of examining the individual failure modes
When you are transitioning from the Fault Tree to
and effects of replacement spares such as fasteners,
the Logic Tree Analysis, you will walk the team
gaskets, and springs, the sFMEA looks at the
backwards through the diagram, from bottom to top.
Page 12 The Basics of Root Cause Analysis
relationship of these parts to their parent Realize
component or machine as the potential causes of
failure. The relationship between component, part, Solution Selection
and problem is what we call the failure mode, and
the relationship between problem and cause is Based on the thresholds established by the RCA
known as the failure mechanism. Combining the team, the last step in the transparency RCA method
two forms the complete root cause statement. is to identify corrective actions that will reduce the
overall risk associated with the loss of function.
From here, we can identify if a new risk mitigating
Once the results of the RCA have been captured, the
action, or “control”, is needed to prevent the failure
team will go through this solution selection process.
mechanism from occurring.
Ideally, every potential failure mode will be
One of the advantages of starting your analysis with addressed, but that might not be economically
the Cause and Effect method is that it helps the feasible based on the boundaries and challenges
team gain a common understanding of the big communicated by the Sponsor.
picture issues, especially if team members came to
Effective solution selection comes down to three (3)
the problem solving event prepared to contribute
factors:
ideas based on their cross-functional perspectives.
The downside of the sFMEA method is that the • The solution must prevent the incident and
team’s view point during the analysis is limited to problem you are trying to solve from
what they can see on the screen, or in the template. recurring or at least mitigate the risk.
The Facilitator will routinely need to refresh the big • The solution must be within the control of
picture perspective by summarizing the analysis as your organization to implement without
it unfolds, in effect reconnecting the cause and effect external limitations or constraints.
dots in people’s minds. • The solution must align with the values and
strategic objectives of your organization.
Failure Mode, Effects, and Criticality
Analysis In order to ensure that the solutions provide a
reasonable value to the organization to offset the
The FMECA method allows the team to quantify the cost of implementation, it is recommended that a
risk priority of each identified failure mode within solution rating matrix be established. For example,
the sFMEA. A FMECA analyzes risk relative to how each solution could be evaluated based on its ability
severely the failure mode impacts organizational to impact chosen strategic objectives such as Cost,
objectives, such as production capacity, the Quality, Delivery, Environmental Performance, and
probability that the failure mode will occur again in the Safety and Health of employees and the
the future, and how likely it is that your community surrounding the facility.
organization will detect the onset of the failure
mode before the effect is realized by the Along with the matrix, you and the RCA team will
organization. The sum of these three (3) risk factors need to determine the minimum required score for
is known as the Risk Priority Number (RPN) of the solution selection. Remember the “Sponsor” role
failure mode and can be used to prioritize solution from the RCA team structure? The Sponsor is an
selection. This is particularly valuable when advocate and advisor to the RCA team who
comparing the effectiveness of current controls and represents the direction and perspective of
potential solutions. stakeholders, but also helps to remove barriers
during the RCA process. When establishing solution
© 2014 Allied Reliability Group Page 13
selection criteria, consult your Sponsor for guidance The trick to facilitating solution selection using the
to ensure that management will continue to support FMECA method is to focus ideas on preventing the
the implementation of corrective actions. potential causes of each failure mode. RCA teams
are commonly sidetracked in this portion of the
Risk Priority Number analysis by focusing their attention on the failure
effects and trying to determine how to improve their
As we already stated, the RPN is the sum of three ability to detect the symptoms of failure. This is a
(3) risk factors: severity, occurrence, and reactive way of thinking.
detectability. The Facilitator must guide the team to
identify the level of risk in each factor and Once the team has identified all of the
determine which failure modes are the most recommended actions for each failure mode, and
significant to the organization’s ability to resolve there could be more than one (1) per failure mode,
the problem at hand. It is recommended that a guide the team back through the risk evaluation as
minimum threshold be established for solution a means of verifying that the proposed solutions will
selection. For example, the team could agree that reduce the likelihood of occurrence or improve the
failure modes that are unlikely to occur will not be organization’s ability to detect the failure mode
addressed in solution selection. Or, the team could before a loss of function. Only redesign solutions
decide that failure modes that have a minor impact that call for functional redundancy will reduce the
on production performance will not be selected, severity of impact risk.
regardless of the probability of occurrence. This
needs to be a consensus decision and it is the The before and after RPN values are an excellent
responsibility of the Facilitator to guide the team to data point to use when developing the business case
an agreed upon threshold. for solution implementation. The post-solution risk
values can also be used to track and validate the
TECHNICAL NOTE: effectiveness of each solution.
There are three (3) accepted ways to calculate RPN
using the severity, occurrence, and detectability risk Corrective Action Tracking
factors. First, there is the simple calculation that
sums the severity, occurrence, and detection risk Communication is the key to success of any RCA
factors in order to determine risk priority. Second, program. Ensure that a Communication Plan is
there is the traditional calculation, which is to implemented to maximize knowledge, awareness,
multiply these same three (3) risk factors together and recognition and to ensure solution
to produce an RPN between ‘1’ and ‘1000’. This is a implementation. This includes training plans for
widely accepted practice as it provides more embedding any human, systemic, or latent root
granularity in the analysis. If the FMECA returns a solutions.
large number of potential causes, use this
During implementation, each corrective action
traditional RPN calculation to clearly separate one
chosen should be managed using standard project
risk from another. The third and very common
management processes and tracked with a
variation to the risk calculation is to divide the
Corrective Actions Tracking Log. Using a Corrective
product of the three (3) risk factors by the total
Actions Master List, enter each corrective action,
points possible. This weights the three (3) risk
the person who is responsible for it, and the
factors and produces an RPN relative to 100% of the
completion date in a spreadsheet or project tracking
total possible risk. Many practitioners use this
tool. To help with tracking, create a separate list for
methodology because it is easier to relate risk to
actions that call for review, analysis, or
non-technical associates in terms of a percentage.
Page 14 The Basics of Root Cause Analysis
investigation. Also, projects or “nice-to-do” tasks solutions from the RCA investigations. These are
should be kept separate from the Corrective Actions driven by the behaviors that the solution from the
Master List, which should only include those RCA is meant to change. Recommended metrics
specific items that result from a formal RCA include:
investigation.
• Number of People Task Qualified to
The Corrective Actions Tracking Log should be Facilitate Root Cause Investigations
updated frequently and have the highest visibility • Number of Root Cause Investigations
in the organization. If a corrective action is not Performed
completed on time, an explanation should be • Percent of Corrective Actions Implemented
provided and a new date assigned. • Mean Time to Implement Corrective Actions
• Percent of Maintenance Labor Consumed by
Metrics RCA Corrective Action Resolution
• Percent of Problems Resolved within 90 days
There are two (2) types of metrics that should be • Percent of Problems Resolved within 12
implemented as part of the RCA program. The first months
type of metric measures the program itself. The • Percent of Assets Analyzed with Increasing
second type of metric is designed to measure the Mean Time Between Failure
Figure 4: R5 Cause Analysis Metrics Structure
Number of People Task Qualified to manager, or employee an expert in RCA facilitation.
Facilitate Root Cause Investigations Instead, the goal should be to ensure that an adequate
number of associates are task qualified in multiple
This metric is designed to quantify the organization’s problem solving techniques and have a demonstrated
capacity to investigate problems using RCA ability to lead a cross-functional group through the
techniques. The intent is not to make every engineer, RCA process. A target to aim for would be 100% of
© 2014 Allied Reliability Group Page 15
those roles within the organization that have procedures and maintenance practices are identified
responsibility for problem resolution, such as and completely within the organization’s ability to
Maintenance Engineers, Reliability Engineers, and implement. As we have discussed, the RCA team is
Continuous Improvement Leaders, plus 23% of responsible for evaluating proposed solutions prior
operating and maintenance resources that are to presenting the results of the investigation to
expected to perform initial investigations. The leadership. Their evaluation criteria should consider
number of task qualified personnel should be whether or not each solution is within the
proportionate to the engineering and maintenance organization’s ability to implement without external
organizations’ capacities to execute corrective actions. constraints.
Number of Root Cause Investigations Mean Time to Implement Corrective
Performed Actions
Although the focus of the RCA program should be to In the event that the percent of corrective actions
solve problems, not simply investigate problems, early implemented is below the agreed upon target, the
on in the deployment of the program the organization organization should evaluate their capacity to
will need to gauge its ability to consistently apply the execute solutions. The first of two (2) metrics that
RCA process. There is no sense in evaluating will enable decisions to be made relative to corrective
corrective actions if investigations are infrequent as action implementation is the mean time to
the overall benefit to the organization will be implement. This metric will help identify constraints
insignificant. So, how many root cause investigations relative to the maintenance backlog, or the total
should be completed in order to justify continued volume of maintenance work divided by the number
sponsorship for the program? The answer is simple: of net available labor hours per week. Ideally,
100% of those problems that can be attributed to the corrective actions will be implemented within 30
organization’s triggers. When calculating this metric, days. This metric is looking for the mean, the
count the number of investigations performed relative average lead-time, so some solutions may take longer
to the number of triggers met. In doing so, the or may be implemented sooner than 30 days. A mean
organization will be able to determine if the program time greater than 30 days may be the result of a
is being consistently executed. If the triggers are too maintenance backlog greater than 6 weeks - meaning
aggressive, and a low percentage of investigations are new work orders that enter the backlog will take
performed, then the triggers need to be refined to longer than 6 weeks to plan, schedule, and execute
ensure adequate capacity for RCA. This is why we due to labor and/or material constraints. With this
start with triggers. metric in place, the organization can determine if
additional improvements are required within the
Percent of Corrective Actions Implemented maintenance work management process in order to
fully realize the benefits of the RCA program.
Problems will not go away unless corrective actions,
identified through formal RCA, are implemented. Percent of Maintenance Labor Consumed
This metric evaluates the organization’s discipline
by RCA Corrective Action Resolution
to implement corrective actions. A reasonable target
is 80% of the identified solutions that do not require If the mean time to implement corrective actions is
capital investment. Using this target turns the within the desired target, but a low percentage of
organization’s attention towards those solutions corrective actions are being implemented, the
that are within the control of the local organization. organization must look at the percent of maintenance
In many cases, changes to standard operating labor consumed by RCA corrective action resolution.
Page 16 The Basics of Root Cause Analysis
To enable this metric, the computerized maintenance were implemented, the RCA Facilitator should
management or enterprise asset management reopen the investigation and determine, using the
system must contain a work order code that links the Failure Mode and Effects Analysis technique, if the
consumption of labor and materials to RCA event was a result of a root cause that was not
corrective actions. With this visibility within the identified in the first investigation, or if the solutions
work order system, the organization can ascertain implemented were insufficient in resolving the
whether or not the RCA program is causing an problem. This is known as a dynamic RCA and
increase in maintenance backlog, thus preventing proves the point why it is important to retain a
more solutions from being implemented – “flooding formal record of each investigation.
the system” so to speak – or if the volume of
available labor hours per week is insufficient due to Overall, the focus of this metric is not to achieve
other, higher priority work orders. It is not perfection, but instead should focus on opportunities
uncommon within a new RCA program to still have a to improve the application of investigation and
lot of “firefighting” going on. These emergent work problem solving techniques, and increase the
orders consume maintenance labor that could be organization’s understanding of repetitive problems
otherwise allocated to permanently resolving the in order to successfully eliminate root causes. With
same issues that are causing the reactive behavior. new RCA programs, a good target for this metric
Having this metric in place allows the organization would be 60%. As organizational maturity increases,
to make priority decisions in the short term that will and reactive practices are replaced with proactive
improve results long term. solutions, a target of 100% is not unrealistic at the
90-day interval.
Percent of Problems Resolved within 90
days Percent of Problems Resolved within 12
months
Up to this point, the organization has evaluated its
capacity to apply the RCA process, and made This metric is similar to the 90-day metric; however,
decisions to improve the implementation of corrective the intent of this metric is to determine if the
actions. Next it is important to evaluate the results. corrective actions implemented were sustainable.
First, the organization should evaluate the The calculation is similar, just broadening the
effectiveness of solutions in the short term. Using the history report to 12 months rather than 90 days. A
triggers identified by the organization, calculate the good target would be 100% of those problems that
percent of problems (i.e. triggers met) that did not were resolved at the 90-day interval and 60% of
reoccur within 90 days of corrective action those that were still evident at 90 days. Using both
implementation. For example, if a critical asset the 90-day and 12-month metrics to evaluate solution
failure resulted in more than 4 hours of downtime, effectiveness ensures that program successes do not
and this was a trigger for the production area, run a go unrecognized and provides a series of milestones
maintenance history report, using the asset from which the organization can gauge program
identification number and the failure code associated maturity. Those problems that are unsuccessfully
with the trigger, for the last 90 days and determine if resolved within 12 months may require help from
the same event occurred after implementation of the external resources in order to bring additional
corrective actions. If the event did not reappear in knowledge and perspectives to the analysis. These
the history report, it can be considered a short-term events should also be prioritized if capital solutions
victory and should be reevaluated at the 12-month were identified but not provisioned for during the
mark. If the event did occur after corrective actions short-term corrective action selection process.
© 2014 Allied Reliability Group Page 17
Percent of Assets Analyzed with Increasing
Mean Time Between Failure
The ultimate goal of the RCA program, relative to
asset and process reliability, is to see an increase in
asset Mean Time Between Failures (MTBF), or the
average duration between functional failures,
regardless of failure mode. This metric can be easily
translated into organizational value. If the asset is
available to operations over longer periods of time,
and assuming the product(s) produced by the asset
are in demand or “sold out”, then every hour of
additional availability equates to more revenue or
contribution margin for the organization. As the
MTBF increases, the window of asset availability
increases, but the frequency of maintenance activity
also decreases. As a result, the organization can also
relate increases in MTBF to reductions in
maintenance material and contract or overtime
labor costs.
About Allied Reliability Group
Allied Reliability Group (ARG) offers best-in-
For this metric, the organization must be capable of
industry maintenance, reliability, and operational
tracking failure and maintenance history within the
consulting and services, training, staffing, and
computerized maintenance management or
integrated software solutions servicing the
enterprise asset management system using event or
industrial and manufacturing sector.
time stamps. Typically, this metric is not calculated
within the first 12 months of RCA program Reliability… it's in our DNA.
deployment; however, to enable this metric, the
organization will need to set a baseline MTBF for
each asset triggering a root cause investigation.
After the first 12 months, compare the current
MTBF of assets analyzed through the RCA process
against the initial baselines collected within the
Recognize phase. Then calculate the percent of
assets analyzed that have an increasing MTBF.
There is no set target; the results of this metric
should be trended over time as a measuring stick for
program maturity. On a per asset basis, however, it For more information about Allied Reliability
is recommended that results be shared with Group, please contact:
leadership in order to demonstrate the value
realized by the organization from RCA and sustain Glo ba l H ea dqu ar ter s
sponsorship for continued deployment. 843.414.5760
[email protected]
www.alliedreliabilitygroup.com
© 2014 Allied Reliability Group The Basics of Root Cause Analysis