SoftwareReliability_csdl
SoftwareReliability_csdl
Software Reliability:
What Went Wrong?
How to Fix It?
Ram Chillarege , Chillarege, Inc.
Almost 50 years ago, software reliability was The notions of software reli-
defined with a hardware mindset. While the research ability were developed over five
decades ago. It evolved as a dual to
community grew, its industry acceptance was muted. its hardware cousin and soon found
acceptance in the software testing
Rethinking the definitions of failures and faults will community. But that is where it
remained. With time, software re-
usher meaningful research and create value to the liability’s importance waned since
practice. it primarily served as an assess-
F
ment metric. Its ability to provide
insight into the causes of poor re-
rom the average person to tech gurus, the phrase liability was superficial and certainly not actionable. To
“software reliability” conjures up endless frus- make things worse, software testing is the tail of an or-
trations that plague the use of modern software. ganization that is driven by head strong architects and
However, the academic community intended for a developers whose priorities are dictated by the whims of
narrow definition of “software reliability”: Failure of run- the business. Today with 30 million software engineers
ning software. This often alluded to the catastrophic stop- worldwide, clamoring for every new incremental technol-
page of a program or service, measured by metrics such as ogy that can boost its productivity, software reliability re-
the mean-time-between-failure. All good, when one con- mains on the sidelines.
siders that most industrial products are rated by similar Can this be changed? Most certainly, yes. Yes, if the
metrics. But therein lies the issue. Just one metric? Soft- software failure event is refined closer to the customer ex-
ware demands new notions of reliability to fully capture perience. And if the fault models capture the true nature
its behavior and remain meaningful, especially today. of the corrective experience.
HARD → SOFT
Digital Object Identifier 10.1109/MC.2024.3431968
The original definition of a software failure was simple: A
Date of current version: 21 October 2024 program either worked correctly or incorrectly. It appeared
92 CO M PUTE R P U B LISHED BY THE IEEE COMP UTER SOCIE T Y 0018-9 162 / 24©2024I EEE
like the dual of a digital hardware fail- gain tremendous leverage in modeling requirement. It is just the nature of our
ure. The simplicity had elegance but and providing real value to the soft- software business that quite often the
failed to grasp the complexities in ware development business. service stream identifies new features
software failure. This classical notion through customer demand.
of software reliability should be called SOFTWARE FAILURE
HARD software reliability. That would Let’s first focus on failure and how it SOFTWARE FAULT
explain the difficulty that most prac- affects the customer. When software One of the definitions of a fault is that
ticing software engineers have with failure occurs, it delivers pain. It does it is the cause of the failure. But what
accepting the contributions in the area not have to be catastrophic. It might exactly is the fault? A fairly exact rep-
of software reliability. merely be poor response time or dif- resentation is possible if we ask what
The reality of software failure is far ficulty of usage. That is still a failure was changed to rectify the fault. The
more nuanced. For example, an enter- as far as the customer sees it. It can change could occur in different ar-
prise financial application could be also be quite complex to identify. For tifacts: code, data, design, or even
considered working for thousands of instance, the bank records may show documentation. The object touched
customers, while a subset may com- a transaction with an incorrect value, is called the target. And more specif-
plain of a failure if the foreign cur- but the account balance is fine. And ically the change made in that target
rency service feed is delayed by 5 min it may take that one customer who is called the defect type. Examples of
causing upheavals among a few bro-
kers. Imagine, a group of new mobile
customers of an application go viral
and find that their registration process Today with 30 million software engineers
demands personal information that worldwide, clamoring for every new incremental
could be a legal issue in that nation. technology that can boost its productivity,
One asks, is that a software failure? Is
software reliability remains on the sidelines.
that a functional software failure or
would that be deemed nonfunctional
since it is a security issue and not a checks their accounts to have noticed defect type for the code target are as-
functional capability? it. Another example may involve a signments, checking, algorithms, and
The classical notions of software Telco operator that is getting reports of so on. The defect types describes the
reliability do not have a language to dropped calls in a few zip codes while semantics of the change in the space
express these issues that plague the other customers in that very region are of the target. Data would have a differ-
software developer. Neither do they enjoying the streaming service with ent set of types as compared to code.
have the means to aggregate these no glitches. The combination of target and type
customer and business frustrations. Software failure needs to be cap- provide a fairly complete description
What makes this shortcoming even tured by the nature of pain and the of the software fault. Multiple targets
more glaring is that over 90% of soft- degree of pain. I call these impact and are often involved in any one fix. For
ware tickets an organization sees are severity. Severity is often captured on a example, the fix for a failure may need
software failures that may never re- scale of one to four. And impact is cap- a design change, a code change, and a
quire a line of code to fix the issue. tured by around a dozen categories: re- data source change. So, this one fail-
Instead, they may require changes to liability, performance, usability, integ- ure would be ascribed an impact and
feeds, data, configurations, access lev- rity, security… and so forth. Just these severity, and the corresponding fault
els, documentation, or UI that is auto two parameters allow us to describe, would have six other attributes: three
generated based on region. quite adequately, the consequence of targets each with its defect type.
As serious as this issue is, a little the failure on the customer. The cat- As an interesting aside, this struc-
humor must not be misplaced: We have egories of impact can be grouped into ture for faults nicely captures events
a HARD software reliability problem. functional and nonfunctional too. that IBM service teams called “non-
And we need a truly SOFT software re- The only two functional categories defects.” Traditionally, development
liability fix. The rigidity of our notions are: capability and new-requirements. teams and management focused on
of failure and fault could benefit from It might seem strange that a failure the classical “defect” which required
a softer touch. And from this, we will attribute of pain is marked as new- a code change. The nondefects got
NOVEMBER 2024 93
NOTES FROM THE FIELD
left by the wayside. As time evolved, Fault and failure data when captured development process. The same data,
nondefects became 90% of the service by these attributes can lead to a great sliced differently, allows one to study
tickets, and there just was not the data deal of understanding. While data is technology platforms and skill groups
or analysis to gain insights to bring captured at the level of an individual within the organization.
processes under control. Orthogonal defect, patterns emerge when looking Our experience with ODC across
defect classification (ODC) viewed all across aggregates in time, process or hundreds of projects in different verti-
changes with a uniform abstraction, customer groups. The ODC metrics are cals: telco, retail, warehousing, indus-
be they defect or nondefect. This quite different from the single-dimen- trial applications, real-time systems,
embedded systems, and mobile has
demonstrated the value and durability
of this concept over the past 30 years.
Software engineering management
What makes this shortcoming even more
is complex. And sophisticated tools
glaring is that over 90% of software tickets an are necessary to deal with the com-
organization sees are software failures that may plexity to provide focused analysis
never require a line of code to fix the issue. to problems. While the classical hard
software reliability struggled due to
its narrow definition, softening of the
uniformity allowed for root cause sion mean-time-between-failures met- failure and fault model got us started
analysis of nondefects leveraging rics of the past. Relationships between over four decades ago. ODC, which
tools and practices that already ex- groups of data and their patterns can evolved from there, reframed how we
isted for defects. be mapped to the underlying devel- need to think of these data and created
Here are a couple examples of a opment processes yielding behavioral a far broader framework of business
nondefects just to illustrate the con- insights of skill groups. and technical insight.
cept. One can have a poor perfor-
S
mance in a cluster of machines that ODC
provide a cloud service because a hu- These ideas on the expanded view of oftware reliability that is true
man configured the virtual machines software fault and failures evolved into to its SOFT nature is better
incorrectly. Contrast that with an a framework that I called Orthogonal served when expressed in a lan-
enterprise warehousing application Defect Classification (ODC).a At the guage closer to what’s experienced.
where a series of customer orders core, it began with a clear articulation of The ODC terminology partitions the
were deemed incorrect because the software faults and failures providing space of failures and faults into dis-
UI prompted users in a manner that clarity beyond what was understood in tinct event and action groups. Now,
confused them. The code was correct, the ‘90s. Today, it has additional catego- measures on these subspaces become
but the UI design was confusing. Both ries such as trigger, the force that acti- meaningful to the customer and the
these instances required changes to vates faults, and source that subdivides developer. In today’s world where
correct the situation, but they were code across legacy, technology, and security issues now dominate the
not code changes. In the first case, a platforms. The collection of categories software landscape, ODC provides a
manual process was reworked and are mapped into a framework across language to clearly articulate the con-
hopefully documented to avoid a four principal dimensions that link sequence of events and the ramifica-
repeat failure. The failure would cause and effect. ODC data can be cap- tions of remedial measures.
be tagged as nonfunctional, perfor- tured across the life cycle and historical
mance with the fault tagged with tar- releases to create a gold mine of orga-
get of manual procedure. In the latter, nizational patterns. There are specific
UI designers need to get involved to techniques to analyze customer seg- RAM CHILLAREGE is a president
ensure the help text provides greater ments and link their pain points to the of Chillarege, Inc., Raleigh, NC
clarity. While the impact is func- 27613 USA. Contact him at ram5@
tional, the fault is tagged with target aOrthogonal defect classification: https://en.wikipedia. chillarege.com.
documentation. org/wiki/Orthogonal_defect_classification