100% found this document useful (1 vote)
148 views

Aven 2013

fsfsdf

Uploaded by

Andres Zuñiga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
148 views

Aven 2013

fsfsdf

Uploaded by

Andres Zuñiga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 307

Stochastic Modelling and Applied Probability  41

Terje Aven
Uwe Jensen

Stochastic
Models in
Reliability
Second Edition
Stochastic Mechanics Stochastic Modelling
Random Media and Applied Probability
Signal Processing and Image Synthesis (Formerly:
Mathematical Economics and Finance Applications of Mathematics)
Stochastic Optimization

Stochastic Control

Stochastic Models in Life Sciences 41


Edited by P.W. Glynn
Y. Le Jan

Advisory Board M. Hairer


I. Karatzas
F.P. Kelly
A. Kyprianou
B. Øksendal
G. Papanicolaou
E. Pardoux
E. Perkins
H.M. Soner

For further volumes:


http://www.springer.com/series/602
Terje Aven • Uwe Jensen

Stochastic Models
in Reliability
Second Edition

123
Terje Aven Uwe Jensen
University of Stavanger Fak. Naturwissenschaften
Stavanger, Norway Inst. Angewandte Mathematik u. Statistik
Universität Hohenheim
Stuttgart, Germany

ISSN 0172-4568
ISBN 978-1-4614-7893-5 ISBN 978-1-4614-7894-2 (eBook)
DOI 10.1007/978-1-4614-7894-2
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2013942488

Mathematics Subject Classification (2010): 60G, 60K, 60K10, 60K20, 90B25

© Springer Science+Business Media New York 1999, 2013


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this pub-
lication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s
location, in its current version, and permission for use must always be obtained from Springer. Permis-
sions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable
to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publica-
tion, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors
or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


Preface

In this second edition of the book, two major topics have been added to the
original version. The first one relates to copula models (Sect. 2.3), which are
used to study the effects of structural dependencies on system reliability. We
believe that an introduction to the fundamental ideas and concepts of copula
models is important when reviewing basic reliability theory. The second new
topic we have included is maintenance optimization models under constraints
(Sect. 5.5). These models have been addressed in some recent publications to
meet the demand for models that adequately balance economic criteria and
safety. We consider two specific models. The first is the so-called delay time
model where the aim is to determine optimal inspection intervals minimiz-
ing the expected discounted costs under some safety constraints. The second
model is also about optimal inspection, but here the system is represented
by a monotone (coherent) structure function. In addition, we have made a
number of minor adjustments to increase precision and we have also corrected
misprints.
We received positive feedback to the first edition from friends and col-
leagues. Their hints and suggestions have been incorporated into this second
edition. We thank all who contributed, by whatever means, to preparing the
new edition.

Stavanger, Norway Terje Aven


Stuttgart, Germany Uwe Jensen

v
Preface to the First Edition

As can be seen from the files of the databases of Zentralblatt/Mathematical


Abstracts and Mathematical Reviews, about 1 % of all mathematical publica-
tions are connected to the keyword reliability. This gives an impression of the
importance of this field and makes it clear that it is impossible to include all
the topics connected to reliability in one book. The existing literature on re-
liability covers inter alia lifetime analysis, complex systems and maintenance
models, and the books by Barlow and Proschan [31, 32] can be viewed as first
milestones in this area. Since then the models and tools have been developed
further. The aim of Stochastic Models in Reliability is to give a comprehensive
up-to-date presentation of some of the classical areas of reliability, based on a
more advanced probabilistic framework using the modern theory of stochas-
tic processes. This framework allows the analyst to formulate general failure
models, establish formulas for computing various performance measures, as
well as to determine how to identify optimal replacement policies in complex
situations. A number of special cases analyzed previously can be included in
this framework. Our book presents a unifying approach to some of the key re-
search areas of reliability theory, summarizing and extending results obtained
in recent years. Having future work in this area in mind, it will be useful
to have at hand a general set-up where the conditions and assumptions are
formulated independently of particular models.
This book comprises five chapters in addition to two appendices.
Chapter 1 gives a short introduction to stochastic models of reliability,
linking existing theory and the topics treated in this book. It also contains an
overview of some questions and problems to be treated in the book. In addition
Sect. 1.1.6 explains why martingale theory is a useful tool for describing and
analyzing the structure of complex reliability models. In the final section of
the chapter we briefly discuss some important aspects of reliability modeling
and analysis, and present two real-life examples. To apply reliability models
in practice successfully, there are many challenges related to modeling and
analysis that need to be faced. However, it is not within the scope of this

vii
viii Preface to the First Edition

book to discuss these challenges in detail. Our text is an introduction to the


topic and of motivational character.
Chapter 2 presents an overview of some parts of basic reliability theory: the
theory of complex (monotone) systems, both binary and multistate systems,
as well as lifetime distributions and nonparametric classes of lifetime distri-
butions. The aim of this chapter has not been to give a complete overview of
the existing theory, but to highlight important areas and give a basis for the
coming chapters.
Chapter 3 presents a general set-up for analyzing failure-prone systems.
A (semi-) martingale approach is adopted. This general approach makes it
possible to formulate a unifying theory of both nonrepairable and repairable
systems, and it includes point processes, counting processes, and Markov pro-
cesses as special cases. The time evolution of the system can also be analyzed
on different information levels, which is one of the main attractions of the
(semi-) martingale approach. Attention is drawn to the failure rate process,
which is a key parameter of the model. Several examples of application of the
set-up are given, including a monotone (coherent) system of possibly depen-
dent components, and failure time and (minimal) repair models. A model for
analyzing the time to failure based on risk reserves (the difference between
total income and accumulated costs of repairs) is also covered.
In the next two chapters we look more closely at types of models for
analyzing situations where the system and its components could be repaired
or replaced in the case of failures, and where we model the downtime or costs
associated with downtimes.
Chapter 4 gives an overview of availability theory of complex systems,
having components that are repaired upon failure. Emphasis is placed on
monotone systems comprising independent components, each generating an
alternating renewal process. Multistate systems are also covered, as well as
systems comprising cold standby components. Different performance measures
are studied, including the distributions of the number of system failures in a
time interval and the downtime of the system in a time interval. The chapter
gives a rather comprehensive asymptotic analysis, providing a theoretical basis
for approximation formulae used in cases where the time interval considered
is long or the components are highly available.
Chapter 5 presents a framework for models of maintenance optimization,
using the set-up described in Chap. 3. The framework includes a number of
interesting special cases dealt with by other authors.
By allowing different information levels, it is possible to extend, for ex-
ample, the classical age replacement model and minimal repair/replacement
model to situations where information is available about the underlying con-
dition of the system and the replacement time is based on this information.
Again we illustrate the applicability of the model by considering monotone
systems.
Chapters 3–5 are based on stochastic process theory, including theory
of martingales and point, counting, and renewal processes. For the sake of
completeness and to help the reader who is not familiar with this theory,
Preface to the First Edition ix

two appendices have been included summarizing the mathematical basis and
some key results. Appendix A gives a general introduction to probability and
stochastic process theory, whereas Appendix B gives a presentation of results
from renewal theory. Appendix A also summarizes basic notation and symbols.
Although conceived mainly as a research monograph, this book can also
be used for graduate courses and seminars. It primarily addresses probabilists
and statisticians with research interests in reliability. But at least parts of it
should be accessible to a broader group of readers, including operations re-
searchers and engineers. A solid basis in probability and stochastic processes
is required, however. In some countries many operations researchers and reli-
ability engineers now have a rather comprehensive theoretical background in
these topics, so that it should be possible to benefit from reading the more
sophisticated theory presented in this book. To bring the reliability field for-
ward, we believe that more operations researchers and engineers should be
familiar with the probabilistic framework of modern reliability theory. Chap-
ters 1 and 2 and the first part of Chaps. 4 and 5 are more elementary and do
not require the more advanced theory of stochastic processes.
References are kept to a minimum throughout, but readers are referred to
the bibliographic notes following each chapter, which give a brief review of
the material covered and related references.

Acknowledgments
We express our gratitude to our institutions, the Stavanger University College,
the University of Oslo, and the University of Ulm, for providing a rich intel-
lectual environment, and facilities indispensable for the writing of this book.
The authors are grateful for the financial support provided by the Norwegian
Research Council and Deutscher Akademischer Austauschdienst. We would
also like to acknowledge our indebtedness to Jelte Beimers, Jørund Gåsemyr,
Harald Haukås, Tina Herberts, Karl Hinderer, Günter Last, Volker Schmidt,
Richard Serfozo, Marcel Smith, Fabio Spizzichino and Rune Winther for mak-
ing helpful comments and suggestions on the manuscript. Thanks for TEXnical
support go to Jürgen Wiedmann.
We especially thank Bent Natvig, University of Oslo, for the great deal
of time and effort he spent reading and preparing comments. Thanks also go
to the three reviewers for providing advice on the content and organization
of the book. Their informed criticism motivated several refinements and im-
provements. Of course, we take full responsibility for any errors that remain.
We also acknowledge the editing and production staff at Springer for their
careful work. In particular, we appreciate the smooth cooperation of John
Kimmel.

Stavanger, Norway Terje Aven


Ulm, Germany Uwe Jensen
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Lifetime Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Damage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Different Information Levels . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Simpson’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Predictable Lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.6 A General Failure Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Availability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Optimization Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Reliability Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Nuclear Power Station . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Gas Compression System . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Basic Reliability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


2.1 Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Binary Monotone Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Multistate Monotone Systems . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Basic Notions of Aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.1 Nonparametric Classes of Lifetime Distributions . . . . . . . 35
2.2.2 Closure Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.3 Stochastic Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 Copula Models of Complex Systems in Reliability . . . . . . . . . . . 42
2.3.1 Introduction to Copula Models . . . . . . . . . . . . . . . . . . . . . . 42
2.3.2 The Influence of the Copula on the Lifetime
Distribution of the System . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3.3 Archimedean Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3.4 The Expectation of the Lifetime of a Two-Component-
System with Exponential Marginals . . . . . . . . . . . . . . . . . . 50
2.3.5 Marshall–Olkin Distribution . . . . . . . . . . . . . . . . . . . . . . . . 52

xi
xii Contents

3 Stochastic Failure Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


3.1 Notation and Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.1.1 The Semimartingale Representation . . . . . . . . . . . . . . . . . 59
3.1.2 Transformations of SSMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2 A General Lifetime Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.1 Existence of Failure Rate Processes . . . . . . . . . . . . . . . . . . 72
3.2.2 Failure Rate Processes in Complex Systems . . . . . . . . . . . 73
3.2.3 Monotone Failure Rate Processes . . . . . . . . . . . . . . . . . . . . 77
3.2.4 Change of Information Level . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3 Point Processes in Reliability:
Failure Time and Repair Models . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.1 Alternating Renewal Processes: One-Component
Systems with Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.2 Number of System Failures for Monotone Systems . . . . . 85
3.3.3 Compound Point Process: Shock Models . . . . . . . . . . . . . 86
3.3.4 Shock Models with State-Dependent Failure
Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.3.5 Shock Models with Failures of Threshold Type . . . . . . . . 89
3.3.6 Minimal Repair Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3.7 Comparison of Repair Processes for Different
Information Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.3.8 Repair Processes with Varying Degrees of Repair . . . . . . 97
3.3.9 Minimal Repairs and Probability of Ruin . . . . . . . . . . . . . 98

4 Availability Analysis of Complex Systems . . . . . . . . . . . . . . . . . 105


4.1 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 One-Component Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.2.1 Point Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.2 The Distribution of the Number of System Failures . . . . 109
4.2.3 The Distribution of the Downtime in a Time
Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.2.4 Steady-State Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.3 Point Availability and Mean Number of System Failures . . . . . . 120
4.3.1 Point Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3.2 Mean Number of System Failures . . . . . . . . . . . . . . . . . . . . 121
4.4 Distribution of the Number of System Failures . . . . . . . . . . . . . . 125
4.4.1 Asymptotic Analysis for the Time to the First System
Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4.2 Some Sufficient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.4.3 Asymptotic Analysis of the Number of System
Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.5 Downtime Distribution Given System Failure . . . . . . . . . . . . . . . 145
4.5.1 Parallel System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.5.2 General Monotone System . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.5.3 Downtime Distribution of the ith System Failure . . . . . . 149
Contents xiii

4.6 Distribution of the System Downtime in an Interval . . . . . . . . . . 151


4.6.1 Compound Poisson Process Approximation . . . . . . . . . . . 152
4.6.2 Asymptotic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.7 Generalizations and Related Models . . . . . . . . . . . . . . . . . . . . . . . 158
4.7.1 Multistate Monotone Systems . . . . . . . . . . . . . . . . . . . . . . . 158
4.7.2 Parallel System with Repair Constraints . . . . . . . . . . . . . . 165
4.7.3 Standby Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

5 Maintenance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175


5.1 Basic Replacement Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.1.1 Age Replacement Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.1.2 Block Replacement Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.1.3 Comparisons and Generalizations . . . . . . . . . . . . . . . . . . . . 178
5.2 A General Replacement Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.2.1 An Optimal Stopping Problem . . . . . . . . . . . . . . . . . . . . . . 180
5.2.2 A Related Stopping Problem . . . . . . . . . . . . . . . . . . . . . . . . 183
5.2.3 Different Information Levels . . . . . . . . . . . . . . . . . . . . . . . . 189
5.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.3.1 The Generalized Age Replacement Model . . . . . . . . . . . . . 190
5.3.2 A Shock Model of Threshold Type . . . . . . . . . . . . . . . . . . . 193
5.3.3 Information-Based Replacement of Complex
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.3.4 A Parallel System with Two Dependent
Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.3.5 Complete Information About T1 , T2 and T . . . . . . . . . . . . 198
5.3.6 A Burn-In Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.4 Repair Replacement Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.4.1 Optimal Replacement Under a General Repair
Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.4.2 A Markov-Modulated Repair Process: Optimization
with Partial Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
5.4.3 The Case of m=2 States . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
5.5 Maintenance Optimization Models Under Constraints . . . . . . . . 215
5.5.1 A Delay Time Model with Safety Constraints . . . . . . . . . 215
5.5.2 Optimal Test Interval for a Monotone Safety System . . . 229

A Background in Probability and Stochastic Processes . . . . . . . 245


A.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
A.2 Random Variables, Conditional Expectations . . . . . . . . . . . . . . . 246
A.2.1 Random Variables and Expectations . . . . . . . . . . . . . . . . . 246
A.2.2 Lp -Spaces and Conditioning . . . . . . . . . . . . . . . . . . . . . . . . 248
A.2.3 Properties of Conditional Expectations . . . . . . . . . . . . . . . 251
A.2.4 Regular Conditional Probabilities . . . . . . . . . . . . . . . . . . . 252
A.2.5 Computation of Conditional Expectations . . . . . . . . . . . . 253
A.3 Stochastic Processes on a Filtered Probability Space . . . . . . . . . 254
xiv Contents

A.4 Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257


A.5 Martingale Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
A.6 Semimartingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
A.6.1 Change of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
A.6.2 Product Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

B Renewal Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273


B.1 Basic Theory of Renewal Processes . . . . . . . . . . . . . . . . . . . . . . . . 273
B.2 Renewal Reward Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
B.3 Regenerative Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
B.4 Modified (Delayed) Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
1
Introduction

This chapter gives an introduction to the topics covered in this book: failure
time models, complex systems, different information levels, maintenance and
optimal replacement. We also include a section on reliability modeling, where
we draw attention to some important factors to be considered in the modeling
process. Two real life examples are presented: a reliability study of a system
in a power plant and an availability analysis of a gas compression system.

1.1 Lifetime Models


In reliability we are mainly concerned with devices or systems that fail at an
unforeseen or unpredictable (this term is defined precisely later) random age
of T > 0. This random variable is assumed to have a distribution F, F (t) =
P (T ≤ t), t ∈ R, with a density f . The hazard or failure rate λ is defined on
the support of the distribution by

f (t)
λ(t) = ,
F̄ (t)

with the survival function F̄ (t) = 1 − F (t). The failure rate λ(t) measures the
proneness to failure at time t in that λ(t)  t ≈ P (T ≤ t + t|T > t) for small
t. The (cumulative) hazard function is denoted by Λ,
 t
Λ(t) = λ(s) ds = − ln{F̄ (t)}.
0

The well-known relation

F̄ (t) = P (T > t) = exp{−Λ(t)} (1.1)

establishes the link between the cumulative hazard and the survival function.
Modeling in reliability theory is mainly concerned with additional information

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modelling 1


and Applied Probability 41, DOI 10.1007/978-1-4614-7894-2 1,
© Springer Science+Business Media New York 2013
2 1 Introduction

about the state of a system, which is gathered during the operating time of
the system. This additional information leads to updated predictions about
proneness to system failure. There are many ways to introduce such additional
information into the model. In the following sections some examples of how to
introduce additional information and how to model the lifetime T are given.

1.1.1 Complex Systems

As will be introduced in detail in Chap. 2, a complex system comprises n


components with positive random lifetimes Ti , i = 1, 2, . . . , n, n ∈ N. Let
Φ : {0, 1}n → {0, 1} be the structure function of the system, which is
assumed to be monotone. The possible states of the components and of
the system, “intact” and “failed,” are indicated by “1” and “0,” respec-
tively. Then Φt = Φ(Xt ) describes the state of the system at time t, where
Xt = (Xt (1), . . . , Xt (n)) and Xt (i) denotes the indicator function

1 if Ti > t
Xt (i) = I(Ti > t) =
0 if Ti ≤ t,

which is 1, if component i is intact at time t, and 0 otherwise. The lifetime T


of the system is then given by T = inf{t ∈ R+ : Φt = 0}.

Example 1.1. As a simple example the following system with three compo-
nents is considered, which is intact if component 1 and at least one of the
components 2 or 3 are intact:

2
• 1 •
3

In this example Φt = Xt (1){1 − (1 − Xt (2))(1 − Xt (3))} is easily obtained


with T = inf{t ∈ R+ : Φt = 0} = T1 ∧ (T2 ∨ T3 ), where as usual a ∧ b and
a∨b denote min {a, b} and max {a, b}, respectively. The additional information
about the lifetime T is given by the observation of the state of the single
components. As long as all components are intact, only a failure of component
1 leads to system failure. If one of the components 2 or 3 fails first, then the
next component failure is a system failure.

Under the classical assumption that all components work independently,


i.e., the random variables Ti , i = 1, . . . , n, are independent, certain character-
istics of the system lifetime are of interest:
• Determining the system lifetime distribution from the known component
lifetime distributions or at least finding bounds for this distribution (see
Sects. 2.1 and 2.2).
1.1 Lifetime Models 3

• Are certain properties of the component lifetime distributions like


increasing failure rate (IFR) or increasing failure rate average (IFRA)
preserved by forming monotone systems? One of these closure theorems
states, for example, that the distribution of the system lifetime is IFRA if
all component lifetimes have IFRA distributions (see Sect. 2.2).
• In what way does a certain component contribute to the functioning of
the whole system? The answer to this question leads to the definition of
several importance measures (see Sect. 2.1).

1.1.2 Damage Models

Additional information about the lifetime T can also be introduced into the
model in a quite different way. If the state or damage of the system at time
t ∈ R+ can be observed and this damage is described by a random variable Xt ,
then the lifetime of the system may be defined as
T = inf{t ∈ R+ : Xt ≥ S},
i.e., as the first time the damage hits a given level S. Here S can be a constant
or, more general, a random variable independent of the damage process. Some
examples of damage processes X = (Xt ) of this kind are described in the
following subsections.

Wiener Process

The damage process is a Wiener process with positive drift starting at 0 and
the failure threshold S is a positive constant. The lifetime of the system is
then known to have an inverse Gaussian distribution. Models of this kind are
especially of interest if one considers different environmental conditions under
which the system is working, as, for example, in so-called burn-in models.
An accelerated aging caused by additional stress or different environmental
conditions can be described by a change of time. Let τ : R+ → R+ be an
increasing function. Then Zt = Xτ (t) denotes the actual observed damage.
The time transformation τ drives the speed of the deterioration. One possible
way to express different stress levels in time intervals [ti , ti+1 ), 0 = t0 < t1 <
. . . < tk , i = 0, 1, . . . , k − 1, k ∈ N, is the choice

i−1
τ (t) = βj (tj+1 − tj ) + βi (t − ti ), t ∈ [ti , ti+1 ), βv > 0.
j=0

In this case it is seen that if F0 is the inverse Gaussian distribution function


of T = inf{t ∈ R+ : Xt ≥ S}, and F is the distribution function of the
lifetime Ta = inf{t ∈ R+ : Zt ≥ S} under accelerated aging, then F (t) =
F0 (τ (t)). A generalization in another direction is to consider a random time
change, which means that τ is a stochastic process. By this, randomly varying
environmental conditions can be modeled.
4 1 Introduction

Compound Point Processes

Processes of this kind describe so-called shock processes where the system is
subject to shocks that occur from time to time and add a random amount
to the damage. The successive times of occurrence of shocks, Tn , are given
by an increasing sequence 0 < T1 ≤ T2 ≤ . . . of random variables, where the
inequality is strict unless Tn = ∞. Each time point Tn is associated with a
real-valued random mark Vn , which describes the additional damage caused by
the nth shock. The marked point process is denoted (T, V ) = (Tn , Vn ), n ∈ N.
From this marked point process the corresponding compound point process
X with
∞
Xt = I(Tn ≤ t)Vn (1.2)
n=1

is derived, which describes the accumulated damage up to time t. The simplest


example is a compound Poisson process in which the shock arrival process is
Poisson and the shock amounts (Vn ) are i.i.d. random variables. As before,
the lifetime T is the first time the damage process (Xt ) hits the level S. If we
go one step further and assume that S is not deterministic and fixed, but a
random failure level, then we can describe a situation in which the observed
damage process does not carry complete information about the (failure) state
of the system; the failure can occur at different damage levels S.
Another way to describe the failure mechanism is the following. Let the
accumulated damage up to time t be given by the shock process Xt as in
(1.2). If the system is up at t− just before t, the accumulated damage equals
Xt− = x and a shock of magnitude y occurs at t, then the probability of
failure at t is p(x + y), where p(x) is a given [0, 1]-valued function. In this
model failures can only occur at shock times and the accumulated damage
determines the failure probability.

1.1.3 Different Information Levels

It was pointed out above in what way additional information can lead to a reli-
ability model. But it is also important to note that in one and the same model
different observation levels are possible, i.e., the amount of actual available
information about the state of a system may vary. The following examples
will show the effect of different degrees of information.

1.1.4 Simpson’s Paradox

This paradox says that if one compares the death rates in two countries, say
A and B, then it is possible that the crude overall death rate in country A is
higher than in B although all age-specific death rates in B are higher than in A.
This can be transferred to reliability in the following way. Considering a two-
component parallel system, the failure rate of the system lifetime may increase
1.1 Lifetime Models 5

although the component lifetimes have decreasing failure rates. The following
proposition, which can be proved by some elementary calculations, yields an
example of this.

Proposition 1.2. Let T = T1 ∨ T2 with i.i.d. random variables Ti , i = 1, 2,


following the common distribution F ,

F (t) = 1 − e−u(t) , t ≥ 0, u(t) = γt + α(1 − e−βt ), α, β, γ > 0.


 2
If 2αeα < βγ < 1, then the failure rate λ of the lifetime T increases,
whereas the component lifetimes Ti have decreasing failure rates.

This example shows that it makes a great difference whether only the sys-
tem lifetime can be observed (aging property: IFR) or additional information
about the component lifetimes is available (aging property: DFR). The aging
property of the system lifetime of a complex system does not only depend
on the joint distribution of the component lifetimes but also, of course, on
the structure function. Instead of a two-component parallel system, consider
a series system where the component lifetimes have the same distributions as
in Proposition 1.2. Then the failure rate of Tser = T1 ∧ T2 decreases, whereas
Tpar = T1 ∨ T2 has an IFR.

1.1.5 Predictable Lifetime

The Wiener process X = (Xt ), t ∈ R+ , with positive drift μ and variance


scaling parameter σ, is a popular damage threshold model. The process X can
be represented as Xt = σBt +μt, where B is standard Brownian motion. If one
assumes that the failure level S is a fixed known constant, then the lifetime
T = inf{t ∈ R+ : Xt ≥ S} follows an inverse Gaussian distribution with a
finite mean ET = S/μ. One criticism of this model is that the paths of X are
not monotone. As a partial answer, one can respond that maintenance actions
also lead to improvements and thus X could be decreasing at some time points.
A more severe criticism from the point of view of the available information is
the following. It is often assumed that in this model the paths of the damage
process can be observed continuously. But this would make the lifetime T a
predictable random time (a precise definition follows in Chap. 3), i.e., there is
an increasing sequence τn , n ∈ N, of random time points that announces the
failure. In this model one could choose τn = inf{t ∈ R+ : Xt ≥ S − 1/n}, and
take n large enough and stop operating the system at τn “just” before failure,
to carry out some preventive maintenance, cf. Fig. 1.1. This does not usually
apply in practical situations. This example shows that one has to distinguish
carefully between the different information levels for the model formulation
(complete information) and for the actual observation (partial information).
6 1 Introduction

Xt

S− 1
n

t
0 τn T
Fig. 1.1. Predictable stopping time

1.1.6 A General Failure Model

The general failure model considered in Chap. 3 uses elements of the theory
of stochastic processes and particularly some martingale theory. Some of the
readers might wonder whether sophisticated theory like this is necessary and
suitable in reliability, a domain with engineering applications. Instead of a
comprehensive justification we give a motivating example.

Example 1.3. We consider a simple two-component parallel system with ind-


ependent Exp(αi ) distributed component lifetimes Ti , i = 1, 2. The system
lifetime T = T1 ∨ T2 has distribution function

F (t) = P (T1 ≤ t, T2 ≤ t) = (1 − e−α1 t )(1 − e−α2 t )

with an ordinary failure rate

α1 e−α1 t + α2 e−α2 t − (α1 + α2 )e−(α1 +α2 )t


λ(t) = .
e−α1 t + e−α2 t − e−(α1 +α2 )t
This formula is rather complicated for such a simple system and reveals noth-
ing about the structure of the system.Using elementary calculus it can be
shown that for α1 = α2 the failure rate is increasing on (0, t∗ ) and decreas-
ing on (t∗ , ∞) for some t∗ > 0. This property of the failure rate, however, is
neither obvious nor immediate to see. We also know that F is of IFRA type.
But is it not more natural and simpler to say that a failure rate (process)
should be 0 as long as both components work (no system failure can occur)
and, when the first component failure occurs, then the rate switches to α1 or
α2 depending on which component survives? We want to derive a model that
allows such a simple failure rate process and also includes the ordinary failure
rate. Of course, this simple failure rate process, which can be expressed as
1.2 Maintenance 7

λt = α1 I(T2 ≤ t < T1 ) + α2 I(T1 ≤ t < T2 ),

needs knowledge about the random component lifetimes Ti . Now the failure
rate λt is a stochastic process and the information about the status of the
components at time t is represented by a filtration. The model allows for
changing the information level and the ordinary failure rate can be derived
from λt on the lowest level possible, namely no information about the com-
ponent lifetimes.

The modern theory of stochastic processes allows for the development of


a general failure model that incorporates the above aspects: time dynamics
and different information levels. Chapter 3 presents this model. The failure
rate process λt is one of the basic parameters of this set-up. If we consider
the lifetime T , under some mild conditions we obtain the failure rate process
on {T > t} as the limit of conditional expectations with respect to the pre-t-
history (σ-algebra) Ft ,
1
λt = lim P (T ≤ t + h|Ft ),
h→0+ h

extending the classical failure rate λ(t) of the system. To apply the set-up,
focus should be placed on the failure rate process (λt ). When this process
has been determined, the model has basically been established. Using the
above interpretation of the failure rate process, it is in most cases rather
straightforward to determine its form. The formal proofs are, however, often
quite difficult.
If we go one step further and consider a model in which the system can
be repaired or replaced at failure, then attention is paid to the number Nt of
system failures in [0, t]. Given certain conditions, the counting process N =
(Nt ), t ∈ R+ , has an “intensity” that as an extension of the failure rate process
can be derived as the limit of conditional expectations
1
λt = lim E[Nt+h − Nt |Ft ],
h→0+ h
where Ft denotes the history of the system up to time t. Hence we can interpret
λt as the (conditional) expected number of system failures per unit of time at
time t given the available information at that time. Chapter 3 includes several
special cases that demonstrate the broad spectrum of potential applications.

1.2 Maintenance
To prolong the lifetime, to increase the availability, and to reduce the prob-
ability of an unpredictable failure, various types of maintenance actions are
being implemented. The most important maintenance actions include:
• Preventive replacements of parts of the system or of the whole system
• Repairs of failed units
8 1 Introduction

• Providing spare parts


• Inspections to check the state of the system if not observed continuously
Taking maintenance actions into account leads, depending on the specific
model, to one of the following subject areas: Availability Analysis and Opti-
mization Models.

1.2.1 Availability Analysis

If the system or parts of it are repaired or replaced when failures occur, the
problem is to characterize the performance of the system. Different measures
of performance can be defined as, for example,
• The probability that the system is functioning at a certain point in time
(point availability)
• The mean time to the first failure of the system
• The probability distribution of the downtime of the system in a given time
interval.
Traditionally, focus has been placed on analyzing the point availability and
its limit (the steady-state availability). For a single component, the steady-
state formula is given by M T T F/(M T T F + M T T R), where M T T F and
M T T R represent the mean time to failure and the mean time to repair (mean
repair time), respectively. The steady-state probability of a system compris-
ing several components can then be calculated using the theory of complex
(monotone) systems.
Often, performance measures related to a time interval are used. Such
measures include the distribution of the number of system failures, and the
distribution of the downtime of the system, or at least the mean of these dis-
tributions. Measures related to the number of system failures are important
from an operational and safety point of view, whereas measures related to
the downtime are more interesting from a productional point of view. Infor-
mation about the probability of having a long downtime in a time interval
is important for assessing the economic risk related to the operation of the
system. For production systems, it is sometimes necessary to use a multistate
representation of the system and some of its components, to reflect different
production levels.
Compared to the steady-state availability, it is of course more complicated
to compute the performance measures related to a time interval, in particu-
lar the probability distributions of the number of system failures and of the
downtime. Using simplifications and approximations, it is however possible to
establish formulas that can be used in practice. For highly available systems,
a Poisson approximation for the number of system failures and a compound
Poisson approximation for the downtime distribution are useful in many cases.
These topics are addressed in Chap. 4, which gives a detailed analysis
of the availability of monotone systems. Emphasis is placed on performance
1.3 Reliability Modeling 9

measures related to a time interval. Sufficient conditions are given for when
the Poisson and the compound Poisson distributions are asymptotic limits.

1.2.2 Optimization Models

If a valuation structure is given, i.e., costs of replacements, repairs, downtime,


etc., and gains, then one is naturally led to the problem of planning the main-
tenance action so as to minimize (maximize) the costs (gains) with respect to
a given criterion. Examples of such criteria are expected costs per unit time
and total expected discounted costs.

Example 1.4. We resume Example 1.3, p. 6, and consider the simple two-
component parallel system with independent Exp(αi ) distributed component
lifetimes Ti , i = 1, 2, with the system lifetime T = T1 ∨ T2 . We now allow
preventive replacements at costs of c units to be carried out before failure,
and a replacement upon system failure at cost c + k. It seems intuitive that
T1 ∧ T2 , the time of the first component failure, should be a candidate for an
optimal replacement time with respect to some cost criterion, at least if c is
“small” compared to k. How can we prove that this random time T1 ∧ T2 is
optimal among all possible replacement times? How can we characterize the
set of all possible replacement times?
These questions can only be answered in the framework of martingale
theory and are addressed in Chap. 5.

One can imagine that thousands of models (and papers) can be created by
combining the different types of lifetime models with different maintenance
actions. The general optimization framework formulated in Chap. 5 incorpo-
rates a number of such models. Here the emphasis is placed on determining
the optimal replacement time of a deteriorating system. The framework is
based on the failure model of Chap. 3, which means that rather complex and
very different situations can be studied. Special cases include monotone sys-
tems, (minimal) repair models, and damage processes, with different informa-
tion levels.

1.3 Reliability Modeling


Models analyzed in this book are general, in the sense that they do not refer
to any specific real life situation but are applicable in a number of cases. This
is the academic and theoretical approach of mathematicians (probabilists,
statisticians) who provide tools that can be used in applications.
The reliability engineer, on the other hand, has a somewhat different start-
ing point. He or she is faced with a real problem and has to analyze this prob-
lem using a mathematical model that describes the situation appropriately.
10 1 Introduction

Sometimes it is rather straightforward to identify a suitable model, but often


the problem is complex and it is difficult to see how to solve it. In many cases,
a model needs to be developed. The modeling process requires both experience
on the part of the practitioner and knowledge on the part of the theorist.
However, it is not within the scope of this book to discuss in detail the
many practical aspects related to reliability modeling and analysis. Only a
few issues will be addressed. In this introductory section we will highlight
important factors to be considered in the modeling process and two real life
examples will be presented.
The objectives of the reliability study can affect modeling in many ways,
for example, by specifying which performance measures and which factors
(parameters) are to be analyzed. Different objectives will require different
approaches and methods for modeling and analysis. Is the study to provide
decision support in a design process of a system where the problem is to choose
between alternative solutions; is the problem to give a basis for specifying
reliability requirements; or is the aim to search for an optimal preventive
maintenance strategy? Clearly, these situations call for different models.
The objectives of the study may also influence the choice of the computa-
tional approach. If it is possible to use analytical calculation methods, these
would normally be preferred. For complex situations, Monte Carlo simulation
often represents a useful alternative, cf., e.g., [13, 64].
The modeling process starts by clarifying the characteristics of the situa-
tion to be analyzed. Some of the key points to address are:
Can the system be decomposed into a set of independent subsystems (com-
ponents)? Are all components operating normally or are some on stand-by?
What is the state of the component after a repair? Is it “as good as new”?
What are the resources available for carrying out the repairs? Are some types
of preventive maintenance being employed? Is the state of the components
and the system continuously monitored, or is it necessary to carry out inspec-
tions to reveal their condition? Is information available about the underlying
condition of the system and components, such as wear, stress, and damage?
Having identified important features of the system, we then have to look
more specifically at the various elements of the model and resolve questions
like the following:
• How should the deterioration process of the components and system
be modeled? Is it sufficient to use a standard lifetime model where
the age of the unit is the only information available? How should the
repair/replacement times be modeled?
• How are the preventive maintenance activities to be reflected in the model?
Are these activities to be considered fixed in the model or is it possible to
plan preventive maintenance action so that costs (rewards) are minimized
(maximized)?
• Is a binary (two-state) approach for components and system sufficiently
accurate, or is multistate modeling required?
1.3 Reliability Modeling 11

• How are the system and components to be represented? Is a reliability


block diagram appropriate?
• Are time dynamics to be included or is a time stationary model sufficient?
• How are the parameters of the model to be determined? What kind of
input data are required for using the model? How is uncertainty to be
dealt with?
Depending on the answers to these questions, relevant models can be iden-
tified. It is a truism that no model can cover all aspects, and it is recommended
that one starts with a simple model describing the main features of the system.
The following application examples give further insight into the situations
that can be modeled using the theory presented in this book.

1.3.1 Nuclear Power Station

In this example we consider a small part of a very complex technical system, in


which safety aspects are of great importance. The nuclear power station under
consideration consists of two identical boiling water reactors in commercial
operation, each with an electrical power of 1,344 MW. They started in 1984
and 1985, respectively, working with an efficiency of 35%.
Nuclear power plants have to shut down from time to time to exchange
the nuclear fuel. This is usually performed annually. During the shutdown
phase a lot of maintenance tasks and surveillance tests are carried out. One
problem during such phases is that decay heat is still produced and thus
has to be removed. Therefore, residual heat removal (RHR) systems are in
operation. At the particular site, three identical systems are available, each
with a capacity of 100%. They are designed to remove decay heat during
accident conditions occurring at full power as well as for operational purposes
in cooldown phases.
One of these RHR systems is schematically shown in Fig. 1.2. It consists of
three different trains including the closed cooling water system. Several pumps
and valves are part of the RHR system. The primary cooling system can be
modeled as a complex system comprising the following main components:
• Closed cooling water system pump (CCWS)
• Service water system pump (SWS)
• Low-pressure pump with a pre-stage (LP)
• High-pressure pump (HP)
• Nuclear heat exchanger (RHR)
• Valves (V1 , V2 , V3 )
For the analysis we have to distinguish between two cases:
1. The RHR system is not in operation.
Then the functioning of the system can be viewed as a binary structure
of the main components as is shown in the reliability block diagram in
12 1 Introduction

Fig. 1.2. Cooling system of a power plant

V1
LP RHR SWS CCWS HP V3
V2

Fig. 1.3. Reliability block diagram

Fig. 1.3. When the system is needed, it is possible that single components
or the whole system fails to start on demand. In this case, to calculate the
probability of a failure on demand, we have to take all components in the
reliability block diagram into consideration. Two of the valves, V1 and V2 ,
are in parallel. Therefore, the RHR system fails on demand if either V1 and
V2 fail or at least one of the remaining components LP,. . . , HP, V3 fails.
We assume that the time from a check of a component until a failure in the
idle state is exponentially distributed. The failure rates are λv1 , λv2 , λv3
for the valves and λp1 , λp2 , λp3 , λp4 , λh for the other components. If the
check (inspection or operating period) dates t time units back, then the
probability of a failure on demand is given by
1 − {1 − (1 − e−λv1 t )(1 − e−λv2 t )}e−(λp1 +λp2 +λp3 +λp4 +λh +λv3 )t .
1.3 Reliability Modeling 13

2. The RHR system is in operation.


During an operation phase, only the pumps and the nuclear heat exchanger
can fail to operate. If the valves have once opened on demand when the op-
eration phase starts, these valves cannot fail during operation. Therefore,
in this operation case, we can either ignore the valves in the block diagram
or assign failure probability 0 to V1 , V2 , V3 . The structure reduces to a
simple series system. If we assume that the failure-free operating times
of the pumps and the heat exchanger are independent and have distribu-
tions Fp1 , Fp2 , Fp3 , Fp4 , and Fh , respectively, then the probability that the
system fails before a fixed operating time t is just

1 − F̄p1 (t)F̄p2 (t)F̄p3 (t)F̄p4 (t)F̄h (t),

where F̄ (t) denotes the survival probability.


In both cases the failure time distributions and the failure rates have to be
estimated. One essential condition for the derivation of the above formulae is
that all components have stochastically independent failure times or lifetimes.
In some cases such an independence condition does not apply. In Chap. 3 a
general theory is developed that also includes the case of complex systems
with dependent component lifetimes. The framework presented covers differ-
ent information levels, which allow updating of reliability predictions using
observations of the condition of the components of the system, for example.

1.3.2 Gas Compression System

This example outlines various aspects of the modeling process related to the
design of a gas compression system.
A gas producer was designing a gas production system, and one of the most
critical decisions was related to the design of the gas compression system.
At a certain stage of the development, two alternatives for the compression
system were considered:

(i) One gas train with a maximum throughput capacity of 100%


(ii) Two trains in parallel, each with a maximum throughput capacity of 50%.
Normal production is 100%. For case (i) this means that the train is
operating normally and a failure stops production completely. For case (ii)
both trains are operating normally. If one train fails, production is reduced to
50%. If both trains are down, production is 0.
Each train comprises compressor–turbine, cooler, and scrubber. A failure
of one of these “components” results in the shutdown of the train. Thus a
train is represented by a series structure of the three components compressor–
turbine, cooler, and scrubber.
14 1 Introduction

The following failure and repair time data were assumed:

Component Failure rate Mean repair time


(unit of time: 1 year) (unit of time: 1 h)

Compressor–turbine 10 12
Cooler 2 50
Scrubber 1 20

To compare the two alternatives, a number of performance measures were


considered. Particular interest was shown in performance measures related to
the number of system shutdowns, the time the system has a reduced produc-
tion level, and the total production loss due to failures of the system. The gas
sales agreement states that the gas demand is to be met with a very high reli-
ability, and failures could lead to considerable penalties and loss of goodwill,
as well as worse sales perspectives for the future.
Using models as will be described in Chap. 4, it was possible to compute
these performance measures, given certain assumptions.
It was assumed that each component generates an alternating renewal pro-
cess, which means that the repair brings the component to a condition that is
as good as new. The uptimes were assumed to be distributed exponentially,
so that the component in the operating state has a constant failure rate. The
failure rate used was based on experience data for similar equipment. Such a
component model was considered to be sufficiently accurate for the purpose
of the analysis. The exponential model represents a “first-order approxima-
tion,” which makes it rather easy to gain insight into the performance of the
system. For a complex “component” with many parts to be maintained, it is
known that the overall failure rate exhibits approximately exponential nature.
Clearly, if all relevant information is utilized, the exponential model is rather
crude. But again we have to draw attention to the purpose of the analysis:
provide decision support concerning the choice of design alternatives. Only
the essential features should be included in the model.
A similar type of reasoning applies to the problem of dependency between
components. In this application all uptimes and downtimes of the compo-
nents were assumed to be independent. In practice there are, of course, some
dependencies present, but by looking into the failure causes and the way the
components were defined, the assumption of independence was not considered
to be a serious weakness of the model, undermining the results of the analysis.
To determine the repair time distribution, expert opinions were used. The
repair times, which also include fault diagnosis, repair preparation, test and
restart, were assessed for different failure modes. As for the uptimes, it was
assumed that no major changes over time take place concerning component
design, operational procedures, etc.
1.3 Reliability Modeling 15

Uncertainty related to the input quantities used was not considered. Ins-
tead, sensitivity studies were performed with the purpose of identifying how
sensitive the results were with respect to variations in input parameters.
Of the results obtained, we include the following examples:

• The gas train is down 2.7% of the time in the long run.
• For alternative (i), the average system failure rate, i.e., the average number
of system failures per year, equals 13. For alternative (ii) it is distinguished
between failures resulting in production below 100% and below 50%. The
average system failure rates for these levels are approximately 26 and 0.7,
respectively. Alternative (ii) has a probability of about 50% of having one
or more complete shutdowns during a year.
• The mean lost production equals 2.7% for both alternatives. The proba-
bility that the lost production during 1 year is more than 4% of demand is
approximately equal to 0.16 for alternative (i) and 0.08 for alternative (ii).
This last result is based on assumptions concerning the variation of the
repair times. Refer to Sect. 4.7.1, p. 162, where the models and methods used
to compute these measures are summarized.
The results obtained, together with an economic analysis, gave the man-
agement a good basis for choosing the best alternative.

Bibliographic Notes. There are now many journals strongly devoted


to reliability, for example, the IEEE Transactions on Reliability and Relia-
bility Engineering and System Safety. In addition, there are many journals in
Probability and Operations Research that publish papers in this field.
As mentioned before, there is an extensive literature covering a variety of
stochastic models of reliability. Instead of providing a long and, inevitably,
almost certainly incomplete list of references, some of the surveys and review
articles are quoted, as well as some of the reliability books.
From time to time, the Naval Research Logistics Quarterly journal pub-
lishes survey articles in this field, among them the renowned article by Pier-
skalla and Voelker [130], which appeared with 259 references in 1976, updated
by Sherif and Smith [144] with an extensive bibliography of 524 references
in 1981, followed by Valdez-Flores and Feldman [158] with 129 references in
1989. Bergman’s review [39] reflects the author’s experience in industry and
emphasizes the usefulness of reliability methods in applications. Gertsbakh’s
paper [75] reviews asymptotic methods in reliability and especially investigates
under what conditions the lifetime of a complex system with many compo-
nents is approximately exponentially distributed. Natvig [125] gives a concise
overview of importance measures for monotone systems. The surveys of Arjas
[4] and Koch [108] consider reliability models using more advanced mathemat-
ical tools as marked point processes and martingales. A guided tour for the
non-expert through point process and intensity-based models in reliability is
presented in the article of Hokstad [89]. The book of Thompson [155] gives a
16 1 Introduction

more elementary presentation of point processes in reliability. Other reliabil-


ity books that we would like to draw attention to are Aven [13], Barlow and
Proschan [31, 32], Beichelt and Franken [36], Bergman and Klefsjö [40], Gaede
[70], Gertsbakh [74], Høyland and Rausand [90], and Kovalenko, Kuznetsov,
and Pegg [110]. Some of the models addressed in this introduction are treated
in the overview of Jensen [94] where related references can also be found.
2
Basic Reliability Theory

This chapter presents some basic theory of reliability, including complex


system theory and properties of lifetime distributions. Basic availability theory
and models for maintenance optimization are included in Chaps. 4 and 5,
respectively.
The purpose of this chapter is not to give a complete overview of the exist-
ing theory, but to introduce the reader to common reliability concepts, models,
and methods. The exposition highlights basic ideas and results, and it provides
a starting point for the more advanced theory presented in Chaps. 3–5.

2.1 Complex Systems


This section gives an overview of some basic theory of complex systems. Binary
monotone (coherent) systems are covered, as well as multistate monotone
systems.

2.1.1 Binary Monotone Systems

In this section we give an introduction to the classical theory of monotone


(coherent) systems. First we study the structural relations between a system
and its components. Then methods for calculation of system reliability are
reviewed when the component reliabilities are known. When not stated oth-
erwise, the random variables representing the state of the components are
assumed to be independent.

Structural Properties

We consider a system comprising n components, which are numbered con-


secutively from 1 to n. In this section we distinguish between two states: a
functioning state and a failure state. This dichotomy applies to the system as

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modelling 17


and Applied Probability 41, DOI 10.1007/978-1-4614-7894-2 2,
© Springer Science+Business Media New York 2013
18 2 Basic Reliability Theory

a • 1 2 ... n • b

Fig. 2.1. Series structure

well as to each component. To indicate the state of the ith component, we


assign a binary variable xi to component i:

1 if component i is in the functioning state
xi =
0 if component i is in the failure state.

(The term binary variable refers to a variable taking on the values 0 or 1.)
Similarly, the binary variable Φ indicates the state of the system:

1 if the system is in the functioning state
Φ=
0 if the system is in the failure state.

We assume that
Φ = Φ(x),
where x = (x1 , x2 , . . . , xn ), i.e., the state of the system is determined com-
pletely by the states of the components. We refer to the function Φ(x) as the
structure function of the system, or simply the structure. In the following we
will often use the phrase structure in place of system.

Example 2.1. A system that is functioning if and only if each component is


functioning is called a series system. The structure function for this system is
given by

n
Φ(x) = x1 · x2 · . . . · xn = xi .
i=1

A series structure can be illustrated by the reliability block diagram in Fig. 2.1.
“Connection between a and b” means that the system functions.

Example 2.2. A system that is functioning if and only if at least one compo-
nent is functioning is called a parallel system. The corresponding reliability
block diagram is shown in Fig. 2.2.
The structure function is given by

n
Φ(x) = 1 − (1 − x1 )(1 − x2 ) · · · (1 − xn ) = 1 − (1 − xi ). (2.1)
i=1

The expression on the right-hand side in (2.1) is often written xi . Thus, a
parallel system with two components has structure function
2.1 Complex Systems 19

2
• •
..
.

Fig. 2.2. Parallel structure


2
Φ(x) = 1 − (1 − x1 )(1 − x2 ) = xi ,
i=1

which we also write as Φ(x) = x1 x2 .
Example 2.3. A system that is functioning if and only if at least k out of n
components are functioning is called a k-out-of-n system. A series system is an
n-out-of-n system, and a parallel system is a 1-out-of-n system. The structure
function for a k-out-of-n system is given by
⎧ n
⎨ 1 if i=1 xi ≥ k
Φ(x) = n

0 if i=1 xi < k.

As an example, we will look at a 2-out-of-3 system. This system can be


illustrated by the reliability block diagram shown in Fig. 2.3. An airplane
that is capable of functioning if and only if at least two of its three engines
are functioning is an example of a 2-out-of-3 system.
Definition 2.4. (Monotone system). A system is said to be monotone if
1. its structure function Φ is nondecreasing in each argument, and
2. Φ(0) = 0 and Φ(1) = 1.
Condition 1 says that the system cannot deteriorate (that is, change from
the functioning state to the failed state) by improving the performance of a
component (that is, replacing a failed component by a functioning compo-
nent). Condition 2 says that if all the components are in the failure state,
then the system is in the failure state, and if all the components are in the
functioning state, then the system is in the functioning state.
All the systems we consider are monotone. In the reliability literature,
much attention has be devoted to coherent systems, which is a subclass of
monotone systems. Before we define a coherent system we need some notation.
20 2 Basic Reliability Theory

1 2

• 1 3 •

2 3

Fig. 2.3. 2-Out-of-3 structure

The vector (·i , x) denotes a state vector where the state of the ith
component is equal to 1 or 0; (1i , x) denotes a state vector where the state
of the ith component is equal to 1, and (0i , x) denotes a state vector where
the state of the ith component is equal to 0; the state of component j, j = i,
equals xj . If we want to specify the state of some components, say i ∈ J
(J ⊂ {1, 2, . . . , n}), we use the notation (·J , x). For example, (0J , x) denotes
the state vector where the states of the components in J are all 0 and the
state of component i, i ∈ / J, equals xi .

Definition 2.5. (Coherent system). A system is said to be coherent if


1. its structure function Φ is nondecreasing in each argument, and
2. each component is relevant, i.e., there exists at least one vector (·i , x) such
that Φ(1i , x) = 1 and Φ(0i , x) = 0.

It is seen that if Φ is coherent, then Φ is also monotone. We also need the


following terminology.

Definition 2.6. (Minimal cut set). A cut set K is a set of components that
by failing causes the system to fail, i.e., Φ(0K , 1) = 0. A cut set is minimal
if it cannot be reduced without losing its status as a cut set.

Definition 2.7. (Minimal path set). A path set S is a set of components


that by functioning ensures that the system is functioning, i.e., Φ(1S , 0) = 1.
A path set is minimal if it cannot be reduced without losing its status as a
path set.

Example 2.8. Consider the reliability block diagram presented in Fig. 2.4. The
minimal cut sets of the system are: {1, 5}, {4, 5}, {1, 2, 3}, and {2, 3, 4}. Note
that, for example, {1, 4, 5} is a cut set, but it is not minimal. The minimal
path sets are {1, 4}, {2, 5}, and {3, 5}. In the following we will refer to this
example as the “5-components example.”
2.1 Complex Systems 21

1 4

• •
2

Fig. 2.4. Example of a reliability block diagram

Computing System Reliability


Let Xi be independent binary random variables representing the state of the
ith component at a given point in time, i = 1, 2, . . . , n. Let
pi = P (Xi = 1)
qi = P (Xi = 0)
h = h(p) = P (Φ(X) = 1) (2.2)
g = g(q) = P (Φ(X) = 0),
where p = (p1 , p2 , . . . , pn ), q = (q1 , q2 , . . . , qn ), and X = (X1 , X2 , . . . , Xn ).
The probabilities pi and qi are referred to as the reliability and unreliability
of component i, respectively, and h and g the corresponding reliability and
unreliability of the system.
The problem is to compute the system reliability h given the component
reliabilities pi . Often it will be more efficient to let the starting point of the
calculation be the unreliabilities. Note that h + g = 1 and pi + qi = 1.
Before we present methods for computation of system reliability for a
general structure, we will look closer into some special cases. We start with
the series structure.
Example 2.9. (Reliability of a series structure). For a series structure the
system functioning means that all the components function, hence

n
h = P (Φ(X) = 1) = P ( Xi = 1)
i=1
= P (X1 = 1, X2 = 1, . . . , Xn = 1)
n n
= P (Xi = 1) = pi . (2.3)
i=1 i=1
22 2 Basic Reliability Theory

Example 2.10. (Reliability of a parallel structure). The reliability of a


parallel structure is given by

n
n
h=1− (1 − pi ) = pi . (2.4)
i=1 i=1

The proof of (2.4) is analogous to the proof of (2.3).

Example 2.11. (Reliability of a k-out-of-n structure). The reliability of


a k-out-of-n structure of independent components, which all have the same
reliability p, equals
n  
n
h= pi (1 − p)n−i .
i
i=k
n
This formula holds since i=1 Xi has a binomial distribution with parameters
n and p under the given assumptions. The case that the component reliabilities
are not equal is treated later.

Next we look at an arbitrary series–parallel structure. By using the calcu-


lation formulae for a series structure and a parallel structure it is relatively
straightforward to calculate the reliability of combinations of series and par-
allel structures, provided that each component is included in just one such
structure. Let us consider an example.

Example 2.12. Consider again the reliability block diagram in Fig. 2.4. The
system can be viewed as a parallel structure of two independent modules: the
structure comprising the components 1 and 4, and the structure comprising
the components 2, 3, and 5. The reliability of the former structure equals p1 p4 ,
whereas the reliability of the latter equals (1 − (1 − p2 )(1 − p3 ))p5 . Thus the
system reliability is given by

h = 1 − {1 − p1 p4 }{1 − (1 − (1 − p2 )(1 − p3 ))p5 }.

Assuming that q1 = q2 = q3 = 0.02 and q4 = q5 = 0.01, this formula gives


h = 0.9997, i.e., g = 3 · 10−4 .
If, for example, a 2-out-of-3 structure of independent components with the
same reliability p is in series with the above system, the total system reliabil-
ity will be as above multiplied by the reliability of the 2-out-of-3 structure,
which equals
   
3 3
p (1 − p) +
2
p3 (1 − p)0 = 3p2 (1 − p) + p3 .
2 3

Now consider a general monotone structure. Computation of system relia-


bility for complex systems might be a formidable task (in fact, impracticable
in some cases) unless an efficient method (algorithm) is used. Developing such
methods is therefore an important area of research within reliability theory.
2.1 Complex Systems 23

There exist a number of methods for reliability computation of a general


structure. Many of these methods are based on the minimal cut (path) sets.
For smaller systems the so-called inclusion–exclusion method may be applied,
but this method is primarily a method for approximate calculations for sys-
tems that are either very reliable or unreliable.

Inclusion–Exclusion Method. Let Aj be the event that minimal cut set


Kj is not functioning, j = 1, 2, . . . , k. Then clearly,

P (Aj ) = qi
i∈Kj

and

k
g = P( Aj ).
j=1

Furthermore, let
k
w1 = j=1 P (Aj )

w2 = i<j P (Ai Aj )
..
. r
wr = 1≤i1 <i2 <···<ir ≤k P ( j=1 Aij ).

Then the well-known inclusion–exclusion formula states that


g = w1 − w2 + w3 − · · · + (−1)k+1 wk (2.5)
and for r ≤ k
g ≤ w1 − w2 + w3 − · · · + wr , r odd
g ≥ w1 − w2 + w3 − · · · − wr , r even.
Although in general it is not true that the upper bounds decrease and the
lower bounds increase, in practice it may be necessary to calculate only a few
wr terms to obtain a close approximation. If the component unreliabilities
qi are small, i.e., the reliabilities pi are large, then the w2 term will usually
be negligible compared to w1 , such that g ≈ w1 . Note that w1 is an upper
bound for g. By using w1 as an estimate for the system unreliability, we will
overestimate the system unreliability. In most cases, such an underestimation
of reliability is preferable compared to an overestimation of reliability.
With a large number of minimal cut sets, the exact calculation  using (2.5)
will be extensive. The number of terms in the sum in wr equals kr . Thus the
total number of terms is
k  
k
= (1 + 1)k − 1 = 2k − 1.
r
r=1
24 2 Basic Reliability Theory

Example 2.13. (Continuation of Examples 2.8 and 2.12). The problem is to


calculate the unreliability of the 5-components system of Fig. 2.4 by means
of the approximation method described above. We assume that q1 = q2 =
q3 = 0.02 and q4 = q5 = 0.01. We find that w1 = 3 · 10−4 , which means
that g ≈ 3 · 10−4 . It is intuitively clear that the error term by using this
approximation will not be significant. Calculating w2 confirms this:

w2 = q1 q4 q5 + q1 q2 q3 q5 + q1 q2 q3 q4 q5 + q1 q2 q3 q4 q5 + q2 q3 q4 q5 + q1 q2 q3 q4
= 2.2 · 10−6 .

There exist also other bounds and approximations for the system reliability.
For example, it can be shown that


k  
k
1− (1 − qi ) = 1 − pi
j=1 i∈Kj j=1 i∈Kj

is an upper bound for g, and a good approximation for small values of the
component unreliabilities qi ; see Barlow and Proschan [32], p. 35. This bound
is always as good as or better than w1 . In the following we sketch some alter-
native methods for reliability computation.

Method Using the Minimal Cut Set Representation of the Structure


Function. Using

k
Φ(X) = Xi ,
j=1 i∈Kj

and by multiplying out the right-hand side of this expression, we can find
an exact expression of h (or g). As an illustration consider a 2-out-of-3 sys-
tem. Then

Φ = (X1 X2 ) · (X1 X3 ) · (X2 X3 )

and by multiplication we obtain

Φ = X1 · X2 + X1 · X3 + X2 · X3 − 2 · X1 · X2 · X3 .

We have used Xir = Xi for r = 1, 2, . . .. It follows by taking expectations that

h = p1 p2 + p1 p3 + p2 p3 − 2 p1 p2 p3 .

For systems with low reliabilities, it is possible to establish similar results


based on the minimal path sets.
State Enumeration Method. Of the direct methods that do not use the
minimal cut (path) sets, the state enumeration method is conceptually the
simplest. With this method reliability is calculated using
2.1 Complex Systems 25

1 4

• 3 •

2 5

Fig. 2.5. Bridge structure

  
n
h = EΦ(X) = Φ(x)P (X = x) = pxi i (1 − pi )1−xi .
x x:Φ(x)=1 i=1

This method, however, is not suitable for larger systems, since the number of
terms in the sum can be extremely large, up to 2n − 1.

Factoring Method. Of other methods we will confine ourselves to describing


the so-called factoring algorithm (pivot-decomposition method). The basic
idea of this method is to make a conditional probability argument using the
relation
h(p) = pi h(1i , p) + (1 − pi )h(0i , p), (2.6)
where h(xi , p) equals the reliability of the system given that the state of
component i is xi . Formula (2.6) follows from the law of total probability.
This process repeats until the system comprises only series–parallel structures.
To illustrate the method we will give an example.

Example 2.14. Consider a bridge structure as given by the diagram shown in


Fig. 2.5. If we first choose to pivot on component 3, formula (2.6) holds with
i = 3. It is not difficult to see that given x3 = 1, the system structure has
the form

1 4
• •

2 5
26 2 Basic Reliability Theory

and that given x3 = 0, the system structure has the form

1 4
• •

2 5

These two structures are both of series–parallel form, and we see that
 
h(13 , p) = (p1 p2 )(p4 p5 )
h(03 , p) = p1 p4 p2 p5 .

Thus a formula for the exact computation of h(p) is established. Note that it
was sufficient to perform only one pivotal decomposition in this case. If the
structure given x3 = 1 had not been in a series–parallel form, we would have
had to perform another pivotal decomposition, and so on.

For a monotone structure Φ we have



Φ(x y) ≥ Φ(x) Φ(y), (2.7)
   
where x y = (x1 y1 , . . . , xn yn ). This is seen by noting that Φ(x y)
is greater than or equal to both Φ(x) and Φ(y). It follows from (2.7) that

h(p p ) ≥ h(p) h(p )

for all 0 ≤ p ≤ 1 and 0 ≤ p ≤ 1. These results state that redundancy at


the component level is more effective than redundancy at system level. This
principle is well known among design engineers. Note that if the system is a
parallel system, then equality holds in the above inequalities. If the system is
coherent, then equality holds if and only if the system is a parallel system.

Time Dynamics

The above theory can be applied to different situations, covering both rep-
airable and nonrepairable systems. As an example, consider a monotone sys-
tem in a time interval [0, t0 ], and assume that the components of the system
are “new” at time t = 0 and that a failed component stays in the failure
state for the rest of the time interval. Thus the component is not repaired or
replaced. This situation, for example, can describe a system with component
failure states that can only be discovered by testing or inspection. We assume
that the lifetime of component i is determined by a lifetime distribution Fi (t)
having failure rate function λi (t). To calculate system reliability at a fixed
point in time, i.e., the reliability function at this point, we can proceed as
2.1 Complex Systems 27

above with qi = Fi (t) and pi = F̄i (t). Thus, for a series system the reliability
at time t takes the form
n
h= F̄i (t). (2.8)
i=1

But F̄i (t) can be expressed by means of the failure rate λi (t):
t
F̄i (t) = e− 0
λi (u)du
. (2.9)

By putting (2.9) into formula (2.8) we obtain


 t n
h = e− 0
[ i=1 λi (u)]du
. (2.10)
From (2.10) we can conclude that the failure rate of a series structure of
independent components equals the sum of the failure rates of the components
of the structure. In particular this means that if the components have constant
failure
rates λi , i = 1, 2, . . . , n, then the series structure has constant failure
rate λi .
For a parallel structure we do not have a similar result. With constant
failure rates of the components, the system will have a time-dependent failure
rate; cf. Example 1.3, p. 6.

Reliability Importance Measures

An important objective of many reliability and risk analyses is to identify


those components or events that are most important (critical) from a relia-
bility/safety point of view and that should be given priority with respect to
improvements. Thus, we need an importance measure. A large number of such
measures have been suggested (see Bibliographic Notes, p. 55). Here we briefly
describe two measures, Improvement Potential and Birnbaum’s measure.
Consider again the 5-components example (cf. pp. 20, 22, and 24). The
unreliability of the system equals

g = {1 − p1 p4 }{1 − p5 (p2 + p3 − p2 p3 )} ≈ w1
w1 = q1 q5 + q4 q5 + q1 q2 q3 + q2 q3 q4
= 0.02 · 0.01 + 0.01 · 0.01
+0.02 · 0.02 · 0.02 + 0.02 · 0.02 · 0.01
= 3 · 10−4 .

If we look at the subsystems comprising the minimal cut sets, it is clear


from the above expression that subsystems {1, 5} and {4, 5} are most impor-
tant in the sense that they are contributing most to unreliability. To decide
which components are most important, we must define more precisely what is
meant by important. For example, we might decide to let the component with
the highest potential for increasing the system reliability be most important
(measure for reliability improvement potential) or the component that has the
28 2 Basic Reliability Theory

largest effect on system reliability by a small improvement of the component


reliability (Birnbaum’s measure).

Improvement Potential. The following reliability importance measure for


component i, IiA , is appropriate in a large number of situations, in particular
during design:
IiA = h(1i , p) − h(p),
where h(p) is the reliability of the system and h(1i , p) is the reliability ass-
uming that component i is in the best state 1. The measure IiA expresses the
system reliability improvement potential of the component, in other words,
the unreliability that is caused by imperfect performance of component i. This
measure can be used for all types of reliability definitions, and it can be used
for repairable or nonrepairable systems.
For a highly reliable monotone system the measure IiA is equivalent to the
well-known Vesely–Fussell importance measure [86]. In fact, in this case IiA is
approximately equal to the sum of the unreliabilities of the minimal cut sets
that include component i, i.e.,
 
IiA ≈ ql . (2.11)
j: i∈Kj l∈Kj

This is seen by applying


k theinclusion–exclusion formula. This formula states
that 1 − h(p) ≈ j=1 l∈Kj ql . Putting qi = 0 in this formula and
subtracting, we obtain the desired approximation formula for IiA . Note that,
like the Vesely–Fussell measure, the measure IiA gives the same importance
to all the components
 of a parallel system, irrespective of component reliabil-
ities, namely, IiA = nj=1 qj . This is as it should be because each one of the
components has the potential of making the system unreliability negligible,
for example, by introducing redundancy.

Example 2.15. Computation of IiA for the 5-components example gives

I1A = 2 · 10−4 , I2A = 1 · 10−5 ,


I3A = 1 · 10−5 , I4A = 1 · 10−4 ,
I5A = 3 · 10−4 .

Thus component 5 is the most important component based on this measure.


Components 1 and 4 follow in second and third place, respectively.

Birnbaum’s Measure. Birnbaum’s measure for the reliability importance


of component i, IiB , is defined by

∂h
IiB = .
∂pi
2.1 Complex Systems 29

Thus Birnbaum’s measure equals the partial derivative of the system reliability
with respect to pi . The approach is well known from classical sensitivity anal-
yses. We see that if IiB is large, a small change in the reliability of component
i will give a relatively large change in system reliability.
Birnbaum’s measure might be appropriate, for example, in the operation
phase where possible improvement actions are related to operation and main-
tenance parameters. Before looking closer into specific improvement actions of
the components, it will be informative to measure the sensitivity of the system
reliability with respect to small changes in the reliability of the components.
To compute IiB the following formula is often used:
IiB = h(1i , p) − h(0i , p). (2.12)
This formula is established using (2.6), p. 25.

Example 2.16. Using (2.12) we find that

I1B = 1.03 · 10−2 = 1 · 10−2 , I2B = I3B = 6 · 10−4 ,


I4B = 1.02 · 10−2 = 1 · 10−2 , I5B = 3 · 10−2 .

We see that for this example the Birnbaum measure gives the same ranking
of the components as the measure IiA . However, this is not true in general.

It is not difficult to see that


IiB = E[Φ(1i , X) − Φ(0i , X)] = P (Φ(1i , X) − Φ(0i , X) = 1)
= P (Φ(1i , X) = 1, Φ(0i , X) = 0).
If Φ(1i , x) − Φ(0i , x) = 1, we call (1i , x) a critical path vector and (0i , x) a
critical cut vector for component i. For simplicity, we often say that component
i is critical for the system.
Thus we have shown that IiB equals the probability that the system is
in a state so that component i is critical for the system. If the components
are dependent, this probability is often used as the definition of Birnbaum’s
measure. Now set pj = 1/2 for all j = i. Then
1  1 
IiB = [Φ(1i , x) − Φ(0i , x)] = [Φ(1i , x) − Φ(0i , x)].
2n−1 2n x
(·i ,x)

This quantity is used as a measure of the structural importance of compo-


nent i.

Some Comments on the Use of Importance Measures. The two imp-


ortance measures presented in this section can be useful tools in the sys-
tem optimization process/system improvement process. This process can be
described as follows:
30 2 Basic Reliability Theory

1. Identify the most important units by means of the chosen importance


measure
2. Identify possible improvement actions/measures for these units
3. Estimate the effect on reliability by implementing the measure
4. Perform cost evaluations
5. Make an overall evaluation and take a decision.
The importance measure to be used in a particular case depends on the
characteristics we want the measure to reflect. Undoubtedly, different situ-
ations call for different importance measures. In a design phase the system
reliability improvement potential IiA might be the most informative measure,
but for a system with frozen design, the Birnbaum measure might be more
informative, since this measure reflects how small component reliability imp-
rovements affect system reliability.

Dependent Components
In the following some remarks on systems with dependent components are
made. A more systematic treatment concerning copula models can be found
in the last subsection of this chapter.
One of the most difficult tasks in reliability engineering is to analyze depen-
dent components (often referred to as common mode failures). It is difficult to
formulate the dependency in a mathematically stringent way and at the same
time obtain a realistic model and to provide data for the model. Whether we
succeed in incorporating a “correct” contribution from common mode failures
is very much dependent on the modeling ability of the analyst. By defining
the components in a suitable way, it is often possible to preclude dependency.
For example, common mode failures that are caused by a common external
cause can be identified and separated out so that the components can be con-
sidered as independent components. Another useful method for “elimination”
of dependency is to redefine components. For example, instead of including
a parallel structure of dependent components in the system, this structure
could be represented by one component. Of course, this does not remove the
dependency, but it moves it to a lower level of the analysis. Special techniques,
such as Markov modeling, can then be used to analyze the parallel structure
itself, or we can try to estimate/assign reliability parameters directly for this
new component.
Although it is often possible to “eliminate” dependency between compo-
nents by proper modeling, it will in many cases be required to establish a
model that explicitly takes into account the dependency. Refer to Chap. 3 for
examples of such models.
Another way of taking into account dependency is to obtain bounds to the
system reliability, assuming that the components are associated and not neces-
sarily independent. Association is a type of positive dependency, for example,
as a result of components supporting loads. The precise mathematical defini-
tion is as follows (cf. [32]):
2.1 Complex Systems 31

Definition 2.17. Random variables T1 , T2 , . . . , Tn are associated if

cov[f (T), g(T)] ≥ 0

for all pairs of increasing binary functions f and g.


A number of results are established for associated components, for example,
the following inequalities:
 
max pi ≤ h ≤ 1 − max qi ,
1≤j≤s 1≤j≤k
i∈Sj i∈Kj

where Sj equals the jth minimal path set, j = 1, 2, . . . , s and Kj equals the
jth minimal cut set, j = 1, 2, . . . , k. This method usually leads to very wide
intervals for the reliability.

2.1.2 Multistate Monotone Systems

In this section parts of the theory presented in Sect. 2.1.1 will be generalized to
include multistate systems where components and system are allowed to have
an arbitrary (finite) number of states/levels. Multistate monotone systems are
used to model, e.g., production and transportation systems for oil and gas,
and power transmission systems.
We consider a system comprising n components, numbered consecutive
from 1 to n. As in the binary case, xi represents the state of component i,
i = 1, 2, . . . , n, but now xi can be in one out of Mi + 1 states,

xi0 , xi1 , xi2 , . . . , xiMi (xi0 < xi1 < xi2 < · · · < xiMi ).

The set comprising these states is denoted Si . The states xij represent, for
example, different levels of performance, from the worst, xi0 , to the best,
xiMi . The states xi0 , xi1 , . . . , xi,Mi −1 are referred to as the failure states of
the components.
Similarly, Φ = Φ(x) denotes the state (level) of the system. The various
values Φ can take are denoted

Φ0 , Φ1 , . . . , ΦM (Φ0 < Φ1 < · · · < ΦM ).

We see that if Mi = 1, i = 1, 2, . . . , n, and M = 1, then the model is identical


with the binary model of Sect. 2.1.1.
Definition 2.18. (Monotone system). A system is said to be monotone if
1. its structure function Φ is nondecreasing in each argument, and
2. Φ(x10 , x20 , . . . , xn0 ) = Φ0 and Φ(x1M1 , x2M2 , . . . , xnMn ) = ΦM .
In the following we will restrict attention to monotone systems. As usual,
we use the convention that (x1 , x2 , . . . , xn ) > (z1 , z2 , . . . , zn ) means that xi ≥
zi , i = 1, 2, . . . , n, and there exists at least one i such that xi > zi .
32 2 Basic Reliability Theory

- 1
a • - 3 • b
- 2

Fig. 2.6. A simple example of a flow network

Definition 2.19. (Minimal cut vector). A vector z is a cut vector to level


c if Φ(z) < c. A cut vector to level c, z, is minimal if Φ(x) ≥ c for all x > z.

Definition 2.20. (Minimal path vector). A vector y is a path vector to


level c if Φ(y) ≥ c. A path vector to level c, y, is minimal if Φ(x) < c for all
x < y.

Example 2.21. Figure 2.6 shows a simple example of a flow network model.
The system comprises three components. Flow (gas/oil) is transmitted from a
to b. The components 1 and 2 are binary, whereas component 3 can be in one
out of three states: 0, 1, or 2. The states of the components are interpreted
as flow capacity rates for the components. The state/level of the system is
defined as the maximum flow that can be transmitted from a to b, i.e.,

Φ = Φ(x) = min{x1 + x2 , x3 }.

If, for example, the component states are x1 = 0 , x2 = 1, and x3 = 2, then


the flow throughput equals 1, i.e., Φ = Φ(0, 1, 2) = 1. The possible system
levels are 0, 1, and 2. We see that Φ is a multistate monotone system. The
minimal cut vectors and path vectors are as follows:

System level 2
Minimal cut vectors: (0, 1, 2), (1, 0, 2), and (1, 1, 1)
Minimal path vectors : (1, 1, 2)

System level 1
Minimal cut vectors: (0, 0, 2) and (1, 1, 0)
Minimal path vectors : (0, 1, 1) and (1, 0, 1).

Computing System Reliability

Assume that the state Xi of the ith component is a random variable, i =


1, 2, . . . , n. Let
2.1 Complex Systems 33

pij = P (Xi = xij ),


hj = P (Φ(X) ≥ Φj ),
a = EΦ(X)/ΦM = j Φj P (Φ(X) = Φj )/ΦM .

We call hj the reliability of the system at system level j. For the flow network
example above, a represents the expected throughput (flow) relatively to the
maximum throughput (flow) level.
The problem is to compute hj for one or more values of j, and a,
based on the probabilities pij . We assume that the random variables Xi are
independent.

Example 2.22. (Continuation of Example 2.21). Assume that

pi1 = 1 − pi0 = 0.96, i = 1, 2,


p32 = 0.97, p31 = 0.02, p30 = 0.01.

Then by simple probability calculus we find that

h2 = P (X1 = 1, X2 = 1, X3 = 2)
= 0.96 · 0.96 · 0.97 = 0.894;
h1 = P (X1 = 1 ∪ X2 = 1, X3 ≥ 1)
= P (X1 = 1 ∪ X2 = 1) P (X3 ≥ 1)
= {1 − P (X1 = 0) P (X2 = 0)} P (X3 ≥ 1)
= 0.9984 · 0.99 = 0.988;
a = (0.094 · 1 + 0.894 · 2)/2 = 0.941.

For the above example it is easy to calculate the system reliability directly
by using elementary probability rules. For larger systems it will be very time-
consuming (in some cases impossible) to perform these calculations if special
techniques or algorithms are not used. If the minimal cut vectors or path
vectors for a specific level are known, the system reliability for this level can
be computed exactly, using, for example, the algorithm described in [17]. For
highly reliable systems, which are most common in practice, simple approxi-
mations can be used as described in the following.
Analogous to the binary case, approximations can be established based on
the inclusion–exclusion method. For example, we have

n
1 − hj = P (Xi ≤ zir ) − , (2.13)
r i=1

where (z1r , z2r , . . . , znr ) represents the rth cut vector for level j and is a positive
error term satisfying

n
≤ P (Xi ≤ min{zir , zil }).
r<l i=1
34 2 Basic Reliability Theory

Example 2.23. (Continuation of Example 2.22). If we use (2.13) to calculate


hj , we obtain

h2 ≈ 1 − (0.04 · 1 · 1 + 1 · 0.04 · 1 + 1 · 1 · 0.03) = 0.890,


h1 ≈ 1 − (0.04 · 0.04 · 1 + 1 · 1 · 0.01) = 0.988,
a ≈ (1 · 0.098 + 2 · 0.890)/2 = 0.939.

We can conclude that the approximations are quite good for this example.

The problem of determining the probabilities pij will, as in the binary case,
depend on the particular situation considered. Often it will be appropriate to
define pij by the limiting availabilities of the component, cf. Chap. 4.

Discussion

The traditional reliability theory based on a binary approach has recently


been generalized by allowing components and system to have an arbitrary
finite number of states. For most reliability applications, binary modeling
should be sufficiently accurate, but for certain types of applications, such as
gas and oil production and transportation systems and telecommunication,
a multistate approach is usually required for the system and components. In
a gas transportation system, for example, the state of the system is defined
as the rate of delivered gas, and in most cases a binary model (100%, 0%)
would be a poor representation of the system. A component in such a sys-
tem may represent a compressor station comprising a certain number (M ) of
compressor units in parallel. The states of the component equal the capacity
levels corresponding to M compressor units running, M − 1 compressor units
running, and so on.
There also exists a number of reliability importance measures for multi-
state systems (see Bibliographic Notes, p. 55). Many of these measures repre-
sent natural generalizations of importance measures of binary systems. We see,
for example, that the measure I A can easily be extended to multistate models.
For the Birnbaum measure, it is not so straightforward to generalize the mea-
sure. Several measures have been proposed as, for example, the r, s-reliability
importance Iir,s of component i, which is given by

Iir,s = P (Φ(ri , X) ≥ Φk ) − P (Φ(si , X) ≥ Φk ), (2.14)

where Φ(ji , X) equals the state of the system given that Xi = xij .

2.2 Basic Notions of Aging


In this section we introduce and recapitulate some properties of lifetime dis-
tributions. Let T be a positive random variable with distribution function
F : T ∼ F, i.e., P (T ≤ t) = F (t). If F has a density f, then λ(t) = f (t)/F̄ (t)
2.2 Basic Notions of Aging 35

is the failure or hazard rate, where as usual F̄ (t) = 1 − F (t) denotes the
survival probability. Here and in the following we sometimes simplify the
notation and define a mapping by its values to avoid constructions like
λ : D → R+ , D⊂ R  {t ∈ R+ : F̄ (t) = 0}, t → λ(t) = f (t)/F̄ (t), if
there is no fear of ambiguity. Interpreting T as the lifetime of some com-
ponent or system, the failure rate measures the proneness to failure at time
t : λ(t)  t ≈ P (T ≤ t + t|T > t). The well-known relation
  t 
F̄ (t) = exp − λ(s)ds
0

shows that F is uniquely determined by the failure rate. One notion of aging
could be an increasing failure rate (IFR). However, this IFR property is in
some cases too strong and other intuitive notions of aging have been sug-
gested. Among them are the increasing failure rate average (IFRA) property
and the notions of new better than used (NBU) and new better than used
in expectation (NBUE). In the following subsection these concepts are intro-
duced formally and the relationships among them are investigated.
Furthermore, these notions should be applied to complex systems. If we
consider the time dynamics of such systems, we want to investigate how the
reliability of the whole system changes in time if the components have one of
the mentioned aging properties.
Another question is how different lifetime (random) variables and their cor-
responding distributions can be compared. This leads to notions of stochastic
ordering. The comparison of the lifetime distribution with the exponential
distribution leads to useful estimates of the system reliability.

2.2.1 Nonparametric Classes of Lifetime Distributions

We first define the IFR and decreasing failure rate (DFR) properties of a
lifetime distribution F by means of the conditional survival probability

P (T > t + x|T > t) = F̄ (t + x)/F̄ (t).

Definition 2.24. Let T be a positive random variable with T ∼ F .


(i) F is an IFR distribution if F̄ (t + x)/F̄ (t) is nonincreasing in t on the
domain of the distribution for each x ≥ 0.
(ii) F is a DFR distribution if F̄ (t + x)/F̄ (t) is nondecreasing in t on the
domain of the distribution for each x ≥ 0.

In the following we will restrict attention to the “increasing” part in the


definition of the aging notion. The “decreasing part” can be treated anal-
ogously. The IFR property says that with increasing age the probability of
surviving x further time units decreases. This definition does not make use
36 2 Basic Reliability Theory

of the existence of a density f (failure rate λ). But if a density exists, then
the IFR property is equivalent to a nondecreasing failure rate, which can
immediately be seen as follows. From
 
1 F̄ (t + x)
λ(t) = lim 1−
x→0+ x
F̄ (t)
we obtain that the IFR property implies that λ is nondecreasing. Conversely,
if λ is nondecreasing, then we can conclude that
  t+x 
P (T > t + x|T > t) = exp − λ(s)ds
t

is nonincreasing, i.e., F is IFR. If F has the IFR property, then it is continuous


for all t < t∗ = sup{t ∈ R+ : F̄ (t) > 0} (possibly t∗ = ∞) and a jump can only
occur at t∗ if t∗ < ∞. This can be directly deduced from the IFR definition.
It seems reasonable that the aging properties of the components of a mono-
tone structure are inherited by the system. However, the example of a parallel
structure with two independent components, the lifetimes of which are dis-
tributed Exp(λ1 ) and Exp(λ2 ), respectively, shows that in this respect the
IFR property is too strong. As was pointed out in Example 1.3, p. 6, for
λ1 = λ2 , the failure rate of the system lifetime is increasing in (0, t∗ ) and
decreasing in (t∗ , ∞) for some t∗ > 0, i.e., constant component failure rates
lead in this case to a nonmonotone system failure rate. To characterize the
class of lifetime distributions of systems with IFR components we are led to
the IFRA property. We use the notation
 t
dF (s)
Λ(t) = ,
0 1 − F (s−)
which is the accumulated failure rate. The distribution function F is uniquely
determined by Λ and the relation is given by

F̄ (t) = exp{−Λc (t)} (1 − ΔΛ(s))
s≤t

for all t such that Λ(t) < ∞, where


ΔΛ(s) = Λ(s) − Λ(s−) is the jump height
at time s and Λc (t) = Λ(t) − s≤t ΔΛ(s) is the continuous part of Λ (cf. [2],
p. 91 or [115], p. 436). In the case that F is continuous, we obtain the simple
exponential formula F̄ (t) = exp{−Λ(t)} or Λ(t) = − ln F̄ (t).
Definition 2.25. A distribution F is IFRA if −(1/t) ln F̄ (t) is nondecreasing
in t > 0 on {t ∈ R+ : F̄ (t) > 0}.
Remark 2.26. (i) The “decreasing” analog is denoted DFRA.
(ii) If F is IFRA, then (F̄ (t))1/t is nonincreasing, which is equivalent to
F̄ (αt) ≥ (F̄ (t))α
for 0 ≤ α ≤ 1 and t ≥ 0.
2.2 Basic Notions of Aging 37

Next we will introduce two aging notions that are related to the residual
lifetime of a component of age t. Let T ∼ F be a positive random variable
with finite expectation. Then the distribution of the remaining lifetime after
t ≥ 0 is given by
F̄ (x + t)
P (T − t > x|T > t) =
F̄ (t)
with expectation
 ∞  ∞
1 1
μ(t) = E[T − t|T > t] = F̄ (x + t)dx = F̄ (x)dx (2.15)
F̄ (t) 0 F̄ (t) t

for 0 ≤ t < t∗ = sup{t ∈ R+ : F̄ (t)>0}. The conditional expectation μ(t) is


called mean residual life at time t.
Definition 2.27. Let T ∼ F be a positive random variable.
(i) F is NBU, if
F̄ (x + t) ≤ F̄ (x)F̄ (t) for x, t ≥ 0.
(ii) F is NBUE, if μ = ET < ∞ and
μ(t) ≤ μ for 0 ≤ t < t∗ .

Remark 2.28. (i) The corresponding notions for “better” replaced by “worse,”
NWU and NWUE, are obtained by reversing the inequality signs.
(ii) These properties are intuitive notions of aging. F is NBU means that
the probability of surviving x further time units for a component of age t
decreases in t. For NBUE distributions the expected remaining lifetime for
a component of age t is less than the expected lifetime of a new component.
Now we want to establish the relations between these four notions of aging.
Theorem 2.29. Let T ∼ F be a positive random variable with finite expecta-
tion. Then we have

F IFR ⇒ F IFRA ⇒ F NBU ⇒ F NBUE.

Proof. F IFR ⇒ F IFRA: Since an IFR distribution F is continuous


for all t < t∗ = sup{t ∈ R+ : F̄ (t)>0}, the simple exponential formula
F̄ (t) = exp{−Λ(t)} holds true and we see that the IFR property implies that
exp{Λ(t + x) − Λ(t)} is increasing in t for all positive x. Therefore Λ is convex,
i.e., Λ(αt+(1−α)u) ≤ αΛ(t)+(1−α)Λ(u), 0 ≤ α ≤ 1. Taking the limit u → 0−
we have Λ(0−) = 0 and Λ(αt) ≤ αΛ(t), which amounts to F̄ (αt) ≥ (F̄ (t))α .
But this is equivalent to the IFRA property (see Remark 2.26 above).
F IFRA ⇒ F NBU: With the abbreviations a = −(1/x) ln F̄ (x) and
b = −(1/y) ln F̄ (y) we obtain from the IFRA property for positive x, y that
−(1/(x + y)) ln F̄ (x + y) ≥ a ∨ b = max{a, b} and
38 2 Basic Reliability Theory

− ln F̄ (x + y) ≥ (a ∨ b)(x + y) ≥ ax + by = − ln F̄ (x) − ln F̄ (y).

But this is the NBU property F̄ (x + y) ≤ F̄ (x)F̄ (y).


F NBU ⇒ F NBUE: This inequality follows by integrating the NBU
inequality
 ∞  ∞
F̄ (t)μ(t) = F̄ (x + t)dx ≤ F̄ (t) F̄ (x)dx = F̄ (t)μ,
0 0

which completes the proof. 




Examples can be constructed which show that none of the above implica-
tions can be reversed.

2.2.2 Closure Theorems

In the previous subsection it was mentioned that the lifetime of a monotone


system with IFR components need not be of IFR type. This gave rise to the
definition of the IFRA class of lifetime distributions, and we will show that
this class is closed under forming monotone structures. There are also other
reliability operations, among them mixtures of distributions or forming the
sum of random variables, and the question arises whether certain distribution
classes are closed under these operations. For example, convolutions arise in
connection with the addition of lifetimes and cold reserves.
Before we come to the IFRA Closure Theorem we need a preparatory
lemma to prove a property of the reliability function h(p) = P (Φ(X) = 1) of
a monotone structure.

Lemma 2.30. Let h be the reliability function of a monotone structure. Then


h satisfies the inequality

h(pα ) ≥ hα (p) for 0 < α ≤ 1,

where pα = (pα α
1 , . . . , pn ).

Proof. We prove the result for binary structures, which are nondecreasing in
each argument (nondecreasing structures) but not necessarily satisfy Φ(0) = 0
and Φ(1) = 1. We use induction by n, the number of components in the
system. For n = 1 the assertion is obviously true. The induction step is carried
out by means of the pivotal decomposition formula:

n h(1n , p ) + (1 − pn )h(0n , p ).
h(pα ) = pα α α α

Now h(1n , pα ) and h(0n , pα ) define reliability functions of nondecreasing


structures with n − 1 components. Therefore we have h(·n , pα ) ≥ hα (·n , p)
and also
h(pα ) ≥ pαn h (1n , p) + (1 − pn )h (0n , p).
α α α
2.2 Basic Notions of Aging 39

The last step is to show that


α
n h (1n , p) + (1 − pn )h (0n , p) ≥ (pn h(1n , p) + (1 − pn )h(0n , p)) .
pα α α α

But since v(x) = xα is a concave function for x ≥ 0, we have


v(x + a) − v(x) ≥ v(y + a) − v(y) for 0 ≤ x ≤ y, 0 ≤ a.
Setting a = pn (h(1n , p) − h(0n , p)), x = pn h(0n , p) and y = h(0n , p) yields
the desired inequality. 


Now we can establish the IFRA Closure Theorem.


Theorem 2.31. If each of the independent components of a monotone struc-
ture has an IFRA lifetime distribution, then the system itself has an IFRA
lifetime distribution.
Proof. Let F, Fi , i = 1, . . . , n, be the distributions of the lifetimes of the system
and the components, respectively. The IFRA property is characterized by
F̄i (αt) ≥ (F̄i (t))α
for 0 ≤ α ≤ 1 and t ≥ 0. The distribution F is related to the Fi by the
reliability function h :
F̄ (t) = h(F̄1 (t), . . . , F̄n (t)).
By Lemma 2.30 above using the monotonicity of h we can conclude that
F̄ (αt) = h(F̄1 (αt), . . . , F̄n (αt)) ≥ h(F̄1α (t), . . . , F̄nα (t))
≥ hα (F̄1 (t), . . . , F̄n (t)) = F̄ α (t)
for 0 < α ≤ 1. For α = 0 this inequality holds true since F (0) = 0. This
proves the IFRA property of F . 


We know that independent IFR components form an IFRA monotone sys-


tem and hence, if the components have exponentially distributed lifetimes, the
system lifetime is of IFRA type. Since constant failure rates are also included
in the DFR class, one cannot hope for a corresponding closure theorem for
DFRA distributions. However, considering other reliability operations things
may change. For example, let {Fk : k ∈ N} be a family of distributions and

F = k=1 pk Fk be its mixture with respect to some probability distribution
(pk ). Then it is known that the DFR and the DFRA property are preserved,
i.e., if all Fk are DFR(A), then the mixture F is also DFR(A) (for a proof
of a slightly more general result see [32] p. 103). Of course, by the same ar-
gument as above a closure theorem for mixtures cannot hold true for IFRA
distributions.
Finally, we state a closure theorem for convolutions. Since a complete proof
is lengthy (and technical), we do not present it here; we refer to [32], p. 100,
and [139], p. 23.
40 2 Basic Reliability Theory

Theorem 2.32. Let X and Y be two independent random variables with IFR
distributions. Then X + Y has an IFR distribution.
By induction this property extends to an arbitrary finite number of random
variables. This shows, for example, that the Erlang distribution is of IFR type
because it is the distribution of the sum of exponentially distributed random
variables.

2.2.3 Stochastic Comparison


There are many possibilities to compare random variables or their distribu-
tions, respectively, with each other, and a rich literature treats various ways
of defining stochastic orders. One of the most important in reliability is the
stochastic order. Let X and Y be two random variables. Then X is said to be
smaller in the stochastic order, denoted X ≤st Y, if P (X > t) ≤ P (Y > t) for
all t ∈ R+ . In reliability terms we say that X is stochastically smaller than
Y , if the probability of surviving a given time t is smaller for X than for Y
for all t. Note that the stochastic order compares two distributions, the ran-
dom variables could even be defined on different probability spaces. One main
point is now to compare a given lifetime distribution with the exponential
one. The reason why we choose the exponential distribution is its simplic-
ity and the special role it plays on the border between the IFR(A) and the
DFR(A) classes. However, it turns out that in general a random variable with
an IFR(A) distribution is not stochastically smaller than an exponentially
distributed one, but their distributions cross at most once.
Lemma 2.33. Let T be a positive random variable with IFRA distribution F
and xp be fixed such that F (xp ) = p (p-quantile). Then for 0 < p < 1
F̄ (t) ≥ e−αt for 0 ≤ t < xp and
F̄ (t) ≤ e−αt for xp ≤ t
holds true, where α = − x1p ln(1 − p).

Proof. For an IFRA distribution v(t) = (− ln F̄ (t))/t is nondecreasing. There-


fore the result follows by noting that v(t) ≤ v(xp ) = α for t < xp and v(t) ≥ α
for t ≥ xp . 


The last lemma compares an IFRA distribution with an exponential dis-


tribution with the same p-quantile. It is also of interest to compare F having
expectation μ with a corresponding Exp(1/μ) distribution. The easiest way
seems to be to set α = 1/μ in the above lemma. But an IFRA distribution
function may have jumps so that there might be no t with v(t) = 1/μ. If, on
the other hand, F has the stronger IFR property, then it is continuous for
t < t∗ = sup{t ∈ R+ : F̄ (t) > 0} (possibly t∗ = ∞) and a jump can only
occur at t∗ if t∗ < ∞. So we find a value tμ with v(tμ ) = 1/μ excluding the
degenerate case F̄ (μ) = 0, i.e., t∗ = μ. This leads to the following result.
2.2 Basic Notions of Aging 41

Lemma 2.34. Let T be a positive random variable with IFR distribution F,


mean μ and let tμ = inf{t ∈ R+ : − 1t ln F̄ (t) ≥ μ1 }. Then
t
F̄ (t) ≥ e− μ for 0 ≤ t < tμ ,
t
F̄ (t) ≤ e− μ for tμ ≤ t

and tμ ≥ μ hold true.

Proof. The inequality for the survival probability follows from Lemma 2.33
with α = 1/μ, where in the degenerate case t∗ = μ we have tμ = t∗ = μ.
It remains to show tμ ≥ μ. To this end we first confine ourselves to the
continuous case and assume that F has no jump at t∗ . Then F (T ) has a
uniform distribution on [0, 1] and we obtain E[ln F̄ (T )] = −1. Now

F̄ (t + x)
= exp{−(Λ(t + x) − Λ(t))}
F̄ (t)

is nonincreasing in t for all x ≥ 0, which implies that Λ(t) = − ln F̄ (t) is


convex, and we can apply J.L. Jensen’s inequality to yield

1 = E[− ln F̄ (T )] ≥ − ln F̄ (μ).

This is tantamount to − μ1 ln F̄ (μ) ≤ μ1 and hence tμ ≥ μ, which proves the


assertion for continuous F .
In case F has a jump at t∗ we can approximate F by continuous distribu-
tions. Then t∗ is finite and all considerations can be carried over to the limit.
We omit the details. 


Example 2.35. Let T follow a Weibull distribution F̄ (t) = exp{−tβ } with


mean μ = Γ (1+1/β), where Γ is the Gamma function. Then clearly F is IFR,
if β > 1. Lemma 2.34 yields F̄ (t) ≥ exp{−t/μ} for 0 ≤ t < tμ = (1/μ)1/(β−1)
and tμ ≥ μ. Note that in this case tμ > μ, which extends slightly the well-
known result F̄ (t) ≥ exp{−t/μ} for 0 ≤ t < μ (see [32] Theorem 6.2, p.
111).

A lot of other bounds for the survival probability can be set up under
various conditions (see the references listed in the Bibliographic Notes). Next
we want to give one example of how such bounds can be carried over to
monotone systems. As an immediate consequence of the last lemma we obtain
the following corollary.

Corollary 2.36. Let h be the reliability function of a monotone system with


lifetime distribution F . If the components are independent with IFR distribu-
tions Fi and mean μi , i = 1, . . . , n, then we have

F̄ (t) ≥ h(e−t/μ1 , . . . , e−t/μn ) for t < min{μ1 , . . . , μn }.


42 2 Basic Reliability Theory

Actually the inequality holds true for t < min{tμ1 , . . . , tμn }. The idea of
this inequality is to give a bound on the reliability of the system at time t
only based on h and μi and the knowledge that the Fi are of IFR type. If the
reliability function h is unknown, then it could be replaced by that of a series
system to yield
 
n n
1
−t/μ1 −t/μn −t/μi
F̄ (t) ≥ h(e ,...,e )≥ e = exp −t
i=1 i=1 i
μ

for t < min{μ1 , . . . , μn }.


These few examples given here indicate how aging properties lead to
bounds on the reliability or survival probability of a single component and
how these affect the lifetime of a system comprising independent components.

2.3 Copula Models of Complex Systems in Reliability


2.3.1 Introduction to Copula Models

We consider a complex system comprising n components. The lifetimes of


the components are described by non-negative random variables T1 , cdotsTn ,
where Ti has continuous distribution Fi with support R+ , i = 1, . . . , n. Usu-
ally, the lifetimes are assumed to be stochastically independent. But in a
number of cases such an assumption is not likely to hold true, e.g., if all
components of a system are exposed to the same environmental conditions
or stresses. Therefore, we want to extend the model to possibly dependent
lifetimes with joint cumulative distribution function H:

H(t1 , . . . , tn ) = P (T1 ≤ t1 , . . . , Tn ≤ tn ).

To investigate the influence of the dependence structure on the system


reliability it turns out to be useful to assume that the dependence structure
is given by a copula. Such a copula C is defined as an n-variate distribution
function on the cube [0, 1]n with marginals that are uniform distributions on
[0, 1], i.e.,
1. C(u) = 0 for any u ∈ [0, 1]n , if at least one coordinate of u = (u1 , . . . , un )
is 0.
2. C(u) = ui for any u ∈ [0, 1]n , if all coordinates of u are 1 except ui .
The link between the joint distribution function H and the marginal dis-
tribution functions Fi of the random variables Ti is given by a copula C.
According to Sklar’s theorem (see Nelsen [127]) for any n-variate distribution
H with marginals Fi there exists an n-copula C such that

H(t1 , . . . , tn ) = C(F1 (t1 ), . . . , Fn (tn ))


2.3 Copula Models of Complex Systems in Reliability 43

for all t1 , . . . , tn . If F1 , . . . , Fn are continuous, as it is assumed here, then this


copula C is uniquely determined.
As before, we consider a binary monotone system admitting two states:
working (coded as 1) and failed (coded as 0). The state of the system is
uniquely determined by the binary states of the n components, i.r., there is
a structure function Φ : {0, 1}n → {0, 1} emitting the state of the system
according to the states of the components. We consider a monotone system,
i.e., we assume that this structure function is monotone in each component
and Φ(0, . . . , 0) = 0, Φ(1, . . . , 1) = 1. Let Xt (i) = I(Ti > t), i = 1, . . . , n
describe the state of the ith component at time t, t ∈ R+ , where I is the
indicator function. Then

F S (t) := P (Φ(Xt (1), . . . , Xt (n)) = 0),

is the distribution function of the system lifetime. Of course, in addition to


the structure function Φ, this distribution also depends on the copula C.
One aim is to investigate how the dependence structure determines the
lifetime distribution F S of the system and in particular in which way prop-
erties such as expectation or quantiles depend on the copula. To this end we
need the system lifetime distribution F S to be given explicitly in terms of
Φ and C as follows (see [71]). Let C be an n-dimensional copula and C̃ the
induced probability measure such that C(t1 , . . . , tn ) = C̃( ni=1 [0, ti ]). Note
that since the support of the copula C is [0, 1]n we have C̃([0, 1]n ) = 1. For
0 ≤ s ≤ 1 we denominate the intervals B0s = [0, s] and B1s = (s, 1], where
B11 = ∅.
We introduce the function GΦ,C : [0, 1]n → [0, 1] with

F S (t) := P (Φ(Xt (1), . . . , Xt (n)) = 0) = GΦ,C (F1 (t), . . . , Fn (t)),

to emphasize that the lifetime distribution F S depends on Φ and on C. This


function GΦ,C can be determined as follows (for a proof see [71]).
Theorem 2.37. The system lifetime distribution F S is given for all t ≥ 0 by

F S (t) = GΦ,C (F1 (t), . . . , Fn (t)).


where
 
n 
GΦ,C (t1 , . . . , tn ) := 1 − Φ(x) · C̃ Bxtii .
x∈{0,1}n i=1

Since this formula is rather complex we will explain it in more detail for
the case n = 2 and give some examples.
Let Y1 , Y2 be random variables each uniformly distributed on [0, 1] with
joint distribution C(t1 , t2 ) = P (Y1 ≤ t1 , Y2 ≤ t2 ), t1 , t2 ∈ [0, 1] and induced
probability measure C̃. For the sets D1 = B0t1 × B0t2 , D2 = B0t1 × B1t2 , D3 =
B1t1 × B1t2 , D4 = B1t1 × B0t2 in Fig. 2.7 we get
44 2 Basic Reliability Theory

D2 D3

t2

D1 D4
-
t1 1

Fig. 2.7. Example for n = 2

C̃(D1 ) = P (Y1 ≤ t1 , Y2 ≤ t2 ) = C(t1 , t2 ),


C̃(D2 ) = P (Y1 ≤ t1 , t2 ≤ Y2 ≤ 1)
= C(t1 , 1) − C(t1 , t2 ) = t1 − C(t1 , t2 ),
C̃(D3 ) = P (t1 < Y1 ≤ 1, t2 < Y2 ≤ 1)
= 1 − C(1, t2 ) − C(t1 , 1) + C(t1 , t2 )
= 1 − t2 − t1 + C(t1 , t2 ),
C̃(D4 ) = P (t1 < Y1 ≤ 1, Y2 ≤ t2 )
= C(1, t2 ) − C(t1 , t2 ) = t2 − C(t1 , t2 ).

Example 2.38.
(i) In the case of a parallel system withnn components, the structure function
is given by Φ(x1 , . . . , xn ) = 1 − i=1 (1 − xi ), which is 0 if and only if
x = (0, . . . , 0). Therefore, the sum in GΦ,C extends over all possible x
except the null vector yielding
 
n   
GΦ,C (t1 , . . . , tn ) = 1 − 1 − C̃ ti
B0 = C t 1 , . . . , tn .
i=1

It follows as to be expected that

F S (t) = GΦ,C (F(t)) = C(F1 (t), . . . , Fn (t)) = H(t, . . . , t).


n
(ii) For a series system with n components, we have Φ(x1 , . . . , xn ) = i=1 xi ,
which is 1 if and only if x = (1, . . . , 1). Hence
2.3 Copula Models of Complex Systems in Reliability 45
  
n 
GΦ,C t1 , . . . , tn = 1 − C̃ B1ti .
i=1

If we denote H̄(t1 , . . . , tn ) = P (T1 > t1 , . . . , Tn > tn ) the survival func-


tion of H and C̄ the n-dimensional joint survival function corresponding
to C, then we get for the lifetime distribution of a series system

F S (t) = 1 − H̄(t, . . . , t) = 1 − C̄(F1 (t), . . . , Fn (t)).

In the special case n = 2 we have GΦ,C = t1 + t2 − C(t1 , t2 ) yielding

F S (t) = F1 (t) + F2 (t) − C(F1 (t), F2 (t)).

(iii) If the n component


 lifetimes are independent, then the copula C is the
product copula (t1 , . . . , tn ) = t1 · · · · · tn . Thus

 
n
GΦ,C (t1 , . . . , tn ) = 1 − Φ(x) t1−x
i
i
(1 − ti )xi .
x∈{0,1}n i=1

The intact probabilities of the components at time t are F̄i (t) = 1 −


Fi (t) = P (Xi (t) = 1), i = 1, . . . , n. The system reliability is then given by

 
n
F̄ S (t) = Φ(x) (F̄i (t))xi (Fi (t))1−xi ,
x∈{0,1}n i=1

the well-known formula that results from the state enumeration method
(see Chap. 2.1, p. 25).

2.3.2 The Influence of the Copula on the Lifetime Distribution


of the System

In the following we want to investigate in which way the dependence structure,


i.e., the copula, influences one-dimensional properties q(F S ) of the system
lifetime distribution F S (t), where the functional q : D → R̄ is a mapping
from the space D of distribution functions of non-negative random variables
to R = R ∪ {−∞, ∞}.
Important examples of such functionals are
• the system reliability Rt at a fixed time t

Rt (F S ) = P (Φ(Xt (1), . . . , Xt (n)) = 1) = 1 − F S (t) = F̄ S (t),

• the expectation E  ∞
E(F S ) = F̄ S (t)dt,
0
46 2 Basic Reliability Theory

• the p-quantiles Qp of the system lifetime distribution

Qp (F S ) = inf{t ∈ R+ : F S (t) ≥ p}, 0 < p ≤ 1.

To investigate the influence of the copula on these one-dimensional quan-


tities we first have to compare different multivariate distributions. There are
a lot of comparison methods that are presented in some detail in [123, 99] and
related to copulas in Nelsen [127]. We summarize briefly the notions we need.
We consider n non-negative random variables T1 , . . . , Tn with joint distri-
bution function H, marginals F1 , . . . , Fn and survival function H̄(t1 , . . . , tn ) =
P (T1 > t1 , . . . , Tn > tn ). In the case n = 2 we have the relation: H̄(t1 , t2 ) =
1 − F1 (t1 ) − F2 (t2 ) + H(t1 , t2 ). Now we want to compare two n-variate distri-
bution functions H, G ∈ D(F1 , . . . , Fn ), where D(F1 , . . . , Fn ) denotes the set
of distribution functions with marginals F1 , . . . , Fn , each with support R+ .

Definition 2.39. Let H, G ∈ D(F1 , . . . , Fn ), n ≥ 2.


(i) G is more positive lower orthant dependent (PLOD) than H, written
H ≺cL G, if H(t) ≤ G(t) for all t = (t1 , . . . , tn ) ∈ Rn .
(ii) G is more positive upper orthant dependent (PUOD) than H, written
H ≺cU G, if H̄(t) ≤ Ḡ(t) for all t.
(iii) G is more concordant than H, written H ≺c G, if both H(t) ≤ G(t) and
H̄(t) ≤ Ḡ(t) hold for all t.

For n = 2, parts (i) and (ii) of the above definition are equivalent as can
be seen from the relation between H and H̄. This does not hold true in higher
dimensions. To compare two distributions H, G ∈ D(F1 , . . . , Fn ) with fixed
marginals it is, of course, enough to compare their corresponding copulas.
For n = 2 random variables X, Y with continuous distribution functions
F, G and copula C, there are well-known measures of the degree of dependence
such as Kendall’s tau τX,Y or Spearman’s rho ρX,Y , expression which can be
expressed in terms of the copula C :
 
τX,Y = 4 C(u, v)dC(u, v) − 1, ρX,Y = 12 C(u, v)dudv − 3.
[0,1]2 [0,1]2

This shows that monotonicity of copulas with respect to the PLOD-


ordering inherits monotonicity of Kendall’s tau and Spearman’s rho. In a
similar way we want to investigate the effect of an increase of dependency on
one-dimensional properties q(F S ) of the system lifetime distribution. We can-
not hope for results for arbitrary systems, but for parallel and series systems,
see Fig. 2.8, we can prove the following theorem. For this we need the usual
stochastic order on D: F ≤s G iff F (t) ≥ G(t) for all t ≥ 0.
2.3 Copula Models of Complex Systems in Reliability 47

c1

c1 cn

cn

(a) (b)

Fig. 2.8. (a)Parallel and (b)series system

Theorem 2.40. Let the functional q : D → R be nondecreasing with respect


to the usual stochastic order on D and let C1 and C2 be two n-dimensional
copulas.
(i) If for a parallel system C1 ≺cL C2 then

q(FCS2 ) ≤ q(FCS1 );

(ii) if for a series system C1 ≺cU C2 then

q(FCS1 ) ≤ q(FCS2 ).

If q is nonincreasing then the inequalities in (i) and (ii) are reversed.

Proof. (i) For a parallel system, note that according to Example 2(i) it holds
that
FCSi (t) = Ci (F1 (t), . . . , Fn (t)),
where i = 1, 2 and F1 (t), . . . , Fn (t) ∈ D. It is clear that FCS1 (t) ≤ FCS2 (t) for all
t ≥ 0, since C1 ≺cL C2 . That means FCS2 ≤s FCS1 . Because of the monotonicity
of q we get the assertion
q(FCS2 ) ≤ q(FCS1 ).
The proof of (ii) is similar: For a series system we have

FCSi (t) = 1 − C̄i (F1 (t), . . . , Fn (t)).

Therefore, the PUOD-ordering of Ci yields FCS1 ≤s FCS2 and consequently the


assertion.
The case of nonincreasing q is obvious. 


The above theorem shall be applied to the three functionals mentioned ear-
lier,
 ∞ namely the system reliability Rt (F S ) = F̄ S (t), the expectation E(F S ) =
0
F̄ (t)dt and the quantile Qp (F S ) := inf{t ∈ R+ : F S (t) ≥ p}, 0 < p ≤ 1.
S

Note that these functionals are all nondecreasing with respect to the usual
stochastic ordering.
48 2 Basic Reliability Theory

One is often interested in bounds for these reliability quantities in cases


when the marginals are (approximately) known but the dependence structure
is unknown. For this we can utilize the so called Fréchet–Hoeffding bounds
(see Nelsen [127])

W (u1 , . . . , un ) = max{1 − n + ni=1 ui , 0},
M (u1 , . . . , un ) = min{u1 , . . . , un }.

While M itself is a copula, W is for n ≥ 3 no distribution function. It is


known (see Nelsen [127]) that all copulas C lie within these two bounds, i.e.,

W ≺cL C ≺cL M.

Using the preceding theorem yields


(i) for a parallel system:
S
Rt (FM ) ≤ Rt (FCS ) ≤ Rt (FW
S
),
S
E(FM ) ≤ E(FCS ) ≤ E(FW
S
),
S
Qp (FM ) ≤ Qp (FCS ) ≤ Qp (FW
S
),
where we used the notation FCS for the system lifetime distribution ac-
cording to the copula C.
(ii) in the case n = 2 the relation W ≺cU C ≺cU M holds true yielding the
inverse inequalities for a series system:
S
Rt (FW ) ≤ Rt (FCS ) ≤ Rt (FM
S
),
S
E(FW ) ≤ E(FCS ) ≤ E(FM
S
),
S
Qp (FW ) ≤ Qp (FCS ) ≤ Qp (FM
S
),
S
This example provides us with an upper bound Qp (FW ) and a lower bound
S S
Qp (FM ), respectively, for the quantile Qp (FC ) of a parallel system. The cor-
responding bounds for the quantile Qp (FCS ) of a series system are Qp (FM S
)]
S
and Qp (FW ), respectively. Note that the lower bound for a parallel system
coincides with the upper bound for a series system.
This example verifies also that the stronger the dependence between the
component lifetimes in a series system is, the more reliable the system is. But
for a parallel system the reverse holds true, the system becomes weaker the
stronger the dependence is, always under the assumption that the marginals
remain the same.
2.3 Copula Models of Complex Systems in Reliability 49

2.3.3 Archimedean Copulas

In general it is not easy to check whether multivariate copulas are PLOD,


PUOD, or CONCORDANT ordered. But for an important subclass, the
so-called Archimedean copulas, the concordance order can be checked by in-
vestigating the properties of generators of Archimedean copulas (see Nelsen
[127]). A function ϕ : [0, 1] → [0, ∞] is a generator (of an n-dimensional
Archimedean copula), if ϕ is continuous, strictly decreasing, ϕ(0) = ∞, ϕ(1) =
0 and the inverse ϕ−1 is completely monotonic, i.e.,

dk −1
(−1)k ϕ (t) ≥ 0, t ≥ 0, k = 0, 1, 2, . . .
dtk
The function C : [0, 1]n → [0, 1] defined by

C(u) = ϕ−1 (ϕ(u1 ) + ϕ(u2 ) + · · · + ϕ(un )

is then an n-dimensional Archimedean copula with generator ϕ.


Definition 2.41. A function f : R+ → R is subadditive, if for all x1 , . . . ,
xn ∈ R+

f (x1 + · · · + xn ) ≤ f (x1 ) + · · · + f (xn ). (2.16)

Using this definition the following theorem supplies us with a sufficient


and necessary condition to check the concordance order of two Archimedean
copulas C1 , C2 with generators, ϕ1 , and ϕ2 , respectively.
Theorem 2.42. Let C1 and C2 be n-dimensional Archimedean copulas gen-
erated by ϕ1 and ϕ2 . Then C1 ≺cL C2 if and only if ϕ1 ◦ ϕ−1
2 is subadditive.

Proof. Let f = ϕ1 ◦ϕ−1


2 . The function f is continuous and nondecreasing with
f (0) = ϕ1 ◦ ϕ−1
2 (0) = ϕ1 (1) = 0.

According to the definition, C1 ≺cL C2 holds true if and only if for all
x1 , . . . , xn ∈ [0, 1]

ϕ−1 −1
1 (ϕ1 (x1 ) + · · · + ϕ1 (xn )) ≤ ϕ2 (ϕ2 (x1 ) + · · · + ϕ2 (xn )). (2.17)

Inserting ti = ϕ2 (xi ), i = 1, . . . , n, (2.17) is equivalent to:

ϕ−1 −1
1 (f (t1 ) + · · · + f (tn )) ≤ ϕ2 (t1 + · · · + tn ), (2.18)

for all t1 , . . . , tn ≥ 0.
Applying the strictly decreasing function ϕ1 to both sides of (2.18) on gets

f (t1 + · · · + tn ) ≤ f (t1 ) + · · · + f (tn ).

This shows the equivalence of the subadditivity of f = ϕ1 ◦ϕ−1


2 and C1 ≺cL C2 .


50 2 Basic Reliability Theory

To verify whether ϕ1 ◦ ϕ−1


2 is subadditive may still be a challenge. There-
fore, we state three sufficient conditions for subadditivity in the following
corollary. The elementary proofs can be found in Nelsen [127] for the case
n = 2, which can easily be extended to the general case n ≥ 2.

Corollary 2.43. Under the assumptions of Theorem 2.42 C1 ≺cL C2 holds


true if either of the following conditions is satisfied
(i) ϕ1 ◦ ϕ−1
2 is concave;
(ii) ϕ1 /ϕ2 is nondecreasing on (0, 1);
(iii) ϕ1 and ϕ2 are continuously differentiable on (0, 1) and ϕ1 /ϕ2 is non-
decreasing on (0, 1).

2.3.4 The Expectation of the Lifetime


of a Two-Component-System with Exponential Marginals

As an example we consider a complex system with n = 2 components with


lifetimes T1 , T2 , which are both exponentially distributed with the same pa-
rameter λ > 0. To model the dependence we consider the one-parameter
Clayton or Pareto family of copulas

Cθ (u, v) = [(u−θ + v −θ − 1)+ ]−1/θ , θ ∈ [−1, ∞)\{0}

with generator ϕθ (t) = 1θ (t−θ − 1). Is this family positively ordered in the
sense that for θ1 ≤ θ2 we have Cθ1 ≺c Cθ2 ? Note that in the case n = 2 the
PLOD- and PUOD-ordering coincide and are equivalent to the concordant
ordering ≺c . To check whether the Clayton family is positively ordered we can
use Corollary 2.43 part (iii). The generator ϕθ is continuously differentiable
on (0, 1) with ϕθ (t) = −t−θ−1 . The ratio ϕθ1 /ϕθ2 = tθ2 −θ1 is nondecreasing
on (0, 1) for θ1 ≤ θ2 which is sufficient for Cθ1 ≺c Cθ2 , i.e., the degree of
dependence increases with θ. The extreme cases θ = −1 and θ → ∞ are the
Fréchet–Hoeffding bounds C−1 =  W and C∞ = M . The limiting case θ → 0
yields the product copula C0 = (independence).

Parallel System

The lifetime T = T1 ∨ T2 of a parallel system has distribution function


FCpar
θ
(t) = P (T ≤ t) = Cθ (F1 (t), F2 (t)). Since Cθ is positively ordered (con-
cordance ordering) the expectation
 ∞
E(FCpar
θ
) = (1 − Cθ (F1 (t), F2 (t))dt
0

is decreasing in θ. The extreme and special cases are:


2.3 Copula Models of Complex Systems in Reliability 51
par ∞
• θ = −1, C−1 = W : E(FW ) = 0 (1 − W (F1 (t), F2 (t))dt.
In the exponential case F1 (t) = F2 (t) = F (t) = 1 − exp(−λt) we get
 ∞
par 1
E(FW )= [1 − (2F (t) − 1)+ ]dt = (1 + ln 2) .
0 λ
 par ∞
• θ = 0, C0 = : E(F ) = 0 (1 − F1 (t)F2 (t))dt.
In the exponential case F1 (t) = F2 (t) = F (t) = 1 − exp(−λt) we get
 ∞
par 3 1
E(F )= [1 − F 2 (t)]dt = · .
0 2 λ
par ∞
• θ = ∞, C∞ = M : E(FM ) = 0 [1 − M (F1 (t), F2 (t))]dt.
In the exponential case F1 (t) = F2 (t) = F (t) = 1 − exp(−λt) we get
 ∞
par 1
E(FM )= [1 − F (t)]dt = .
0 λ
This shows that in the independence case the second component in this two-
component parallel system prolongs the mean lifetime by 50%. The most pos-
sible prolongation is about 70% [ln 2 · 100] in the extreme negative correlation
case, whereas, as to be expected, the worst case is a correlation of 1 between
the component lifetimes, in which case a second component does not pay.

Series System
The lifetime T = T1 ∧ T2 of a series system has distribution function FCser
θ
(t) =
P (T ≤ t) = F1 (t) + F2 (t) − Cθ (F1 (t), F2 (t)) according to Example 2.38. For
the expectation of the system lifetime we get
E(FCser
θ
) = E(T1 ) + E(T2 ) − E(T1 ∨ T2 ).
Therefore, the properties of the expectation can be transferred from the par-
allel system:
∞
• θ = −1, C−1 = W : E(FW ser
) = E(T1 ) + E(T2 ) − 0 (1 − W (F1 (t), F2 (t))dt.
In the exponential case we get
2 1 1
E(FW ser
) = − (1 + ln 2) = (1 − ln 2) .
λ λ λ
 ∞
• θ = 0, C0 = : E(F ser
) = E(T1 ) + E(T2 ) − 0 (1 − F1 (t)F2 (t))dt.
In the exponential case we get
2 3 1 1
E(F ser
) = − · = 0.5 · .
λ 2 λ λ
∞
• θ = ∞, C∞ = M : E(FM ser
) = E(T1 ) − E(T2 ) − 0 [1 − M (F1 (t), F2 (t))]dt.
In the exponential case we get
ser 1
E(FM )= .
λ
This shows that the expected system lifetime of a series system can be reduced
to about 30% [(1 − ln 2) · 100] of the expected lifetime of one component.
52 2 Basic Reliability Theory

2.3.5 Marshall–Olkin Distribution

In this subsection we consider the bivariate Marshall–Olkin (M–O) distribu-


tion and investigate the influence of the degree of dependence on the system
reliability. The M–O distribution is interesting in so far as it can be interpreted
physically. As before we consider a complex system with two components. The
system is subject to shocks that are always “fatal” to one or both of the com-
ponents. The shocks occur at times Z1 , Z2 , Z12 , where we differentiate whether
only the first, only the second, or both components are destroyed. These ran-
dom variables are assumed to be independent and exponentially distributed
with parameters λ1 , λ2 , λ12 > 0, respectively. The component lifetimes T1 , T2
are given by
T1 = Z1 ∧ Z12 and T2 = Z2 ∧ Z12
and follow exponential distributions with parameters λ1 + λ12 and λ2 + λ12 .
The joint distribution of T1 and T2 is called the Marshall–Olkin distribu-
tion with joint distribution function:

H(t1 , t2 ) = H̄(t1 , t2 ) + F1 (t1 ) + F2 (t2 ) − 1


= exp (−λ1 t1 − λ2 t2 − λ12 (t1 ∨ t2 )) − exp (−(λ1 + λ12 )t1 )
− exp (−(λ2 + λ12 )t2 ) + 1, t1 , t2 ≥ 0.

The associated M–O copula is:

Cα,β (u1 , u2 ) = min((1 − u1 )1−α (1 − u2 ), (1 − u1 )(1 − u2 )1−β ) + u1 + u2 − 1

where 0 ≤ u1 , u2 ≤ 1 and α = λ12


λ1 +λ12 , β= λ12
λ2 +λ12 . As limiting cases we get
for the M–O copula

C0,0 (u1 , u2 ) = lim Cα,β (u1 , u2 ) = lim Cα,β (u1 , u2 ) = (u1 , u2 ) = u1 · u2
α→0+ β→0+

and
C1,1 (u1 , u2 ) = M (u1 , u2 ) = u1 ∧ u2 .
This implies that the limit λ1 → ∞, λ2 → ∞ or λ12 = 0 result in the
product copula, whereas the limit λ12 → ∞ or λ1 = λ2 = 0 yield the upper
Fréchet–Hoeffding bound. The family Cα,β , 0 ≤ α, β ≤ 1 is positively ordered
with respect to the concordance ordering in α(β fixed) as well as in β(α fixed).
For 0 ≤ α, β ≤ 1 we get

≺c Cα,β ≺c M.

Now we are in a position to compare the reliabilities Rt (FCpar ) and Rt (FCser )


by means of Theorem 2.40 for different copulas and all t ≥ 0:
par
Rt (F
ser
) ≤ Rt (FCser
α,β
) ≤ Rt (FM
ser
) = Rt (FM ) ≤ Rt (FCpar
α,β
par
) ≤ Rt (F )
2.3 Copula Models of Complex Systems in Reliability 53

The Parallel System

For a parallel system the reliability Rt (FCpar


α,β
) can be explicitly determined
as follows

Rt (FCpar
α,β
) = F̄ S (t) = 1 − Cα,β (F1 (t), F2 (t))
= 1 − min((1 − F1 (t))1−α (1 − F2 (t)), (1 − F1 (t))(1 − F2 (t))1−β )
−F1 (t) − F2 (t) + 1
= e−(λ1 +λ12 )t + e−(λ2 +λ12 )t − e−(λ1 +λ2 +λ12 )t , t ≥ 0.

The reliability functions for different copulas with the same marginals Fi (t) =
1 − exp(−10t), i = 1, 2, are displayed graphically in Fig. 2.9.

Fig. 2.9. Reliability functions of a parallel system

The dotted line in Fig. 2.9 represents the independence case with λ1 =
10, λ2 = 10, λ12 = 0. The dashed line corresponds to λ1 = 5, λ2 = 5, λ12 = 5,
whereas the solid line represents the upper Fréchet–Hoeffding bound with
λ1 = 0, λ2 = 0, λ12 = 10.
54 2 Basic Reliability Theory

Figure 2.9 shows that with increasing measure of dependence between the
component lifetimes, here increasing λ12 , the reliabilities of a parallel system
are decreasing. For example, for t = 0.1, the reliability is in the range of R0 .1 =
0.60(λ12 = 0) to R0 .1 = 0.37(λ12 = 10), i.e. the reliability may decrease
to about 60% of the reliability in the independence case due to correlation
between the component lifetimes.

The Series System


Analogously we can analyze the reliability of a series system under the same
conditions as above. The system reliability is
Rt (FCser
α,β
) = F̄ S (t) = 1 − F1 (t) − F2 (t) + Cα,β (F1 (t), F2 (t))
= e−(λ1 +λ2 +λ12 )t , t ≥ 0.
Figure 2.10 shows the reliability functions for different copulas.

Fig. 2.10. Reliability functions of a series system

As before, the dotted line in Fig. 2.10 represents the independence case
with λ1 = 10, λ2 = 10, λ12 = 0. The dashed line corresponds to λ1 = 5, λ2 =
5, λ12 = 5, whereas the solid line represents the upper Fréchet–Hoeffding
bound with λ1 = 0, λ2 = 0, λ12 = 10.
2.3 Copula Models of Complex Systems in Reliability 55

With increasing measure of dependence the series system becomes better in


that the reliability increases. Furthermore, a parallel system is always more re-
liable than a series with the same marginals. For the upper Fréchet–Hoeffding
bound the reliability functions of the parallel and the series system coincide,
i.e., the best series systems is as reliable as the worst parallel system. In this
limit case the correlation of the component lifetimes is ρ(T1 , T2 ) = 1.

Bibliographic Notes. The basic reliability theory of complex systems


was developed in the 1960s and 1970s, and is to a large extent covered by the
two books of Barlow and Proschan [31] and [32]. Some more recent books in
this field are Aven [13] and Høyland and Rausand [90]. Our presentation is
based on Aven [13], which also includes the theory of multistate monotone
systems. This theory was developed in the 1980s. Refer to Natvig [126] and
Aven [17] for further details and references.
For specific references to methods (algorithms) for reliability computa-
tions, see [132] and the many papers on this topic appearing in reliability
journals each year.
Birnbaum’s reliability importance measure presented in Sect. 2.1.1 was in-
troduced by Birnbaum [43]. The improvement potential measure has been
used in different contexts, see, e.g., [13, 28]. The measure (2.14) was proposed
by Butler [52]. For other references on reliability importance measures, see
[13, 28, 39, 79, 86, 90, 125].
Section 2.2, which presents some well-known properties of lifetime distribu-
tions, is based on Barlow and Proschan [31], [32], Gertsbakh [74], and Shaked
and Shanthikumar [139]. We have not dealt with stochastic comparisons and
orders in detail. An overview of this topic with applications in reliability can
be found in the book of Shaked and Shanthikumar [139].
Good sources for multivariate comparison methods and dependence con-
cepts are Müller and Stoyan [123], Joe [99] and, in particular related to cop-
ulas, Nelsen [127].
3
Stochastic Failure Models

A general set-up should include all basic failure time models, should take
into account the time-dynamic development, and should allow for different
information and observation levels. Thus, one is led in a natural way to the
theory of stochastic processes in continuous time, including (semi-) martingale
theory, in the spirit of Arjas [3, 4] and Koch [108]. As was pointed out in
Chap. 1, this theory is a powerful tool in reliability analysis. It should be
stressed, however, that the purpose of this chapter is to present and introduce
ideas rather than to give a far reaching excursion into the theory of stochastic
processes. So the mathematical technicalities are kept to the minimum level
necessary to develop the tools to be used. Also, a number of remarks and
examples are included to illustrate the theory. Yet, to benefit from reading
this chapter a solid basis in stochastics is required. Section 3.1 summarizes the
mathematics needed. For a more comprehensive and in-depth presentation of
the mathematical basis, we refer to Appendix A and to monographs such as
by Brémaud [50], Dellacherie and Meyer [61, 62], Kallenberg [101], or Rogers
and Williams [133].

3.1 Notation and Fundamentals


Let (Ω, F , P ) be the basic probability space. The information up to time t is
represented by the pre-t-history Ft , which contains all events of F that can
be distinguished up to and including time t. The filtration F = (Ft ), t ∈ R+ ,
which is the family of increasing pre-t-histories, is assumed to follow the usual
conditions of completeness and right continuity,
 i.e., Ft ⊂ F contains
 all P -
negligible sets of F and Ft = Ft+ = s>t Fs . We define F∞ = t≥0 Ft as
the smallest σ-algebra containing all events of Ft for all t ∈ R+ .
If {Xj , j ∈ J} is a family of random variables and {Aj , j ∈ J} is a system
of subsets in F , then σ(Xj , j ∈ J) and σ(Aj , j ∈ J), respectively, denote
the completion of the generated σ-field, i.e., the generated σ-field including
all P -negligible sets of F . In many cases the information is determined by a

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modelling 57


and Applied Probability 41, DOI 10.1007/978-1-4614-7894-2 3,
© Springer Science+Business Media New York 2013
58 3 Stochastic Failure Models

stochastic process Z = (Zt ), t ∈ R+ , and the corresponding filtration is the


so-called natural or internal one, which is generated by this stochastic process
and denoted FZ = (FtZ ), t ∈ R+ , FtZ = σ(Zs , 0 ≤ s ≤ t). But since it is
sometimes desirable to observe one stochastic process on different information
levels, it seems more convenient to use filtrations as measures of information.
On the basic filtered probability space we now consider a stochastic process
Z = (Zt ), which is adapted to a general filtration F, i.e., on the F-information
level the process can be observed, or in mathematical terms: FtZ ⊂ Ft , which
assures that Zt is Ft -measurable for all t ∈ R+ . All stochastic processes are, if
not stated otherwise, assumed to be right-continuous and to have left limits.
A random variable X is integrable if E|X| < ∞. If the pth power of
a random variable X is integrable, E|X|p < ∞, 1 ≤ p < ∞, then it is
sometimes said that X is an element of Lp , the vector space of real-valued
random variables with finite pth moment. A stochastic process (Xt ), t ∈ R+ , is
called integrable if all Xt are integrable, i.e., Xt ∈ L1 for all t ∈ R+ . A family
of random variables (Xt ), t ∈ R+ , is called uniformly integrable, if
lim sup E[|Xt |I(|Xt | ≥ c)] = 0.
c→∞ t∈R+

To simplify the notation, we assume that relations such as ⊂, = or ≤


, <, = between measurable sets and random variables, respectively, always
hold with probability one, which means that the term P -a.s. is suppressed.
For conditional expectations no difference is made between a version and the
equivalence class of P -a.s. equal versions.
If we consider a stochastic process X = (Xt ) and do not demand that it
t
is right-continuous, then expressions like Yt = 0 Xs ds have no meaning unless
(Xt ) fulfills some measurability condition in the argument t. One condition is
the following.
Definition 3.1. A stochastic process X is F-progressive or progressively mea-
surable, if for every t the mapping (s, ω) → Xs (ω) on [0, t] × Ω is measurable
with respect to the product σ-algebra B([0, t]) ⊗ Ft , where B([0, t]) is the Borel
σ-algebra on [0, t].
Every left- or right-continuous adapted process
 t is progressively measurable.
If X is progressive, then so is Y = (Yt ), Yt = 0 Xs ds. A further measurabil-
ity restriction is needed in connection with stochastic processes in continuous
time. This is the fundamental concept of predictability.
Definition 3.2. Let F be a filtration on the basic probability space and let
P(F) be the σ-algebra on (0, ∞) × Ω generated by the system of sets

(s, t] × A, 0 ≤ s < t, A ∈ Fs , t > 0.


P(F) is called the F-predictable σ-algebra on (0, ∞) × Ω. A stochastic process
X = (Xt ) is called F-predictable, if X0 is F0 -measurable and the mapping
(t, ω) → Xt (ω) on (0, ∞) × Ω into R is measurable with respect to P(F).
3.1 Notation and Fundamentals 59

Every left-continuous process adapted to F is F-predictable. In most


applications we will be concerned with predictable processes that are left-
continuous. Note that F-predictable processes are also F-progressive.
To get an impression of the meaning of the term predictable, we remark
that for an F-predictable process X the value Xt can be predicted from the
information
 available “just” before time t, i.e., Xt is measurable with respect
to Ft− = s<t Fs = σ(As , As ∈ Fs , 0 ≤ s < t). Processes of this kind are imp-
ortant elements in the framework of point processes. Additional information
on these measurability concepts can be found in Appendix A.3, p. 254.
Some further important terms are introduced in the following definitions.
Definition 3.3. A random variable τ with values in R+ ∪ {∞} is called an
F-stopping time if {τ ≤ t} ∈ Ft for all t ∈ R+ .
Thus a stopping time is related to the given information in that at any
time t it is possible to decide whether τ has happened up to time t or not, only
using information of the past and present but not anticipating the future.
If F = (Ft ) is a filtration and τ an F-stopping time, then the information
up to the random time τ is given by Fτ = {A ∈ F∞ : A ∩ {τ ≤ t} ∈ Ft for
all t ∈ R+ }. To understand the meaning of this definition, we specialize to
a deterministic stopping time τ = t∗ ∈ R+ . Then A ∈ Ft∗ is equivalent to
A ∩ {t∗ ≤ t} ∈ Ft for all t ∈ R+ , where {t∗ ≤ t} stands for Ω if t∗ ≤ t and for
∅ otherwise, i.e., for t = t∗ the event must be in Ft∗ and then it is in Ft for
all t > t∗ because the filtration is monotone.
Definition 3.4. An integrable F-adapted process (Xt ), t ∈ R+ , is called a
martingale (submartingale, supermartingale), if for all s > t, s, t ∈ R+ ,
E[Xs |Ft ] = (≥, ≤)Xt .
In the following we denote by M the set of martingales with paths that are
right-continuous and have left-hand limits and by M0 the set of martingales
M ∈ M with M0 = 0.

3.1.1 The Semimartingale Representation


Semimartingale representations of stochastic processes play a key role in our
set-up. They allow the process to be decomposed into a drift or regression
part and an additive random fluctuation described by a martingale.
Definition 3.5. A stochastic process Z = (Zt ), t ∈ R+ , is called a smooth
semimartingale (SSM) if it has a decomposition of the form
 t
Zt = Z0 + fs ds + Mt , (3.1)
0

where f = (ft ), t ∈ R+ , is a progressively measurable stochastic process with


t
E 0 |fs |ds < ∞ for all t ∈ R+ , E|Z0 | < ∞ and M = (Mt ) ∈ M0 . Short
notation: Z = (f, M ).
60 3 Stochastic Failure Models

A martingale is the mathematical model of a fair game with constant


expectation function EM0 = 0 = EMt for all t ∈ R+ . The drift term is an
integral over a stochastic process. To give this integral meaning, (ft ) should
also be measurable in the argument t, which is ensured, for example, if f has
right-continuous paths or, more general, if f is progressively measurable. Since
the drift part in the above decomposition is continuous, a process Z, which
admits such a representation, is called a SSM or smooth F-semimartingale if
we would like to emphasize that Z is adapted to the filtration F. For some
additional details concerning SSMs, see the Appendix A.6, p. 266.
Below we formulate conditions under which a process Z admits a semi-
martingale representation and show how this decomposition can be found. To
this end we denote D(t, h) = h−1 E[Zt+h − Zt |Ft ], t, h ∈ R+ .
C1 For all t, h ∈ R+ , versions of the conditional expectation E[Zt+h |Ft ] exist
such that the limit
ft = lim D(t, h)
h→0+

exists P -a.s. for all t ∈ R+ and (ft ), t ∈ R+ , is F-progressively measurable


t
with E 0 |fs |ds < ∞ for all t ∈ R+ .
C2 For all t ∈ R+ , (hD(t, h)), h ∈ R+ , has P -a.s. paths, which are absolutely
continuous.
C3 For all t ∈ R+ , a constant c > 0 exists such that {D(t, h) : 0 < h ≤ c} is
uniformly integrable.
The following theorem shows that these conditions are sufficient for a SSM
representation.
Theorem 3.6. Let Z = (Zt ), t ∈ R+ , be a stochastic process on the probability
space (Ω, F , P ), adapted to the filtration F. If C1, C2, and C3 hold true, then
Z is an SSM with representation Z = (f, M ), where f is the limit defined in
C1 and M is an F-martingale given by
 t
Mt = Zt − Z0 − fs ds.
0

Proof. We have to show that with (ft ) from condition C1 the right-continuous
t
process Mt = Zt − Z0 − 0 fs ds is an F-martingale, i.e., that for all A ∈ Ft
and s ≥ t, s, t ∈ R+ , E[IA Ms ] = E[IA Mt ], where IA denotes the indicator
variable. This is equivalent to
   s 
E[IA (Ms − Mt )] = Zs − Zt − fu du dP = 0.
A t
For all r, t ≤ r ≤ s, and A ∈ Ft , IA is Fr -measurable. This yields
1 1
E[IA (Zr+h − Zr )] = E [E[IA (Zr+h − Zr )|Fr ]]
h h 
1
= E IA E[Zr+h − Zr |Fr ] = E[IA D(r, h)].
h
3.1 Notation and Fundamentals 61

From C1 it follows that D(r, h) → fr as h → 0+ and therefore also


IA D(r, h) → IA fr as h → 0+ P -a.s. Now IA D(r, h) is uniformly integrable by
C3, which ensures that
1
lim E[IA D(r, h)] = lim E[IA (Zr+h − Zr )] = E[IA fr ]. (3.2)
h→0+ h→0+ h

Because of C2 there exists a process (gt ) such that


  s   s
E[IA (Zs − Zt )] = E IA gu du = E[IA gu ]du, (3.3)
t t

where the second equality follows from Fubini’s theorem. Then (3.2) and (3.3)
together yield
 s   s 
E[IA (Zs − Zt )] = E[IA fu ]du = E IA fu du ,
t t
which proves the assertion. 


Remark 3.7. (i) In the terminology of Dellacherie and Meyer [62] an SSM
t
Z = (f, M ) is a special semimartingale because the drift term 0 fs ds is
continuous and therefore predictable. Hence the decomposition of Z is unique
P -a.s., because a second decomposition Z = (f  , M  ) leads to the continuous
and therefore predictable martingale M − M  of integrable variation, which
is identically 0 (cf. Appendix A.5, Lemma A.39, p. 263). (ii) It can be shown
that if Z = (f, M ) is an SSM and for some constant c > 0 the family of
 t+h
random variables {|h−1 t fs ds| : 0 < h ≤ c} is bounded by some integrable
random variable Y, then the conditions C1–C3 hold true, i.e., C1–C3 are
under this boundedness condition not only sufficient but also necessary for a
semimartingale representation. The proof of the main part (C2) is based on
the Radon/Nikodym theorem. The details are of technical nature, and they
are therefore omitted and left to the interested reader. (iii) For applications
it is often of interest to find an SSM representation for point processes, i.e.,
to determine the compensator of such a process (cf. Definition 3.4 on p. 62).
For such and other more specialized processes, specifically adapted methods
to find the compensator can be applied, see below and [16, 50, 58, 103, 115].
One of the simplest examples of a process with an SSM representation is
the Poisson process (Nt ), t ∈ R+ , with constant rate λ > 0. It is well-known
and easy to see from the definition of a martingale that Mt = Nt −λt defines a
martingale with respect to the internal filtration FtN = σ(Ns , 0 ≤ s ≤ t). If we
consider conditions C1–C3, we find that D(t, h) = λ for all t, h ∈ R+ because
the Poisson process has independent and stationary increments: E[Nt+h −
Nt |FtN ] = E[Nt+h − Nt ] = ENh = hλ. Therefore, we see that C1–C3 are
satisfied with ft = λ for all ω ∈ Ω and all t ∈ R+ , which results in the
t
representation Nt = 0 λds + Mt = λt + Mt .
The Poisson process is a point process as well as an example of a Markov
process, and the question arises under which conditions point and Markov
processes admit an SSM representation.
62 3 Stochastic Failure Models

Point and Counting Processes


A point process over R+ can be described by an increasing sequence of
random variables or by a purely atomic random measure or by means of
its corresponding counting process. Since we want to use the semimartin-
gale structure of point processes, we will mostly use the last description
of a point process. A (univariate) point process is an increasing sequence
(Tn ), n ∈ N, of positive random variables, which may also take the value
+∞ : 0 < T1 ≤ T2 ≤ . . .. The inequality is strict unless Tn = ∞. We
always assume that T∞ = limn→∞ Tn = ∞, i.e., that the point process is
nonexplosive.
This point process is also completely characterized by the random measure
μ on (0, ∞) defined by

μ(ω, A) = I(Tk (ω) ∈ A)
k≥1

for all Borel sets A of (0, ∞).


Another equivalent way to describe a point process is by a counting process
N = (Nt ), t ∈ R+ , with

Nt (ω) = I(Tk (ω) ≤ t),
k≥1
which is, for each realization ω, a right-continuous step function with jumps of
magnitude 1 and N0 (ω) = 0. Nt counts the number of time points Tn , which
occur up to time t. Since (Nt ), t ∈ R+ , and (Tn ), n ∈ N, obviously carry the
same information, the associated counting process is sometimes also called a
point process.
A slight generalization is the notion of a multivariate point process. Let
(Tn ), n ∈ N, be a point process as before and (Vn ), n ∈ N, a sequence of
random variables with values in a finite set {a1 , . . . , am }. Then the sequence of
pairs (Tn , Vn ), n ∈ N, is called a multivariate point process and the associated
m-variate counting process Nt = (Nt (1), . . . , Nt (m)) is defined by

Nt (i) = I(Tk ≤ t)I(Vk = ai ), i ∈ {1, . . . , m}.
k≥1

Let us now consider a univariate point process (Tn ), n ∈ N, and its associ-
ated counting process (Nt ), t ∈ R+ , with ENt < ∞ for all t ∈ R+ on a filtered
probability space (Ω, F , F, P ). The traditional definition of the compensator
of a point process is the following.
Definition 3.8. Let N be an integrable point process adapted to the filtra-
tion F. The unique F-predictable increasing process A = (At ), such that
 ∞  ∞
E Cs dNs = E Cs dAs (3.4)
0 0

is fulfilled for all nonnegative F-predictable processes C, is called the compen-


sator of N with respect to F.
3.1 Notation and Fundamentals 63

The existence and the uniqueness of the compensator can be proved by


the so-called dual predictable projection. We refer to the work of Jacod [92].
The following martingale characterization of the compensator links the dyn-
amical view of point processes with the semimartingale set-up (for a proof,
see [103], p. 60).

Theorem 3.9. Let N be an integrable point process adapted to the filtration F.


Then A is the F-compensator of N if and only if the difference process N − A
is an F-martingale of M0 .

Proof (Sketch). Let A be the compensator and C be the predictable process


defined as the indicator of the set (t, s] × B, where s > t, B ∈ Ft . Then the
definition of the compensator yields

E[IB (Ns − Nt )] = E[IB (As − At )], (3.5)

which gives
E[IB (Ns − As )] = E[IB (Nt − At )].
Hence, N − A is a martingale.
Conversely, if N − A is a martingale, then A is integrable and we obtain
(3.5). In the general case, (3.4) can be established using the monotone class
theorem. 


If we view the compensator as a random measure A(dt) on (0, ∞), then we


can interpret this measure in an infinitesimal form by the heuristic expression

A(dt) = E[dNt |Ft− ].

So, by an increment dt in time from t on, the increment A(dt) is what we can
predict from the information gathered in [0, t) about the increase of Nt , and
dMt = dNt − A(dt) is what remains unforeseen. Thus, sometimes M is called
an innovation martingale and A(dt) the (dual) predictable projection.
In many cases (which are those we are mostly interested in) the F-
compensator A of a counting process N can be represented as an integral
of the form  t
At = λs ds
0

with some nonnegative (F-progressively measurable) stochastic process (λt ),


t ∈ R+ , i.e., N has an SSM representation N = (λ, M ).

Definition 3.10. Let N be an integrable counting process with an F-SSM rep-


resentation  t
Nt = At + Mt = λs ds + Mt ,
0

where (λt ), t ∈ R+ , is a nonnegative process. Then λ is called the F-intensity


of N.
64 3 Stochastic Failure Models

Remark 3.11. (i) To speak of the intensity is a little bit misleading (but harm-
less) because it is not unique. It can be shown (see Brémaud [50], p. 31) that
if one can find a predictable intensity, then it is unique except on a set of
measure 0 with respect to the product measure of P and Lebesgue measure.
On the other hand, if there exists an intensity, then one can always find a
predictable version. (ii) The heuristic interpretation

λt dt = E[dNt |Ft− ]

is very similar to the ordinary failure or hazard rate of a random variable.

Theorem 3.9 and Definition 3.10 link the point process to the semimartin-
gale representation, and using the definition of the compensator, it is possible
to verify formally that a process λ is the F-intensity of the point process N .
We have to show that
 ∞  ∞
E Cs dNs = E Cs λs ds
0 0

for all nonnegative F-predictable processes C. Another way to verify that a


process A is the compensator is to check the general conditions C1–C3 on
page 60 or to use the conditions given by Aven [16].
To go one step further we now specialize to the internal filtration FN =
(Ft ), FtN = σ(Ns , 0 ≤ s ≤ t), and determine the FN -compensator of N in an
N

explicit form. The proof of the following theorem can be found in Jacod [92]
and in Brémaud [50], p. 61. Regular conditional distributions are introduced
in Appendix A.2, p. 252.

Theorem 3.12. Let N be an integrable point process and FN its internal


filtration. For each n let Gn (ω, B) be the regular conditional distribution of
the interarrival time Un+1 = Tn+1 − Tn , n ∈ N0 , T0 = 0, given the past FTNn
at the FN -stopping time Tn : Gn (ω, B) = P (Un+1 ∈ B|FTNn )(ω).
(i) Then for Tn < t ≤ Tn+1 the compensator A is given by
 t−Tn
Gn (dx)
At = ATn + .
0 Gn ([x, ∞))

(ii) If the conditional distribution Gn admits a density gn for all n, then the
FN -intensity λ is given by
 gn (t − Tn )
λt =  t−Tn I(Tn < t ≤ Tn+1 ).
n≥0 1 − 0 gn (x)dx

Note that expressions of the form “ 00 ” are always set equal to 0.

Example 3.13. (Renewal process). Let the interarrival times Un+1 = Tn+1 −
Tn , n ∈ N0 , T0 = 0, be i.i.d. random variables with common distribution
function F , density f and failure rate r: r(t) = f (t)/(1 − F (t)). Then it
3.1 Notation and Fundamentals 65

follows from Theorem 3.12 that with respect to the internal history FtN =
σ(Ns , 0 ≤ s ≤ t) the intensity on {Tn < t ≤ Tn+1 } is given by λt = r(t − Tn ).
This results in the SSM representation N = (λ, M ),
 t
Nt = λs ds + Mt
0

with the intensity



λt = r(t − Tn )I(Tn < t ≤ Tn+1 ).
n≥0

This corresponds to our supposition that the intensity at time t is the failure
rate of the last renewed item before t at an age of t − Tn .
Example 3.14. (Markov-modulated Poisson process). A Poisson process can
be generalized by replacing the constant intensity with a randomly varying
intensity, which takes one of the m values λi , 0 < λi < ∞, i ∈ S = {1, . . . , m},
m ∈ N. The changes are driven by a homogeneous Markov chain Y = (Yt ), t ∈
R+ , with values in S and infinitesimal parameters qi , the rate to leave state
i, and qij , the rate to reach state j from state i:
1
qi = lim P (Yh = i|Y0 = i),
h→0+ h
1
qij = lim P (Yh = j|Y0 = i), i, j ∈ S, i = j,
h→0+ h

qii = −qi = − qij .
j =i

The point process (Tn ) corresponds to the counting process N = (Nt ), t ∈ R+ ,


with


Nt = I(Tn ≤ t).
n=1
It is assumed that N has a stochastic intensity λYt with respect to the filtration
F, generated by N and Y :
Ft = σ(Ns , Ys , 0 ≤ s ≤ t).
Then N is called a Markov-modulated Poisson process with SSM
representation  t
Nt = λYs ds + Mt .
0
Roughly spoken, in state i the point process is Poisson with rate λi . But note
that the ordinary failure rate of T1 is not constant. If we cannot observe the
Markov chain Y, but only the point process (Tn ), then we look for an intensity
with respect to the subfiltration A = (At ), t ∈ R+ , At = σ(Ns , 0 ≤ s ≤ t). For
this we have to estimate the current state of the Markov chain, involving the
infinitesimal parameters qi , qij . For this we refer to Sects. 3.2.4 and 5.4.2.
66 3 Stochastic Failure Models

Markov Processes

The question whether Markov processes admit semimartingale representations


can generally be answered in the affirmative: (most) Markov processes and
bounded functions of such processes have an SSM representation.
Let (Xt ), t ∈ R+ , be a right-continuous homogeneous Markov process on
(Ω, F , P x ) with respect to the (internal) filtration Ft = σ(Xs , 0 ≤ s ≤ t)
with values in a measurable space (S, B(S)). For applications we will often
confine ourselves to S = R with its Borel σ-field B. Here P x , x ∈ S, denotes
the probability measure on the set of paths, which start in X0 = x: P x
(X0 = x) = 1.
Let B denote the set of bounded, measurable functions on S with values in
R and let E x denote expectation with respect to P x . Then the infinitesimal
generator A is defined as follows: If for f ∈ B the limit
1 x
lim (E f (Xh ) − f (x)) = g(x)
h→0+ h
exists for all x ∈ S with g ∈ B, then we set Af = g and say that f belongs
to the domain D(A) of the infinitesimal generator A. It is known that if
f ∈ D(A), then
 t
f
Mt = f (Xt ) − f (X0 ) − Af (Xs )ds
0

defines a martingale (cf., e.g., [101], p. 328). This shows that a function
Zt = f (Xt ) of a homogeneous Markov process has an SSM representation
if f ∈ D(A).
Example 3.15 (Markov pure jump process). A homogeneous Markov process
X = (Xt ) with right-continuous paths, which are constant between isolated
jumps, is called a Markov pure jump process. As before, P x denotes the prob-
ability law conditioned on X0 = x and τx = inf{t ∈ R+ : Xt = x} the
exit time of state x. It is known that τx follows an Exp(λ(x)) distribution if
0 < λ(x) < ∞ and that P x (τx = ∞) = 1 if λ(x) = 0, for some suitable map-
ping λ on the set of possible outcomes of X0 with values in R+ . Let v(x, ·) be
the jump law or transition probability at x, defined by v(x, B) = P x (Xτx ∈ B)
for λ(x) > 0. If f belongs to the domain of D(A) of the infinitesimal generator,
then we obtain (cf. Métivier [122])

Af (x) = λ(x) (f (y) − f (x))v(x, dy). (3.6)

Let us now consider some particular cases. (i) Poisson process N = (Nt ) with
parameter λ > 0. In this case we have jumps of height 1, i.e., v(x, {x+1}) = 1.
For f (x) = x we get Af (x) ≡ λ. This again shows that Nt −λt is a martingale.
If we take f (x) = x2 , then we obtain Af (x) = λ(2x + 1) and for N 2 we have
the SSM representation
3.1 Notation and Fundamentals 67
 t
Nt2 = f (Nt ) = λ(2Ns + 1)ds + Mtf .
0

(ii) Compound Poisson process X = (Xt ). Let N be a Poisson process with


an intensity λ : R → R+ , 0 < λ(x) < ∞, and (Yn ), n ∈ N, a sequence of i.i.d.
random variables with finite mean μ. Then


Nt
Xt = Yn
n=1

defines a Markov pure jump process with ν(x, B) = P x (Xτx ∈ B) = P (Y1 ∈


B − x). By formula (3.6) for the infinitesimal generator we get the SSM rep-
resentation  t
Xt = λ(Xs )μds + Mt .
0

We now return to the general theory of Markov processes. The so-called


Dynkin formula states that for a stopping time τ we have
 τ
E x g(Xτ ) = g(x) + E x Ag(Xs )ds
0

if E x τ < ∞ and g ∈ D(A) (see Dynkin [66], p. 133). This formula can now be
extended to the more general case of SSMs. If Z = (f, M ) is an F-SSM with
(P -a.s.) bounded Z and f , then for all F-stopping times τ with Eτ < ∞ we
obtain  τ
EZτ = EZ0 + E fs ds.
0

Here EMτ = 0 is a consequence of the Optional Sampling Theorem (see


Appendix A.5, Theorem A.34, p. 262). The following example shows how the
Dynkin formula can be applied to determine the expectation of a stopping
time.

Example 3.16. Let B = (Bt ) be a k-dimensional Brownian motion with initial


point B0 = x and g a bounded twice continuously differentiable function on
Rk with bounded derivatives. Then we obtain (cf. Métivier [122], p. 201) the
SSM representation for g(Bt ) :
 t 
k
1 ∂2g
g(Bt ) = g(x) + (Bs )ds + Mtg .
2 0 i,j=1 ∂xi ∂xj

For some R > 0 and |x| < R we consider the stopping time σ = inf{t ∈ R+ :
|Bt | ≥ R} with respect to the internal filtration, which is the first exit time
of the ball KR = {y ∈ Rk : |y| < R}. By means of the Dynkin formula we can
determine the expectation E x σ in the following way. Let us assume E x σ < ∞
and choose g(x) = |x|2 . Dynkin’s formula then yields
68 3 Stochastic Failure Models
 σ
1
E g(Bσ ) = R = |x| + E x
x 2 2
2k ds
2 0
= |x|2 + kE x σ,
which is tantamount to E x σ = k −1 (R2 − |x|2 ). To show E x σ < ∞ we may
replace σ by τn = n ∧ σ in the above formula: E x τn ≤ k −1 (R2 − |x|2 ) and
together with the monotone convergence theorem the result is established.

3.1.2 Transformations of SSMs


Next we want to investigate under which conditions certain transformations
of SSMs again lead to SSMs and leave the SSM property unchanged.

Random Stopping
One example is the stopping of a process Z, i.e., the transformation from
Z = (Zt ) to the process Z ζ = (Zt∧ζ ), where ζ is some stopping time. If
Z = (f, M ) is an F-SSM and ζ is an F-stopping time, then Z ζ is again an
F-SSM with representation
 t
ζ
Zt = Z0 + I(ζ > s)fs ds + Mt∧ζ , t ∈ R+ .
0
This result is an immediate consequence of the fact that a stopped martingale
is a martingale.

A Product Rule
A second example of a transformation is the product of two SSMs. To see
under which conditions such a product of two SSMs again forms an SSM,
some further notations and definitions are required, which are presented in
Appendix A. Here we only give the general result. For the conditions and a
detailed proof we refer to Appendix A.6, Theorem A.51, p. 269.
Let Z = (f, M ) and Y = (g, N ) be F-SSMs with M, N ∈ M20 and M N ∈
M0 . Then, under suitable integrability conditions, ZY is an F-SSM with
representation
 t
Zt Yt = Z0 Y0 + (Ys fs + Zs gs )ds + Rt ,
0
where R = (Rt ) is a martingale in M0 .
Remark 3.17. (i) If Z = (f, M ) and Y = (g, N ) are two SSMs and f and
g are considered as “derivatives,” then Y f + Zg is the “derivative” of the
product ZY in accordance with the ordinary product rule. (ii) Martingales
M, N , for which M N is a martingale are called orthogonal. This property
can be interpreted in the sense that the increments of the martingales are
“conditionally uncorrelated,” i.e.,
E[(Mt − Ms )(Nt − Ns )|Fs ] = 0
for all 0 ≤ s ≤ t.
3.1 Notation and Fundamentals 69

A Change of Filtration

Another transformation is a certain change of the filtration, which allows the


observation of a stochastic process on different information levels.

Definition 3.18. Let A = (At ), t ∈ R+ , and F = (Ft ), t ∈ R+ , be two filtra-


tions on the same probability space (Ω, F , P ). Then A is called a subfiltration
of F if At ⊂ Ft for all t ∈ R+ .

In this case F can be viewed as the complete information filtration and


A as the actual observation filtration on a lower level. If Z = (f, M ) is an
SSM with respect to the filtration F, then the projection to the observation
filtration A is given by the conditional expectation Ẑ with Ẑt = E[Zt |At ]. The
following projection theorem states that Ẑ is an A-semimartingale. Different
versions of this theorem are proved in the literature. The version presented
here for SSMs is based on, [50], pp. 87, 108, [100], p. 202 and [161].

Theorem 3.19 (Projection Theorem). Let Z = (f, M ) be an F-SSM and


A a subfiltration of F. Then Ẑ with
 t
Ẑt = Ẑ0 + fˆs ds + M̄t (3.7)
0

is an A-SSM, where
(i) Ẑ is A-adapted with a.s. right-continuous paths with left-hand limits and
Ẑt = E[Zt |At ] for all t ∈ R+ ;
(ii) fˆ is A-progressively measurable with fˆt = E[ft |At ] for almost all t ∈ R+
(Lebesgue measure);
(iii) M̄ is an A-martingale.
∞ ∞
If in addition Z0 , 0 |fs |ds ∈ L2 and M ∈ M20 , then Ẑ0 , 0 |fˆs |ds ∈ L2 and
M̄ ∈ M20 .

Unfortunately, monotonicity properties of Z and f do not in general extend


to Ẑ and fˆ, respectively. So if, for example, f has monotone paths, this need
not be true for the corresponding process fˆ. Whether fˆ has monotone paths
depends on the path properties of f as well as on the subfiltration A. If f
is already adapted to the subfiltration A, then it is obvious that fˆ = f. In
this case projecting onto the subfiltration only filters information out, which
does not affect the drift term.
The Projection Theorem will mainly be applied to solve optimal stopping
problems on different information levels in the following manner. Let Z =
(f, M ) be an F-SSM and let Ẑ = (fˆ, M̄ ) be the corresponding A-SSM with
respect to a subfiltration A of F. To determine the maximum of EZτ in the
set C A of A-stopping times τ , i.e., to solve the optimal stopping problem on
the lower A-information level, we can use the rule of successive conditioning
for conditional expectations (cf. Appendix A.2, p. 251) to obtain
70 3 Stochastic Failure Models

sup{EZτ : τ ∈ C A } = sup{E Ẑτ : τ ∈ C A }.

In Sect. 5.2.1, Theorem 5.9, p. 181, conditions are given under which the stop-
ping problem for an SSM Z can be solved. If these conditions apply to Ẑ,
then we can solve this optimal stopping problem on the A-level according to
Theorem 5.9. Could the stopping problem be solved on the F-level, then we
get a bound for the stopping value on the A-level in view of the inequality

sup{E Ẑτ : τ ∈ C A } ≤ sup{EZτ : τ ∈ C F }.

3.2 A General Lifetime Model


First let us consider the simple indicator process Zt = I(T ≤ t), where T is
the lifetime random variable defined on the basic probability space. Obviously
Z is the counting process corresponding to the simple point process (Tn ) with
T = T1 and Tn = ∞ for n ≥ 2. The paths of this indicator process Z are
constant, except for one jump from 0 to 1 at T . Let us assume that this
indicator process has a smooth F-semimartingale representation with an F-
martingale M ∈ M0 and a nonnegative stochastic process λ = (λt ):
 t
I(T ≤ t) = I(T > s)λs ds + Mt , t ∈ R+ . (3.8)
0

The general lifetime model is then defined by the filtration F and the corre-
sponding F-SSM representation of the indicator process.

Definition 3.20. The process λ = (λt ), t ∈ R+ , in the SSM-representation


(3.8) is called the F-failure rate or the F-hazard rate process and the compen-
t
sator Λt = 0 I(T > s)λs ds is called the F-hazard process.

We drop F, when it is clear from the context. As was mentioned before


(cf. Remark 3.11 on p. 64), the intensity of the indicator (point) process is
not unique. If one F-failure rate λ is known, we may pass to a left-continuous
version (λt− ) to obtain a predictable, unique intensity:
 t
I(T ≤ t) = I(T ≥ s)λs− ds + Mt .
0

Before investigating under which conditions such a representation exists, some


examples are given.

Example 3.21. If the failure rate process λ is deterministic, forming expecta-


tions leads to the integral equation
 t  t
F (t) = P (T ≤ t) = EI(T ≤ t) = P (T > s)λs ds = (1 − F (s))λs ds.
0 0
3.2 A General Lifetime Model 71

The unique solution


  t 
F̄ (t) = 1 − F (t) = exp − λs ds (3.9)
0
is just the well-known relation between the standard failure rate and the distri-
bution function. This shows that if the hazard rate process λ is deterministic,
it coincides with the ordinary failure rate.
Example 3.22. In continuation of Example 1.1, p. 2, we consider a three-
component system with one component in series with a two-component
parallel system. It is assumed that the component lifetimes T1 , T2 , T3 are i.i.d.
exponentially distributed with parameter α > 0. What is the failure rate pro-
cess corresponding to the system lifetime T = T1 ∧ (T2 ∨ T3 )? This depends
on the information level, i.e., on the filtration F.
• Ft = σ(Xs , 0 ≤ s ≤ t), where Xs = (Xs (1), Xs (2), Xs (3)) and Xs (i) =
I(Ti > s), i = 1, 2, 3. Observing on the component level means that Ft
is generated by the indicator processes of the component lifetimes up to
time t. It can be shown (by means of the results of the next section) that
the failure rate process of the system lifetime is given by λt = α{1 + (1 −
Xt (2)) + (1 − Xt (3))} on {T > t}. As long as all components work, the
rate is α due to component 1. When one of the two parallel components 2
or 3 fails first, then the rate switches to 2α.
• Ft = σ(I(T ≤ s), 0 ≤ s ≤ t). If only the system lifetime can be observed,
the failure rate process diminishes to the ordinary deterministic failure rate
 
1 − e−αt
λt = α 1 + 2 .
2 − e−αt
Example 3.23. Consider the damage threshold model in which the deteriora-
tion is described by the Wiener process Xt = σBt + μt, where B is standard
Brownian motion and σ, μ > 0 are constants. In this case, whether and in what
way the lifetime T = inf{t ∈ R+ : Xt ≥ K}, K ∈ R+ , can be characterized by
a failure rate process, also depends on the available information.
• Ft = σ(Bs , 0 ≤ s ≤ t). Observing the actual state of the system proves to
be too informative to be described by a failure rate process. The martingale
part is identically 0, the drift part or the predictable compensator is the
indicator process I(T ≤ t) itself. No semimartingale representation (3.8)
exists because the lifetime is predictable, as we will see in the following
section.
• Ft = σ(I(T ≤ s), 0 ≤ s ≤ t). If only the system lifetime can be observed,
conditions change completely. A representation (3.8) exists. The first hit-
ting time T of the barrier K is known to follow a so-called inverse Gaussian
distribution (cf. [133], p. 26). The failure rate process is then the ordinary
failure rate corresponding to the density
 
K (K − μt)2
f (t) = √ exp − , t > 0.
2πσ 2 t3 2σ 2 t
72 3 Stochastic Failure Models

3.2.1 Existence of Failure Rate Processes

It is possible to formulate rather general conditions on Z to ensure a


semimartingale representation (3.8) as shown by Theorem 3.6, p. 60. But
in reliability models we often have more specific processes Vt = I(T ≤ t) for
which a representation (3.8) has to be found. Whether such a representation
exists should depend on the random variable T (or on the probability measure
P ) and on the filtration F. If T is a stopping time with respect to the filtration
F, then a representation (3.8) only exists for stopping times which are totally
inaccessible in the following sense:

Definition 3.24. An F-stopping time τ is called


• predictable if an increasing sequence (τn ), n ∈ N, of F-stopping times τn <
τ exists such that limn→∞ τn = τ ;
• totally inaccessible if P (τ = σ < ∞) = 0 for all predictable F-stopping
times σ.

Roughly speaking, a stopping time τ is predictable, if it is announced by


a sequence of (observable) stopping times, τ is totally inaccessible if it occurs
“suddenly” without announcement. For example, a random variable T with
an absolutely continuous distribution has the representation
 t
Vt = I(T ≤ t) = I(T > s)λ(s)ds + Mt , t ∈ R+
0

with respect to the filtration FT = (Ft ) generated by T : Ft = σ(T ∧ t), where


λ is the ordinary failure rate.
In general it can be shown that, if V has a SSM representation (3.8), then
T is a totally inaccessible stopping time. On the other hand, if T is totally
inaccessible, then there is a (unique) decomposition V = Λ + M in which the
process Λ is (P -a.s.) continuous. We state this result without proof (cf. [62],
p. 137 and [122], p. 113).

Lemma 3.25. Let (Ω, F , F, P ) be a filtered probability space and T an F-


stopping time.
(i) If the process V = (Vt ), Vt = I(T ≤ t), has an SSM representation
 t
Vt = I(T > s)λs ds + Mt , t ∈ R+ ,
0

then T is a totally inaccessible stopping time and the martingale M is


bounded in L2 , M ∈ M20 .
(ii) If T is a totally inaccessible stopping time, then the process V = (Vt ),
Vt = I(T ≤ t), has a unique (P -a.s.) decomposition V = Λ + M , where
M is a uniformly integrable martingale and Λ is continuous (P -a.s., the
predictable compensator).
3.2 A General Lifetime Model 73

“Most” continuous functions are absolutely continuous (except some


pathological special cases). Therefore, we can conclude from Lemma 3.25 that
t
the class of lifetime models with a compensator Λ of the form Λt = 0 I(T >
s)λs ds is rich enough to include models for most real-life systems in contin-
uous time. In view of Example 3.23 the condition that V admits an SSM
representation seems a natural restriction, because if the lifetime could be
predicted by an announcing sequence of stopping times, maintenance actions
would make no sense, they could be carried out “just” before a failure. In
Example 3.23 τn = inf{t ∈ R+ : Xt = K − n1 } is such an announcing sequence
with respect to Ft = σ(Bs , 0 ≤ s ≤ t) (compare also Fig. 1.1, p. 6). In addi-
tion, Example 3.23 shows that one and the same random variable T can be
predictable or totally inaccessible depending on the corresponding information
filtration.
How can the failure rate process λ be ascertained or identified for a given
information level F? In general, we can determine λ under the conditions of
Theorem 3.6 as the limit
1
I(T > t)λt = lim P (t < T ≤ t + h|Ft )
h→0+ h
in the sense of almost sure convergence. Another way to verify whether a
given process λ is the failure rate is to show that the corresponding hazard
process defines the compensator of I(T ≤ t). In some special cases λ can be
represented in a more explicit form, as for example for complex systems. This
will be carried out in some detail in the next section.

3.2.2 Failure Rate Processes in Complex Systems

In the following we want to derive the hazard rate process for the lifetime T of
a complex system under fairly general conditions. We make no independence
assumption concerning the component lifetimes, and we allow two or more
components to fail at the same time with positive probability.
Let Ti , i = 1, . . . , n, be n positive random variables that describe the com-
ponent lifetimes of a monotone complex system with structure function Φ.
Our aim is to derive the failure rate process for the lifetime

T = inf{t ∈ R+ : Φ(Xt ) = 0}

with respect to the filtration F given by Ft = σ(Xs , 0 ≤ s ≤ t), where as before


Xs = (Xs (1), . . . , Xs (n)) and Xs (i) = I(Ti > s), i = 1, . . . , n. We call this
filtration the complete information filtration or filtration on the component
level.
For a specific outcome ω let m(ω) be the number of different failure time
points 0 < T(1) < T(2) < · · · < T(m) and J(k) = {i : Ti (ω) = T(k) (ω)} the set
of components that fail at T(k) . For completeness we define

T(r) = ∞, J(r) = ∅ for r ≥ m + 1.


74 3 Stochastic Failure Models

Thus, the sequence (T(k) , J(k) ), k ∈ N, forms a multivariate point process. Now
we fix a certain failure pattern J ⊂ {1, . . . , n} and consider the time TJ of
occurrence of this pattern, i.e.,

T(k) if J(k) = J for some k
TJ =
∞ if J(k) = J for all k.

The corresponding counting process Vt (J) = I(TJ ≤ t) has a compensator


At (J) with respect to F, which is assumed to be absolutely continuous such
that λt (J) is the F-failure rate process:
 t
Vt (J) = I(TJ > s)λs (J)ds + Mt (J).
0

In the case P (TJ = ∞) = 1, we set λt (J) = 0 for t ∈ R+ .

Example 3.26. If we assume that the component lifetimes are independent


random variables, the only interesting (nontrivial) failure patterns are those
consisting of only one single component J = {j}, j ∈ {1, . . . , n}. In this case
the F-failure rate processes λt ({j}) are merely the ordinary failure rates λt (j)
corresponding to Tj .

Example 3.27. We now consider the special case n = 2 in which (T1 , T2 ) follows
the bivariate exponential distribution of Marshall and Olkin (cf. [121]) with
parameters β1 , β2 > 0 and β12 ≥ 0. A plausible interpretation of this distribu-
tion is as follows. Three independent exponential random variables Z1 , Z2 , Z12
with corresponding parameters β1 , β2 , β12 describe the time points when a
shock causes failure of component 1 or 2 or all intact components at the same
time, respectively. Then the component lifetimes are given by T1 = Z1 ∧ Z12
and T2 = Z2 ∧ Z12 , and the joint survival probability is seen to be

P (T1 > t, T2 > s) = exp{−β1 t − β2 s − β12 (t ∨ s)}, s, t ∈ R+ .

The three different patterns to distinguish are {1}, {2}, {1, 2}. Note that
T{1} = T1 as we have for example T{1} = ∞ on {T1 = T2 }, i.e., on
{Z12 < Z1 ∧ Z2 }. Calculations then yield

⎨ β1 on {T1 > t, T2 > t}
λt ({1}) = β1 + β12 on {T1 > t, T2 ≤ t}

0 elsewhere,

λt ({2}) is given by obvious index interchanges, and



β12 on {T1 > t, T2 > t}
λt ({1, 2}) =
0 elsewhere.

Now we have the F-failure rate processes λ(J) at hand for each pattern J.
We are interested in deriving the F-failure rate process λ of T. The next
theorem shows how this process λ is composed of the single processes λ(J)
3.2 A General Lifetime Model 75

on the component observation level F. Here we remind the reader of some


notation introduced in Chap. 2. For x ∈ Rn and J = {j1 , . . . , jr } ⊂ {1, . . . , n},
the vectors (1J , x) and (0J , x) denote those n-dimensional state vectors in
which the components xj1 , . . . , xjr of x are replaced by 1s and 0s, respectively.
Let D(t) be the set of components that have failed up to time t, formally

J(1) ∪ . . . ∪ J(k) if T(k) ≤ t < T(k+1)
D(t) =
∅ if t < T(1) .

Then we define a pattern J to be critical at time t ≥ 0 if

I(J ∩ D(t) = ∅) (Φ(1J , Xt ) − Φ(0J , Xt )) = 1

and denote by

ΓΦ (t) = {J ⊂ {1, . . . , n} : I(J ∩ D(t) = ∅) (Φ(1J , Xt ) − Φ(0J , Xt )) = 1}

the collection of all such patterns critical at t.

Theorem 3.28. Let (λt (J)) be the F-failure rate process corresponding to TJ ,
J ⊂ {1, . . . , n}. Then for all t ∈ R+ on {T > t} :
 
λt = I(J ∩ D(t) = ∅)(Φ(1J , Xt ) − Φ(0J , Xt ))λt (J) = λt (J).
J⊂{1,...,n} J∈ΓΦ (t)

Proof. By Definition 3.8, p. 62, a predictable increasing process (At ) is the


compensator of the counting process (Vt ), Vt = I(T ≤ t), if
 ∞  ∞
E Cs dVs = E Cs dAs
0 0

holds true for every nonnegative F-predictable process C. Thus, we have to


show that
 ∞  ∞ 
E Cs dVs = E Cs I(T > s) λs (J)ds (3.10)
0 0 J∈ΓΦ (s)

for all nonnegative predictable processes C. Since (λt (J)) are the F-failure
rate processes corresponding to TJ , we have for all J ⊂ {1, . . . , n}
 ∞  ∞
E Cs (J)dVs (J) = E Cs (J)I(TJ > s)λs (J)ds
0 0

and therefore
 ∞   ∞ 
E Cs (J)dVs (J) = E Cs (J)I(TJ > s)λs (J)ds
0 J⊂{1,...,n} 0 J⊂{1,...,n}
(3.11)
76 3 Stochastic Failure Models

holds true for all nonnegative predictable processes (Ct (J)). If we especially
choose for some nonnegative predictable process C

Ct (J) = Ct ft− ,

where ft− is the left-continuous version of ft = I(J ∈ ΓΦ (t)), we see that


(3.11) reduces to (3.10), noting that under the integral sign we can replace
ft− by ft , and the proof is complete. 


Remark 3.29. (i) The proof follows the lines of Arjas (Theorem 4.1 in [6])
except the definition of the set ΓΦ (t) of the critical failure patterns at time
t. In [6] this set includes on {T > t} all cut sets, whereas in our definition
those cut sets J are excluded for which at time t “it is known” that TJ = ∞.
However, this deviation is harmless because in [6] only extra zeros are added.
(ii) We now have a tool that allows us to determine the failure rate process
corresponding to the lifetime T of a complex system in an easy way: Add at
time t the failure rates of those patterns that are critical at t.

As an immediate consequence we obtain the following corollary.

Corollary 3.30. Let Ti , i = 1, . . . , n, be independent random variables that


have absolutely continuous distributions with ordinary failure rates λt (i). Then
the F-failure rate processes λ({i}) are deterministic, λt ({i}) = λt (i) and on
{T > t}


n 
λt = (Φ(1i , Xt ) − Φ(0i , Xt ))λt (i) = λt (i), t ∈ R+ . (3.12)
i=1 {i}∈ΓΦ (t)

In the case of independent component lifetimes we only have to add the


ordinary failure rates of those components critical at t to obtain the F-failure
rate of the system at time t. It is not enough to require that P (Ti = Tj ) = 0
for i = j if we drop the independence assumption as the following example
shows.

Example 3.31. Let U1 , U2 be i.i.d. random variables from an Exp(β) distribu-


tion and T1 = U1 , T2 = U1 +U2 be the component lifetimes of a two-component
series system. Then we obviously have P (T1 = T2 ) = 0, but the F-failure rate
of T{2} = T2 on {T2 > t}

λt ({2}) = βI(T1 ≤ t)

is not deterministic. The system F-failure rate is seen to be

I(T > t)λt = I(T1 > t)β.

To see how formula (3.12) can be used we resume Example 3.22, p. 71.
3.2 A General Lifetime Model 77

Example 3.32. Again we consider the three-component system with one


component in series with a two-component parallel system such that the life-
time of the system is given by T = T1 ∧ (T2 ∨ T3 ). It is assumed that the
component lifetimes T1 , T2 , T3 are i.i.d. exponentially distributed with param-
eter α > 0. If at time t all three components work, then only component 1
belongs to ΓΦ (t) and I(T > t)λt = αI(T1 > t) on {T2 > t, T3 > t}. If one of
the components 2 or 3 has failed first before time t, say component 2, then
ΓΦ (t) = {{1}, {3}} and I(T > t)λt = α(I(T1 > t) + I(T3 > t)) on {T2 ≤ t}.
Combining these two formulas yields the failure rate process on {T > t}

λt = α(1 + I(T2 ≤ t) + I(T3 ≤ t))

given in Example 3.22.

Example 3.33. We now go back to the pair (T1 , T2 ) of random variables, which
follows the bivariate exponential distribution of Marshall and Olkin with par-
ameters β1 , β2 > 0 and β12 ≥ 0 and consider a parallel system with lifetime
T = T1 ∨ T2 . Then on {T > t} the critical patterns are

⎨ {1, 2} on {T1 > t, T2 > t}
ΓΦ (t) = {1} on {T1 > t, T2 ≤ t}

{2} on {T1 ≤ t, T2 > t}.

Using the results of Example 3.27, p. 74, the F-failure rate process of the
system lifetime is seen to be

I(T > t)λt = β12 I(T1 > t, T2 > t) + (β1 + β12 )I(T1 > t, T2 ≤ t)
+ (β2 + β12 )I(T1 ≤ t, T2 > t),

which can be reduced to

I(T > t)λt = β12 I(T > t) + β1 I(T1 > t, T2 ≤ t) + β2 I(T1 ≤ t, T2 > t).

3.2.3 Monotone Failure Rate Processes

We have investigated under which conditions failure rate processes exist and
how they can be determined explicitly for complex systems. In reliability
it plays an important role whether failure rates are monotone increasing or
decreasing. So it is quite natural to extend such properties to F-failure rates
in the following way.

Definition 3.34. Let an F-SSM representation (3.8) hold true for the positive
random variable T with failure rate process λ. Then λ is called F-increasing
(F-IFR, increasing failure rate) or F-decreasing (F-DFR, decreasing failure
rate), if λ has P -a.s. nondecreasing or nonincreasing paths, respectively, for
t ∈ [0, T ).
78 3 Stochastic Failure Models

Remark 3.35. (i) Clearly, monotonicity properties of λ are only of importance


on the random interval [0, T ). On [T, ∞) we can specify λ arbitrarily. (ii) In
the case of complex systems the above definition reflects both, the information
level F and the structure function Φ. An alternative definition, which is derived
from notions of multivariate aging terms, is given by Arjas [5]; see also Shaked
and Shanthikumar [140].

In the case of a complex system with independent component lifetimes,


the following closure lemma can be established.

Proposition 3.36. Assume that in a monotone system the component life-


times Ti , i = 1, . . . , n, are independent random variables with absolutely con-
tinuous distributions and ordinary nondecreasing failure rates λt (i) and let
F be the filtration on the component level. Then the F-failure rate process λ
corresponding to the system lifetime T is F-IFR.

Proof. Under the assumptions of the lemma no patterns with two or more
components are critical. Since the system is monotone, the number of elements
in ΓΦ (t) is nondecreasing in t. So from (3.12), p. 76, it can be seen that if all
component failure rates are nondecreasing, the F-failure rate process λ is also
nondecreasing for t ∈ [0, T ). 


Such a closure theorem does not hold true for the ordinary failure rate of
the lifetime T as can be seen from simple counterexamples (see Sect. 2.2.1 or
[32], p. 83). From the proof of Proposition 3.36 it is evident that we cannot
draw an analogous conclusion for decreasing failure rates.

3.2.4 Change of Information Level

One of the advantages of the semimartingale technique is the possibility of


studying the random evolution of a stochastic process on different information
levels. This was described in general in Sect. 3.1.2 by the projection theorem,
which says in which way an SSM representation changes when changing the
filtration from F to a subfiltration A. This projection theorem can be applied
to the lifetime indicator process
 t
Vt = I(T ≤ t) = I(T > s)λs ds + Mt . (3.13)
0

If the lifetime can be observed, i.e., {T ≤ s} ∈ As for all 0 ≤ s ≤ t, then


the change of the information level from F to A leads from (3.13) to the
representation
 t
V̂t = E[I(T ≤ t)|At ] = I(T ≤ t) = I(T > s)λ̂s ds + M̄t , (3.14)
0
3.2 A General Lifetime Model 79

where λ̂t = E[λt |At ]. Note that, in general, this formula only holds for almost
all t ∈ R+ . In all our examples we can find A-progressive versions of the
conditional expectations. The projection theorem shows that it is possible
to obtain the failure rate on a lower information level merely by forming
conditional expectations under some mild technical conditions.
Remark 3.37. Unfortunately, monotonicity properties are in general not pre-
served when changing the observation level. As was noted above (see
Proposition 3.36), if all components of a monotone system have independent
lifetimes with increasing failure rates, then T is F-IFR on the component ob-
servation level. But switching to a subfiltration A may lead to a nonmonotone
failure rate process λ̂.
The following example illustrates the role of partial information.
Example 3.38. Consider a two-component parallel system with i.i.d. random
variables Ti , i = 1, 2, describing the component lifetimes, which follow an
exponential distribution with parameter α > 0. Then the system lifetime is
T = T1 ∨ T2 and the complete information filtration is given by

Ft = σ(I(T1 > s), I(T2 > s), 0 ≤ s ≤ t).

In this case the F-semimartingale representation (3.13) is given by


 t
I(T ≤ t) = I(T > s)α{I(T1 ≤ s) + I(T2 ≤ s)}ds + Mt
0
 t
= I(T > s)λs ds + Mt .
0

Now several subfiltrations can describe different lower information levels where
it is assumed that the system lifetime T can be observed on all observation
levels. Examples of partial information and the formal description via subfil-
trations A and A-failure rates are as follows:
a) Information about T until h, after h complete information.

σ(I(T ≤ s), 0 ≤ s ≤ t) for 0 ≤ t < h
Aat =
Ft for t ≥ h,
 −αt −1
2α(1 − (2 − e ) ) for 0≤t<h
λ̂at =
λt for t ≥ h.

b) Information about component lifetime T1 and T :

Abt = σ(I(T ≤ s), I(T1 ≤ s), 0 ≤ s ≤ t),


λ̂bt = α(I(T1 ≤ t) + I(T1 > t)P (T2 ≤ t)).
80 3 Stochastic Failure Models

c) Information about T only:


Act = σ(I(T ≤ s), 0 ≤ s ≤ t),
λ̂ct = 2α(1 − (2 − e−αt )−1 ).
The failure rate corresponding to Ac of this example is the standard de-
terministic failure rate, because {T > t} is an atom of Act (there is no subset
of {T > t} in Act of positive probability) so that λ̂c can always be chosen
to be deterministic on {T > t}. This corresponds to our intuition because
on this information level we cannot observe any other random event before
T. Example 3.21 shows that such deterministic failure rates satisfy the well-
known exponential formula (3.9), p. 71. An interesting question to ask is
then: under what conditions will such an exponential formula also extend to
random failure rate processes? This question is referred to briefly in [4] and
answered in [165] to some extent. The following treatment differs slightly in
that the starting point is the basic lifetime model of this section. The failure
rate process λ is assumed to be observable on some level A, i.e., λ is adapted
to that filtration. This observation level can be somewhere between the trivial
filtration G = (Gt ), t ∈ R+ , Gt = {∅, Ω}, which does not allow for any ran-
dom information, and the basic complete information filtration F. So T itself
need not be observable at level A (and should not, if we want to arrive at an
exponential formula). Using the projection theorem we obtain
 t
E[I(T ≤ t)|At ] = 1 − F̄t = F̄s λs ds + M̄t , (3.15)
0

where F̄ denotes the conditional survival probability,


F̄t = E[I(T > t)|At ] = P (T > t|At ),
and M̄ is an A-martingale. In general, F̄ need not be monotone and can be
rather irregular. But if F̄ has continuous paths of bounded variation, then
the martingale M̄ is identically 0 and the solution of the resulting integral
equation is    t
F̄t = exp − λs ds , (3.16)
0
which is a generalization of formula (3.9). If A is the trivial filtration G,
then (3.16) coincides with (3.9). For (3.16) to hold, it is necessary that the
observation of λ and other events on level A only have “smooth” influence on
the conditional survival probability.
Remark 3.39. This is a more technical remark to show how one can proceed
if F̄ is not continuous. Let (F̄t− ), t ∈ R+ , be the left-continuous version of F̄ .
Equation (3.15) can be rewritten as
 t
F̄t = 1 − F̄s− λs ds − M̄t .
0
3.3 Point Processes in Reliability: Failure Time and Repair Models 81

Under mild conditions an A-martingale L can be found such that M̄ can be


t
represented as the (stochastic) integral M̄t = 0 F̄s− dLs , take
 t
I(F̄s− > 0)
Lt = dM̄s .
0 F̄s−
t
With the semimartingale Z, Zt = − 0 λs ds − Lt , (3.15) becomes
 t
F̄t = 1 + F̄s− dZs .
0

If Z is of locally finite variation then the unique solution of this integral


equation is given by the so-called Doléans exponential (see [101], p. 440)

F̄t = E(Zt ) = exp{Ztc } (1 + ΔZs )
0<s≤t
  t  
= exp − λs ds exp {−Lct} (1 − ΔLs ),
0 0<s≤t

where Z c (Lc ) denotes the continuous part of Z(L) and ΔZs = Zs −Zs− (ΔLs =
Ls − Ls− ) denotes the jump height at s. This extended exponential formula
shows that possible jumps of the conditional survival probability are not
caused by jumps of the failure rate process but by (unpredictable) jumps
of the martingale part.

3.3 Point Processes in Reliability: Failure Time


and Repair Models
A number of models in reliability are described by point processes and their
corresponding counting processes. As examples we can think of shock models,
in which shocks affecting a technical system arrive at random time points Tn
according to a point process causing some damage of random amount Vn , or
we can think of repair models, in which failures occur at random time points
Tn causing random repair costs Vn . In both cases the sequence (Tn , Vn ) is a
multivariate or marked point process to be introduced as follows.
Definition 3.40. Let (Tn ), n ∈ N, be a point process and (Vn ), n ∈ N, a
sequence of random variables taking values in a measurable space (S, S). Then
a marked point process (Tn , Vn ), n ∈ N, is the ordered sequence of time points
Tn and marks Vn associated with the time points, and (S, S) is called the
mark space.
The mark Vn describes the event occurring at time Tn , for example the
magnitude of the shock arriving at a system at time Tn (see Fig. 3.1). For each
A ∈ S we associate the counting process (Nt (A)), t ∈ R+ ,
82 3 Stochastic Failure Models
S 6 V3

V1

V2

- t
T1 T2 T3

Fig. 3.1. Marked point process



Nt (A) = I(Vn ∈ A)I(Tn ≤ t),
n=1

which counts the number of marked points up to time t with marks in A. This
family of counting processes N carries the same information as the sequence
(Tn , Vn ) and is therefore an equivalent description of the marked point process.

Example 3.41. A point process (Tn ) can be viewed as a marked point process
for which S consists of a single point. Another link between point and marked
point processes is given by the counting process N = (Nt ), Nt = Nt (S), which
corresponds to the sequence (Tn ).

Example 3.42 (Alternating Renewal Process). Consider a system, which is


repaired or replaced after failure (models of this kind are treated in detail in
Sect. 4.2). Let Uk represent the length of the kth operation period and Rk the
length of the kth repair/replacement time. Assume that (Uk ) and (Rk ), k ∈ N,
are independent i.i.d. sequences of positive random variables. Let the mark
space be S = {0, 1}, where 0 and 1 stand for “repair/replacement completed”
and “failure”, respectively. Then the random time points Tn are

[ n+1 n
 2 ] 
[2]

Tn = Uk + Rk , n = 1, 2, . . . ,
k=1 k=1

where [a] denotes the integer part of a. The mark sequence is deterministic
and alternating between 0 and 1:
1
Vn = (1 + (−1)n+1 ).
2
We see that Nt ({0}) counts the number of number of completed repairs and
Nt ({1}) failures up to time t.
3.3 Point Processes in Reliability: Failure Time and Repair Models 83

We now want to extend the concept of stochastic intensities from point


processes to marked point processes. The internal filtration FN of (Tn , Vn ) is
defined by
FtN = σ(Ns (A), 0 ≤ s ≤ t, A ∈ S).
This filtration is equivalently generated by the history {(Tn , Vn ), Tn ≤ t} of
the marked point process.

Definition 3.43. Let F be some filtration including FN : FtN ⊂ Ft , t ∈ R+ .


A stochastic process (λt (A), t ∈ R+ , A ∈ S) is called the stochastic intensity of
the marked point process N, if (i) for each t, A → λt (A) is a random measure
on S; (ii) for each A ∈ S, Nt (A) admits the F-intensity λt (A).

We can now formulate the extension of Theorem 3.12, p. 64, to marked


point processes (cf. [50], p. 238, [92, 115], p. 22).

Theorem 3.44. Let N be an integrable marked point process and FN its in-
ternal filtration. Suppose that for each n there exists a regular conditional
distribution of (Un+1 , Vn+1 ), Un+1 = Tn+1 −Tn , given the past FTNn of the form

Gn (ω, A, B) = P (Un+1 ∈ A, Vn+1 ∈ B|FTNn )(ω)



= gn (ω, s, B)ds,
A

where gn (ω, s, B) is, for fixed B, a measurable function and, for fixed (ω, s),
a finite measure on (S, S). Then the process given by

gn (t − Tn , C) gn (t − Tn , C)
λt (C) = =  t−T
Gn ([t − Tn , ∞), S) 1 − 0 n gn (s, S)ds

on (Tn , Tn+1 ] is a stochastic intensity of N and for each C ∈ S,


 t
Nt (C) − λs (C)ds
0

is an FN -martingale.

To find the SSM representation of a stochastic process, which is derived


from a marked point process, we can make use of the intensity of the latter.
The following theorem is proved in Brémaud [50], p. 235. For the formulation
of this result it is more convenient to use a slightly different notation for the
process Nt (C), namely,


Nt (C) = N (t, C) = I(Vn ∈ C)I(Tn ≤ t).
n=1
84 3 Stochastic Failure Models

Theorem 3.45. Let (N (t, C)), t ∈ R+ , C ∈ S, be an integrable marked point


process admitting the intensity λt (C) with respect to some filtration F. Let
H(t, z) be an S-marked F-predictable process, such that, for all t ∈ R+ ,
we have  t
E |H(s, z)|λs (dz)ds < ∞.
0 S

Then, defining M (ds, dz) = N (ds, dz) − λs (dz)ds,


 t
H(s, z)M (ds, dz)
0 S

is an F-martingale.

In the following subsections we consider some examples and particular


cases. As was mentioned in Example 3.41 a point process (Tn ) and its associ-
ated counting process (Nt ) are special cases of marked point processes. Point
process models in our SSM set-up require the assumption that the count-
ing process (Nt ), t ∈ R+ , on a filtered probability space (Ω, F , F, P ) has an
absolutely continuous compensator or, what amounts to the same, admits an
F-SSM representation  t
Nt = λs ds + Mt . (3.17)
0
This point process model is consistent with the general lifetime model con-
sidered in Sect. 3.2. If the process N is stopped at T1 , then (3.17) reduces to
(3.13):
 t∧T1
Nt∧T1 = I(T1 ≤ t) = λs ds + Mt∧T1
0
 t
= I(T1 > s)λs ds + Mt ,
0

where M is the stopped martingale M, Mt = Mt∧T1 . The time to first failure


or shock corresponds to the lifetime T = T1 .


In general, N is determined by its compensator or by its intensity λ, and
it is possible to construct a point process N (and a corresponding probability
measure) from a given intensity λ (these problems are considered in some
detail in [92], see also [115], Chap. 8). This allows us to define point process
models in reliability by considering a given intensity.

3.3.1 Alternating Renewal Processes: One-Component


Systems with Repair

We resume Example 3.42, p. 82, and assume that the operating times Uk
follow a distribution F with density f and failure rate ρ(t) = f (t)/F̄ (t),
whereas the repair times follow a distribution G with density g and hazard rate
3.3 Point Processes in Reliability: Failure Time and Repair Models 85

η(t) = g(t)/Ḡ(t). Note that the failure/hazard rate is always set to 0 outside
the support of the distribution. Then Nt ({0}) counts the number of failures
up to time t with an intensity λt ({0}) = ρ(t − Tn )X(t) on (Tn , Tn+1 ], where
X(t) = Vn on (Tn , Tn+1 ] indicates whether the system is up or down at t. The
corresponding internal intensity for Nt ({1}) is λt ({1}) = η(t − Tn )(1 − X(t)).
If the operating times are exponentially distributed with rate ρ > 0, the
expected number of failures up to time t is given by
 t
ENt ({0}) = ρ EX(s)ds.
0

3.3.2 Number of System Failures for Monotone Systems

We now consider a monotone system comprising m independent components.


For each component we define an alternating renewal process, indexed by
“i.” The operating and repair times Uik and Rik , respectively, are indepen-
dent i.i.d. sequences with distributions Fi and Gi . We make the assumption
that the up-time distributions Fi are absolutely continuous with failure rates
λt (i). The point process (Tn ) is the superposition of the m independent al-
ternating renewal processes (Tin ), i = 1, . . . , m, and the associated counting
process is merely the sum of the single counting processes. Since we are only
interested in the occurrence of failures now, we denote by Nt (i) the num-
ber of failures of component i (omitting mthe argument {0}) and the total
number of component failures by Nt = i=1 Nt (i). The time Tn records the
occurrence of a component failure or completion of a repair. As in Chap. 2,
Φ : A → {0, 1} is the structure function, where A = {0, 1}m, and the pro-
cess Xt = (Xt (1), . . . , Xt (m)) denotes the vector of component states at
time t with values in A. The mark space is S = A × A and the value of
Vn = (XTn − , XTn ) describes the change of the component states occurring
at time Tn , where we set V0 = {(1, . . . , 1), (1, . . . , 1)}, i.e., we start with in-
tact components at T0 = 0. Note that Vn = (x, y) means that y = (0i , x) or
y = (1i , x) for some i ∈ {1, . . . , m}, because we have absolutely continuous
up-time distributions so that at time Tn only one component changes its sta-
tus. Combining Corollary 3.30, p. 76, and Theorem 3.44, p. 83, we get the
following result.
Corollary 3.46. Let Γ = {(x, y) ∈ S : Φ(x) = 1, Φ(y) = 0, y = (0j , x) for
some j ∈ {1, . . . , m}} be the set of marks indicating a system failure. Then
the process
m  t
Nt (Γ ) = {Φ(1i , Xs ) − Φ(0i , Xs )}dNs (i)
i=1 0

counting the number of system failures up to time t admits the intensity



m
λt (Γ ) = {Φ(1i , Xt ) − Φ(0i , Xt )}ρt (i)Xt (i)
i=1
86 3 Stochastic Failure Models

with respect to the internal filtration, where




ρt (i) = λt−Tik (i)I(Tik < t ≤ Ti,k+1 ).
k=0

Proof. We know that ρt (i)Xt (i) are intensities of Nt (i) and thus
 t
Mt (i) = Nt (i) − ρs (i)Xs (i)ds
0

defines a martingale (also with respect to the internal filtration of the super-
position because of the independence of the component processes). Define

ΔΦt (i) = Φ(1i , Xt ) − Φ(0i , Xt )

and let ΔΦt− (i) be the left-continuous and therefore predictable version of
this process. Since at a jump of Nt (i) no other components change their status
(P -a.s.), we have
 t  t
ΔΦs (i)dNs (i) = ΔΦs− (i)dNs (i).
0 0

It follows that
 t  t
m
Nt (Γ ) − λs (Γ )ds = ΔΦs (i)dMs (i)
0 0 i=1
 tm
= ΔΦs− (i)dMs (i).
0 i=1

But the last integral is the sum of integrals of bounded, predictable processes
and so by Theorem 3.45 is a martingale, which proves the assertion. 


To determine the expected numberof system failures up to time t, we


t
observe that EMt (i) = 0, i.e., ENt (i) = 0 ms (i)ds with ms (i) = Eρs (i)Xs (i),
and that ΔΦt (i) and ρt (i)Xt (i) are stochastically independent. This results in
 t
m
ENt (Γ ) = E[ΔΦs (i)]ms (i)ds. (3.18)
0 i=1

3.3.3 Compound Point Process: Shock Models

Let us now assume that a system is exposed to shocks at random times


(Tn ). A shock occurring at Tn causes a random amount of damage Vn and
these damages accumulate. The marked point process (Tn , Vn ) with mark
space (R, B(R)) describes this shock process. To avoid notational difficulties
3.3 Point Processes in Reliability: Failure Time and Repair Models 87

we write in this subsection N (t, C) for the associated counting processes,


describing the number of shocks up to time t with amounts in C. We are
interested in the so-called compound point process


N (t)
Xt = Vn
n=1

with N (t) = N (t, R), which gives the total damage up to t, and we want to
derive the infinitesimal characteristics or the “intensity” of this process, i.e.,
to establish an SSM representation. We might also think of repair models, in
which failures occur at random time points Tn . Upon failure, repair is per-
formed. If the cost for the nth repair is Vn , then Xt describes the accumulated
costs up to time t.
To derive an SSM representation of X, we first assume that we are given
a general intensity λt (C) of the marked point process with respect to some
filtration F. The main point now is to observe that
 t
Xt = zN (ds, dz).
0 S

Then we can use Theorem 3.45, p. 84, with the predictable process H(s, z) = z
to see that
 t
MtF = z(N (ds, dz) − λs (dz)ds)
0 S
t
is a martingale if E 0 S |z|λs (dz)ds < ∞. Equivalently, we see that X has
the F-SSM representation X = (f, M F ), with

fs = zλs (dz).
S

To come to a more explicit representation we make the following assump-


tions (A):
• The filtration is the internal one FN ;
• Un+1 = Tn+1 − Tn is independent of FTNn ∨ σ(Vn+1 );
• Un+1 has absolutely continuous distribution with density gn (t) and (ordi-
nary) failure or hazard rate rn (t);
• Vn+1 is a positive random variable, independent of FTNn , with finite mean
EVn+1 .
Under these assumptions we get by Theorem 3.44, p. 83,


λt (C) = rn (t − Tn )P (Vn+1 ∈ C)I(Tn < t ≤ Tn+1 )
n=0
88 3 Stochastic Failure Models

and therefore the SSM representation


 t ∞
N
Xt = E[Vn+1 ]rn (s − Tn )I(Tn < s ≤ Tn+1 )ds + MtF .
0 n=0

In the case of constant expectations EVn = EV1 we have


fs = E[V1 ]λs (R).

3.3.4 Shock Models with State-Dependent Failure Probability


Now we introduce a failure mechanism in which the marks Vn = (Yn , Wn )
are pairs of random variables, where Yn , Yn > 0, represents the amount of
damage caused by the nth shock and Wn equals 1 or 0 according to whether
the system fails or not at the nth shock. Upon failure, repair is performed.
So the marks Vn take values in S = R+ × {0, 1}. The associated counting
 (t) = N (t, R+ × {1}) counts the number
process is N (t, R+ × {0, 1}), and N
of failures up to time t. The accumulated damage is described by


N (t,S)
Xt = Yn .
n=1

In addition to (A), p. 87, we now assume


• Yn+1 is independent of FTNn with distribution
Fn+1 (y) = P (Yn+1 ≤ y);
• For each k ∈ N0 there exists a measurable function pk (x) such that 0 ≤
pk (x) ≤ 1 and
P (Wn+1 = 1|FTNn ∨ σ(Yn+1 )) = pN(T
 n ) (XTn + Yn+1 ). (3.19)

Note that FTNn = σ((Ti , Yi , Wi ), i = 1, . . . , n) and that



n 
n
 (Tn ) =
N Wi , XTn = Yi .
i=1 i=1

The assumption (3.19) can be interpreted as follows: if the accumulated dam-


age is x and k failures have already occurred, then an additional shock of
magnitude y causes the system to fail with probability pk (x + y).
To derive the compensator of N (t, R+ × {1}), the number of failures up to
time t, we observe that
P (Un+1 ∈ A, Yn+1 ∈ R+ , Wn+1 = 1|FTNn )
= P (Un+1 ∈ A)P (Wn+1 = 1|FTNn )
 
= P (Un+1 ∈ A)E pN(T N
 n ) (XTn + Yn+1 )|FTn .
3.3 Point Processes in Reliability: Failure Time and Repair Models 89

Then Theorem 3.44 yields the intensity on {Tn < t ≤ Tn+1 }:


 
λt (R+ × {1}) = rn (t − Tn )E pN (Tn ) (XTn + Yn+1 )|FTNn .

Example 3.47. As a shock arrival process we now consider a Poisson process


with rate ν, 0 < ν < ∞, and an i.i.d. sequence of shock amounts with common
distribution F. Then we get
 ∞
λt (R+ × {1}) = ν pN(t)
 (Xt + y)dF (y).
0

 and
If the failure probability does not depend on the number of failures N
the shock magnitudes are deterministic, Yn = 1, then we have

λt (R+ × {1}) = vp(Nt + 1).

To derive a semimartingale description of the first time to failure

T = inf{Tn : Wn = 1},

 at the FN -stopping time T and get


we simply stop the counting process N
 t∧T
 (t ∧ T ) =
I(T ≤ t) = N λs (R+ × {1})ds + Mt∧T
0
 t
= I(T > s)λs (R+ × {1})ds + Mt∧T ,
0

where M is a martingale. The time to first failure admits a failure rate process,
.
which is just the intensity of the counting process N

3.3.5 Shock Models with Failures of Threshold Type

The situation is as above; we only change the failure mechanism in that the
first time to failure T is defined as the first time the accumulated damage
reaches or exceeds a given threshold K ∈ R+ :
⎧ ⎫  
⎨ 
N (t,S) ⎬ n
T = inf t ∈ R+ : Yi ≥ K = inf Tn : Yi ≥ K .
⎩ ⎭
i=1 i=1

This is the hitting time of the set [K, ∞).


This failure model seems to be quite different from the previous one. How-
ever, we see that it is just a special case setting the failure probability function
pk (x) of (3.19) for all k equal to the indicator of the interval [K, ∞) :

pk (x) = p(x) = I[K,∞) (x).


90 3 Stochastic Failure Models

Then we get

P (Wn+1 = 1|FTNn ) = E[p(XTn + Yn+1 )|FTNn ]


= P (Yn+1 + XTn ≥ K|FTNn )
= 1 − Fn+1 ((K − XTn )−).

This can be interpreted as follows: If the accumulated damage after n shocks


is x, then the system fails with probability P (Yn+1 ≥ K − x) when the next
shock occurs, which is the probability that the total damage hits the threshold
K. Obviously, all shocks after T are counted by N  (t) = N (t, R+ × {1}). The
failure counting process N has on {Tn < t ≤ Tn+1 } the intensity

λt (R+ × {1}) = rn (t − Tn ){1 − Fn+1 ((K − XTn )−)}. (3.20)

The first time to failure is described by


 t
I(T ≤ t) = I(T > s)λs (R+ × {1})ds + Mt ,
0

with a suitable martingale M .

Example 3.48. Let us again consider the compound Poisson case with shock
arrival rate ν and Fn = F for all n ∈ N0 . Since rn (s−Tn ) = ν and (K −XTn ) =
(K − Xt ) on {Tn < t < Tn+1 }, we get
 t
I(T ≤ t) = I(T > s)ν F̄ ((K − Xs )−)ds + Mt .
0

3.3.6 Minimal Repair Models

In the literature covering repair models special attention has been given to
so-called minimal repair models. Instead of replacing a failed system by a new
one, a repair restores the system to a certain degree. These minimal repairs
are often verbally described (and defined) as in the following:
• “The . . . assumption is made that the system failure rate is not disturbed
after performing minimal repair. For instance, after replacing a single tube
in a television set, the set as a whole will be about as prone to failure after
the replacement as before the tube failure” (Barlow and Hunter [30]).
• “A minimal repair is one which leaves the unit in precisely the condition
it was in immediately before the failure” (Phelps [129]).
The definition of the state of the system immediately before failure depends
to a considerable degree on the information one has about the system. So it
makes a difference whether all components of a complex system are observed
or only failure of the whole system is recognized. In the first case the lifetime
of the repaired component (tube of TV set) is associated with the residual
3.3 Point Processes in Reliability: Failure Time and Repair Models 91

system lifetime. In the second case the only information about the condition
of the system immediately before failure is the age. So a minimal repair in this
case would mean replacing the system (the whole TV set) by another one of
the same age that as yet has not failed. Minimal repairs of this kind are also
called black box or statistical minimal repairs, whereas the component-wise
minimal repairs are also called physical minimal repairs.
Example 3.49. We consider a simple two-component parallel system with inde-
pendent Exp(1) distributed component lifetimes X1 , X2 and allow for exactly
one minimal repair.
• Physical minimal repair. After failure at T = T1 = X1 ∨X2 the component
that caused the system to fail is repaired minimally. Since the component
lifetimes are exponentially distributed, the additional lifetime is given by
an Exp(1) random variable X3 independent of X1 and X2 . The total life-
time T1 + X3 has distribution
P (T1 + X3 > t) = e−t (2t + e−t ).
• Black box minimal repair. The lifetime T = T1 = X1 ∨ X2 until the first
failure of the system has distribution P (T1 ≤ t) = (1 − e−t )2 and failure
rate λ(t) = 2 1−exp (−t)
2−exp (−t) . The additional lifetime T2 − T1 until the second
failure is assumed to have conditional distribution
2 − e−(t+x)
P (T2 − T1 ≤ x|T1 = t) = P (T1 ≤ t + x|T1 > t) = 1 − e−x .
2 − e−t
Integrating leads to the distribution of the total lifetime T2 :
P (T2 > t) = e−t (2 − e−t )(1 + t − ln (2 − e−t )).
It is (perhaps) no surprise that the total lifetime after a black box minimal
repair is stochastically greater than after a physical minimal repair:
P (T2 > t) ≥ P (T1 + X3 > t), for all t ≥ 0.
Below we summarize some typical categories of minimal repair models, and
give some further examples. Let (Tn ) be a point process describing the failure
times at which instantaneous repairs are carried out and let N = (Nt ), t ∈ R+ ,
be the corresponding counting process


Nt = I(Tn ≤ t).
n=1

We assume that N is adapted to some filtration F and has F-intensity (λt ).


Different types of repair processes are characterized by different intensities λ.
The repairs are minimal if the intensity λ is not affected by the occurrence
of failures or, in other words, if one cannot determine the failure time points
from the observation of λ. More formally, minimal repairs can be characterized
as follows.
92 3 Stochastic Failure Models

Definition 3.50. Let (Tn ), n ∈ N, be a point process with an integrable


counting process N and corresponding F-intensity λ. Suppose that Fλ =(Ftλ ), t
∈ R+ , is the filtration generated by λ: Ftλ = σ(λs , 0 ≤ s ≤ t). Then the point
process (Tn ) is called a minimal repair process (MRP) if none of the variables
Tn , n ∈ N, for which P (Tn < ∞) > 0 is an Fλ -stopping time, i.e., for all
n ∈ N with P (Tn < ∞) > 0 there exists t ∈ R+ such that {Tn ≤ t} ∈ / Ftλ .

This is a rather general definition that comprises the well-known special


case of a nonhomogeneous Poisson process as is seen below. A renewal process
with a strictly increasing or decreasing hazard rate r of the interarrival times
has intensity (compare Example 3.13, p. 64)

λt = r(t − Tn )I(Tn < t ≤ Tn+1 ), T0 = 0, λ0 = r(0+),
n≥0

and is therefore not an MRP, because Nt = |{s ∈ R+ : 0 < s ≤ t, λs+ = λ0 }|.


In the following we give some examples of (minimal) repair processes.
(a) In the basic statistical minimal repair model the intensity is a time-
dependent deterministic function λt = λ(t), so that the process is a nonho-
mogeneous Poisson process. This means that the age (the failure intensity)
is not changed as a result of a failure (minimal repair). Here Ftλ = {Ω, ∅}
for all t ∈ R+ , so clearly the failure times Tn are no Fλ -stopping times. The
following special cases have been given much attention in the literature:

λp (t) = λβ(λt)β−1 (Power law),


λL (t) = λeβt (Log linear model).

For the parallel system in Example 3.49, one has λ(t) = 2 1−exp (−t)
2−exp (−t) . If
the intensity is a constant, λt ≡ λ, the times between successive repairs
are independent Exp(λ) distributed random variables. This is the case in
which repairs have the same effect as replacements.
(b) If in (a) the intensity is not deterministic but a random variable λ(ω),
which is known at the time origin (λ is F0 -measurable), or, more general,
λ = (λt ) is a stochastic process such that λt is F0 -measurable for all
t ∈ R+ , i.e., F0 = σ(λs , s ∈ R+ ) and Ft = F0 ∨ σ(Ns , 0 ≤ s ≤ t), then
the process is called a doubly stochastic Poisson process or a Cox process.
The process generalizes the basic model (a); the failure (minimal repair)
times are no Fλ -stopping times, since Ftλ = σ(λ) ⊂ F0 and Tn is not
F0 -measurable.
Also the Markov-modulated Poisson process of Example 3.14, p. 65, where
the intensity λt = λYt is determined by a Markov chain (Yt ), is an MRP.
Indeed, it is a slight modification of a doubly stochastic Poisson process
in that the filtration Ft = σ(Ns , Ys , 0 ≤ s ≤ t) does not include the
information about the paths of λ in F0 .
3.3 Point Processes in Reliability: Failure Time and Repair Models 93

(c) For the physical minimal repair in Example 3.49, λt = I(X1 ∧ X2 ≤ t).
In this case Fλ is generated by the minimum of X1 and X2 . The first failure
time of the system, T1 , equals X1 ∨ X2 , which is not an Fλ -stopping time.
The filtration generated by λt comprises no information about X1 ∨ X2 .
In the following we give another characterization of an MRP.

Theorem 3.51. Assume that P (Tn < ∞) = 1 for all n ∈ N and that there
exist versions of conditional probabilities Ft (n) = E[I(Tn ≤ t)|Ftλ ] such that
for each n ∈ N (Ft (n)), t ∈ R+ , is an (Fλ -progressive) stochastic process.
(i) Then the point process (Tn ) is an MRP if and only if for each n ∈ N
there exists some t ∈ R+ such that
P (0 < Ft (n) < 1) > 0.
(ii) If furthermore (Ft ) = (Ft (1)) has P -a.s. continuous paths of bounded
variation on finite intervals, then
  t 
1 − Ft = exp − λs ds .
0

Proof. (i) To prove (i) we show that P (Ft (n) ∈ {0, 1}) = 1 for all t ∈ R+
is equivalent to Tn being an Fλ -stopping time. Since we have F0 (n) = 0
and by the dominated convergence theorem for conditional expectations
lim Ft (n) = 1,
t→∞

the assumption that P (Ft (n) ∈ {0, 1}) = 1 for all t ∈ R+ is equivalent
to Ft (n) = I(Tn ≤ t) (P -a.s.). But as (Ft (n)) is adapted to Fλ this
means that Tn is an Fλ -stopping time. This shows that under the given
assumptions P (0 < Ft (n) < 1) > 0 is equivalent to Tn being no Fλ -
stopping time.
(ii) For the second assertion we apply the exponential formula (3.16) as de-
scribed on p. 80. 


Example 3.52. In continuation of Example 3.49 of the two-component parallel


system we allow for repeated physical minimal repairs. Let (Xk ), k ∈ N, be a
sequence of i.i.d. random variables following an exponential distribution with
parameter 1 : Xk ∼Exp(1). Then we define
T1 = X1 ∨ X2 , Tn+1 = Tn + Xn+2 , n ∈ N.

∞ (Xk ), k ∈ N. The inten-


We consider the filtration generated by the sequence
sity of the corresponding counting process Nt = n=1 I(Tn ≤ t) with respect
to this filtration is then λt = I(X1 ∧ X2 ≤ t). [If we had considered the
filtration generated by the sequence (Tn ), n ∈ N we would have derived the
deterministic intensity 2(1 − exp(−t))/(2 − exp(−t)).]
94 3 Stochastic Failure Models

By elementary calculations it can be seen that

E[I(T1 > t)|Ftλ ] = P (T1 > t|X1 ∧ X2 ∧ t)

is continuous and nonincreasing. According to Theorem 3.51 it follows that


(Tn ) is an MRP and that the time to the first failure has conditional distri-
bution
  t 
" #
1 − Ft = exp − I(X1 ∧ X2 ≤ s)ds = exp −(t − X1 ∧ X2 )+ .
0

Now we want to illustrate the above definition of a minimal repair in a


more complex situation. We consider the shock damage repair model described
in Sect. 3.3.4. We now assume that the shock arrival process (Tk∗ ) is a non-
homogeneous Poisson process with intensity function ν(t) and that (Vk ) with
Vk = (Yk , Wk ) is an i.i.d. sequence of pairs of random variables, independent
of (Tk∗ ). The common distribution of the positive variables Yk is denoted F.
The failure mechanism is as before, but the probability of failure at the occur-
rence of a shock p(x) if the accumulated damage is x, is independent of the
number of previous failures. Then we obtain for the failure counting process
the intensity  ∞
λt = ν(t) p(Xt− + y)dF (y), (3.21)
0
where ∞

Xt = Yk I(Tk∗ ≤ t)
k=1

denotes the accumulated damage up to time t. The following theorem shows


under which condition the failure point process is an MRP.

Theorem 3.53. If 0 < p(x) < 1 for all x holds true, then the point process
(Tn ) driven by the intensity (3.21) is an MRP.

Proof. The random variables Wk equal 1 or 0 according to whether the sys-


tem fails or not at the kth shock. The first failure time T1 can then be repre-
sented by
T1 = inf{Tk∗ : Wk = 1}.
At each occurrence of a shock a Bernoulli experiment is carried out with
outcome Wk . The random variable Wk is not measurable with respect to
σ(XTk∗ ) because by the condition 0 < p(x) < 1 it follows that

E[I(Wk = 1)|XTk∗ ] = P (Wk = 1|XTk∗ ) = p(XTk∗ ) ∈


/ {0, 1}.

This shows that T1 cannot be an FX -stopping time, where FX is generated


by the process X = (Xt ). Since we have Ftλ ⊂ FtX , T1 is no Fλ -stopping time
either. By induction via
3.3 Point Processes in Reliability: Failure Time and Repair Models 95

Tn+1 = inf{Tk∗ > Tn : Wk = 1}

we infer that none of the variables Tn is an Fλ -stopping time, which shows


that (Tn ) is an MRP. 


Remark 3.54. (1) In the case p(x) = c for some c, 0 < c ≤ 1, the process is a
nonhomogeneous Poisson process with intensity λt = ν(t)c and therefore an
MRP. (2) The condition 0 < p(x) < 1 excludes the case of threshold models
for which p(x) = 1 for x ≥ K and p(x) = 0 else for some constant K > 0. For
such a threshold model we have

T1 = inf{t ∈ R+ : λt ≥ ν(t)},

if P (Yk ≤ x) > 0 for all x > 0. In this case T1 is an Fλ -stopping time and
consequently (Tn ) is no MRP.

3.3.7 Comparison of Repair Processes for Different


Information Levels

Consider a monotone system comprising m independent components with


lifetimes Zi , i = 1, . . . , m and corresponding ordinary failure rates λt (i). Its
structure function Φ : {0, 1}m → {0, 1} represents the state of the system
(1:intact, 0:failure), and the process Xt = (Xt (1), . . . , Xt (m)) denotes the
vector of component states at time t with values in {0, 1}m. Example 3.49 sug-
gests comparing the effects of minimal repairs on different information levels.
However, it seems difficult to define such point processes for arbitrary informa-
tion levels. One possible way is sketched in the following where considerations
are restricted to the complete information F-level (component-level) and the
“black-box-level” AT generated by T = T1 , At = σ(I(T1 ≤ s), 0 ≤ s ≤ t).
Note that T1 describes the time to first failure, i.e.,

T1 = inf{t ∈ R+ : Φ(Xt ) = 0}.

This time to first system failure is governed by the hazard rate process λ for
t ∈ [0, T ) (cf. Corollary 3.30 on p. 76):


m
λt = (Φ(1i , Xt ) − Φ(0i , Xt ))λt (i). (3.22)
i=1

Our aim is to extend the definition of λt also on {T1 ≤ t}. To this end
we extend the definition of Xt (i) on {Zi ≤ t} following the idea that upon
system failure the component which caused the failure is repaired minimally
in the sense that it is restored and operates at the same failure rate as it
had not failed before. So we define Xt (i) = 0 on {Zi ≤ t} if the first failure
of component i caused no system failure, otherwise we set Xt (i) = 1 on
96 3 Stochastic Failure Models

{Zi ≤ t} (note that in the latter case the value of Xt (i) is redefined for
t = Zi ). In this way we define Xt and by (3.22) the process λt for all t ∈
R+ . This completed intensity λt induces a point process (Nt ) which counts
the number of minimal repairs on the component level. The corresponding
complete information filtration F = (Ft ), t ∈ R+ , is given by

Ft = σ(Ns , I(Zi ≤ s), 0 ≤ s ≤ t, i = 1, . . . , m).

To investigate whether the process (Nt ) is an MRP we define the random


variables

Yi = inf{t ∈ R+ : Φ(1i , Xt ) − Φ(0i , Xt ) = 1}, i = 1, . . . , m, inf ∅ = ∞,

which describe the time when component i becomes critical, i.e., the time from
which on a failure of component i would lead to system failure. It follows that

m
λt = I(Yi ≤ t)λt (i),
i=1

Ftλ = σ(I(Yi ≤ s), 0 ≤ s ≤ t, i = 1, . . . , m).

Obviously on {Yi < ∞} we have Zi > Yi and it can be shown that Zi is


not measurable with respect to σ(Y1 , . . . , Ym ). For a two component parallel
system this means that Z1 ∨ Z2 is not measurable with respect to σ(Z1 ∧ Z2 ),
which holds true observing that E[I(Z1 ∨ Z2 > z)|Z1 ∧ Z2 ] ∈ / {0, 1} for some
z (note that the random variables Zi are assumed to be independent). The
extension to the general case is intuitive but the details of a formal, lengthy
proof are omitted. We state that the time to the first failure

T1 = min Zi I(Yi < ∞)


i=1,...,m

is no Fλ -stopping time. By induction it can be seen that also Tn is no Fλ -


stopping time and (Tn ) is an MRP.
Now we want to consider the same system on the “black-box-level”. The
change to the AT -level by conditioning leads to the failure rate λ̂, λ̂t =
E[λt |At ]. This failure rate λ̂ can be chosen to be deterministic,

λ̂t = E[λt |T1 > t],

it is the ordinary failure rate of T1 . For the time to the first system failure we
have the two representations
 t
I(T1 ≤ t) = I(T1 > s)λs ds + Mt F-level
0
 t
= I(T1 > s)λ̂s ds + M̄t AT -level.
0
3.3 Point Processes in Reliability: Failure Time and Repair Models 97

From the deterministic failure rate λ̂ a nonhomogeneous Poisson process


(Tn )n∈N , 0 < T1 < T2 < · · · can be constructed where T1 and T1 have the
same distribution. This nonhomogeneous Poisson process with
∞  t
Nt = I(Tn ≤ t) = λ̂s ds + Mt
n=1 0

describes the MRP on the AT -level. Comparing these two information levels,
Example 3.49 suggests ENt ≥ ENt for all positive t. A general comparison,
also for arbitrary subfiltrations, seems to be an open problem (cf. [4, 124]).

Example 3.55. In the two-component parallel system of Example 3.49 we have


the failure rate process λt = I(X1 ∧ X2 ≤ t) on the component level and
λ̂t = 2 1−exp (−t)
2−exp (−t) on the black-box level. So one has two descriptions of the
same random lifetime T = T1
 t
I(T1 ≤ t) = I(T1 > s)I(X1 ∧ X2 ≤ s)ds + Mt
0
 t
1 − e−s
= I(T1 > s)2 ds + M̄t .
0 2 − e−s
The process N counts the number of minimal repairs on the component level:
 t
Nt = I(X1 ∧ X2 ≤ s)ds + Mt .
0

This is a delayed Poisson process, the (repair) intensity of which is equal


to 1 after the first component failure. The process N  counts the number of
minimal repairs on the black-box level:
 t
1 − e−s
Nt = 2 −s
ds + Mt .
0 2 − e
This is a nonhomogeneous Poisson process with an intensity which corre-
sponds to the ordinary failure rate of T1 . Elementary calculations yield indeed
1
ENt = t − (1 − e−2t ) ≥ ENt = t − ln(2 − e−t ).
2
To interpret this result one should note that on the component level only the
critical component which caused the system to fail is repaired. A black box
repair, which is a replacement by a system of the same age that has not yet
failed, could be a replacement by a system with both components working.

3.3.8 Repair Processes with Varying Degrees of Repair

As in the minimal repair section, let (Tn ) be a point process describing failure
times at which instantaneous repairs are carried out and let N = (Nt ), t ∈ R+ ,
be the corresponding counting process. We assume that N is adapted to some
filtration F and has F-intensity (λt ).
98 3 Stochastic Failure Models

One way to model varying levels or degrees of repairs is the following.


Consider a new item or system having lifetime distribution F with failure
rate r(t). Assume that the nth repair has the effect that the distribution to
the next failure is that of an unfailed item of age An ≥ 0. Then An = 0
means complete repair (as good as new) or replacement and An > 0 can be
interpreted as a partial repair which sets the item back to the functioning
state. Theorem 3.12, p. 64, immediately yields the intensity of such a repair
process with respect to the internal filtration FN : Let (An ), n ∈ N, be a
sequence of nonnegative random variables such that An is FN Tn -measurable,
then the F -intensity of N is given by
N



λt = r(t − Tn + An )I(Tn < t ≤ Tn+1 ), A0 = T0 = 0.
n=0

The two extreme cases are:


1. An = 0, for all n ∈ N. Then N is a renewal process with interarrival time
distribution F, all repairs are complete restorations to the as good as new
state.
2. An = Tn for all n ∈ N. Then N is a nonhomogeneous Poisson process
with intensity r(t), all repairs are (black box) minimal repairs.
In addition we can introduce random degrees Zn ≤ 1 of the nth repair.
Starting with a new item the first failure occurs at T1 . A repair with degree Z1
is instantaneously carried out and results in a virtual age of A1 = (1 − Z1 )T1 .
Continuing we can define the sequence of virtual ages recursively by
An+1 = (1 − Zn+1 )(An + Tn+1 − Tn ), A0 = 0.
Negative values of Zn may be interpreted as additional aging due to the nth
failure or a clumsy repair. In the literature there exist many models describing
different ways of generating or prescribing the random sequence of repair
degrees, cf. Bibliographic Notes.

3.3.9 Minimal Repairs and Probability of Ruin

In this section we investigate a model that combines a certain reward and


cost structure with minimal repairs. Consider a one-unit system that fails
from time to time according to a point process. After failure a minimal repair
is carried out that leaves the state of the system unchanged. The system can
work in one of m unobservable states. State “1” stands for new or in good
condition and “m” is defective or in bad condition. Aging of the system is
described by a link between the failure point process and the unobservable
state of the system. The failure or minimal repair intensity may depend on
the state of the system.
Starting with an initial capital of u ≥ 0, there is some constant flow of
income, on the one hand, and, on the other hand, each minimal repair incurs
a random cost. The risk process R = (Rt ), t ∈ R+ , describes the difference
between the income including the initial capital u and the accumulated costs
3.3 Point Processes in Reliability: Failure Time and Repair Models 99

for minimal repairs up to time t. The time of ruin is defined as τ = τ (u) =


inf{t ∈ R+ : Rt ≤ 0}. Since explicit formulas are rarely available, we are
interested in bounds for P (τ < ∞) and P (τ ≤ t), the infinite and the finite
horizon ruin probabilities.
A related question is when to stop processing the system and carrying out
an inspection or a renewal in order to maximize some reward functional. This
problem is treated in Sect. 5.4.
For the mathematical formulation of the model, let the basic probability
space (Ω, F , P ) be equipped with a filtration F, the complete information
level, to which all processes are adapted, and let S = {1, . . . , m} be the set
of unobservable states. We assume that the time points of failures (minimal
repairs) 0 < T1 < T2 < · · · form a Markov-modulated Poisson process as
described in Example 3.14, p. 65. Let us recapitulate the details:
• The changes of the states are driven by a homogeneous Markov process
Y = (Yt ), t ∈ R+ , with values in S and infinitesimal parameters qi , the
rate to leave state i, and qij , the rate to reach state j from state i
1
qi = lim P (Yh = i|Y0 = i),
h→0+ h
1
qij = lim P (Yh = j|Y0 = i), i, j ∈ S, i = j,
h→0+ h

qii = −qi = − qij .
j =i

• The time points (Tn ) form a point process and N = (Nt ), t ∈ R+ , is


the corresponding counting process Nt = n≥1 I(Tn ≤ t), which has
a stochastic intensity λYt depending on the unobservable state, i.e., N
admits the representation
 t
Nt = λYs ds + Mt ,
0
where M is an F-martingale and 0 < λi < ∞, i ∈ S. Since the filtration
Fλ (Fλ = FY , if λi = λj for i = j) generated by the intensity does not
include FN as a subfiltration, it follows that Tn , n ∈ N, is not an Fλ -
stopping time. Therefore, according to Definition 3.50, p. 92, N is a MRP.
• (Xn ), n ∈ N, is a sequence of positive i.i.d. random variables, independent
of N and Y , with common distribution F and finite mean μ. The cost
caused by the nth minimal repair at time Tn is described by Xn .
• There is an initial capital u and an income of constant rate c > 0 per
unit time.
Now the process R, given by

Nt
Rt = u + ct − Xn
n=1
describes the available capital at time t as the difference of the income and
the total amount of costs for minimal repairs up to time t.
100 3 Stochastic Failure Models

The process R is commonly used in other branches of applied probability


like queueing or collective risk theory. In risk theory one is mainly interested
in the distribution of the time to ruin τ = inf{t ∈ R+ : Rt ≤ 0}.

The Failure Rate Process of the Ruin Time

We want to show that the indicator process Vt = I(τ (u) ≤ t) has a semi-
martingale representation
 t
Vt = I(τ ≤ t) = I(τ > s)hs ds + Mt , t ∈ R+ , (3.23)
0

where M is a mean zero martingale with respect to the filtration F = (Ft ), t ∈


R+ , which is generated by all introduced random quantities:

Ft = σ(Ns , Ys , Xi , 0 ≤ s ≤ t, i = 1, . . . , Nt ).

The failure rate process h = (ht ), t ∈ R+ , can be derived in the same way
as was done for shock models with failures of threshold type (cf. p. 89). Note
that ruin can only occur at a failure time; therefore, the ruin time is a hitting
time of a compound point process:
 
Nt
τ = inf t ∈ R+ : At = Bn ≥ u = inf {Tn : ATn ≥ u} ,
n=1

where Bn = Xn − cUn and Un = Tn − Tn−1 , n = 1, 2, . . .. Replacing Xt by At ,


r(t − Tn ) by λYt , and the threshold S by u in formula (3.20) on p. 90, we get
the following lemma.

Lemma 3.56. Let τ = τ (u) be the ruin time and F the distribution of the
claim sizes, F̄ (x) = F ((x, ∞)) = P (X1 > x), x ∈ R. Then the F-failure rate
process h is given by

m
ht = λYt F̄ (Rt −) = λi I(Yt = i)F̄ (Rt −), t ∈ R+ .
i=1

The failure rate processes h is bounded above by max{λi : i ∈ S}. If all


claim arrival rates λi coincide, λ = λi , i ∈ S, we have the classical Poisson
case, and it is not surprising that the hazard rate decreases when the risk
reserve increases and vice versa. Of course, the paths of R are not monotone
and so the failure rate processes do not have monotone paths either. But they
have (stochastically) a tendency to increase or decrease in the following sense.
As follows from the results of Sect. 3.3.3 the process R has an F-semimartingale
representation
 t m
Rt = I(Ys = i)(c − λi μ)ds + Lt
0 i=1
3.3 Point Processes in Reliability: Failure Time and Repair Models 101

with a mean zero F-martingale L. If we have positive drift in all environmental


states, i.e., c − λi μ > 0, i = 1, . . . , m, then R is a submartingale and it is seen
that h tends to 0 as t → ∞ (P -a.s.). On the other hand, if the claim rate λYt
is increasing (P -a.s.) and the drift is nonpositive for all states, i.e., c − λi μ ≤
0, i = 1, . . . , m, and F̄ is convex on the support of the distribution, then
R is a supermartingale and it follows by Jensen’s inequality for conditional
expectations:
E[ht+s |Ft ] = E[λYt+s F̄ (Rt+s −)|Ft ] ≥ E[λYt F̄ (Rt+s −)|Ft ]
= λYt E[F̄ (Rt+s −)|Ft ] ≥ λYt F̄ (E[Rt+s − |Ft ])
≥ λYt F̄ (Rt −) = ht , t, s ∈ R+ .
This shows that h is a submartingale, i.e., h is stochastically increasing.

Bounds for Finite Time Ruin Probabilities

Except in simple cases, such as Poisson arrivals of exponentially distributed


claims (P/E case), the finite time ruin probabilities ψ(u, t) = P (τ (u) ≤ t)
cannot be expressed by the basic model parameters in an explicit form. So
there is a variety of suggested bounds and approximations (see Asmussen
[9] and Grandell [78] for overviews). In the following, bounds for the ruin
probabilities in finite time will be derived that are based on the semimartingale
representation given in Lemma 3.56. It turns out that especially for small
values of t known bounds can be improved.
From now on we assume that the claim arrival process is Poisson with rate
λ > 0. Then Lemma 3.56 yields the representation
 t
Vt = I(τ (u) ≤ t) = I(τ (u) > s)λF̄ (Rs )ds + Mt , t ∈ R+ . (3.24)
0

Note that the paths of R have only countable numbers of jumps such that
under the integral sign Rs − can be replaced by Rs . Taking expectations on
both sides of (3.24) one gets by Fubini’s theorem
 t
ψ(u, t) = E[I(τ (u) > s)λF̄ (Rs )]ds (3.25)
0
 t
= (1 − ψ(u, s))λE[F̄ (Rs )|τ (u) > s]ds.
0

As a solution of this integral equation we have the following representation of


the finite time ruin probability:
  t 
ψ(u, t) = 1 − exp − λE[F̄ (Rs )|τ (u) > s]ds . (3.26)
0

This shows that the (possibly defective) distribution of τ (u) has the haz-
ard rate
λE[F̄ (Rt )|τ (u) > t].
102 3 Stochastic Failure Models

Now let N X be the renewal process generated by the sequence (Xi ), i∈N,
k t
NtX = sup{k ∈ N0 : i=1 Xi ≤ t}, and A(u, t) = 0 a(u, s)ds, where
X
a(u, s)=λP (Nu+cs = Ns ). Then bounds for ψ(u, t) can be established.

Theorem 3.57. For all u, t ≥ 0, the following inequality holds true:

B(u, t) ≤ ψ(u, t) ≤ A(u, t),


t
where A is defined as above and B(u, t) = 1 − exp{−λ 0 F̄ (u + cs)ds}.

Proof. For the lower bound we use the representation (3.26) and simply ob-
serve that E[F̄ (Rs )|τ (u) > s] ≥ F̄ (u + cs).
For the upper bound we start with formula (3.24). Since {τ (u) > t} ⊂
{Rt ≥ 0}, we have
 t
Vt = I(τ (u) > s)λF̄ (Rs )ds + Mt
0
 t
≤ I(Rs ≥ 0)λF̄ (Rs )ds + Mt
0

Taking expectations on both sides of this inequality we get


 t
ψ(u, t) = EVt ≤ λE[I(Rs ≥ 0)F̄ (Rs )]ds.
0

It remains to show that a(u, t) = λE[I(Rs ≥ 0)F̄ (Rs )]. Denoting the k-fold
k
convolution of F by F ∗k and Tk = i=1 Xi it follows by the independence of
the claim arrival process and (Xi ), i ∈ N,
$ % &'
Nt
E I(Rt ≥ 0)F̄ u + ct − Xi
i=1

$ % & % &'
 
k 
k
= E I u + ct − Xi ≥ 0 F̄ u + ct − Xi P (Nt = k)
k=0 i=1 i=1
∞  u+ct
= F̄ (u + ct − x)dF ∗k (x)P (Nt = k)
k=0 0

∞
= {F ∗k (u + ct) − F ∗(k+1) (u + ct)}P (Nt = k)
k=0
∞
X
= P (Nu+ct = k)P (Nt = k)
k=0
X
= P (Nu+ct = Nt ),

which completes the proof. 



3.3 Point Processes in Reliability: Failure Time and Repair Models 103

The bounds of the theorem seem to have several advantages: as numerical


examples show, they perform well especially for small values of t for which
ψ(u, t)  ψ(u, ∞) (see Aven and Jensen [25]). In addition no assumptions
have been made about the tail of the claim size distribution F and the drift
of the risk reserve process, which are necessary for most of the asymptotic
methods. This makes clear, on the other hand, that one cannot expect these
bounds to perform well for t → ∞.
Bibliographic Notes. The book of Brémaud [50] is one of the basic
sources of the martingale dynamics of point process systems. The introduction
(p. XV) also contains a sketch of the historical development. The SSM app-
roach in connection with optimal stopping problems is considered by Jensen
[98]. Comprehensive overviews over lifetime models in the martingale frame-
work are those of Arjas [3, 4] and Koch [108]. An essential basis for the presen-
tation of point processes in the martingale framework was laid by Jacod [92].
A number of books on point processes are available now. Among others, the
martingale approach is exposed in Brémaud [50], Karr [103], and Daley and
Vere-Jones [58], which also include the basic results about marked point pro-
cesses. A full account on marked point processes can be found in the mono-
graph of Last and Brandt [115].
Details on the theory of Markov processes, briefly mentioned in Sect. 3.1,
can be found in the classic book of Dynkin [66] or in the more recent mono-
graphs on stochastic processes mentioned at the beginning of this chapter.
One of the first papers considering random hazard rates in lifetime models
is that of Bergman [38]. Failure rate processes for multivariate reliability sys-
tems were introduced by Arjas in [6]. Shock processes have been investigated
by a number of authors. Aven treated these processes in the framework of
counting processes in some generality in [15]. Recent work on shock models of
threshold type concentrates on deriving the distribution of the hitting (life-)
time under general conditions. Wendt [163] considers a doubly stochastic Pois-
son shock arrival process, whereas Lehmann [119] investigates shock models
with failure thresholds varying in time.
Models of minimal repairs have been considered by Barlow and Hunter [30],
Aven [18], Bergman [39], Block et al. [48], Stadje and Zuckerman [151], Shaked
and Shanthikumar [141], and Beichelt [35], among others. Our formulation of
the minimal repair concept in a general counting process framework is taken
from [24]. Varying degrees of repairs are investigated in a number of papers
like Brown and Proschan [51], Kijima [107], and Last and Szekli [116, 117].
As was pointed out by Bergman [39], information plays an important role
in minimal repair models. Further steps in investigating information-based
minimal repair were carried out by Arjas and Norros [7] and Natvig [124].
General references to risk theory are among others the books of Grandell
[77] and Rolski et al. [134]. Overviews over bounds and approximations of
ruin probabilities can be found in Asmussen [9] and Grandell [78]. Most of
the approximations are based on limit theorems for ψ(u, t) as u → ∞, t → ∞.
One of the exceptions is the inverse martingales technique used by Delbaen
and Haezendonck [60].
4
Availability Analysis of Complex Systems

In this chapter we establish methods and formulas for computing various per-
formance measures of monotone systems of repairable components. Emphasis
is placed on the point availability, the distribution of the number of failures
in a time interval, and the distribution of downtime of the system. A number
of asymptotic results are formulated and proved, mainly for systems having
highly available components.
The performance measures are introduced in Sect. 4.1. In Sects. 4.3–4.6 re-
sults for binary monotone systems are presented. Since many of these results
are based on the one-component case, we first give in Sect. 4.2 a rather compre-
hensive treatment of this case. Section 4.7 presents generalizations and related
models. Section 4.7.1 covers multistate monotone systems. In Sects. 4.2–4.5
and 4.7.1 it is assumed that there are at least as many repair facilities (chan-
nels) as components. In Sect. 4.7.2 we consider a parallel system having r
repair facilities, where r is less than the number of components. Attention is
drawn to the case with r = 1. Finally, in Sect. 4.7.3 we present models for
analysis of passive redundant systems.
In this chapter we focus on the situation that the components have ex-
ponential lifetime distributions. See Sect. 4.7.1, p. 163, and Bibliographic
Notes, p. 173, for some comments concerning the more general case of non-
exponential lifetimes.

4.1 Performance Measures


We consider a binary monotone system with state process (Φt ) = (Φ(Xt )),
as described in Sect. 2.1. Here Φt equals 1 if the system is functioning
at time t and 0 if the system is not-functioning at time t, and Xt =
(Xt (1), Xt (2), . . . , Xt (n)) ∈ {0, 1}n describes the states of the components.
The performance measures relate to one point in time t or an interval J,
which has the form [0, u] or (u, v], 0 < u < v. To simplify notation, we simply
write u instead of [0, u].

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modelling 105


and Applied Probability 41, DOI 10.1007/978-1-4614-7894-2 4,
© Springer Science+Business Media New York 2013
106 4 Availability Analysis of Complex Systems

Emphasis will be placed on the following performance measures:


(a) Point availability at time t, A(t), given by

A(t) = EΦt = P (Φt = 1).

(b) Let NJ be equal to the number of system failures in the interval J. We


consider the following performance measures

P (NJ ≤ k), k ∈ N0 ,
M (J) = ENJ ,
A[u, v] = P (Φt = 1, ∀t ∈ [u, v])
= P (Φu = 1, N(u,v] = 0).

The performance measure A[u, v] is referred to as the interval reliability.


(c) Let YJ denote the downtime in the interval J, i.e.,

YJ = (1 − Φt ) dt.
J

We consider the performance measures

P (YJ ≤ y), y ∈ R+ ,
EYJ
AD (J) = ,
|J|

where |J| denotes the length of the interval J. The measure AD (J) is in
the literature sometimes referred to as the interval unavailability, but we
shall not use this term here.
The above performance measures relate to a fixed point in time or a finite
time interval. Often it is more attractive, in particular from a computational
point of view, to consider the asymptotic limit of the measure (as t, u or
v → ∞), suitably normalized (in most cases such limits exist). In the following
we shall consider both the above measures and suitably defined limits.

4.2 One-Component Systems


We consider in this section a one-component system. Hence Φt = Xt = Xt (1).
If the system fails, it is repaired or replaced. Let Tk , k ∈ N, represent the
length of the kth operation period, and let Rk , k ∈ N, represent the length
of the kth repair/replacement time for the system; see Fig. 4.1. We assume
that (Tk ), k ∈ N, and (Rk ), k ∈ N, are independent i.i.d. sequences of positive
random variables. We denote the probability distributions of Tk and Rk by F
and G, respectively, and assume that they have finite means, i.e.,
4.2 One-Component Systems 107

Xt

0 t
T1 R1 T2 R2 T3

Fig. 4.1. Time evolution of a failure and repair process for a one-component system
starting at time t = 0 in the operating state

μF < ∞, μG < ∞.

In reliability engineering μF and μG are referred to as the mean time to failure


(MTTF) and the mean time to repair (MTTR), respectively.
To simplify the presentation, we also assume that F is an absolutely con-
tinuous distribution, i.e., F has a density function f and failure rate function
λ. We do not make the same assumption for the distribution function G, since
that would exclude discrete repair time distributions, which are often used in
practice.
In some cases we also need the variances of F and G, denoted σF2 and σG 2
,
respectively. In the following, when writing the variance of a random variable,
or any other moment, it is tacitly assumed that these are finite.
The sequence
T1 , R1 , T2 , R2 , · · ·
forms an alternating renewal process.
We introduce the following variables


n−1
Sn = T 1 + (Rk + Tk+1 ), n ∈ N,
k=1

and

n
Sn◦ = (Tk + Rk ), n ∈ N.
k=1

By convention, S0 = S0◦ = 0, and sums over empty sets are zero. We see that
Sn represents the nth failure time, and Sn◦ represents the completion time of
the nth repair.
The Sn sequence generates a modified (delayed) renewal process N with
renewal function M . The first interarrival time has distribution F . All other
interarrival times have distribution F ∗G (convolution of F and G), with mean
μF + μG . Let H (n) denote the distribution function of Sn . Then

H (n) = F ∗ (F ∗ G)∗(n−1) ,
108 4 Availability Analysis of Complex Systems

where B ∗n denotes the n-fold convolution of a distribution B and as usual


B ∗0 equals the distribution with mass of 1 at 0. Note that we have


M (t) = H (n) (t)
n=1

(cf. (B.2), p. 274, in Appendix B). The Sn◦ sequence generates an ordinary
renewal process N ◦ with renewal function M ◦ . The interarrival times, Tk +Rk ,
have distribution F ∗G, with mean μF +μG . Let H ◦(n) denote the distribution
function of Sn◦ . Then
H ◦(n) = (F ∗ G)∗n .
Let αt denote the forward recurrence time at time t, i.e., the time from t to
the next event:
αt = SNt +1 − t on {Xt = 1}
and

αt = SN ◦
t +1
− t on {Xt = 0}.
Hence, given that the system is up at time t, the forward recurrence time αt
equals the time to the next failure time. If the system is down at time t, the
forward recurrence time equals the time to complete the repair. Let Fαt and
Gαt denote the conditional distribution functions of αt given that Xt = 1 and
Xt = 0, respectively. Then we have for x ∈ R

Fαt (x) = P (αt ≤ x|Xt = 1) = P (SNt +1 − t ≤ x|Xt = 1)

and

Gαt (x) = P (αt ≤ x|Xt = 0) = P (SN ◦
t +1
− t ≤ x|Xt = 0).
Similarly for the backward recurrence time, we define βt , Fβt , and Gβt . The
backward recurrence time βt equals the age of the system if the system is up
at time t and the duration of the repair if the system is down at time t, i.e.,

βt = t − SN ◦
t
on {Xt = 1}

and
βt = t − SNt on {Xt = 0}.

4.2.1 Point Availability

We will show that the point availability A(t) is given by


 t
A(t) = F̄ (t) + F̄ (t − x)dM ◦ (x) = F̄ (t) + F̄ ∗ M ◦ (t). (4.1)
0

Using a standard renewal argument conditioning on the duration of T1 + R1 ,


it is not difficult to see that A(t) satisfies the following equation:
4.2 One-Component Systems 109
 t
A(t) = F̄ (t) + A(t − x)d(F ∗ G)(x)
0

(cf. the derivation of the renewal equation in Appendix B, p. 275). Hence, by


using Theorem B.2, p. 275, in Appendix B, formula (4.1) follows. Alterna-
tively, we may use a more direct approach, writing


Xt = I(T1 > t) + I(Sn◦ ≤ t, Sn◦ + Tn+1 > t),
n=1

which gives
∞ 
 t
A(t) = EXt = F̄ (t) + F̄ (t − x)dH ◦(n) (x)
n=1 0
 t
= F̄ (t) + F̄ (t − x)dM ◦ (x).
0

The point unavailability Ā(t) is given by Ā(t) = 1 − A(t) = F (t) − F̄ ∗ M ◦ (t).


In the case that F is exponential with failure rate λ, it can be shown that

Ā(t) ≤ λμG ,

see Proposition 4.11, p. 114.


By the Key Renewal Theorem (Theorem B.7, p. 277, in Appendix B), it
follows that
μF
lim A(t) = , (4.2)
t→∞ μF + μG
∞
noting that the mean of F ∗G equals μF +μG and 0 F̄ (t)dt = μF . The right-
hand side of (4.2) is called the limiting availability (or steady-state availability)
and is for short denoted A. The limiting unavailability is defined as Ā = 1−A.
Usually μG is small compared to μF , so that
 
μG μG μG
Ā = +o , → 0.
μF μF μF

4.2.2 The Distribution of the Number of System Failures

Consider first the interval [0, v]. We see that

{Nv ≤ n} = {Sn+1 > v}, n ∈ N0 ,

because if the number of failures in this interval is less than or equal to n,


then the (n + 1)th failure occurs after v, and vice versa. Thus, for n ∈ N0 ,

P (Nv ≤ n) = 1 − (F ∗ G)∗n ∗ F (v). (4.3)

Some closely related results are stated below in Propositions 4.1 and 4.2.
110 4 Availability Analysis of Complex Systems

Proposition 4.1. The probability of n failures occurring in [0, v] and the sys-
tem being up at time v is given by
 v
P (Nv = n, Xv = 1) = F̄ (v − x)d(F ∗ G)∗n (x), n ∈ N0 .
0

Proof. The result clearly holds for n = 0. For n ≥ 1, the result follows by
observing that

{Nv = n, Xv = 1} = {Sn◦ + Tn+1 > v, Sn◦ ≤ v}.





Proposition 4.2. The probability of n failures occurring in [0, v] and the sys-
tem being down at time v is given by
 v
0
Ḡ(v − x)dH (n) (x) n ∈ N
P (Nv = n, Xv = 0) =
0 n = 0.

Proof. The proof is similar to the proof of Proposition 4.1. For n ∈ N, it is


seen that
{Nv = n, Xv = 0} = {Sn + Rn > v, Sn ≤ v}. 


From Propositions 4.1 and 4.2 we can deduce several results, for example,
a formula for P (Nu = n|Xu = 1) using that

P (Nu = n, Xu = 1)
P (Nu = n|Xu = 1) = .
A(u)

In the theorem below we establish general formulas for P (N(u,v] ≤ n) and


A[u, v].

Theorem 4.3. The probability that at most n (n ∈ N0 ) failures occur during


the interval (u, v] equals

P (N(u,v] ≤ n) = [1 − Fαu ∗ (F ∗ G)∗n (v − u)]A(u)


+[1 − Gαu ∗ (F ∗ G)∗n ∗ F (v − u)]Ā(u),

and

A[u, v] = F̄αu (v − u)A(u).

Proof. To establish the formula for P (N(u,v] ≤ n), we condition on the state
of the system at time u:
1
P (N(u,v] ≤ n) = j=0 P (N(u,v] ≤ n|Xu = j)P (Xu = j).

From this equality the formula follows trivially for n = 0. For n ∈ N, we need
to show that the following two equalities hold true:
4.2 One-Component Systems 111

P (N(u,v] > n|Xu = 1) = (Fαu ∗ G) ∗ (F ∗ G)∗(n−1) ∗ F (v − u), (4.4)


P (N(u,v] > n|Xu = 0) = Gαu ∗ (F ∗ G)∗n ∗ F (v − u). (4.5)

But (4.4) follows directly from (4.3) with the forward recurrence time distri-
bution given {Xu = 1} as the first operating time distribution. Formula (4.5)
is established analogously.
The formula for A[u, v] is seen to hold observing that

A[u, v] = P (Xu = 1, N(u,v] = 0)


= A(u)P (N(u,v] = 0|Xu = 1)
= A(u)P (αu > v − u|Xu = 1).

This completes the proof of the theorem. 




If the downtimes are much smaller then the uptimes in probability (which
is the common situation in practice), then N is close to a renewal process
generated by all the uptimes. Hence, if the times to failure are exponentially
distributed, the process N is close to a homogeneous Poisson process. Formal
asymptotic results will be established later, see Sect. 4.4.
In the following two propositions we relate the distribution of the forward
and backward recurrence times and the renewal functions M and M ◦ .

Proposition 4.4. The probability that the system is up (down) at time t and
the forward recurrence time at time t is greater than w is given by

A[t, t + w] = P (Xt = 1, αt > w)


 t
= F̄ (t + w) + F̄ (t − x + w)dM ◦ (x), (4.6)
0
 t
P (Xt = 0, αt > w) = Ḡ(t − x + w)dM (x). (4.7)
0

Proof. Consider first formula (4.6). It is not difficult to see that




Xt I(αt > w) = I(Sn◦ ≤ t, Sn◦ + Tn+1 > t + w). (4.8)
n=0

By taking expectations we find that


∞ 
 t
P (Xt = 1, αt > w) = F̄ (t + w) + F̄ (t − x + w)dH ◦(n) (x)
n=1 0
 t
= F̄ (t + w) + F̄ (t − x + w)dM ◦ (x).
0
112 4 Availability Analysis of Complex Systems

This proves (4.6). To prove (4.7) we use a similar argument writing




(1 − Xt )I(αt > w) = I(Sn ≤ t, Sn + Rn > t + w). (4.9)
n=1

This completes the proof of the proposition. 




Proposition 4.5. The probability that the system is up (down) at time t and
the backward recurrence time at time t is greater than w is given by
  t−w
F̄ (t) + 0 F̄ (t − x)dM ◦ (x) w ≤ t
P (Xt = 1, βt > w) = (4.10)
0 w>t
  t−w
Ḡ(t − x)dM (x) w ≤ t
P (Xt = 0, βt > w) = 0 (4.11)
0 w > t.
Proof. The proof is similar to the proof of Proposition 4.4. Replace the indi-
cator function in the sums in (4.8) and (4.9) by
I(Sn◦ + Tn+1 > t, Sn◦ + w < t)
and
I(Sn + Rn > t, Sn + w < t),
respectively. 


Theorem 4.6. The asymptotic distributions of the state process (Xt ) and the
forward (backward) recurrence times at time t are given by
∞
F̄ (x) dx
lim P (Xt = 1, αt > w) = w
t→∞ μF + μG
∞
Ḡ(x) dx
lim P (Xt = 0, αt > w) = w
t→∞ μF + μG
∞
F̄ (x) dx
lim P (Xt = 1, βt > w) = w (4.12)
t→∞ μF + μG
∞
Ḡ(x) dx
lim P (Xt = 0, βt > w) = w .
t→∞ μF + μG
Proof. The results follow by applying the Key Renewal Theorem (see
Appendix B, p. 277) to formulas (4.6), (4.7), (4.10), and (4.11). 


Let us introduce
w
0
F̄ (x) dx
F∞ (w) = , (4.13)
μF
w
0
Ḡ(x) dx
G∞ (w) = . (4.14)
μG
4.2 One-Component Systems 113

The distribution F∞ (G∞ ) is the asymptotic limit distribution of the forward


and backward recurrence times in a renewal process generated by the uptimes
(downtimes) and is called the equilibrium distribution for F (G), cf. Theo-
rem B.13, p. 279, in Appendix B. We would expect that F∞ and G∞ are equal
to the asymptotic distributions of the forward and backward recurrence times
in the alternating renewal process. As shown in the following proposition, this
holds in fact true.

Proposition 4.7. The asymptotic distribution of the forward and backward


recurrence times are given by

lim F̄αt (w) = lim F̄βt (w) = F̄∞ (w)


t→∞ t→∞

and
lim Ḡαt (w) = lim Ḡβt (w) = Ḡ∞ (w). (4.15)
t→∞ t→∞

Proof. To establish these formulas, we use (4.2) (see p. 109), Theorem 4.6,
and identities like
P (Xt = 1, αt > w)
P (αt > w|Xt = 1) = .
A(t) 


The following theorem expresses the asymptotic distribution of N(t,t+w] as


a function of F , G, F∞ , G∞ and A.

Theorem 4.8. For n ∈ N0 ,

lim P (N(t,t+w] ≤ n) = [1 − F∞ ∗ (F ∗ G)∗n (w)]A +


t→∞
+[1 − G∞ ∗ (F ∗ G)∗n ∗ F (w)]Ā.

Proof. The result follows from the expression for the distribution of the num-
ber of failures given in Theorem 4.3, p. 110, combined with the limiting avail-
ability formula (4.2), p. 109, and Proposition 4.7. 


If the lifetime distribution F is exponential with failure rate λ, then we


know that the forward recurrence time αt has the same distribution for all t,
and it is easily verified from the expression (4.13) for the equilibrium distri-
bution for F that F∞ (t) = F (t).
Next we consider an increasing interval (t, t+w], w → ∞. Then we can use
the normal distribution to find an approximate value for the distribution of N .
The asymptotic normality, as formulated in the following theorem, follows by
applying the Central Limit Theorem for renewal processes, see Theorem B.12,
p. 278, in Appendix B. The notation N (μ, σ 2 ) is used for the normal distri-
bution with mean μ and variance σ 2 .
114 4 Availability Analysis of Complex Systems

Theorem 4.9. The asymptotic distribution of N(t,t+w] as w → ∞, is given


by
N(t,t+w] − w/(μF + μG ) D
→ N(0, 1). (4.16)
[w(σF2 + σG
2 )/(μ + μ )3 ]1/2
F G

The expected number of system failures can be found from the distribu-
tion function. Obviously, M (v) ≈ M ◦ (v) for large v. The exact relationship
between M (v) and M ◦ (v) is given in the following proposition.
Proposition 4.10. The difference between the renewal functions M (v) and
M ◦ (v) equals the unavailability at time v, i.e.,

M (v) = M ◦ (v) + Ā(v).

Proof. Using that P (Nv ≤ n) = 1 − (F ∗ G)∗n ∗ F (v) (by (4.3), p. 109) and
the expression (4.1), p. 108, for the availability A(t), we obtain


M (v) = P (Nv ≥ n)
n=1
∞
= (F ∗ G)∗n ∗ F (v) = F (v) + M ◦ ∗ F (v)
n=0
= M ◦ (v) + Ā(v),

which is the desired result. 




The number of system failures in [0, v], Nv , generates a counting process


with stochastic intensity process

ηv = λ(βv )Xv , (4.17)

where λ is the failure rate function and βv is the backward recurrence time
at time v, i.e., the relative age of the system at time v, cf. Sect. 3.3.2, p. 85.
We have m(v) = Eηv , where m(v) is the renewal density of M (v). Thus if the
system has an exponential lifetime distribution with failure rate λ,

m(v) = λA(v). (4.18)

In general,
m(v) ≤ [sup λ(s)]A(v). (4.19)
s≤v

This bound can be used to establish an upper bound also for the unavailability
Ā(t).
Proposition 4.11. The unavailability at time t, Ā(t), satisfies
 t
Ā(t) ≤ sup λ(s) Ḡ(u)du ≤ [sup λ(s)]μG . (4.20)
s≤t 0 s≤t
4.2 One-Component Systems 115

Proof. From (4.7), p. 111, we have


 t  t
Ā(t) = P (Xt = 0) = Ḡ(t − x)dM (x) = Ḡ(t − x)m(x)dx. (4.21)
0 0

Using (4.19) this gives


 t
Ā(t) ≤ Ḡ(t − x)[sup λ(s)]A(x)dx.
0 s≤x

It follows that
 t
Ā(t) ≤ sup λ(s) Ḡ(t − x)dx
s≤t 0
 t
= sup λ(s) Ḡ(u)du ≤ [sup λ(s)]μG ,
s≤t 0 s≤t

which proves (4.20). 




Hence, if the system has an exponential lifetime distribution with failure


rate λ, then  t
Ā(t) ≤ λ Ḡ(s)ds ≤ λμG . (4.22)
0
It is also possible to establish lower bounds on Ā(t). A simple bound is
obtained by combining (4.21) and the fact that
t ≤ ESNt +1 ≤ (μF + μG )(1 + M (t))
(cf. Appendix B, p. 279), giving
 
t
Ā(t) ≥ Ḡ(t)M (t) ≥ Ḡ(t) −1 .
μF + μG
Now suppose at time t that the system is functioning and the relative
age is u. What can we then say about the intensity process at time t + v
(v > 0)? The probability distribution of ηt+v is determined if we can find the
distribution of the relative age at time t + v. But the relative age is given by
(4.10), p. 112, slightly modified to take into account that the first uptime has
distribution given by Fu (x) = 1 − F̄ (u + x)/F̄ (u) for 0 ≤ u ≤ t:
P (Xt+v = 1, βt+v > w|Xt = 1, βt = u)
  v−w
F̄u (v) + 0 F̄ (v − x)dM ◦ (x) w ≤ v
=
0 w > v.
The asymptotic distribution, as v → ∞, is the same as in formula (4.12),
p. 112.
The (modified) renewal process (Nt ) has cycle lengths Tk + Rk with mean
μF + μG , k ≥ 2. Thus we would expect that the (mean) average number of
116 4 Availability Analysis of Complex Systems

failures per unit of time is approximately equal to 1/(μF + μG ) for large t. In


the following theorem some asymptotic results are presented that give precise
formulations of this idea.

Theorem 4.12. With probability one ,


Nt 1
lim = . (4.23)
t→∞ t μF + μG
Furthermore,
ENt 1
lim = , (4.24)
t→∞t μF + μG
w
lim E[Nu+w − Nu ] = , (4.25)
u→∞ μF + μG
t σF2 + σG
2
1
lim (ENt − )= − .
t→∞ μF + μG 2(μF + μG )2 2

Proof. These results follow directly from renewal theory, see Appendix B,
pp. 276–278. 


4.2.3 The Distribution of the Downtime in a Time Interval

First we formulate and prove some results related to the mean of the downtime
in the interval [0, u]. As before (cf. Sect. 4.1, p. 106), we let Yu represent the
downtime in the interval [0, u].

Theorem 4.13. The expected downtime in [0, u] is given by


 u
EYu = Ā(t)dt. (4.26)
0

Asymptotically, the (expected) portion of time the system is down equals the
limiting unavailability, i.e.,
EYu
lim AD (u) = lim = Ā. (4.27)
u→∞ u→∞ u
With probability one,
Yu
lim = Ā. (4.28)
u→∞ u
4.2 One-Component Systems 117

Proof. Using the definition of Yu and Fubini’s theorem we find that


 u
EYu = E (1 − Φt )dt
 u 0

= E(1 − Φt )dt
0 u
= Ā(t)dt.
0

This proves (4.26). Formula (4.27) follows by using (4.26) and the limiting
availability formula (4.2), p. 109. Alternatively, we can use the Renewal Re-
ward Theorem (Theorem B.15, p. 280, in Appendix B), interpreting Yu as
a reward. From this theorem we can conclude that EYu /u converges to the
ratio of the expected downtime in a renewal cycle and the expected length of
a cycle, i.e., to the limiting unavailability Ā. The Renewal Reward Theorem
also proves (4.28). 


Now we look into the problem of finding formulas for the downtime distri-
bution.
Let Nsop denote the number of system failures after s units of operational
time, i.e.,

 
n
Nsop = I( Tk ≤ s).
n=1 k=1

Note that

n
Nsop ≥n⇔ Tk ≤ s, n ∈ N. (4.29)
k=1

Let Zs denote the total downtime associated with the operating time s, but
not including s, i.e.,
op
Ns−

Zs = Ri ,
i=1

where
op
Ns− = lim Nuop .
u→s−

Define
Cs = s + Zs .
We see that Cs represents the calendar time after an operation time of s time
units and the completion of the repairs associated with the failures occurred
up to s but not including s.
The following theorem gives an exact expression of the probability distri-
bution of Yu , the total downtime in [0, u].
118 4 Availability Analysis of Complex Systems

Theorem 4.14. The distribution of the downtime in a time interval [0, u] is


given by


P (Yu ≤ y) = G∗n (y)P (Nu−y
op
= n) (4.30)
n=0
∞
= G∗n (y)[F ∗n (u − y) − F ∗(n+1) (u − y)]. (4.31)
n=0

Proof. To prove the theorem we first argue that

P (Yu ≤ y) = P (Cu−y ≤ u) = P (u − y + Zu−y ≤ u)


= P (Zu−y ≤ y).

This first equality follows by noting that the event Yu ≤ y is equivalent to the
event that the uptime in the interval [0, u] is equal to or longer than u−y. This
means that the point in time when the total uptime of the system equals u − y
must occur before or at u, i.e., Cu−y ≤ u. Now using a standard conditional
probability argument it follows that

 op op
P (Zu−y ≤ y) = P (Zu−y ≤ y|N(u−y)− = n)P (N(u−y)− = n)
n=0
∞
= G∗n (y)P (N(u−y)−
op
= n)
n=0
∞
= G∗n (y)P (Nu−y
op
= n).
n=0

We have used that the repair times are independent of the process Nsop and
that F is continuous. This proves (4.30). Formula (4.31) follows by using
(4.29). 


In the case that F is exponential with failure rate λ the following simple
bounds apply

e−λ(u−y) [1 + λ(u − y)G(y)] ≤ P (Yu ≤ y) ≤ e−λ(u−y)[1−G(y)].

The lower bound follows by including only the first two terms of the sum in
(4.30), observing that Ntop is Poisson distributed with mean λt, whereas the
upper bound follows by using (4.30) and the inequality

G∗n (y) ≤ (G(y))n .

In the case that the interval is rather long, the downtime will be approximately
normally distributed, as is shown in Theorem 4.15 below.
4.2 One-Component Systems 119

Xt

0 t
R1 T1 R2 T2 R3 T3

Fig. 4.2. Time evolution of a failure and repair process for a one-component system
starting at time t = 0 in the failure state

Theorem 4.15. The asymptotic distribution of Yu as u → ∞, is given by


 
√ Yu D
u − Ā → N(0, τ 2 ), (4.32)
u

where
μ2F σG
2
+ μ2G σF2
τ2 = .
(μF + μG )3
Proof. The result follows by applying Theorem B.17, p. 280, in Appendix B,
observing that the length of the first renewal cycle equals S1◦ = T1 + R1 , the
downtime in this cycle equals YS1◦ = R1 and

Var[R1 − ĀS1◦ ] Var[R1 A − T1 Ā]


=
ES1◦ ES1◦
A Var[R1 ] + Ā2 Var[T1 ]
2
=
μF + μG
μ2F σG
2
+ μ2G σF2
= .
(μF + μG )3 


4.2.4 Steady-State Distribution

The asymptotic results established above provide good approximations for the
performance measures related to a given point in time or an interval. Based
on the asymptotic values we can define a stationary (steady-state) process
having these asymptotic values as their distributions and means. To define
such a process in our case, we generalize the model analyzed above by allowing
X0 to be 0 or 1.
Thus the time evolution of the process is as shown in Fig. 4.2 or as shown
in Fig. 4.1 (p. 107) beginning with an uptime. The process is characterized
by the parameters A(0), F ∗ (t), F (t), G∗ (t), G(t), where F ∗ (t) denotes the
distribution of the first uptime provided that the system starts in state 1 at
time 0 (i.e., X0 = 1) and G∗ (t) denotes the distribution of the first downtime
120 4 Availability Analysis of Complex Systems

provided that the system starts in state 0 at time 0 (i.e., X0 = 0). Now
assuming that F ∗ (t) and G∗ (t) are equal to the asymptotic distributions of
the recurrence times, i.e., F∞ (t) and G∞ (t), respectively, and A(0) = A, then
it can be shown that the process (Xt , αt ) is stationary; see Birolini [44]. This
means that we have, for example,

A(t) = A, ∀t ∈ R+ ,
∞
F̄ (x) dx
A[u, u + w] = w , ∀u, w ∈ R+ ,
μF + μG
w
M (u, u + w] = , ∀u, w ∈ R+ .
μF + μG

4.3 Point Availability and Mean Number


of System Failures

Consider now a monotone system comprising n independent components. For


each component we define a model as in Sect. 4.2, indexed by “i”. The uptimes
and downtimes of component i are thus denoted Tik and Rik with distributions
Fi and Gi , respectively. The lifetime distribution Fi is absolutely continuous
with a failure rate function λi (t). The process (Nt ) refers now to the number of
system failures, whereas (Nt (i)) counts the number of failures of component i.
The counting process (Nt (i)) has intensity process (ηt (i)) = (λi (βt (i))Xt (i)),
where (Xt (i)) equals the state process of component i and (βt (i)) the backward
recurrence time of component i. The mean of (Nt (i)) is denoted Mi (t), whereas
the mean of the renewal process having interarrival times Tik + Rik , k ∈ N, is
denoted Mi◦ (t). If the process (Xt ) is regenerative, we denote the consecutive
cycle lengths S1 , S2 , . . .. We write S in place of S1 . Remember that a stochastic
process (Xt ) is called regenerative if there exists a finite random variable S
such that the process beyond S is a probabilistic replica of the process starting
at 0. The precise definition is given in Appendix B, p. 281.
In the following we establish results similar to those obtained in the pre-
vious section. Some results are quite easy to generalize to monotone systems,
others are extremely difficult. Simplifications and approximative methods are
therefore sought. First we look at the point availability.

4.3.1 Point Availability

The following results show that the point availability (limiting availability) of
a monotone system is equal to the reliability function h with the component
reliabilities replaced by the component availabilities Ai (t) (Ai ).

Theorem 4.16. The system availability at time t, A(t), and the limiting sys-
tem availability, limt→∞ A(t), are given by
4.3 Point Availability and Mean Number of System Failures 121

A(t) = h(A1 (t), A2 (t), . . . , An (t)) = h(A(t)), (4.33)


lim A(t) = h(A1 , A2 , . . . , An ) = h(A). (4.34)
t→∞

Proof. Formula (4.33) is simply an application of the reliability function for-


mula (2.2), see p. 21, with Ai (t) = P (Xt (i) = 1). Since the reliability function
h(p) is a linear function in each pi (see Sect. 2.1, p. 25), and therefore a con-
tinuous function, it follows that A(t) → h(A1 , A2 , . . . , An ) as t → ∞, which
proves (4.34). 


The limiting system availability can also be interpreted as the expected


portion of time the system is operating in the long run, or as the long run
average availability, noting that
  t  
1 1 t
lim E Φs ds = lim A(s)ds = lim A(t).
t→∞ t 0 t→∞ t 0 t→∞

4.3.2 Mean Number of System Failures

We first state some results established in Sect. 3.3.2, cf. formula (3.18), p. 86.
See also (4.17) and (4.18), p. 114.

Theorem 4.17. The expected number of system failures in [0, u] is given by


n 
 u
ENu = [h(1i , A(t)) − h(0i , A(t))] dMi (t) (4.35)
i=1 0

n  u
= [h(1i , A(t)) − h(0i , A(t))] mi (t) dt
i=1 0

n  u
= [h(1i , A(t)) − h(0i , A(t))] Eηt (i)dt,
i=1 0

where mi (t) is the renewal density function of Mi (t).

Corollary 4.18. If component i has constant failure rate λi , i = 1, 2, . . . , n,


then
 n  u
ENu = [h(1i , A(t)) − h(0i , A(t))] λi Ai (t)dt, (4.36)
i=1 0

≤ uλ̃,
n
where λ̃ = i=1 λi .

Next we will generalize the asymptotic results (4.23)–(4.25), p. 116.


122 4 Availability Analysis of Complex Systems

Theorem 4.19. The expected number of system failures per unit of time is
asymptotically given by

ENu  h(1i , A) − h(0i , A)


n
lim = , (4.37)
u→∞ u i=1
μFi + μGi
EN(u,u+w]  h(1i , A) − h(0i , A)
n
lim = . (4.38)
u→∞ w i=1
μFi + μGi

Furthermore, if the process X is a regenerative process having finite expected


cycle length, i.e., ES < ∞, then with probability one,

Nu  h(1i , A) − h(0i , A)
n
lim = . (4.39)
u→∞ u μFi + μGi
i=1

Proof. To prove these results, we make use of formula (4.35). Dividing this
formula by u and using the Elementary Renewal Theorem (see Appendix
B, p. 277), formula (4.37) can be shown to hold noting that E[Φ(1i , Xt )
− Φ(0i , Xt )] → [h(1i , A) − h(0i , A)] as t → ∞. Let h∗i (t) = E[Φ(1i , Xt ) −
Φ(0i , Xt )] and h∗i its limit as t → ∞. Then we can write formula (4.35)
divided by u in the following form:
 n   
Mi (u) 1 u ∗
h∗i + [hi (t) − h∗i ]dMi (t) .
i=1
u u 0

Hence in view of the Elementary Renewal Theorem, formula (4.37) follows if



1 u ∗
lim [hi (t) − h∗i ]dMi (t) = 0. (4.40)
u→∞ u 0

But (4.40) is seen to hold true by Proposition B.14, p. 279, in Appendix B.


The formula (4.38) is shown by writing
n 
 u+w
E[Nu+w − Nu ] = E[Φ(1i , Xt ) − Φ(0i , Xt )]dMi (t)
i=1 u

and using Blackwell’s Theorem, see Theorem B.9, p. 278, in Appendix B.


If we assume that the process X is regenerative with ES < ∞, it follows
from the theory of renewal reward processes (see Appendix B, p. 280) that
with probability one, limu→∞ Nu /u exists and equals
ENu ENS
lim = .
u→∞ u ES
Combining this with (4.37), we can conclude that (4.39) holds true, and the
proof of the theorem is complete. 

4.3 Point Availability and Mean Number of System Failures 123

Definition 4.20. The limit of ENu /u, given by formula (4.37), is referred to
as the system failure rate and is denoted λΦ , i.e.,

ENu  h(1i , A) − h(0i , A)


n
λΦ = lim = . (4.41)
u→∞ u i=1
μFi + μGi

Remark 4.21. 1. Heuristically, the limit (4.37) can easily be established: In


the interval (t, t + w), t large and w small, the probability that component
i fails equals approximately w/(μFi + μGi ), and this failure implies a
system failure if Φ(1i , Xt ) = 1 and Φ(0i , Xt ) = 0, i.e., the system fails if
component i fails. But the probability that Φ(1i , Xt ) = 1 and Φ(0i , Xt ) =
0 is approximately equal to h(1i , A) − h(0i , A), which gives the desired
result.
2. At time t we can define a system failure rate λΦ (t) by

n
λΦ (t) = [Φ(1i , Xt ) − Φ(0i , Xt )]ηt (i),
i=1

cf. Sect. 3.3.2, p. 85. Since



n
EλΦ (t) = [h(1i , At ) − h(0i , At )]mi (t),
i=1

where mi (t) denotes the renewal density of Mi (t), we see that EλΦ (t) →
λΦ as t → ∞ provided that mi (t) → 1/(μFi + μGi ). From renewal theory,
see Theorem B.10, p. 278, in Appendix B, we know that if the renewal
cycle lengths Tik + Rik have a density function h with h(t)p integrable for
some p > 1, and h(t) → 0 as t → ∞, then Mi has a density mi such that
mi (t) → 1/(μFi + μGi ) as t → ∞. See the remark following Theorem B.10
for other sufficient conditions for mi (t) → 1/(μFi + μGi ) to hold. If com-
ponent i has an exponential lifetime distribution with parameter λi , then
mi (t) = λi Ai (t), (cf. (4.18), p. 114), which converges to 1/(μFi + μGi ).
It is intuitively clear that the process X is regenerative if the components
have exponential lifetime distributions. Before we prove this formally, we for-
mulate a result related to ENu◦ : the expected number of visits to the best
state (1, 1, . . . , 1) in [0, u]. The result is analogous to (4.35) and (4.37).
Lemma 4.22. The expected number of visits to state (1, 1, . . . , 1) in [0, u] is
given by
n  u
ENu◦ = Aj (t) dMi◦ (t). (4.42)
i=1 0 j =i

Furthermore,
ENu◦ n n
1
lim = Aj . (4.43)
u→∞ u j=1
μ
i=1 Fi
124 4 Availability Analysis of Complex Systems

Proof. Formula (4.42) is shown by arguing as in the proof of (4.35) (cf.


Sect. 3.3.2, p. 85), writing
⎡ ⎤
n  u
ENu◦ = E ⎣ Xj (t) dNt◦ (i)⎦ .
i=1 0 j =i

To show (4.43) we can repeat the proof of (4.37) to obtain


ENu◦ n 
1
lim = Aj
u→∞ u i=1
μFi + μGi
j =i

n n
1
= Aj .
j=1
μ
i=1 Fi

This completes the proof of the lemma. 



The above result can be shown heuristically using the same type of
arguments as in Remark 4.21. For highly available components we have
Ai ≈ 1, hence the limit (4.43) is approximately equal to
n
1
.
μ
i=1 Fi

This is as expected noting that the number of visits to state (1, 1, . . . , 1) then
should be approximately equal to the average number of component failures
per unit of time. If a component fails, it will normally be repaired before any
other component fails, and, consequently, the process again returns to state
(1, 1, . . . , 1).
Theorem 4.23. If all the components have exponential lifetimes, then X is
a regenerative process.
Proof. Because of the memoryless property of the exponential distribution
and the fact that all component uptimes and downtimes are independent, we
can conclude that X is regenerative (as defined in Appendix B, p. 281) if we
can prove that P (S < ∞) = 1, where S = inf{t > S  : Xt = (1, 1, . . . , 1)}
and S  = min{Ti1 : i = 1, 2, . . . , n}. It is clear that if X returns to the state
(1, 1, . . . , 1), then the process beyond S is a probabilistic replica of the process
starting at 0.
Suppose that P (S < ∞) < 1. Then there exists an > 0 such that
P (S < ∞) ≤ 1 − . Now let τi be point in time of the ith visit of X to the
state (1, 1, . . . , 1), i.e., τ1 = S and for i ≥ 2,
τi = inf{t > τi−1 + Si : Xt = (1, 1, . . . , 1)},
where Si has the same distribution as S  . We define inf{∅} = ∞. Since τi < ∞
is equivalent to τk − τk−1 < ∞, k = 1, 2, . . . , i (τ0 = 0), we obtain
P (τi < ∞) = [P (S < ∞)]i ≤ (1 − )i .
4.4 Distribution of the Number of System Failures 125

For all t ∈ R+ ,
P (Nt◦ ≥ i) ≤ P (τi < ∞),
and it follows that


ENt◦ = P (Nt◦ ≥ i)
i=1


≤ (1 − )i
i=1
1− 1−
= = < ∞.
1 − (1 − )

Consequently, ENt◦ /t → 0 as t → ∞. But this result contradicts (4.43), and


therefore P (S < ∞) = 1. 


Under the given set-up the regenerative property only holds true if the
lifetimes of the components are exponentially distributed. However, this can
be generalized by considering phase-type distributions with an enlarged state
space, which also includes the phases; see Sect. 4.7.1, p. 163.

4.4 Distribution of the Number of System Failures


In general, it is difficult to calculate the distribution of the number of system
failures N(u,v] . Only in some special cases it is possible to obtain practical
computation formulas, and in the following we look closer into some of these.

If the repair times are small compared to the lifetimes and the lifetimes
are exponentially distributed with parameter λi , then clearly the number of
failures of component i in the time interval (u, u + w], Nu+w (i) − Nu (i), is
approximately Poisson distributed with parameter λi w. If the system is a
series system, and we make the same assumptions as above, it is also clear
that the number of system failures in the interval (u, u + w] is approximately
n
Poisson distributed with parameter i=1 λi w. The number of system failures
in [0, t], Nt , is approximately a Poisson process with intensity ni=1 λi .
If the system is highly available and the components have constant failure
rates, the Poisson distribution (with the asymptotic rate λΦ ) will in fact also
produce good approximations for more general systems. As motivation, we
observe that EN(u,u+w] /w is approximately equal to the asymptotic system
failure rate λΦ , and N(u,u+w] is “nearly independent” of the history of N up
to u, noting that the process X frequently restarts itself probabilistically, i.e.,
X re-enters the state (1, 1, . . . , 1).
Refer to [22, 82] for Monte Carlo simulation studies of the accuracy of
the Poisson approximation. As an illustration of the results obtained in these
studies, consider a parallel system of two identical components where the
126 4 Availability Analysis of Complex Systems

failure rate λ is equal to 0.05, the repair times are all equal to 1, and the
expected number of system failures is equal to 5. This means, as shown below,
that the time interval is about 1,000 and the expected number of component
failures is about 100. Using the definition of the system failure rate λΦ (cf.
(4.41), p. 123) with μG = 1, we obtain

ENu 5 1 μG 1
= ≈ λΦ = 2Ā1 =21 · 1
u u μF1 + μG1 λ + μG λ + μG
≈ 2λ2 = 0.005.

Hence u ≈ 1, 000 and 2 ENu (i) ≈ 2λu ≈ 100. Clearly, this is an approximate
steady-state situation, and we would expect that the Poisson distribution gives
an accurate approximation. The Monte Carlo simulations in [22] confirm this.
The distance measure, which is defined as the maximum distance between the
Poisson distribution (with mean λΦ u) and the “true” distribution obtained
by Monte Carlo simulation, is equal to 0.006. If we take instead λ = 0.2 and
ENu = 0.2, we find that the expected number of component failures is about
1. Thus, we are far away from a steady-state situation and as expected the
distance measure is larger: 0.02. But still the Poisson approximation produces
relatively accurate results.
In the following we look at the problem of establishing formalized asymp-
totic results for the distribution of the number of system failures. We first
consider the interval reliability.

4.4.1 Asymptotic Analysis for the Time to the First System


Failure

The above discussion indicates that the interval reliability A[0, u], defined by
A[0, u] = P (Nu = 0), is approximately exponentially distributed for highly
available systems comprising components with exponentially distributed life-
times. This result can also be formulated as a limiting result as shown in the
theorem below. It is assumed that the process X is a regenerative process
with regenerative state (1, 1, . . . , 1). The variable S denotes the length of the
first renewal cycle of the process X, i.e., the time until the process returns to
state (1, 1, . . . , 1). Let TΦ denote the time to the first system failure and q the
probability that a system failure occurs in a renewal cycle, i.e.,

q = P (NS ≥ 1) = P (TΦ < S).

For q ∈ (0, 1), let P0 and P1 denote the conditional probability given NS = 0
and NS ≥ 1, i.e., P0 (·) = P (·|NS = 0) and P1 (·) = P (·|NS ≥ 1). The
corresponding expectations are denoted E0 and E1 . Furthermore, let c20S =
[E0 S 2 /(E0 S)2 ] − 1 denote the squared coefficient of variation of S under P0 .
P D
The notation → is used for convergence in probability and → for con-
vergence in distribution, cf. Appendix A, p. 248. We write Exp(t) for the
4.4 Distribution of the Number of System Failures 127

exponential distribution with parameter t, Poisson(t) for the Poisson distri-


bution with mean t and N(μ, σ 2 ) for the normal distribution with mean μ and
variance σ 2 .
For each component i (i ∈ {1, 2, . . . , n}) we assume that there is a sequence
of uptime and downtime distributions (Fij , Gij ), j = 1, 2, . . ..
To simplify notation, we normally omit the index j. When assuming in
the following that X is a regenerative process, it is tacitly understood for all
j ∈ N. We shall formulate conditions which guarantee that αTΦ is asymptot-
ically exponentially distributed with parameter 1, where α is a suitable nor-
malizing “factor” (more precisely, a normalizing sequence depending on j).
The following factors will be studied: q/E0 S, q/ES, 1/ETΦ , and λΦ . These
factors are asymptotically equivalent under the conditions stated in the the-
orem below, i.e., the ratio of any two of these factors converges to one as
j → ∞. To motivate this, note that for a highly available system we have
ETΦ ≈ E0 S(1/q) ≈ ES(1/q), observing that E0 S equals the length of a cycle
having no system failures and 1/q equals the expected number of cycles until
a system failure occurs (the number of such cycles is geometrically distributed
with parameter q). We have E0 S ≈ ES when q is small. Note also that

ENS
λΦ = (4.44)
ES
by the Renewal Reward Theorem (Theorem B.15, p. 280, in Appendix B).
For a highly available system we have ENS ≈ q and hence λΦ ≈ q/ES.
Results from Monte Carlo simulations presented in [22] show that the factors
q/E0 S, q/ES, and 1/ETΦ typically give slightly better results (i.e., better
fit to the exponential distribution) than the system failure rate λΦ . From a
computational point of view, however, λΦ is much more attractive than the
other factors, which are in most cases quite difficult to compute. We therefore
normally use λΦ as the normalizing factor.
The basic idea of the proof of the asymptotic exponentiality of αTΦ is as
follows: If we assume that X is a regenerative process and the probability
that a system failure occurs in a renewal cycle, i.e., q, is small (converges to
zero), then the time to the first system failure will be approximately equal
to the sum of a number of renewal cycles having no system failures; and this
number of cycles is geometrically distributed with parameter q. Now if q → 0
as j → ∞, the desired result follows by using Laplace transformations. The
result can be formulated in general terms as shown in the lemma below.
Note that series systems are excluded since such systems have q = 1. We
will analyze series systems later in this section; see Theorem 4.35, p. 143.

Lemma 4.24. Let S, Si , i = 1, 2, . . ., be a sequence of non-negative i.i.d. ran-


dom variables with distribution function F (t) having finite mean a, a > 0 and
finite variance, and let ν be a random variable independent of (Si ), geomet-
rically distributed with parameter q (0 < q ≤ 1), i.e., P (ν = k) = qpk−1 , k =
1, 2, . . . , p = 1 − q. Furthermore, let
128 4 Availability Analysis of Complex Systems


ν−1

S = Si .
i=1

Consider now a sequence Fj , qj (j = 1, 2, . . .) satisfying the above conditions


for each j. Then if (as j → ∞)
q→0 (4.45)
and
qc2S → 0, (4.46)

where c2S denotes the squared coefficient of variation of S, we have (as j → ∞)


qS ∗ D
→ Exp(1). (4.47)
a
Proof. Let S̃ ∗ = qS ∗ /a. By conditioning on the value of ν, it is seen that the

Laplace transform of S ∗ , LS ∗ (x) = Ee−xS , equals q/[1 − pL(x)], where L(x)
is the Laplace transform of Si . Let ψ(x) = [L(x) − 1 + ax]/x. Then
q
LS ∗ (x) = .
1 − p(1 − ax + xψ(x))
We need to show that
∗ 1
LS̃ ∗ (x) = Ee−(qx/a)S → ,
1+x
since the convergence theorem for Laplace transforms then give the desired
result. Noting that
∗ 1
Ee−(qx/a)S = ,
1 + px − (px/a)ψ(qx/a)
we must require that
(x/a)ψ(qx/a) → 0,
i.e.,
[L(qx/a) − 1 + qx]/q → 0.
Using ES = a and the inequalities 0 ≤ e−t − 1 + t ≤ t2 /2, we find that
0 ≤ [L(qx/a) − 1 + qx]/q = E[e−(qx/a)S − 1 + (qx/a)S]/q

≤ E[(qx/a)S]2 /2q

x2 q
= ES 2
2 a2

x2
= q(1 + c2S ).
2
4.4 Distribution of the Number of System Failures 129

The desired conclusion (4.47) follows now since q → 0 and qc2S → 0 (assump-
tions (4.45) and (4.46)). 


Theorem 4.25. Assume that X is a regenerative process, and that Fij and
Gij change in such a way that the following conditions hold (as j → ∞) :

q → 0, (4.48)
→ 0,
qc20S (4.49)
qE1 S
→ 0, (4.50)
ES
E1 (NS − 1) → 0. (4.51)

Then
A[0, u/λΦ ] → e−u , i.e., λΦ TΦ → Exp(1).
D
(4.52)

Proof. Using Lemma 4.24, we first prove that under conditions (4.48)–(4.50)
we have
TΦ q D
→ Exp(1). (4.53)
E0 S
Let ν denote the renewal cycle index associated with the time of the first
system failure, TΦ . Then it is seen that TΦ has the same distribution as


ν−1
S0k + Wν ,
k=1

where (S0k ) and (Wk ) are independent sequences of i.i.d. random variables
with
P (S0k ≤ s) = P0 (S ≤ s)
and
P (Wk ≤ w) = P1 (TΦ ≤ w).
Both sequences are independent of ν, which has a geometrical distribution
with parameter q = P (NS ≥ 1). Hence, (4.53) follows from Lemma 4.24
provided that
Wν q P
→ 0. (4.54)
E0 S
By a standard conditional probability argument it follows that

ES = (1 − q)E0 S + qE1 S,
130 4 Availability Analysis of Complex Systems

and by noting that


qE1 Tφ qEW qE1 S qE1 S(1 − q)
= ≤ =
E0 S E0 S E0 S ES − qE1 S

ES (1 − q)
qE1 S
= → 0, (4.55)
1 − qE
ES
1S

we see that (4.54) holds.


Using (4.44) we obtain
λφ λφ E0 S
=
q/E0 S q/ES ES
ENS /ES E0 S
=
q/ES ES
ENS E0 S
= .
q ES
Now ENS /q = 1 + E1 (NS − 1) → 1 in view of (4.51), and
E0 S 1 − q EES
1S

= →1
ES 1−q
by (4.48) and (4.50). Hence the ratio of λφ and q/E0 S converges to 1. Com-
bining this with (4.53), the conclusion of the theorem follows. 


Remark 4.26. The above theorem shows that


D
αTφ → Exp(1)
for α equal to λφ . But the result also holds for the normalizing factors
q/E0 S, q/ES, and 1/ETφ . For q/E0 S and q/ES this is seen from the proof
of the theorem. To establish the result for 1/ETφ , let

ν−1
S∗ = S0i .
i=1

Then ES ∗ = E0 S(1 − q)/q, observing that the mean of ν equals 1/q. It follows
that
ETφ = E0 S(1 − q)/q + E1 Tφ ,
which can be rewritten as
qETφ /E0 S = 1 − q + qE1 Tφ /E0 S.
We see that the right-hand side of this expression converges to 1, remember-
ing (4.48),(4.50), and (4.55). Hence, 1/ETφ is also a normalizing factor. Note
that the condition (4.51) is not required if the normalizing factor equals either
q/E0 S, q/ES, or 1/ETφ.
We can conclude that the ratio between any of these normalizing factors
converges to one if the conditions of the theorem hold true.
4.4 Distribution of the Number of System Failures 131

4.4.2 Some Sufficient Conditions

It is intuitively clear that if the components have constant failure rates, and
the component unavailabilities converge to zero, then the conditions of The-
orem 4.25 would hold. In Theorems 4.27 and 4.30 below this result will be
formally established. We assume, for the sake of simplicity, that no single
component is in series with the rest of the system. If there are one or more
components in series with the rest of the system, we know that the time to
failure of these components has an exact exponential distribution, and by in-
dependence it is straightforward to establish the limiting distribution of the
total system.
Define
n 
n
d= λ μ
i Gi , λ̃ = λi .
i=1 i=1

Theorem 4.27. Assume that the system has no components in series with
the rest of the system, i.e., Φ(0i , 1) = 1 for i = 1, 2, . . . , n. Furthermore,
assume that component i has an exponential lifetime distribution with failure
rate λi > 0, i = 1, 2, . . . , n.
If d → 0 and there exist constants c1 and c2 such that λi ≤ c1 < ∞
and ERi2 ≤ c2 < ∞ for all i, then the conditions (4.48),(4.49), and (4.50)
D
of Theorem 4.25 are met, and, consequently, αTΦ → Exp(1) for α equal to
q/E0 S, q/ES, or 1/ETφ.

Proof. As will be shown below, it is sufficient to show that q → 0 holds


(condition (4.48)) and that there exists a finite constant c such that

λ̃2 E(S  )2 ≤ c, (4.56)

where S  represents the “busy” period of the renewal cycle, which equals the
time from the first component failure to the next regenerative point, i.e., to the
time when the process again visits state (1, 1, . . . , 1). (The term “busy” period
is taken from queueing theory. In the busy period at least one component is
under repair.) Let S  be an exponentially distributed random variable with
parameter λ̃ representing the time to the first component failure. This means
that we can write
S = S  + S  .
Assume that we have already proved (4.56). Then this condition and (4.48)
imply (4.50), noting that
132 4 Availability Analysis of Complex Systems

qE1 S
≤ λ̃qE1 S
ES
= λ̃(qE1 S  + qE1 S  )
= q + λ̃qE[S  |NS ≥ 1]
= q + λ̃E[S  I(NS ≥ 1)]
≤ q + λ̃q 1/2 [E(S  )2 ]1/2
= q + q 1/2 [λ̃2 E(S  )2 ]1/2 ,

where the last inequality follows from Schwartz’s inequality. Furthermore,


condition (4.56) together with (4.48) imply (4.49), noting that

E0 S 2
c20S ≤
(E0 S)2
≤ λ̃2 E0 S 2
= λ̃2 E[S 2 I(NS = 0)]/(1 − q)
≤ λ̃2 ES 2 /(1 − q)
= λ̃2 {E(S  )2 + E(S  )2 + 2E[S  S  ]}/(1 − q)
≤ λ̃2 {(2/λ̃2 ) + E(S  )2 + 2(E(S  )2 E(S  )2 )1/2 }/(1 − q)
= {2 + λ̃2 E(S  )2 + 2(21/2 ) (λ̃2 E(S  )2 )1/2 }/(1 − q),

where we again have used Schwartz’s inequality. Alternatively, an upper bound


on E[S  S  ] can be established using that S  and S  are independent:

E[S  S  ] = ES  ES  = (1/λ̃)ES  ≤ (1/λ̃){E(S  )2 }1/2 .

Now, to establish (4.48), we note that with probability λ̃i = λi /λ̃, the busy
period begins at the time of the failure of component i. If, in the interval of
repair of this component, none of the remaining components fails, then the
busy period comes to an end when the repair is completed. Therefore, since
there are no components in series with the rest of the system,
n  ∞
1−q ≥ λ̃i e−t(λ̃−λi ) dGi (t),
i=1 0

where Gi is the distribution of the repair time of component i. Hence,


 n  ∞
q≤ λ̃i [1 − e−t(λ̃−λi ) ]dGi (t)
i=1 0

n  ∞
≤ λi tdGi (t) = d.
i=1 0

Consequently, d → 0 implies q → 0.
4.4 Distribution of the Number of System Failures 133

It remains to show (4.56). Clearly, the busy period will only increase if
we assume that the flow of failures of component i is a Poisson flow with
parameter λi , i.e., we adjoin failures that arise according to a Poisson process
on intervals of repair of component i, assuming that repair begins immediately
for each failure. This means that the process can be regarded as an M/G/∞
queueing process, where the Poisson input flow has parameter λ̃ and there are
an infinite number of devices with servicing time distributed according to the
law
n
G(t) = λ̃i Gi (t).
i=1

Note that the probability that a “failure is due to component i” equals λ̃i . It
is also clear that the busy period increases still more if, instead of an infinite
number of servicing devices, we take only one, i.e., the process is a queueing
process M/G/1. Thus, E(S  )2 ≤ E(S̃  )2 , where S̃  is the busy period in a
single-line system with a Poisson input flow λ̃ and servicing distribution G(t).
It is a well-known result from the theory of queueing processes (and branching
processes) that the second-order moment of the busy period (extinction time)
equals ERG 2
/(1 − λ̃ERG )3 , where RG is the service
n time having distribution
G, see, e.g., [80]. Hence, by introducing d2 = i=1 λi ERi2 we obtain

λ̃d2 n2 c21 c2
λ̃2 E(S  )2 ≤ ≤ .
(1 − d)3 (1 − d)3
The conclusion of the theorem follows. 


We now give sufficient conditions for E1 (N − 1) → 0 (assumption (4.51)


in Theorem 4.25).
We define
μ̆i = sup {E[Ri1 − t|Ri1 > t]},
0≤t<t∗

where t∗ = sup{t ∈ R+ : Ḡi (t) > 0}. We see that μ̆i expresses the maximum
expected residual repair time of component i. We might have μ̆i = ∞, but
we shall in the following restrict attention to the finite case. We know from
Sect. 2.2, p. 37, that if Gi has the NBUE property, then
μ̆i ≤ μGi .
If the repair times are bounded by a constant c, i.e., P (Rik ≤ c) = 1, then
μ̆i ≤ c. Let

n
μ̃ = μ̆i .
i=1

Lemma 4.28. Assume that the lifetime of component i is exponentially dis-


tributed with failure rate λi , i = 1, 2, . . . , n. Then
P1 (NS ≥ k) ≤ (λ̃μ̃)k−1 , k = 2, 3, . . . . (4.57)
134 4 Availability Analysis of Complex Systems

Proof. The lemma will be shown by induction. We first prove that (4.57)
holds true for k = 2. Suppose the first system failure occurs at time t. Let
Lt denote the number of component failures after t until all components are
again functioning for the first time. Furthermore, let Rit denote the remaining
repair time of component i at time t (put Rit = 0 if component i is functioning
at time t). Finally, let Vt = maxi Rit and let GVt (v) denote the distribution
function of Vt . Note that Lt ≥ 1 implies that at least one component must fail
in the interval (t, t + Vt ) and that the probability of at least one component
failure in this interval increases if we replace the failed components at t by
functioning components. Using these observations and the inequality 1−e−x ≤
x, we obtain
 ∞
P (Lt ≥ 1) = P (Lt ≥ 1|Vt = v)dGVt (v)
0 ∞
≤ (1 − e−λ̃v )dGVt (v)
0
 ∞ 
≤ λ̃ vdGVt (v) = λ̃EVt ≤ λ̃E Rit
0 i

≤ λ̃μ̃.
Since NS ≥ 2 implies Lt ≥ 1, formula (4.57) is shown for k = 2 and P1
conditional on the event that the first system failure occurs at time t. Inte-
grating over the failure time t, we obtain (4.57) for k = 2. Now assume that
P1 (NS ≥ k) ≤ (λ̃μ̃)k−1 for a k ≥ 2. We must show that
P1 (NS ≥ k + 1) ≤ (λ̃μ̃)k .
We have
P1 (NS ≥ k + 1) = P1 (NS ≥ k + 1|NS ≥ k)P1 (NS ≥ k)
≤ P1 (NS ≥ k + 1|NS ≥ k) · (λ̃μ̃)k−1 ,
thus it remains to show that
P1 (NS ≥ k + 1|NS ≥ k) ≤ λ̃μ̃. (4.58)
Suppose that the kth system failure in the renewal cycle occurs at time t. Then
if at least one more system failure occurs in the renewal cycle, there must be
at least one component failure before all components are again functioning,
i.e., Lt ≥ 1. Repeating the above arguments for k = 2, the inequality (4.58)
follows. 


Remark 4.29. The inequality (4.57) states that the number of system failures
in a renewal cycle when it is given that at least one system failure occurs is
bounded in distribution by a geometrical random variable with parameter λ̃μ̃
(provided this quantity is less than 1)
4.4 Distribution of the Number of System Failures 135

Theorem 4.30. Assume that the system has no components in series with the
rest of the system. Furthermore, assume that component i has an exponential
lifetime distribution with failure rate λi > 0, i = 1, 2, . . . , n. If d → 0, where
d = λ̃μ̃, and there exist constants c1 and c2 such that λi ≤ c1 < ∞ and
ERi2 ≤ c2 < ∞ for all i, then the conditions (4.48)–(4.51) of Theorem 4.25
(p. 129) are all met, and, consequently, the limiting result (4.52) holds, i.e.,
D
λΦ TΦ → Exp(1).

Proof. Since d ≤ d , it suffices to show that condition (4.51) holds under the
given assumptions. But from (4.57) of Lemma 4.28 we have

E1 (NS − 1) ≤ d /(1 − d ),

and the desired result follows. 




The above results show that the time to the first system failure is approxi-
mately exponentially distributed with parameter q/E0 S ≈ q/ES ≈ 1/ETΦ ≈
λΦ . For a system comprising highly available components, it is clear that
P (Xt = 1) would be close to one, hence the above approximations for the
interval reliability can also be used for an interval (t, t + u].

4.4.3 Asymptotic Analysis of the Number of System Failures

For a highly available system, the downtimes will be small compared to the
uptimes, and the time from when the system has failed until it returns to the
state (1, 1, . . . , 1) will also be small. Hence, the above results also justify the
Poisson process approximation for N . More formally, it can be shown that
Nt/α converges in distribution to a Poisson distribution under the same as-
sumptions as the first system failure time converges to the exponential distri-
bution. Let TΦ∗ (k) denote the time between the (k − 1)th and the kth system
failure. From this sequence we define an associated sequence TΦ (k) of i.i.d.
variables, distributed as TΦ , by letting TΦ (1) = TΦ∗ (1), TΦ (2) be equal to the
time to the first system failure following the first regenerative point after the
first system failure, etc. Then it is seen that
TΦ (1) + TΦ (2)(1 − I(N(1) ≥ 2)) ≤ TΦ∗ (1) + TΦ∗ (2) ≤ TΦ (1) + TΦ (2) + Sν ,
where N(1) = equals the number of system failures in the first renewal cycle
having one or more system failures, and Sν equals the length of this cycle (ν
denotes the renewal cycle index associated with the time of the first system
failure). For α being one of the normalizing factors (i.e., q/E0 S, q/ES, 1/ETΦ,
or λΦ ), we will prove that αTΦ (2)I(N(1) ≥ 2) converges in probability to zero.
It is sufficient to show that P (N(1) ≥ 2) → 0 noting that

P (αTΦ (2)I(N(1) ≥ 2) > ) ≤ P (N(1) ≥ 2).


136 4 Availability Analysis of Complex Systems

But
P (N(1) ≥ 2) = P1 (NS ≥ 2) ≤ E1 (NS − 1),
where the last expression converges to zero in view of (4.51), p. 129. The
distribution of Sν is the same as the conditional probability of the cycle length
given a system failure occurs in the cycle, cf. Theorem 4.25 and its proof. Thus,
if (4.48)–(4.51) hold, it follows that α(TΦ∗ (1)+TΦ∗ (2)) converges in distribution
to the sum of two independent exponentially distributed random variables
with parameter 1, i.e.,
P (Nt/α ≥ 2) = P (α(TΦ∗ (1) + TΦ∗ (2)) ≤ t)
→ 1 − e−t − te−t .
Similarly, we establish the general distribution. We summarize the result in
the following theorem.
Theorem 4.31. Assume that X is a regenerative process, and that Fij and
Gij change in such a way that (asj → ∞) the conditions (4.48)–(4.51) hold.
Then (asj → ∞)
D
Nt/α → Poisson(t), (4.59)
where α is a normalizing factor that equals either q/E0 S, q/ES, 1/ETΦ or
λΦ .
Results from Monte Carlo simulations [22] indicate that the asymptotic
system failure rate λΦ is normally preferable as parameter in the Poisson
distribution when the expected number of system failures is not too small
(less than one). When the expected number of system failures is small, the
factor 1/ETΦ gives slightly better results. The system failure rate is however
easier to compute.

Asymptotic Normality
Now we turn to a completely different way to approximate the distribution
of Nt . Above, the up and downtime distributions are assumed to change such
that the system availability increases and after a time rescaling Nt converges to
a Poisson variable. Now we leave the up and downtime distribution unchanged
and establish a central limit theorem as t increases to infinity. The theorem
generalizes (4.16), p. 114.
Theorem 4.32. If X is a regenerative process with cycle length S, Var[S] <
∞ and Var[NS ] < ∞, then as t → ∞,
 
√ Nu+t − Nu D
t − λΦ → N(0, γΦ 2
),
t
where
γΦ2 ES = Var[NS − λΦ S]. (4.60)
4.4 Distribution of the Number of System Failures 137

Proof. Noting that the system failure rate λΦ is given by


ENS
λΦ = , (4.61)
ES
the result follows from Theorem B.17, p. 280, in Appendix B. 


Below we argue that if the system failure rate is small, then we have
γΦ2 ≈ λΦ .
We obtain
Var[NS − λΦ S] E(NS − λΦ S)2
γΦ2 = =
ES ES
ENS2 ENS
≈ ≈ = λΦ ,
ES ES
where the last approximation follows by observing that if the system failure
rate is small, then NS with a probability close to one is equal to the indicator
function I(NS ≥ 1). More formally, it is possible to show that under certain
conditions, γΦ2 /λΦ converges to one. We formulate the result in the following
proposition.

Proposition 4.33. Assume X is a regenerative process with cycle length S


and that Fij and Gij change in such a way that conditions (4.48)–(4.50) of
Theorem 4.25 (p. 129) hold (as j → ∞). Furthermore, assume that (as j → ∞)

E1 (NS − 1)2 → 0 (4.62)

and
qc2S → 0, (4.63)
where c2S denotes the squared coefficient of variation of S. Then (as j → ∞)

γΦ2
→ 1.
λΦ
Proof. Using (4.60) and writing N in place of NS we get

γΦ2 E(N − λΦ S)2


=
λΦ λΦ ES

q −1 EN 2 + q −1 (λΦ )2 ES 2 − 2q −1 λΦ E[N S]
=
q −1 λΦ ES

E1 N 2 + q −1 (λΦ )2 ES 2 − 2q −1 λΦ E[N S]
= .
q −1 λΦ ES
Since the denominator converges to 1 (the denominator equals the ratio be-
tween two normalizing factors), the result follows if we can show that E1 N 2
138 4 Availability Analysis of Complex Systems

converges to 1 and all the other terms of the numerator converge to zero.
Writing

E1 N 2 = E1 [1 + (N − 1)]2 = 1 + E1 (N − 1)2 + 2E1 (N − 1)

and using condition (4.62), it is seen that E1 N converges to 1. Now consider


the term q −1 (λΦ )2 ES 2 . Using that λΦ = EN/ES (formula (4.61)) we obtain

q −1 (λΦ )2 ES 2 = q −1 (EN/ES)2 ES 2 = q −1 (EN )2 {ES 2 /(ES)2 }


= q(E1 N )2 (1 + c2S ) = q[1 + E1 (N − 1)]2 (1 + c2S ).

Letting q → 0 (condition (4.48)), and applying (4.62) and (4.63), we see


that q −1 (λΦ )2 ES 2 converges to zero. It remains to show that q −1 λΦ E[N S]
converges to zero. But this is shown in the same way as the previous term,
noting that
E[N S] ≤ (EN 2 )1/2 (ES 2 )1/2
by Schwartz’s inequality. This completes the proof of the proposition. 


Proposition 4.34. Under the same conditions as formulated in Theorem


4.30, p. 135, the following limiting result holds true (as j → ∞):

γΦ2
→ 1.
λΦ
Proof. It is sufficient to show that conditions (4.62) and (4.63) hold. Condition
(4.62) follows by using that under P1 , N is bounded in distribution by a
geometrical distribution random variable with parameter d = λ̃μ̃, cf. (4.57)
of Lemma 4.28, p. 133. Note that for a variable N that has a geometrical
distribution with parameter d we have


E(N − 1)2 = (k − 1)2 (d )k−1 (1 − d )
k=1
d (1 + d )

= .
(1 − d )2

From this equality it follows that E1 (NS − 1)2 → 0 as d → 0. To establish


(4.63) we can repeat the arguments in the proof of Theorem 4.27, p. 131,
showing (4.49), observing that

ES 2
c2S ≤ ≤ λ̃2 ES 2 .
(ES)2 


For a parallel system of two components it is possible to establish simple


expressions for some of the above quantities, such as q and ETΦ .
4.4 Distribution of the Number of System Failures 139

Parallel System of Two Identical Components

Consider a parallel system comprising two identical components having ex-


ponential life lengths with failure rate λ. Suppose one of the components has
failed. Then we see that a system failure occurs, i.e., the number of system
failures in the cycle is at least 1 (NS ≥ 1), if the operating component fails
before the repair is completed. Consequently,
 ∞  ∞
q = P (NS ≥ 1) = F (t)dG(t) = (1 − e−λt )dG(t),
0 0

where F (t) = P (T ≤ t) = 1 − e−λt and G(t) = P (R ≤ t) equal the component


lifetime and repair time distribution, respectively. It follows that
 ∞
q≤ λtdG(t) = λμG .
0

Thus for a parallel system comprising two identical components, it is trivially


verified that the convergence of λμG to zero implies that q → 0. From the
Taylor formula we have 1 − e−x = x − 12 x2 + x3 O(1), x → 0, where |O(1)| ≤ 1.
Hence, if λμG → 0 and ER3 /μ3G is bounded by a finite constant, we have

λ2
q = λμG − ER2 + λ3 ER3 O(1)
2
(λμG )2
= λμG − (1 + c2G ) + o((λμG )2 ),
2
where c2G denotes the squared coefficient of variation of G defined by
c2G =VarR/μ2G. We can conclude that if λμG is small, then comparing dis-
tributions G with the same mean, those with a large variance exhibit a small
probability q.
If we instead apply the Taylor formula 1 − e−x = x − x2 O(1), we can write

q = λμG + o(λμG ), λμG → 0.

For this example it is also possible to establish an explicit formula for E0 S.


It is seen that
E0 S = E min{T1 , T2 } + E[R|R < T ],
where T1 and T2 are the times to failure of component 1 and 2, respectively.
But
1
E min{T1 , T2 } =

and

E[R|R < T ] = E[RI(R < T )]/(1 − q)


 ∞
= re−λr dG(r)/(1 − q).
0
140 4 Availability Analysis of Complex Systems

This gives  ∞
1 1
E0 S = + re−λr dG(r).
2λ 1 − q 0

From the Taylor formula we have e−x = 1 − xO(1), x → 0, where |O(1)| ≤ 1.


Using this and noting that
 ∞
re−λr dG(r) = μG [1 + λμG (c2G + 1)O(1)],
0

it can be shown that if the failure rate λ and the squared coefficient of variation
c2G are bounded by a finite constant, then the normalizing factor q/E0 S is
asymptotically given by
q
= 2λ2 μG + o(λμG ), λμG → 0.
E0 S
Now we will show that the system failure rate λΦ , defined by (4.41), p. 123,
is also approximately equal to 2λ2 μG . First note that the unavailability of a
component, Ā, is given by Ā = λμG /(1 + λμG ). It follows that
2Ā
λΦ = = 2λ2 μG + o(λμG ), λμG → 0, (4.64)
λ−1 + μG
provided that the failure rate λ is bounded by a finite constant.
Next we will compute the exact distribution and mean of TΦ . Let us denote
this distribution by FTΦ (t). In the following FX denotes the distribution of any
random variable X and FiX (t) = Pi (X ≤ t), i = 0, 1, where P0 (·) = P (·|NS =
0) and P1 (·) = P (·|NS ≥ 1). Observe that the length of a renewal cycle S can
be written as S  + S  , where S  represents the time to the first failure of a
component, and S  represents the “busy” period, i.e., the time from when one
component has failed until the process returns to the best state (1, 1). The
variables S  and S  are independent and S  is exponentially distributed with
rate λ̃ = 2λ. Now, assume a component has failed. Let R denote the repair
time of this component and let T denote the time to failure of the operating
component. Then

1 ∞
F1T (t) = P (T ≤ t|T ≤ R) = (1 − e−λ(t∧r) )dG(r),
q 0
where a ∧ b denotes the minimum of a and b. Furthermore,

1 t −λr
F0R (t) = P (R ≤ t|R < T ) = e dG(r),
q̄ 0
where q̄ = 1 − q. Now, by conditioning on whether a system failure occurs in
the first renewal cycle or not, we obtain

FTΦ (t) = qP (TΦ ≤ t|NS ≥ 1) + q̄P (TΦ ≤ t|NS = 0)


= qF1TΦ (t) + q̄F0TΦ (t). (4.65)
4.4 Distribution of the Number of System Failures 141

To find an expression for F1TΦ (t) we use a standard conditional probability


argument, yielding
 t
F1TΦ (t) = P1 (TΦ ≤ t|S  = s)dFS  (s)
0
 t
= P (T ≤ t − s|T ≤ R)dFS  (s)
0
 t
= F1T (t − s)dFS  (s).
0

Consider now F0TΦ (t). By conditioning on S = s, we obtain


 t
F0TΦ (t) = P0 (TΦ ≤ t|S = s)dF0S (s)
0
 t
= FTΦ (t − s)dF0S (s).
0

Inserting the above expressions into (4.65) gives


 t
FTΦ (t) = h(t) + q̄ FTΦ (t − s)dF0S (s),
0

where  t
h(t) = q F1T (t − s)dFS  (s). (4.66)
0
Hence, FTΦ (t) satisfies a renewal equation with the defective distribution
q̄F0S (s), and arguing as in the proof of Theorem B.2, p. 275, in Appendix B,
it follows that  t
FTΦ (t) = h(t) + h(t − s)dM0 (s), (4.67)
0
where the renewal function M0 (s) equals

 ∗j
q̄ j F0S (s).
j=1

Noting that F0S = FS  ∗ F0R , the Laplace transform of S  equals 2λ/(2λ + v),
q̄ = LG (λ) and LF0R (v) = LG (v+λ)/LG (λ), we see that the Laplace transform
of M0 takes the form
2λ 2λ
q̄ 2λ+v LF0R (v) 2λ+v LG (v + λ)
LM0 (v) = = .
1 − q̄ 2λ+v

LF0R (v) 1 − 2λ+v

LG (v + λ)

It is seen that the Laplace transform of F1T is given by


1 λ
LF1T (v) = (1 − LG (v + λ)) .
1 − LG (λ) λ+v
142 4 Availability Analysis of Complex Systems

Now using (4.67) and (4.66) and the above expressions for the Laplace trans-
form we obtain the following simple formula for LFTΦ :

2λ2 1 − LG (v + λ)
LFTΦ (v) = · .
λ + v v + 2λ(1 − LG (v + λ))
The mean ETΦ can be found from this formula, or alternatively by using a
direct renewal argument. We obtain

ETΦ = ES  + E(TΦ − S  )
1
= + Emin{R, T } + (1 − q)ETΦ ,

noting that the time one component is down before system failure occurs or
the renewal cycle terminates equals min{R, T }. If a system failure does not
occur, the process starts over again. It follows that
1 Emin{R, T }
ETΦ = + .
2qλ q
Note that
 ∞  ∞
Emin{R, T } = F̄ (t)Ḡ(t)dt = e−λt Ḡ(t)dt.
0 0

It is also possible to write

3 1 − 23 LG (λ)
ETΦ = .
2λ 1 − LG (λ)

Now using the Taylor formula e−x = 1 − xO(1), |O(1)| ≤ 1, we obtain


 ∞
E min{R, T } = e−λt Ḡ(t)dt = μG + λμ2G (c2G + 1)O(1),
0

where c2Gis the squared coefficient of variation of G. From this it can be shown
that the normalizing factor 1/ETΦ can be written in the same form as the
other normalizing factors:
1
= 2λ2 μG + o(λμG ), λμG → 0,
ETΦ
assuming that λ and c2G are bounded by a finite constant.

Asymptotic Analysis for Systems having Components in Series


with the Rest of the System

We
now return to the general asymptotic analysis. Remember that d =
λi μGi and λ̃ = λi . So far we have focused on nonseries systems (series
4.4 Distribution of the Number of System Failures 143

system have q = 1). Below we show that a series system also has a Poisson
limit under the assumption that the lifetimes are exponentially distributed.
We also formulate and prove a general asymptotic result for the situation that
we have some components in series with the rest of the system. A component
is in series with the rest of the system if Φ(0i , 1) = 0.

Theorem 4.35. Assume that Φ is a series system and the lifetimes are ex-
ponentially distributed. Let λi be the failure rate of component i. If d → 0 (as
j → ∞), then (as j → ∞)
D
Nt/λ̃ → Poisson(t).

Proof. Let NtP (i) be the Poisson process with intensity λi generated by the
consecutive uptimes of component i. Then it is seen that

n 
n
P
Nt/ λ̃
(i) − D = Nt/λ̃ ≤ P
Nt/λ̃
(i),
i=1 i=1

where

n
D= P
Nt/ λ̃
(i) − Nt/λ̃ .
i=1

We have D ≥ 0 and hence the conclusion of the theorem follows if we can


show
n that ED → 0, since then D converges in probability to zero. Note that
P
i=1 Nt/λ̃ (i) is Poisson distributed with mean


n 
n
P
E Nt/ λ̃
(i) = (t/λ̃)λi = t. (4.68)
i=1 i=1

From (4.36) of Corollary 4.18, p. 121, we have


n 
 t/λ̃
ENt/λ̃ = [h(1i , A(s)) − h(0i , A(s))] λi Ai (s)ds,
i=1 0

which gives
n 
 t/λ̃ 
ENt/λ̃ = Ak (s)λi Ai (s)ds
i=1 0 k =i
 t/λ̃ 
n
= λ̃ Ak (s)ds.
0 k=1

Using
this expression together with (4.68), the inequalities 1 − i (1 − qi ) ≤
i qi , and the component unavailability bound (4.22) of Proposition 4.11,
p. 114, (Āi (t) ≤ λi μGi ), we find that
144 4 Availability Analysis of Complex Systems
 $ '
t/λ̃ 
n
ED = λ̃ 1− Ai (s) ds
0 i=1
 t/λ̃ 
n
≤ λ̃ Āi (s)ds
0 i=1
n
≤ λ̃(t/λ̃) λi μGi
k=1
= td.

Now if d → 0, we see that ED → 0 and the proof is complete. 




Remark 4.36. Arguing as in the proof of the theorem above it can be shown
that if aj → a as j → ∞, then
D
Naj t/λ̃ → Poisson(ta).
n
Observe that i=1 NaP t/λ̃ (i) is Poisson distributed with parameter aj t and
j
as j → ∞ this variable converges in distribution to a Poisson variable with
parameter at.

Theorem 4.37. Assume that the components have exponentially distributed


lifetimes, and let λi be the failure rate of component i. Let A denote the set
of components that are in series with the rest of the system, and let B be the
remaining components. Let N A , λ̃A , etc., denote the number of system failures,
the total failure rate, etc., associated with the series system comprising the
components in A. Similarly define N B , αB , dB , etc., for the system comprising
the components in B. Assume that the following conditions hold (as j → ∞) :
1. d → 0
2. The conditions of Theorem 4.25, p. 129, i.e., (4.48)–(4.51), hold for system
B
3. λ̃A /αB → a.
Then (as j → ∞)
D
Nt/αB → Poisson(t(1 + a)).

Remark 4.38. The conditions of Theorem 4.25 ensure that


D
B → Poisson(t),
B
Nt/α

cf. Theorem 4.31, p. 136. Theorem 4.30, p. 135, gives sufficient conditions for
(4.48)–(4.51).
4.5 Downtime Distribution Given System Failure 145

Proof. First note that

Nt/αB ≤ Nt/α
A B
B + Nt/αB = N
A
a t/λ̃A
B
+ Nt/αB,
j

where aj = λ̃A /αB . Now in view of Remark 4.36 above and the conditions of
the theorem, it is sufficient to show that D∗ , defined as the expected number
of times system A fails while system B is down, or vice versa, converges to
zero. But noting that the probability that system A (B) is not functioning is
less than or equal to d (the unreliability of a monotone system is bounded by
the sum of the component unreliabilities, which in its turn is bounded by d,
cf. (4.22), p. 115), it is seen that

D∗ ≤ d[ENaA t/λ̃A + ENt/α


B
B ] ≤ d[λ̃ aj t/λ̃
A A B
+ ENt/αB]
j
B
= d[aj t + ENt/αB ].

B
To find a suitable bound on ENt/α B , we need to refer to the argumentation

in the proof of Theorem 4.43, formulas (4.88) and (4.93), p. 156. Using these

B → t. Hence, D → 0 and the theorem is
B
results we can show that ENt/α
proved. 


4.5 Downtime Distribution Given System Failure


In this section we study the downtime distribution of the system given that
a failure has occurred. We investigate the downtime distribution given a fail-
ure at time t, the asymptotic (steady-state) distribution obtained by letting
t → ∞, and the distribution of the downtime following the ith system fail-
ure. Recall that Φ represents the structure function of the system and Nt the
number of system failures in [0, t]. Component i generates an alternating re-
newal process with uptime distribution Fi and downtime distribution Gi , with
means μFi and μGi , respectively. The lifetime distribution Fi is absolutely
continuous with a failure rate function λi . The n component processes are
independent.
Let ΔNt = Nt − Nt− . Define GΦ (·, t) as the downtime distribution at time
t, i.e.,
GΦ (y, t) = P (Y ≤ y|ΔNt = 1),
where Y is a random variable representing the downtime (we omit the depen-
dency on t). The asymptotic (steady-state) downtime distribution is given by

GΦ (y) = lim GΦ (y, t),


t→∞

assuming that the limit exists. It turns out that it is quite simple to establish
the asymptotic (steady-state) downtime distribution of a parallel system, so
we first consider this category of systems.
146 4 Availability Analysis of Complex Systems

4.5.1 Parallel System

Consider a parallel system comprising n stochastically identical components,


with repair time distribution G. Since a system failure coincides with one and
only one component failure, we have

P (Y > y|ΔNt = 1) = Ḡ(y)[Ḡαt (y)]n−1 ,

where Gαt (y) = P (αt (i) > y|Xi (t) = 0) denotes the distribution of the for-
ward recurrence time in state 0 of a component. But we know from (4.14) and
(4.15), p. 112, that the asymptotic distribution of Gαt (y) is given by
∞
y
Ḡ(x)dx
lim Ḡαt (y) = = Ḡ∞ (y). (4.69)
t→∞ μG
Thus we have proved the following theorem.

Theorem 4.39. For a parallel system of n identical components, the asymp-


totic (steady-state) downtime distribution given system failure, equals
$ ∞ 'n−1
y Ḡ(x)dx
GΦ (y) = 1 − Ḡ(y) . (4.70)
μG

Next we consider a parallel system of not necessarily identical components.


We have the following result.

Theorem 4.40. Let mi (t) be the renewal density function of Mi (t), and as-
sume that mi (t) is right-continuous and satisfies
1
lim mi (t) = . (4.71)
t→∞ μFi + μG i

For a parallel system of not necessarily identical components, the asymptotic


(steady-state) downtime distribution given system failure equals
⎡  ⎤
n  y∞ Ḡk (x) dx
GΦ (y) = ci ⎣1 − Ḡi (y) ⎦,
i=1
μG k
k =i

where
1/μGi
ci = n (4.72)
k=1 1/μGk

denotes the asymptotic (steady-state) probability that component i causes a


system failure.
4.5 Downtime Distribution Given System Failure 147

Proof. The proof follows the lines of the proof of Theorem 4.39, the difference
being that we have to take into consideration which component causes system
failure and the probability of this event given system failure. Clearly,

 y∞ Ḡk (x) dx
1 − Ḡi (y)
μG k
k =i

equals the asymptotic downtime distribution given that component i causes


system failure. Hence it suffices to show (4.72). Since the system failure rate
n (i)
λΦ is given by λΦ = i=1 λΦ , where

(i)
 1
λΦ = Āk
μFi + μGi
k =i

represents the expected number of system failures per unit of time caused
by failures of component i, an intuitive argument gives that the asymptotic
(steady-state) probability that component i causes system failure equals
1

(i)
λΦ μFi +μGi k =i Āk
= n 1

λΦ l=1 μF +μG k =l Āk
l l

1 n
μG k=1 Āk
= n i 1 n = ci .
l=1 μGl k=1 Āk

To establish sufficient conditions for this result to hold, we need to carry


out a somewhat more formal proof. Let ci (t) be defined as the conditional
probability that component i causes system failure given that the system
failure occurs at time t. For each h > 0 let

c
N[t,t+h) (i) = (Φ(1i , Xs ) − Φ(0i , Xs ))dNs (i)
[t,t+h)

n
c c
N[t,t+h) = N[t,t+h) (i).
i=1

Then
c
P (N[t,t+h) (i) = 1)
ci (t) = lim c
h→0+ P (N[t,t+h) = 1)

h EN[t,t+h) (i) − oi (1)


1 c
= lim , (4.73)
h EN[t,t+h) − o(1)
h→0+ 1 c

where
c
oi (1) = E[N[t,t+h) c
(i))I(N[t,t+h) (i) ≥ 2)]/h
148 4 Availability Analysis of Complex Systems

and
c
o(1) = E[N[t,t+h) c
)I(N[t,t+h) ≥ 2)]/h.
Hence it remains to study the limit of the ratio of the first terms of (4.73).
Using that

c
EN[t,t+h) (i) = (h(1i , A(s)) − h(0i , A(s))mi (s)ds,
[t,t+h)

where Ai (s) = P (Xs (i) = 1) equals the availability of component i at time s,


it follows that
{h(1i , A(t)) − h(0i , A(t))}mi (t)
ci (t) = n .
k=1 {h(1k , A(t)) − h(0k , A(t))}mk (t)

From this expression, we see that limt→∞ ci (t) = ci provided that


1
limt→∞ mi (t) = .
μFi + μG i
This completes the proof of the theorem. 


Remark 4.41. 1. From renewal theory (see Theorem B.10, p. 278, in Ap-
pendix B) sufficient conditions can be formulated for the limiting result
(4.71) to hold true. For example, if the renewal cycle lengths Tik +Rik have
a density function h with h(t)p integrable for some p > 1, and h(t) → 0
as t → ∞, then Mi has a density mi such that mi (t) → 1/(μFi + μGi )
as t → ∞. If component i has an exponential lifetime distribution with
parameter λi , then we know that mi (t) = λi Ai (t) (cf. (4.18), p. 114),
which converges to 1/(μFi + μGi ).
2. From the above proof it is seen that the downtime distribution at time t,
GΦ (y, t), is given by
⎡ ⎤

n 
GΦ (y, t) = ci (t) ⎣1 − Ḡi (y) Ḡkαt (y)⎦ .
i=1 k =i

4.5.2 General Monotone System

Consider now an arbitrary monotone system comprising the minimal cut sets
Kk , k = 1, 2, . . . , k0 . No simple formula exists for the downtime distribution
in this case. But for highly available systems the following formula can be
used to approximate the downtime distribution:

rk GKk (y),
k
4.5 Downtime Distribution Given System Failure 149

where
λK
rk = k .
l λKl

Here λKk and GKk denote the asymptotic (steady-state) failure rate of min-
imal cut set Kk and the asymptotic (steady-state) downtime distribution of
minimal cut set Kk , respectively, when this set is considered in isolation (i.e.,
we consider the parallel system comprising the components in Kk ). We see
that rk is approximately equal to the probability that minimal cut set Kk
causes system failure. Refer to [23, 72] for more detailed analyses in the general
case. In [72] it is formally proved that the asymptotic downtime distribution
exists and is equal to the steady-state downtime distribution.

4.5.3 Downtime Distribution of the i th System Failure

The above asymptotic (steady-state) formulas for GΦ give in most cases good
approximations to the downtime distribution of the ith system failure, i ∈ N.
Even for the first system failure observed, the asymptotic formulas produce
relatively accurate approximations. This is demonstrated by Monte Carlo sim-
ulations in [23]. An example is given below. Let the distance measure Di (y)
be defined by
Di (y) = |GΦ (y) − Ĝi,Φ (y)|,
where Ĝi,Φ (y) equals the “true” downtime distribution of the ith system fail-
ure obtained by Monte Carlo simulations. In Fig. 4.3 the distance measure
of the first and second system failure have been plotted as a function of y
for a parallel system of two identical components with constant repair times
and exponential lifetimes. As we can see from the figure, the distance is quite
small; the maximum distance is about 0.012 for i = 1 and 0.004 for i = 2.

Di(y)

0.014
.........
.............. .........................
0.012 ......
.......... ......
......
.
...
....... ......
.. ......
...... .....
..... .....
0.010 ..
.
....
.
.
.. .....
.....
..
....
. .....
..
..... .....
.....
..... ....
0.008 ....
.
. ...
...
..
..
.
..
. .........................
i =1 ...
...
...
..
0.006 ...... ..... ............... ..
. ..
i =2 ...
...
...
.
... ..
... ...
...
. ....... ...............
. ...
.. .... ........ ...
0.004 .
.
.
..
.
..
..
.
....
.
.. ............
..............
...
...
.
... .
..
..... ..
...........
...
...
. ..
...... ...
...
........
...
...
.. ....
0.002 ..
.
..
.
...
.
.
....
..
. ...
...
. ....
.. ...
.. ...
.. ...
.. ..
........
.. .
.. .........
...
...
...
...
. ..
...
.... ....
. ...
.. ............. ........
.. . .
. .
...
........ ..
.......... ............................. y
0.000
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Fig. 4.3. The distance Di (y), i = 1, 2, as a function of y for a parallel system of two
components with constant repair times, μG = 1, λ = 0.1
150 4 Availability Analysis of Complex Systems

Only for some special cases are explicit expressions for the downtime dis-
tribution of the ith system failure known. Below we present such expressions
for the downtime distribution of the first failure for a two-component parallel
system of identical components with exponentially distributed lifetimes.
Theorem 4.42. For a parallel system of two identical components with con-
stant failure rate λ and repair time distribution G, the downtime distribution
G1,p2 (y) of the first system failure is given by
∞s
Ḡ(y + s − x)dF (x) dF (s)
G1,p2 (y) = 1 − Ḡ(y) 0 ∞0 s (4.74)
0 0 Ḡ(s − x)dF (x) dF (s)
∞
y
[1 − e−λ(r−y) ]dG(r)
= 1 − Ḡ(y)  ∞ −λr ]dG(r)
. (4.75)
0 [1 − e
Proof. Let Ti and Ri have distribution function F and G, respectively, i = 1, 2,
and let
Y = min (Ti + Ri ) − max (Ti ).
1≤i≤2 1≤i≤2
It is seen that the downtime distribution G1,p2 (y) equals the conditional dis-
tribution of Y given that Y > 0. The equality (4.74) follows if we can show
that  ∞ s
P (Y > y) = Ḡ(y) 2Ḡ(y + s − x)dF (x) dF (s). (4.76)
0 0
Consider the event that Ti = s, Tj = x, Ri > y, and Tj + Rj > y + s for x < s
and j = i. For this event it holds that Y is greater than y. The probability of
this event, integrated over all s and x, is given by
 ∞ s
Ḡ(y + s − x)Ḡ(y)dF (x)dF (s).
0 0
By taking the union over i = 1, 2, we find that (4.76) holds.
But the double integral in (4.76) can be written as
 ∞  s
2 Ḡ(y + s − x)d(1 − e−λx )d(1 − e−λs )
0 0
 ∞ s
=1− G(y + s − x)2λ2 e−λ(x+s) dxds
 ∞ ∞
0 0

=1− G(y + s − x)λe−λ(s−x) 2λe−2λx dsdx.


0 x
Introducing r = y + s − x gives
 ∞  ∞
1− 2λe−2λx G(r)λe−λ(r−y) drdx
0 y
 ∞
=1− G(r)λe−λ(r−y) dr
y
 ∞
= (1 − e−λ(r−y) )dG(r).
y
4.6 Distribution of the System Downtime in an Interval 151

Thus the formulas (4.75) and (4.74) in the theorem are identical. This
completes the proof of the theorem. 


Now what can we say about the limiting downtime distribution of the first
system failure as the failure rate converges to 0? Is it equal to the steady-
state downtime distribution GΦ ? Yes, for the above example we can show
that if the failure rate converges to 0, the distribution G1,p2 (y) converges to
the steady-state formula, i.e.,
∞
y
Ḡ(r)dr
lim G1,p2 (y) = 1 − Ḡ(y) = GΦ (y).
λ→0 μG
This is seen by noting that
∞ ∞
y
[1 − e−λ(r−y) ]dG(r) y
(r − y)dG(r)
lim  ∞ −λr
= ∞
0 [1 − e ]dG(r) 0 rdG(r)
λ→0

∞
y
Ḡ(r)dr
= ∞ .
0 Ḡ(r)dr
This result can be extended to general monotone systems, and it is not
necessary to establish an exact expression for the distribution of the first
downtime; see [72]. Consider the asymptotic set-up introduced in Sect. 4.4,
to study highly available components, with exponential lifetime distributions
Fij (t) = 1 − e−λij t and fixed repair time distributions Gi , and where we as-
sume λij → 0 as j → ∞. Then for a parallel system it can be shown that the
distribution of the ith system downtime converges as j → ∞ to the steady-
state downtime distribution GΦ . For a general system it is more complicated.
Assuming that the steady-state downtime distribution converges as j → ∞ to
G∗Φ (say), it follows that the distribution of the ith system downtime converges
to the same limit. See [72] for details.

4.6 Distribution of the System Downtime in an Interval


In this section we study the distribution of the system downtime in a time
interval. The model considered is as described in Sect. 4.3, p. 120. The system
analyzed is monotone and comprises n independent components. Component
i generates an alternating renewal process with uptime distribution Fi and
downtime distribution Gi .
We immediately observe that the asymptotic expression for the expected
average downtime presented in Theorem 4.13, p. 116, also holds for monotone
systems, with A = h(A). Formula (4.28) of Theorem 4.13 requires that the
process X is a regenerative process with finite expected cycle length.
The rest of this section is organized as follows. First we present some ap-
proximative methods for computing the distribution of Yu (the downtime in
152 4 Availability Analysis of Complex Systems

the time interval [0, u]) in the case that the components are highly available,
utilizing that (Yu ) is approximately a compound Poisson process, denoted
(CPu ), and the exact one-unit formula (4.30), p. 118, for the downtime distri-
bution. Then we formulate some sufficient conditions for when the distribu-
tion of CPu is an asymptotic limit. The framework is the same as described
in Sect. 4.4.1, p. 126. Finally, we study the convergence to the normal distri-
bution.

4.6.1 Compound Poisson Process Approximation

We assume that the components have constant failure rate and that the com-
ponents are highly available, i.e., the products λi μGi are small. Then it can
be heuristically argued that the process (Yu ), u ∈ R+ , is approximately a
compound Poisson process,


Nu
Yu ≈ Yi ≈ CPu . (4.77)
i=1

Here Nu is the number of system failures in [0, u] and Yi is the downtime of


the ith system failure. The dependency between Nu and the random variables
Yi is not “very strong” since Nu is mainly governed by the renewal cycles
without system failures. We can ignore downtimes Yi being the second, third,
etc., system failure in a renewal cycle of the process X. The probability of
having two or more system failures in a cycle is small since we are assuming
highly available components. This means that the random variables Yi are
approximately independent and identically distributed.
From this we can find an approximate expression for the distribution of
Yu .
A closely related approximation can be established by considering system
operational time, as described in the following.
Let Nsop be the number of system failures in [0, s] when we consider op-
erational time. Similar to the reasoning in Sect. 4.4, p. 125, it can be argued
that Nsop is approximately a homogeneous Poisson process with intensity λΦ ,
where λΦ is given by


n
h(1i , A)−h(0i , A)
λΦ = .
i=1
(μFi + μGi )h(A)

To motivate this result, we note that the expected number of system failures
per unit of time when considering calendar time is approximately equal to the
asymptotic (steady-state) system failure rate λΦ , given by (cf. formula (4.41),
p. 123)
n
h(1i , A)−h(0i , A)
λΦ = .
i=1
μFi + μGi
4.6 Distribution of the System Downtime in an Interval 153

Then observing that the ratio between calendar time and operational time is
approximately 1/h(A), we see that the expected number of system failures per
unit of time when considering operational time, EN op (u, u + w]/w, is approx-
op
imately equal to λΦ /h(A)Furthermore, N(u,u+w] is “nearly independent” of
op
the history of N up to u, noting that the state process X frequently restarts
itself probabilistically, i.e., X re-enters the state (1, 1, . . . , 1). It can be shown
by repeating the proof of the Poisson limit Theorem 4.31, p. 136, and using
op
the fact that h(A) → 1 as λi μGi → 0, that Nt/α has an asymptotic Poisson
distribution with parameter t. The system downtimes given system failure are
approximately identically distributed with distribution function G(y), say, in-
dependent of N op , and approximately independent observing that the state
process X with a high probability restarts itself quickly after a system failure.
The distribution function G(y) is normally taken as the asymptotic (steady-
state) downtime distribution given system failure or an approximation to this
distribution; see Sect. 4.5.
Considering the system as a one-unit system, we can now apply the ex-
act formula (4.30), p. 118, for the downtime distribution with the Poisson
parameter λΦ . It follows that

 [λΦ (u − y)]n −λΦ (u−y)
P (Yu ≤ y) ≈ G∗n (y) e = Pu (y), (4.78)
n=0
n!

where the equality is given by definition. Formula (4.78) gives good approx-
imations for “typical real life cases” with small component availabilities; see
[82]. Figure 4.4 presents the downtime distribution for a parallel system of
two components with the repair times identical to 1 and μF = 10 using the
steady-state formula GΦ for G (formula (4.70), p. 146). The “true” distribu-
tion is found using Monte Carlo simulation. We see that formula (4.78) gives
a good approximation.

4.6.2 Asymptotic Analysis

We argued above that (Yu ) is approximately equal to a compound Poisson pro-


cess when the system comprises highly available components. In the following
theorem we formalize this result.
The set-up is the same as in Sect. 4.4.1, p. 126. We consider for each
component i a sequence {Fij , Gij }, j ∈ N, of distributions satisfying certain
conditions. To simplify notation, we normally omit the index j. When assum-
ing in the following that X is a regenerative process, it is tacitly understood
for all j ∈ N.
We say that the renewal cycle is a “success” if no system failure occurs
during the cycle and a “fiasco” if a system failure occurs.
Let α be a suitable normalizing factor (or more precisely, a normalizing
sequence in j) such that Nt/α converges in distribution to a Poisson variable
154 4 Availability Analysis of Complex Systems

0.98
0.96
0.94
0.92
0.90
0.88
0.86
P(Y10≤y)
0.84
P10(y)
0.82
0.8 y
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Fig. 4.4. P10 (y) and P (Y10 ≤ y) for a parallel system of two components with
constant repair times, μG = 1, λ = 0.1

with mean t, cf. Theorem 4.31, p. 136. Normally we take α = λΦ , but we


could also use q/E0 S, q/ES, or 1/ETΦ , where q equals the probability that
a system failure occurs in a cycle, S equals the length of a cycle, E0 S equals
the expected length of a cycle with no system failures, and TΦ equals the time
to the first system failure. Furthermore, let Yi1 denote the length of the first
downtime of the system in the ith “fiasco” renewal cycle, and Yi2 the length
of the remaining downtime in the same cycle. We assume that the asymptotic
D
distribution of Yi1 exists (as j → ∞): Yi1 → G∗Φ (say).
A random variable is denoted CP(r, G) if it has the same distribution as
N
i=1 Yi , where N is a Poisson variable with mean r, the variables Yi are i.i.d.
with distribution function G, and N and Yi are independent. The distribution
of CP(r, G) equals
∞
ri
G∗i e−r ,
i=0
i!

where G∗i denotes ith convolution of G.

Theorem 4.43. Assume that X is a regenerative process, and that Fij and
Gij change in such a way that the following conditions hold (as j → ∞) :

q → 0, (4.79)
qc20S → 0, (4.80)

where c20S = [E0 S 2 /(E0 S)2 ] − 1 denotes the squared coefficient of variation of
S under P0 ,
4.6 Distribution of the System Downtime in an Interval 155

qE1 S
→ 0, (4.81)
ES
E1 (NS − 1) → 0, (4.82)

Yi1 → G∗Φ .
D
(4.83)

Then (as j → ∞)
Yt/α → CP(t, G∗Φ ),
D
(4.84)
where α = λΦ , q/E0 S, q/ES, or 1/ETΦ .

Proof. First we will introduce two renewal processes, N  and N  , having the
same asymptotic properties as Nt/α . From Theorem 4.31, p. 136, we know
that
D
Nt/α → Poisson(t)
under conditions (4.79)–(4.82).
Let ν(1) equal the renewal cycle index associated with the first “fiasco”
renewal cycle, and let U1 denote the time to the starting point of this cycle,
i.e.,

ν(1)−1
U1 = Si .
i=1

Note that if the first cycle is a “fiasco” cycle, then U1 = 0. Starting from the
beginning of the renewal cycle ν(1)+1, we define U2 as the time to the starting
point of the next “fiasco” renewal cycle. Similarly we define U3 , U4 , . . .. The
random variables Ui are equal the interarrival times of the renewal process
Nt , i.e., % k &
∞ 

Nt = I Ui ≤ t .
k=1 i=1

By repeating the proofs of Theorem 4.25 (p. 129) and Theorem 4.31 it is seen
that
 D
Nt/α → Poisson(t). (4.85)
Using that the process Nt and the random variables Yi are independent, and
D
the fact that Yi1 → G∗Φ (assumption (4.83)), it follows that

Nt/α

Yi1 → CP(t, G∗Φ ).
D
(4.86)
i=1

A formal proof of this can be carried out using Moment Generating Functions.
Next we introduce Nt as the renewal process having interarrival times
with the same distribution as U1 + Sν(1) , i.e., the renewal cycle also includes
the “fiasco” cycle. It follows from the proof of Theorem 4.25, using condition

(4.81), that Nt/α has the same asymptotic Poisson distribution as Nt/α .
156 4 Availability Analysis of Complex Systems

It is seen that

Nt ≤ Nt , (4.87)


Nt Nt
 
Nt ≤ Nt ≤ N(i) = Nt + (N(i) − 1), (4.88)
i=1 i=1

where N(i) equals the number of system failures in the ith “fiasco” cycle.
Note that Nt is at least the number of “fiasco” cycles up to time t, including
the one that is possibly running at t, and Nt equals the number of finished
“fiasco” cycles at time t without the one possibly running at t.
Now to prove the result (4.84) we will make use of the following inequali-
ties:
 
Nt/α Nt/α
 
Yt/α ≤ Yi1 + Yi2 , (4.89)
i=1 i=1
 
Nt/α Nt/α
 
Yt/α ≥ Yi1 − Yi1 . (4.90)
i=1 
i=Nt/α +1

In view of (4.86), and the inequalities (4.89) and (4.90), we need to show that

Nt/α
 P
Yi2 → 0, (4.91)
i=1

Nt/α
 P
Yi1 → 0. (4.92)

i=Nt/α +1

To establish (4.91) we first note that


P
Yi2 → 0,

since
P (Yi2 > ) ≤ P1 (NS ≥ 2) ≤ E1 (NS − 1) → 0
by (4.82). Using Moment Generating Functions it can be shown that (4.91)
holds.

The key part of the proof of (4.92) is to show that (Nt/α ) is uniformly inte-
 D
grable in j (t fixed). If this result is established, then since Nt/α → Poisson(t)
by (4.85) it follows that

ENt/α → t. (4.93)

And because of the inequality (4.87), (Nt/α ) is also uniformly integrable so

that ENt/α → t, and we can conclude that (4.92) holds noting that
4.6 Distribution of the System Downtime in an Interval 157
   
P (Nt/α − Nt/α ≥ 1) ≤ ENt/α − ENt/α → 0.

Thus it remains to show that (Nt/α ) is uniformly integrable.
l
Let FU denote the probability distribution of U and let Vl = i=1 Ui .
Then we obtain


   
E[Nt/α I(Nt/α ≥ k)] = P (Nt/α ≥ l) + (k − 1)P (Nt/α ≥ k)
l=k


= P (Vl ≤ t/α) + (k − 1)P (Vk ≤ t/α)
l=k
∞
= FU∗l (t/α) + (k − 1)FU∗k (t/α)
l=k


≤ (FU (t/α))l + (k − 1)(FU (t/α))k
l=k
(FU (t/α))k
= + (k − 1)(FU (t/α))k .
1 − FU (t/α)

Since FU (t/α) → 1 − e−t , as j → ∞, it follows that for any sequence Fij , Gij

satisfying the conditions (4.79)–(4.82), (Nt/α ) is uniformly integrable. To see
−t
this, let be given such that 0 < < e . Then for j ≥ j0 (say) we have

  (1 − e−t + )k
sup E[Nt/α I(Nt/α ≥ k)] ≤ + (k − 1)(1 − e−t + )k .
j≥j0 e−t −

Consequently,
 
lim sup E[Nt/α I(Nt/α ≥ k)] = 0,
k→∞ j

i.e., (Nt/α ) is uniformly integrable, and the proof is complete. 


Remark 4.44. The conditions (4.79)–(4.82) of Theorem 4.43 ensures the asymp-
totic Poisson distribution of Nt/α , cf. Theorem 4.31, p. 136. Sufficient condi-
tions for (4.79)–(4.82) are given in Theorem 4.27, p. 131.

Asymptotic Normality

We now study convergence to the normal distribution. The theorem below is


“a time average result”—it is not required that the system is highly available.
The result generalizes (4.32), p. 119.

Theorem 4.45. If X is a regenerative process with cycle length S and asso-


ciated downtime Y = YS , Var[S] < ∞, and Var[Y ] < ∞, then as t → ∞,
158 4 Availability Analysis of Complex Systems
 
√ Yt D
t − Ā → N(0, τΦ2 ), (4.94)
t

where
Var[Y − ĀS]
τΦ2 = (4.95)
ES
EY
Ā = . (4.96)
ES
Proof. The result (4.94) follows by applying the Central Limit Theorem for
renewal reward processes, Theorem B.17, p. 280, in Appendix B. 


In the case that the system is highly available, we have

τΦ2 ≈ λΦ EY12 , (4.97)

where Y1 is the downtime of the first system failure (note that Y1 = Y11 ).
The idea used to establish (4.97) is the following: As before, let S be equal to
the time of the first return to the best state (1, 1, . . . , 1). Then (4.97) follows
by using (4.95), (4.96), Ā ≈ 0, the fact that Y ≈ Y1 if a system failure
occurs in the renewal cycle, the probability of two or more failures occurring
in the renewal cycle is negligible, and λΦ = ENS /ES (by the Renewal Reward
Theorem, p. 280). We obtain

Var[Y − ĀS] E(Y − ĀS)2


τΦ2 = =
ES ES
EY 2 E1 Y 2 q EY12 q
≈ = ≈
ES ES ES
2 ENS
≈ EY1 2
= EY1 λΦ ,
ES
which gives (4.97).
More formally, it is possible to show that under certain conditions, the
ratio τΦ2 /λΦ EY12 converges to 1, see [26].

4.7 Generalizations and Related Models


4.7.1 Multistate Monotone Systems

We consider a multistate monotone system Φ as described in Sect. 2.1.2, p. 31,


observed in a time interval J, with the following extensions of the model: We
assume that there exists a reference level Dt at time t, t ∈ J, which expresses a
desirable level of system performance at time t. The reference level Dt at time
t is a positive random variable, taking values in {d0 , d1 , . . . , dr }. For a flow
network system we interpret Dt as the demand rate at time t. In the following
4.7 Generalizations and Related Models 159

we will use the word “demand rate” also in the general case. The state of the
system at time t, which we in the following refer to as the throughput rate,
is assumed to be a function of the states of the components and the demand
rate, i.e.,
Φt = Φ(Xt , Dt ).
If Dt is a constant, we write Φ(Xt ). The process (Φt ) takes values in
{Φ0 , Φ1 , . . . , ΦM }.

Performance Measures

The performance measures introduced in Sect. 4.1, p. 105, can now be gener-
alized to the above model.
(a) For a fixed time t we define point availabilities

P (Φt ≥ Φk |Dt = d),


E[Φt |Dt = d],
P (Φt ≥ Dt ).

(b) Let NJ be defined as the number of times the system state is below de-
mand in J. The following performance measures related to NJ are con-
sidered

P (NJ ≤ k), k ∈ N0 ,
ENJ ,
P (Φt ≥ Dt , ∀t ∈ J) = P (NJ = 0).

Some closely related measures are obtained by replacing Dt by Φk and NJ


by NJk , where NJk is equal to the number of times the process Φ is below
state Φk during the interval J.
(c) Let

YJ = (Dt − Φt ) dt
J 
= Dt dt − Φt dt.
J J

We see that YJ represents the lost throughput(volume) in J, i.e., the differ-


ence between the accumulated demand (volume) and the actual through-
put (volume) in J. The following performance measures related to YJ are
considered
160 4 Availability Analysis of Complex Systems

P (YJ ≤ y), y ∈ R+ ,

EYJ
,
|J|

E J Φt dt
 , (4.98)
E J Dt dt

where |J| denotes the length of the interval J. The measure (4.98) is called
throughput availability.
(d) Let 
1
ZJ = I(Φt ≥ Dt ) dt.
|J| J
The random variable Z represents the portion of time the throughput
rate equals (or exceeds) the demand rate. We consider the following per-
formance measures related to ZJ

P (ZJ ≤ y), y ∈ R+ ,

EZJ .

The measure EZJ is called demand availability.


As in the binary case we will often use in practice the limiting values of
these performance measures.
The above performance measures are the most common measures used in
reliability studies of offshore oil and gas production and transport systems,
see, e.g., Aven [13]. In particular, the throughput availability is very much
used when predicting the performance of various design options. For economic
analysis and as a basis for decision-making, however, it is essential to be able
to compute the total distribution of the throughput loss, and not only the
mean. The measures related to the number of times the system is below a
certain demand level is also useful, but more from an operational and safety
point of view.

Computation

We now briefly look into the computation problem for some of the measures
defined above. To simplify the analysis we shall make the following assump-
tions:

Assumptions

1. J = [0, u].
2. The demand rateDt equals the maximum throughput rate ΦM for all t.
4.7 Generalizations and Related Models 161

3. The n component processes (Xt (i)) are independent. Furthermore, with


probability one, the n component processes (Xt (i)) make no transitions
(“jumps”) at the same time.
4. The process (Xt (i)) generates an alternating renewal process Ti1 , Ri1 ,
Ti2 , Ri2 , . . ., as described in Sect. 4.2, p. 106, where Tim represents the
time spent in the state xiMi during the mth visit to this state, and Rim
represents the time spent in the states {xi0 , xi1 , . . . , xi,Mi −1 } during the
mth visit to these states.
For all i and r,
air = lim P (Xt (i) = xir )
t→∞

exist.
Arguing as in the binary case we can use results from regenerative and renewal
reward processes to generalize the results obtained in the previous sections.
To illustrate this, we formulate some of these extensions below. The proofs
are omitted. We will focus here on the asymptotic results. Refer to Theorems
4.16, p. 120, and 4.19, p. 122, for the analogous results in the binary case. We
need the following notation:

μi = ETim + ERim
k
Nt = N[0,t] (k is fixed)
pir (t) = P (Xt (i) = xir ); if t is fixed, we write pir and X(i)
p = (p10 , . . . , pnMn )
a = (a10 , . . . , anMn )
Φk (X) = I(Φ(X) ≥ Φk )
hk (p) = EΦk (X)
h(p) = EΦ(X)
(1ir , p) = p with pir replaced by 1 and pil = 0 for l = r.

We see that μi is equal to the expected cycle length for component i, Nt


represents the number of times the process Φ is below state Φk during the
interval [0, t], and Φk (X) equals 1 if the system is in state Φk or better, and
0 otherwise.

Theorem 4.46. The limiting availabilities are given by

lim EΦ(Xt ) = h(a),


t→∞
lim P (Φ(Xt ) ≥ Φk ) = hk (a).
t→∞

Theorem 4.47. Let

γilr = hk (1il , a) − hk (1ir , a)


162 4 Availability Analysis of Complex Systems

and let filr denote the expected number of times component i makes a tran-
sition from state xil to state xir during a cycle of component i. Assume
filr < ∞. Then the expected number of times the system state is below Φk
per unit of time in the long run equals

E[Nu+s − Nu ]   filr γilr


n
ENu
lim = lim = . (4.99)
u→∞ u u→∞ s i=1
μi
r<l

If X is a regenerative process having finite expected cycle length, then with


probability one,
Nu n 
filr γilr
lim = .
u→∞ u μi
i=1 r<l

The limit (4.99) is denoted λΦ . If the random variables Tim are exponen-
tially distributed, then X is regenerative, cf. Theorem 4.23, p. 124.
It is also possible to extend the asymptotic results related to the distribu-
tion of the number of system failures at level k, and the distribution of the
lost volume (downtime). We can view the system as a binary system of binary
components, and the asymptotic results of Sects. 4.4–4.6 apply.

Gas Compression System

Consider the gas compression system example in Sect. 1.3.2, p. 13. Two design
alternatives were studied:
(i) One gas train with a maximum throughput capacity of 100 %.
(ii) Two trains in parallel, each with a maximum throughput capacity of 50 %.
Normal production is 100 %. Each train comprises compressor–turbine, cooler
and scrubber. To analyze the performance of the system it was considered
sufficient to use approximate methods developed for highly available systems,
as presented in this chapter. In the system analysis, each train was treated as
one component, having exponential lifetime distribution with a failure rate of
13 per year, and mean repair time equal to

(10/13) · 12 + (2/13) · 50 + (1/13) · 20 ≈ 18.5 (h).

From this we find that the asymptotic unavailability Ā, given by formula
(4.2), p. 109, for a train equals 0.027, assuming 8,760 h per year. The number
of system failures per unit of time is given by the system failure rate λΦ . For
alternative (i) there is only one failure level and λΦ = 13. For alternative (ii)
we must distinguish between failures resulting in production below 100 % and
below 50 %. The system in these two cases can be viewed as a series system
of the two trains and a parallel system of the two trains, respectively. Hence
the system failure rate for these levels is approximately equal to 26 and 0.7,
respectively. Note that for the latter case (cf. (4.64), p. 140),
4.7 Generalizations and Related Models 163

λΦ ≈ 2 · Ā · 13.

Using that the number of system failures is approximately Poisson distributed,


we can compute the probability that a certain number of failures occurs during
a specific period of time. For example, we find that for alternative (ii) there is
a probability of about e−0.7 = 0.50 of having no complete shutdowns during
a year.
Let EY denote the asymptotic mean lost production relative to the de-
mand. For alternative (i) it is clear that EY equals 0.027, observing that a
failure results in 100 % loss and the unavailability equals 0.027. For alternative
(ii), we obtain the same value for the asymptotic mean lost production, as is
seen from the following calculation

EY = 0.5 · 2 · 0.027 · 0.973 + 1 · 0.0272 = 0.027.

The first term in the sum represents the contribution from failures leading to
50 % loss, whereas the second term represents the contribution from failures
leading to 100 % loss. The latter contribution is in practice negligible compared
to the former one. To compute the distribution of the lost production, we
need to know more about the distribution of the repair time R of the train.
It was assumed in this application that ER2 = 1, 000, which corresponds
to a squared coefficient of variation equal to 1.9 and a standard deviation
equal to 25.7. The unit of time is hours. This assumption makes it possible to
approximate the distribution of the lost production during a year, using the
normal approximation. We know the mean (EY = 0.027) and need to estimate
the variance of Y . To do this we make use of (4.97), p. 158, stating that the
variance in the binary case is approximately equal to λΦ EY12 /t, where t is the
length of the time period considered and Y1 is the downtime of the first system
failure. For alternative (i) we find that the variance equals approximately

(13/8760) · 1000/8760 = 1.7 · 10−4 ,

and for alternative (ii) (we ignore situations with both components down so
that the lost production is approximately 50 % of the downtime)

(50/100)2 · (26/8760) · 1000/8760 = 0.85 · 10−4 .

From this we estimate, for example, that the probability that the lost produc-
tion during 1 year is more than 4 % of demand, to be 0.16 for alternative (i)
and 0.08 for alternative (ii).

Special Case: Phase-Type Distributions

In the asymptotic analysis in Sects. 4.4–4.6 main emphasis has been placed
on the situation that the lifetimes are exponentially distributed. Using the
so-called phase-type approach, we can show that the multistate model also
164 4 Availability Analysis of Complex Systems

provides a framework for covering other types of distributions. The phase-type


approach makes use of the fact that a distribution function can be approxi-
mated by a mixture of Erlang distributions (with the same scale parameter),
cf., e.g., Asmussen [8] and Tijms [156]. It is common to use a mixture of
two Erlang distributions with the first two moments matching the distribu-
tion considered. Now assume that the lifetime of component i, Fi , can be
described by the sum of Mi random variables, each of which is exponentially
distributed with rate λi0 , i.e., the lifetime of component i is Erlangian dis-
tributed with parameters λi0 and Mi . Then we have a situation that fits into
the above multistate framework and the asymptotic results can be applied.
The state space for component i is {0, 1, . . . , Mi }. The component process
(Xt (i)) starts in state Mi , it stays there a time governed by an exponential
random variable with rate λi0 and jumps to state Mi − 1, it stays there a time
governed by an exponential random variable with rate λi0 and jumps to state
Mi − 2, and this continues until the process reaches state 0. After a duration
having distribution Gi in state 0 it returns to state Mi . We see that filr = 1
if l = r + 1 and filr = 0 otherwise (for r < l). Furthermore,
1
μi = M i + μGi = μFi + μGi ,
λi0
n
1 n
1
λΦ = [h1 (1i1 , a) − h1 (1i0 , a)] = [h(1i , A) − h(0i , A)] ,
i=1
μi i=1
μi
μG i
ai0 = = Āi ,
μi
using the terminology from the binary theory. Remember that the formu-
las established in Sects. 4.2 and 4.3 for the expected cycle length and the
steady-state (un)availability of component i, and the system failure rate, are
applicable also for nonexponential distributions.
Thus by modifying the state space, we have been able to extend the results,
i.e., the Theorems 4.25 (p. 129), 4.31 (p. 136), and 4.43 (p. 154), in the previous
sections to Erlang distributions.
Now assume that the lifetime distribution of component i is a mixture of
Erlang distributions, i.e., with probability pir > 0 the distribution equals an
Erlang distribution with parameters λi0 and Mir , r = 1, 2, . . . , ri . This situa-
tion can be analyzed as above with the state space for component i given by
{0, 1, . . . , Mi }, where Mi =maxr {Mir }. If the component state process (Xt (i))
is in state 0, it will go to state Mir with probability pir . Then the component
stays in this state for a time governed by an exponential distribution with
parameter λi0 , before it jumps to state Mir − 1, etc. As above we can use
the formulas for the binary case to compute the expected cycle length and
steady-state (un)availability of component i, and the system failure rate. It is
seen that
ri
1
μFi = pir Mir .
r=1
λi0
4.7 Generalizations and Related Models 165

We can conclude that the set-up also covers mixtures of Erlang distributions,
and Theorems 4.25, 4.31, and 4.43 apply.
Note that we have not proved that the limiting results obtained in the
previous sections hold true for general lifetime distributions Fij . We have
shown that if the distributions Fij all belong to a certain class of mixtures of
Erlang distributions, then the results hold. Starting from general distributions
Fij , we can write Fij as a limit of Fijr , r → ∞, where Fijr are mixtures of
Erlang distributions. But interchanging the limits as j → ∞ and as r → ∞
is not justified in general. Refer also to Bibliographic Notes, p. 173, for some
comments related to the non-exponential case.

4.7.2 Parallel System with Repair Constraints

Consider the model as described in Sect. 4.3, p. 120, but assume now that
there are repair constraints, i.e., a maximum of r (r < n) components can
be repaired at the same time. Hence if i, i > r, components are down, the
remaining i − r components are waiting in a repair queue. We shall restrict
attention to the case r = 1, i.e., there is only one repair facility (channel)
available. The repair policy is first come first served. We assume exponentially
distributed lifetimes.
Consider first a parallel system of two components, and the set-up of
Sect. 4.4, p. 126. It is not difficult to see that ETΦ , q, and E0 S are identi-
cal to the corresponding quantities when there are no repair constraints; see
section on parallel system of two identical components p. 139. We can also
find explicit expressions for ES and λΦ . Since the time to the first component
failure is exponentially distributed with parameter 2λ, ES = 1/2λ + ES  ,
where S  equals the time from the first component failure until the process
again returns to (1, 1). Denoting the repair time of the failed component by
R, we see that
ES  = μG + qE[S  − R|NS ≥ 1].
But E[S  − R|NS ≥ 1] = ES  , and it follows that
1 μG
ES = + .
2λ 1 − q
Hence
ENS q/(1 − q)
λΦ = =
ES ES
2λq
= .
1 − q + 2λμG
Alternatively, and easier, we could have found λΦ by defining a cycle S as
the time between two consecutive visits to a state with just one component
functioning. Then it is seen that ES = μG +(1−q)/2λ and ENS = q, resulting
in the same λΦ as above.
166 4 Availability Analysis of Complex Systems

Now suppose we have n ≥ 2, and let Φt be defined as the number of com-


ponents functioning at time t. To analyze the system, we can utilize that the
state process Φt is a semi-Markov process with jump times at the completion
of repairs. In state 0, 1, . . . , n− 1 the time between transitions has distribution
G(t) and the transition probability Pij is given by
⎧  
⎪ ∞ i

⎪ F (s)i−j+1 (1 − F (s))j−1 dG(s),

⎪ 0 i − j + 1



⎨ 1≤j ≤i≤n−1
Pij = 

⎪ ∞
⎪ 0 (1 − F (s)) dG(s),
⎪ i
j =i+1





0, 1 ≤ i < j − 1,

observing that if the state is i and the repair is completed at time s, then the
probability that the process jumps to state j, where j ≤ i ≤ n − 1, equals
the probability that i − j + 1 components fail before s and j − 1 components
survive s; and, furthermore, if the state is i and the repair is completed at
time s, then the probability that the process jumps to state i + 1 equals the
probability that i components survive s. Now if the process is in state n, it
stays there for an exponential time with rate nλ, and jumps to state n − 1.
Having established the transition probabilities, we can compute a number
of interesting performance measures for the system using results from semi-
Markov theory. For example, we have an explicit formula for the asymptotic
probability that P (Φt = k) as t → ∞, which depends on the mean time spent
in each state and the limiting probabilities of the embedded discrete-time
Markov chain; see Ross [135], p. 104.

4.7.3 Standby Systems

In this section we study the performance of standby systems comprising n


identical components of which n − 1 are normally operating and one is in
(cold) standby. Emphasis is placed on the case that the components have
constant failure rates, and the mean repair time is relatively small compared
to the MTTF.
Standby systems as analyzed here are used in many situations in real life.
As an example we return to the gas compression system in Sect. 1.3, p. 13 and
Sect. 4.7.1, p. 162. To increase the availability for the alternatives considered,
we may add a standby train such that when a failure of a train occurs, the
standby train can be put into operation and a production loss is avoided.

Model

The following assumptions are made:


• Normally n − 1 components are running and one is in standby.
4.7 Generalizations and Related Models 167

• Failed components are repaired. The repair regime is characterized by


R1 Only one component can be repaired at a time (one repair
facility/channel), the repair policy is “first come first served,” or
R2 Up to n repairs can be carried out at a time (n repair facilities/channels).
• Switchover to the standby component is perfect, i.e., instantaneous and
failure-free.
• A standby component that has completed its repair is functioning at
demand, i.e., the failure rate is zero in the standby state.
• All failure times and repair times are independent with probability distri-
butions F (t) and G(t), respectively. F is absolutely continuous and has
finite mean, and G has finite third-order moment. We assume
 ∞
F (t)dG(t) > 0.
0

In the following T refers to a failure time of a component and R refers to


a repair time.
The squared coefficient of variation of the repair time distribution is de-
noted c2G .
Let Φt denote the state of the system at time t, i.e., the number of com-
ponents functioning at time t (Φt ∈ {n, n − 1, . . . , 0}). For repair regime R1,
Φ is generally a regenerative process, or a modified regenerative process. For
a two-component system it is seen that the time points when Φ jumps to
state 1 are regenerative points, i.e., the time points when (i) the operating
component fails and the second component is not under repair (the process
jumps from state 2 to 1) or (ii) both components are failed and the repair
of the component being repaired is completed (the process jumps from state
0 to 1). For n > 2, the points in time when the process jumps from state 0
to 1 are regenerative points, noting that the situation then is characterized
by one “new” component, and n − 1 in a repair queue. Assuming exponential
lifetimes, we can define other regenerative points, e.g., consecutive visits to
the best state n, or consecutive visits to state n − 1.
Also for a two-component system under repair regime R2, the process gen-
erally generates a (modified) regenerative process. The regenerative points are
given by the points when the process jumps from state 2 to 1 (case (i) above).
If the system has more than two components (n > 2), the regenerative prop-
erty is not true for a general failure time distribution. However, under the
assumption of an exponential time to failure, the process is regenerative. Re-
generative points are given by consecutive visits to state n, or points when
the process jumps from state n to state n − 1. In the following, when con-
sidering a system of more than two components, we assume an exponential
lifetime distribution. Remember that a cycle refers to the length between two
consecutive regenerative points.
168 4 Availability Analysis of Complex Systems

Performance Measures

The system can be considered as a special case of a multistate monotone sys-


tem, with the demand rate Dt set to n − 1. Hence the performance measures
defined in Sect. 4.7.1, p. 158, also apply to the system analyzed in this section.
Availability refers to the probability that at least n − 1 components are func-
tioning, and system failure refers to the event that the state process Φ is below
n − 1. Note that we cannot apply the computation results of Sect. 4.7.1 since
the state processes of the components are not stochastically independent. The
general asymptotic results obtained in Sects. 4.4–4.6 for regenerative processes
are however applicable.
Of the performance measures we will put emphasis on the limiting avail-
ability, and the limiting mean of the number of system failures in a time
interval.
We need the following notation for i = n, n − 1, . . . , 0:
pi (t) = P (Φt = i),
pi = lim pi (t),
t→∞

provided the limits exist. Clearly, the availability at time t, A(t), is given by
A(t) = pn (t) + pn−1 (t)
and the limiting availability, A, is given by
A = pn + pn−1 .

Computation

First, we focus on the limiting unavailability Ā, i.e., the expected portion of
time in the long run that at least two components are not functioning. Under
the assumption of constant failure and repair rates this unavailability can
easily be computed using Markov theory, noting that Φ is a birth and death
process. The probability p̃i of having i components down is given by (cf. [13],
p. 303)
z
p̃i = pn−i = in , (4.100)
1 + j=1 zj
where

(n−1)(n−1)!
i 1 δi i = 1, 2, . . . , n
zi = (n−i)! l=1 ul
1 i=0
δ = μG /μF
ul = 1 under repair regime R1 and l under repair regime R2.
Note that if δ is small, then p̃i ≈ zi for i ≥ 1. Hence
(n − 1)2 2
Ā ≈ p̃2 ≈ δ . (4.101)
u2
4.7 Generalizations and Related Models 169

We can also write


(n − 1)2 2
Ā = δ + o(δ 2 ), δ → 0.
u2
In general we can find expressions for the limiting unavailability by using
the regenerative property of the process Φ. Defining Y and S as the system
downtime in a cycle and the length of a cycle, respectively, it follows from the
Renewal Reward Theorem (Theorem B.15, p. 280, in Appendix B) that

EY
Ā = . (4.102)
ES

Here system downtime corresponds to the time two or more of the components
are not functioning. Let us now look closer into the problem of computing Ā,
given by (4.102), under repair regime R1.

Repair Regime R1. In general, semi-Markov theory can be used to estab-


lish formulas for the unavailability, cf. [27]. In practice, we usually have μG
relatively small compared to μF . Typically, δ = μG /μF is less than 0.1. In
this case we can establish simple approximation formulas as shown below.
First we consider the case with two components, i.e., n = 2. The regener-
ative points for the process Φ are generated by the jumps from state 2 to 1.
In view of (4.102) the limiting system unavailability Ā can be written as

E[max{R − T, 0}]
Ā = (4.103)
ET + E[max{R − T, 0}]
(μG − w)
= , (4.104)
μF + (μG − w)

where
 ∞
w = E[min{R, T }] = F̄ (t)Ḡ(t) dt,
0

noting that max{R − T, 0} = R − min{R, T } and the system downtime equals


0 if the repair of the failed component is completed before the failure of the
operating component, and equals the difference between the repair time of the
failed component and the time to failure of the operating component if this
difference is positive. Thus we have proved the following theorem.

Theorem 4.48. If n = 2, then the unavailability Ā is given by (4.104).

We now assume an exponential failure time distribution F (t) = 1 − e−λt .


Then we have
Ā ≈ Ā , (4.105)
170 4 Availability Analysis of Complex Systems

where
λ2 δ2
Ā = ER2 = [1 + c2G ]. (4.106)
2 2
This gives a simple approximation formula for computing Ā. The approxima-
tion (4.105) is established formally by the following proposition.

Proposition 4.49. If n = 2 and F (t) = 1 − e−λt , then

δ 3 ER3
0 ≤ Ā − Ā ≤ (Ā )2 + . (4.107)
6 μ3G

Proof. Using that 1 − e−λt ≤ λt and changing the order of integration, it


follows that
λ(μG − w)
Ā = ≤ λ(μG − w) (4.108)
1 + λ(μG − w)
 ∞
=λ F (t)Ḡ(t)dt
0 ∞
≤λ (λt)Ḡ(t)dt
0
21
=λ ER2 = Ā . (4.109)
2
It remains to show the right-hand inequality of (4.107). Considering
 ∞
Ā(1 + λ(μG − w)) = λ F (t)Ḡ(t)dt
 ∞
0

1
≥λ λt − (λt)2 Ḡ(t)dt
0 2
1
= Ā − λ3 ER3
6
and the inequalities Ā ≤ λ(μG − w) ≤ Ā obtained above, it is not difficult to
see that
1 1
0 ≤ Ā − Ā ≤ Āλ(μG − w) + λ3 ER3 ≤ (Ā )2 + λ3 ER3 ,
6 6
which completes the proof. 


Hence Ā overestimates the unavailability and the error term will be neg-
ligible provided that δ = μG /μF is sufficiently small.
Next, let us compare the approximation formula Ā with the standard
“Markov formula” ĀM = δ 2 , obtained by assuming exponentially distributed
failure and repair times (replace c2G by 1 in the expression (4.106) for Ā , or
use the Markov formula (4.101), p. 168). It follows that
4.7 Generalizations and Related Models 171

1
Ā = ĀM · [1 + c2G ]. (4.110)
2
From this, we see that the use of the Markov formula when the squared
coefficient of variation of the repair time distribution, c2G , is not close to 1,
will introduce a relatively large error. If the repair time is a constant, then
c2G = 0 and the unavailability using the Markov formula is two times Ā . If c2G
is large, say 2, then the unavailability using the Markov formula is 2/3 of Ā .
Assume now n > 2. The repair regime is R1 as before. Assume that δ is
relatively small. Then it is possible to generalize the approximations obtained
above for n = 2.
Since δ is small, there will be a negligible probability of having Φ ≤ n − 3,
i.e., three or more components not functioning at the same time. By neglecting
this possibility we obtain a simplified process that is identical to the process
for the two-component system analyzed above, with failure rate (n − 1)λ.
Hence by replacing λ with (n − 1)λ, formula (4.105) is valid for general n, i.e.,
Ā ≈ Ā , where
[(n − 1)δ]2
Ā = [1 + c2G ].
2
The error bounds are, however, more difficult to obtain, see [27].
The relation between the approximation formulas Ā and ĀM , given by
(4.101), p. 168, are the same for all n ≥ 2. Hence Ā = ĀM · 12 [1 + c2G ] (formula
(4.110)) holds for n > 2 too.
Next we will establish results for the long run average number of sys-
tem failures. It follows from the Renewal Reward Theorem that ENt /t and
E[Nt+s − Nt ]/s converge to λΦ = EN/ES as t → ∞, where N equals the
number of system failures in one renewal cycle and S equals the length of
the cycle as before. With probability one, Nt /t converges to the same value.
Under repair regime R1, N ∈ {0, 1}. Hence EN equals the probability that
the system fails in a cycle, i.e., EN = q using the terminology of Sects. 4.3
and 4.4. Below we find expressions for λΦ in the case that the repair regime
is R1. The regenerative points are consecutive visits to state n − 1.

Theorem 4.50. If n = 2, then


q
λΦ = , (4.111)
μF + EY
where
 ∞
q= F (t)dG(t), (4.112)
0 ∞
EY = F (t)Ḡ(t) dt.
0

Proof. First note that EY equals the expected downtime in a cycle and is
given by
172 4 Availability Analysis of Complex Systems

EY = E[(R − T )I(T < R)] = E[R − min{R, T }],


cf. (4.103)–(4.104), p. 169. We have established above that
EN q
λΦ = = ,
ES ES
where N equals the number of system failures in one renewal cycle, S equals
the length of the cycle, and q = P (T ≤ R) equals the probability of having a
system failure during a cycle. Thus it remains to show that

ES = μF + EY. (4.113)

Suppose the system has just jumped to state 1. We then have one component
operating and one undergoing repair. Now if a system failure occurs (i.e.,
T ≤ R), then the cycle length equals R, and if a system failure does not occur
(i.e., T > R), then the cycle length equals T . Consequently,

S = I(T ≤ R)R + I(T > R)T = T + (R − T )I(T < R).

Formula (4.113) follows and the proof is complete. 




We see from (4.111) that if F (t) is exponential with rate λ and the com-
ponents are highly available, then

λΦ ≈ λ2 μG .

If n > 2 and the repair regime is R1, it is not difficult to see that q
is given by (4.112) with F (t) replaced by 1 − e−(n−1)λt . It is however more
difficult to find an expression for ES. For highly available components, we can
approximate the system with a two-state system with failure rate (n − 1)λ;
hence,

λΦ ≈ [(n − 1)λ]2 μG ,
1
ES ≈ .
(n − 1)λ
When the state process of the system jumps from state n to n − 1, it will re-
turn to state n with a high probability and the sojourn time in state n − 1 will
be relatively short; consequently, the expected cycle length is approximately
equal to the expected time in the best state n, i.e., 1/(n − 1)λ.

Repair Regime R2. Finally in this section we briefly comment on the repair
regime R2. We assume constant failure rates. It can be argued that if there is
ample repair facilities, i.e., the repair regime is R2, the steady-state unavail-
ability is invariant with respect to the repair time distribution, cf., e.g., Smith
[145] and Tijms [156], p. 175. This means that we can use the steady-state
Markov formula (4.100), p. 168, also when the repair time distribution is not
4.7 Generalizations and Related Models 173

exponential. The result only depends on the repair time distribution through
its mean value. However, a strict mathematical proof of this invariance result
does not seem to have been presented yet.

Bibliographic Notes. Alternating renewal processes are studied in many


textbooks, e.g., Birolini [44] and Ross [135]. Different versions of the one-
component downtime distribution formula in Theorem 4.14 (p. 118) have been
formulated and proved in the literature, cf. [44, 45, 57, 65, 69, 154]. The first
version was established by Takács. Theorem 4.14, which is taken from Haukås
and Aven [82], seems to be the most general formulation and also has the
simplest proof.
Some key references to the theory of point availability of monotone systems
and the mean number of system failures are Barlow and Proschan [31, 32] and
Ross [136]; see also Aven [13]. Parallel systems of two identical components
have been studied by a number of researchers, see, e.g., [34, 73, 76]. Gaver
[73] established formulas for the distribution and mean of the time to the first
system failure, identical to those presented in Sect. 4.4, p. 139. Our derivation
of these formulas is different however from Gaver’s.
Asymptotic analysis of highly available systems has been carried out by
a number of researchers. A survey is given by Gertsbakh [75], with emphasis
on results related to the convergence of the distribution of the first system
failure to the exponential distribution. See also the books by Gnedenko and
Ushakov [76], Ushakov [157], and Kovalenko et al. [110, 111]. Some of the
earliest results go back to work done by Keilson [104] and Solovyev [148]. A
result similar to Lemma 4.24 (p. 127) was first proved by Keilson [104]; see also
[76, 105, 109]. Our version of this lemma is taken from Aven and Jensen [26].
To establish the asymptotic exponential distribution, different normalizing
factors are used, e.g., q/E0 S, where q equals the probability of having at least
one system failure in a renewal cycle and E0 S equals the expected cycle length
given that no system failures occur in the cycle. This factor, as well as the other
factors considered in the early literature in this field (cf., e.g., the references
[75, 76, 157]) are generally difficult to compute. The asymptotic failure rate
of the system, λφ , is more attractive from a computational point of view, and
is given most attention in this presentation. We find it somewhat difficult to
read some of the earlier literature on availability. A large part of the research
in this field has been developed outside the framework of monotone system
theory. Using this framework it is possible to give a unified presentation of
the results. Our set-up and results (Sect. 4.4) are to a large extent taken from
the recent papers by Aven and Haukås [22] and Aven and Jensen [26]. These
papers also cover convergence of the number of system failures to the Poisson
distribution.
The literature includes a number of results proving that the exponen-
tial/Poisson distribution is the asymptotic limit of certain sums of point
processes. Most of these results are related to the thinning of independent
processes, see e.g., Çinlar [55], Daley and Vere-Jones [58], and Kovalenko et
174 4 Availability Analysis of Complex Systems

al. [111]. See also Lam and Lehoczky [114] and the references therein. These
results are not applicable for the availability problems studied in this book.
Sections 4.5 and 4.6 are to a large extent based on Gåsemyr and Aven
[72], Aven and Haukås [23], and Aven and Jensen [26]. Gåsemyr and Aven
[72] and Aven and Haukås [23] study the asymptotic downtime distribution
given system failure. Theorem 4.42 is due to Haukås (see [26, 81]) and Smith
[146]. Aven and Jensen [26] gives sufficient conditions for when a compound
Poisson distribution is an asymptotic limit for the distribution of the downtime
of a monotone system observed in a time interval. An alternative approach
for establishing the compound Poisson process limit is given by Serfozo [138].
There exist several asymptotic results in the literature linking the sums of
independent point processes with integer marks to the compound Poisson
process; see, e.g., [153]. It is, however, not possible to use these results for
studying the asymptotic downtime distributions of monotone systems.
Section 4.7.1 generalizes results obtained in the previous sections to mul-
tistate systems. The presentation on multistate systems is based on Aven
[11, 14]. For the analysis in Sect. 4.7.3 on standby systems, reference is given
to the work by Aven and Opdal [27].
In this chapter we have primarily focused on the situation that the com-
ponent lifetime distributions are exponential. In Sect. 4.7.1 we outlined how
some of the results can be extended to phase-type distributions. A detailed
analysis of the nonexponential case (nonregenerative case) is however outside
the scope of this book. Further research is needed to present formally proved
results for the general case. Presently, the literature covers only some partic-
ular cases. Intuitively, it seems clear that it is possible to generalize many of
the results obtained in this chapter. Consider, for example, the convergence
to the Poisson process for the number of system failures. As long as the com-
ponents are highly available, we would expect that the number of failures are
approximately Poisson distributed. But formal asymptotic results are rather
difficult to establish; see, for example, [102, 106, 112, 152, 162]. Strict con-
ditions have to be imposed to establish the results, to the system structure
and the component lifetime and downtime distributions. Also the general ap-
proach of showing that the compensator of the counting process converges in
probability (see Daley and Vere-Jones [58], p. 552), is difficult to apply in our
setting.
Of course, this chapter covers only a small number of availability mod-
els compared to the large number of models presented in the literature. We
have, for example, not included models where some components remain in
“suspended animation” while a component is being repaired/replaced, and
models allowing preventive maintenance. For such models, and other related
models, refer to the above cited references, Beichelt and Franken [36], Osaki
[128], Srinivasan and Subramanian [150], Van Heijden and Schornagel [160],
and Yearout et. al. [166]. See also the survey paper by Smith et al. [147].
5
Maintenance Optimization

In this chapter we combine the general lifetime model of Chap. 3 with


maintenance actions like repairs and replacements. Given a certain cost and
reward structure an optimal repair and replacement strategy will be derived.
We begin with some basic and well-known models and come then to more
complex ones, which show how the general approach can be exploited to open
a variety of different optimization models.

5.1 Basic Replacement Models


First of all we consider some basic models that are simple in both the lifetime
modeling and the optimization criterion. These basic models include the age
and the block replacement models that are widely used and thoroughly inves-
tigated. A technical system is considered, the lifetime of which is described
by a positive random variable T with distribution F . Upon failure the system
is immediately replaced by an equivalent one and the process repeats itself.
A preventive replacement can be carried out before failure. Each replacement
incurs a cost of c > 0 and each failure adds a penalty cost k > 0.

5.1.1 Age Replacement Policy

For this policy a replacement age s, s > 0, is fixed for each system at which
a preventive replacement takes place. If Ti , i = 1, 2, . . . , are the successive
lifetimes of the systems, then τi = Ti ∧ s denotes the operating time of the ith
system and equals the ith cycle length. The random variables Ti are assumed
to form an i.i.d. sequence with common distribution F , i.e., F (t) = P (Ti ≤ t).
The costs for one cycle are described by the stochastic process Z = (Zt ), t ∈
R+ , Zt = c + kI(T ≤ t). Clearly, the average cost after n cycles is

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modelling 175


and Applied Probability 41, DOI 10.1007/978-1-4614-7894-2 5,
© Springer Science+Business Media New York 2013
176 5 Maintenance Optimization
n
Zτi
i=1
n
i=1 τi

and the total cost per unit time up to time t is given by

1
N
t

Ct = Zτ ,
t i=1 i

where (Nt ), t ∈ R+ , is the renewal counting process generated by (τi ) and


Zτ = c + kI(T ≤ τ ) describes the incurred costs in one cycle. It is well
known from renewal theory (see Appendix B, p. 280) that the limits of the
expectations of these ratios, Ks , coincide and are equal to the ratio of the
expected costs for one cycle and the expected cycle length:
 n 
Zτi EZτ
Ks = lim E i=1 n = lim ECt = .
n→∞ τ
i=1 i
t→∞ Eτ
The objective is to find the replacement age that minimizes this long run
average cost per unit time. Inserting the cost function Zt = c + kI(T ≤ t)
we get
c + kF (s)
Ks =  s . (5.1)
0 (1 − F (x))dx
Now elementary analysis can be used to find the optimal replacement age s,
i.e., to find s∗ with

Ks∗ = inf{Ks : s ∈ R+ ∪ {∞}}.

Here s∗ = ∞ means that preventive replacements do not pay and it is optimal


to replace only at failures. As can be easily seen this case occurs if the lifetimes
are exponentially distributed, i.e., if F (t) = 1 − exp{−λt}, t ≥ 0, λ > 0, then
K∞ = λ(c + k) ≤ Ks for all s > 0.
Example 5.1. Using rudimentary calculus we see that in the case of an increas-
ing failure rate λ(t) = f (t)/F̄ (t), the optimal replacement age is given by
  t 
∗ c
s = inf t ∈ R+ : λ(t) F̄ (x)dx − F (t) ≥ ,
0 k
where inf ∅ = ∞. By differentiating it is not hard to show that the left-hand
side of the inequality is increasing in the IFR case so that s∗ can easily be
determined. As an example consider the Weibull distribution F (t) = 1 −
exp{−(λt)β }, t ≥ 0 with λ > 0 and β > 1. The corresponding failure rate is
λ(t) = λβ(λt)β−1 and the optimal replacement age is the unique solution of
 t
c
λ(t) exp{−(λx)β }dx − 1 + exp{−(λt)β } = .
0 k
The cost minimum is then given by Ks∗ = kλ(s∗ ).
5.1 Basic Replacement Models 177

The age replacement policy allows for planning of a preventive replacement


only when a new item is installed. If one wants to fix the time points for
preventive replacements in advance for a longer period, one is led to the block
replacement policy.

5.1.2 Block Replacement Policy

Under this policy the item is replaced at times is, i = 1, 2, . . . and s > 0,
and at failures. The preventive replacements occur at regular predetermined
intervals at a cost of c, whereas failures within the intervals incur a cost of
c + k.
The advantage of this policy is the simple structure and administration
because the time points of preventive replacements are fixed and determined
in advance. On the other hand, preventive replacements are carried out, ir-
respective of the age of the processing unit, so that this policy is usually
applied to several units at the same time and only if the replacement costs c
are comparatively low.
For a fixed time interval s the long run average cost per unit time is

(c + k)M (s) + c
Ks = , (5.2)
s

where M is the renewal function M (t) = j=1 F ∗j (t) (see Appendix B, p.
274). If the renewal function is known explicitly, we can again use elementary
analysis to find the optimal s, i.e., to find s∗ with

Ks∗ = inf{Ks : s ∈ R+ ∪ {∞}}.

In most cases the renewal function is not known explicitly. In such a case
asymptotic expansions like Theorem B.5, p. 277 in Appendix B or numerical
methods have to be used. As is to be expected in the case of an Exp(λ)
distribution, preventive replacements do not pay: M (s) = λs and s∗ = ∞.

Example 5.2. Let F be the Gamma distribution function with parameters λ >
0 and n = 2. The corresponding renewal function is
λs 1
M (s) = − (1 − e−2λs )
2 4
(cf. [1], p. 274) and s∗ can be determined as the solution of

d M (s) c
M (s) = + .
ds s s(c + k)

The solution s∗ is finite if and only if c/(c + k) < 1/4, i.e., if failure replace-
ments are at least four times more expensive than preventive replacements.
178 5 Maintenance Optimization

The age and block replacement policies will result in a finite optimal value
of s only if there is some aging and wear-out of the units, i.e., in probabilistic
terms the lifetime distribution F fulfills some aging condition like IFR, NBU,
or NBUE (see Chap. 2 for these notions). To judge whether it pays to follow
a certain policy and in order to compare the policies it is useful to consider
the number of failures and the number of planned preventive replacements in
a time interval [0, t].

5.1.3 Comparisons and Generalizations

Let F be the underlying lifetime distribution that generates the renewal


counting process (Nt ), t ∈ R+ , so that Nt describes the number of failures
or completed replacements in [0, t] following the basic policy replace at fail-
ure only. Let NtA (s) and NtB (s) denote the number of failures up to time t
following policy A (age replacement) or B (block replacement), respectively,
and RtA (s) and RtB (s) the corresponding total number of removals in [0, t]
including failures and preventive replacements. We now want to summarize
some early comparison results that can be found, including the proofs, in
the monographs of Barlow and Proschan [31, 32]. We remind the reader of
the notion of stochastic comparison of two positive random variables X and
Y : X ≤st Y means P (X > t) ≤ P (Y > t) for all t ∈ R+ .

Theorem 5.3. The following four assertions hold true:


(i) Nt ≥st NtB (s) for all t ≥ 0, s ≥ 0 ⇐⇒ F is NBU;
(ii) Nt ≥st NtA (s) for all t ≥ 0, s ≥ 0 ⇐⇒ F is NBU;
(iii) F IFR ⇒ Nt ≥st NtA (s) ≥st NtB (s) for all t ≥ 0, s ≥ 0;
(iv) RtA (s) ≤st RtB (s) for all t ≥ 0, s ≥ 0.

Part (i) and (ii) say that under the weak aging notion NBU it is useful
to apply a replacement strategy, since the number of failures is (stochasti-
cally) decreased under such a strategy. If, in addition, F has an increasing
failure rate, block replacement results in stochastically less failures than age
replacement, and it follows that ENtA (s) ≥ ENtB (s). On the other hand, for
any lifetime distribution F (irrespective of aging notions) block policies have
more removals than age policies.

Theorem 5.4. NtA (s) is stochastically increasing in s for each t ≥ 0 if and


only if F is IFR.

This result says that IFR is characterized by the reasonable aging condi-
tion that the number of failures is growing with increasing replacement age.
Somewhat weaker results hold true for the block policy (see Shaked and Zhu
[143] for proofs):

Theorem 5.5. If NtB (s) is stochastically increasing in s for each t ≥ 0, then


F is IFR.
5.1 Basic Replacement Models 179

Theorem 5.6. The expected value ENtB (s) is increasing in s for each t ≥ 0
if and only if the renewal function M (t) is convex.

Since the monographs of Barlow and Proschan appeared, many possible


generalizations have been investigated concerning (a) the comparison meth-
ods, (b) the lifetime models and replacement policies and the cost structures.
It is beyond the scope of this book to describe all of these models and refine-
ments. Some hints for further reading can be found in the Bibliographic Notes
at the end of the chapter.
Berg [37] and Dekker [63] among others use a marginal cost analysis for
studying the optimal replacements problem. Let us, for example, consider this
approach for block-type policies. In this model it is assumed that the long run
average cost per unit time is given by
c + R(s)
Ks = , (5.3)
s
s
where c is the cost of a preventive replacement and R(s) = 0 r(x)dx denotes
the total expected costs due to deterioration over an interval of length s.
The derivative r, called the (marginal ) deterioration cost rate, is assumed
to be continuous and piecewise differentiable. If in the block replacement
model of the preceding Sect. 5.1.2 the lifetime distribution function F has a
bounded density f, then it is known (see Appendix B, p. 278) that also the
corresponding
s renewal function M admits a density m and we have R(s) =
0
(c + k)m(x)dx, which shows that this is a special case of this block-type
model. Now certain properties of the marginal cost rate can be carried over
to the cost function K. The proof of the following theorem is straightforward
and can be found in [63].

Theorem 5.7. (i) If r(t) is nonincreasing on [t0 , t1 ] for some 0 ≤ t0 < t1


and r(t0 ) < Kt0 , then Ks is also nonincreasing in s on [t0 , t1 ];
(ii) if r(t) increases strictly for t > t0 and some t0 ≥ 0, where r(t0 ) < Kt0 ,
and if either

(a) lim r(t) = ∞ or (b) lim r(t) = a and lim (at − R(t)) > c,
t→∞ t→∞ t→∞

then Ks has a minimum, say K ∗ at s∗ , which is unique on [t0 , ∞); moreover,


K ∗ = Ks∗ = r(s∗ ).

Thus a myopic policy, in which at every moment we consider whether to


defer the replacement or not, is optimal. That is, the expected cost of deferring
the replacement to level t + Δt, being r(t)Δt, should be compared with the
minimum average cost over an interval of the same length, being K ∗ Δt. Hence
if r(t) is larger than K ∗ , the deferment costs are larger and we should replace.
This is the idea of marginal cost analysis as described for example in [37, 63].
The above framework can be extended to age-type policies if we consider
the following long run average cost per unit time
180 5 Maintenance Optimization
s
c+ r(x)F̄ (x)dx
Ks = 0s , (5.4)
0
F̄ (x)dx

where c is the cost of a preventive replacement and r denotes the marginal de-
terioration cost rate. Again it can easily be seen that the basic age replacement
model (5.1) is a special case setting r(x) = kλ(x), where λ(x) = f (x)/F̄ (x) is
the failure rate. Now a very similar analysis can be carried out (see [63]) and
the same theorem holds true for this cost criterion except that condition (ii)
(b) has to be replaced by

lim r(t) = a and a > lim Ks for some a > 0.


t→∞ s→∞

This shows that behind these two quite different models the same opti-
mizations mechanism works. This has been exploited by Aven and Bergman
in [19] (see also [21]). They recognized that for many replacement models the
optimization criterion can be written in the form
- τ .
E 0 at ht dt + c0
- τ . , (5.5)
E 0 ht dt + p0

where τ is a stopping time based on the information about the condition of


the system, (at ) is a nondecreasing stochastic process, (ht ) is a nonnegative
stochastic process, and c0 and p0 are nonnegative random variables; all vari-
ables are adapted to the information about the condition of the system. Both,
the block-type model (5.3) and the age-type model (5.4) are included. Take,
for example, for all random quantities deterministic values, especially τ = t,
ht = F̄ (t), at = r(t), p0 = 0, and c0 = c. This leads to the age-type model. In
(5.5) the stopping time τ is the control variable which should be determined in
a way that (5.5) is minimized. This problem of choosing a minimizing stopping
time is known as an optimal stopping problem and will be further developed
in the next section.

5.2 A General Replacement Model


In this section we want to develop the tools that allow certain maintenance
problems to be solved in a fairly general way, also considering the possibility
of taking different levels of information into account.

5.2.1 An Optimal Stopping Problem

In connection with maintenance models as described above, we will have to


solve optimization problems. Often an optimal point in time has to be de-
termined that maximizes some reward functional. In terms of the theory of
stochastic processes, this optimal point in time will be a stopping time τ that
5.2 A General Replacement Model 181

maximizes the expectation EZτ of some stochastic process Z. We will see that
the smooth semimartingale (SSM) representation of Z, as introduced in de-
tail in Sect. 3.1, is an excellent tool to carry out this optimization. Therefore,
we want to solve the stopping problem and to characterize optimal stopping
times for the case in which Z is an SSM and τ ranges in a suitable class of
stopping times, say

C F = {τ : τ is an F-stopping time, τ < ∞, EZτ > −∞}.

Without any conditions on the structure of the process Z one cannot hope to
find an explicit solution of the stopping problem. A condition called monotone
case in the discrete time setting can be transferred to continuous time as
follows.
Definition 5.8 (MON). Let Z = (f, M ) be an SSM. Then the following
condition

{ft ≤ 0} ⊂ {ft+h ≤ 0} ∀ t, h ∈ R+ , {ft ≤ 0} = Ω (5.6)
t∈R+

is said to be the monotone case and the stopping time

ζ = inf{t ∈ R+ : ft ≤ 0}

is called the ILA-stopping rule (infinitesimal-look-ahead).

 t Obviously in the monotone case the process f driving the SSM Zt =


0 fs ds + Mt remains negative (nonpositive) if it once crosses zero from above
and the ILA-stopping rule ζ is a natural candidate to solve the maximization
problem.
Theorem 5.9. Let Z = (f, M ) be an F-SSM and ζ the ILA-stopping rule. If
the martingale M is uniformly integrable, then in the monotone case (5.6)

EZζ = sup{EZτ : τ ∈ C F }.

Remark 5.10. The condition that the martingale is uniformly integrable can
be relaxed; in [98] it is shown that the condition may be replaced by

Mζ ∈ L1 , ζ ∈ C F , lim Mt− dP = 0 ∀ τ ∈ C F ,
t→∞ {τ >t}

where as usual a− denotes the negative part of a ∈ R : a− = max{−a, 0}. But


in most cases such a generalization will not be used in what follows.
Proof. Since M is uniformly integrable we have EMτ = 0 for all τ ∈ C F as a
consequence of the optional sampling theorem (cf. Appendix A, p. 262). Also ζ
is an element of C F because ζ < ∞ per definition and EZζ− ≤ E|Z0 |+E|Mζ | <
∞. It remains to show that
182 5 Maintenance Optimization
 ζ  τ
E fs ds ≥ E fs ds
0 0

for all τ ∈ C F . But this is an immediate consequence of fs > 0 on {ζ > s}


and fs ≤ 0 on {ζ ≤ s}. 


The following example demonstrates how this optimization technique can


be applied.

Example 5.11. Let ρ be an exponentially distributed random variable with


parameter λ > 0 on the basic probability space (Ω, F , F, P ) equipped with
the filtration F generated by ρ :

Ft = σ({ρ > s}, 0 ≤ s ≤ t) = σ(I(ρ > s), 0 ≤ s ≤ t) = σ(ρ ∧ t).

For the latter equality we make use of our agreement that σ(·) denotes the
completion
 of the generated σ-algebra so that, for instance, the event {ρ =
t} = n∈N {t − n1 < ρ ≤ t} is also included in σ(ρ ∧ t). Then we define

Zt = et I(ρ > t), t ∈ R+ .

This process Z can be interpreted as the potential gain in a harvesting problem


(in a wider sense): there is an exponentially growing potential gain and at
any time t the decision-maker has to decide whether to realize this gain or to
continue observations with the chance of earning a higher gain. But the gain
can only be realized up to a random time ρ, which is unknown in advance.
So there is a risk to loose all potential gains and the problem is to find an
optimal harvesting time.
The process Z is adapted, right-continuous and integrable with

E[Zt+h |Ft ] = et+h E[I(ρ > t + h)|Ft ] = e(1−λ)h Zt , h, t ∈ R+ .

Thus Z is a submartingale (martingale, supermartingale), if λ < 1 (λ = 1, λ >


1). Obviously we have
1
lim E[Zt+h − Zt |Ft ] = Zt (1 − λ) = ft .
h→0+ h
Theorem 3.6, p. 60, states that Z is an SSM with representation:
 t
Zt = 1 + Zs (1 − λ)ds + Mt .
0

Three cases will be discussed separately:


1. λ < 1. The monotone case (5.6) holds true with The ILA stopping time
ζ = ρ. But ζ is not optimal, because EZζ = 0 and Z is a submartingale
with unbounded expectation function: sup{EZτ : τ ∈ C F } = ∞.
5.2 A General Replacement Model 183

2. λ > 1. The monotone case holds true with the ILA stopping time ζ = 0.
It is not hard to show that in this case the martingale
 t
Mt = Zt − 1 − Zs (1 − λ)ds
0

is uniformly integrable. Theorem 5.9 ensures that ζ is optimal with


EZζ = 1.
3. λ = 1. Again the monotone case (5.6) holds true with the ILA stopping
time ζ = 0. However, the martingale Mt = et I(ρ > t) − 1 is not uniformly
integrable. But for all τ ∈ C F we have EMτ− ≤ 1 and
 
lim Mt− dP ≤ lim dP = 0,
t→∞ {τ >t} t→∞ {τ >t}

so that the more general conditions mentioned in the above remark are
fulfilled with Mζ = 0. This yields
EZζ = 1 = sup{EZτ : τ ∈ C F }.

5.2.2 A Related Stopping Problem


As was described in Sect. 5.1, replacement policies of age and block type are
strongly connected to the following stopping problem: minimize
EZτ
Kτ = , (5.7)
EXτ
in a suitable class of stopping times, where Z and X are real stochastic pro-
cesses. For a precise formulation and solution of this problem we use the
set-up given in Chap. 3. On the basic complete probability space (Ω, F , P ) a
filtration F = (Ft ), t ∈ R+ , is given, which is assumed to fulfill the usual
conditions concerning right continuity and completeness. Furthermore, let
Z = (Zt ) and X = (Xt ), t ∈ R+ , be real right-continuous stochastic pro-
cesses adapted to the filtration F. Let T > 0 be a finite F-stopping time with
EZT > −∞, E|XT | < ∞ and
CTF = {τ : τ is an F-stopping time, τ ≤ T, EZτ > −∞, E|Xτ | < ∞}.
For τ ∈ CTF we consider the ratio Kτ in (5.7). The stopping problem is then
to find a stopping time σ ∈ CTF , with
K ∗ = Kσ = inf{Kτ : τ ∈ CTF }. (5.8)
In this model T describes the random lifetime of some technical system.
The index t can be regarded as a time point and Ft as the σ-algebra which
contains all gathered information up to time t. The stochastic processes Z and
X are adapted to the stream of information F, i.e., Z and X are observable
with respect to the given information or in mathematical terms, Zt and Xt are
Ft -measurable for all t ∈ R+ . The replacement times can then be identified
with stopping times not greater than the system lifetime T.
184 5 Maintenance Optimization

Example 5.12. In the case of block-type models no random information is to


be considered so that the filtration reduces to the trivial one and all stopping
times are constants, i.e., CTF = R+ ∪ {∞}. In this case elementary analysis
manipulations yield the optimum and no additional efforts are necessary.
Example 5.13. Let Zt = c + kI(T ≤ t), Xt = t, and Ft = σ(Zs , 0 ≤ s ≤
t) = σ(I(T ≤ s), 0 ≤ s ≤ t) be the σ-algebra generated by Z, i.e., at any
time t ≥ 0 it is known whether the system works or not. The F-stopping
times τ ∈ CTF are of the form τ = t∗ ∧ T for some t∗ > 0. Then we have
EZτ = c + kEI(T ≤ τ ) = c + kP (T ≤ t∗ ) and EXτ = Eτ, which leads to the
basic age replacement policy.
To solve the above-mentioned stopping problem, we will make use of semi-
martingale representations of the processes Z and X. It is assumed that Z
and X are SSMs as introduced in Sect. 3.1 with representations
 t
Zt = Z0 + fs ds + Mt ,
0
 t
Xt = X0 + gs ds + Lt .
0

As in Sect. 3.1 we use the short notation Z = (f, M ) and X = (g, L). Almost
all of the stochastic processes used in applications without predictable jumps
admit such SSM representations. The following general assumption is made
throughout this section:
Assumption (A). Z = (f, M ) and X = (g, L) are SSMs with EZ0 >0,
EX0 ≥ 0, gs > 0 for all s ∈ R+ and M T , LT ∈ M0 are uniformly inte-
grable martingales, where MtT = Mt∧T , LTt = Lt∧T .
Remember that all relations between real random variables hold (only) P -
almost surely. The first step to solve the optimization problem is to establish
bounds for K ∗ in (5.8).
Lemma 5.14. Assume that (A) is fulfilled and
 
ft (ω)
q = inf : 0 ≤ t < T (ω), ω ∈ Ω > −∞.
gt (ω)
Then
bl ≤ K ∗ ≤ bu
holds true, where the bounds are given by
EZT
bu = ,
EX
 T
E[Z0 −qX0 ]
EXT + q if E[Z0 − qX0 ] > 0
bl = EZ0
EX0 if E[Z0 − qX0 ] ≤ 0.
5.2 A General Replacement Model 185

Proof. Because T ∈ CTF only the lower bound has to be shown. Since the
martingales M T and LT are uniformly integrable, the optional sampling the-
orem (see Appendix A, p. 262) yields EMτ = ELτ = 0 for all τ ∈ CTF and
therefore
EZ0 + qE[Xτ − X0 ] EZ0 − qEX0
Kτ ≥ = + q ≥ bl .
EXτ EXτ
The lower bound is derived observing that EX0 ≤ EXτ ≤ EXT , which
completes the proof. 


The following example gives these bounds for the basic age replacement
policy.
Example 5.15 (Continuation of Example 5.13). Let us return to the simple
cost process Zt = c + kI(T ≤ t) with the natural filtration as before. Then
I(T ≤ t) has the SSM representation
 t

I(T ≤ t) = I(T > s)λ(s)ds + Mt ,
0

where λ is the usual failure rate of the lifetime T . It follows that the processes
Z and X have representations
 t

Zt = c + I(T > s)kλ(s)ds + Mt , Mt = kMt
0

and  t
Xt = t = ds.
0
Assuming the IFR property, we obtain with λ(0) = inf{λ(t) : t ∈ R+ } and
q = kλ(0) the following bounds for K ∗ in the basic age replacement model:
EZT c+k
bu = = ,
EXT ET
c
bl = + kλ(0).
ET
These bounds could also be established directly by using (5.1), p. 176. The
benefit of Lemma 5.14 lies in its generality, which also allows the bounds to
be found in more complex models as the following example shows.
Example 5.16. (Shock Model). Consider now a compound point process model
in which shocks arrive according to a marked point process (Tn , Vn ) as was
outlined in Sect. 3.3.3. Here we assume that (Tn ) is a nonhomogeneous Pois-
t
son process with a deterministic intensity λ(s) integrating to Λ(t) = 0 λ(s)ds
and that (Vn ) forms an i.i.d. sequence of nonnegative random variables inde-
pendent of (Tn ) with Vn ∼ F. The accumulated damage up to time t is then
described by
186 5 Maintenance Optimization


Nt
Rt = Vn ,
n=1

where Nt = n=1 I(Tn ≤ t) is the number of shocks arrived until t. The
lifetime of the system is modeled as the first time Rt reaches a fixed threshold
S>0:
T = inf{t ∈ R+ : Rt ≥ S}.
We stick to the simple cost structure of the basic age replacement model, i.e.,

Zt = c + kI(T ≤ t).

But now we want to minimize the expected costs per number of arrived shocks
in the long run, i.e.,
Xt = Nt .
This cost criterion is appropriate if we think, for example, of systems which
are used by customers at times Tn . Each usage causes some random damage
(shock). If the customers arrive with varying intensities governed by external
circumstances, e.g., different intensities at different periods of a day, it makes
no sense to relate the costs to time, and it is more reasonable to relate the
costs to the number of customers served.
The semimartingale representations with respect to the internal filtration
generated by the marked point process are (cf. Sect. 3.3.5, p. 89)
 t
Zt = c + I(T > s)kλ(s)F̄ ((S − Rs )−)ds + Mt ,
0
 t
Xt = λ(s)ds + Lt .
0

The martingale M is uniformly integrable and so is LT = (Lt∧T ) if we assume


T
that E 0 λ(s)ds = EΛ(T ) < ∞. Lemma 5.14 yields, with

q = inf{k F̄ ((S − Rt )−) : 0 ≤ t < T (ω), ω ∈ Ω} = k F̄ (S−),

the following bounds for K ∗ = inf{Kτ : τ ∈ CTF } :

c+k
bu = ,
EXT
c
bl = + k F̄ (S−),
EXT
n
where EXT = EΛ(T ). Observe that XT = inf{n ∈ N : Vi ≥ S} and
i=1
{XT > k} = { ki=1 Vi < S}. This yields

% k & ∞
   1
EXT = P Vi < S ≤ F k (S−) = ,
k=0 i=1 k=0
F̄ (S−)
5.2 A General Replacement Model 187
XT
if F (S−) < 1. In addition, using Wald’s equation E n=1 Vn = EXT EV1 ≥ S,
we can derive the following alternative bounds
 EV1
bu = (c + k) ,
S

bl = (c + k)F̄ (S−),

which can easily be computed.


To solve the stopping problem (5.8) for a ratio of expectations, we use
the solution of the simpler case in which we look for the maximum of the
expectations EZτ , where Z is an SSM and τ ranges in a suitable class of
stopping times, which has been considered in detail in Sect. 5.2. It is a well-
known technique to replace the minimization problem (5.8) by an equivalent
maximization problem. Observing that Kτ = EZτ /EXτ ≥ K ∗ is equivalent
to K ∗ EXτ − EZτ ≤ 0 for all τ ∈ CTF , where equality holds for an optimal
stopping time, one has the maximization problem:

Find σ ∈ CTF with EYσ = sup{EYτ : τ ∈ CTF } = 0, where (5.9)


Yt = K ∗ Xt − Zt and K ∗ = inf{Kτ : τ ∈ CTF }.

This new stopping problem can be solved by means of the semimartingale


representation of the process Y = (Yt ) for t ∈ [0, T )
 t
Yt = K ∗ X0 − Z0 + (K ∗ gs − fs )ds + Rt , (5.10)
0

where the martingale R = (Rt ), t ∈ R+ , is given by

Rt = K ∗ Lt − Mt .

Now the procedure is as follows. If the integrand ks = K ∗ gs − fs fulfills


the monotone case (MON), then Theorem 5.9, p. 181, of Sect. 5.2 yields that
the ILA-stopping rule σ = inf{t ∈ R+ : kt ≤ 0} is optimal, provided the
martingale part R is uniformly integrable. Note, however, that this stopping
time σ depends on the unknown value K ∗ , which can be determined from the
equality EYσ = 0.
Next we want to define monotonicity conditions that ensure (MON). Obvi-
ously under assumption (A), p. 184, the monotone case holds true if the ratio
fs /gs is increasing (P -a.s.) with f0 /g0 < K ∗ and lims→∞ fs /gs > K ∗ . The
value K ∗ is unknown so that we need to use the bounds derived, and it seems
too restrictive to demand that the ratio is increasing. Especially bath-tub-
shaped functions, which decrease first up to some s0 and increase for s > s0 ,
should be covered by the monotonicity condition. This results in the following
definition.
Definition 5.17. Let a, b ∈ R ∪ {−∞, ∞} be constants with a ≤ b. Then a
function r : R+ → R is called
188 5 Maintenance Optimization

(i) (a, b)-increasing, if for all t, h ∈ R+

r(t) ≥ a implies r(t + h) ≥ r(t) ∧ b;

(ii) (a, b)-decreasing, if for all t, h ∈ R+

r(t) ≤ b implies r(t + h) ≤ r(t) ∨ a.

Roughly spoken, an (a, b)-increasing function r(t) passes with increasing t


the levels a, b from below and never falls back below such a level. Between a
and b the increase is monotone. Obviously a (0, 0)-decreasing function fulfills
(MON) if r(∞) ≤ 0. A (−∞, ∞)-increasing (decreasing) function is monotone
in the ordinary sense.
The main idea for solving the stopping problem is that, if the ratio fs /gs
satisfies such a monotonicity condition, instead of considering all stopping
times τ ∈ CTF one may restrict the search for an optimal stopping time to the
class of indexed stopping times

ρx = inf{t ∈ R+ : xgt − ft ≤ 0} ∧ T, inf ∅ = ∞, x ∈ R. (5.11)

The optimal stopping level x∗ for the ratio fs /gs can be determined from
EYσ = 0 and coincides with K ∗ as is shown in the following theorem.

Theorem 5.18. Assume (A)(see p. 184) and let ρx , x ∈ R, and the bounds
bu , bl be defined as above in (5.11) and in Lemma 5.14, p. 184, respectively. If
the process (rt ), t ∈ R+ , with rt = ft /gt has (bl , bu )-increasing paths on [0, T ),
then
σ = ρx∗ , with x∗ = inf{x ∈ R : xEXρx − EZρx ≥ 0}
is an optimal stopping time and x∗ = K ∗ .

Proof. Since r is (bl , bu )-increasing with bl ≤ K ∗ ≤ bu , it follows that r is


also (K ∗ , K ∗ )-increasing, i.e., passes K ∗ at most once from below. Thus the
monotone case holds true for the SSM Y . From the general assumption (A)
on p. 184 we deduce that the martingale part of Y is uniformly integrable so
that
σ = inf{t ∈ R+ : K ∗ gt − ft ≤ 0} ∧ T = ρK ∗
is optimal with EYσ = sup{EYτ : τ ∈ CTF } = 0.
It remains to show that x∗ = K ∗ . Define
 ρx
v(x) = xEXρx − EZρx = xEX0 − EZ0 + E (xgs − fs )ds.
0

Now v(x) is obviously nondecreasing in x and by the definition of ρx and (A)


we have v(x) ≥ −EZ0 . For x < K ∗ and v(x) > −EZ0 the following strict
inequality holds, since in this case we have either EX0 > 0 or EX0 = 0 and
P (ρx > 0) > 0 :
5.2 A General Replacement Model 189
 ρx
v(x) < K ∗ EX0 − EZ0 + E (K ∗ gs − fs )ds ≤ v(K ∗ ) = 0.
0

Equally for x < K ∗ and v(x) = −EZ0 we have v(x) < v(K ∗ ) = 0 because of
EZ0 > 0. Therefore,

x∗ = inf{x ∈ R : v(x) ≥ v(K ∗ ) = 0} = K ∗ ,

which proves the assertion. 




Remark 5.19. 1. If E[Z0 − qX0 ] < 0, then the lower bound bl in Lemma 5.14
is attained for σ = 0. So in this case K ∗ = EZ0 /EX0 is the minimum
without any further monotonicity assumptions.
2. If no monotonicity conditions hold at all, then x∗ = inf{x ∈ R : xEXρx −
EZρx ≥ 0} is the cost minimum if only stopping times of type ρx are
considered. But T = ρ∞ is among this restricted class of stopping times
so that x∗ is at least an improved upper bound for K ∗ , i.e., bu ≥ x∗ . From
the definition of x∗ we obtain x∗ ≥ Kρx∗ , which is obviously bounded
below by the overall minimum K ∗ : bu ≥ x∗ ≥ Kρx∗ ≥ K ∗ .
3. Processes r with (bl , bu )-increasing paths include especially unimodal or
bath-tub-shaped processes provided that r0 < bl .

The case of a deterministic process r is of special interest and is stated as


a corollary under the assumptions of the last theorem.

Corollary 5.20. If (ft ) and (gt ) are deterministic with inverse of the ratio
r−1 (x) = inf{t ∈ R+ : rt = ft /gt ≥ x}, x ∈ R, and X0 ≡ 0, then σ = t∗ ∧ T is
optimal with t∗ = r−1 (K ∗ ) ∈ R+ ∪ {∞} and
  −1 
r (x)
K ∗ = inf x∈R: (xgs − fs )P (T > s)ds ≥ EZ0 .
0

If, in addition, r is constant with rt ≡ r0 ∀t ∈ R+ , then


EZ0
K∗ = + r0 and σ = T.
EXT
Remark 5.21. The bounds for K ∗ in Lemma 5.14 are sharp in the following
sense. For constant rt ≡ r0 in the above corollary the upper and lower bounds
coincide.

5.2.3 Different Information Levels

As indicated in Sect. 3.2.4 in the context of the general lifetime model, the
semimartingale set-up has its advantage in opening new fields of applications.
One of these features is the aspect of partial information. In the framework
of stochastic process theory, the information is represented by a filtration, an
190 5 Maintenance Optimization

increasing family of σ-fields. So it is natural to describe partial information


by a family of smaller σ-fields. Let A = (At ) be a subfiltration of F = (Ft ),
i.e., At ⊂ Ft for all t ∈ R+ . The σ-field Ft describes the complete information
up to time t and At can be regarded as the available partial information that
allows us to observe versions of the conditional expectations Ẑt = E[Zt |At ]
and X̂t = E[Xt |At ], respectively. For all A-stopping times τ it holds true that
EZτ = E Ẑτ and EXτ = E X̂τ . So the problem to find a stopping time σ
in the class CTA of A-stopping times that minimizes Kτ = EZτ /EXτ can be
reduced to the ordinary stopping problem by the means developed in the last
subsection if Ẑ and X̂ admit A-SSM representations:
   
EZτ E Ẑ
: τ ∈ CζA = inf Kτ = : τ ∈ CζA .
τ
Kσ = inf Kτ =
EXτ E X̂τ

The projection theorem (Theorem 3.19, p. 69) yields:


If Z is an F-SSM with representation Z = (f, M ) and A is a subfiltration
of F, then Ẑt = E[Zt |At ] is an A-SSM with Ẑ = (fˆ, M̄ ), where fˆ is an
A-progressively measurable version of (E[ft |At ]) , t ∈ R+ , and M̄ is an A-
martingale.
Loosely spoken, if f is the “density” of Z we get the “density” fˆ of Ẑ simply
as the conditional expectation with respect to the subfiltration A. Then the
idea is to use the projection Ẑ of Z to the A-level and apply the above-
described optimization technique to Ẑ. Of course, on the lower information
level the cost minimum is increased,

inf{Kτ : τ ∈ CζA } ≥ inf{Kτ : τ ∈ CζF },

since all A-stopping times are also F-stopping times, and the question, to
what extent the information level influences the cost minimum, has to be
investigated.

5.3 Applications
The general set-up to minimize the ratio of expectations allows for many
special cases covering a variety of maintenance models. Some few of these will
be presented in this section, which show how the general approach can be
exploited.

5.3.1 The Generalized Age Replacement Model

We first focus on the age replacement model with the long run average cost
per unit time criterion: find σ ∈ CTF with
EZσ
K ∗ = Kσ = = inf{Kσ : τ ∈ CTF },
EXσ
5.3 Applications 191

where we now insert Zt = c + I(T ≤ t) and Xt = t, t ∈ R+ . Without loss


of generality the constant k, the penalty costs for replacements at failures,
introduced in Sect. 5.1.1 is set equal to 1. We will now make use of the general
lifetime model described in detail in Sect. 3.2. This means that it is assumed
that the indicator process Vt = I(T ≤ t) has an F-SSM representation with a
failure rate process λ :
 t
Vt = I(T ≤ t) = I(T > s)λs ds + Mt .
0

We know then that λ has nonnegative paths, T is a totally inaccessible F-


stopping time, and M a uniformly integrable F-martingale (cf. Definition 3.24
and Lemma 3.25, p. 72). With λmin = q = inf{λt : 0 ≤ t < T (ω), ω ∈ Ω} we
get from Lemma 5.14, p. 184, the bounds
c c+1
bl = + λmin ≤ K ∗ ≤ bu = .
ET ET
Note that in contrast to Example 5.15, p. 185, λ may be a stochastic failure
rate process. If the paths of λ are (bl , bu )-increasing, then the SSMs Z and X
meet the requirements of Theorem 5.18, p. 188, and it follows that

K ∗ = x∗ = inf{x ∈ R : xEρx − EZρx ≥ 0} and σ = ρx∗ ,

where ρx = inf{t ∈ R+ : λt ≥ x} ∧ T. Consequently, if λ is nondecreasing


or bath-tub-shaped starting at λ0 < bl , we get this solution of the stopping
problem. The optimal replacement time is a control-limit rule for the failure
rate process λ.
To give an idea of how partial information influences this optimal solution,
we resume the example of a two-component parallel system with i.i.d. ran-
dom variables Xi ∼Exp(α), i = 1, 2, which describe the component lifetimes
(cf. Example 3.38, p. 79). Then the system lifetime is T = X1 ∨ X2 with
corresponding indicator process
 t
Vt = I(T ≤ t) = I(T > s)α(I(X1 ≤ s) + I(X2 ≤ s))ds + Mt
0
 t
= I(T > s)λs ds + Mt .
0

Possible different information levels were described in Sect. 3.2.4 in detail. We


restrict ourselves now to four levels:
(a) The complete information level: F = (Ft ),

Ft = σ(I(X1 ≤ s), I(X2 ≤ s), 0 ≤ s ≤ t)

with failure rate process λt = λat = α(I(X1 ≤ t) + I(X2 ≤ t)).


192 5 Maintenance Optimization

(b) Information only about T until h > 0, after h complete information:


Ab = (Abt ) 
σ(I(T ≤ s), 0 ≤ s ≤ t) if 0 ≤ t < h
At =
b
Ft if t ≥ h
and failure rate process

2α(1 − (2 − e−αt )−1 ) if 0 ≤ t < h
λ̂bt = E[λt |Abt ] =
λt if t ≥ h.
(c) Information about component lifetime X1 : Ac = (Act ),
Act = σ(I(T ≤ s), I(X1 ≤ s), 0 ≤ s ≤ t)
and failure rate process
λ̂ct = E[λt |Act ] = α(I(X1 ≤ t) + I(X1 > t)P (X2 ≤ t)).
(d) Information only about T : Ad = (Adt ), Adt = σ(I(T ≤ s), 0 ≤ s ≤ t), and
failure rate (process) λ̂dt = E[λt |Adt ] = 2α(1 − (2 − e−αt )−1 ).
3
In all four cases the bounds remain the same with ET = 2α :

2α 2α
bl = c, bu = (c + 1).
3 3
Since Ab and Ac are subfiltrations of F and include Ad as a subfiltration, we
must have for the optimal stopping values
bl ≤ Ka∗ ≤ Kb∗ ≤ Kd∗ ≤ bu , Ka∗ ≤ Kc∗ ≤ Kd∗ ,
i.e., on a higher information level we can achieve a lower cost minimum. Let us
consider the complete information case in more detail. The failure rate process
is nondecreasing and the assumptions of Theorem 5.18, p. 188, are met. For
the stopping times ρx = inf{t ∈ R+ : λt ≥ x} ∧ T we have to consider values
of x in [bl , bu ] and to distinguish between the cases 0 < x ≤ α and x > α :
• 0 < x ≤ α. In this case we have ρx = X1 ∧ X2 , Eρx = 2α1
, EZρx = c, such
∗ ∗
that xEρx − EZρx = 0 leads to x = 2αc, where 0 < x ≤ α is equivalent
to c ≤ 12 ;
• α < x. In this case we have ρx = T, Eρx = 2α3
, EZρx = c + 1, such that
∗ ∗ 1
x = bu , x > α is equivalent to c > 2 .
The other information levels are treated in a similar way. Only case (b)
needs some special attention because the failure rate process λ̂b is no longer
monotone but only piecewise nondecreasing. To meet the (bl , bu )-increasing
condition, we must have λ̂bh < bl , i.e., 2α(1 − (2 − e−αh )−1 ) < 2α3c. This

inequality holds for all h ∈ R+ , if c ≥ 3
2 and for h < h(α, c) = − α1 ln 3−2c
3−c ,
3
if 0 < c < 2.
We summarize these considerations in the following proposition the proof
of which follows the lines above and is elementary but not straightforward.
5.3 Applications 193

Proposition 5.22. For 0 < c ≤ 1


2 the optimal stopping times and values
K ∗ are
a) Ka∗ = 2αc, σa = X1 ∧ X2 ;
αh 2
b) Kb∗ = α 0.5+(1−e
c+(1−e )
αh )2 , σb = ((X1 ∧ X2 ) ∨ h) ∧ T, if 0 < h < h(α, c);

√   √ 
c) Kc∗ = α 2c, σc = X1 ∧ − α1 ln 1 − 2c ;
/    / 
c2 c2
d) Kd∗ = 2α 4 + c − c
2 , σd = T ∧ − 1
a ln 1 − c
2 − 4 + c .

For c > 1
2 we have on all levels K ∗ = bu and σ = T.
For decreasing c the differences between the cost minima increase. If the
costs c for a preventive replacement are greater than half of the penalty costs,
i.e., c > 12 k = 12 , then extra information and preventive replacements are not
profitable.

5.3.2 A Shock Model of Threshold Type


In the shock model of Example 5.16, p. 185, the shock arrivals were described
by a marked point process (Tn , Vn ), where at time Tn a shock causing damage
of amount Vn occurs. Here we assume that (Tn ) and (Vn ) are independent
and that (Vn ) forms an i.i.d.
∞ sequence of nonnegative random variables with
Vn ∼ F. As usual Nt = n=1 I(Tn ≤ t) counts the number of shocks until
t and

Nt
Rt = Vn
n=1
describes the accumulated damage up to time t. In the threshold-type model,
the lifetime T is given by
T = inf{t ∈ R+ : Rt ≥ S}, S > 0.
Now F is the internal history generated by (Tn , Vn ) and (λt ) the F-intensity of
(Nt ). The costs of a preventive replacement are c > 0 and for a replacement
at failure c + k, k > 0, which results in a cost process Zt = c + kI(T ≤ t). The
aim is to minimize the expected cost per arriving shock in the long run, i.e.,
to find σ ∈ CTF with
 
EZτ
K ∗ = Kσ = inf Kτ = , τ ∈ CTF ,
EXτ
where Xt = Nt . The only assumption concerning the shock arrival process is
that the intensity λ is positive: λt > 0 on [0, T ). According to Example 5.16
and Sect. 3.3.3 we have the following SSM representations:
 t
Zt = c + I(T > s)kλs F̄ ((S − Rs )−)ds + Mt ,
0
 t
Xt = λs ds + Lt .
0
194 5 Maintenance Optimization

Then the cost rate process r is given on [0, T ) by rt = k F̄ ((S − Rt )−),


which is obviously nondecreasing. Under the integrability assumptions of The-
orem 5.18, p. 188, we see that the optimal stopping time is σ = ρx∗ = inf{t ∈
R+ : rt ≥ x∗ }, where the limit x∗ = inf{x ∈ R : xEXρx − EZρx ≥ 0} = K ∗
has to be found numerically. Thus the optimal stopping time is a control-limit
rule for the process (Rt ) : Replace the system the first time the accumulated
damage hits a certain control limit.

Example 5.23. Under the above assumptions let (Nt ) be a point process with
Vn ∼Exp(ν). Then we get with F̄ (x) = exp{−νx}
positive intensity (λs ) and
n
and EXT = E[inf{n ∈ N : i=1 Vi ≥ S}] = νS + 1 the bounds
c
bl = + ke−νS ,
νS + 1
c+k
bu = ,
νS + 1
and the control-limit rules

ρx = inf{t ∈ R+ : k exp{−ν(S − Rt )} ≥ x} ∧ T
 
1 x
= inf t ∈ R+ : Rt ≥ ln + S ∧ T.
ν k

We set g(x) = ν1 ln( xk ) + S and observe that ρx = inf{t ∈ R+ : Rt ≥ g(x)}, if


0 < x ≤ k. For such values of x we find

EXρx = νg(x) + 1,
EZρx = c + kP (T = ρx ) = c + ke−ν(S−g(x)) = c + x.

The probability P (T = ρx ) is just the probability that a Poisson process with


rate ν has no event in the interval [g(x), S], which equals e−ν(g(x)−S) . By these
quantities the optimal control limit x∗ = K ∗ is the unique solution of
c + x∗
x∗ = ,
νg(x∗ ) + 1
provided that bl ≤ x∗ ≤ bu . As expected this solution does not depend on the
specific intensity of the shock arrival process.

5.3.3 Information-Based Replacement of Complex Systems

In this section the basic lifetime model for complex systems is combined with
the possibility of preventive replacements. A system with random lifetime T >
0 is replaced by a new equivalent one after failure. A preventive replacement
can be carried out before failure. There are costs for each replacement and an
additional amount has to be paid for replacements after failures. The aim is to
determine an optimal replacement policy with respect to some cost criterion.
5.3 Applications 195

Several cost criteria are known among which the long run average cost per unit
time criterion is by far the most popular one. But the general optimization
procedure also allows for other criteria. As an example the total expected
discounted cost criterion will be applied in this section. We will also consider
the possibility to take different information levels into account. This set-up will
be applied to complex monotone systems for which in Sect. 3.2 some examples
of various degrees of observation levels were given. For the special case of a two-
component parallel system with dependent component lifetimes, it is shown
how the optimal replacement policy depends on the different information levels
and on the degree of dependence of the component lifetimes.
Consider a monotone system with random lifetime T, T > 0, with an F-
semimartingale representation
 t
I(T ≤ t) = I(T > s)λs ds + Mt , (5.12)
0

for some filtration F. When the system fails it is immediately replaced by


an identical one and the process repeats itself. A preventive replacement can
be carried out before failure. Each replacement incurs a cost of c > 0 and
each failure adds a penalty cost k > 0. The problem is to find a replacement
(stopping) time that minimizes the total expected discounted costs.
Let α > 0 be the discount rate and (Zτ , τ ), (Zτ1 , τ1 ), (Zτ2 , τ2 ), . . . a se-
quence of i.i.d. pairs of positive random variables, where τi represents the
replacement age of the ith implemented system, i.e., the length of the ith
cycle, and Zτi describes the costs incurred during the ith cycle discounted to
the beginning of the cycle. Then the total expected discounted costs are
 
Kτ = E Zτ1 + e−ατ1 Zτ2 + e−α(τ1 +τ2 ) Zτ3 + · · ·
EZτ
= .
E[1 − e−ατ ]
It turns out that Kτ is the ratio of the expected discounted costs for one cycle
and E[1 − e−ατ ]. Again the set of admissible stopping (replacement) times
less or equal to T is
CTF = {τ : τ is an F-stopping time τ ≤ T, EZτ− < ∞}.
The stopping problem is to find a stopping time σ ∈ CTF with
K ∗ = Kσ = inf{Kτ : τ ∈ CTF }. (5.13)
Stopping at a fixed time t leads to the following costs for one cycle discounted
to the beginning of the cycle:
Zt = (c + kI(T ≤ t))e−αt , t ∈ R+ .
Starting from (5.12) such a semimartingale representation can also be ob-
tained for Z = (Zt ), t ∈ R+ , by using the product rule for “differentiating”
196 5 Maintenance Optimization

semimartingales introduced in Sect. 3.1.2. Then Theorem A.51, p. 269, can be


applied to yield for t ∈ [0, T ]:
 t  
−αs k
Zt = c + I(T > s)αe −c + λs ds + Rt
0 α
 t
= c+ I(T > s)αe−αs rs ds + Rt , (5.14)
0

where rs = α−1 (−αc+λs k) is a cost rate and R = (Rt ), t ∈ R+ , is a uniformly


t
integrable F-martingale. Since Xt = 1 − e−αt = 0 αe−αs ds, the ratio of the
“derivatives” of the two semimartingales Z and X is given by (rt ).
We now consider a monotone system with random component lifetimes
Ti > 0, i = 1, 2, . . . , n, n ∈ N, and structure function Φ : {0, 1}n → {0, 1} as
introduced in Chap. 2. The system lifetime T is given by T = inf{t ∈ R+ :
Φt = 0}, where the vector process (Xt ) describes the state of the compo-
nents and Φt = Φ(Xt ) = I(T > t) indicates the state of the system at time
t. If the random variables Ti are independent with (ordinary) failure rates
λt (i) and F = (Ft ) is the (complete information) filtration generated by X,
Ft = σ(Xs , 0 ≤ s ≤ t), then Corollary 3.30 in Sect. 3.2.2 yields the following
semimartingale representation for Φt :
 t
1 − Φt = I(T > s)λs ds + Mt ,
0

n
λt = (Φ(1i , Xt ) − Φ(0i , Xt ))λt (i).
i=1

To find the minimum K ∗ we will proceed as before. First of all bounds bl


and bu for K ∗ are determined by means of q = inf{rt : 0 ≤ t < T (ω), ω ∈ Ω},
the minimum of the cost rate with q ≥ −c:
- .
c ∗ E (c + k)e−αT
bl = + q ≤ K ≤ bu = . (5.15)
E[1 − e−αT ] E[1 − e−αT ]

If all failure rates λt (i) are of IFR-type, then the F-failure rate process λ
and the ratio process r are nondecreasing. Therefore, Theorem 5.18, p. 188,
can be applied to yield σ = ρx∗ . So the optimal stopping time is among the
control-limit rules

ρx = inf{t ∈ R+ : rt ≥ x} ∧ T
0 α 1
= inf t ∈ R+ : λt ≥ (c + x) ∧ T.
k
This means: replace the system the first time the sum of the failure rates of
critical components reaches a given level x∗ . This level has to be determined as

x∗ = inf{x ∈ R : xE[1 − e−αρx ] − E[c + kI(T = ρx )e−αρx ] ≥ 0}.


5.3 Applications 197

The effect of partial information is in the following only considered for the
case that no single component or only some of the n components are observed,
say those with index in a subset {i1 , i2 , . . . , ir } ⊂ {1, 2, . . . , n}, r ≤ n. Then the
subfiltration A is generated by T or by T and the corresponding component
lifetimes, respectively. The projection theorem yields a representation on the
corresponding observation level:
 t
1 − Φ̂ = E[I{T ≤t} |At ] = I{T ≤t} = I(T > s)λ̂s ds + M̄t .
0

If the A-failure rate process λ̂t = E[λt |At ] is (bl , bu )-increasing, then the
stopping problem can also be solved on the lower information level by means
of Theorem 5.18. We want to carry out this in more detail in the next section,
allowing also for dependencies between the component lifetimes. To keep the
complexity of the calculations on a manageable level, we confine ourselves to
a two-component parallel system.

5.3.4 A Parallel System with Two Dependent Components

A two-component parallel system is considered now to demonstrate how the


optimal replacement rule can be determined explicitly. It is assumed that the
component lifetimes T1 and T2 follow a bivariate exponential distribution.
There are lots of multivariate extensions of the univariate exponential distri-
bution. But it seems that only a few models like those of Freund [68] and
Marshall and Olkin [121] are physically motivated.
The idea behind Freund’s model is that after failure of one component
the stress, placed on the surviving component, is changed. As long as both
components work, the lifetimes follow independent exponential distributions
with parameters β1 and β2 . When one of the components fails, the parameter
of the surviving component is switched to β̄1 or β̄2 respectively.
Marshall and Olkin proposed a bivariate exponential distribution for a
two-component system where the components are subjected to shocks. The
components may fail separately or both at the same time due to such shocks.
This model includes the possibility of a common cause of failure that destroys
the whole system at once.
As a combination of these two models the following bivariate distribu-
tion can be derived. Let the pair (Y1 , Y2 ) of random variables be distributed
according to the model of Freund and let Y12 be another positive random vari-
able, independent of Y1 and Y2 , exponentially distributed with parameter β12 .
Then (T1 , T2 ) with T1 = Y1 ∧ Y12 , T2 = Y2 ∧ Y12 is said to follow a combined
exponential distribution. For brevity the notation γi = β1 + β2 − β̄i , i ∈ {1, 2},
and β = β1 + β2 + β12 is introduced. The survival function

F̄ (x, y) = P (T1 > x, T2 > y) = P (Y1 > x, Y2 > y)P (Y12 > x ∨ y)
198 5 Maintenance Optimization

is then given by

⎨ β1 e−γ2 x−(β̄2 +β12 )y − β̄2 −β2 −βy
for x ≤ y
γ2 γ2 e
F̄ (x, y) = (5.16)
⎩ β2 e−γ1 y−(β̄1 +β12 )x − β̄1 −β1 −βx
γ1 γ1 e for x > y,

where here and in the following γi = 0, i ∈ {1, 2}, is assumed. For βi = β̄i this
formula diminishes to the Marshall–Olkin distribution and for β12 = 0 (5.16)
gives the Freund distribution. From (5.16) the distribution H of the system
lifetime T = T1 ∧ T2 can be obtained:

H(t) = P (T ≤ t) = P (T1 ≤ t, T2 ≤ t) (5.17)


β2 β1 β1 β̄2 + β2 β̄1 − β̄1 β̄2 −βt
= 1 − e−(β̄1 +β12 )t − e−(β̄2 +β12 )t + e .
γ1 γ2 γ1 γ2
The optimization problem will be solved for three different information
levels:
• Complete information about T1 , T2 (and T ). The corresponding filtration
F is generated by both component lifetimes:

Ft = σ(I(T1 ≤ s), I(T2 ≤ s), 0 ≤ s ≤ t), t ∈ R+ .

• Information about T1 and T . The corresponding filtration A is generated


by one component lifetime, say T1 , and the system lifetime:

At = σ(I(T1 ≤ s), I(T ≤ s), 0 ≤ s ≤ t), t ∈ R+ .

• Information about T . The filtration generated by T is denoted by B:

Bt = σ(I(T ≤ s), 0 ≤ s ≤ t), t ∈ R+ .

In the following it is assumed that βi ≤ β̄i , i ∈ {1, 2}, and β̄1 ≤ β̄2 ,
i.e., after failure of one component the stress placed on the surviving one is
increased. Without loss of generality the penalty costs for replacements after
failures are set to k = 1. The solution of the stopping problem will be outlined
in the following. More details are contained in [84].

5.3.5 Complete Information About T1 , T2 and T

The failure rate process λ on the F-observation level is given by (cf. Exam-
ple 3.27, p. 74)

λt = β12 + β̄2 I(T1 < t < T2 ) + β̄1 I(T2 < t < T1 ).

Inserting q = −c + β12 α−1 in (5.15) we get the bounds for the stopping
value K ∗
5.3 Applications 199

cv β12 (c + 1)v
bl = + and bu = ,
1−v α 1−v
where v = E[e−αT ] can be determined by means of the distribution H. Since
the failure rate process is monotone on [0, T ) the optimal stopping time can
be found among the control limit rules ρx = inf{t ∈ R+ : rt ≥ x} ∧ T :


⎪ 0 for x ≤ βα12 − c



⎨ T1 ∧ T2 for β12 − c < x ≤ β̄1 +β12 − c
α α
ρx =

⎪ T1 for β̄1 +β12
− c < x ≤ β̄2 +β 12
−c

⎪ α α


T for x > α − c.
β̄2 +β12

The optimal control limit x∗ is the solution of the equation

xE[1 − e−αρx ] − EZρx = 0.

Since the optimal value x∗ lies between the bounds bl and bu , the considera-
tions can be restricted to the cases x ≥ bl > β12 α−1 − c. In the first case when
β12 α−1 − c < x ≤ (β̄1 + β12 )α−1 − c, one has ρx = T1 ∧ T2 and
α
E[1 − e−αρx ] =
β+α
β β12
EZρx = cE[e−αρx ] + E[I(T ≤ ρx )e−αρx ] = c + .
β+α β+α
The solution of the equation
 
∗ α β β12
x − c + =0
β+α β+α β+α
is given by

1 β12 β̄1 + β12


x∗ = (cβ + β12 ) if − c < x∗ ≤ − c.
α α α
Inserting x∗ in the latter inequality we obtain the condition 0 < c ≤ c1 , where
c1 = β̄1 (β + α)−1 .
The remaining two cases (β̄1 + β12 )α−1 − c < x ≤ (β̄2 + β12 )α−1 − c and
x > (β̄2 + β12 )α−1 − c are treated in a similar manner. After some extensive
calculations the following solution of the stopping problem is derived:

⎨ T1 ∧ T2 for 0 < c ≤ c1
ρx ∗ = T 1 for c1 < c ≤ c2

T for c2 < c
⎧ ∗
⎨ x1 for 0 < c ≤ c1
x∗ = x∗2 for c1 < c ≤ c2
⎩ ∗
x3 for c2 < c,
200 5 Maintenance Optimization

where c1 is defined as above and


β̄2 β2 (β̄2 − β̄1 )
c2 = + ,
(β + α) (β̄1 + β12 + α)(β + α)
1
x∗1 = (cβ + β12 ),
α 
∗ 1 (c + 1)β2 β̄1 − cβ1 β2
x2 = c(β1 + β12 ) + β12 + ,
α β̄1 + β2 + β12 + α
x∗3 = bu .

The explicit formulas for the optimal stopping value were only presented here
to show how the procedure works and that even in seemingly simple cases
extensive calculations are necessary. The main conclusion can be drawn from
the structure of the optimal policy. For small values of c (note that the penalty
costs for failures are k = 1) it is optimal to stop and replace the system at the
first component failure. For mid-range values of c, the replacement should take
place when the “better” component with a lower residual failure rate (β̄1 ≤ β̄2 )
fails. If the “worse” component fails first, this results in an replacement after
system failure. For high values of c, preventive replacements do not pay, and
it is optimal to wait until system failure. In this case the optimal stopping
value is equal to the upper bound x∗ = bu .

Information About T1 and T

The failure rate process corresponding to this observation level A is given by

λt = g(t)I(T1 > t) + (β̄2 + β12 )I(T1 ≤ t),


β̄1 γ1
g(t) = β̄1 + β12 − ,
β2 e + β1 − β̄1
γ 1 t

where the function g is derived by means of (5.16) as the limit


1
g(t) = lim P (t < T1 ≤ t + h, T2 ≤ t + h|T1 > t).
h→0+ h

The paths of the failure rate process λ depend only on the observable compo-
nent lifetime T1 and not on T2 . The paths are nondecreasing so that the same
procedure as before can be applied. For γ1 = β1 + β2 − β̄1 > 0 the following
results can be obtained:

⎨ T1 ∧ b∗ for 0 < c ≤ c1
ρx ∗ = T 1 for c1 < c ≤ c2

T for c2 < c
⎧ ∗
⎨ x1 for 0 < c ≤ c1
x∗ = x∗2 for c1 < c ≤ c2
⎩ ∗
x3 for c2 < c.
5.3 Applications 201

The constants c1 , c2 and the stopping values x∗2 , x∗3 are the same as in the
complete information case. What is optimal on a higher information level and
can be observed on a lower information level must be optimal on the latter
too. So only the case 0 < c ≤ c1 is new. In this case the optimal replacement
time is T1 ∧ b∗ with a constant b∗ , which is the unique solution of the equation

d1 exp{γ1 b∗ } + d2 exp{−(β̄1 + β12 + α)b∗ } + d3 = 0.

The constants di , i ∈ {1, 2, 3}, are extensive expressions in α, the β and γ


constants and therefore not presented here (see [84]). The values of b∗ and
x∗1 have to be determined numerically. For γ1 < 0 a similar result can be
obtained.

Information About T

On this lowest level B, no additional information about the state of the


components is available up to the time of system failure. The failure rate
is deterministic and can be derived from the distribution H:
d
λt = − (ln(1 − H(t))).
dt
In this case the replacement times ρx = T ∧ b, b ∈ R+ ∪ {∞}, are the well-
known age replacement policies. Even if λ is not monotone, such a policy is
optimal on this B-level. The optimal values b∗ and x∗ have to be determined
by minimizing Kρx as a function of b.

Numerical Examples

The following tables show the effects of changes of two parameters, the re-
placement cost parameter c and the “dependence parameter” β12 . To be able
to compare the cost minima K ∗ = x∗ , both tables refer to the same set of
parameters: β1 = 1, β2 = 3, β̄1 = 1.5, β̄2 = 3.5, α = 0.08. The optimal replace-
ment times are denoted:
a: ρx∗ = T1 ∧ T2 b: ρx∗ = T1 c: ρx∗ = T1 ∧ b∗

d: ρx∗ = T ∧ b e: ρx∗ = T = T1 ∨ T2 .
Table 5.1 shows the cost minima x∗ for different values of c. For small
values of c, the influence of the information level is greater than for mod-
erate values. For c > 1.394 preventive replacements do not pay, additional
information concerning T is not profitable.
Table 5.2 shows how the cost minimum depends on the parameter β12 . For
increasing values of β12 the difference between the cost minima on different
information levels decreases, because the probability of a common failure of
both components increases and therefore extra information about a single
component is not profitable.
202 5 Maintenance Optimization

Table 5.1. β1 = 1, β2 = 3, β12 = 0.5, β̄1 = 1.5, β̄2 = 3.5, α = 0.08


Information level
c bl F A B bu
0.01 6.453 6.813 a 9.910 c 11.003 d 20.506
0.10 8.280 11.875 a 17.208 c 19.678 d 22.333
0.50 16.402 28.543 b 28.543 b 30.455 e 30.455
1.00 26.553 39.764 b 39.764 b 40.606 e 40.606
2.00 46.856 60.900 e 60.900 e 60.900 e 60.900

Table 5.2. β1 = 1, β2 = 3, β̄1 = 1.5, β̄2 = 3.5, c = 0.1, α = 0.08


Information level
β12 bl F A B bu
0.00 1.505 5.000 a 10.739 c 13.231 d 16.552
0.10 2.859 6.375 a 12.032 c 14.520 d 17.698
1.00 15.067 18.750 a 23.688 c 26.132 d 28.235
10.00 138.106 142.500 b 142.500 b 144.168 e 144.168
50.00 687.677 689.448 e 689.448 e 689.448 e 689.448

5.3.6 A Burn-In Model

Many manufactured items, for example, electronic components, tend either to


last a relatively long time or to fail very early. A technique used to screen out
the items with short lifelengths before they are delivered to the customer is the
so-called burn-in. To burn-in an item means that before the item is released,
it undergoes a test during which it is examined under factory conditions or it
is exposed to extra stress. After the test phase of (random) length τ, the item
is put into operation.
Considering m produced items, and given some cost structure such as
costs for failures during and after the test and gains per unit time for released
items, one problem related to burn-in is to determine the optimal burn-in
duration. This optimal burn-in time may either be fixed in advance and it
is therefore deterministic, or one may consider the random information given
by the lifelengths of the items failing during the test and obtain a random
burn-in time.
We consider a semimartingale approach for solving the optimal stopping
problem. In our model, the lifelengths of the items need not be identically
distributed, and the stress level during burn-in may differ from the one after
burn-in. The information at time t consists of whether and when components
failed before t. Under these assumptions, we determine the optimal burn-in
time ζ.
Let Tj , j = 1, . . . , m, be independent random variables representing the
lifelengths of the items that are burned in. We assume that ETj < ∞ for all j.
We consider burn-in under severe conditions. That means that we assume the
items to have different failure rates during and after burn-in, λ0j (t) and λ1j (t),
5.3 Applications 203

respectively, where it is supposed that λ0j (t) ≥ λ1j (t) for all t ≥ 0. We assume
that the lifelength Tj of the jth item admits the following representation:
 t
I(Tj ≤ t) = I(Tj > s)λYj s (s)ds + Mt (j), j = 1, . . . , m, (5.18)
0

where Yt = I(τ < t), τ is the burn-in time and M (j) ∈ M is bounded in L2 .
This representation can also be obtained by modeling the lifelength of the
jth item in the following way:

Tj = Zj ∧ τ + Rj I(Zj > τ ), (5.19)


where Zj , Rj , j = 1, . . . , m, are independent random variables and a ∧ b
denotes the minimum of a and b; Zj is the lifelength of the jth item
when it is exposed to a higher stress level and Rj is the operating time
of the item if it survived the burn-in phase. Let Fj be the lifelength dis-
tribution, Hj denote the distribution function of Zj , j = 1, . . . , m, and let
Hj (0) = Fj (0) = 0, H̄j (t) = 1 − Hj (t), F̄j (t) = 1 − Fj (t). Furthermore, we
assume that Hj and Fj admit densities hj and fj , respectively. It is assumed
that the operating time Rj follows the conditional survival distribution cor-
responding to Fj :

P (Tj ≤ t + s|τ = t < Zj ) = P (Rj ≤ s|τ = t < Zj )


Fj (t + s) − Fj (t)
= , t, s ∈ R+ .
F̄j (t)

In order to determine the optimal burn-in time, we introduce the following


cost and reward structure: there is a reward of c > 0 per unit operating
time of released items. In addition there are costs for failures, cB > 0 for a
failure during burn-in and cF > 0 for a failure after the burn-in time τ, where
cF > cB . If we fix the burn-in time for a moment to τ = t, then the net reward
is given by

m 
m 
m
Zt = c (Tj − t)+ − cB I(Tj ≤ t) − cF I(Tj > t), t ∈ R+ . (5.20)
j=1 j=1 j=1

Since we assume that the failure time of any item can be observed during the
burn-in phase, the observation filtration, generated by the lifelengths of the
items, is given by

F = (Ft ), t ∈ R+ , Ft = σ(I(Tj ≤ s), 0 ≤ s ≤ t, j = 1, . . . , m).


In order to determine the optimal burn-in time, we are looking for an
F-stopping time ζ ∈ C F satisfying

EZζ = sup{EZτ : τ ∈ C F }.
204 5 Maintenance Optimization

In other words, at any time t the observer has to decide whether to stop or to
continue with burn-in with respect to the available information up to time t.
Since Z is not adapted to F, i.e., Zt cannot be observed directly, we consider
the conditional expectation

m
Ẑt = E[Zt |Ft ] = c I(Tj > t)E[(Tj − t)+ |Tj > t] − mcF
j=1

m
+(cF − cB ) I(Tj ≤ t). (5.21)
j=1

As an abbreviation we use
 ∞
1
μj (t) = E[(Tj − t)+ |Tj > t] = F̄j (x)dx, t ∈ R+ ,
F̄j (t) t

for the mean residual lifelength. The derivative with respect to t is given by
μj (t) = −1 + λ1j (t)μj (t). We are now in a position to apply Theorem 5.9,
p. 181, and formulate conditions under which the monotone case holds true.
Theorem 5.24. Suppose that the functions

gj (t) = −c − cμj (t)(λ0j (t) − λ1j (t)) + (cF − cB )λ0j (t)

satisfy the following condition:



gj (t) ≤ 0 implies gj (s) ≤ 0 ∀j ∈ J , ∀J ⊆ {1, . . . , m}, ∀s ≥ t. (5.22)
j∈J

Then  

m
ζ = inf t ∈ R+ : I(Tj > t)gj (t) ≤ 0
j=1

is an optimal burn-in time:

EZζ = sup{EZτ : τ ∈ C F }.

Proof. In order to obtain a semimartingale representation for Ẑ in (5.21) we


derive such a representation for I(Tj > t)μj (t). Since μj (·) and I(Tj > ·) are
right-continuous and of bounded variation on [0, t], we can use the integration
by parts formula for Stieltjes integrals (pathwise) to obtain
 t
μj (t)I(Tj > t) = μj (0)I(Tj > 0) + μj (s−)dI(Tj > s)
0
 t
+ I(Tj > s)dμj (s).
0

Substituting
5.3 Applications 205
 s
I(Tj > s) = 1 + (−I(Tj > x)λ0j (x))dx + Mj (s)
0

in this formula and using the continuity of μ, we obtain


 t
μj (t)I(Tj > t) = μj (0) + [−μj (s)I(Tj > s)λ0j (s) + I(Tj > s)μj (s)]ds
0
 t
+ μj (s)dMj (s)
0
 t - .
= μj (0) + I(Tj > s) −1 − μj (s)(λ0j (s) − λ1j (s)) ds
0
+M̃j (t),

where M̃j is a martingale, which is bounded in L2 . This yields the following


semimartingale representation for Ẑ :

m
Ẑt = −mcF + c μj (0)
j=1
 t
m
+ cI(Tj > s)[−1 − μj (s)(λ0j (s) − λ1j (s))]ds
0 j=1
 t
m
+(cF − cB ) I(Tj > s)λ0j (s)ds + Lt
0 j=1


m  t
m
= −mcF + c μj (0) + I(Tj > s)gj (s)ds + Lt
j=1 0 j=1

with a uniformly integrable martingale



m 
m
L=c M̃j + (cF − cB ) Mj ∈ M.
j=1 j=1

m for all ω ∈ Ω and all


Since t ∈ R+ , there exists some J ⊆ {1, . . . , m} such that
j=1 I(T j > t)g j (t) = j∈J gj (t), condition (5.22) in the theorem ensures
that the monotone case (MON), p. 181, holds true. Therefore we get the
desired result by Theorem 5.9 and the proof is complete.

Remark 5.25. The structure of the optimal stopping time shows that high
rewards per unit operating time lead to short burn-in times whereas great
differences cF − cB between costs for failures in different phases lead to long
testing times, as expected.

Equivalent characterizations of condition (5.22) in Theorem 5.24 are given


in the following lemma. The proof can be found in [87].
206 5 Maintenance Optimization

Lemma 5.26. Let tJ = inf{t ∈ R+ : j∈J gj (t) ≤ 0} and denote tj = t{j}
for all j ∈ {1, . . . , m}. Then the following conditions are equivalent:

(i) j∈J gj (t) ≤ 0 implies gj (s) ≤ 0 ∀ j ∈ J , ∀ J ⊆ {1, . . . , m} and
∀ s ≥ t.
(ii) tJ = maxj∈J tj ∀ J ⊆ {1, . . . , m} and gj (s) ≤ 0 ∀ s ≥ tj , ∀ j ∈
{1, . . . , m}.
(iii) 2 2
2  2
2 2
2 gj (t)2 < min gj (t) ∀ t < max tj
2 2 j:gj (t)>0 j=1,...,m
j:gj (t)≤0

and gj (s) ≤ 0 ∀ s ≥ tj , ∀ j ∈ {1, . . . , m}.

The following special cases illustrate the result of the theorem.


1. Burn-in forever. If gj (t) > 0 for all t ∈ R+ , j = 1, . . . , m, then ζ =
max{T1 , . . . , Tm }, i.e., burn-in until all items have failed.
2. No burn-in. If gj (0) ≤ 0, j = 1, . . . , m, then ζ = 0 and no burn-in takes
place. This case occurs for instance if the costs for failures during and
after burn-in are the same: cB = cF .
3. Identical items. If all failure rates coincide, i.e., λ01 (t) = . . . = λ0m (t) and
λ11 (t) = . . . = λ1m (t) for all t ≥ 0, then gj (t) = g1 (t) for all j ∈ {1, . . . , m}
and condition (A.1) reduces to

g1 (s) ≤ 0 for s ≥ t1 = inf{t ∈ R+ : g1 (t) ≤ 0}.

If this condition is satisfied, the optimal stopping time is of the form


ζ = t1 ∧ max{T1 , . . . , Tm }, i.e., stop burn-in as soon as g1 (s) ≤ 0 or as
soon as all items have failed, whatever occurs first.
4. The exponential case. If all failure rates are constant, equal to λ0j and
λ1j , respectively, then μj and therefore gj is constant, too, and ζ(ω) ∈
{0, T1 (ω), . . . , Tm (ω)}, if condition (5.22) is satisfied. If, furthermore, the
items are “identical,” then we have ζ = 0 or ζ = max{T1 , . . . , Tm }.
5. No random information. In some situations the lifelengths of the items
cannot be observed continuously. In this case one has to maximize the
expectation function

m 
m
EZt = E Ẑt = −mcF + c H̄j (t)μj (t) + (cF − cB ) Hj (t)
j=1 j=1

in order to obtain the (deterministic) optimal burn-in time. This can be


done using elementary calculus.
5.4 Repair Replacement Models 207

5.4 Repair Replacement Models


In this section we consider models in which repairs are carried out in negligible
time up to the time of a replacement. So the observation of the system does
not end with a failure, as in the first sections of this chapter, but are continued
until it is decided to replace the system by a new one. Given a certain cost
structure the optimal replacement time is derived with respect to the available
information.

5.4.1 Optimal Replacement Under a General Repair Strategy

We consider a system that fails at times Tn , according to a point process


(Nt ), t ∈ R+ , with an intensity (λt ) adapted to some filtration F. At failures
a repair is carried out at cost of c > 0, which takes negligible time. A replace-
ment can be carried out at any time t at an additional cost k > 0. Following
the average cost per unit time criterion, we have to find a stopping time σ, if
there exists one, with
 
cENτ + k
K ∗ = Kσ = inf Kτ = : τ ∈ CF ,

where C F = {τ : τ F-stopping time, Eτ < ∞} is a suitable class of stopping
times. To solve this problem we can adopt the procedure of Sect. 5.2.1 with
some slight modifications.
EZτ
First of all we have Kτ = EX τ
with SSM representations
 t
Zt = k + cλs ds + Mt , (5.23)
0
 t
Xt = ds.
0

Setting τ = T1 , we derive the simple upper bound bu :


c+k
bu = ≥ K ∗.
ET1
The process Y corresponding to (5.10) on p. 187 now reads
 t
Yt = −k + (K ∗ − cλs )ds + Rt
0

and therefore we know that, if there exists an optimal finite stopping time σ,
then it is among the indexed stopping times
x
ρx = inf{t ∈ R+ : λt ≥ }, 0 ≤ x ≤ bu ,
c
provided λ has nondecreasing paths. We summarize this in a corollary to
Theorem 5.18, p. 188.
208 5 Maintenance Optimization

Corollary 5.27. Let the martingale M in (5.23) be such that (Mt∧ρbu ) is


uniformly integrable. If λ has nondecreasing paths and Eρbu < ∞, then

σ = ρx∗ , with x∗ = inf{x ∈ R+ : xEρx − cENρx ≥ k},

is an optimal stopping time and x∗ = K ∗ .

Example 5.28. Considering a nonhomogeneous Poisson process with a nonde-


creasing deterministic intensity λt = λ(t), we observe that the stopping times
ρx = λ−1 (x/c) are constants. If λ−1 (bu /c) < ∞, then the corollary can be
applied and the optimal stopping time σ is a finite constant.
The simplest case is that of a Poisson process with constant rate λ > 0.
In this case we have bu = cλ + kλ > cλ and ρbu = ∞, so that the corollary
does not apply. But in this case it is easily seen that additional stopping
(replacement) costs do not pay and we get that σ = ∞ is optimal with
K ∗ = cλ.

Example 5.29. Consider the shock model with state-dependent failure proba-
bility of Sect. 3.3.4 in which shocks arrive according to a Poisson process with
rate ν (cf. Example 3.47, p. 89). The failure intensity is of the form
 ∞
λt = ν p(Xt + y)dF (y),
0

where p(Xt + y) denotes the probability of a failure at the next shock if the
accumulated damage is Xt and the next shock has amount y. Here we assume
that this probability function p does not depend on the number of failures in
the past. Obviously λt is nondecreasing so that Corollary 5.27 applies provided
that the integrability conditions are met.

A variety of point process models as described in Sect. 3.3 can be used in


this set-up. Also more general cost structures could be applied as for example
random costs k = (kt ), if k admits an SSM representation. Other modifications
(discounted cost criterion, different information levels) can be worked out
easily apart of some technical problems.

5.4.2 A Markov-Modulated Repair Process: Optimization


with Partial Information

In this section a model with a given reward structure is investigated in which


an optimal operating time of a system has to be found that balances some
flow of rewards and the increasing cost rate due to (minimal) repairs. Consider
a one-unit system that fails from time to time according to a point process.
After failure a minimal repair is carried out that leaves the state of the system
unchanged. The system can work in one of m unobservable states. State “1”
stands for new or in good condition and “m” is defective or in bad condition.
Aging of the system is described by a link between the failure point process and
5.4 Repair Replacement Models 209

the unobservable state of the system. The failure or minimal repair intensity
may depend on the state of the system. There is some constant flow of income,
on the one hand, and on the other hand, each minimal repair incurs a random
cost amount. The question is when to stop processing the system and carrying
out an inspection or a renewal in order to maximize some reward functional.
For the basic set-up we refer to Example 3.14, p. 65 and Sect. 3.3.9. Here
we recapitulate the main assumptions of the model:
The basic probability space (Ω, F , P ) is equipped with a filtration F,
the complete information level, to which all processes are adapted, and
S = {1, . . . , m} is the set of unobservable environmental states. The changes
of the states are driven by a homogeneous Markov process Y = (Yt ), t ∈ R+ ,
with values in S and infinitesimal parameters qi , the rate to leave state i, and
qij , the rate to reach state j from state i. The time points of failures (minimal
repairs) 0 < T1 < T2 < · · · form a point process and N = (Nt ), t ∈ R+ , is the
corresponding counting process:


Nt = I(Tn ≤ t).
n=1

It is assumed that N has a stochastic intensity λYt that depends on the unob-
servable state, i.e., N is a so-called Markov-modulated Poisson process with
representation  t
Nt = λYs ds + Mt ,
0
where M is an F-martingale and 0 < λi < ∞, i ∈ S.
Furthermore, let (Xn ), n ∈ N, be a sequence of positive i.i.d. random
variables, independent of N and Y , with common distribution F and finite
mean μ. The cost caused by the nth minimal repair at time Tn is described
by Xn .
There is an initial capital u and an income of constant rate c > 0 per unit
time.
Now the process R, given by


Nt
Rt = u + ct − Xn ,
n=1

describes the available capital at time t as the difference of the income and
the total amount of costs for minimal repairs up to time t.
The process R is well-known in other branches of applied probability like
queueing or collective risk theory, where the time to ruin τ = inf{t ∈ R+ :
Rt < 0} is investigated (cf. Sect. 3.3.9). Here the focus is on determining the
optimal operating time with respect to the given reward structure. To achieve
this goal one has to estimate the unobservable state of the system at time t,
given the history of the process R up to time t. This can be done using results
210 5 Maintenance Optimization

in filtering theory as is shown below. Stopping at a fixed time t results in the


net gain

m
Zt = Rt − kj Ut (j),
j=1

where Ut (j) = I(Yt = j) is the indicator of the state at time t and kj ∈ R, j ∈


S, are stopping costs (for inspection and replacement), which may depend on
the stopping state. The process Z cannot be observed directly because only
the failure time points and the costs for minimal repairs are known to an
observer. The observation filtration A = (At ), t ∈ R+ , is given by
At = σ(Ns , Xi , 0 ≤ s ≤ t, i = 1, . . . , Nt ).
Let C A = {τ : τ is a finite A-stopping time, EZτ− < ∞} be the set of feasible
stopping times in which the optimal one has to be found. As usual a− =
− min{0, a} denotes the negative part of a ∈ R. So the problem is to find
τ ∗ ∈ C A which maximizes the expected net gain:
EZτ ∗ = sup{EZτ : τ ∈ C A }.
For the solution of this problem an F-semimartingale representation of
the process Z is needed, where it is assumed that the complete information
filtration F is generated by Y, N, and (Xn ):
Ft = σ(Ys , Ns , Xi , 0 ≤ s ≤ t, i = 1, . . . , Nt ).
Such a representation can be obtained by means of an SSM representation for
the indicator process Ut (j),
 tm
Ut (j) = U0 (j) + Us (i)qij ds + mt (j), m(j) ∈ M0 , (5.24)
0 i=1

as follows (see [95] for details):



m  t
m
Zt = u − kj U0 (j) + Us (j)rj ds + Mt , t ∈ R+ , (5.25)
j=1 0 j=1

where M = (Mt ) is an F-martingale and the constants rj are defined by



rj = c − λj μ − (kν − kj )qjv .
ν =j

These constants can be interpreted as net gain rates in state j:


• c is the income rate.
• λj , the failure rate in state j, is the expected number of failures per unit
of time, μ is the expected repair cost for one minimal repair. So λj μ is the
repair cost rate.
• The remaining sum is the stopping cost rate by leaving state j.
5.4 Repair Replacement Models 211

Since the state indicators U (j) and therefore Z cannot be observed, a


projection to the observation filtration A is needed. As described in Sect. 3.1.2
such a projection from the F-level (5.25) to the A-level leads to the following
conditional expectations:

m  t
m
Ẑt = E[Zt |At ] = u − kj Û0 (j) + Ûs (j)rj ds + M̄t , t ∈ R+ . (5.26)
j=1 0 j=1

m
The integrand j=1 Ûs (j)rj with Ûs (j) = E[Us |As ] = P (Ys = j|As ) is the
conditional expectation of the net gain rate at time s given the observations
up to time s. If this integrand has nonincreasing paths, then we know that we
are in the “monotone case” (cf. p. 181) and the stopping problem could be
solved under some additional integrability conditions. To state monotonicity
conditions for the integrand in (5.26), an explicit representation of Ût (j) is
needed, which can be obtained by means of results in filtering theory (see [50],
p. 98, [93]) in the form of “differential equations”:
• Between the jumps of N : Tn ≤ t < Tn+1
 t %
m
&
Ût (j) = ÛTn (j) + Ûs (i){qij + Ûs (j)(λi − λj )} ds,
Tn i=1
qjj = −qj , (5.27)
Û0 (j) = P (Y0 = j), j ∈ S.

• At jumps
λj ÛTn − (j)
ÛTn (j) = m , (5.28)
i=1 λi ÛTn − (i)

where UTn − (j) denotes the left limit.


The following conditions ensure that the system ages, i.e., it moves from
the “good” states with high net gains and low failure rates to the “bad” states
with low and possibly negative net gains and high failure rates, and it is never
possible to return to a “better” state:

qi > 0, i = 1, . . . , m − 1, qij = 0 for i > j, i, j ∈ S,


r1 ≥ r2 ≥ · · · ≥ rm = c − λm μ, rm < 0, (5.29)
0 < λ1 ≤ λ2 ≤ · · · ≤ λm .

A reasonable candidate for an optimal A-stopping time is


 
m
τ ∗ = inf t ∈ R+ : Ût (j)rj ≤ 0 , (5.30)
j=1

the first time the conditional expectation of the net gain rate falls below 0.
212 5 Maintenance Optimization

Theorem 5.30. Let τ ∗ be the A-stopping time (5.30) and assume that con-
ditions (5.29) hold true. If, in addition, qim > λm − λi , i = 1, . . . , m − 1, then
τ ∗ is optimal:
EZτ ∗ = sup{EZτ : τ ∈ C A }.

Proof. Because of EZτ = E Ẑτ for all τ ∈ C A we can apply Theorem 5.9, p.
181, of Chap. 3 taking the A-SSM representation (5.26) of Ẑ. We will proceed
in two steps:
(a) First, we prove that the monotone case holds true.
(b) Second, we show that the martingale part M̄ in (5.26) is uniformly inte-
grable.
m
(a) We start showing that the integrand j=1 Ûs (j)rj has nonincreasing
paths. A simple rearrangement gives


m 
m−1
Ûs (j)rj = rm + (rm−1 − rm ) Ûs (j) + · · · + (r1 − r2 )Ûs (1).
j=1 j=1

Since we have from (5.29) that rk−1 − rk ≥ 0, k = 2, . . . , m, it remains to



show that jν=1 Ûs (ν) is nonincreasing in s for j = 1, . . . , m − 1. Denoting
m
λ̄(s) = j=1 Ûs (j)λj we get from (5.27) between jumps Tn < s < Tn+1 ,
where T0 = 0,
% j & %m &
d   
j
Ûs (ν) = Ûs (i){qiν + Ûs (ν)(λi − λν )}
ds ν=1 ν=1 i=1


m 
j 
j
= Ûs (i)qiν + Ûs (ν)(λ̄(s) − λν )
i=1 ν=1 ν=1
⎛ ⎞

j 
m
= Ûs (i) ⎝− qik + λ̄(s) − λi ⎠
i=1 k=j+1

m
using qij = 0 for i > j and qii = − k=i+1 qik , i = 1, . . . , m − 1.
From qim > λm − λi ≥ λ̄(s) − λi it follows that
% &
d 
j
Ûs (ν) ≤ 0, j = 1, . . . , m − 1.
ds ν=1

At jumps Tn we have from (5.28)


j 
j
λv − λ̄(Tn −)
(ÛTn (ν) − ÛTn − (ν)) = ÛTn − (ν) .
ν=1 ν=1
λ̄(Tn −)
5.4 Repair Replacement Models 213

The condition λ1 ≤ · · · ≤ λm ensures that the latter sum is not greater than
0. This is obvious in the case λj ≤ λ̄(Tn −); otherwise, if λj > λ̄(Tn −), this
follows from

m
λv − λ̄(Tn −) 
j
λv − λ̄(Tn −)
0= ÛTn − (ν) ≥ ÛTn − (ν) .
ν=1
λ̄(Tn −) ν=1
λ̄(Tn −)
For the monotone case to hold it is also necessary that
 m 
 
Ût (j)rj ≤ 0 = Ω
t∈R+ j=1

or equivalently τ ∗ < ∞. From (5.24) we obtain by means of the projection


theorem
 t m−1

Ût (m) = Û0 (m) + Ûs (i)qim ds + m̄t (j)
0 i=1

with a nonnegative integrand. This shows that Ût (m) is a bounded submartin-
gale. Thus, the limit
Û∞ (m) = lim Ût (m) = E[U∞ (m)|A∞ ]
t→∞

exists and is identical to 1 since limt→∞ Yt = m and hence U∞ (m) = 1.


Because rm < 0, it is possible to choose some > 0 such that (1 − )rm +
m−1
i=1 ri < 0. Therefore, we have
   

m
τ ∗ = inf t ∈ R+ : Ût (j)rj ≤ 0 ≤ inf t ∈ R+ : Ût (m) ≥ 1 − < ∞.
j=1

(b) To show that M̄ is uniformly integrable we consider a decomposition of


the drift term of the F-SSM representation of Z :

 t
m  t
m
Us (j)rj ds = Us (j)(rj − rm )ds + trm ,
0 j=1 0 j=1

where trm is obviously A-adapted. We use the projection Theorem 3.19, p.


69, in the extended version. To this end we have to show that
m  ∞ m
1. Z0 = c − j=1 kj U0 (j) and 0 | j=1 Us (j)(rj − rm )|ds are square inte-
grable, and that
2. M is square integrable.
The details of these parts are omitted here and can be found in [93, 95].
To sum up, by (a) the monotone case holds true for Ẑ with a martingale
part M̄ , which is by (b) square integrable and hence uniformly integrable.
The monotone stopping Theorem 5.9 can then be applied and the assertion
of the theorem follows. 

214 5 Maintenance Optimization

5.4.3 The Case of m=2 States

For two states the stopping problem can be reformulated as follows. At an


unobservable random time, say σ, there occurs a switch from state 1 to state
2. Detect this change as well as possible (with respect to the given optimization
criterion) by means of the failure process observations. The conditions (5.29)
now read

q1 = q12 = q > 0, q2 = q21 = 0,


r1 = c − λ1 μ − q(k2 − k1 ) > 0 > r2 = c − λ2 μ, (5.31)
0 < λ1 ≤ λ2 .

The conditional distribution of σ can be obtained explicitly as the solution of


the above differential equations. To obtain this explicit solution we assume in
addition P (Y0 = 1) = 1. The result of the (lengthy) calculations is

e−gn (t)
Ût (2) = P (σ ≤ t|At ) = 1 − t , Tn ≤ t < Tn+1 ,
dn + (λ2 − λ1 ) Tn e−gn (s) ds
λ2 ÛTn − (2)
ÛTn (2) = ,
λ1 + (λ2 − λ1 )ÛTn − (2)
 −1
where dn = 1 − ÛTn (2) , gn (t) = (q − (λ2 − λ1 ))(t − Tn ). The stopping
time τ ∗ in (5.30) can now be written as
r1
τ ∗ = inf{t ∈ R+ : Ût (2) > z ∗ }, z ∗ = .
r1 − r2

For 0 < q < λ2 − λ1 , Ût (2) increases as long as Ût (2) < q/(λ2 − λ1 ) = r. When
Ût (2) jumps above this level, then between jumps Ût (2) decreases but not
below the level r. So even in this case under conditions (5.31) the monotone
case holds true if z ∗ ≤ q/(λ2 − λ1 ). As a consequence of Theorem 5.30 we
have the following corollary.

Corollary 5.31. Assume conditions (5.31) with stopping rule τ ∗ = inf{t ∈


R+ : Ût (2) > z ∗ }. Then τ ∗ is optimal in C A if either q > λ2 − λ1 or z ∗ ≤
q/(λ2 − λ1 ).
Remark 5.32. If the failure rates in both states coincide, i.e., λ1 = λ2 , the
observation of the failure time points should give no additional information
about the change time point from state 1 to state 2. Indeed, in this case the
conditional distribution of σ is deterministic,

P (σ ≤ t|At ) = P (σ ≤ t) = 1 − exp {−qt}

and τ ∗ is a constant. As to be expected, random observations are useless in


this case.
5.5 Maintenance Optimization Models Under Constraints 215

In general, the value of the stopping problem sup{EZτ : τ ∈ C A }, the best


possible expected net gain, cannot be determined explicitly. But it is possible
to determine bounds for this value. For this, the semimartingale representation
turns out to be useful again, because it allows, by means of the projection
theorem, comparisons of different information levels. The constant stopping
times are contained in C A and C A ⊂ C F . Therefore, the following inequality
applies:

sup{EZt : t ∈ R+ } ≤ sup{EZτ : τ ∈ C A } ≤ sup{EZτ : τ ∈ C F }.

At the complete information level F the change time point σ can be observed,
and it is obvious that under conditions (5.31) the F-stopping time σ is optimal
in C F . Thus, we have the following upper and lower bounds bu and bl :

bl ≤ sup{EZτ : τ ∈ C A } ≤ bu

with

bl = sup{EZt : t ∈ R+ },
bu = sup{EZτ : τ ∈ C F } = EZσ .

Some elementary calculations yield


 
1 r2 −r2
bl = u − k2 + (c − λ1 μ) − ln ,
q q r1 − r2
1
bu = u − k2 + (c − λ1 μ).
q
For λ1 = λ2 the optimal stopping time is deterministic so that in this case
the lower bound is attained.

5.5 Maintenance Optimization Models Under


Constraints
In this section we consider two models: the first one is a so-called delay time
model with safety constraints. The aim is to determine optimal inspection in-
tervals minimizing the expected discounted costs under the safety constraints.
The second model is also about optimal inspection but here the system is
represented by a monotone (coherent) structure function. The state of the
components and the system is only revealed through inspections.

5.5.1 A Delay Time Model with Safety Constraints

In many cases, the presence of a fault in a system does not lead to an imme-
diate system failure; the system stays in a “defective” state. There will be a
216 5 Maintenance Optimization

time lapse between the occurrence of the fault and the failure of the system–
a “delay time”. This is the idea of the delay time models, which have been
thoroughly discussed in the literature. See the Bibliographic Notes at the end
of the chapter.
The delay time models are used as bases for determining monitoring strate-
gies for detecting system defects or faults. The state of the system is revealed
by inspections, except for failures which are observed. The basic delay time
model was introduced for analyzing inspection policies for systems regularly
inspected each T time units. If an inspection is carried out during the delay
time period, the defect is identified and removed. Thus, the delay time model
is based on the simplest monitoring framework possible: a defective state and
a nondefective state. In most of the models, the objective of the delay time
analysis is to determine optimal inspection times that minimize the (expected)
long-run average costs or downtimes.
The framework in the present analysis is the basic delay time model sub-
ject to regular inspections every T units of time. If a defect is detected by
an inspection, a preventive replacement is performed. If the system fails, a
corrective replacement is carried out. A replacement brings the system back
to the initial state. A cost is incurred at each inspection.
Furthermore, safety constraints are introduced, related to two important
safety aspects: the number of failures of the system and the time spent in
the defective state (the delay time). The control of these quantities can be
obtained by bounding the probability of at least one system failure occurring
during a certain interval of time and by bounding the probability that the
delay times are larger than a certain number.
The objective of the analysis is to determine an optimal inspection interval
T that minimizes the total expected discounted costs under the two safety
constraints.
If α is a positive discount factor, a cost C at time t has a value of Ce−αt
at time 0. Letting Ti be the length of the ith replacement cycle and Ci the
total discounted costs associated with the ith replacement cycle, then the total
discounted costs incurred can be written (see Sect. 5.3.3)

EC1
. (5.32)
1 − E[e−αT1 ]

To explicitly take into account risk and uncertainties we introduce two safety
constraints. Below these are defined and the results are compared.
In practice we may consider different levels for the safety constraint. The
optimization produces decision support by providing information about the
consequences of imposing various safety-level requirements.
Before we search for an optimal inspection time T , we need to specify
the optimization model in detail.
5.5 Maintenance Optimization Models Under Constraints 217

Problem Definition and Formulation

We consider a system subject to failures and make the following assumptions.


1. The failure of the system is revealed immediately, and the system is re-
placed. The replacement time is negligible and the cost of this corrective
maintenance is Cc .
2. Before failure occurs, the system passes through a defective state. Let X
be a random variable representing the time to the occurrence of a fault
and Y a random variable representing the time in the defective state,
in case of no replacement of the system. We denote by F and G the
distributions of X and Y , respectively. We assume that F and G have
densities f and g, respectively. Furthermore, we assume that X and Y
have finite expectations.
3. All random variables X and Y are independent.
4. Whether or not the system is in a defective state can only be determined
by inspection.
5. An inspection takes place every T units of time, and the cost of each
inspection is CI . These inspections are perfect in the sense that if the
system is in a defective state, this will be identified by the inspection. If
a defect is identified at an inspection, the system will be replaced by a
new one. The replacement time is negligible. The cost of this preventive
maintenance is Cp , where 0 < CI < Cp < Cc < ∞.
The assumption CI < Cp < Cc is justified by the following type of argu-
ments. The inspection tasks are assumed to be rather straightforward activi-
ties, whereas preventive maintenance tasks are more extensive operations that
involve repairs and replacements of the units. Hence it is reasonable to assume
CI < Cp . Furthermore, the corrective maintenance tasks cost more than the
preventive maintenance tasks as the replacement of the system is unplanned;
hence Cp < Cc .
Consider a replacement cycle defined by the time interval between replace-
ments of the system caused by a preventive maintenance or by a corrective
maintenance. For k = 0, 1, 2, . . ., let XT be a random variable representing
the time between replacements of the system, i.e.,

(k + 1)T kT < X < (k + 1)T ≤ X + Y
XT =
X + Y kT < X < X + Y ≤ (k + 1)T

Let F̄T be the survival function of XT . By conditioning on X = u, we see that


 t
F̄T (t) = F̄ (t) + f (u)Ḡ(t − u)du, t ≥ 0, (5.33)
[t/T ]T

where [x] denotes the integer part of x. From (5.33) we obtain the following
lemma:
218 5 Maintenance Optimization

Lemma 5.33.
 ∞
- −αXT
.
1−E e = αe−αt F̄ (t)dt
0
∞ 
% &
 (k+1)T (k+1)T −u
−αu −αv
+ f (u)αe e Ḡ(v)dv du.
k=0 kT 0

Proof. Denoting by fT the density function of XT one obtains that,


 ∞
−αXT
1 − E[e ] = 1− e−αt fT (t)dt
 ∞ 0

= αe−αt F̄XT (t)dt,


0

integrating by parts. Furthermore, using (5.33) we see that 1 − E[e−αXT ] can


be written as
 ∞  ∞ % &
t
−αt −αt
αe F̄ (t)dt + αe f (u)Ḡ(t − u)du dt
0 0 [t/T ]T
 ∞ 
% &
∞  (k+1)T t
−αt −αt
= αe F̄ (t)dt + αe f (u)Ḡ(t − u)du dt
0 k=0 kT [t/T ]T
 ∞  (k+1)T
% &
∞  (k+1)T
−αt −αt
= αe F̄ (t)dt + f (u) αe Ḡ(t − u)dt du
0 k=0 kT u
 ∞ ∞  (k+1)T
= αe−αt F̄ (t)dt + αe−αu f (u)
0 k=0 kT
% &
(k+1)T
−α(t−u)
e Ḡ(t − u)dt du
u
 ∞ ∞ 
 (k+1)T
−αt
= αe F̄ (t)dt + αe−αu f (u)
0 k=0 kT
% &
(k+1)T −u
−αt
e Ḡ(t)dt du,
0

which shows that the lemma holds. 




From the assumptions of the model, a cost Cp is incurred whenever a


preventive maintenance is performed. Hence, the expected discounted costs
associated with the preventive maintenance in a replacement cycle is given by

  (k+1)T
−α(k+1)T
Cp e f (u)Ḡ((k + 1)T − u)du, (5.34)
k=0 kT
5.5 Maintenance Optimization Models Under Constraints 219

noting that if X = u and kT < u ≤ (k + 1)T , the system is replaced at


(k + 1)T if the delay time exceeds (k + 1)T − u.
Analogously, we obtain that the expected discounted costs associated with
the corrective maintenance in a replacement cycle equals
∞  (k+1)T
% &
 (k+1)T
−αv
Cc f (u) g(v − u)e dv du, (5.35)
k=0 kT u

observing that if X = u and kT < u ≤ (k + 1)T , the system is replaced at v


if the delay time is v − u and v < (k + 1)T .
Furthermore, a cost CI is incurred at each inspection and the expected
discounted costs associated with these actions equals
∞ 
 k+1  (k+1)T
CI e−αiT f (u)Ḡ((k + 1)T − u)du
k=0 i=1 kT

∞ 
 k  (k+1)T
+CI e−αiT f (u)G((k + 1)T − u)du,
k=1 i=1 kT

or rewritten,

  (k+1)T
CI e−α(k+1)T f (u)Ḡ((k + 1)T − u)du
k=0 kT
∞ 
 k  (k+1)T
+CI e−αiT f (u)du. (5.36)
k=1 i=1 kT

Notice that the expression



  (k+1)T
−α(k+1)T
e f (u)Ḡ((k + 1)T − u)du,
k=0 kT

that appears in (5.34) and (5.36) can be expressed as


∞ 
 T
f (u + kT )e−α(u+kT ) e−α(T −u) Ḡ(T − u)du,
k=0 0

and finally as a consequence of the Monotone Convergence Theorem (see


Appendix A.2.3) we obtain that

  (k+1)T
e−α(k+1)T f (u)Ḡ((k + 1)T − u)du
k=0 kT
 T
= hT (u)e−α(T −u) Ḡ(T − u)du,
0
220 5 Maintenance Optimization

where, for T > 0, hT (u) is equal to




hT (u) = f (u + kT )e−α(u+kT ) , 0 ≤ u ≤ T. (5.37)
k=0

We denote by Cd (T ) the total expected discounted costs in [0, ∞). By (5.32)


we can focus on the first cycle. From Lemma (5.33), (5.34), (5.35) and (5.36)
we obtain the following expression for Cd (T )
∞ 
 k  (k+1)T  T
−αiT
CI e f (u)du + hT (u)c(T − u)du
k=1 i=1 kT 0
Cd (T ) =  T
, (5.38)
1+ hT (u)(D(T − u) − 1)du
0

where hT (u) is given by (5.37) and for 0 ≤ u ≤ T ,


 u
c(u) = (Cp + CI )e−αu Ḡ(u) + Cc g(v)e−αv dv, (5.39)
 u 0

−αv
D(u) = e Ḡ(v)dv. (5.40)
0

Two safety conditions are introduced in this model. The first one is related to
the occurrences of system failures, whereas the second is related to the time
spent in a defective state.

Safety Constraint 1: Bound on the Probability of a System Failure

The first constraint is implemented by bounding the probability of occurrence


of one or more failures of the system in an interval [0, A]. Denoting by Nc,T (A)
the number of failures of the system in [0, A] with inspection times each T
time units, the safety constraint is expressed as

P (Nc,T (A) ≥ 1) ≤ ω1 ,

with 0 < ω1 < 1 or equivalently

1 − P (Nc,T (A) = 0) ≤ ω1 .

Let Xc,T be the time between successive corrective maintenances, then

P (Nc,T (A) = 0) = F̄c,T (A),

where F̄c,T represents the survival function of Xc,T . The following lemma
shows the analytical expression for the survival function F̄c,T .
5.5 Maintenance Optimization Models Under Constraints 221

Lemma 5.34. The survival function F̄c,T of Xc,T , representing the time be-
tween successive corrective maintenances, can be written in the following way:


k   t 
F̄c,T (t) = Bi,T F̄ (t − iT ) + f (u − iT )Ḡ(t − u)du ,
i=0 kT

kT ≤ t ≤ (k + 1)T, k = 0, 1, 2, . . . , (5.41)

where the coefficient Bi,T equals the probability of a preventive maintenance


at iT and is obtained using the recursive formulas:

B0,T = 1

k  (k+1)T
Bk+1,T = Bi,T f (u − iT )Ḡ((k + 1)T − u)du, k = 0, 1, 2, . . .
i=0 kT

Proof. Notice that we can express F̄c,T (t) as


k
F̄c,T (t) = Bi,T Pk,i,T (t), kT ≤ t ≤ (k + 1)T,
i=0

where Bi,T represents the probability of a preventive maintenance at iT , 1 ≤


i ≤ k and Pk,i,T (t) represents the probability that the system does not fail
in (iT, t] and no preventive maintenance is performed in this interval. If the
preventive maintenance is not performed in (iT, t], then either no defect of
the system arises in (iT, t] or a defect arises in [kT, t) but it does not lead to
a failure before t. Hence,
 t
Pk,i,T (t) = F̄ (t − iT ) + f (u)Ḡ(t − u)du, kT ≤ t ≤ (k + 1)T, 0 ≤ i ≤ k.
kT

The probabilities Bi,T are obtained in a recursive way as follows. For i = 0,


B0,T , the probability of a preventive maintenance at 0, is equal to 1. For i = 1,
B1,T represents the probability of a preventive maintenance at T , and it is
equal to
 T
B1,T = f (u)Ḡ(T − u)du.
0
Analogously, for i = 2, B2,T represents the probability of a preventive main-
tenance at 2T . If a preventive maintenance is performed at 2T , and the first
preventive maintenance is at T or at 2T . If the first preventive maintenance
is at T and the second one is at 2T , then faults of the system arise in (0, u)
(u < T ) and (T, v) (v < 2T ) but do not lead to a failure before T and 2T
respectively. This event has the following probability
% & % &
T 2T
f (u)Ḡ(T − u)du f (v − T )Ḡ(2T − v)dv .
0 T
222 5 Maintenance Optimization

If the first preventive maintenance is performed at 2T , and the system fault


arises in (T, u) but does not lead to a failure before 2T , the associated prob-
ability is equal to
 2T
f (u)Ḡ(2T − u)du.
T
Summing over these exclusive events, we obtain
% & % &
T 2T
f (u)Ḡ(T − u)du f (u − T )Ḡ(2T − u)du
0 T
 2T
+ f (u)Ḡ(2T − u)du
T
 2T  2T
= B0,T f (u)Ḡ(2T − u)du + B1,T f (u − T )Ḡ(2T − u)du
T T

1  2T
= Bi,T f (u − iT )Ḡ(2T − u)du
i=0 T

= B2,T ,

which is the desired result.


A preventive maintenance at (k + 1)T is equivalent to a preventive mainte-
nance at iT , for any 0 ≤ i ≤ k, no fault of the system in (iT, kT ) and a defect
in [kT, (k + 1)T ) which does not lead a failure before (k + 1)T . Following the
same type of arguments as above it follows that this event has the following
probability


k  (k+1)T
Bi,T f (u − iT )Ḡ((k + 1)T − u)du.
i=0 kT

Hence the result holds. 




Using (5.41), the safety constraint can be formulated as

aA (T ) ≤ ω1 , (5.42)

where 0 < ω1 < 1 and


⎧ %  A &

⎪ 
[A/T ]

⎨1− Bi,T F̄ (A − iT ) + f (u − iT )Ḡ(A − u)du A ≥ T
aA (T ) = [A/T ]T

i=0  A


⎩ f (u)G(A − u)du A < T.
0
(5.43)
5.5 Maintenance Optimization Models Under Constraints 223

Safety Constraint 2: Bound on the Limiting Fraction of Time


Spent in a Defective State

The second safety constraint is related to the time spent in a failure state.
What we would like to control is the proportion of time the system is in such
a state. This is implemented by considering the asymptotic limit b(T ), which
is equal to the expected time that the system is in the defective state in a
replacement cycle divided by the expected renewal cycle (see Appendix B.2).
Hence we can formulate the safety criterion as
 XT
E 1d (u)du
b(T ) = 0
≤ ω2 ,
E[XT ]

where 0 < ω2 < 1 and 1d (·) denotes the indicator function which equals 1 if
the system is defective at time u and 0 otherwise. From (5.33), the expected
length of a replacement cycle for this model is equal to
 ∞
E [XT ] = F̄T (t)dt
0
∞  (k+1)T
% &
 (k+1)T −u
= E [X] + f (u) Ḡ(v)dv du.
k=0 kT 0

It follows that this second safety constraint can be expressed as

b(T ) ≤ ω2 , (5.44)

where b(T ) is given by


∞ 
% &
 (k+1)T (k+1)T −u
f (u) Ḡ(v)dv du
kT 0
b(T ) = k=0
∞ 
% & , 0 < T ≤ ∞.
 (k+1)T (k+1)T −u
E [X] + f (u) Ḡ(v)dv du
k=0 kT 0
(5.45)

Optimization

The problem is to find a value of T that minimizes Cd (T ) given by (5.38)


under the safety constraints given by (5.42) or (5.44), that is, finding a value
Topt such that
Cd (Topt ) = inf{Cd (T ) : T ∈ Υ },
where Υ is the set of inspection times satisfying the inequality (5.42) or
(5.44), i.e.,
Υ = {T > 0; aA (T ) ≤ Υ1 }
224 5 Maintenance Optimization

or
Υ = {T > 0; b(T ) ≤ Υ2 },
where aA (T ) and b(T ) are given by (5.43) and (5.45), respectively.
Analyzing the terms in the function Cd (T ) given by (5.38), we will show
that Cd (T ) is a continuous function in T , with

lim Cd (T ) = ∞.
T →0

To show the continuity of the function Cd (T ), we need to assume that


the density function f of X is continuous. Then hT (u), given by (5.37), is
continuous in u and continuous in T , and hence
 T  T
hT (u)c(T − u)du and 1 − hT (u)(D(T − u) − 1)du,
0 0

where c and D are given by (5.39) and (5.40), are continuous functions in T .
Moreover,
 T  T
hT (u)c(T − u)du ≤ (Cp + CI + Cc ) hT (u)du
0
 0

= (Cp + Cc + CI ) f (u)e−αu du,
0

and consequently
 T
lim hT (u)c(T − u)du < ∞,
T →0 0
and
%  & 
T ∞
lim 1+ hT (u)(D(T − u) − 1)du = αe−αu F̄ (u)du < ∞,
T →0 0 0

using that E[X] is finite.


Furthermore, notice that
∞ 
 k  ∞ −αT 
−αiT
(k+1)T
e − e−α(k+1)T (k+1)T
e f (u)du = f (u)du
kT 1 − e−αT kT
k=1 i=1 k=1
% ∞  (k+1)T &
e−αT 
−αkT
= 1− e f (u)du ,
1 − e−αT kT k=1

is continuous in T and
∞ 
 k  (k+1)T
lim e−αiT f (u)du = ∞.
T →0 kT
k=1 i=1
5.5 Maintenance Optimization Models Under Constraints 225

Taking these properties into account, the function Cd (T ) given by (5.38) is


a continuous function in T and limT →0 Cd (T ) = ∞. Hence the minimum of
Cd (T ) in the unconstrained case exists if we include the delay-time policy for
T = ∞, i.e., a delay-time policy without inspections for which corresponding
expected discounted costs are given by
 ∞  ∞
Cc f (u)e−αu du g(v)e−αv dv
lim Cd (T ) =  ∞ 0  ∞ 0  ∞ .
T →∞ −αu −αu −αv
αe F̄ (u)du + αe f (u)du e Ḡ(v)dv
0 0 0

We see that Cd (∞) < ∞.


Let T ∗ be an optimal value of T in the unconstrained case, i.e.,

Cd (T ∗ ) = inf{Cd (T ) : T > 0}.

Clearly, if T ∗ ∈ Υ , then Topt = T ∗ , i.e., T ∗ is an optimal solution also to the


constrained optimization problem.
The analytical optimization of Cd (T ) is not straightforward as the func-
tion Cd (T ) is not on the standard form seen for many maintenance models
(nonincreasing up to a minimum value and then nondecreasing), even when
assuming F and G to have increasing failure rate distributions. As we will
show later, Cd (T ) could have several local minimum values. Also the safety
constraint functions aA (T ) and b(T ) could have rather irregular forms, when
we compare these to the common increasing shapes seen for other maintenance
optimization models.

Numerical Examples

In this section we present some numerical examples of the above model. The
aim is to find a value of T that minimizes Cd (T ) given by (5.38) under the two
safety constraints based on the occurrence of failures in an interval (5.42) and
the fraction of time in a defective state (5.44). We refer to these constraints
as criterion 1 and criterion 2, respectively.
We assume that the distributions of the random variables X and Y follow
Weibull distributions with nondecreasing failure rates, i.e.,

F̄ (t) = exp{−(λ1 t)β1 }, Ḡ(t) = exp{−(λ2 t)β2 }, t ≥ 0,

where βi > 1 for i = 1, 2.


Intuitively we may think that the proportion of time that the system is in
a defective state is increasing with respect to T . However, this is not in general
true. A counterexample, based on rather extreme failure rates, is given in the
following.
226 5 Maintenance Optimization

Let λ1 = 1, λ2 = 1, β1 = 20 and β2 = 30 be the parameters of the Weibull


distributions. For these parameters
E[X] = 0.9735, E[Y ] = 0.9818.
Figure 5.1 shows a simulation of the long-run proportion of time that the
system is in a defective state as a function of T . The simulation has been
carried out using 500 points between 0.2 and 2.2 with 500,000 realizations
in each point. We see from the figure that b(T ) in this case shows a rather
irregular form, with many local minimum and maximum values.

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Fig. 5.1. Function b(T ) versus T

A similar case is observed for the function aA (T ) given by (5.43). This


function represents the probability of occurrence of at least one failure in
[0, A]. For the same numerical example as above, the monotonicity of aA (T )
is not guaranteed as we can see from Fig. 5.2, which displays a simulation of
aA (T ) for A = 2.
In the case λ1 = 1, β1 = 20 and λ2 = 1, β2 = 30, the distributions of X
and Y are highly concentrated in the interval [0.8, 1.1], i.e.,

P [0.8 ≤ X ≤ 1.1] = 0.9873, P [0.8 ≤ Y ≤ 1.1] = 0.9988.

We focus on the function a2 (T ), the probability of occurrence of one or more


failures in [0, T ]. For T = 1.5, the system is “always” in the defective state and
the inspection avoids a corrective maintenance. Hence, a2 (1.5) ≈ 0. However,
for values of inspection near to 1, the system could be in a defective state
or not. If it is not, the next inspection will happen at time T = 2 and a
corrective maintenance could happen in this period. Hence a2 (1) > a2 (1.5)
and the monotony of a2 (T ) is not guaranteed.
5.5 Maintenance Optimization Models Under Constraints 227

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.5 1 1.5 2

Fig. 5.2. Function a2 (T ) versus T

Next, we specify the costs. Assume Cp = 400, Cc = 1000 and CI = 100 be


the costs incurred for a preventive maintenance, a corrective maintenance and
an inspection, respectively. Furthermore, let α = 0.4 be the discount factor.
For λ1 = 1, λ2 = 1, β1 = 20 and β2 = 30, Fig. 5.3 displays a simulation
of the total expected discounted costs versus T . This simulation has been
performed using 500 points between 0.2 and 2.5 with 500,000 realizations in
each point. As we can see, for this numerical example Cd (T ) has several local
minimum values. The global minimum of Cd (T ) is reached for T ∗ = 1.79, with
an expected discounted costs of Cd (1.79) = 397.68.
Finally, we specify the safety constraints, first criterion 1. We assume that
ω1 = 0.2 and A = 2, i.e., the probability of occurrence of one or more failures
in two units of time should not exceed 0.2, that is,

P (Nc (2) ≥ 1) ≤ 0.2.

Figure 5.4 shows the total expected discounted costs Cd (T ) along with the
function a2 (T ). We find that

Υ = {T > 0; a2 (T ) ≤ 0.2} = (0, 1.898].

In this case, T ∗ = 1.79 ∈ Υ , and hence the optimal value for the con-
strained optimization problem under criterion 1 is Topt = 1.79 with a value of
Cd (1.79) = 397.68.
Consider now the constrained optimization problem under criterion 2. We
assume that ω2 = 0.15, i.e., the proportion of time that the system is in a
228 5 Maintenance Optimization

1800

1600

1400

1200

1000

800

600

400

200
0 0.5 1 1.5 2 2.5

Fig. 5.3. Total expected discounted costs Cd (T ) versus T

defective state should not exceed 0.15. Figure 5.5 shows the total expected
discounted costs and the function b(T ) for this problem. In this case

Υ = {T > 0; b(T ) ≤ 0.15}


= (0, 0.291] ∪ [0.3272, 0.3823] ∪ [0.508, 0.5727] ∪ [1.041, 1.1454].

By inspection the optimal value for the constrained optimization problem is


Topt = 1.1454 with a value of Cd (1.1454) = 687.
In the following example we use a more realistic set of parameter values
of the Weibull distributions: λ1 = 1, λ2 = 1, β1 = 2 and β2 = 3. In this case

E[X] = 0.8862, E[Y ] = 0.8930.

Let Cp = 400, Cc = 1000 and CI = 100 be the costs incurred, with α = 0.4 the
discount factor. The functions Cd (T ), aA (T ) and b(T ) are shown in Figs. 5.6–
5.8.
Figure 5.6 shows a simulation of the total expected discounted costs Cd (T )
versus T for this example. The function Cd (T ) is in standard form, nonincreas-
ing up to T = 1.1511 and nondecreasing for T ≥ 1.1511. Hence T ∗ = 1.1511.
The corresponding expected discounted costs equal Cd (1.1511) = 804.0365.
We analyze the constrained optimization problem for each safety require-
ment. As above we put ω1 = 0.2 for criterion 1. From Fig. 5.7 we find that

Υ = {T > 0; a2 (T ) ≤ 0.2} = (0, 0.975].


5.5 Maintenance Optimization Models Under Constraints 229
2000
a
1500

1000

500

0
0.5 1 1.5 2 2.5

0.5
b
0.4

0.3
ω1
0.2

0.1

0
0.5 1 1.5 2 2.5

Fig. 5.4. (a) Total expected discounted costs Cd (T ) versus T . (b) Function a2 (T )
versus T

Due to the form of Cd (T ) the optimal value for the constrained optimization
is Topt = 0.975 with a value of Cd (0.975) = 813.55.
For criterion 2, we suppose ω2 = 0.15. From Fig. 5.8,

Υ = {T > 0; b(T ) ≤ 0.15} = (0, 0.313],

and using the same reasoning as above, the optimal value for Cd (T ) is reached
for Topt = 0.313 with a value of Cd (0.313) = 1372. By comparing the expected
costs for the unconstrained and the constrained problem, we see that a rather
large cost is introduced by implementing the safety constraint.
Both constraints can be used to control the safety level. However, we prefer
to use criterion 1 as it is more directly related to the failures of the system.

5.5.2 Optimal Test Interval for a Monotone Safety System

In this section we consider a safety system represented by a monotone (coher-


ent) structure function of n components. The components and the system can
be in one out of several states. The state of the components and the system
is only revealed through inspections, which are carried at intervals of length
T . If the inspection shows that the system is in a critical state or has failed,
it is overhauled and all components are resumed to good-as-new conditions.
The system is in a critical state if further deterioration of a component (com-
ponent i jumps from state j to state j − 1) induces system failure. As the
system is a safety system in standby position, the state of the system and its
components is revealed only by testing. The aim of the testing and overhaul is
230 5 Maintenance Optimization
2000
a
1500

1000

500

0
0.5 1 1.5 2 2.5

0.5
b
0.4

0.3

0.2 ω2

0.1

0
0.5 1 1.5 2 2.5

Fig. 5.5. (a) Total expected discounted costs Cd (T ) versus T . (b) Function b(T )
versus T

to avoid that the system fails and stays in the failure state for a long period.
However, this goal has to be balanced against the costs of inspections and
overhauls. Too frequent inspections would not be cost optimal. Costs are as-
sociated with tests, system downtime, and repairs. The optimization criterion
is the expected long-run cost per unit of time.
Below we present a formal set-up for this problem and show how an optimal
T can be determined. A special case where the components have three states
is given special attention. It corresponds to a “delay time type system” where
the presence of a fault in a component does not lead to an immediate failure;
there will be a “delay time” between the occurrence of the fault and the failure
of the component. We refer to Sect. 5.5.1.

Model and Problem Definition

We consider a safety system comprising n components, numbered consecu-


tively from 1 to n. The state of component i at time t, t ≥ 0, is denoted Xt (i),
i = 1, 2, . . . , n, where Xt (i) can be in one out of Mi +1 states, 0, 1, . . . , Mi . The
paths X· (i) are assumed to be right-continuous. The states represent different
levels of performance, from the worst, 0, to the best, Mi . At time t = 0, all
components are in the best state, i.r., X0 (i) = Mi , i = 1, 2, . . . , n. The random
duration time in state Mi is denoted UiMi . The component then jumps to state
Mi − 1 for a random time Ui(Mi −1) , and so on until the component reaches the
absorbing state 0. All sojourn times are positive random variables. The prob-
ability distribution of Uij is denoted Fij . The distributions Fij are assumed
5.5 Maintenance Optimization Models Under Constraints 231

2000

1800

1600

1400

1200

1000

800
0.5 1 1.5 2 2.5

Fig. 5.6. Total expected discounted costs Cd (T ) versus T

absolute continuous, with finite means. The density and “jump rate” of Fij (t)
are denoted fij (t) and rij (t), respectively, i = 1, 2, . . . , n and j = 1, 2, . . . , Mi .
The jump rate rij (t) is defined as usual as
1
lim P (Uij ≤ t + h|Uij > t).
h→0 h
Hence rij (t)h (h a small positive number) is approximately equal to the condi-
tional probability that component i makes a jump to state j − 1 in the interval
(t, t + h] given that the component has stayed in state j during the interval
[0, t]. The sojourn times UiMi , Ui(Mi −1) , . . . , Ui1 , i = 1, 2, . . . , n, are assumed
independent. The distribution of the vector of all Uij s, U, is denoted FU .
We denote by G(t, x) the distribution of the vector of component states
Xt = (Xt (1), Xt (2), . . . , Xt (n)), i.e.,

G(t, x) = P (Xt (1) = x1 , Xt (2) = x2 , . . . , Xt (n) = xn ).

Here x = (x1 , x2 , . . . , xn ), where xi ∈ {0, 1, . . . , Mi }. The state of the system


at time t is denoted Φt and is a function of the states of the components, i.e.,

Φt = φ(Xt ),

where φ is the structure function of the system. We assume that Φ and φ are
binary, equal to 1 if the system is functioning and 0 otherwise (see Sect. 2.1).
The system is a monotone system (see Sect. 2.1.2), i.e., its structure function
φ is nondecreasing in each argument, and

φ(0, 0, . . . , 0) = 0 and φ(M1 , M2 , . . . , Mn ) = 1.


232 5 Maintenance Optimization
1800
a
1600

1400

1200

1000

800
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

b
0.5

0.4

0.3
ω1
0.2

0.1

0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Fig. 5.7. (a) Total expected discounted costs Cd (T ) versus T . (b) Function a2 (T )
versus T

Since at time t = 0 all components are in the best state, Φ(0) = 1. The
components deteriorate and at time τ the system fails, i.e.,
τ = inf{t > 0 : φ(Xt ) = 0}.
The deterioration of the components and the system failure is revealed by
inspections. It is assumed that the system is inspected every T units of time.
If the system is found to be in the failure state, a complete overhaul is carried
out meaning that all components are repaired to a good-as-new condition.
Furthermore, a preventive policy is introduced: if the system is found to be
in a critical state, also a complete overhaul is conducted. The system is said
to be in a critical state if the system is functioning and there exists at least
one i such that the system fails if component i jumps to the state Xt (i) − 1.
Let τC be the time to the system first becomes critical. Then
τC = inf{t ≥ 0 : φ(Xt ) = 1, φ((Xt (i) − 1)i , Xt ) = 0 for at least one i},
where φ(·i , x) = φ(x1 , . . . , xi−1 , ·, xi+1 , . . . , xn ). We assume τC > 0, i.e., the
system is not critical at time 0.
The distribution of τC is denoted FτC . The times τ and τC are functions
of the duration times Uij . Let g and gC be defined by
τ = g(U) and τC = gC (U).
The inspections and overhauls are assumed to take negligible time.
To further characterize the critical states, we introduce the concept of a
critical path vector for system level 1:
5.5 Maintenance Optimization Models Under Constraints 233

1800
a
1600

1400

1200

1000

800
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

0.5
b
0.4
0.3
0.2 ω2
0.1
0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Fig. 5.8. (a) Total expected discounted costs Cd (T ) versus T . (b) Function b(T )
versus T

Definition 5.35. A state vector x is a critical path vector for system level
1 (the functioning state of the system) if and only if φ(x) = 1 and φ((xi −
1)i , x) = 0 for at least one i.
From this definition we introduce a maximal critical path vector:
Definition 5.36. A critical path vector x is a maximal critical path vector for
system level 1 if it cannot be increased without losing its status as a critical
path vector.
Note that these concepts are different from the common defined path vectors
and minimal path vectors in a monotone system; see Sect. 2.1.2.
Based on the maximal critical minimal path vectors we introduce a new
structure function, φC (x), which is equal to 1 if and only if there exists no
maximal critical path vector xk such that the state x is below or equal to
xk , i.e. 
φC (x) = (1 − I(x ≤ xk )),
k

where k runs trough all maximal critical path vectors for the system at level
1. We see that the system φC fails as soon as a system state becomes critical.
As an example, consider a binary parallel system. Then it is seen that the
maximal critical path vectors are (1,0) and (0,1), and φC (x) = x1 x2 , as if one
component fails, the system state becomes critical.
A counting process N is introduced that jumps to 1 at the time of system
failure, i.e.,
Nt = I(τ ≤ t).
234 5 Maintenance Optimization

Let Vij,t be the virtual age of component i in state j at time t. Then the
intensity λt of N is given by


n 
Mi
λt = rij (Vij,t )I(Xt (i) = j)φ(Xt )(1 − φ((j − 1)i , Xt )),
i=1 j=1

noting that the rate is rij (Vij,t ) at time t for component i to cause sys-
tem failure by jumping from state j to state j − 1. A formal proof can be
given following the approach in Sect. 3.2.2. By introducing φij (x) = I(xi =
j)φ(x)(1 − φ((j − 1)i , x)), the intensity λt can be expressed as


n 
Mi
λt = rij (Vij,t )φij (Xt ).
i=1 j=1

Analogously, we define a counting process NC for the process φC . This count-


ing process jumps to 1 at the time the system becomes critical, i.e.,

NC,t = I(τC ≤ t).

The intensity λC,t of NC is given by


n 
Mi
λC,t = rij (Vij,t )I(Xt (i) = j)φC (Xt )(1 − φC ((j − 1)i , Xt )).
i=1 j=1

Similarly to φij we define φijC (x) = I(xi = j)φC (x)(1 − φC ((j − 1)i , x)), and
hence the intensity λC,t can be expressed as


n 
Mi
λC,t = rij (Vij,t )φijC (Xt ).
i=1 j=1

The following cost structure is assumed: the cost of a complete overhaul is


cp , whereas the cost of each inspection is cI . If the system is not functioning
a cost c is incurred per unit of time. All costs are positive numbers.
The problem is to find an optimal T minimizing the long-run expected
cost per unit of time.

Optimization

For a fixed test interval length T , 0 < T < ∞, the system is overhauled at time
τ T , where τ T is the time of the first inspection following a critical state, i.e.,

τ T = T ([τC /T ]I + 1),

where [x]I equals the integer part of x. This inspection represents a renewal
for the cost and time processes, and using the renewal reward theorem (see
5.5 Maintenance Optimization Models Under Constraints 235

Appendix B.2), it follows that the long-run (expected) cost per unit time, B T ,
can be written:
EC T
BT = , (5.46)
Eτ T
where Eτ T expresses the expected length of the first renewal cycle (the time
until renewal) and EC T expresses the expected cost associated with
this cycle.
It is seen that Eτ T < ∞ and EC T < ∞, observing that Eτ T ≤ ij EUij +T ,
and EC T ≤ T c + cp + CI (Eτ T /T + 1). Theorem 5.37 establishes an explicit
formula for Eτ T and EC T , and hence for B T .
Theorem 5.37. Under the above model assumptions, with τ = g(U) and
τC = gC (U), we have
∞ 
Eτ T = T (k + 1) dFU (u) (5.47)
k=0 u:kT <gC (u)≤(k+1)T
∞ 

EC T = [cI (k + 1) + cp
k=0 u:kT <gC (u)≤(k+1)T

+cI(g(u) ≤ (k + 1)T ){(k + 1)T − g(u)}]dFU (u) (5.48)

Proof. To establish (5.47), we write




τT = I(kT < τC ≤ (k + 1)T )(k + 1)T.
k=0

Then taking expectation we obtain




Eτ T = E I(kT < gC (U) ≤ (k + 1)T )(k + 1)T
k=0
∞ 
=T (k + 1) dFU (u),
k=0 u:kT <gC (u)≤(k+1)T

which proves (5.47). To establish (5.48), we use a similar approach writing


the cost C T as a function of τC and τ :


T
C = I(kT < τC ≤ (k+1)T )[cI (k+1)+cp +cI(τ ≤ (k+1)T ){(k+1)T −τ }],
k=0
(5.49)
noting that the system is down a period (k + 1)T − τ if the system enters
a critical state in the interval (kT, (k + 1)T ] and the system fails before the
inspection at time (k + 1)T . Then taking expectations we obtain (5.48).  

In the following theorem we establish more explicit formulae for Eτ T and


EC T by using counting process theory. Then we do not need the distribution
of FU (u) but the distribution of Xt , G(t, x). We consider two special cases:
236 5 Maintenance Optimization

– The system is a binary system with binary components,


i.e. Mi = 1 for i = 1, 2, . . . , n. (5.50)
– The rates rij are independent of t,
i.e. the sojourn times are all exponentially distributed. (5.51)

Theorem 5.38. Let


 t
Hij (t, x) = rij (s)G(s, x)ds.
0

For the cases (5.50) and (5.51), we then have



 
n 
Mi 
Eτ T = T (k + 1) φijC (x)
k=0 i=1 j=1 x
×[Hij ((k + 1)T, x) − Hij (kT, x)], (5.52)

where φijC (x) = I(xi = j)φC (x)(1 − φC ((j − 1)i , x)). Furthermore, if
Gs (t, x|x ) denotes the conditional distribution of X(t) given X(s) = x
(t > s), we have

 
n 
Mi 
EC T = [cI (k + 1) + cp ] φijC (x)[Hij ((k + 1)T, x) − Hij (kT, x)]
k=0 i=1 j=1 x
∞ 
 
n 
Mi 
+ φC (x )G(kT, x ) φij (x)
k=0 x i=1 j=1 x
 (k+1)T
× c((k + 1)T − t)rij (t)GkT (t, x|x )dt. (5.53)
kT

Proof. To establish (5.52), we write



 ∞
  (k+1)T
τT = I(kT < τC ≤ (k + 1)T )(k + 1)T = (k + 1)T dNC (t)
k=0 k=0 kT

Then taking expectation, using that NC,t has intensity λC,t , and noting that
we can write rij (Vi,j,t ) = rij (t), we obtain:

  (k+1)T
Eτ T = E (k + 1)T dNC,t
k=0 kT

∞  (k+1)T
=T (k + 1) EλC,t dt
k=0 kT
5.5 Maintenance Optimization Models Under Constraints 237

  (k+1)T 
n 
Mi
=T (k + 1) rij (t)EφijC (Xt )dt
k=0 kT i=1 j=1

 
n 
Mi   (k+1)T
=T (k + 1) φijC (x) rij (t)G(t, x)dt
k=0 i=1 j=1 x kT


 
n 
Mi 
=T (k + 1) φijC (x)[Hij ((k + 1)T, x) − Hij (kT, x)],
k=0 i=1 j=1 x

which proves (5.52). To establish (5.53), we rewrite (5.49) to obtain




EC T = E I(kT < τC ≤ (k + 1)T )[cI (k + 1) + cp ]
k=0


+E I(kT < τC ≤ (k + 1)T )cI(τ ≤ (k + 1)T ){(k + 1)T − τ },
k=0

Similarly to the above analysis for Eτ T it is seen that the first term of this
expression for EC T equals

 
n 
Mi
[cI (k + 1) + cp ] φijC (x)[Hij ((k + 1)T, x) − Hij (kT, x)].
k=0 i=1 j=1

Hence it remains to establish the desired expression for the downtime costs,
the second term. This term can be expressed as
∞  (k+1)T
E φC (XkT ) c((k + 1)T − t)dNt ,
k=0 kT

as φC (Xt ) is 1 as long as t < τC . Then using that Nt has intensity λt , we


obtain that this expected cost term equals
∞  (k+1)T
E φC (XkT ) c((k + 1)T − t)dNt
k=0 kT

  (k+1)T
=E φC (XkT ) c((k + 1)T − t)λt dt
k=0 kT

  (k+1)T 
n 
Mi
=E φC (XkT ) c((k + 1)T − t) φij (Xt )rij (t)dt
k=0 kT i=1 j=1
∞ 
 
n 
Mi 
= φC (x )G(kT, x ) φij (x) ×
k=0 x i=1 j=1 x
 (k+1)T
c((k + 1)T − t)rij (t)GkT (t, x|x )dt.
kT

Equation (5.53) follows, and the theorem is proved. 



238 5 Maintenance Optimization

We seek an optimal Topt minimizing B T given by (5.46) and the expressions


for EC T and Eτ T in Theorems 5.37 and 5.38. Such a minimum always exists
if we include the “perform no testing and overhaul” policy T = ∞ as B T is a
continuous function and limT →0 B T = ∞. We have B ∞ = limT →∞ B T =
c. The expected average long-run cost per unit of time when there is no
testing and overhaul equals c. If we perform very frequent testing, the long-run
expected average cost will be very high due to a large number of inspections.
To find Topt it is convenient to search for T s minimizing the functions

B T (δ) = EC T − δEτ T .

If Tδ minimizes B T (δ) and B Tδ (δ) = 0, then Tδ minimizes B T , i.e., Tδ is


optimal, and δ = B Tδ = inf 0<T ≤∞ B T . This result is well-known from the
literature; see Aven and Bergman [19]. We also refer to (5.9).

Special Case: Parallel System of Two Components

Assume that φ(x) = 1 − (1 − x1 )(1 − x2 ), i.e., the system is a binary parallel


system composed of two components. The time to the system first becomes
critical, τC , can then be expressed as

τC = min{U11 , U21 },

noting that if a component fails, the system is functioning if and only if the
other component is functioning. Furthermore, the time to system failure, τ ,
equals the maximum component lifetime, i.e.,

τ = max{U11 , U21 }.

It follows that


Eτ T = T (k + 1)[FτC ((k + 1)T ) − FτC (kT )]
k=0



= (k + 1)[F̄11 (kT ))F̄21 (kT ) − F̄11 ((k + 1)T ))F̄21 ((k + 1)T )],
k=0

where F̄ = 1 − F . By similar arguments, first considering the costs cI and cp ,


and then for the cost c condition on U11 = u1 and U21 = u2 , we obtain


EC = ET
I(kT < τC ≤ (k + 1)T )
k=0

×[cI (k + 1) + cp + cI(τ ≤ (k + 1)T ){(k + 1)T − τ }]




=E I(kT < τC ≤ (k + 1)T )[cI (k + 1) + cp ] +
k=0
5.5 Maintenance Optimization Models Under Constraints 239


E I(kT < τC ≤ (k + 1)T )[cI(τ ≤ (k + 1)T ){(k + 1)T − τ }]
k=0


= [F̄11 (kT )F̄21 (kT ) − F̄11 ((k + 1)T )F̄21 ((k + 1)T )][cI (k + 1) + cp ] +
k=0
∞  ∞  ∞
I(kT < min{u1 , u2 }
k=0 0 0

≤ (k + 1)T )[cI(max{u1 , u2 } ≤ (k + 1)T )


×{(k + 1)T − max{u1 , u2 }}]dF21 (u1 )dF11 (u2 ).

The last term due to system downtime can be simplified to


∞ 
 (k+1)T
c{(k + 1)T − u1 }[F21 (u1 ) − F21 (kT )]dF11 (u1 )
k=0 kT
∞ 
 (k+1)T
+ c{(k + 1)T − u2 }[F11 (u2 ) − F11 (kT )]dF21 (u2 ).
k=0 kT

These results are presented in Proposition 5.39.

Proposition 5.39. For a parallel system of two binary components, the ex-
pected renewal cycle and expected associated costs are given by:


Eτ T = (k + 1)[F̄11 (kT )F̄21 (kT ) − F̄11 ((k + 1)T )F̄21 ((k + 1)T )]
k=0
∞
EC T = [F̄11 (kT )F̄21 (kT ) − F̄11 ((k + 1)T )F̄21 ((k + 1)T )][cI (k + 1) + cp ] +
k=0
∞  (k+1)T
c{(k + 1)T − u1 }[F21 (u1 ) − F21 (kT )]dF11 (u1 ) +
k=0 kT
∞  (k+1)T
c{(k + 1)T − u2 }[F11 (u2 ) − F11 (kT )]dF21 (u2 ).
k=0 kT

An optimal T can then be determined.


Similar expressions can easily be derived based on Theorem 5.38. Note
that φi1C (x) is equal to 1 only if x1 = 1 and x2 = 1.

Special Case: Delay Time Model with Three Components

We consider a system comprising n = 3 components, with Mi = 2, i.e., each


component has three states. The state 2 is a perfect functioning state, whereas
state 1 is a “partly defective” state, as a result of a “fault.” There will be a time
lapse between the occurrence of the fault and the failure of the component—a
240 5 Maintenance Optimization

“delay time.” To simplify the mathematical analysis, we assume that all


sojourn times Uij are exponentially distributed. The constant rates are de-
noted rij . The components 1 and 2 are assumed to have the same rates. The
rates for different arrival states j are assumed different, i.e., ri2 = ri1 , for
i = 1, 2, 3.
The state of the system is given by the structure function

φ(x) = I(x1 + x2 ≥ 1)I(x3 ≥ 1).

Hence the system is functioning if either component 1 or 2 is in state 1 or


better, and component 3 is in state 1 or better. We may think of the system as
a parallel system comprising the components 1 and 2, in series with component
3, with each component having a delay time before failure is occurring.
The maximal critical path vectors for level 1 are (0,1,2), (1,0,2) and
(2,2,1), and this defines φC (x) and φijC (x). We see that φC (x) = 1 for
x = (1, 1, 2) and x > (1, 1, 2), as well as for x = (0, 2, 2) and x = (2, 0, 2). and
φ32C (x1 , x2 , 2) = 1, xi ≥ 1, i = 1, 2.
t
For two distribution functions F1 and F2 , let F¯1 ∗F2 (t) = 0 F̄1 (t−s)dF2 (s).
Then the distribution G(t, x) can be expressed as:
3
G(t, (2, 2, 2)) = F̄12 (t)F̄22 (t)F̄32 (t) = e−t i=1 ri2
G(t, (1, 2, 2)) = [F̄11 ∗ F12 (t)]F̄22 (t)F̄32 (t)
r12 3
= (e−r11 t − e−r12 t )e−t i=2 ri2
r12 − r11
G(t, (1, 1, 2)) = [F̄11 ∗ F12 (t)][F̄21 ∗ F22 (t)]F̄32 (t)
r12 r22
= (e−r11 t − e−r12 t ) (e−r21 t − e−r22 t )e−tr32
r12 − r11 r22 − r21
G(t, (0, 2, 2)) = F12 ∗ F11 (t)F̄22 (t)F̄32 (t)
  3
−r12 t r12 −r11 t −r12 t
= 1−e − [e −e ] e−t i=2 ri2
r12 − r11

From these expressions compact formulae can be derived for Hij (t, x) =
t
rij 0 G(s, x)ds.
Similar equations can be established for Gs (t, x|x ), the conditional distri-
bution of X(t) given X(s) = x . We need to compute the conditional distri-
bution of P (Xt (i) = j2 |Xs (i) = j1 ) for j2 ≤ j1 , j1 = 1, 2, i = 1, 2. We see that
P (Xt (i) = 2|Xs (i) = 2) = F̄i2 (t−s), P (Xt (i) = 1|Xs (i) = 2) = F̄11 ∗F12 (t−s),
and P (Xt (i) = 1|Xs (i) = 1) = F̄i1 (t − s). Furthermore; P (Xt (i) = 0|Xs (i) =
1) = Fi1 (t − s) and P (Xt (i) = 0|Xs (i) = 2) = F12 ∗ Fi1 (t − s). From these
formulae we see for example that
3
Gs (t, (2, 2, 2)|(2, 2, 2)) = F̄12 (t − s)F̄22 (t − s)F̄32 (t − s) = e−(t−s) i=1 ri2

Gs (t, (1, 2, 2)|(2, 2, 2)) = [F̄11 ∗ F12 (t − s)]F̄22 (t − s)F̄32 (t − s)


r12 3
= (e−r11 (t−s) − e−r12 (t−s) )e−(t−s) i=2 ri2
r12 − r11
5.5 Maintenance Optimization Models Under Constraints 241

In this way all terms in Eτ T and EC T can be derived and an optimal T


determined.

Numerical Example

We assume that the failure rates are as follows: r12 = r22 = 0.5, r11 = r21 =
1.0 and r32 = 1/3, r31 = 1/2. Hence the expected time to failure for the three
components are 2 + 1 = 3, 2 + 1 = 3 and 3 + 2 = 5, respectively. The following
costs are assumed: c = 100, cI = 1 and cp = 5, i.e., the cost of an overhaul
is five times the inspection cost and the unit downtime cost is 100 times the
inspection cost. Then we can compute the B T function and determine an
optimal inspection time. Figure 5.9 shows the B T function as a function of
T , computed using Maple 10. By inspection an optimal value is obtained for
T = 0.43. A number of sensitivity analysis should be performed to see the
effect of changes in the input data. Figure 5.10 shows an example where the
unit downtime cost is increased by a factor 10, from 100 to 1,000, to reflect
the serious safety risk caused by downtime. The optimal inspection interval is
then reduced to 0.18.

110

100

90

80

70

60

50

40

30

20

10

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
T

Fig. 5.9. The B T function for the base case example with c = 100
242 5 Maintenance Optimization

Final Remarks

The optimization of B T needs to be carried out by numerical methods. For


the numerical example considered in the previous section, the optimization
criterion is on the standard form seen for many maintenance models (non-
increasing up to a minimum value and then nondecreasing). In general this
is, however, not the case for the model studied in this chapter. Examples can
be constructed where the optimization function has several local minimum
values, which is in line with the examples for a one component system in
Sect. 5.5.1.
The model can be extended in many ways, for example, by allowing a more
general cost structure. As an example we may distinguish between the cost of
an overhaul when the system is in a critical state and when it has failed. The
calculations of EC T in (5.49) then need to be modified, by considering a cost
term cp + cp I(τ < (k + 1)T ), where cp is the additional overhaul cost if the
system has failed compared to being in a critical state. The further analysis is
analogous to the one carried out for EC T . The next step would be to allow the
overhaul cost to depend on the state vector. The analysis would then become
more complicated, but still within the framework and approach presented.

150

125

100

75

50

25

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
T

Fig. 5.10. The B T function for c=1000


5.5 Maintenance Optimization Models Under Constraints 243

Bibliographic Notes. A fundamental reference for basic replacement


models is Barlow and Proschan [31]. There is an extensive literature about
preventive replacement models, which is surveyed in the overviews of Pier-
skalla and Voelker [130], Sherif and Smith [144], Valdez-Flores and Feldman
[158], and Jensen [94]. Block and Savits [47] and Boland and Proschan [49]
give overviews over comparison methods and stochastic order in reliability the-
ory. Shaked and Szekli [142] and Last and Szekli [116] compare replacement
policies via point process methods. A good source for overviews of the vastly
increasing literature on replacement and maintenance optimization models is
the book Reliability and Maintenance of Complex Systems edited by Özekici
in the NATO ASI Series.
The presentation in Sect. 5.2 follows the lines of [96]. A general set-up for
cost-minimizing problems is introduced in Jensen [96] similar to Bergman [38]
and Aven and Bergman [19]. It allows for specialization in different directions.
As an example the model presented by Aven [18] covering the total expected
discounted costs criterion is included. What goes beyond the results in [19] is
the possibility to take different information levels into account.
There are lots of multivariate extensions of the univariate exponential dis-
tribution, for an overview see Hutchinson and Lai [91] or Basu [33], which
also cover the models of Freund [68] and Marshall and Olkin [121]. A de-
tailed derivation, statistical properties, and methods of parameter estimation
of the combined exponential distribution can be found in [83]. The optimiza-
tion problem also for more general cost structures is treated in Heinrich and
Jensen [85]. An alternative approach to solve optimization problems of the
kind treated in this chapter is to use Markov decision processes. It has not
been within the scope of this book to develop this theory here. An introduction
to this theory can be found in the books of Puterman [131], Bertsekas [41],
Davis [59], and Van der Duyn Schouten [159], which also contains applications
in reliability.
An overview of several problems related to burn-in and the corresponding
literature is given in the review articles by Block and Savits [46], Kuo and
Kuo [113], and Leemis and Beneke [118]. The problem of sequential burn-in,
where the failures of the items are observed and the burn-in time depends on
these failures, is treated in the article of Marcus and Blumenthal [120]. In the
papers of Costantini and Spizzichino [56] and Spizzichino [149] the assumption
that the component lifelengths are independent is dropped and replaced by
certain dependence models.
The problem of finding optimal replacement times for general repair
processes has been treated by Aven in [12, 15]. The presentation of Markov-
modulated minimal repair processes follows the lines of [93, 95] which include
the technical details. A similar model considering interest rates has been in-
vestigated by Schöttl [137].
Section 5.5 is based on Aven and Castro [20] and Aven [10]. For reviews
of the literature on delay time models, see Baker and Christer [29], Christer
and Redmond [54] and Christer [53].
A
Background in Probability and Stochastic
Processes

This appendix serves as background for Chaps. 3–5. The focus is on stochastic
processes on the positive real time axis R+ = [0, ∞). Our aim is to give
that basis of the measure-theoretic framework that is necessary to make the
text intelligible and accessible to those who are not familiar with the general
theory of stochastic processes. For detailed presentations of this framework
we recommend texts like Dellacherie and Meyer [61, 62], Rogers and Williams
[133], and Kallenberg [101]. The point process theory is treated in Karr [103],
Daley and Vere-Jones [58], and Brémaud [50]. A “nontechnical” introduction
to parts of the general theory accompanied by comprehensive historical and
bibliographic remarks can be found in Chap. II of the monograph of Andersen
et al. [2]. A good introduction to basic results of probability theory is Williams
[164].

A.1 Basic Definitions

We use the following notation


N = {1, 2, . . .}
N0 = {0, 1, 2, . . .}
Z = {0, +1, −1, +2, −2, . . .} set of integers
Q = { pq : p ∈ Z, q ∈ N} set of rationals
R = (−∞, +∞) set of real numbers
R+ = [0, ∞) set of nonnegative real numbers
f ∨ g and f ∧ g denote max{f, g} and min{f, g}, respectively, where f and
g can be real-valued functions or real numbers. We denote f + = f ∨ 0 and
f − = −(f ∧ 0).
inf ∅ = ∞, sup ∅ = 0. Ratios of the form 00 are set equal to 0.
A function f from a set A to a set B is denoted by f : A → B and f (a) is
the value of f at a ∈ A. To simplify the notation we also speak of f (a) as a
function.

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modelling 245


and Applied Probability 41, DOI 10.1007/978-1-4614-7894-2,
© Springer Science+Business Media New York 2013
246 A Background in Probability and Stochastic Processes

For a function f : R → R we denote the left and right limit at a (in the
case of existence) by
f (a−) = lim f (t) = lim f (a − h) ,
t→a− h→0,h>0
f (a+) = lim f (t) = lim f (a + h) .
t→a+ h→0,h>0

For two functions f, g : R → R we write f (h) = o(g(h)), h → h0 , for some


h0 ∈ R ∪ {∞}, if
f (h)
lim = 0;
h→h0 g(h)

we write f (h) = O(g(h)), h → h0 , for some h0 ∈ R ∪ {∞}, if


|f (h)|
lim sup < ∞.
h→h0 |g(h)|

An integral f (s)ds of a real-valued measurable function is always an
b
integral with respect to Lebesgue-measure.
 Integrals over finite intervals a ,
a ≤ b, are always integrals [a,b] over the closed interval [a, b].
The indicator function of a set A taking only the values 1 and 0 is den-
oted I(A). This notation is preferred rather than IA or IA (a) in the case of
descriptions of sets A by means of random variables.
In the following we always refer to a basic probability space (Ω, F , P ),
where
• Ω is a fixed nonempty set.
• F is a σ-algebra or σ-field on Ω, i.e., a collection of subsets of Ω including
Ω, which is closed under countable unions and finite differences.
• P is a probability measure on (Ω, F ), i.e., a σ-additive, [0, 1]-valued func-
tion on F with P (Ω) = 1.
If A is a collection of subsets of Ω, then σ(A) denotes the smallest σ-
algebra containing A, the σ-algebra generated by A.
If S is some set and S a σ-algebra of subsets of S, then the pair (S, S) is
called a measurable space. Let S be a metric space (usually R or Rn ) and O
the collection of its open sets. Then the σ-algebra generated by O is called
Borel-σ-algebra and denoted B(S), especially we denote B = B(R).
If A and C are two sub-σ-algebras of F , then A ∨ C denotes the σ-algebra
generated by the union of A and C. The product σ-algebra of A and C, gen-
erated by the sets A × C, where A ∈ A and C ∈ C, is denoted A ⊗ C.

A.2 Random Variables, Conditional Expectations


A.2.1 Random Variables and Expectations
On the fixed probability space (Ω, F , P ) we consider a mapping X into
the measurable space (R, B). If X is measurable (or more exactly F -B-
measurable), i.e., X −1 (B) = {X −1 (B) : B ∈ B} ⊂ F, then it is called a
A.2 Random Variables, Conditional Expectations 247

random variable. The σ-algebra σ(X) = X −1 (B) is the smallest one with
respect to which X is measurable. It is called the σ-algebra generated by X.
Definition A.1 (Independence).
(i) Two events A, B ∈ F are called independent, if P (A ∩ B) = P (A)P (B).
(ii) Suppose A1 and A2 are subfamilies of F : A1 , A2 ⊂ F . Then A1 and A2
are called independent, if P (A1 ∩ A2 ) = P (A1 )P (A2 ) for all A1 ∈ A1 ,
A2 ∈ A2 .
(iii) Two random variables X and Y on (Ω, F ) are called independent, if σ(X)
and σ(Y ) are independent.
The expectation EX
 (or E[X]) of a random variable is defined in the usual
way as the integral XdP with respect to the probability measure P . If the
expectation E|X| is finite, we call X integrable. The law or distribution of X
on (R, B) is given by FX (B) = P (X ∈ B), B ∈ B, and FX (t) = FX ((−∞, t])
is the distribution function. Often the index X in FX is omitted when it is
clear which random variable is considered. Let g : R → R be a measurable
function and suppose that g(X) is integrable. Then
 
Eg(X) = g(X)dP = g(t)dFX (t).
Ω R

If X has a density fX : R → R+ , i.e., P (X ∈ B) = B fX (t)dt, B ∈ B, then
the expectation can be calculated as

Eg(X) = g(t)fX (t)dt.
R

The variance of a random variable X with E[X 2 ] < ∞ is denoted Var[X] and
defined by Var[X] = E[(X − EX)2 ].
We now present some classical inequalities:
• Markov inequality: Suppose that X is a random variable and g : R+ → R+
a measurable nondecreasing function such that g(|X|) is integrable. Then
for any real c > 0
Eg(|X|) ≥ g(c)P (|X| ≥ c).
• Jensen’s inequality: Suppose that g : R → R is a convex function and that
X is a random variable such that X and g(X) are integrable. Then

g(EX) ≤ Eg(X).

• Hölder’s inequality: Let p, q ∈ R such that p > 1 and 1/p + 1/q = 1. Sup-
pose X and Y are random variables such that |X|p and |Y |q are integrable.
Then XY is integrable and

E|XY | ≤ E[|X|p ]1/p E[|Y |q ]1/q .

Taking p = q = 2 this inequality reduces to Schwarz’s inequality.


248 A Background in Probability and Stochastic Processes

• Minkowski’s inequality: Suppose that X and Y are random variables such


that |X|p and |Y |p are integrable for some p ≥ 1. Then we have the triangle
law
E[|X + Y |p ]1/p ≤ E[|X|p ]1/p + E[|Y |p ]1/p .

At the end of this section we list some types of convergence of real-valued


random variables. Let X, Xn , n ∈ N, be random variables carried by the triple
(Ω, F , P ) and taking values in (R, B) with distribution functions F, Fn . Then
the following forms of convergence Xn → X are fundamental in probability
theory.
• Almost sure convergence: We say Xn → X almost surely (P -a.s.) if
P ( lim Xn = X) = 1.
n→∞

P
• Convergence in probability: We say Xn → X in probability, if for every
> 0,
lim P (|Xn − X| > ) = 0.
n→∞
D
• Convergence in distribution: We say Xn → X in distribution, if for every
x of the set of continuity points of F ,
lim Fn (x) = F (x).
n→∞

• Convergence in the pth mean or convergence in Lp : We say Xn → X in


the pth mean, p ≥ 1, or in Lp , if |X|p , |Xn |p are integrable and
lim E|Xn − X|p = 0.
n→∞

The relationships between these forms of convergence are the following:


P
Xn → X, P -a.s. ⇒ Xn → X,
P
Xn → X in Lp ⇒ Xn → X,
P D
Xn → X ⇒ Xn → X.

A.2.2 Lp -Spaces and Conditioning

We introduce the vector spaces Lp = Lp (Ω, F , P ), p ≥ 1, of (equivalence


classes of) random variables X such that |X|p is integrable, without distin-
guishing between random variables X, Y with P (X = Y ) = 1. With the norm
X p = (E|X|p )1/p the space Lp becomes a complete space in that for any
Cauchy sequence (Yn ), n ∈ N, there exists a Y ∈ Lp such that Yn − Y p → 0
for n → ∞. A sequence (Yn ) is called Cauchy sequence if

sup Yr − Ys p→ 0 for k → ∞.
r,s≥k
A.2 Random Variables, Conditional Expectations 249

Lp is a complete and metric vector space or Banach space. For 1 ≤ p ≤ q and


X ∈ Lq it follows by Jensen’s inequality that

X p ≤ X q.

So Lq is a subspace of Lp if q ≥ p. For p = 2 we define the scalar product


!X, Y " = E[XY ], which makes L2 a Hilbert space, i.e., a Banach space with
a norm induced by a scalar product.
We have introduced Lp -spaces to be able to look at conditional expec-
tations from a geometrical point of view. Before we give a formal definition
of conditional expectations, we consider the orthogonal projection in Hilbert
spaces.

Theorem A.2. Let K be a complete vector subspace of L2 and X ∈ L2 . Then


there exists Y in K such that
(i) X − Y 2 = inf{ X − Z 2 : Z ∈ K},
(ii) X − Y ⊥ Z, i.e., E[(X − Y )Z] = 0, for all Z ∈ K.
Properties (i) and (ii) are equivalent and if Y ∗ shares either property (i)
or (ii) with Y , then P (Y = Y ∗ ) = 1.

The short proof of this result can be found in Williams [164]. The theo-
rem states that there is one unique element in the subspace K that has the
shortest distance from a given element in L2 and the projection direction is
orthogonal on K. A similar projection can be carried out from L1 (Ω, F , P )
onto L1 (Ω, A, P ), where A ⊂ F is some sub-σ-algebra of F . Of course, any
A-measurable random variable of L1 (Ω, A, P ) is also in L1 (Ω, F , P ). Thus,
for a given X in L1 (Ω, F , P ), we are looking for the “best” approximation in
L1 (Ω, A, P ). A solution to this problem is given by the following fundamental
theorem and definition.

Theorem A.3. Let X be a random variable in L1 (Ω, F , P ) and let A be a


sub-σ-algebra of F . Then there exists a random variable Y in L1 (Ω, A, P )
such that  
Y dP = XdP, for all A ∈ A. (A.1)
A A

If Y ∗ is another random variable in L1 (Ω, A, P ) with property (A.1), then


P (Y = Y ∗ ) = 1.
A random variable Y ∈ L1 (Ω, A, P ) with property (A.1) is called (a version
of ) the conditional expectation E[X|A] of X given A. We write Y = E[X|A]
noting that equality holds P -a.s.

The standard proof of this theorem uses the Radon–Nikodym theorem (cf.
for example Billingsley [42]). A more constructive proof is via the Orthogonal
Projection Theorem A.2. In the case that EX 2 < ∞, i.e., X ∈ L2 (Ω, F , P ), we
can use Theorem A.2 directly with K = L2 (Ω, A, P ). Let Y be the projection
250 A Background in Probability and Stochastic Processes

of X in K. Then property (ii) of Theorem A.2 yields E[(X − Y )Z] = 0 for all
Z ∈ K. Take Z = IA , A ∈ A. Then E[(X − Y )IA ] = 0 is just condition (A.1),
which shows that Y is a version of the conditional expectation E[X|A]. If X
is not in L2 , we split X as X + −X − and approximate both parts by sequences
Xn+ = X + ∧ n and Xn− = X − ∧ n, n ∈ N, of L2 -random variables. A limiting
argument for n → ∞ yields the desired result (see [164] for a complete proof).
Conditioning with respect to a σ-algebra is in general not very concrete,
so the idea of projecting onto a subspace may give some additional insight.
Another point of view is to look at conditioning as an averaging operator.
The sub-σ-algebra A lies between the extremes F and G = {∅, Ω}, the trivial
σ-field. As can be easily verified from the definition, the corresponding condi-
tional expectations of X are X = E[X|F ] and EX = E[X|G]. So for A with
G ⊂ A ⊂ F the conditional expectation E[X|A] lies “between” X (no aver-
aging, complete information about the value of X) and EX (overall average,
no information about the value of X). The more events of F are included in
A the more is E[X|A] varying and the closer is this conditional expectation
to X in a sense made precise in the following proposition.

Proposition A.4. Suppose X ∈ L2 (Ω, F , P ) and let A1 and A2 be sub-σ-


algebras of F such that A1 ⊂ A2 ⊂ F . Then, denoting Yi = E[X|Ai ], i = 1, 2,
we have the following inequalities:
(i) X − Y2 2 ≤ X − Y1 2 ≤ X − Y2 2 + Y2 − Y1 2 .
(ii) Y1 − EX 2 ≤ Y2 − EX 2 ≤ Y1 − EX 2 + Y2 − Y1 2 .

Proof. The right-hand side inequalities are just special cases of the triangle
law for the L2 -norm or Minkowski’s inequality. So we need to prove the left-
hand inequalities.
(i) Since Y2 is the projection of X on L2 (Ω, A2 , P ) and

Y1 ∈ L2 (Ω, A1 , P ) ⊂ L2 (Ω, A2 , P ),

we can use Theorem A.2 to yield

X − Y2 2 = inf{ X − Z 2 : Z ∈ L2 (Ω, A2 , P )} ≤ X − Y1 2 .

(ii) Denoting Ỹi = Yi − EX we see that Ỹ1 is the projection of Ỹ2 on


L2 (Ω, A1 , P ). Again from Theorem A.2 it follows that Ỹ2 − Ỹ1 and Ỹ1
are orthogonal. The Pythagoras Theorem then takes the form

Ỹ2 2
2 = Ỹ2 − Ỹ1 + Ỹ1 2
2 = Ỹ2 − Ỹ1 2
2+ Ỹ1 22 ,

which gives Ỹ1 2 ≤ Ỹ2 2. 




Remark A.5. 1. Using some of the properties of conditional expectations


stated below, all the inequalities but the first in (i) of the proposition
can be shown to hold also in Lp -norm, p ≥ 1, provided that X ∈ Lp .
A.2 Random Variables, Conditional Expectations 251

2. If we view E[X|A] as a predictor of the unknown X, then Proposition A.4


says that the closer A is to F the better in the mean square sense is
this estimate and the bigger is the variance Var[E[X|A]] of this random
variable.
In particular, if A is generated by a finite or countable partition of Ω, then
the conditional expectation can be given explicitly.
Theorem A.6. Let X be an integrable random variable, i.e., X ∈ L1 , and
let A be a sub-σ-algebra of F generated by a finite or countable partition
A1 , A2 , . . . of Ω. Then,

1 E[IAi X]
E[X|A] = XdP = , ω ∈ Ai , P (Ai ) > 0.
P (Ai ) Ai P (Ai )

If P (Ai ) = 0, the value of E[X|A] over Ai is set to 0.

A.2.3 Properties of Conditional Expectations

Here and in the following relations like <, ≤, = between random variables are
always assumed to hold with probability one and the term P -a.s. is suppressed.
All random variables in this subsection are assumed to be integrable, i.e., to
be elements of L1 (Ω, F , P ). Let A and C denote sub-σ-algebras of F . Then
the following properties for conditional expectations hold true.
1. If Y is any version of E[X|A], then EY = EX.
2. If X is A-measurable (σ(X) ⊂ A), then E[X|A] =X.
3. Linearity. E[aX + bY |A] = aE[X|A] + bE[Y |A], a, b ∈ R.
4. Monotonicity. If X ≤ Y , then E[X|A] ≤ E[Y |A].
5. Monotone Convergence. If Xn is an increasing sequence and Xn → X
P -a.s., then E[Xn |A] converges almost surely:
lim E[Xn |A] = E[X|A].
n→∞

6. Dominated Convergence. If Xn is a sequence of random variables such


that sup |Xn | is integrable and Xn → X P -a.s., then E[Xn |A] converges
almost surely:
lim E[Xn |A] = E[X|A].
n→∞

7. Jensen’s Inequality. If g : R → R is convex and g(X) is integrable, then

E[g(X)|A] ≥ g(E[X|A]),

in particular
X p ≥ E[X|A] p , for p ≥ 1.
8. Successive Conditioning. If H is a sub-σ-algebra of A, then

E[E[X|A]|H] = E[X|H].
252 A Background in Probability and Stochastic Processes

9. Factoring. Let the random variable Z be A-measurable and suppose that


ZX is integrable. Then
E[ZX|A] = ZE[X|A].
10. Independent Conditioning. Let C and A be sub-σ-algebras of F such
that C is independent of σ(X)∨A. Then
E[X|C ∨ A] = E[X|A].
In particular, if X is independent of C, then E[X|C] = EX.
The proofs of all these properties are mainly based on the definition of
the conditional expectation and follow the ideas of the corresponding proofs
for unconditional expectations, e.g., for monotone and dominated convergence
(cf. Williams [164], pp. 89–90).

A.2.4 Regular Conditional Probabilities


We define the conditional probability of an event A ∈ F , given a sub-σ-algebra
A as
P (A|A) =E[IA |A].
Clearly, by the monotonicity, linearity, and monotone convergence properties
we have
0 ≤ P (A|A) ≤ 1,
P (Ω|A) = 1,
and

 ∞

P( An |A) = P (An |A)
n=1 n=1

for a fixed sequence A1 , A2 , . . . of disjoint events of F . From this we cannot


conclude that for almost all ω ∈ Ω the map A −→ P (A|A)(ω) defines a prob-
ability on F . Although we often dispense with a discussion of P -zero sets, it
is important here. For example, the last equation showing the σ-additivity of
conditional probability only holds with probability 1. Except in trivial cases,
there are uncountable many sequences of disjoint events and each of these
sequences determines an exceptional P -zero set. The union of all these ex-
ceptional sets need not have probability 0 (it need not even be an element of
F ). But fortunately, for most cases encountered in applications there exists a
so-called regular conditional probability.
Definition A.7. A map Q : Ω × F → [0, 1] is called regular conditional
probability given A ⊂ F , if
(i) for all A ∈ F , ω −→ Q(ω, A) is a version of E[IA |A];
(ii) there exists some N0 ∈ F , P (N0 ) = 0 such that the map A −→ Q(ω, A)
is a probability measure on F for all ω ∈
/ N0 .
A.2 Random Variables, Conditional Expectations 253

A.2.5 Computation of Conditional Expectations

Besides the simple case of a sub-σ-algebra A generated by a countable parti-


tion of Ω mentioned in Theorem A.6, we consider two further ways to deter-
mine conditional expectations E[X|A].
1. If there exists a regular conditional probability Q given A, we can deter-
mine the conditional distribution QX of a random variable X given A:
QX (ω, B) = Q(ω, X −1 (B)). Then for any measurable function g : R → R
such that g(X) is integrable,

g(x)QX (ω, dx)
R

is a version of E[g(X)|A].
2. We consider two random variables X and Y and a measurable function g
such that g(X) is integrable. We write

E[g(X)|Y ] = E[g(X)|σ(Y )]

for the conditional expectation of g(X) given Y . By definition E[g(X)|Y ]


is σ(Y )-measurable and by Doob’s representation theorem (cf. [61], p. 12)
there exists a measurable function h : Y (Ω) → R such that

E[g(X)|Y ] = h(Y ).

If we know such a function h, we can also determine h(y) = E[g(X)|Y =


y], y ∈ R, the conditional expectation of g(X) given that Y has realization
y. Of course, if P (Y = y) > 0, we have

1
h(y) = E[g(X)|Y = y] = g(X)dP.
P (Y = y) {Y =y}

But even if the set {Y = y} has probability 0, we are now able to deter-
mine the conditional expectation of g(X) given that Y takes the value y
(provided we know h). Consider the case that a joint density fXY (x, y)
of X and Y is known. Let fY (y) = R fXY (x, y)dx be the density of the
(marginal) distribution of Y and

fXY (x, y)/fY (y) if fY (y) = 0
fX|Y (x|y) =
0 otherwise

the elementary conditional density of X given Y . A natural choice for the


function h would then be

h(y) = g(x)fX|Y (x, y)dx.
R
254 A Background in Probability and Stochastic Processes

We claim that h(Y ) is a version of the conditional expectation E[g(X)|Y ].


To prove this note that the elements of the σ-algebra σ(Y ) are of the form
Y −1 (B) = {ω : Y (ω) ∈ B}, B ∈ B. Therefore, we have to show that
 
E[g(X)IB (Y )] = g(x)IB (y)fXY (x, y)dxdy

equals 
E[h(Y )IB (Y )] = h(y)IB (y)fY (y)dy

for all B ∈ B. But this follows directly from Fubini’s Theorem, which
proves the assertion.

A.3 Stochastic Processes on a Filtered Probability Space


Definition A.8. 1. A stochastic process is a family X = (Xt ), t ∈ R+ , of
random variables all defined on the same probability space (Ω, F , P ) with
values in a measurable space (S, S).
2. For ω ∈ Ω the mapping t → Xt (ω) is called path.
3. Two stochastic processes X, Y are called indistinguishable, if P -almost all
paths are identical: P (Xt = Yt , ∀t ∈ R+ ) = 1.
If it is claimed that a process is unique, we mean uniqueness up to indis-
tinguishability. Also for conditional expectations no distinction will be made
between one version of the conditional expectation and the equivalence class of
P -a.s. equal versions. A real-valued process is called right- or left-continuous,
nondecreasing, of bounded variation on finite intervals etc., if P -almost all
paths have this property, i.e., if the process is indistinguishable from a pro-
cess, the paths of which all have that property. In particular a process is
called cadlag (continu à droite, limité à gauche), if almost all paths are right-
continuous and left-limited.
If not otherwise mentioned, we always refer in the following to real-valued
stochastic processes, i.e., to processes X = (Xt ) for which the Xt take values
in (S, S) = (R, B), where B = B(R) is the Borel σ-algebra on R.
Definition A.9. A stochastic process X is called
1. integrable, if E|Xt | < ∞, ∀t ∈ R+ ;
2. square integrable, if EXt2 < ∞, ∀t ∈ R+ ;
3. bounded in Lp , p ≥ 1,if supt∈R+ E|Xt |p < ∞;
4. uniformly integrable, if limc→∞ supt∈R+ E[|Xt |I(|Xt | > c)] = 0.
Deviating from our notation some authors call an L2 -bounded stochastic
process square integrable.
Uniform integrability plays an important role in martingale theory. There-
fore, we look for criteria for this property. A very useful one is given in the
following proposition.
A.3 Stochastic Processes on a Filtered Probability Space 255

Proposition A.10. A stochastic process X is uniformly integrable if and only


if there exists a positive increasing convex function G : R+ → R+ such that
1. limt→∞ G(t)
t = ∞ and
2. supt∈R+ EG(|Xt |) < ∞.
In particular, taking G(t) = tp , we see that a process X, which is bounded
in Lp for some p > 1, is uniformly integrable. A process bounded in L1 is not
necessarily uniformly integrable. The property of uniform integrability links
the convergence in probability with convergence in L1 .
Theorem A.11. Let (Xn ), n ∈ N, be a sequence of integrable random vari-
ables that converges in probability to a random variable X, i.e., P (|Xn − X| >
) → 0 as n → ∞ ∀ > 0. Then
L1
X ∈ L1 and Xn → X, i.e., E|Xn − X| → 0 as n → ∞

if and only if (Xn ) is uniformly integrable.


So if Xn → X P -a.s. and the sequence is uniformly integrable, then it
follows that EXn → EX, n → ∞. At first sight it seems reasonable that un-
der uniform integrability almost sure convergence can be carried over also to
conditional expectations E[Xn |A] for some sub-σ-algebra A ⊂ F . But (sur-
prisingly) this does not hold true in general, for a counterexample see Jensen
[97]. The condition sup Xn ∈ L1 in the dominated convergence theorem for
conditional expectations as stated above is necessary for the convergence result
and cannot be weakened.
To describe the information that is gathered observing some stochastic
phenomena in time, we introduce filtrations.
Definition A.12. 1. A family F = (Ft ), t ∈ R+ , of sub-σ-algebras of F
is called a filtrationif
 it is nondecreasing,
7 i.e., if s ≤ t, then Fs ⊂ Ft .
We denote F∞ = t∈R+ Ft = σ( t∈R+ Ft ).
2. If F =(Ft ) is a filtration, then we write
% &
8 
Ft+ = Ft+h and Ft− = σ Ft−h .
h>0 h>0

3. A filtration (Ft ) is called right-continuous, if for all t ∈ R+ , we have


Ft+ = Ft .
4. A probability space (Ω, F , P ) together with a filtration F is called a stochas-
tic basis: (Ω, F , F, P ).
5. A stochastic basis (Ω, F , F, P ) is called complete, if F is complete, i.e., F
contains all subsets of P -null sets, and if each Ft contains all P -null sets
of F .
6. A filtration F is said to fulfill the usual conditions, if it is right-continuous
and complete.
256 A Background in Probability and Stochastic Processes

The σ-algebra Ft is often interpreted as the information gathered up to


time t, or more precisely, the set of events of F , which can be distinguished at
time t. If a stochastic process X = (Xt ), t ∈ R+ , is observed, then a natural
choice for a corresponding filtration would be Ft = FtX = σ(Xs , 0 ≤ s ≤
t), which is the smallest σ-algebra such that all random variables Xs , 0 ≤
s ≤ t, are Ft -measurable. Here we assume that FtX is augmented so that
the generated filtration fulfills the usual conditions. Such an augmentation is
always possible (cf. Dellacherie and Meyer [61], p. 115).
Remark A.13. Sometimes it is discussed whether such an augmentation affects
the filtration too strongly. Indeed, if we consider, for example, two mutually
singular probability measures, say P and Q on the measurable space (Ω, F )
such that P (A) = 1 − Q(A) = 1 for some A ∈ F , then completing each Ft
with all P and Q negligible sets may result in Ft = F for all t ∈ R+ , which is
a rather uninteresting case destroying the modeling of the evolution in time.
But in the material we cover in this book such cases are not essential and we
always assume that a stochastic basis is given with a filtration meeting the
usual conditions.
Definition A.14. A stochastic process X = (Xt ), t ∈ R+ , is called adapted
to a filtration F = (Ft ), if Xt is Ft -measurable for all t ∈ R+ .
Definition A.15. A stochastic process X is F-progressive or progressively
measurable, if for every t, the mapping (s, ω) → Xs (ω) on [0, t] × Ω is mea-
surable with respect to the product σ-algebra B([0, t]) ⊗ Ft , where B([0, t]) is
the Borel σ-algebra on [0, t].
Theorem A.16. Let X be a real-valued stochastic process. If X is left-
or right-continuous and t adapted to F, then it is F-progressive. If X is F-
progressive, then so is 0 Xs ds.
A further measurability restriction is needed in connection with stochastic
processes in continuous time. This is the fundamental concept of predictability.
Definition A.17. Let F be a filtration on the basic probability space and let
P(F) be the σ-algebra on (0, ∞) × Ω generated by the system of sets
(s, t] × A, 0 ≤ s < t, A ∈ Fs , t > 0.
P(F) is called the F-predictable σ-algebra on (0, ∞) × Ω. A stochastic process
X = (Xt ) is called F-predictable, if X0 is F0 -measurable and the mapping
(t, ω) → Xt (ω) on (0, ∞) × Ω into R is measurable with respect to P(F).
Theorem A.18. Every left-continuous process adapted to F is F-predictable.
In all applications, we will be concerned with predictable processes that
are left-continuous. Note that F-predictable processes are also F-progressive. A
property that explains the term predictable is given in the following theorem.
Theorem A.19. Suppose the process X is F-predictable. Then for all t > 0
the variable Xt is Ft− -measurable.
A.4 Stopping Times 257

A.4 Stopping Times


Suppose we want to describe a point in time at which a stochastic process
first enters a given set, say when it hits a certain level. So this point in time
is a random time because it depends on the random evolution of the process.
Observing this stochastic process, it is possible to decide at any time t whether
this random time has occurred or not. Such random times, which are based on
the available information not anticipating the future, are defined as follows.
Definition A.20. Suppose F = (Ft ), t ∈ R+ , is a filtration on the measurable
space (Ω, F ). A random variable τ : Ω → [0, ∞] is said to be a stopping time
if for every t ∈ R+ ,
{τ ≤ t} = {ω : τ (ω) ≤ t} ∈ Ft .
In particular, a constant random variable τ = t0 ∈ R+ is a stopping time.
Since we assume that the filtration is right-continuous, we can equivalently
describe stopping times by the condition {τ < t} ∈ Ft : If {τ < t} ∈ Ft for
all t ∈ R+ , then
8 1
 8
{τ ≤ t} = τ <t+ ∈ Ft+ n1 = Ft+ .
n
n∈N n∈N

Conversely, if {τ ≤ t} ∈ Ft for all t ∈ R+ , then


 1

{τ < t} = τ ≤ t− ∈ Ft for t > 0 and {τ < 0} = ∅ ∈ F0 .
n
n∈N

Proposition A.21. Suppose σ and τ are stopping times. Then σ ∧ τ , σ ∨ τ ,


and σ + τ are stopping times. Let (τn ), n ∈ N, be a sequence of stopping times.
Then sup τn and inf τn are also stopping times.
Proof. First we show that σ + τ is a stopping time and consider the comple-
ment of the event {σ + τ ≤ t} :
{σ + τ > t} = {σ > t} ∪ {τ > t} ∪ {σ ≥ t, τ > 0} ∪ {0 < σ < t, σ + τ > t}.
The first three events of this union are clearly in Ft . The fourth event

{0 < σ < t, σ + τ > t} = {r < σ < t, τ > t − r}
r∈Q∩[0,t)

is the countable union of events of Ft and therefore σ + τ is a stopping time.


The proof of the remaining assertions follows from
8
{sup τn ≤ t} = {τn ≤ t} ∈ Ft ,
n∈N

{inf τn < t} = {τn < t} ∈ Ft ,
n∈N

using the fact that for a right-continuous filtration it suffices to show {inf τn <
t} ∈ Ft . 

258 A Background in Probability and Stochastic Processes

For a sequence of stopping times (τn ) the random variables sup τn , inf τn
are stopping times, so that lim sup τn , lim inf τn and lim τn (if it exists) are
also stopping times.
We now define the σ-algebra of the past of a stopping time τ .
Definition A.22. Suppose τ is a stopping time with respect to the filtration
F. Then the σ-algebra Fτ of events occurring up to time τ is

Fτ = {A ∈ F∞ : A ∩ {τ ≤ t} ∈ Ft for all t ∈ R+ }.

We note that τ is Fτ -measurable and that for a constant stopping time


τ = t0 ∈ R+ we have Fτ = Ft0 .
Theorem A.23. Suppose σ and τ are stopping times.
(i) If σ ≤ τ , then Fσ ⊂ Fτ .
(ii) If A ∈ Fσ , then A ∩ {σ ≤ τ } ∈ Fτ .
(iii) Fσ∧τ = Fσ ∩ Fτ .
Proof. (i) For B ∈ Fσ and t ∈ R+ we have

B ∩ {τ ≤ t} = B ∩ {σ ≤ t} ∩ {τ ≤ t} ∈ Ft ,

which proves (i).


(ii) Suppose A ∈ Fσ . Then

A ∩ {σ ≤ τ } ∩ {τ ≤ t} = A ∩ {σ ≤ t} ∩ {τ ≤ t} ∩ {σ ∧ t ≤ τ ∧ t}.

Now A ∩ {σ ≤ t} and {τ ≤ t} are elements of Ft by assumption and the


random variables σ ∧ t and τ ∧ t are both Ft -measurable. This shows that
{σ ∧ t ≤ τ ∧ t} ∈ Ft .
(iii) Since σ ∧ τ ≤ σ and σ ∧ τ ≤ τ we obtain from (i)

Fσ∧τ ⊂ Fσ ∩ Fτ .

Conversely, for A ∈ Fσ ∩ Fτ we have

A ∩ {σ ∧ τ ≤ t} = (A ∩ {σ ≤ t}) ∪ (A ∩ {τ ≤ t}) ∈ Ft ,

which proves (iii). 



This theorem shows that some of the properties known for fixed time points
s, t also hold true for stopping times σ, τ . Next we consider the link between
a stochastic process X = (Xt ), t ∈ R+ , and a stopping time σ. It is natural
to investigate variables Xσ(ω) (ω) with random index and the stopped process
Xtσ (ω) = Xσ∧t (ω) on {σ < ∞}. To ensure that Xσ is a random variable, we
need that Xt fulfills a measurability requirement in t.
Theorem A.24. If σ is a stopping time and X = (Xt ), t ∈ R+ , is an F-
progressive process, then Xσ is Fσ -measurable and X σ is F-progressive.
A.5 Martingale Theory 259

Proof. We must show that for any Borel set B ∈ B, {Xσ ∈ B} ∩ {σ ≤


t} belongs to Ft . This intersection equals {Xσ∧t ∈ B} ∩ {σ ≤ t}, so we
need only show that X σ is progressive. Now σ ∧ t is Ft -measurable. Hence,
(s, ω) → (σ(ω)∧s, ω) is B([0, t])⊗Ft-measurable. Therefore, the map (s, ω) →
Xσ(ω)∧s (ω) is measurable as it is the composition of two measurable maps.
Hence X σ is progressive. 


Most important for applications are those random times σ that are defined
as first entrance times of a stochastic process X into a Borel set B: σ = inf{t ∈
R+ : Xt ∈ B}. In general, it is very difficult to show that σ is a stopping time.
For a discussion of the usual conditions in this connection, see Rogers and
Williams [133], pp. 183–191. For a complete proof of the following theorem
we refer to Dellacherie and Meyer [61], p. 116.

Theorem A.25. Let X be an F-progressive process with respect to the com-


plete and right-continuous filtration F and B ∈ B a Borel set. Then

σ(ω) = inf{t ∈ R+ : Xt (ω) ∈ B}

is an F-stopping time.

Proof. We only show the simple case where X is right-continuous and B is


an open set. Then the right continuity implies that

{σ < t} = {Xr ∈ B} ∈ Ft .
r∈Q∩[0,t)

Using the right-continuity of F it is seen that σ is an F-stopping time . 




Note that the right-continuity of the paths was used to express {σ < t}
as the union of events {Xr ∈ B} and that we could restrict ourselves to a
countable union because B is an open set.

A.5 Martingale Theory


An overview over the historical development of martingale theory can be found
in monographs such as Andersen et al. [2], pp. 115–120, or Kallenberg [101],
pp. 464–485. We fix a stochastic basis (Ω, F , F, P ) and define stochastic pro-
cesses with certain properties which are known as the stochastic analogues to
constant, increasing and decreasing functions.

Definition A.26. An integrable F-adapted process X = (Xt ), t ∈ R+ , is called


a martingale if
Xt = E[Xs |Ft ] (A.2)
for all s ≥ t, s, t ∈ R+ . A supermartingale is defined in the same way, except
that (A.2) is replaced by
Xt ≥ E[Xs |Ft ],
260 A Background in Probability and Stochastic Processes

and a submartingale is defined with (A.2) being replaced by

Xt ≤ E[Xs |Ft ].

Forming expectations on both sides of the (in)equality we obtain EXt =


(≥, ≤)EXs , which shows that a martingale is constant on average, a super-
martingale decreases, and a submartingale increases on average, respectively.

Example A.27. Let X be an integrable F-adapted process. Suppose that the


increments Xs −Xt are independent of Ft for all s > t, s, t ∈ R+ . If these incre-
ments have zero expectation (thus the expectation function EXt is constant),
then X is a martingale:

E[Xs |Ft ] = E[Xt |Ft ] + E[Xs − Xt |Ft ] = Xt .

Of particular importance are the following cases.


(i) If X is continuous, X0 = 0, and the increments Xs − Xt are normally
distributed with mean 0 and variance s − t, then X is an F-Brownian
motion. In addition to X, also the process Yt = Xt2 − t is a martingale:

E[Ys |Ft ] = E[(Xs − Xt )2 |Ft ] + 2Xt E[Xs − Xt |Ft ] + Xt2 − s


= s − t + 0 + Xt2 − s = Yt .

(ii) If X0 = 0 and the increments Xs − Xt follow a Poisson distribution with


mean s−t, for s > t, then X is a Poisson process. Now X is a submartingale
because of

E[Xs |Ft ] = Xt + E[Xs − Xt |Ft ] = Xt + s − t ≥ Xt

and Xt − t is a martingale.

Example A.28. Let Y be an integrable random variable and define Mt =


E[Y |Ft ]. Then M is a martingale because of the successive conditioning prop-
erty:
E[Ms |Ft ] = E[E[Y |Fs ]|Ft ] = E[Y |Ft ] = Mt , s ≥ t.
So Mt is a predictor of Y given the information Ft gathered up to time t.
Furthermore, M is a uniformly integrable martingale. To see this we have to
show that limc→∞ supt∈R+ E[|Mt |I(|Mt | > c)] → 0 as c → ∞. By Jensen’s
inequality for conditional expectations we obtain

E[|Mt |I(|Mt | > c)] ≤ E[E[|Y |I(|Mt | > c)|Ft ]] = E[|Y |I(|Mt | > c)].

Since Y is integrable and cP (|Mt | > c) ≤ E|Mt | ≤ E|Y |, it follows that


P (|Mt | > c) → 0 uniformly in t, which shows that M is uniformly integrable.

Concerning the regularity of the paths of a supermartingale, the following


result holds true.
A.5 Martingale Theory 261

Lemma A.29. Suppose X is a supermartingale such that t → EXt is


right-continuous. Then X has a modification with all paths cadlag, i.e., there
exists a process Y with cadlag paths such that Xt = Yt P -a.s. for all t ∈ R+ .

So for a martingale, a submartingale, or a supermartingale with right-


continuous expectation function, we can assume that it has cadlag paths. From
now on we make the general assumption that all martingales, submartingales,
and supermartingales are cadlag unless stated otherwise.

Lemma A.30. Let M be a martingale and consider a convex function g :


R → R such that X = g(M ) is integrable. Then X is a submartingale.
If g is also nondecreasing, then the assertion remains true for submartin-
gales M .

Proof. Let M be a martingale. Then by Jensen’s inequality we obtain for s ≥ t

Xt = g(Mt ) = g(E[Ms |Ft ]) ≤ E[g(Ms )|Ft ] = E[Xs |Ft ],

which shows that X is a submartingale.


If M is a submartingale and g is nondecreasing, then

g(Mt ) ≤ g(E[Ms |Ft ])

shows that the conclusion remains valid. 




The last lemma is often applied with functions g(x) = |x|p , p ≥ 1. So, if
M is a square integrable martingale, then X = M 2 defines a submartingale.
One key result in martingale theory is the following convergence theorem
(cf. [62], p. 72).

Theorem A.31. Let X be a supermartingale (martingale). Suppose that

sup E|Xt | < ∞,


t∈R+

a condition that is equivalent to limt→∞ EXt− < ∞. Then the random variable
X∞ = limt→∞ Xt exists and is integrable.
If the supermartingale (martingale) X is uniformly integrable, X∞ exists
and closes X on the right in that for all t ∈ R+

Xt ≥ E[X∞ |Ft ] (respectively Xt = E[X∞ |Ft ]).

As a consequence we get the following characterization of the convergence


of martingales.

Theorem A.32. Suppose M is a martingale. Then the following conditions


are equivalent:
262 A Background in Probability and Stochastic Processes

(i) M is uniformly integrable.


(ii) There exists a random variable M∞ such that Mt converges to M∞ in
L1 : limt→∞ E|Mt − M∞ | = 0.
(iii) Mt converges P -a.s. to an integrable random variable M∞ , which closes
M on the right: Mt = E[M∞ |Ft ].

Example A.33. If in Example A.28 we assume that Y is F∞ -measurable, then


we can conclude that the martingale Mt = E[Y |Ft ] converges P -a.s. and in
L1 to Y .
In Example A.27 (i) we see that Brownian motion (Xt ) is not uniformly
integrable as for any c > 1 we can find a t > 0 such that P (|Xt | > c) ≥ for
some , 0 < < 1. In this case we can conclude that Xt does not converge to
any random variable for t → ∞ neither P -a.s. nor in L1 .

Next we consider conditions under which the (super-)martingale property


also extends from fixed time points s, t to stopping times σ, τ .

Theorem A.34. (Optional Sampling Theorem). Let X be a supermartin-


gale and let σ and τ be two stopping times such that σ ≤ τ . Suppose either
that τ is bounded or that (Xt ) is uniformly integrable. Then Xσ and Xτ are
integrable and
Xσ ≥ E[Xτ |Fσ ]

with equality if X is a martingale.

An often used consequence of Theorem A.34 is the following: If X is a


uniformly integrable martingale, then setting σ = 0 we obtain EX0 = EXτ
for all stopping times τ (all quantities are related to the same filtration F). A
kind of converse is the following proposition.

Proposition A.35. Suppose X is an adapted cadlag process such that for any
bounded stopping time τ the random variable Xτ is integrable and EX0 =
EXτ . Then X is a martingale.

A further consequence of the Optional Sampling Theorem is that a stopped


(super-) martingale remains a (super-) martingale.

Corollary A.36. Let X be a right-continuous supermartingale (martingale)


and τ a stopping time. Then the stopped process X τ = (Xt∧τ ) is a super-
martingale (martingale). If either X is uniformly integrable or I(τ < ∞)Xτ
is integrable and limt→∞ {τ >t} |Xt | dP = 0, then X τ is uniformly integrable.

Martingales are often constructed in that an increasing process is sub-


tracted from a submartingale (cf. Example A.27 (ii), p. 260). This fact ema-
nates from the celebrated Doob–Meyer decomposition, which is a cornerstone
in modern probability theory.
A.5 Martingale Theory 263

Theorem A.37. (Doob–Meyer decomposition). Let the process X be


right-continuous and adapted. Then X is a uniformly integrable submartin-
gale if and only if it has a decomposition

X = A + M,

where A is a right-continuous predictable nondecreasing and integrable process


with A0 = 0 and M is a uniformly integrable martingale. The decomposition
is unique within indistinguishable processes.

Remark A.38. 1. Several proofs of this and more general results, not res-
tricted to uniformly integrable processes, are known (cf. [62], p. 198 and
[101], p. 412). Some of these also refer to local martingales, which are not
needed for the applications we have presented and which are therefore not
introduced here.
2. The process A in the theorem above is often called compensator.
3. In the case of discrete time such a decomposition is easily constructed in
the following way. Let (Xn ), n ∈ N0 , be a submartingale with respect to
a filtration (Fn ), n ∈ N0 . Then we define

Xn = An + Mn ,

where

An = An−1 + E[Xn |Fn−1 ] − Xn−1 , n ∈ N, A0 = 0,


Mn = Xn − An , n ∈ N0 .

The process M is a martingale and A is nondecreasing and predictable


in that An is Fn−1 -measurable for n ∈ N. This decomposition is unique,
since for a second decomposition Xn = Ãn + M̃n with the same properties
we must have Mn − M̃n = An − Ãn , which is a predictable martingale.
Therefore,
0 = E[An − Ãn |Fn−1 ] = An − Ãn , n ∈ N
and A0 = Ã0 = 0.

The continuous time result needs much more care and uses several lemmas,
one of which is interesting in its own right and will be presented here.

Lemma A.39.  ∞ A process M is a predictable martingale of integrable varia-


tion, i.e., E[ 0 |dMs |] < ∞, if and only if Mt = M0 for all t ∈ R+ .

We will now use the Doob–Meyer decomposition to introduce two types


of (co-)variation processes. For this we recall that M (M0 ) denotes the class
of cadlag martingales (with M0 = 0) and denote by M2 (M20 ) the set of
martingales in M(M0 ), which are bounded in L2 , i.e., supt∈R+ EMt2 < ∞.
264 A Background in Probability and Stochastic Processes

Definition A.40. For M ∈ M2 the unique compensator of M 2 in the


Doob–Meyer decomposition, denoted !M, M " or !M ", is called the predictable
variation process. For M1 , M2 ∈ M2 the process
1
!M1 , M2 " = (!M1 + M2 " − !M1 − M2 ")
4
is called the predictable covariation process of M1 and M2 .
Proposition A.41. Suppose that M1 , M2 ∈ M2 . Then A = !M1 , M2 " is the
unique predictable cadlag process with A0 = 0 such that M1 M2 − A ∈ M.
Proof. The assertion follows from the Doob–Meyer decomposition and
1 
M1 M2 − !M1 , M2 " = (M1 + M2 )2 − (M1 − M2 )2 − !M1 , M2 "
4
1 
= (M1 + M2 )2 − !M1 + M2 "
4
1 
− (M1 − M2 )2 − !M1 − M2 " .
4



To understand what predictable variation means, we give a heuristic


explanation. Recall that for a martingale M we have for all 0 < h < t
E[Mt − Mt−h |Ft−h ] = 0,
or in heuristic form:
E[dMt |Ft− ] = 0.
Since M − !M " is a martingale and !M " is predictable, we obtain
2

E[dMt2 |Ft− ] = E[d!M "t |Ft− ] = d!M "t .


Furthermore,
dMt2 = Mt2 − Mt−
2

= (Mt− + dMt )2 − Mt−


2

= (dMt )2 + 2Mt− dMt ,


yielding
d!M "t = E[(dMt )2 |Ft− ] + 2Mt− E[dMt |Ft− ] = E[(dMt )2 |Ft− ]
= Var[dMt |Ft− ].
This indicates (and it can be proved) that !M "t is the stochastic limit of
the form
n
Var[Mti − Mti−1 |Fti−1 ]
i=1
as n → ∞ and the span of the partition 0 = t0 < t1 < . . . < tn = t tends to 0.
A.5 Martingale Theory 265

Definition A.42. Two martingales M, L ∈ M2 are called orthogonal if their


product is a martingale: M L ∈ M.
For two martingales M, L of M2 that are orthogonal we must have
!M, L" = 0. If we equip M2 with the scalar product

(M, L)M2 = E[M∞ L∞ ]

inducing the norm M = (EM∞ ) , then M2 becomes a Hilbert space.


2 1/2

Because of M L − !M, L" ∈ M and !M, L"0 = 0, it follows that

(M, L)M2 = E[M∞ L∞ ] = E!M, L"∞ + EM0 L0 .

So two orthogonal martingales M, L of M20 are also orthogonal in the Hilbert


space M2 (cf. Elliott [67], p. 88).
The set of continuous martingales in M20 , denoted M2,c0 , is a complete
2,d 2,c
subspace of M0 and M0 is the space orthogonal to M0 . The martingales in
2

M2,d
0 are called purely discontinuous. As an immediate consequence we obtain
that any martingale M ∈ M20 has a unique decomposition M = M c + M d ,
where M c ∈ M2,c 0 and M ∈ M0 .
d 2,d

A process strongly connected to predictable variation is the so-called


square bracket process introduced in the following definition.
Definition A.43. Suppose M ∈ M20 and M = M c + M d is the unique de-
composition with M c ∈ M2,c 2,d
0 and M ∈ M0 . The increasing cadlag process
d

[M ] with 
[M ]t = !M c "t + Ms2
s≤t

is called the quadratic variation of M , where Mt = Mt − Mt− denotes the


jump of M at time t > 0 (X0 = X0 ). For martingales M, L ∈ M20 we define
the quadratic covariation [M, L] by
1
[M, L] = ([M + L] − [M − L]) .
4
The following proposition helps to understand the name quadratic covari-
ation.
Proposition A.44. Suppose M, L ∈ M20 .
1. Let (tni ) be a sequence of partitions 0 = tn0 < tn1 < . . . < tnn = t such that
the span supi (tni+1 − tni ) tends to 0 as n → ∞. Then

(Mti+1 − Mti )(Lti+1 − Lti )
i

converges P -a.s. and in L1 to [M, L]t for all t > 0.


2. M L − [M, L] is a martingale.
266 A Background in Probability and Stochastic Processes

A.6 Semimartingales
A decomposition of a stochastic process into a (predictable) drift part and a
martingale, as presented for submartingales in the Doob–Meyer decomposi-
tion, also holds true for more general processes. We start with the motivating
example of a sequence (Xn ), n ∈ N0 , of integrable random variables adapted
to the filtration (Fn ). This sequence admits a decomposition


n
Xn = X0 + fi + Mn
i=1

with a predictable sequence f = (fn ), n ∈ N, (i.e., fn is Fn−1 -measurable)


and a martingale M = (Mn ), n ∈ N0 , M0 = 0. We can take

fn = E[Xn − Xn−1 |Fn−1 ],


n
Mn = (Xi − E[Xi |Fi−1 ]).
i=1

This decomposition is unique because a second decomposition of this type,


say with a sequence f˜ and a martingale M̃ , would imply that

n
Mn − M̃n = (f˜i − fi )
i=1

defines a predictable martingale, i.e., E[Mn − M̃n |Fn−1 ] = Mn − M̃n = M0 −


M̃0 = 0, which shows the uniqueness.
Unlike the time-discrete case, corresponding decompositions cannot be
found for all integrable processes in continuous time. The role of increasing
processes in the Doob–Meyer decomposition will now be taken by processes
of bounded variation.

Definition A.45. For a cadlag function g : R+ → R the variation is de-


fined as
n
Vg (t) = lim |g(tk/n) − g(t(k − 1)/n)|.
n→∞
k=1

The function g is said to have finite variation if Vg (t) < ∞ for all t ∈ R+ .
The class of cadlag processes A with finite variation starting in A0 = 0 is
denoted V.

For any A ∈ V there is a decomposition At = Bt − Ct with increasing


processes B, C ∈ V and
 t
Bt + Ct = VA (t) = |dAs |.
0
A.6 Semimartingales 267

Definition A.46. A process Z is a semimartingale if it has a decomposition

Zt = Z0 + At + Mt ,

where A ∈ V and M ∈ M0 .

There is a rich theory based on semimartingales that relies on the remark-


able property that semimartingales are stable under many sorts of operations,
e.g., changes of time, of probability measures, and of filtrations preserve the
semimartingale property, also products and convex functions of semimartin-
gales are semimartingales (cf. Dellacherie and Meyer [62], pp. 212–252). The
importance of semimartingales lies also in the fact that stochastic integrals
 t
Hs dZs
0

of predictable processes H with respect to a semimartingale Z can be int-


roduced replacing Stieltjes integrals. It is beyond the scope of this book to
present the whole theory of semimartingales; we confine ourselves to the case
that the process A in the semimartingale decomposition is absolutely contin-
uous (with respect to Lebesgue-measure). The class of such processes is rich
enough to contain most processes interesting in applications and allows the
development of a kind of “differential” calculus.

Definition A.47. A semimartingale Z with decomposition Zt = Z0 +At +Mt


is called smooth semimartingale (SSM) if Z is integrable and A has the form
 t
At = fs ds,
0

where f is a progressive process and A has locally integrable variation, i.e.,


 t
E |fs |ds < ∞
0

for all t ∈ R+ . Short notation: Z = (f, M ).

As submartingales can be considered as stochastic analog to increasing


functions, smooth semimartingales can be seen as the stochastic counterpart
to differentiable functions. Some of the above-mentioned operations will be
considered in the following.

A.6.1 Change of Time

Let (τt ), t ∈ R+ , be a family of stopping times with respect to F = (Ft )


such that for all ω, τt (ω) is nondecreasing and right-continuous as a function
of t. Then for an F-semimartingale Z we consider the transformed process
Z̃t = Zτt , which is adapted to F̃ = (F̃t ), where F̃t = Fτt .
268 A Background in Probability and Stochastic Processes

Theorem A.48. If Z is an F-semimartingale, then Z̃ is an F̃-semimartingale.

One example of such a change of time is stopping a process at some fixed


stopping time τ :
τt = t ∧ τ.
If we consider an SSM Z = (f, M ), then the stopped process Z τ = Z̃ = (f˜, M̃ )
is again an SSM with
f˜t = I(τ > t)ft .

A.6.2 Product Rule

It is known that the product of two semimartingales is a semimartingale (cf.


[62], p. 219). However, this does not hold true in general for SSMs. As an
example consider a martingale M ∈ M20 with a predictable variation process
!M " that is not continuous. Then Z = M is an SSM with f = 0, but Z 2 = M 2
has a decomposition
Zt2 = !M "t + Rt
with some martingale R, which shows that Z 2 is not an SSM. To establish
conditions under which a product rule for SSMs holds true, we first recall the
integration by parts formula for ordinary functions.

Proposition A.49. Let a and b be cadlag functions on R+ , which are of finite


variation. Then for each t ∈ R+
 t  t
a(t)b(t) = a(0)b(0) + a(s−)db(s) + b(s)da(s)
0 0
 t  t
= a(0)b(0) + a(s−)db(s) + b(s−)da(s)
0 0

+ a(s)b(s),
0<s≤t

where a(s−) is the left limit at s and a(s) = a(s) − a(s−).

Replacing a and b by SSMs Z and Y in this integration by parts formula


we need to give  t
Ys− dZs
0

a meaning. The finite variation part can be defined as an ordinary (pathwise)


t
Stieltjes integral. It remains to define 0 Ys− dMs where M is a martingale pos-
sibly of unbounded variation. Because we do not want to develop the theory of
stochastic integration, we only quote the following theorem stating conditions
to be used in the product formula we aim at.
A.6 Semimartingales 269

Theorem A.50. Suppose M ∈ M20 and let X be a predictable process such


that  ∞
E Xs2 d!M "s < ∞.
0
t
Then there exists a unique process 0 Xs dMs ∈ M20 with the characterizing
property 9 t :  t
Xs dMs , L = Xs d!M, L"s
0 0

for all L ∈ M20 .

For two SSMs Z and Y with martingale parts M and L, respectively,


M, L ∈ M20 , we define the covariation [Z, Y ] by

[Z, Y ]t = !M c , Lc "t + Zs Ys
s≤t

= !M , L "t + Z0 Y0 +
c c
Ms Ls
s≤t
= Z0 Y0 + [M, L]t .

After these preparations the following product rule can be established.

Theorem A.51. Let Z = (f, M ) and Y = (g, L) be F-SSMs with orthogonal


martingales M, L ∈ M20 , i.e., M L ∈ M0 . Assume that
 t
E (|Zs gs | + |Ys fs |)ds < ∞, E|Z0 Y0 | < ∞,
 ∞
0
 ∞
E Ys− d!M "s < ∞, E
2 2
Zs− d!L"s < ∞.
0 0

Then ZY is an F-SSM with representation


 t
Zt Yt = Z0 Y0 + (Ys fs + Zs gs )ds + Rt ,
0

where R = (Rt ) is a martingale in M0 .

Proof. To prove the product rule we use a form of integration by parts for
semimartingales, which is an application of Ito’s formula (see [67], p. 140):
 
Zt Yt = Zs− dYs + Ys− dZs + [Z, Y ]t .
(0,t] (0,t]

The definition of stochastic integrals implies


   s  
Zs− dYs = Zs− d gu du + Zs− dLs .
(0,t] (0,t] 0 (0,t]
270 A Background in Probability and Stochastic Processes

The second term of the sum is a martingale of M20 by virtue of


 ∞
E 2
Zs− d!L"s < ∞.
0

The first term of the sum is an ordinary Stieltjes integral. Since the paths of
Z have at most countably many jumps, it follows that
  s   t
Zs− d gu du = Zs gs ds.
(0,t] 0 0

The second integral in the integration by parts formula is treated in the


same way.
It remains to show that in [Z, Y ]t = Z0 Y0 + [M, L]t the second term of the
sum is a martingale. From Proposition A.44, p. 265, we know that M L−[M, L]
is a martingale. By virtue of the assumption that M L ∈ M0 the square
bracket process [M, L] must also have the martingale property. Altogether
the product semimartingale has the representation
 t
Zt Yt = Z0 Y0 + (Zs gs + Ys fs )ds + Rt ,
0

where  
Rt = Zs− dLs + Ys− dMs + [M, L]t
(0,t] (0,t]

is a martingale in M0 . This completes the proof. 



Sometimes the product rule is used for a product one factor of which is
the one point process I(ζ ≤ t) with a stopping time ζ. Because of the special
structure of this factor less restrictive conditions are necessary to establish a
product rule.
Proposition A.52. Let Z = (f, M ) be an F-SSM and ζ > 0 a (totally
inaccessible) F-stopping time with
 t
Yt = I(ζ ≤ t) = gs ds + Ls .
0

Furthermore, it is assumed that for all t ∈ R+


 t  t
E |Zs gs |ds < ∞, E |Zs− | |dLs | < ∞
0 0

and Mζ = 0. Then ZY is an SSM with representation


 t
Zt Yt = (Zs gs + Ys fs )ds + Rt ,
0

where R ∈ M0 .
A.6 Semimartingales 271

Proof. The product ZY can be represented in the form


 t
Zt Yt = Zt − Zt∧ζ + Zs dYs
0

with the pathwise defined Stieltjes integral


 t  t  t
Zs dYs = Zs gs ds + Zs dLs .
0 0 0

The second term in this sum can be decomposed as


 t  t 
Zs dLs = Zs− dLs + Ms Ls .
0 0 s≤t

The sum of jumps is 0, since L is continuous outside {(t, ω) : ζ(ω) = t} and


t
Mζ = 0. The martingale L is of finite variation and the condition E 0 |Zs− |
|dLs | < ∞ implies that the integral of the predictable process Zs− with respect
to L is a martingale (cf. [101]).
To sum up we get
 t  t
ζ
Zt Yt = (fs − I(ζ > s)fs + Zs gs )ds + Mt − Mt + Zs− dLs ,
0 0

which proves the assertion. 



B
Renewal Processes

In this appendix we present some definitions and results from the theory of
renewal processes, including renewal reward processes and regenerative pro-
cesses. Key references are [1, 8, 44, 58, 135, 156].
The purpose of this appendix is not to give an all-inclusive presentation
of the theory. Only definitions and results needed for establishing the results
of Chaps. 1–5 (in particular Chap. 4) is covered.

B.1 Basic Theory of Renewal Processes


Let T, Tj , j = 1, 2, . . ., be a sequence of nonnegative independent identically
distributed (i.i.d.) random variables with distribution function F . To avoid
trivialities, we assume that P (T = 0) < 1. From the nonnegativity of T , it
follows that ET exists, although it may be infinite, and we denote
 ∞
μ = ET = P (T > t) dt.
0

The variance of T is denoted σ 2 . Let



j
S0 = 0, Sj = Ti , j ∈ N
i=1

and define
Nt = sup{j : Sj ≤ t},
or equivalently,


Nt = I(Sj ≤ t). (B.1)
j=1

The processes (Nt ), t ∈ R+ , and (Sj ), j ∈ N0 , are both called a renewal process.
We say that a renewal occurs at t if Sj = t for some j ≥ 1. The random variable

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modelling 273


and Applied Probability 41, DOI 10.1007/978-1-4614-7894-2,
© Springer Science+Business Media New York 2013
274 B Renewal Processes

Nt represents the number of renewals in [0, t]. Since the interarrival times Tj
are independent and identically distributed, it follows that after each renewal
the process restarts.
Let M (t) = ENt , 0 ≤ t < ∞. The function M (t) is called the renewal
function. It can be shown that M (t) is finite for all t. From (B.1) we see that


M (t) = F ∗j (t), (B.2)
j=1

where F ∗j denotes the j-fold convolution of F . If, for example, F is a Gamma


distribution with parameters 2 and λ, i.e., F (t) = 1 − e−λt − λte−λt , it can be
shown that
λt 1 − e−2λt
M (t) = − .
2 4
Refer to [1, 31, 32] for more general formulas for the renewal function of
the Gamma distribution and expressions and bounds for other distributions.
In Proposition B.1 we show how M can be determined (at least in theory)
from F . It turns out that M uniquely determines F .

Proposition B.1. There is a one-to-one correspondence between the interar-


rival distribution F and the renewal function M .
∞
Proof. We introduce the Laplace transform LB (s) = 0 e−sx dB(x), where
B : R+ → R+ is a nondecreasing and right-continuous function. By taking
the Laplace transform L on both sides of formula (B.2) we obtain


LM (s) = LF ∗j (s)
j=1
∞
= (LF (s))j
j=1
LF (s)
= , (B.3)
1 − LF (s)

or equivalently
LM (s)
LF (s) = .
1 + LM (s)
Hence LF is determined by M and since the Laplace transform determines
the distribution, it follows that F also is determined by M . 


The function M (t) satisfies the following integral equation:


 t
M (t) = F (t) + M (t − x)dF (x),
0
B.1 Basic Theory of Renewal Processes 275

i.e., M = F + M ∗ F , where ∗ means convolution. This equation is referred


to as the renewal equation, and is seen to hold by conditioning on the time of
the first renewal. Upon doing so we obtain
 ∞
M (t) = E[Nt |T1 = x]dF (x)
0
 t
= [1 + M (t − x)]dF (x)
0
= F (t) + (M ∗ F )(t),

noting that if the first renewal occurs at time x, x ≤ t, then from this point
on the process restarts, and thus the expected number of renewals in [0, t] is
just 1 plus the expected number to arrive in a time t − x from an equivalent
renewal process. A more formal proof is the following;

 ∞

M (t) = ENt = E I(Sj ≤ t) = F (t) + E I(Sj ≤ t)
j=1 j=2


= F (t) + E I(Sj − S1 ≤ t − S1 )
j=2
 t ∞

= F (t) + E I(Sj − S1 ≤ t − s)dF (s)
0 j=2
 t
= F (t) + M (t − s)dF (s).
0

To generalize the renewal equation, we write

g(t) = h(t) + (g ∗ F )(t), (B.4)

where h and F are known and g is an unknown function to be determined


as a solution to (B.4). The solution of this equation is given by the following
result.

Theorem B.2. If the function g satisfies (B.4) and h is bounded on finite


intervals, then
g(t) = h(t) + (h ∗ M )(t)
is a solution to (B.4) and the unique solution which is bounded on finite in-
tervals.

Proof. A proof of this result is given in Asmussen [8], p. 113. A simpler proof
can however be given in the case where the Laplace transform of h and g
exists: Taking Laplace transforms in (B.4), yields

Lg (s) = Lh (s) + Lg (s)LF (s),


276 B Renewal Processes

and it follows that


Lh (s)
Lg (s) =
1 − LF (s)
 
LF (s)
= Lh (s) 1 +
1 − LF (s)
= Lh (s) + Lh (s)LM (s)
= Lh+h∗M (s),
where the second last equality follows from (B.3). Since the Laplace transform
uniquely determines the function, this gives the desired result. 


Using the (strong) law of large numbers, many results related to renewal
processes can be established, including the following.

Theorem B.3. With probability one,


Nt 1
→ as t → ∞.
t μ
Proof. By definition of Nt , it follows that
SNt ≤ t ≤ SNt +1 .
Hence,
SN t t SNt +1
≤ ≤ .
Nt Nt Nt
Now the strong law of large numbers states that with probability one, Sj /j →
μ as j → ∞. As can be easily shown, Nt → ∞ as t → ∞, and thus
SN t
→ μ as t → ∞ (P -a.s.).
Nt
By the same argument, we also see that with probability one,
SNt +1 SNt +1 Nt + 1
= → μ · 1 = μ as t → ∞.
Nt Nt + 1 Nt
The result follows. 


We now formulate some limiting results, without proof, including the Ele-
mentary Renewal Theorem, the Key Renewal Theorem, Blackwell’s Theorem,
and the Central Limit Theorem for renewal processes. Refer to Alsmeyer [1],
Asmussen [8], Daley and Vere-Jones [58], and Ross [135] for proofs; see also
Birolini [44]. Some of the results require that the distribution F is not periodic
(lattice). We say that F is periodic if there exists a constant c, c > 0, such
that T takes only values in {0, c, 2c, 3c, . . .}.
B.1 Basic Theory of Renewal Processes 277

Theorem B.4. (Elementary Renewal Theorem)


M (t) 1
lim = .
t→∞ t μ
Theorem B.5. (Tightened Elementary Renewal Theorem). Assume
that σ 2 =Var[T ] < ∞. If the distribution F is not periodic, then
 
t σ 2 − μ2
lim M (t) − = .
t→∞ μ 2μ2

Theorem B.6. Assume that σ 2 =Var[T ] < ∞. If the distribution F is not


periodic, then
Var[Nt ] σ2
lim = 3.
t→∞ t μ

Before we state the Key Renewal Theorem, we need a definition. Let g be


a function defined on R+ and for h > 0 let
h
g− (x) = inf g(x − δ), h
g+ (x) = sup g(x − δ).
0≤δ≤h 0≤δ≤h

We say that g is directly Riemann integrable if for any h > 0;



 ∞

h |g−
h
(nh)| and h |g+
h
(nh)|
n=1 n=1

are finite, and



 ∞

h h
lim h g− (nh) = lim h g+ (nh).
h→0+ h→0+
n=1 n=1

In particular, a nonnegative, nonincreasing and integrable function is di-


rectly Riemann integrable. See [58, 88] for some other sufficient conditions for
a function to be directly Riemann integrable.
Theorem B.7. (Key Renewal Theorem). Assume that the distribution F
is not periodic and g is a directly Riemann integrable function. Then
 t 
1 ∞
lim g(t − s) dM (s) = g(s) ds.
t→∞ 0 μ 0
Remark B.8. An alternative formulation of the Key Renewal Theorem is the
following: If g is bounded and integrable with g(t) → 0 as t → ∞, then
t ∞
limt→∞ 0 g(t − s) dM (s) = (1/μ) 0 g(s) ds provided that F is spread out.
A distribution function is spread out if there exists an n such that F ∗n has a
nonzero absolutely continuous component with respect to Lebesgue measure,
i.e., we can write F ∗n = G1 + G2 , where G1 , G2 are nonnegative measures on
R+ , and G1 has a density with respect to Lebesgue measure.
278 B Renewal Processes

The Key Renewal Theorem is equivalent to Blackwell’s Theorem below.


Theorem B.9. (Blackwell’s Theorem). For a renewal process with a non-
periodic distribution F ,
s
lim [M (t) − M (t − s)] = .
t→∞ μ

If F has a density f , then M has a density m, and




m(t) = f ∗j (t),
j=1

where f ∗1 = f and
 t
∗j
f (t) = f ∗(j−1) (t − s)f (s)ds, j = 2, 3, . . . .
0

Under certain conditions the renewal density m(t) converges to 1/μ as t → ∞.


Theorem B.10. (Renewal Density Theorem). Assume that F has a den-
sity f with f (t)p integrable for some p > 1, and f (t) → 0 as t → ∞. Then M
has a density m such that
1
lim m(t) = .
t→∞ μ
Remark B.11. The conclusion of the theorem also holds true if F has a density
f , which is directly Riemann integrable, or if F has finite mean and a bounded
density f satisfying f (t) → 0 as t → ∞.
Theorem B.12. (Central Limit Theorem). Assume that σ 2 =Var[T ] <
∞. Then Nt , suitably standardized, tends to a normal distribution as t → ∞,
i.e., % &  x
Nt − t/μ 1 1 2
lim P ; ≤x = √ e− 2 u du.
t→∞ 2
tσ /μ 3 2π −∞

Next we formulate the limiting distribution of the forward and backward


recurrence times αt and βt , defined by

αt = SNt +1 − t,
βt = t − SN t .

The recurrence times αt and βt are the time intervals from t forward to the
next renewal point and backward to the last renewal point (or to the time
origin), respectively. Let Fαt and Fβt denote the distribution functions of αt
and βt , respectively. The following result is a consequence of the Key Renewal
Theorem.
B.1 Basic Theory of Renewal Processes 279

Theorem B.13. Assume that the distribution F is not periodic. Then the
asymptotic distribution of the forward and backward recurrence times are
given by
x
F̄ (s) ds
lim Fαt (x) = lim Fβt (x) = 0 .
t→∞ t→∞ μ
This asymptotic distribution of αt and βt is called the equilibrium distri-
bution.
A simple formula exists for the mean forward recurrence time; we have
ESNt +1 = μ(1 + M (t)). (B.5)
Formula B.5 is a special case of Wald’s equation (see, e.g., Ross [135]), and
follows by writing
 
k
ESNt +1 = E Sk I(Nt + 1 = k) = E Tj I(Nt + 1 = k)
k≥1 k≥1 j=1
 
=E Tj I(Nt + 1 ≥ j) = E Tj I(Sj−1 ≤ t)
j≥1 j≥1
 
= ETj EI(Sj−1 ≤ t) = μ F ∗j (t) = μ(1 + M (t)).
j≥1 j≥0

Finally in this section we prove a result used in the proof of Theorem 4.19,
p. 122.
Proposition B.14. Let g be a real-valued function which is bounded on finite
intervals. Assume that
lim g(t) = g.
t→∞
Then  t
1 g
lim g(s)dM (s) = .
t→∞ t 0 μ
Proof. To prove this result we use a standard argument. Given > 0, there
exists a t0 such that |g(t) − g| < for t ≥ t0 . Hence for t > t0 we have

1 t
|g(s) − g|dM (s)
t 0
 
1 t0 1 t
≤ |g(s) − g|dM (s) + dM (s).
t 0 t t0
Since t0 is fixed, this gives by applying the Elementary Renewal Theorem,

1 t
lim sup |g(s) − g|dM (s) ≤ .
t→∞ t 0 μ
The desired conclusion follows. 

280 B Renewal Processes

B.2 Renewal Reward Processes


Let (T, Y ), (T1 , Y1 ), (T2 , Y2 ), . . ., be a sequence of independent and identically
distributed pairs of random variables, with T, Tj ≥ 0. We interpret Yj as the
“reward” (“cost”) associated with the j th interarrival time Tj . The random
variable Yj may depend on Tj . Let Zt denote the total reward earned by time
t. We see that if the reward is earned at the time of the renewal,


Nt
Zt = Yj .
j=1

The limiting value of the average return is established using the law of large
numbers and is given by the following result (cf. [135]).

Theorem B.15. If E|Y | is finite, then


(i) With probability 1
Zt EY
→ as t → ∞,
t ET
(ii)
EZt EY
→ as t → ∞.
t ET
Remark B.16. The conclusions of Theorem B.15 also hold true if Y ≥ 0, EY =
∞ and ET < ∞.

Many results from renewal theory can be generalized to renewal reward


processes. For example Blackwell’s Theorem holds:
sEY
lim [Zt − Zt−s ] = .
t→∞ ET
The following theorem, which is a reformulation of Theorem 3.2, p. 136, in [8],
generalizes the Central Limit Theorem for renewal processes, Theorem B.12.

Theorem B.17. Suppose Var[Y ] < ∞ and Var[T ] < ∞. Then as t → ∞


   
√ Zt EY D τ2
t − → N 0, ,
t ET ET

where
 
EY
τ = Var Y −
2
T
ET
 2
EY EY
= Var[Y ] + Var[T ] − 2 Cov[Y, T ].
ET ET
B.4 Modified (Delayed) Processes 281

B.3 Regenerative Processes


The stochastic process (Xt ) is called regenerative if there exists a renewal
D
process (Tj ) such that for k ∈ N, (Xt )t≥0 = (Xt+Sk )t≥0 , and

((Xt+Sk )t≥0 , (Tj ), j > k) and ((Xt )0≤t≤Sk , T1 , T2 , . . . , Tk )

are stochastically independent. Thus the continuation of the process beyond


Sk is a probabilistic replica of the whole process starting at 0. The random
times Sk are said to be regenerative points for the process (Xt ) and the time
interval [Sk−1 , Sk ) is called the kth cycle of the process.
In the following assume that the state space of (Xt ) equals N0 ={0, 1, 2, . . .}.
Let
Pk (t) = P (Xt = k), k ∈ N0 .
The following result taken from Ross [135] is stated without proof.

Theorem B.18. If the distribution of T1 has an absolutely continuous com-


ponent and E T1 < ∞, then
T
E 0 1 I(Xt = k) dt
lim Pk (t) = , k ∈ N0 .
t→∞ ET1
Remark B.19. We see that if limt→∞ Pk (t) = Pk exists, then
 t
1
lim E I(Xs = k) ds = Pk .
t→∞ t 0
t
The quantity (1/t)E 0 I(Xs = k) ds represents the expected portion of time
the process is in state k in [0, t]. Since
 t  
1 1 t 1 t
E I(Xs = k) ds = EI(Xs = k) ds = Pk (s) ds,
t 0 t 0 t 0
this quantity is also equal to the average probability that the process is in
state k.

B.4 Modified (Delayed) Processes


Consider a renewal process (Sj ) as defined in Sect. B.1, but assume now that
the first interarrival time T1 has a distribution F̃ , that is not necessarily
identical to F . The process is referred to as a modified renewal process (or
a delayed renewal process). Similarly, we define a modified (delayed) renewal
reward process and a modified (delayed) regenerative process. For the modified
renewal reward process the distribution of the pair (Y1 , T1 ) is not necessarily
the same as the pairs (Yi , Ti ), i = 2, 3, . . ..
282 B Renewal Processes

It can be shown that all the asymptotic results presented in the previous
sections of this appendix still hold true for the modified processes. If we take
the first distribution to be equal to the asymptotic distribution of the rec-
urrence times, given by Theorem B.13, p. 279, the renewal process becomes
stationary in the sense that the distribution of the forward recurrence time
αt does not depend on t. Furthermore,

M (t + h) − M (t) = h/ET.
References

[1] Alsmeyer, G. (1991) Erneuerungstheorie. Teubner Skripten zur Mathe-


matischen Stochastik. B.G. Teubner, Stuttgart.
[2] Andersen, P. K., Borgan, Ø., Gill, R. and Keiding, N. (1992) Statistical
Models Based on Counting Processes. Springer, New York.
[3] Arjas, E. (1993) Information and reliability: A Bayesian perspective.
In: Barlow, R., Clarotti, C. and Spizzichino, F. (eds.): Reliability and
Decision Making. Chapman & Hall, London, pp. 115–135.
[4] Arjas, E. (1989) Survival models and martingale dynamics. Scand. J.
Statist 16, 177–225.
[5] Arjas, E. (1981) A stochastic process approach to multivariate reliability
systems: Notions based on conditional stochastic order. Mathematics of
Operations Research 6, 263–276.
[6] Arjas, E. (1981) The failure and hazard processes in multivariate relia-
bility systems. Mathematics of Operations Research 6, 551–562.
[7] Arjas, E. and Norros, I. (1989) Change of life distribution via hazard
transformation: An inequality with application to minimal repair. Math-
ematics of Operations Research 14, 355–361.
[8] Asmussen, S. (1987) Applied Probability and Queues. Wiley, New York.
[9] Asmussen, S. (1984) Approximations for the probability of ruin within
finite time. Scand. Actuarial J ., 31–57.
[10] Aven, T. (2009) Optimal test interval for a monotone safety system. J.
Applied Probability 46, 1–12.
[11] Aven, T. (1996) Availability analysis of monotone systems. In: S. Özekici
(ed.): Reliability and Maintenance of Complex Systems. NATO ASI
Series F, Springer, Berlin, pp. 206–223.
[12] Aven, T. (1996) Condition based replacement times - a counting process
approach. Reliability Engineering and System Safety. Special issue on
Maintenance and Reliability 51, 275–292.
[13] Aven, T. (1992) Reliability and Risk Analysis. Elsevier Applied Science,
London.

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modelling 283


and Applied Probability 41, DOI 10.1007/978-1-4614-7894-2,
© Springer Science+Business Media New York 2013
284 References

[14] Aven, T. (1990) Availability evaluation of flow networks with varying


throughput-demand and deferred repairs. IEEE Trans. Reliability 38,
499–505.
[15] Aven, T. (1987) A counting process approach to replacement models.
Optimization 18, 285–296.
[16] Aven, T. (1985) A theorem for determining the compensator of a count-
ing process. Scand. J. Statist. 12, 69–72.
[17] Aven, T. (1985) Reliability evaluation of multistate systems of multi-
state components. IEEE Trans. Reliability 34, 473–479.
[18] Aven, T. (1983) Optimal replacement under a minimal repair strategy
− A general failure model. Adv. Appl. Prob. 15, 198–211.
[19] Aven, T. and Bergman, B. (1986) Optimal replacement times, a general
set-up. J. Appl. Prob. 23, 432–442.
[20] Aven, T. and Castro, I. T. (2008) A delay time model with safety con-
straint. Reliability Engineering and System Safety 94, 261–267.
[21] Aven, T. and Dekker, R. (1997) A useful framework for optimal replace-
ment models. Reliability Engineering and System Safety 58, 61–67.
[22] Aven, T. and Haukås, H. (1997) Asymptotic Poisson distribution for the
number of system failures of a monotone system. Reliability Engineering
and System Safety 58, 43–53.
[23] Aven, T. and Haukås, H. (1997) A note on the steady state availability
of monotone systems. Reliability Engineering and System Safety 59,
269–276.
[24] Aven, T. and Jensen, U. (1998) A general minimal repair model.
Research report, University of Ulm.
[25] Aven, T. and Jensen, U. (1998) Information based hazard rates for ruin
times of risk processes. Research Report, University of Ulm.
[26] Aven, T. and Jensen, U. (1997) Asymptotic distribution of the downtime
of a monotone system. Mathematical Methods of Operations Research.
Special issue on Stochastic Models of Reliability, 45, 355–375.
[27] Aven, T. and Opdal, K. (1996) On the steady state unavailability
of standby systems. Reliability Engineering and System Safety 52,
171–175.
[28] Aven, T. and Østebø, R. (1986) Two new component importance mea-
sures for a flow network system. Reliability Engineering 14, 75–80.
[29] Baker, R. D., Christer, A. H.(1994) Review of delay-time OR modelling
of engineering aspects of maintenance. European Journal of Operational
Research 73, 407–422.
[30] Barlow, R. and Hunter, L. (1960) Optimum preventive maintenance
policies. Operations Res. 8, 90–100.
[31] Barlow, R. and Proschan, F. (1965) Mathematical Theory of Reliability.
Wiley, New York.
[32] Barlow, R. and Proschan, F. (1975) Statistical Theory of Reliability and
Life Testing. Holt, Rinehart and Winston, New York.
References 285

[33] Basu, A. (1988) Multivariate exponential distributions and their appli-


cations in reliability. In: Krishnaiah, P. R. and Rao, C. R. (eds.): Hand-
book of Statistics 7. Quality Control and Reliability. North-Holland,
Amsterdam, pp. 99–111.
[34] Baxter, L. A. (1981) Availability measures for a two-state system. J.
Appl. Prob. 18, 227–235.
[35] Beichelt, F. (1993) A unifying treatment of replacement policies with
minimal repair. Nav. Res. Log. Q. 40, 51–67.
[36] Beichelt, F. and Franken, F. (1984) Zuverlässigkeit und Instandhaltung.
Carl Hanser Verlag, München.
[37] Berg, M. (1996) Economics oriented maintenance analysis and the
marginal cost approach. In: Özekici, S. (ed.): Reliability and Main-
tenance of Complex Systems. NATO ASI Series F, Springer, Berlin,
pp. 189–205.
[38] Bergman, B. (1978) Optimal replacement under a general failure model.
Adv. Appl. Prob. 10, 431–451.
[39] Bergman, B. (1985) On reliability theory and its applications. Scand. J.
Statist . 12, 1–41.
[40] Bergman, B. and Klefsjö, B. (1994) Quality. Studentlitteratur, Lund.
[41] Bertsekas, D. (1995) Dynamic Programming and Optimal Control. Vol.
1 and 2. Athena Scientific, Belmont.
[42] Billingsley, P. (1979) Probability and Measure. Wiley, New York.
[43] Birnbaum, Z. W. (1969) On the importance of different components
in a multicomponent system. In: Krishnaiah, P. R. (ed.) Multivariate
Analysis II, Academic Press, pp. 581–592.
[44] Birolini, A. (1994) Quality and Reliability of Technical Systems.
Springer, Berlin.
[45] Birolini, A. (1985) On the use of Stochastic Processes in Modeling Relia-
bility Problems. Lecture notes in Economics and Mathematical Systems
252, Springer, Berlin.
[46] Block, H. W. and Savits, T. H. (1997) Burn-In. Statistical Science 12,
1–19.
[47] Block, H. W. and Savits, T. H. (1994) Comparison of maintenance poli-
cies. In: Shaked, M. and Shanthikumar, G. (eds.): Stochastic Orders and
their Applications. Academic Press, Boston, pp. 463–484.
[48] Block, H. W., Borges, W. and Savits, T. H. (1985) Age-dependent
minimal repair. J. Appl. Prob. 22, 370–385.
[49] Boland, P. and Proschan, F. (1994) Stochastic order in system reliability
theory. In: Shaked, M. and Shanthikumar, G. (eds.): Stochastic Orders
and their Applications. Academic Press, Boston, pp. 485–508.
[50] Brémaud, P. (1981) Point Processes and Queues. Martingale Dynamics.
Springer, New York.
[51] Brown, M. and Proschan, F. (1983) Imperfect repair. J. Appl. Prob. 20,
851–859.
286 References

[52] Butler, D. A. (1979) A complete importance ranking for components of


binary coherent systems, with extensions to multi-state systems. Nav.
Res. Log. Q. 26, 565–578.
[53] Christer, A. H. (1999) Developments in delay time analysis for mod-
elling plant maintenance. Journal of the Operational Research Society
50, 1120–1137.
[54] Christer, A. H. and Redmond D. F. (1992) Revising models of mainte-
nance and inspection. International Journal of Production Economics
24, 227–234.
[55] Çinlar, E. (1975) Superposition of point processes. In: Lewis, P. (ed.)
Stochastic Point Processes. Wiley, New York, pp. 549–606.
[56] Constantini, C. and Spizzichino, F. (1997) Explicit solution of an opt-
imal stopping problem: The burn-in of conditionally exponential com-
ponents. J. Appl. Prob. 34, 267–282.
[57] Csenki, A. (1994) Cumulative operational time analysis of finite semi-
Markov reliability models. Reliability Engineering and System Safety
44, 17–25.
[58] Daley, D. J. and Vere-Jones, D. (1988) An Introduction to the Theory
of Point Processes. Springer, Berlin.
[59] Davis, M. H. A. (1993) Markov Models and Optimization. Chapman &
Hall, London.
[60] Delbaen, F. and Haezendonck, J. (1985) Inversed martingales in risk
theory. Insurance: Mathematics and Economics 4, 201–206.
[61] Dellacherie, C. and Meyer, P. A. (1978) Probabilities and Potential A.
North-Holland, Amsterdam.
[62] Dellacherie, C. and Meyer, P. A. (1980) Probabilities and Potential B.
North-Holland, Amsterdam.
[63] Dekker, R. (1996) A framework for single-parameter maintenance acti-
vities and its use in optimisation, priority setting and combining. In:
Özekici, S. (ed.): Reliability and Maintenance of Complex Systems.
NATO ASI Series F, Springer, Berlin, pp. 170–188.
[64] Dekker, R. and Groenendijk, W. (1995) Availability assessment methods
and their application in practice. Microelectron. Reliab. 35, 1257–1274.
[65] Donatiello, L. and Iyer, B. R. (1987) Closed-form solution for system
availability distribution. IEEE Trans. Reliability 36, 45–47.
[66] Dynkin, E. B. (1965) Markov Processes. Springer, Berlin.
[67] Elliott, R. (1982) Stochastic Calculus and Applications. Springer, New
York.
[68] Freund, J. E. (1961) A bivariate extension of the exponential distribu-
tion. J. Amer. Stat. Ass. 56, 971–977.
[69] Funaki, K. and Yoshimoto, K. (1994) Distribution of total uptime during
a given time interval. IEEE Trans. Reliability 43, 489–492.
[70] Gaede, K.-W. (1977) Zuverlässigkeit, Mathematische Modelle. Carl
Hanser Verlag, München.
References 287

[71] Gandy, A. (2005). Effects of Uncertainties in Components on the Sur-


vival of Complex Systems with given Dependencies. In: Wilson, A.,
Limnios, N., Keller-McNulty, S. and Armijo, Y. (eds.): Modern Sta-
tistical and Mathematical Methods in Reliability. World Scientific, New
Jersey, pp. 177–189.
[72] Gåsemyr, J. and Aven, T. (1999) Asymptotic distributions for the down-
times of monotone systems. J. Appl. Prob., to appear.
[73] Gaver, D. P. (1963) Time to failure and availability of paralleled systems
with repair. IEEE Trans. Reliability 12, 30–38.
[74] Gertsbakh, I. B. (1989) Statistical Reliability Theory. Marcel-Dekker,
New York.
[75] Gertsbakh, I. B. (1984) Asymptotic methods in reliability: A review.
Adv. Appl. Prob. 16, 147–175.
[76] Gnedenko, B. V. and Ushakov, I. A. (1995), edited by Falk, J. A. Prob-
abilistic Reliability Engineering. Wiley, Chichester.
[77] Grandell, J. (1991) Aspects of Risk Theory. Springer, New York.
[78] Grandell, J. (1991) Finite time ruin probabilities and martingales.
Informatica 2, 3–32.
[79] Griffith, W. S. (1980) Multistate reliability models. J. Appl. Prob. 15,
735–744.
[80] Grimmelt, G. R. and Stirzaker, D. R. (1992) Probability and Random
Processes. 2nd ed. Oxford Science Publication, Oxford.
[81] Haukås, H. and Aven, T. (1997) A general formula for the downtime of
a parallel system. J. Appl. Prob. 33, 772–785.
[82] Haukås, H. and Aven, T. (1996) Formulae for the downtime distribution
of a system observed in a time interval. Reliability Engineering and
System Safety 52, 19–26.
[83] Heinrich, G. and Jensen, U. (1995) Parameter estimation for a bivariate
lifetime distribution in reliability with multivariate extensions. Metrika
42, 49–65.
[84] Heinrich, G. and Jensen, U. (1996) Bivariate lifetime distributions and
optimal replacement. Mathematical Methods of Operations Research 44,
31–47.
[85] Heinrich, G. and Jensen, U. (1992) Optimal replacement rules based on
different information levels. Nav. Res. Log. Q. 39, 937–955.
[86] Henley, E.J. and Kumamoto, H. (1981) Reliability Engineering and Risk
Assessment. Prentice Hall, New Jersey.
[87] Herberts, T. and Jensen, U. (1998) Optimal stopping in a burn-in model.
Research report, University of Ulm.
[88] Hinderer, H. (1987) Remarks on directly Riemann integrable functions.
Mathematische Nachrichten 130, 225–230.
[89] Hokstad, P. (1997) The failure intensity process and the formulation of
reliability and maintenance models. Reliability Engineering and System
Safety 58, 69–82.
288 References

[90] Høyland, A. and Rausand, M. (1994) System Reliability Theory, Wiley,


New York.
[91] Hutchinson, T. P. and Lai, C. D. (1990) Continuous Bivariate Distribu-
tions, Emphasising Applications. Rumbsby Scientific Publishing, Ade-
laide.
[92] Jacod, J. (1975) Multivatiate point processes: predictable projec-
tion, Radon-Nikodym derivatives, representation of martingales. Z. für
Wahrscheinlichkeitstheorie und Verw. Gebiete 31, 235–253.
[93] Jensen, U. and Hsu, G. (1993) Optimal stopping by means of point pro-
cess observations with applications in reliability. Mathematics of Oper-
ations Research 18, 645–657.
[94] Jensen, U. (1996) Stochastic models of reliability and maintenance: an
overview. In: S. Özekici (ed.): Reliability and Maintenance of Complex
Systems. NATO ASI Series F, Springer, Berlin, pp. 3–36.
[95] Jensen, U. (1997) An optimal stopping problem in risk theory. Scand
Actuarial J. 149–159.
[96] Jensen, U. (1990) A general replacement model. ZOR-Methods and Mod-
els of Operations Research 34, 423–439.
[97] Jensen, U. (1990) An example concerning the convergence of conditional
expectations. Statistics 21, 609–611.
[98] Jensen, U. (1989) Monotone stopping rules for stochastic processes in
a semimartingale representation with applications. Optimization 20,
837–852.
[99] Joe, H. (1997). Multivariate Models and Dependence Concepts. Chap-
man & Hall, Boca Raton.
[100] Kallianpur, G. (1980) Stochastic Filtering Theory. Springer, New York.
[101] Kallenberg, O. (1997) Foundations of Modern Probability. Springer, New
York.
[102] Kaplan, N. (1981) Another look at the two-lift problem. J. Appl. Prob.
18, 697–706.
[103] Karr, A. F. (1986) Point Processes and their Statistical Inference. Mar-
cel Dekker, New York.
[104] Keilson, J. (1966) A limit theorem for passage times in ergodic regen-
erative processes. Ann. Math. Stat. 37, 866–870.
[105] Keilson, J. (1979) Markov Chain Models – Rarity and Exponentiality.
Springer, Berlin.
[106] Keilson, J. (1987) Robustness and exponentiality in redundant
repairable systems. Annals of Operations Research 9, 439–447.
[107] Kijima, M. (1989) Some results for repairable systems. J. Appl. Prob.
26, 89–102.
[108] Koch, G. (1986) A dynamical approach to reliability theory. Proc. Int.
School of Phys. “Enrico Fermi,” XCIV. North-Holland, Amsterdam,
pp. 215–240.
[109] Kovalenko, I. N. (1994) Rare events in queueing systems − a survey.
Queueing Systems 16, 1–49.
References 289

[110] Kovalenko, I. N., Kuznetsov, N. Y., and Pegg, P. A. (1997) Mathemat-


ical Theory of Reliability of Time Dependent Systems with Practical
Applications. Wiley, New York.
[111] Kovalenko, I. N., Kuznetsov, N. Y., and Shurenkov, V. M. (1996) Models
of Random Processes. CRC Press, London.
[112] Kozlov, V. V. (1978) A limit theorem for a queueing system. Theory of
Probability and its Application 23, 182–187.
[113] Kuo, W. and Kuo, Y. (1983): Facing the headaches of early failures:
a state-of-the-art review of burn-in decisions. Proceedings of the IEEE
71, 1257–1266.
[114] Lam, T. and Lehoczky, J. (1991) Superposition of renewal processes.
Adv. Appl. Prob. 23, 64–85.
[115] Last, G. and Brandt, A. (1995) Marked Point Processes on the Real
Line - The Dynamic Approach. Springer, New York.
[116] Last, G. and Szekli, R. (1998) Stochastic comparison of repairable sys-
tems. J. Appl. Prob. 35, 348–370.
[117] Last, G. and Szekli, R. (1998) Time and Palm stationarity of repairable
systems. Stoch. Proc. Appl., to appear.
[118] Leemis, L. M. and Beneke, M. (1990) Burn-in models and methods: a
review. IIE Transactions 22, 172–180.
[119] Lehmann, A. (1998) Boundary crossing probabilities of Poisson counting
processes with general boundaries. In: Kahle, W., Collani, E., Franz, J.,
and Jensen, U. (eds.): Advances in Stochastic Models for Reliability,
Quality and Safety. Birkhäuser, Boston, pp. 153–166.
[120] Marcus, R. and Blumenthal, S. (1974) A sequential screening procedure.
Technometrics 16, 229–234.
[121] Marshall, A. W. and Olkin, I. (1967) A multivariate exponential distri-
bution. J. Amer. Stat. Ass. 62, 30–44.
[122] Métivier, M. (1982) Semimartingales, a Course on Stochastic Processes.
De Gruyter, Berlin.
[123] Müller, A., Stoyan, D. (2002) Comparison Methods for Stochastic Models
and Risks. John Wiley & Sons, New York.
[124] Natvig, B. (1990) On information-based minimal repair and the reduc-
tion in remaining system lifetime due to the failure of a specific module.
J. Appl. Prob. 27, 365–375.
[125] Natvig, B. (1988) Reliability: Importance of components. In: Johnson,
N. and Kotz, S. (eds.): Encyclopedia of Statistical Sciences, vol. 8, Wiley,
New York, pp. 17–20.
[126] Natvig, B. (1994) Multistate coherent systems. In: Johnson, N. and
Kotz, S. (eds.): Encyclopedia of Statistical Sciences, vol. 5. Wiley, New
York.
[127] Nelsen, R. B. (2006). An Introduction to Copulas. Springer, New York.
[128] Osaki, S. (1985) Stochastic System Reliability Modeling. World Scien-
tific, Philadelphia.
290 References

[129] Phelps, R. (1983) Optimal policy for minimal repair. J. Opl. Res. 34,
425–427.
[130] Pierskalla, W. and Voelker, J. (1976) A survey of maintenance models:
The control and surveillance of deteriorating systems.Nav. Res. Log. Q.
23, 353–388.
[131] Puterman, M. L. (1994) Markov Decision Processes: Discrete Stochastic
Dynamic Programming. Wiley, New York.
[132] Rai, S. and Agrawal, D. P. (1990) Distributed Computing network reli-
ability. 2nd ed. IEEE Computer Soc. Press, Los Alamitos, California.
[133] Rogers, C. and Williams, D. (1994) Diffusions, Markov Processes and
Martingales, Vol. 1, 2nd ed. Wiley, Chichester.
[134] Rolski, T., Schmidli, H., Schmidt, V. and Teugels, J. (1999) Stochastic
Processes for Insurance and Finance. Wiley, Chichester.
[135] Ross, S. M. (1970) Applied Probability Models with Optimization Appli-
cations. Holden-Day, San Francisco.
[136] Ross, S. M. (1975) On the calculation of asymptotic system reliability
characteristics. In: Barlow R. E., Fussel, J. B. and Singpurwalla, N. D.
(eds.) Fault Tree Analysis. Society for Industrial and Applied Mathe-
matics, SIAM, Philadelphia, PA.
[137] Schöttl, A. (1997) Optimal stopping of a risk reserve process with int-
erest and cost rates. J. Appl. Prob. 35, 115–123.
[138] Serfozo, R. (1980) High-level exceedances of regenerative and semi-
stationary processes. J. Appl. Prob. 17, 423–431.
[139] Shaked, M. and Shanthikumar, G. (1993) Stochastic Orders and their
Applications. Academic Press, Boston.
[140] Shaked, M. and Shanthikumar, G. (1991) Dynamic multivariate aging
notions in reliability theory. Stoch. Proc. Appl . 38, 85–97.
[141] Shaked, M. and Shanthikumar, G. (1986) Multivariate imperfect repair.
Oper. Res. 34, 437–448.
[142] Shaked, M. and Szekli, R. (1995) Comparison of replacement policies
via point processes. Adv. Appl. Prob. 27, 1079–1103.
[143] Shaked, M. and Zhu, H. (1992) Some results on block replacement poli-
cies and renewal theory. J. Appl. Prob. 29, 932–946.
[144] Sherif, Y. and Smith, M. (1981) Optimal maintenance models for sys-
tems subject to failure. A review. Nav. Res. Log. Q. 28, 47–74.
[145] Smith, M. (1998) Insensitivity of the k out of n system. Probability in
the Engineering and Informational Sciences, to appear.
[146] Smith, M. (1997) On the availability of failure prone systems. PhD thesis
Erasmus University, Rotterdam.
[147] Smith, M., Aven, T., Dekker, R. and van der Duyn Schouten, F.A.
(1997) A survey on the interval availability of failure prone sys-
tems. In: Proceedings ESREL’97 conference, Lisbon, 17–20 June, 1997,
pp. 1727–1737.
[148] Solovyev, A.D. (1971) Asymptotic behavior of the time to the first
occurrence of a rare event. Engineering Cybernetics 9 (6), 1038–1048.
References 291

[149] Spizzichino, F. (1991) Sequential burn-in procedures. J. Statist. Plann.


Inference 29, 187–197.
[150] Srinivasan, S.K. and Subramanian, R. (1980) Probabilistic Analysis of
Redundant Systems. Lecture Notes in Economic and Mathematical Sys-
tems 175, Springer, Berlin.
[151] Stadje, W. and Zuckerman, D. (1991) Optimal maintenance strategies
for repairable systems with general degree of repair. J. Appl. Prob. 28,
384–396.
[152] Szász, D. (1977) A problem of two lifts. The Annals of Probability 5,
550–559.
[153] Szász, D. (1975) On the convergence of sums of point processes with
integer marks. In: Lewis, P. (ed.) Stochastic Point Processes., Wiley,
New York, pp. 607–615.
[154] Takács, L. (1957) On certain sojourn time problems in the theory of
stochastic processes. Acta Math. Acad. Sci. Hungar. 8, 169–191.
[155] Thompson, W. A. (1988) Point Process Models with Applications to
Safety and Reliability. Chapman and Hall, New York.
[156] Tijms, H. C. (1994) Stochastic Modelling and Analysis: A Computa-
tional Approach. Wiley, New York.
[157] Ushakov, I. A. (ed.) (1994) Handbook of Reliability Engineering. Wiley,
Chichester.
[158] Valdez-Flores, C. and Feldman, R. (1989) A survey of preventive main-
tenance models for stochastically deteriorating single-unit systems. Nav.
Res. Log. Q . 36, 419–446.
[159] Van der Duyn Schouten, F. A. (1983) Markov Decision Processes with
Continuous Time Parameter. Math. Centre Tracts 164, Amsterdam.
[160] Van Heijden, M. and Schornagel, A. (1988) Interval uneffectiveness dis-
tribution for a k-out-of-n multistate reliability system with repair. Eur-
opean Journal of Operational Research 36, 66–77.
[161] Van Schuppen, J. (1977) Filtering, prediction and smoothing observa-
tions, a martingale approach. SIAM J. Appl. Math. 32, 552–570.
[162] Voina, A. (1982) Asymptotic analysis of systems with a continuous com-
ponent. Kibernetika 18, 516–524.
[163] Wendt, H. (1998) A model describing damage processes and resulting
first passage times. Research Report University of Magdeburg.
[164] Williams, D. (1991) Probability with Martingales. Cambridge University
Press, Cambridge.
[165] Yashin, A. and Arjas, E. (1988) A note on random intensities and con-
ditional survival functions. J. Appl. Prob. 25, 630–635.
[166] Yearout, R. D., Reddy, P., and Grosh, D. L. (1986) Standby redundancy
in reliability − a review. IEEE Trans. Reliability 35, 285–292.
Index

Accumulated failure rate, 36 long run average, 121


Age replacement, 175 point availability, 106, 108, 120
Alternating renewal process, 107, 161 steady-state (un)availability, 109, 120
Alternating renewal process, 14 throughput availability, 160
Applications Availability, 8
availability analysis of gas compres-
sion system, 162 Backward recurrence time, 108, 113,
availability analysis of gas compres- 278, 279
sion system, 13 Binomial distribution, 22
reliability analysis of a nuclear power Birnbaum’s measure, 28
plant, 11 Birth and death process, 168
Associated variables, 30 Bivariate exponential distribution, 197
Asymptotic results Blackwell’s theorem, 278
backward recurrence time, 113 Block replacement, 177
Bounded in Lp , 254
compound Poisson process, 152
Bridge structure, 25
distribution of number of failures,
Brownian motion, 5, 67, 71
113, 125, 136
Burn-in, 202
distribution of time to failure, 126
downtime distribution, 119, 145 Cadlag, 254
downtime distribution, interval, 153 Central limit theorem, 280
forward recurrence time, 113 Change of time, 267
highly available systems, 127 Closure theorem, 38
mean number of failures, 122 Coefficient of variation, 126, 137, 154,
multistate monotone system, 162 163, 167, 171
number of failures, 116 Coherent system, 20
parallel system, 139 Common mode failures, 30
series system, 142 Compensator, 62
Availability, 106 Complex system
bound, 109, 114 binary monotone system, 2, 17
demand availability, 160 hazard rate process, 73
interval (un)availability, 106 multistate monotone system, 31
interval reliability, 106 Complex systems
limiting (un)availability, 109, 120, 168 Copula models, 42

T. Aven and U. Jensen, Stochastic Models in Reliability, Stochastic Modelling 293


and Applied Probability 41, DOI 10.1007/978-1-4614-7894-2,
© Springer Science+Business Media New York 2013
294 Index

Compound Poisson process, 4, 67, 152 renewal density, 114


Concordant, 46 standby system, 169
Conditional expectation, 249 unavailability bound, 115
Control limit rule, 194 Exponential formula, 80
Copula, 42
Archimedian, 49 Factoring algorithm, 25
Counting process, 7, 62, 114 Failure rate, 1, 6, 12, 14, 26, 36, 64
compensator, 62 accumulated, 36
intensity, 63 process, 6, 65
predictable intensity, 64 process, monotone, 77
Cox process, 92 system, 123
Critical component, 29 Filtration, 57, 255
Critical path vector, 232 complete, 255
Cut set, 20 subfiltration, 69
Cut vector, 32 Finite variation, 266
Flow network, 32
Damage models, 3 Forward recurrence time, 108, 113, 146,
Decreasing 278, 279
(a,b)-decreasing, 188
Delay time model, 215 Gas compression system, 13, 162
Delayed renewal process, 281 General repair strategy, 207
Demand availability, 160
Demand rate, 160 Harvesting problem, 182
Dependence structure, 43 Hazard function (cumulative), 1
Dependent components, 30 Hazard rate, 1, 64
failure rate process, 73 Hazard rate process, 70
optimal replacement, 197
DFR (Decreasing Failure Rate), 5, 35 IFR (Increasing Failure Rate), 3, 5, 35
DFRA (Decreasing Failure Rate IFR closure theorem, 40
Average), 36 IFRA (Increasing Failure Rate
Discounted cost, 195 Average), 3, 36
Doob–Meyer decomposition, 263 IFRA closure theorem, 39
Downtime Iinfinitesimal generator, 66
distribution bounds, 118 Inclusion–exclusion method, 23, 33
distribution given failure, 145 Increasing
distribution of the i th failure, 149 (a,b)-increasing, 188
distribution, interval, 118, 151 Independence, 247
mean, interval, 116 Indicator process, 70
steady-state distribution, 145 Indistinguishable, 254
Infinitesimal look ahead, 181
Elementary renewal theorem, 277 Information levels, 4
Equilibrium distribution, 113, 279 change of, 78
Erlang distribution, 163 Information-based replacement, 194
Expectation, 247 Innovation martingale, 63
Exponential distribution, 131 Inspection, 229
asymptotic limit, 127 Integrability, 58
mean number of system failures, 121 Integrable, 254
parallel system, 139 Intensity, 63
regenerative process, 124 marked point process, 83
Index 295

Interval (un)availability, 106 series system, 18


Interval reliability, 106, 110, 129 steady-state availability, 120
Inverse Gaussian distribution, 3, 5 Monte Carlo simulation, 10
MTTF (Mean Time To Failure), 8, 107
k-out-of-n system, 19 MTTR (Mean Time To Repair), 8, 14,
reliability, 22 107
Key renewal theorem, 277 Multistate monotone system, 31, 158,
168
Lp -space, 248 Multivariate point process, 62
Laplace transform, 128, 141, 274
Lifetime distribution, 1, 26, 34 NBU (New Better than Used), 37
Long run average cost, 195 NBUE (New Better than Used in
Lost throughput distribution, 159 Expectation), 37
Normal distribution, 114, 119, 136, 157
Maintenance, 7 Number of system failures
Marginal cost analysis, 179 asymptotic results, 135
Marked point process, 4, 81 distribution, 109, 125
Markov modulated repair process, 208 limiting mean, 116
Markov process, 66 mean, 121
pure jump process, 66 standby system, 171
Markov theory, 168 Number of system failures
Markov modulated Poisson process, 65 mean, 109
Marshall-Olkin distribution, 52 NWU (New Worse than Used), 37
Martingale, 59, 259 NWUE (New Worse than Used in
innovation, 63 Expectation), 37
orthogonal, 265
submartingale, 260 Optimal replacement, 9
supermartingale, 259 age replacement, 175
Minimal cut set, 20 block replacement, 177
Minimal cut vector, 32 complex system, 194
Minimal path set, 20 general repair strategy, 207
Minimal path vector, 32 Optimal stopping problem, 180
Minimal repairs, 90 Optimization criterion, 180
black box, 91 Optional Sampling, 262
optimal operating time, 208 Optional sampling theorem, 67
physical, 91
statistical, 91 Parallel system, 139
Modified renewal process, 281 down time distribution, 146
Monotone case, 181 downtime distribution of first failure,
Monotone system, 2, 17, 231 150
distribution of number of system downtime distribution, interval, 153
failures, 125 repair constraints, 165
downtime distribution, 148 Parallel system, 6, 18
k-out-of-n system, 19 optimal replacement, 9
mean number of system failures, 121 reliability, 22
multistate, 158 Partial information, 197, 208
parallel system, 18 Path set, 20
point availability, 120 Path vector, 32
series system, 142 Performance measures, 14, 105, 168
296 Index

Phase-type distribution, 125, 163 Renewal process, 64


PLOD (positive lower orthant alternating, 82
dependent), 46 intensity, 65
Point process, 62 Renewal reward process, 280
compound, 87 Repair models, 81
marked point process, 81 minimal repairs, 90
multivariate, 62 varying degrees, 97
Poisson approximation, 8, 125 Repair replacement model, 207
Poisson distribution, 136, 143 Replacement model, 175
Poisson process, 4, 65 Risk process, 98
doubly stochastic, 92 Ruin time, 99
Markov modulated, 92
nonhomogeneous, 92
Predictable Safety constraint, 216
variation, 264 Safety system, 229
Predictable semi-Markov theory, 169
projection, 63 Semimartingale, 267
Predictable change of filtration, 69
intensity, 64 product rule, 68
Predictable process, 58, 256 semimartingale representation, 59
Preventive replacement, 175 smooth semimartingale (SSM), 59
Probability space, 57, 246 transformations, 68
Product rule, 268 Series system, 13, 18
Progressively measurable, 256 lifetime distribution, 27
Progressively measurable process, 58 reliability, 21
PUOD (positive upper orthant Shock model, 185, 193
dependent), 46 Shock models, 86
Shock process, 4
Quadratic variation, 265 Simpson’s paradox, 4
Standby system, 166
Random variable, 247 ample repair facilities, 172
Regenerative process, 124, 167, 281 one repair facility, 169
Regular conditional expectation, 252 Stationary process, 119
Reliability, 21 Steady-state, 119
Reliability block diagram, 18 Stochastic comparison, 40
Reliability engineer, 9 Stochastic order, 46
Reliability importance measure, 27, 34 Stochastic process
Birnbaum’s measure, 28 predictable, 58
Improvement potential, 28 progressively measurable, 58
Vesely–Fussell’s measure, 28 Stopping problem, 183
Reliability modeling, 9 Stopping time, 59, 257
Renewal density, 114, 121, 148 predictable, 72
Renewal density theorem, 278 totally inaccessible, 72
Renewal equation, 275 Structural importance, 29
Renewal function, 274 Structure function, 18
Renewal process, 273 Subadditive, 49
alternating, 107 Subfiltration, 69, 190
delayed, 281 Submartingale, 59
modified, 281 Supermartingale, 59
Index 297

Survival probability, 1, 13 asymptotic distribution, 126


System failure rate, 123, 127, 136 parallel system, 140
System failures, 85
System reliability, 21 Unavailability, 109
Uniformly integrable, 58, 254
Usual conditions, 255
Throughput availability, 160
Time to system failure Wiener process, 3, 5

You might also like