Aven 2013
Aven 2013
Terje Aven
Uwe Jensen
Stochastic
Models in
Reliability
Second Edition
Stochastic Mechanics Stochastic Modelling
Random Media and Applied Probability
Signal Processing and Image Synthesis (Formerly:
Mathematical Economics and Finance Applications of Mathematics)
Stochastic Optimization
Stochastic Control
Stochastic Models
in Reliability
Second Edition
123
Terje Aven Uwe Jensen
University of Stavanger Fak. Naturwissenschaften
Stavanger, Norway Inst. Angewandte Mathematik u. Statistik
Universität Hohenheim
Stuttgart, Germany
ISSN 0172-4568
ISBN 978-1-4614-7893-5 ISBN 978-1-4614-7894-2 (eBook)
DOI 10.1007/978-1-4614-7894-2
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2013942488
In this second edition of the book, two major topics have been added to the
original version. The first one relates to copula models (Sect. 2.3), which are
used to study the effects of structural dependencies on system reliability. We
believe that an introduction to the fundamental ideas and concepts of copula
models is important when reviewing basic reliability theory. The second new
topic we have included is maintenance optimization models under constraints
(Sect. 5.5). These models have been addressed in some recent publications to
meet the demand for models that adequately balance economic criteria and
safety. We consider two specific models. The first is the so-called delay time
model where the aim is to determine optimal inspection intervals minimiz-
ing the expected discounted costs under some safety constraints. The second
model is also about optimal inspection, but here the system is represented
by a monotone (coherent) structure function. In addition, we have made a
number of minor adjustments to increase precision and we have also corrected
misprints.
We received positive feedback to the first edition from friends and col-
leagues. Their hints and suggestions have been incorporated into this second
edition. We thank all who contributed, by whatever means, to preparing the
new edition.
v
Preface to the First Edition
vii
viii Preface to the First Edition
two appendices have been included summarizing the mathematical basis and
some key results. Appendix A gives a general introduction to probability and
stochastic process theory, whereas Appendix B gives a presentation of results
from renewal theory. Appendix A also summarizes basic notation and symbols.
Although conceived mainly as a research monograph, this book can also
be used for graduate courses and seminars. It primarily addresses probabilists
and statisticians with research interests in reliability. But at least parts of it
should be accessible to a broader group of readers, including operations re-
searchers and engineers. A solid basis in probability and stochastic processes
is required, however. In some countries many operations researchers and reli-
ability engineers now have a rather comprehensive theoretical background in
these topics, so that it should be possible to benefit from reading the more
sophisticated theory presented in this book. To bring the reliability field for-
ward, we believe that more operations researchers and engineers should be
familiar with the probabilistic framework of modern reliability theory. Chap-
ters 1 and 2 and the first part of Chaps. 4 and 5 are more elementary and do
not require the more advanced theory of stochastic processes.
References are kept to a minimum throughout, but readers are referred to
the bibliographic notes following each chapter, which give a brief review of
the material covered and related references.
Acknowledgments
We express our gratitude to our institutions, the Stavanger University College,
the University of Oslo, and the University of Ulm, for providing a rich intel-
lectual environment, and facilities indispensable for the writing of this book.
The authors are grateful for the financial support provided by the Norwegian
Research Council and Deutscher Akademischer Austauschdienst. We would
also like to acknowledge our indebtedness to Jelte Beimers, Jørund Gåsemyr,
Harald Haukås, Tina Herberts, Karl Hinderer, Günter Last, Volker Schmidt,
Richard Serfozo, Marcel Smith, Fabio Spizzichino and Rune Winther for mak-
ing helpful comments and suggestions on the manuscript. Thanks for TEXnical
support go to Jürgen Wiedmann.
We especially thank Bent Natvig, University of Oslo, for the great deal
of time and effort he spent reading and preparing comments. Thanks also go
to the three reviewers for providing advice on the content and organization
of the book. Their informed criticism motivated several refinements and im-
provements. Of course, we take full responsibility for any errors that remain.
We also acknowledge the editing and production staff at Springer for their
careful work. In particular, we appreciate the smooth cooperation of John
Kimmel.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Lifetime Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Damage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Different Information Levels . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Simpson’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Predictable Lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.6 A General Failure Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Availability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Optimization Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Reliability Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Nuclear Power Station . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Gas Compression System . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
xi
xii Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
1
Introduction
This chapter gives an introduction to the topics covered in this book: failure
time models, complex systems, different information levels, maintenance and
optimal replacement. We also include a section on reliability modeling, where
we draw attention to some important factors to be considered in the modeling
process. Two real life examples are presented: a reliability study of a system
in a power plant and an availability analysis of a gas compression system.
f (t)
λ(t) = ,
F̄ (t)
with the survival function F̄ (t) = 1 − F (t). The failure rate λ(t) measures the
proneness to failure at time t in that λ(t) t ≈ P (T ≤ t + t|T > t) for small
t. The (cumulative) hazard function is denoted by Λ,
t
Λ(t) = λ(s) ds = − ln{F̄ (t)}.
0
establishes the link between the cumulative hazard and the survival function.
Modeling in reliability theory is mainly concerned with additional information
about the state of a system, which is gathered during the operating time of
the system. This additional information leads to updated predictions about
proneness to system failure. There are many ways to introduce such additional
information into the model. In the following sections some examples of how to
introduce additional information and how to model the lifetime T are given.
Example 1.1. As a simple example the following system with three compo-
nents is considered, which is intact if component 1 and at least one of the
components 2 or 3 are intact:
2
• 1 •
3
Additional information about the lifetime T can also be introduced into the
model in a quite different way. If the state or damage of the system at time
t ∈ R+ can be observed and this damage is described by a random variable Xt ,
then the lifetime of the system may be defined as
T = inf{t ∈ R+ : Xt ≥ S},
i.e., as the first time the damage hits a given level S. Here S can be a constant
or, more general, a random variable independent of the damage process. Some
examples of damage processes X = (Xt ) of this kind are described in the
following subsections.
Wiener Process
The damage process is a Wiener process with positive drift starting at 0 and
the failure threshold S is a positive constant. The lifetime of the system is
then known to have an inverse Gaussian distribution. Models of this kind are
especially of interest if one considers different environmental conditions under
which the system is working, as, for example, in so-called burn-in models.
An accelerated aging caused by additional stress or different environmental
conditions can be described by a change of time. Let τ : R+ → R+ be an
increasing function. Then Zt = Xτ (t) denotes the actual observed damage.
The time transformation τ drives the speed of the deterioration. One possible
way to express different stress levels in time intervals [ti , ti+1 ), 0 = t0 < t1 <
. . . < tk , i = 0, 1, . . . , k − 1, k ∈ N, is the choice
i−1
τ (t) = βj (tj+1 − tj ) + βi (t − ti ), t ∈ [ti , ti+1 ), βv > 0.
j=0
Processes of this kind describe so-called shock processes where the system is
subject to shocks that occur from time to time and add a random amount
to the damage. The successive times of occurrence of shocks, Tn , are given
by an increasing sequence 0 < T1 ≤ T2 ≤ . . . of random variables, where the
inequality is strict unless Tn = ∞. Each time point Tn is associated with a
real-valued random mark Vn , which describes the additional damage caused by
the nth shock. The marked point process is denoted (T, V ) = (Tn , Vn ), n ∈ N.
From this marked point process the corresponding compound point process
X with
∞
Xt = I(Tn ≤ t)Vn (1.2)
n=1
It was pointed out above in what way additional information can lead to a reli-
ability model. But it is also important to note that in one and the same model
different observation levels are possible, i.e., the amount of actual available
information about the state of a system may vary. The following examples
will show the effect of different degrees of information.
This paradox says that if one compares the death rates in two countries, say
A and B, then it is possible that the crude overall death rate in country A is
higher than in B although all age-specific death rates in B are higher than in A.
This can be transferred to reliability in the following way. Considering a two-
component parallel system, the failure rate of the system lifetime may increase
1.1 Lifetime Models 5
although the component lifetimes have decreasing failure rates. The following
proposition, which can be proved by some elementary calculations, yields an
example of this.
This example shows that it makes a great difference whether only the sys-
tem lifetime can be observed (aging property: IFR) or additional information
about the component lifetimes is available (aging property: DFR). The aging
property of the system lifetime of a complex system does not only depend
on the joint distribution of the component lifetimes but also, of course, on
the structure function. Instead of a two-component parallel system, consider
a series system where the component lifetimes have the same distributions as
in Proposition 1.2. Then the failure rate of Tser = T1 ∧ T2 decreases, whereas
Tpar = T1 ∨ T2 has an IFR.
Xt
S− 1
n
t
0 τn T
Fig. 1.1. Predictable stopping time
The general failure model considered in Chap. 3 uses elements of the theory
of stochastic processes and particularly some martingale theory. Some of the
readers might wonder whether sophisticated theory like this is necessary and
suitable in reliability, a domain with engineering applications. Instead of a
comprehensive justification we give a motivating example.
needs knowledge about the random component lifetimes Ti . Now the failure
rate λt is a stochastic process and the information about the status of the
components at time t is represented by a filtration. The model allows for
changing the information level and the ordinary failure rate can be derived
from λt on the lowest level possible, namely no information about the com-
ponent lifetimes.
extending the classical failure rate λ(t) of the system. To apply the set-up,
focus should be placed on the failure rate process (λt ). When this process
has been determined, the model has basically been established. Using the
above interpretation of the failure rate process, it is in most cases rather
straightforward to determine its form. The formal proofs are, however, often
quite difficult.
If we go one step further and consider a model in which the system can
be repaired or replaced at failure, then attention is paid to the number Nt of
system failures in [0, t]. Given certain conditions, the counting process N =
(Nt ), t ∈ R+ , has an “intensity” that as an extension of the failure rate process
can be derived as the limit of conditional expectations
1
λt = lim E[Nt+h − Nt |Ft ],
h→0+ h
where Ft denotes the history of the system up to time t. Hence we can interpret
λt as the (conditional) expected number of system failures per unit of time at
time t given the available information at that time. Chapter 3 includes several
special cases that demonstrate the broad spectrum of potential applications.
1.2 Maintenance
To prolong the lifetime, to increase the availability, and to reduce the prob-
ability of an unpredictable failure, various types of maintenance actions are
being implemented. The most important maintenance actions include:
• Preventive replacements of parts of the system or of the whole system
• Repairs of failed units
8 1 Introduction
If the system or parts of it are repaired or replaced when failures occur, the
problem is to characterize the performance of the system. Different measures
of performance can be defined as, for example,
• The probability that the system is functioning at a certain point in time
(point availability)
• The mean time to the first failure of the system
• The probability distribution of the downtime of the system in a given time
interval.
Traditionally, focus has been placed on analyzing the point availability and
its limit (the steady-state availability). For a single component, the steady-
state formula is given by M T T F/(M T T F + M T T R), where M T T F and
M T T R represent the mean time to failure and the mean time to repair (mean
repair time), respectively. The steady-state probability of a system compris-
ing several components can then be calculated using the theory of complex
(monotone) systems.
Often, performance measures related to a time interval are used. Such
measures include the distribution of the number of system failures, and the
distribution of the downtime of the system, or at least the mean of these dis-
tributions. Measures related to the number of system failures are important
from an operational and safety point of view, whereas measures related to
the downtime are more interesting from a productional point of view. Infor-
mation about the probability of having a long downtime in a time interval
is important for assessing the economic risk related to the operation of the
system. For production systems, it is sometimes necessary to use a multistate
representation of the system and some of its components, to reflect different
production levels.
Compared to the steady-state availability, it is of course more complicated
to compute the performance measures related to a time interval, in particu-
lar the probability distributions of the number of system failures and of the
downtime. Using simplifications and approximations, it is however possible to
establish formulas that can be used in practice. For highly available systems,
a Poisson approximation for the number of system failures and a compound
Poisson approximation for the downtime distribution are useful in many cases.
These topics are addressed in Chap. 4, which gives a detailed analysis
of the availability of monotone systems. Emphasis is placed on performance
1.3 Reliability Modeling 9
measures related to a time interval. Sufficient conditions are given for when
the Poisson and the compound Poisson distributions are asymptotic limits.
Example 1.4. We resume Example 1.3, p. 6, and consider the simple two-
component parallel system with independent Exp(αi ) distributed component
lifetimes Ti , i = 1, 2, with the system lifetime T = T1 ∨ T2 . We now allow
preventive replacements at costs of c units to be carried out before failure,
and a replacement upon system failure at cost c + k. It seems intuitive that
T1 ∧ T2 , the time of the first component failure, should be a candidate for an
optimal replacement time with respect to some cost criterion, at least if c is
“small” compared to k. How can we prove that this random time T1 ∧ T2 is
optimal among all possible replacement times? How can we characterize the
set of all possible replacement times?
These questions can only be answered in the framework of martingale
theory and are addressed in Chap. 5.
One can imagine that thousands of models (and papers) can be created by
combining the different types of lifetime models with different maintenance
actions. The general optimization framework formulated in Chap. 5 incorpo-
rates a number of such models. Here the emphasis is placed on determining
the optimal replacement time of a deteriorating system. The framework is
based on the failure model of Chap. 3, which means that rather complex and
very different situations can be studied. Special cases include monotone sys-
tems, (minimal) repair models, and damage processes, with different informa-
tion levels.
V1
LP RHR SWS CCWS HP V3
V2
Fig. 1.3. When the system is needed, it is possible that single components
or the whole system fails to start on demand. In this case, to calculate the
probability of a failure on demand, we have to take all components in the
reliability block diagram into consideration. Two of the valves, V1 and V2 ,
are in parallel. Therefore, the RHR system fails on demand if either V1 and
V2 fail or at least one of the remaining components LP,. . . , HP, V3 fails.
We assume that the time from a check of a component until a failure in the
idle state is exponentially distributed. The failure rates are λv1 , λv2 , λv3
for the valves and λp1 , λp2 , λp3 , λp4 , λh for the other components. If the
check (inspection or operating period) dates t time units back, then the
probability of a failure on demand is given by
1 − {1 − (1 − e−λv1 t )(1 − e−λv2 t )}e−(λp1 +λp2 +λp3 +λp4 +λh +λv3 )t .
1.3 Reliability Modeling 13
This example outlines various aspects of the modeling process related to the
design of a gas compression system.
A gas producer was designing a gas production system, and one of the most
critical decisions was related to the design of the gas compression system.
At a certain stage of the development, two alternatives for the compression
system were considered:
Compressor–turbine 10 12
Cooler 2 50
Scrubber 1 20
Uncertainty related to the input quantities used was not considered. Ins-
tead, sensitivity studies were performed with the purpose of identifying how
sensitive the results were with respect to variations in input parameters.
Of the results obtained, we include the following examples:
• The gas train is down 2.7% of the time in the long run.
• For alternative (i), the average system failure rate, i.e., the average number
of system failures per year, equals 13. For alternative (ii) it is distinguished
between failures resulting in production below 100% and below 50%. The
average system failure rates for these levels are approximately 26 and 0.7,
respectively. Alternative (ii) has a probability of about 50% of having one
or more complete shutdowns during a year.
• The mean lost production equals 2.7% for both alternatives. The proba-
bility that the lost production during 1 year is more than 4% of demand is
approximately equal to 0.16 for alternative (i) and 0.08 for alternative (ii).
This last result is based on assumptions concerning the variation of the
repair times. Refer to Sect. 4.7.1, p. 162, where the models and methods used
to compute these measures are summarized.
The results obtained, together with an economic analysis, gave the man-
agement a good basis for choosing the best alternative.
Structural Properties
a • 1 2 ... n • b
(The term binary variable refers to a variable taking on the values 0 or 1.)
Similarly, the binary variable Φ indicates the state of the system:
1 if the system is in the functioning state
Φ=
0 if the system is in the failure state.
We assume that
Φ = Φ(x),
where x = (x1 , x2 , . . . , xn ), i.e., the state of the system is determined com-
pletely by the states of the components. We refer to the function Φ(x) as the
structure function of the system, or simply the structure. In the following we
will often use the phrase structure in place of system.
A series structure can be illustrated by the reliability block diagram in Fig. 2.1.
“Connection between a and b” means that the system functions.
Example 2.2. A system that is functioning if and only if at least one compo-
nent is functioning is called a parallel system. The corresponding reliability
block diagram is shown in Fig. 2.2.
The structure function is given by
n
Φ(x) = 1 − (1 − x1 )(1 − x2 ) · · · (1 − xn ) = 1 − (1 − xi ). (2.1)
i=1
The expression on the right-hand side in (2.1) is often written xi . Thus, a
parallel system with two components has structure function
2.1 Complex Systems 19
2
• •
..
.
2
Φ(x) = 1 − (1 − x1 )(1 − x2 ) = xi ,
i=1
which we also write as Φ(x) = x1 x2 .
Example 2.3. A system that is functioning if and only if at least k out of n
components are functioning is called a k-out-of-n system. A series system is an
n-out-of-n system, and a parallel system is a 1-out-of-n system. The structure
function for a k-out-of-n system is given by
⎧
n
⎨ 1 if i=1 xi ≥ k
Φ(x) =
n
⎩
0 if i=1 xi < k.
1 2
• 1 3 •
2 3
The vector (·i , x) denotes a state vector where the state of the ith
component is equal to 1 or 0; (1i , x) denotes a state vector where the state
of the ith component is equal to 1, and (0i , x) denotes a state vector where
the state of the ith component is equal to 0; the state of component j, j = i,
equals xj . If we want to specify the state of some components, say i ∈ J
(J ⊂ {1, 2, . . . , n}), we use the notation (·J , x). For example, (0J , x) denotes
the state vector where the states of the components in J are all 0 and the
state of component i, i ∈ / J, equals xi .
Definition 2.6. (Minimal cut set). A cut set K is a set of components that
by failing causes the system to fail, i.e., Φ(0K , 1) = 0. A cut set is minimal
if it cannot be reduced without losing its status as a cut set.
Example 2.8. Consider the reliability block diagram presented in Fig. 2.4. The
minimal cut sets of the system are: {1, 5}, {4, 5}, {1, 2, 3}, and {2, 3, 4}. Note
that, for example, {1, 4, 5} is a cut set, but it is not minimal. The minimal
path sets are {1, 4}, {2, 5}, and {3, 5}. In the following we will refer to this
example as the “5-components example.”
2.1 Complex Systems 21
1 4
• •
2
Example 2.12. Consider again the reliability block diagram in Fig. 2.4. The
system can be viewed as a parallel structure of two independent modules: the
structure comprising the components 1 and 4, and the structure comprising
the components 2, 3, and 5. The reliability of the former structure equals p1 p4 ,
whereas the reliability of the latter equals (1 − (1 − p2 )(1 − p3 ))p5 . Thus the
system reliability is given by
and
k
g = P( Aj ).
j=1
Furthermore, let
k
w1 = j=1 P (Aj )
w2 = i<j P (Ai Aj )
..
.
r
wr = 1≤i1 <i2 <···<ir ≤k P ( j=1 Aij ).
w2 = q1 q4 q5 + q1 q2 q3 q5 + q1 q2 q3 q4 q5 + q1 q2 q3 q4 q5 + q2 q3 q4 q5 + q1 q2 q3 q4
= 2.2 · 10−6 .
There exist also other bounds and approximations for the system reliability.
For example, it can be shown that
k
k
1− (1 − qi ) = 1 − pi
j=1 i∈Kj j=1 i∈Kj
is an upper bound for g, and a good approximation for small values of the
component unreliabilities qi ; see Barlow and Proschan [32], p. 35. This bound
is always as good as or better than w1 . In the following we sketch some alter-
native methods for reliability computation.
and by multiplying out the right-hand side of this expression, we can find
an exact expression of h (or g). As an illustration consider a 2-out-of-3 sys-
tem. Then
Φ = (X1 X2 ) · (X1 X3 ) · (X2 X3 )
Φ = X1 · X2 + X1 · X3 + X2 · X3 − 2 · X1 · X2 · X3 .
h = p1 p2 + p1 p3 + p2 p3 − 2 p1 p2 p3 .
1 4
• 3 •
2 5
n
h = EΦ(X) = Φ(x)P (X = x) = pxi i (1 − pi )1−xi .
x x:Φ(x)=1 i=1
This method, however, is not suitable for larger systems, since the number of
terms in the sum can be extremely large, up to 2n − 1.
1 4
• •
2 5
26 2 Basic Reliability Theory
1 4
• •
2 5
These two structures are both of series–parallel form, and we see that
h(13 , p) = (p1 p2 )(p4 p5 )
h(03 , p) = p1 p4 p2 p5 .
Thus a formula for the exact computation of h(p) is established. Note that it
was sufficient to perform only one pivotal decomposition in this case. If the
structure given x3 = 1 had not been in a series–parallel form, we would have
had to perform another pivotal decomposition, and so on.
Time Dynamics
The above theory can be applied to different situations, covering both rep-
airable and nonrepairable systems. As an example, consider a monotone sys-
tem in a time interval [0, t0 ], and assume that the components of the system
are “new” at time t = 0 and that a failed component stays in the failure
state for the rest of the time interval. Thus the component is not repaired or
replaced. This situation, for example, can describe a system with component
failure states that can only be discovered by testing or inspection. We assume
that the lifetime of component i is determined by a lifetime distribution Fi (t)
having failure rate function λi (t). To calculate system reliability at a fixed
point in time, i.e., the reliability function at this point, we can proceed as
2.1 Complex Systems 27
above with qi = Fi (t) and pi = F̄i (t). Thus, for a series system the reliability
at time t takes the form
n
h= F̄i (t). (2.8)
i=1
But F̄i (t) can be expressed by means of the failure rate λi (t):
t
F̄i (t) = e− 0
λi (u)du
. (2.9)
g = {1 − p1 p4 }{1 − p5 (p2 + p3 − p2 p3 )} ≈ w1
w1 = q1 q5 + q4 q5 + q1 q2 q3 + q2 q3 q4
= 0.02 · 0.01 + 0.01 · 0.01
+0.02 · 0.02 · 0.02 + 0.02 · 0.02 · 0.01
= 3 · 10−4 .
∂h
IiB = .
∂pi
2.1 Complex Systems 29
Thus Birnbaum’s measure equals the partial derivative of the system reliability
with respect to pi . The approach is well known from classical sensitivity anal-
yses. We see that if IiB is large, a small change in the reliability of component
i will give a relatively large change in system reliability.
Birnbaum’s measure might be appropriate, for example, in the operation
phase where possible improvement actions are related to operation and main-
tenance parameters. Before looking closer into specific improvement actions of
the components, it will be informative to measure the sensitivity of the system
reliability with respect to small changes in the reliability of the components.
To compute IiB the following formula is often used:
IiB = h(1i , p) − h(0i , p). (2.12)
This formula is established using (2.6), p. 25.
We see that for this example the Birnbaum measure gives the same ranking
of the components as the measure IiA . However, this is not true in general.
Dependent Components
In the following some remarks on systems with dependent components are
made. A more systematic treatment concerning copula models can be found
in the last subsection of this chapter.
One of the most difficult tasks in reliability engineering is to analyze depen-
dent components (often referred to as common mode failures). It is difficult to
formulate the dependency in a mathematically stringent way and at the same
time obtain a realistic model and to provide data for the model. Whether we
succeed in incorporating a “correct” contribution from common mode failures
is very much dependent on the modeling ability of the analyst. By defining
the components in a suitable way, it is often possible to preclude dependency.
For example, common mode failures that are caused by a common external
cause can be identified and separated out so that the components can be con-
sidered as independent components. Another useful method for “elimination”
of dependency is to redefine components. For example, instead of including
a parallel structure of dependent components in the system, this structure
could be represented by one component. Of course, this does not remove the
dependency, but it moves it to a lower level of the analysis. Special techniques,
such as Markov modeling, can then be used to analyze the parallel structure
itself, or we can try to estimate/assign reliability parameters directly for this
new component.
Although it is often possible to “eliminate” dependency between compo-
nents by proper modeling, it will in many cases be required to establish a
model that explicitly takes into account the dependency. Refer to Chap. 3 for
examples of such models.
Another way of taking into account dependency is to obtain bounds to the
system reliability, assuming that the components are associated and not neces-
sarily independent. Association is a type of positive dependency, for example,
as a result of components supporting loads. The precise mathematical defini-
tion is as follows (cf. [32]):
2.1 Complex Systems 31
where Sj equals the jth minimal path set, j = 1, 2, . . . , s and Kj equals the
jth minimal cut set, j = 1, 2, . . . , k. This method usually leads to very wide
intervals for the reliability.
In this section parts of the theory presented in Sect. 2.1.1 will be generalized to
include multistate systems where components and system are allowed to have
an arbitrary (finite) number of states/levels. Multistate monotone systems are
used to model, e.g., production and transportation systems for oil and gas,
and power transmission systems.
We consider a system comprising n components, numbered consecutive
from 1 to n. As in the binary case, xi represents the state of component i,
i = 1, 2, . . . , n, but now xi can be in one out of Mi + 1 states,
xi0 , xi1 , xi2 , . . . , xiMi (xi0 < xi1 < xi2 < · · · < xiMi ).
The set comprising these states is denoted Si . The states xij represent, for
example, different levels of performance, from the worst, xi0 , to the best,
xiMi . The states xi0 , xi1 , . . . , xi,Mi −1 are referred to as the failure states of
the components.
Similarly, Φ = Φ(x) denotes the state (level) of the system. The various
values Φ can take are denoted
- 1
a • - 3 • b
- 2
Example 2.21. Figure 2.6 shows a simple example of a flow network model.
The system comprises three components. Flow (gas/oil) is transmitted from a
to b. The components 1 and 2 are binary, whereas component 3 can be in one
out of three states: 0, 1, or 2. The states of the components are interpreted
as flow capacity rates for the components. The state/level of the system is
defined as the maximum flow that can be transmitted from a to b, i.e.,
Φ = Φ(x) = min{x1 + x2 , x3 }.
System level 2
Minimal cut vectors: (0, 1, 2), (1, 0, 2), and (1, 1, 1)
Minimal path vectors : (1, 1, 2)
System level 1
Minimal cut vectors: (0, 0, 2) and (1, 1, 0)
Minimal path vectors : (0, 1, 1) and (1, 0, 1).
We call hj the reliability of the system at system level j. For the flow network
example above, a represents the expected throughput (flow) relatively to the
maximum throughput (flow) level.
The problem is to compute hj for one or more values of j, and a,
based on the probabilities pij . We assume that the random variables Xi are
independent.
h2 = P (X1 = 1, X2 = 1, X3 = 2)
= 0.96 · 0.96 · 0.97 = 0.894;
h1 = P (X1 = 1 ∪ X2 = 1, X3 ≥ 1)
= P (X1 = 1 ∪ X2 = 1) P (X3 ≥ 1)
= {1 − P (X1 = 0) P (X2 = 0)} P (X3 ≥ 1)
= 0.9984 · 0.99 = 0.988;
a = (0.094 · 1 + 0.894 · 2)/2 = 0.941.
For the above example it is easy to calculate the system reliability directly
by using elementary probability rules. For larger systems it will be very time-
consuming (in some cases impossible) to perform these calculations if special
techniques or algorithms are not used. If the minimal cut vectors or path
vectors for a specific level are known, the system reliability for this level can
be computed exactly, using, for example, the algorithm described in [17]. For
highly reliable systems, which are most common in practice, simple approxi-
mations can be used as described in the following.
Analogous to the binary case, approximations can be established based on
the inclusion–exclusion method. For example, we have
n
1 − hj = P (Xi ≤ zir ) − , (2.13)
r i=1
where (z1r , z2r , . . . , znr ) represents the rth cut vector for level j and is a positive
error term satisfying
n
≤ P (Xi ≤ min{zir , zil }).
r<l i=1
34 2 Basic Reliability Theory
We can conclude that the approximations are quite good for this example.
The problem of determining the probabilities pij will, as in the binary case,
depend on the particular situation considered. Often it will be appropriate to
define pij by the limiting availabilities of the component, cf. Chap. 4.
Discussion
where Φ(ji , X) equals the state of the system given that Xi = xij .
is the failure or hazard rate, where as usual F̄ (t) = 1 − F (t) denotes the
survival probability. Here and in the following we sometimes simplify the
notation and define a mapping by its values to avoid constructions like
λ : D → R+ , D⊂ R {t ∈ R+ : F̄ (t) = 0}, t → λ(t) = f (t)/F̄ (t), if
there is no fear of ambiguity. Interpreting T as the lifetime of some com-
ponent or system, the failure rate measures the proneness to failure at time
t : λ(t) t ≈ P (T ≤ t + t|T > t). The well-known relation
t
F̄ (t) = exp − λ(s)ds
0
shows that F is uniquely determined by the failure rate. One notion of aging
could be an increasing failure rate (IFR). However, this IFR property is in
some cases too strong and other intuitive notions of aging have been sug-
gested. Among them are the increasing failure rate average (IFRA) property
and the notions of new better than used (NBU) and new better than used
in expectation (NBUE). In the following subsection these concepts are intro-
duced formally and the relationships among them are investigated.
Furthermore, these notions should be applied to complex systems. If we
consider the time dynamics of such systems, we want to investigate how the
reliability of the whole system changes in time if the components have one of
the mentioned aging properties.
Another question is how different lifetime (random) variables and their cor-
responding distributions can be compared. This leads to notions of stochastic
ordering. The comparison of the lifetime distribution with the exponential
distribution leads to useful estimates of the system reliability.
We first define the IFR and decreasing failure rate (DFR) properties of a
lifetime distribution F by means of the conditional survival probability
of the existence of a density f (failure rate λ). But if a density exists, then
the IFR property is equivalent to a nondecreasing failure rate, which can
immediately be seen as follows. From
1 F̄ (t + x)
λ(t) = lim 1−
x→0+ x
F̄ (t)
we obtain that the IFR property implies that λ is nondecreasing. Conversely,
if λ is nondecreasing, then we can conclude that
t+x
P (T > t + x|T > t) = exp − λ(s)ds
t
Next we will introduce two aging notions that are related to the residual
lifetime of a component of age t. Let T ∼ F be a positive random variable
with finite expectation. Then the distribution of the remaining lifetime after
t ≥ 0 is given by
F̄ (x + t)
P (T − t > x|T > t) =
F̄ (t)
with expectation
∞ ∞
1 1
μ(t) = E[T − t|T > t] = F̄ (x + t)dx = F̄ (x)dx (2.15)
F̄ (t) 0 F̄ (t) t
Remark 2.28. (i) The corresponding notions for “better” replaced by “worse,”
NWU and NWUE, are obtained by reversing the inequality signs.
(ii) These properties are intuitive notions of aging. F is NBU means that
the probability of surviving x further time units for a component of age t
decreases in t. For NBUE distributions the expected remaining lifetime for
a component of age t is less than the expected lifetime of a new component.
Now we want to establish the relations between these four notions of aging.
Theorem 2.29. Let T ∼ F be a positive random variable with finite expecta-
tion. Then we have
Examples can be constructed which show that none of the above implica-
tions can be reversed.
where pα = (pα α
1 , . . . , pn ).
Proof. We prove the result for binary structures, which are nondecreasing in
each argument (nondecreasing structures) but not necessarily satisfy Φ(0) = 0
and Φ(1) = 1. We use induction by n, the number of components in the
system. For n = 1 the assertion is obviously true. The induction step is carried
out by means of the pivotal decomposition formula:
n h(1n , p ) + (1 − pn )h(0n , p ).
h(pα ) = pα α α α
Theorem 2.32. Let X and Y be two independent random variables with IFR
distributions. Then X + Y has an IFR distribution.
By induction this property extends to an arbitrary finite number of random
variables. This shows, for example, that the Erlang distribution is of IFR type
because it is the distribution of the sum of exponentially distributed random
variables.
Proof. The inequality for the survival probability follows from Lemma 2.33
with α = 1/μ, where in the degenerate case t∗ = μ we have tμ = t∗ = μ.
It remains to show tμ ≥ μ. To this end we first confine ourselves to the
continuous case and assume that F has no jump at t∗ . Then F (T ) has a
uniform distribution on [0, 1] and we obtain E[ln F̄ (T )] = −1. Now
F̄ (t + x)
= exp{−(Λ(t + x) − Λ(t))}
F̄ (t)
1 = E[− ln F̄ (T )] ≥ − ln F̄ (μ).
A lot of other bounds for the survival probability can be set up under
various conditions (see the references listed in the Bibliographic Notes). Next
we want to give one example of how such bounds can be carried over to
monotone systems. As an immediate consequence of the last lemma we obtain
the following corollary.
Actually the inequality holds true for t < min{tμ1 , . . . , tμn }. The idea of
this inequality is to give a bound on the reliability of the system at time t
only based on h and μi and the knowledge that the Fi are of IFR type. If the
reliability function h is unknown, then it could be replaced by that of a series
system to yield
n n
1
−t/μ1 −t/μn −t/μi
F̄ (t) ≥ h(e ,...,e )≥ e = exp −t
i=1 i=1 i
μ
H(t1 , . . . , tn ) = P (T1 ≤ t1 , . . . , Tn ≤ tn ).
Since this formula is rather complex we will explain it in more detail for
the case n = 2 and give some examples.
Let Y1 , Y2 be random variables each uniformly distributed on [0, 1] with
joint distribution C(t1 , t2 ) = P (Y1 ≤ t1 , Y2 ≤ t2 ), t1 , t2 ∈ [0, 1] and induced
probability measure C̃. For the sets D1 = B0t1 × B0t2 , D2 = B0t1 × B1t2 , D3 =
B1t1 × B1t2 , D4 = B1t1 × B0t2 in Fig. 2.7 we get
44 2 Basic Reliability Theory
D2 D3
t2
D1 D4
-
t1 1
Example 2.38.
(i) In the case of a parallel system withnn components, the structure function
is given by Φ(x1 , . . . , xn ) = 1 − i=1 (1 − xi ), which is 0 if and only if
x = (0, . . . , 0). Therefore, the sum in GΦ,C extends over all possible x
except the null vector yielding
n
GΦ,C (t1 , . . . , tn ) = 1 − 1 − C̃ ti
B0 = C t 1 , . . . , tn .
i=1
n
GΦ,C (t1 , . . . , tn ) = 1 − Φ(x) t1−x
i
i
(1 − ti )xi .
x∈{0,1}n i=1
n
F̄ S (t) = Φ(x) (F̄i (t))xi (Fi (t))1−xi ,
x∈{0,1}n i=1
the well-known formula that results from the state enumeration method
(see Chap. 2.1, p. 25).
• the expectation E ∞
E(F S ) = F̄ S (t)dt,
0
46 2 Basic Reliability Theory
For n = 2, parts (i) and (ii) of the above definition are equivalent as can
be seen from the relation between H and H̄. This does not hold true in higher
dimensions. To compare two distributions H, G ∈ D(F1 , . . . , Fn ) with fixed
marginals it is, of course, enough to compare their corresponding copulas.
For n = 2 random variables X, Y with continuous distribution functions
F, G and copula C, there are well-known measures of the degree of dependence
such as Kendall’s tau τX,Y or Spearman’s rho ρX,Y , expression which can be
expressed in terms of the copula C :
τX,Y = 4 C(u, v)dC(u, v) − 1, ρX,Y = 12 C(u, v)dudv − 3.
[0,1]2 [0,1]2
c1
c1 cn
cn
(a) (b)
q(FCS2 ) ≤ q(FCS1 );
q(FCS1 ) ≤ q(FCS2 ).
Proof. (i) For a parallel system, note that according to Example 2(i) it holds
that
FCSi (t) = Ci (F1 (t), . . . , Fn (t)),
where i = 1, 2 and F1 (t), . . . , Fn (t) ∈ D. It is clear that FCS1 (t) ≤ FCS2 (t) for all
t ≥ 0, since C1 ≺cL C2 . That means FCS2 ≤s FCS1 . Because of the monotonicity
of q we get the assertion
q(FCS2 ) ≤ q(FCS1 ).
The proof of (ii) is similar: For a series system we have
The above theorem shall be applied to the three functionals mentioned ear-
lier,
∞ namely the system reliability Rt (F S ) = F̄ S (t), the expectation E(F S ) =
0
F̄ (t)dt and the quantile Qp (F S ) := inf{t ∈ R+ : F S (t) ≥ p}, 0 < p ≤ 1.
S
Note that these functionals are all nondecreasing with respect to the usual
stochastic ordering.
48 2 Basic Reliability Theory
W ≺cL C ≺cL M.
dk −1
(−1)k ϕ (t) ≥ 0, t ≥ 0, k = 0, 1, 2, . . .
dtk
The function C : [0, 1]n → [0, 1] defined by
According to the definition, C1 ≺cL C2 holds true if and only if for all
x1 , . . . , xn ∈ [0, 1]
ϕ−1 −1
1 (ϕ1 (x1 ) + · · · + ϕ1 (xn )) ≤ ϕ2 (ϕ2 (x1 ) + · · · + ϕ2 (xn )). (2.17)
ϕ−1 −1
1 (f (t1 ) + · · · + f (tn )) ≤ ϕ2 (t1 + · · · + tn ), (2.18)
for all t1 , . . . , tn ≥ 0.
Applying the strictly decreasing function ϕ1 to both sides of (2.18) on gets
with generator ϕθ (t) = 1θ (t−θ − 1). Is this family positively ordered in the
sense that for θ1 ≤ θ2 we have Cθ1 ≺c Cθ2 ? Note that in the case n = 2 the
PLOD- and PUOD-ordering coincide and are equivalent to the concordant
ordering ≺c . To check whether the Clayton family is positively ordered we can
use Corollary 2.43 part (iii). The generator ϕθ is continuously differentiable
on (0, 1) with ϕθ (t) = −t−θ−1 . The ratio ϕθ1 /ϕθ2 = tθ2 −θ1 is nondecreasing
on (0, 1) for θ1 ≤ θ2 which is sufficient for Cθ1 ≺c Cθ2 , i.e., the degree of
dependence increases with θ. The extreme cases θ = −1 and θ → ∞ are the
Fréchet–Hoeffding bounds C−1 = W and C∞ = M . The limiting case θ → 0
yields the product copula C0 = (independence).
Parallel System
Series System
The lifetime T = T1 ∧ T2 of a series system has distribution function FCser
θ
(t) =
P (T ≤ t) = F1 (t) + F2 (t) − Cθ (F1 (t), F2 (t)) according to Example 2.38. For
the expectation of the system lifetime we get
E(FCser
θ
) = E(T1 ) + E(T2 ) − E(T1 ∨ T2 ).
Therefore, the properties of the expectation can be transferred from the par-
allel system:
∞
• θ = −1, C−1 = W : E(FW ser
) = E(T1 ) + E(T2 ) − 0 (1 − W (F1 (t), F2 (t))dt.
In the exponential case we get
2 1 1
E(FW ser
) = − (1 + ln 2) = (1 − ln 2) .
λ λ λ
∞
• θ = 0, C0 = : E(F ser
) = E(T1 ) + E(T2 ) − 0 (1 − F1 (t)F2 (t))dt.
In the exponential case we get
2 3 1 1
E(F ser
) = − · = 0.5 · .
λ 2 λ λ
∞
• θ = ∞, C∞ = M : E(FM ser
) = E(T1 ) − E(T2 ) − 0 [1 − M (F1 (t), F2 (t))]dt.
In the exponential case we get
ser 1
E(FM )= .
λ
This shows that the expected system lifetime of a series system can be reduced
to about 30% [(1 − ln 2) · 100] of the expected lifetime of one component.
52 2 Basic Reliability Theory
and
C1,1 (u1 , u2 ) = M (u1 , u2 ) = u1 ∧ u2 .
This implies that the limit λ1 → ∞, λ2 → ∞ or λ12 = 0 result in the
product copula, whereas the limit λ12 → ∞ or λ1 = λ2 = 0 yield the upper
Fréchet–Hoeffding bound. The family Cα,β , 0 ≤ α, β ≤ 1 is positively ordered
with respect to the concordance ordering in α(β fixed) as well as in β(α fixed).
For 0 ≤ α, β ≤ 1 we get
≺c Cα,β ≺c M.
Rt (FCpar
α,β
) = F̄ S (t) = 1 − Cα,β (F1 (t), F2 (t))
= 1 − min((1 − F1 (t))1−α (1 − F2 (t)), (1 − F1 (t))(1 − F2 (t))1−β )
−F1 (t) − F2 (t) + 1
= e−(λ1 +λ12 )t + e−(λ2 +λ12 )t − e−(λ1 +λ2 +λ12 )t , t ≥ 0.
The reliability functions for different copulas with the same marginals Fi (t) =
1 − exp(−10t), i = 1, 2, are displayed graphically in Fig. 2.9.
The dotted line in Fig. 2.9 represents the independence case with λ1 =
10, λ2 = 10, λ12 = 0. The dashed line corresponds to λ1 = 5, λ2 = 5, λ12 = 5,
whereas the solid line represents the upper Fréchet–Hoeffding bound with
λ1 = 0, λ2 = 0, λ12 = 10.
54 2 Basic Reliability Theory
Figure 2.9 shows that with increasing measure of dependence between the
component lifetimes, here increasing λ12 , the reliabilities of a parallel system
are decreasing. For example, for t = 0.1, the reliability is in the range of R0 .1 =
0.60(λ12 = 0) to R0 .1 = 0.37(λ12 = 10), i.e. the reliability may decrease
to about 60% of the reliability in the independence case due to correlation
between the component lifetimes.
As before, the dotted line in Fig. 2.10 represents the independence case
with λ1 = 10, λ2 = 10, λ12 = 0. The dashed line corresponds to λ1 = 5, λ2 =
5, λ12 = 5, whereas the solid line represents the upper Fréchet–Hoeffding
bound with λ1 = 0, λ2 = 0, λ12 = 10.
2.3 Copula Models of Complex Systems in Reliability 55
A general set-up should include all basic failure time models, should take
into account the time-dynamic development, and should allow for different
information and observation levels. Thus, one is led in a natural way to the
theory of stochastic processes in continuous time, including (semi-) martingale
theory, in the spirit of Arjas [3, 4] and Koch [108]. As was pointed out in
Chap. 1, this theory is a powerful tool in reliability analysis. It should be
stressed, however, that the purpose of this chapter is to present and introduce
ideas rather than to give a far reaching excursion into the theory of stochastic
processes. So the mathematical technicalities are kept to the minimum level
necessary to develop the tools to be used. Also, a number of remarks and
examples are included to illustrate the theory. Yet, to benefit from reading
this chapter a solid basis in stochastics is required. Section 3.1 summarizes the
mathematics needed. For a more comprehensive and in-depth presentation of
the mathematical basis, we refer to Appendix A and to monographs such as
by Brémaud [50], Dellacherie and Meyer [61, 62], Kallenberg [101], or Rogers
and Williams [133].
Proof. We have to show that with (ft ) from condition C1 the right-continuous
t
process Mt = Zt − Z0 − 0 fs ds is an F-martingale, i.e., that for all A ∈ Ft
and s ≥ t, s, t ∈ R+ , E[IA Ms ] = E[IA Mt ], where IA denotes the indicator
variable. This is equivalent to
s
E[IA (Ms − Mt )] = Zs − Zt − fu du dP = 0.
A t
For all r, t ≤ r ≤ s, and A ∈ Ft , IA is Fr -measurable. This yields
1 1
E[IA (Zr+h − Zr )] = E [E[IA (Zr+h − Zr )|Fr ]]
h h
1
= E IA E[Zr+h − Zr |Fr ] = E[IA D(r, h)].
h
3.1 Notation and Fundamentals 61
where the second equality follows from Fubini’s theorem. Then (3.2) and (3.3)
together yield
s s
E[IA (Zs − Zt )] = E[IA fu ]du = E IA fu du ,
t t
which proves the assertion.
Remark 3.7. (i) In the terminology of Dellacherie and Meyer [62] an SSM
t
Z = (f, M ) is a special semimartingale because the drift term 0 fs ds is
continuous and therefore predictable. Hence the decomposition of Z is unique
P -a.s., because a second decomposition Z = (f , M ) leads to the continuous
and therefore predictable martingale M − M of integrable variation, which
is identically 0 (cf. Appendix A.5, Lemma A.39, p. 263). (ii) It can be shown
that if Z = (f, M ) is an SSM and for some constant c > 0 the family of
t+h
random variables {|h−1 t fs ds| : 0 < h ≤ c} is bounded by some integrable
random variable Y, then the conditions C1–C3 hold true, i.e., C1–C3 are
under this boundedness condition not only sufficient but also necessary for a
semimartingale representation. The proof of the main part (C2) is based on
the Radon/Nikodym theorem. The details are of technical nature, and they
are therefore omitted and left to the interested reader. (iii) For applications
it is often of interest to find an SSM representation for point processes, i.e.,
to determine the compensator of such a process (cf. Definition 3.4 on p. 62).
For such and other more specialized processes, specifically adapted methods
to find the compensator can be applied, see below and [16, 50, 58, 103, 115].
One of the simplest examples of a process with an SSM representation is
the Poisson process (Nt ), t ∈ R+ , with constant rate λ > 0. It is well-known
and easy to see from the definition of a martingale that Mt = Nt −λt defines a
martingale with respect to the internal filtration FtN = σ(Ns , 0 ≤ s ≤ t). If we
consider conditions C1–C3, we find that D(t, h) = λ for all t, h ∈ R+ because
the Poisson process has independent and stationary increments: E[Nt+h −
Nt |FtN ] = E[Nt+h − Nt ] = ENh = hλ. Therefore, we see that C1–C3 are
satisfied with ft = λ for all ω ∈ Ω and all t ∈ R+ , which results in the
t
representation Nt = 0 λds + Mt = λt + Mt .
The Poisson process is a point process as well as an example of a Markov
process, and the question arises under which conditions point and Markov
processes admit an SSM representation.
62 3 Stochastic Failure Models
Let us now consider a univariate point process (Tn ), n ∈ N, and its associ-
ated counting process (Nt ), t ∈ R+ , with ENt < ∞ for all t ∈ R+ on a filtered
probability space (Ω, F , F, P ). The traditional definition of the compensator
of a point process is the following.
Definition 3.8. Let N be an integrable point process adapted to the filtra-
tion F. The unique F-predictable increasing process A = (At ), such that
∞ ∞
E Cs dNs = E Cs dAs (3.4)
0 0
which gives
E[IB (Ns − As )] = E[IB (Nt − At )].
Hence, N − A is a martingale.
Conversely, if N − A is a martingale, then A is integrable and we obtain
(3.5). In the general case, (3.4) can be established using the monotone class
theorem.
So, by an increment dt in time from t on, the increment A(dt) is what we can
predict from the information gathered in [0, t) about the increase of Nt , and
dMt = dNt − A(dt) is what remains unforeseen. Thus, sometimes M is called
an innovation martingale and A(dt) the (dual) predictable projection.
In many cases (which are those we are mostly interested in) the F-
compensator A of a counting process N can be represented as an integral
of the form t
At = λs ds
0
Remark 3.11. (i) To speak of the intensity is a little bit misleading (but harm-
less) because it is not unique. It can be shown (see Brémaud [50], p. 31) that
if one can find a predictable intensity, then it is unique except on a set of
measure 0 with respect to the product measure of P and Lebesgue measure.
On the other hand, if there exists an intensity, then one can always find a
predictable version. (ii) The heuristic interpretation
λt dt = E[dNt |Ft− ]
Theorem 3.9 and Definition 3.10 link the point process to the semimartin-
gale representation, and using the definition of the compensator, it is possible
to verify formally that a process λ is the F-intensity of the point process N .
We have to show that
∞ ∞
E Cs dNs = E Cs λs ds
0 0
explicit form. The proof of the following theorem can be found in Jacod [92]
and in Brémaud [50], p. 61. Regular conditional distributions are introduced
in Appendix A.2, p. 252.
(ii) If the conditional distribution Gn admits a density gn for all n, then the
FN -intensity λ is given by
gn (t − Tn )
λt = t−Tn I(Tn < t ≤ Tn+1 ).
n≥0 1 − 0 gn (x)dx
Example 3.13. (Renewal process). Let the interarrival times Un+1 = Tn+1 −
Tn , n ∈ N0 , T0 = 0, be i.i.d. random variables with common distribution
function F , density f and failure rate r: r(t) = f (t)/(1 − F (t)). Then it
3.1 Notation and Fundamentals 65
follows from Theorem 3.12 that with respect to the internal history FtN =
σ(Ns , 0 ≤ s ≤ t) the intensity on {Tn < t ≤ Tn+1 } is given by λt = r(t − Tn ).
This results in the SSM representation N = (λ, M ),
t
Nt = λs ds + Mt
0
This corresponds to our supposition that the intensity at time t is the failure
rate of the last renewed item before t at an age of t − Tn .
Example 3.14. (Markov-modulated Poisson process). A Poisson process can
be generalized by replacing the constant intensity with a randomly varying
intensity, which takes one of the m values λi , 0 < λi < ∞, i ∈ S = {1, . . . , m},
m ∈ N. The changes are driven by a homogeneous Markov chain Y = (Yt ), t ∈
R+ , with values in S and infinitesimal parameters qi , the rate to leave state
i, and qij , the rate to reach state j from state i:
1
qi = lim P (Yh = i|Y0 = i),
h→0+ h
1
qij = lim P (Yh = j|Y0 = i), i, j ∈ S, i = j,
h→0+ h
qii = −qi = − qij .
j =i
Markov Processes
defines a martingale (cf., e.g., [101], p. 328). This shows that a function
Zt = f (Xt ) of a homogeneous Markov process has an SSM representation
if f ∈ D(A).
Example 3.15 (Markov pure jump process). A homogeneous Markov process
X = (Xt ) with right-continuous paths, which are constant between isolated
jumps, is called a Markov pure jump process. As before, P x denotes the prob-
ability law conditioned on X0 = x and τx = inf{t ∈ R+ : Xt = x} the
exit time of state x. It is known that τx follows an Exp(λ(x)) distribution if
0 < λ(x) < ∞ and that P x (τx = ∞) = 1 if λ(x) = 0, for some suitable map-
ping λ on the set of possible outcomes of X0 with values in R+ . Let v(x, ·) be
the jump law or transition probability at x, defined by v(x, B) = P x (Xτx ∈ B)
for λ(x) > 0. If f belongs to the domain of D(A) of the infinitesimal generator,
then we obtain (cf. Métivier [122])
Af (x) = λ(x) (f (y) − f (x))v(x, dy). (3.6)
Let us now consider some particular cases. (i) Poisson process N = (Nt ) with
parameter λ > 0. In this case we have jumps of height 1, i.e., v(x, {x+1}) = 1.
For f (x) = x we get Af (x) ≡ λ. This again shows that Nt −λt is a martingale.
If we take f (x) = x2 , then we obtain Af (x) = λ(2x + 1) and for N 2 we have
the SSM representation
3.1 Notation and Fundamentals 67
t
Nt2 = f (Nt ) = λ(2Ns + 1)ds + Mtf .
0
Nt
Xt = Yn
n=1
if E x τ < ∞ and g ∈ D(A) (see Dynkin [66], p. 133). This formula can now be
extended to the more general case of SSMs. If Z = (f, M ) is an F-SSM with
(P -a.s.) bounded Z and f , then for all F-stopping times τ with Eτ < ∞ we
obtain τ
EZτ = EZ0 + E fs ds.
0
For some R > 0 and |x| < R we consider the stopping time σ = inf{t ∈ R+ :
|Bt | ≥ R} with respect to the internal filtration, which is the first exit time
of the ball KR = {y ∈ Rk : |y| < R}. By means of the Dynkin formula we can
determine the expectation E x σ in the following way. Let us assume E x σ < ∞
and choose g(x) = |x|2 . Dynkin’s formula then yields
68 3 Stochastic Failure Models
σ
1
E g(Bσ ) = R = |x| + E x
x 2 2
2k ds
2 0
= |x|2 + kE x σ,
which is tantamount to E x σ = k −1 (R2 − |x|2 ). To show E x σ < ∞ we may
replace σ by τn = n ∧ σ in the above formula: E x τn ≤ k −1 (R2 − |x|2 ) and
together with the monotone convergence theorem the result is established.
Random Stopping
One example is the stopping of a process Z, i.e., the transformation from
Z = (Zt ) to the process Z ζ = (Zt∧ζ ), where ζ is some stopping time. If
Z = (f, M ) is an F-SSM and ζ is an F-stopping time, then Z ζ is again an
F-SSM with representation
t
ζ
Zt = Z0 + I(ζ > s)fs ds + Mt∧ζ , t ∈ R+ .
0
This result is an immediate consequence of the fact that a stopped martingale
is a martingale.
A Product Rule
A second example of a transformation is the product of two SSMs. To see
under which conditions such a product of two SSMs again forms an SSM,
some further notations and definitions are required, which are presented in
Appendix A. Here we only give the general result. For the conditions and a
detailed proof we refer to Appendix A.6, Theorem A.51, p. 269.
Let Z = (f, M ) and Y = (g, N ) be F-SSMs with M, N ∈ M20 and M N ∈
M0 . Then, under suitable integrability conditions, ZY is an F-SSM with
representation
t
Zt Yt = Z0 Y0 + (Ys fs + Zs gs )ds + Rt ,
0
where R = (Rt ) is a martingale in M0 .
Remark 3.17. (i) If Z = (f, M ) and Y = (g, N ) are two SSMs and f and
g are considered as “derivatives,” then Y f + Zg is the “derivative” of the
product ZY in accordance with the ordinary product rule. (ii) Martingales
M, N , for which M N is a martingale are called orthogonal. This property
can be interpreted in the sense that the increments of the martingales are
“conditionally uncorrelated,” i.e.,
E[(Mt − Ms )(Nt − Ns )|Fs ] = 0
for all 0 ≤ s ≤ t.
3.1 Notation and Fundamentals 69
A Change of Filtration
is an A-SSM, where
(i) Ẑ is A-adapted with a.s. right-continuous paths with left-hand limits and
Ẑt = E[Zt |At ] for all t ∈ R+ ;
(ii) fˆ is A-progressively measurable with fˆt = E[ft |At ] for almost all t ∈ R+
(Lebesgue measure);
(iii) M̄ is an A-martingale.
∞ ∞
If in addition Z0 , 0 |fs |ds ∈ L2 and M ∈ M20 , then Ẑ0 , 0 |fˆs |ds ∈ L2 and
M̄ ∈ M20 .
In Sect. 5.2.1, Theorem 5.9, p. 181, conditions are given under which the stop-
ping problem for an SSM Z can be solved. If these conditions apply to Ẑ,
then we can solve this optimal stopping problem on the A-level according to
Theorem 5.9. Could the stopping problem be solved on the F-level, then we
get a bound for the stopping value on the A-level in view of the inequality
The general lifetime model is then defined by the filtration F and the corre-
sponding F-SSM representation of the indicator process.
In the following we want to derive the hazard rate process for the lifetime T of
a complex system under fairly general conditions. We make no independence
assumption concerning the component lifetimes, and we allow two or more
components to fail at the same time with positive probability.
Let Ti , i = 1, . . . , n, be n positive random variables that describe the com-
ponent lifetimes of a monotone complex system with structure function Φ.
Our aim is to derive the failure rate process for the lifetime
T = inf{t ∈ R+ : Φ(Xt ) = 0}
Thus, the sequence (T(k) , J(k) ), k ∈ N, forms a multivariate point process. Now
we fix a certain failure pattern J ⊂ {1, . . . , n} and consider the time TJ of
occurrence of this pattern, i.e.,
T(k) if J(k) = J for some k
TJ =
∞ if J(k) = J for all k.
Example 3.27. We now consider the special case n = 2 in which (T1 , T2 ) follows
the bivariate exponential distribution of Marshall and Olkin (cf. [121]) with
parameters β1 , β2 > 0 and β12 ≥ 0. A plausible interpretation of this distribu-
tion is as follows. Three independent exponential random variables Z1 , Z2 , Z12
with corresponding parameters β1 , β2 , β12 describe the time points when a
shock causes failure of component 1 or 2 or all intact components at the same
time, respectively. Then the component lifetimes are given by T1 = Z1 ∧ Z12
and T2 = Z2 ∧ Z12 , and the joint survival probability is seen to be
The three different patterns to distinguish are {1}, {2}, {1, 2}. Note that
T{1} = T1 as we have for example T{1} = ∞ on {T1 = T2 }, i.e., on
{Z12 < Z1 ∧ Z2 }. Calculations then yield
⎧
⎨ β1 on {T1 > t, T2 > t}
λt ({1}) = β1 + β12 on {T1 > t, T2 ≤ t}
⎩
0 elsewhere,
Now we have the F-failure rate processes λ(J) at hand for each pattern J.
We are interested in deriving the F-failure rate process λ of T. The next
theorem shows how this process λ is composed of the single processes λ(J)
3.2 A General Lifetime Model 75
and denote by
Theorem 3.28. Let (λt (J)) be the F-failure rate process corresponding to TJ ,
J ⊂ {1, . . . , n}. Then for all t ∈ R+ on {T > t} :
λt = I(J ∩ D(t) = ∅)(Φ(1J , Xt ) − Φ(0J , Xt ))λt (J) = λt (J).
J⊂{1,...,n} J∈ΓΦ (t)
for all nonnegative predictable processes C. Since (λt (J)) are the F-failure
rate processes corresponding to TJ , we have for all J ⊂ {1, . . . , n}
∞ ∞
E Cs (J)dVs (J) = E Cs (J)I(TJ > s)λs (J)ds
0 0
and therefore
∞ ∞
E Cs (J)dVs (J) = E Cs (J)I(TJ > s)λs (J)ds
0 J⊂{1,...,n} 0 J⊂{1,...,n}
(3.11)
76 3 Stochastic Failure Models
holds true for all nonnegative predictable processes (Ct (J)). If we especially
choose for some nonnegative predictable process C
Ct (J) = Ct ft− ,
Remark 3.29. (i) The proof follows the lines of Arjas (Theorem 4.1 in [6])
except the definition of the set ΓΦ (t) of the critical failure patterns at time
t. In [6] this set includes on {T > t} all cut sets, whereas in our definition
those cut sets J are excluded for which at time t “it is known” that TJ = ∞.
However, this deviation is harmless because in [6] only extra zeros are added.
(ii) We now have a tool that allows us to determine the failure rate process
corresponding to the lifetime T of a complex system in an easy way: Add at
time t the failure rates of those patterns that are critical at t.
n
λt = (Φ(1i , Xt ) − Φ(0i , Xt ))λt (i) = λt (i), t ∈ R+ . (3.12)
i=1 {i}∈ΓΦ (t)
λt ({2}) = βI(T1 ≤ t)
To see how formula (3.12) can be used we resume Example 3.22, p. 71.
3.2 A General Lifetime Model 77
Example 3.33. We now go back to the pair (T1 , T2 ) of random variables, which
follows the bivariate exponential distribution of Marshall and Olkin with par-
ameters β1 , β2 > 0 and β12 ≥ 0 and consider a parallel system with lifetime
T = T1 ∨ T2 . Then on {T > t} the critical patterns are
⎧
⎨ {1, 2} on {T1 > t, T2 > t}
ΓΦ (t) = {1} on {T1 > t, T2 ≤ t}
⎩
{2} on {T1 ≤ t, T2 > t}.
Using the results of Example 3.27, p. 74, the F-failure rate process of the
system lifetime is seen to be
I(T > t)λt = β12 I(T1 > t, T2 > t) + (β1 + β12 )I(T1 > t, T2 ≤ t)
+ (β2 + β12 )I(T1 ≤ t, T2 > t),
I(T > t)λt = β12 I(T > t) + β1 I(T1 > t, T2 ≤ t) + β2 I(T1 ≤ t, T2 > t).
We have investigated under which conditions failure rate processes exist and
how they can be determined explicitly for complex systems. In reliability
it plays an important role whether failure rates are monotone increasing or
decreasing. So it is quite natural to extend such properties to F-failure rates
in the following way.
Definition 3.34. Let an F-SSM representation (3.8) hold true for the positive
random variable T with failure rate process λ. Then λ is called F-increasing
(F-IFR, increasing failure rate) or F-decreasing (F-DFR, decreasing failure
rate), if λ has P -a.s. nondecreasing or nonincreasing paths, respectively, for
t ∈ [0, T ).
78 3 Stochastic Failure Models
Proof. Under the assumptions of the lemma no patterns with two or more
components are critical. Since the system is monotone, the number of elements
in ΓΦ (t) is nondecreasing in t. So from (3.12), p. 76, it can be seen that if all
component failure rates are nondecreasing, the F-failure rate process λ is also
nondecreasing for t ∈ [0, T ).
Such a closure theorem does not hold true for the ordinary failure rate of
the lifetime T as can be seen from simple counterexamples (see Sect. 2.2.1 or
[32], p. 83). From the proof of Proposition 3.36 it is evident that we cannot
draw an analogous conclusion for decreasing failure rates.
where λ̂t = E[λt |At ]. Note that, in general, this formula only holds for almost
all t ∈ R+ . In all our examples we can find A-progressive versions of the
conditional expectations. The projection theorem shows that it is possible
to obtain the failure rate on a lower information level merely by forming
conditional expectations under some mild technical conditions.
Remark 3.37. Unfortunately, monotonicity properties are in general not pre-
served when changing the observation level. As was noted above (see
Proposition 3.36), if all components of a monotone system have independent
lifetimes with increasing failure rates, then T is F-IFR on the component ob-
servation level. But switching to a subfiltration A may lead to a nonmonotone
failure rate process λ̂.
The following example illustrates the role of partial information.
Example 3.38. Consider a two-component parallel system with i.i.d. random
variables Ti , i = 1, 2, describing the component lifetimes, which follow an
exponential distribution with parameter α > 0. Then the system lifetime is
T = T1 ∨ T2 and the complete information filtration is given by
Now several subfiltrations can describe different lower information levels where
it is assumed that the system lifetime T can be observed on all observation
levels. Examples of partial information and the formal description via subfil-
trations A and A-failure rates are as follows:
a) Information about T until h, after h complete information.
σ(I(T ≤ s), 0 ≤ s ≤ t) for 0 ≤ t < h
Aat =
Ft for t ≥ h,
−αt −1
2α(1 − (2 − e ) ) for 0≤t<h
λ̂at =
λt for t ≥ h.
where Z c (Lc ) denotes the continuous part of Z(L) and ΔZs = Zs −Zs− (ΔLs =
Ls − Ls− ) denotes the jump height at s. This extended exponential formula
shows that possible jumps of the conditional survival probability are not
caused by jumps of the failure rate process but by (unpredictable) jumps
of the martingale part.
V1
V2
- t
T1 T2 T3
∞
Nt (A) = I(Vn ∈ A)I(Tn ≤ t),
n=1
which counts the number of marked points up to time t with marks in A. This
family of counting processes N carries the same information as the sequence
(Tn , Vn ) and is therefore an equivalent description of the marked point process.
Example 3.41. A point process (Tn ) can be viewed as a marked point process
for which S consists of a single point. Another link between point and marked
point processes is given by the counting process N = (Nt ), Nt = Nt (S), which
corresponds to the sequence (Tn ).
[ n+1 n
2 ]
[2]
Tn = Uk + Rk , n = 1, 2, . . . ,
k=1 k=1
where [a] denotes the integer part of a. The mark sequence is deterministic
and alternating between 0 and 1:
1
Vn = (1 + (−1)n+1 ).
2
We see that Nt ({0}) counts the number of number of completed repairs and
Nt ({1}) failures up to time t.
3.3 Point Processes in Reliability: Failure Time and Repair Models 83
Theorem 3.44. Let N be an integrable marked point process and FN its in-
ternal filtration. Suppose that for each n there exists a regular conditional
distribution of (Un+1 , Vn+1 ), Un+1 = Tn+1 −Tn , given the past FTNn of the form
where gn (ω, s, B) is, for fixed B, a measurable function and, for fixed (ω, s),
a finite measure on (S, S). Then the process given by
gn (t − Tn , C) gn (t − Tn , C)
λt (C) = = t−T
Gn ([t − Tn , ∞), S) 1 − 0 n gn (s, S)ds
is an FN -martingale.
is an F-martingale.
where M is the stopped martingale M, Mt = Mt∧T1 . The time to first failure
We resume Example 3.42, p. 82, and assume that the operating times Uk
follow a distribution F with density f and failure rate ρ(t) = f (t)/F̄ (t),
whereas the repair times follow a distribution G with density g and hazard rate
3.3 Point Processes in Reliability: Failure Time and Repair Models 85
η(t) = g(t)/Ḡ(t). Note that the failure/hazard rate is always set to 0 outside
the support of the distribution. Then Nt ({0}) counts the number of failures
up to time t with an intensity λt ({0}) = ρ(t − Tn )X(t) on (Tn , Tn+1 ], where
X(t) = Vn on (Tn , Tn+1 ] indicates whether the system is up or down at t. The
corresponding internal intensity for Nt ({1}) is λt ({1}) = η(t − Tn )(1 − X(t)).
If the operating times are exponentially distributed with rate ρ > 0, the
expected number of failures up to time t is given by
t
ENt ({0}) = ρ EX(s)ds.
0
Proof. We know that ρt (i)Xt (i) are intensities of Nt (i) and thus
t
Mt (i) = Nt (i) − ρs (i)Xs (i)ds
0
defines a martingale (also with respect to the internal filtration of the super-
position because of the independence of the component processes). Define
and let ΔΦt− (i) be the left-continuous and therefore predictable version of
this process. Since at a jump of Nt (i) no other components change their status
(P -a.s.), we have
t t
ΔΦs (i)dNs (i) = ΔΦs− (i)dNs (i).
0 0
It follows that
t t
m
Nt (Γ ) − λs (Γ )ds = ΔΦs (i)dMs (i)
0 0 i=1
tm
= ΔΦs− (i)dMs (i).
0 i=1
But the last integral is the sum of integrals of bounded, predictable processes
and so by Theorem 3.45 is a martingale, which proves the assertion.
N (t)
Xt = Vn
n=1
with N (t) = N (t, R), which gives the total damage up to t, and we want to
derive the infinitesimal characteristics or the “intensity” of this process, i.e.,
to establish an SSM representation. We might also think of repair models, in
which failures occur at random time points Tn . Upon failure, repair is per-
formed. If the cost for the nth repair is Vn , then Xt describes the accumulated
costs up to time t.
To derive an SSM representation of X, we first assume that we are given
a general intensity λt (C) of the marked point process with respect to some
filtration F. The main point now is to observe that
t
Xt = zN (ds, dz).
0 S
Then we can use Theorem 3.45, p. 84, with the predictable process H(s, z) = z
to see that
t
MtF = z(N (ds, dz) − λs (dz)ds)
0 S
t
is a martingale if E 0 S |z|λs (dz)ds < ∞. Equivalently, we see that X has
the F-SSM representation X = (f, M F ), with
fs = zλs (dz).
S
N (t,S)
Xt = Yn .
n=1
and
If the failure probability does not depend on the number of failures N
the shock magnitudes are deterministic, Yn = 1, then we have
T = inf{Tn : Wn = 1},
where M is a martingale. The time to first failure admits a failure rate process,
.
which is just the intensity of the counting process N
The situation is as above; we only change the failure mechanism in that the
first time to failure T is defined as the first time the accumulated damage
reaches or exceeds a given threshold K ∈ R+ :
⎧ ⎫
⎨
N (t,S) ⎬ n
T = inf t ∈ R+ : Yi ≥ K = inf Tn : Yi ≥ K .
⎩ ⎭
i=1 i=1
Then we get
Example 3.48. Let us again consider the compound Poisson case with shock
arrival rate ν and Fn = F for all n ∈ N0 . Since rn (s−Tn ) = ν and (K −XTn ) =
(K − Xt ) on {Tn < t < Tn+1 }, we get
t
I(T ≤ t) = I(T > s)ν F̄ ((K − Xs )−)ds + Mt .
0
In the literature covering repair models special attention has been given to
so-called minimal repair models. Instead of replacing a failed system by a new
one, a repair restores the system to a certain degree. These minimal repairs
are often verbally described (and defined) as in the following:
• “The . . . assumption is made that the system failure rate is not disturbed
after performing minimal repair. For instance, after replacing a single tube
in a television set, the set as a whole will be about as prone to failure after
the replacement as before the tube failure” (Barlow and Hunter [30]).
• “A minimal repair is one which leaves the unit in precisely the condition
it was in immediately before the failure” (Phelps [129]).
The definition of the state of the system immediately before failure depends
to a considerable degree on the information one has about the system. So it
makes a difference whether all components of a complex system are observed
or only failure of the whole system is recognized. In the first case the lifetime
of the repaired component (tube of TV set) is associated with the residual
3.3 Point Processes in Reliability: Failure Time and Repair Models 91
system lifetime. In the second case the only information about the condition
of the system immediately before failure is the age. So a minimal repair in this
case would mean replacing the system (the whole TV set) by another one of
the same age that as yet has not failed. Minimal repairs of this kind are also
called black box or statistical minimal repairs, whereas the component-wise
minimal repairs are also called physical minimal repairs.
Example 3.49. We consider a simple two-component parallel system with inde-
pendent Exp(1) distributed component lifetimes X1 , X2 and allow for exactly
one minimal repair.
• Physical minimal repair. After failure at T = T1 = X1 ∨X2 the component
that caused the system to fail is repaired minimally. Since the component
lifetimes are exponentially distributed, the additional lifetime is given by
an Exp(1) random variable X3 independent of X1 and X2 . The total life-
time T1 + X3 has distribution
P (T1 + X3 > t) = e−t (2t + e−t ).
• Black box minimal repair. The lifetime T = T1 = X1 ∨ X2 until the first
failure of the system has distribution P (T1 ≤ t) = (1 − e−t )2 and failure
rate λ(t) = 2 1−exp (−t)
2−exp (−t) . The additional lifetime T2 − T1 until the second
failure is assumed to have conditional distribution
2 − e−(t+x)
P (T2 − T1 ≤ x|T1 = t) = P (T1 ≤ t + x|T1 > t) = 1 − e−x .
2 − e−t
Integrating leads to the distribution of the total lifetime T2 :
P (T2 > t) = e−t (2 − e−t )(1 + t − ln (2 − e−t )).
It is (perhaps) no surprise that the total lifetime after a black box minimal
repair is stochastically greater than after a physical minimal repair:
P (T2 > t) ≥ P (T1 + X3 > t), for all t ≥ 0.
Below we summarize some typical categories of minimal repair models, and
give some further examples. Let (Tn ) be a point process describing the failure
times at which instantaneous repairs are carried out and let N = (Nt ), t ∈ R+ ,
be the corresponding counting process
∞
Nt = I(Tn ≤ t).
n=1
For the parallel system in Example 3.49, one has λ(t) = 2 1−exp (−t)
2−exp (−t) . If
the intensity is a constant, λt ≡ λ, the times between successive repairs
are independent Exp(λ) distributed random variables. This is the case in
which repairs have the same effect as replacements.
(b) If in (a) the intensity is not deterministic but a random variable λ(ω),
which is known at the time origin (λ is F0 -measurable), or, more general,
λ = (λt ) is a stochastic process such that λt is F0 -measurable for all
t ∈ R+ , i.e., F0 = σ(λs , s ∈ R+ ) and Ft = F0 ∨ σ(Ns , 0 ≤ s ≤ t), then
the process is called a doubly stochastic Poisson process or a Cox process.
The process generalizes the basic model (a); the failure (minimal repair)
times are no Fλ -stopping times, since Ftλ = σ(λ) ⊂ F0 and Tn is not
F0 -measurable.
Also the Markov-modulated Poisson process of Example 3.14, p. 65, where
the intensity λt = λYt is determined by a Markov chain (Yt ), is an MRP.
Indeed, it is a slight modification of a doubly stochastic Poisson process
in that the filtration Ft = σ(Ns , Ys , 0 ≤ s ≤ t) does not include the
information about the paths of λ in F0 .
3.3 Point Processes in Reliability: Failure Time and Repair Models 93
(c) For the physical minimal repair in Example 3.49, λt = I(X1 ∧ X2 ≤ t).
In this case Fλ is generated by the minimum of X1 and X2 . The first failure
time of the system, T1 , equals X1 ∨ X2 , which is not an Fλ -stopping time.
The filtration generated by λt comprises no information about X1 ∨ X2 .
In the following we give another characterization of an MRP.
Theorem 3.51. Assume that P (Tn < ∞) = 1 for all n ∈ N and that there
exist versions of conditional probabilities Ft (n) = E[I(Tn ≤ t)|Ftλ ] such that
for each n ∈ N (Ft (n)), t ∈ R+ , is an (Fλ -progressive) stochastic process.
(i) Then the point process (Tn ) is an MRP if and only if for each n ∈ N
there exists some t ∈ R+ such that
P (0 < Ft (n) < 1) > 0.
(ii) If furthermore (Ft ) = (Ft (1)) has P -a.s. continuous paths of bounded
variation on finite intervals, then
t
1 − Ft = exp − λs ds .
0
Proof. (i) To prove (i) we show that P (Ft (n) ∈ {0, 1}) = 1 for all t ∈ R+
is equivalent to Tn being an Fλ -stopping time. Since we have F0 (n) = 0
and by the dominated convergence theorem for conditional expectations
lim Ft (n) = 1,
t→∞
the assumption that P (Ft (n) ∈ {0, 1}) = 1 for all t ∈ R+ is equivalent
to Ft (n) = I(Tn ≤ t) (P -a.s.). But as (Ft (n)) is adapted to Fλ this
means that Tn is an Fλ -stopping time. This shows that under the given
assumptions P (0 < Ft (n) < 1) > 0 is equivalent to Tn being no Fλ -
stopping time.
(ii) For the second assertion we apply the exponential formula (3.16) as de-
scribed on p. 80.
Theorem 3.53. If 0 < p(x) < 1 for all x holds true, then the point process
(Tn ) driven by the intensity (3.21) is an MRP.
Remark 3.54. (1) In the case p(x) = c for some c, 0 < c ≤ 1, the process is a
nonhomogeneous Poisson process with intensity λt = ν(t)c and therefore an
MRP. (2) The condition 0 < p(x) < 1 excludes the case of threshold models
for which p(x) = 1 for x ≥ K and p(x) = 0 else for some constant K > 0. For
such a threshold model we have
T1 = inf{t ∈ R+ : λt ≥ ν(t)},
if P (Yk ≤ x) > 0 for all x > 0. In this case T1 is an Fλ -stopping time and
consequently (Tn ) is no MRP.
This time to first system failure is governed by the hazard rate process λ for
t ∈ [0, T ) (cf. Corollary 3.30 on p. 76):
m
λt = (Φ(1i , Xt ) − Φ(0i , Xt ))λt (i). (3.22)
i=1
Our aim is to extend the definition of λt also on {T1 ≤ t}. To this end
we extend the definition of Xt (i) on {Zi ≤ t} following the idea that upon
system failure the component which caused the failure is repaired minimally
in the sense that it is restored and operates at the same failure rate as it
had not failed before. So we define Xt (i) = 0 on {Zi ≤ t} if the first failure
of component i caused no system failure, otherwise we set Xt (i) = 1 on
96 3 Stochastic Failure Models
{Zi ≤ t} (note that in the latter case the value of Xt (i) is redefined for
t = Zi ). In this way we define Xt and by (3.22) the process λt for all t ∈
R+ . This completed intensity λt induces a point process (Nt ) which counts
the number of minimal repairs on the component level. The corresponding
complete information filtration F = (Ft ), t ∈ R+ , is given by
which describe the time when component i becomes critical, i.e., the time from
which on a failure of component i would lead to system failure. It follows that
m
λt = I(Yi ≤ t)λt (i),
i=1
it is the ordinary failure rate of T1 . For the time to the first system failure we
have the two representations
t
I(T1 ≤ t) = I(T1 > s)λs ds + Mt F-level
0
t
= I(T1 > s)λ̂s ds + M̄t AT -level.
0
3.3 Point Processes in Reliability: Failure Time and Repair Models 97
describes the MRP on the AT -level. Comparing these two information levels,
Example 3.49 suggests ENt ≥ ENt for all positive t. A general comparison,
also for arbitrary subfiltrations, seems to be an open problem (cf. [4, 124]).
As in the minimal repair section, let (Tn ) be a point process describing failure
times at which instantaneous repairs are carried out and let N = (Nt ), t ∈ R+ ,
be the corresponding counting process. We assume that N is adapted to some
filtration F and has F-intensity (λt ).
98 3 Stochastic Failure Models
∞
λt = r(t − Tn + An )I(Tn < t ≤ Tn+1 ), A0 = T0 = 0.
n=0
We want to show that the indicator process Vt = I(τ (u) ≤ t) has a semi-
martingale representation
t
Vt = I(τ ≤ t) = I(τ > s)hs ds + Mt , t ∈ R+ , (3.23)
0
Ft = σ(Ns , Ys , Xi , 0 ≤ s ≤ t, i = 1, . . . , Nt ).
The failure rate process h = (ht ), t ∈ R+ , can be derived in the same way
as was done for shock models with failures of threshold type (cf. p. 89). Note
that ruin can only occur at a failure time; therefore, the ruin time is a hitting
time of a compound point process:
Nt
τ = inf t ∈ R+ : At = Bn ≥ u = inf {Tn : ATn ≥ u} ,
n=1
Lemma 3.56. Let τ = τ (u) be the ruin time and F the distribution of the
claim sizes, F̄ (x) = F ((x, ∞)) = P (X1 > x), x ∈ R. Then the F-failure rate
process h is given by
m
ht = λYt F̄ (Rt −) = λi I(Yt = i)F̄ (Rt −), t ∈ R+ .
i=1
Note that the paths of R have only countable numbers of jumps such that
under the integral sign Rs − can be replaced by Rs . Taking expectations on
both sides of (3.24) one gets by Fubini’s theorem
t
ψ(u, t) = E[I(τ (u) > s)λF̄ (Rs )]ds (3.25)
0
t
= (1 − ψ(u, s))λE[F̄ (Rs )|τ (u) > s]ds.
0
This shows that the (possibly defective) distribution of τ (u) has the haz-
ard rate
λE[F̄ (Rt )|τ (u) > t].
102 3 Stochastic Failure Models
Now let N X be the renewal process generated by the sequence (Xi ), i∈N,
k t
NtX = sup{k ∈ N0 : i=1 Xi ≤ t}, and A(u, t) = 0 a(u, s)ds, where
X
a(u, s)=λP (Nu+cs = Ns ). Then bounds for ψ(u, t) can be established.
Proof. For the lower bound we use the representation (3.26) and simply ob-
serve that E[F̄ (Rs )|τ (u) > s] ≥ F̄ (u + cs).
For the upper bound we start with formula (3.24). Since {τ (u) > t} ⊂
{Rt ≥ 0}, we have
t
Vt = I(τ (u) > s)λF̄ (Rs )ds + Mt
0
t
≤ I(Rs ≥ 0)λF̄ (Rs )ds + Mt
0
It remains to show that a(u, t) = λE[I(Rs ≥ 0)F̄ (Rs )]. Denoting the k-fold
k
convolution of F by F ∗k and Tk = i=1 Xi it follows by the independence of
the claim arrival process and (Xi ), i ∈ N,
$ % &'
Nt
E I(Rt ≥ 0)F̄ u + ct − Xi
i=1
∞
$ % & % &'
k
k
= E I u + ct − Xi ≥ 0 F̄ u + ct − Xi P (Nt = k)
k=0 i=1 i=1
∞ u+ct
= F̄ (u + ct − x)dF ∗k (x)P (Nt = k)
k=0 0
∞
= {F ∗k (u + ct) − F ∗(k+1) (u + ct)}P (Nt = k)
k=0
∞
X
= P (Nu+ct = k)P (Nt = k)
k=0
X
= P (Nu+ct = Nt ),
In this chapter we establish methods and formulas for computing various per-
formance measures of monotone systems of repairable components. Emphasis
is placed on the point availability, the distribution of the number of failures
in a time interval, and the distribution of downtime of the system. A number
of asymptotic results are formulated and proved, mainly for systems having
highly available components.
The performance measures are introduced in Sect. 4.1. In Sects. 4.3–4.6 re-
sults for binary monotone systems are presented. Since many of these results
are based on the one-component case, we first give in Sect. 4.2 a rather compre-
hensive treatment of this case. Section 4.7 presents generalizations and related
models. Section 4.7.1 covers multistate monotone systems. In Sects. 4.2–4.5
and 4.7.1 it is assumed that there are at least as many repair facilities (chan-
nels) as components. In Sect. 4.7.2 we consider a parallel system having r
repair facilities, where r is less than the number of components. Attention is
drawn to the case with r = 1. Finally, in Sect. 4.7.3 we present models for
analysis of passive redundant systems.
In this chapter we focus on the situation that the components have ex-
ponential lifetime distributions. See Sect. 4.7.1, p. 163, and Bibliographic
Notes, p. 173, for some comments concerning the more general case of non-
exponential lifetimes.
P (NJ ≤ k), k ∈ N0 ,
M (J) = ENJ ,
A[u, v] = P (Φt = 1, ∀t ∈ [u, v])
= P (Φu = 1, N(u,v] = 0).
P (YJ ≤ y), y ∈ R+ ,
EYJ
AD (J) = ,
|J|
where |J| denotes the length of the interval J. The measure AD (J) is in
the literature sometimes referred to as the interval unavailability, but we
shall not use this term here.
The above performance measures relate to a fixed point in time or a finite
time interval. Often it is more attractive, in particular from a computational
point of view, to consider the asymptotic limit of the measure (as t, u or
v → ∞), suitably normalized (in most cases such limits exist). In the following
we shall consider both the above measures and suitably defined limits.
Xt
0 t
T1 R1 T2 R2 T3
Fig. 4.1. Time evolution of a failure and repair process for a one-component system
starting at time t = 0 in the operating state
μF < ∞, μG < ∞.
n−1
Sn = T 1 + (Rk + Tk+1 ), n ∈ N,
k=1
and
n
Sn◦ = (Tk + Rk ), n ∈ N.
k=1
By convention, S0 = S0◦ = 0, and sums over empty sets are zero. We see that
Sn represents the nth failure time, and Sn◦ represents the completion time of
the nth repair.
The Sn sequence generates a modified (delayed) renewal process N with
renewal function M . The first interarrival time has distribution F . All other
interarrival times have distribution F ∗G (convolution of F and G), with mean
μF + μG . Let H (n) denote the distribution function of Sn . Then
H (n) = F ∗ (F ∗ G)∗(n−1) ,
108 4 Availability Analysis of Complex Systems
(cf. (B.2), p. 274, in Appendix B). The Sn◦ sequence generates an ordinary
renewal process N ◦ with renewal function M ◦ . The interarrival times, Tk +Rk ,
have distribution F ∗G, with mean μF +μG . Let H ◦(n) denote the distribution
function of Sn◦ . Then
H ◦(n) = (F ∗ G)∗n .
Let αt denote the forward recurrence time at time t, i.e., the time from t to
the next event:
αt = SNt +1 − t on {Xt = 1}
and
◦
αt = SN ◦
t +1
− t on {Xt = 0}.
Hence, given that the system is up at time t, the forward recurrence time αt
equals the time to the next failure time. If the system is down at time t, the
forward recurrence time equals the time to complete the repair. Let Fαt and
Gαt denote the conditional distribution functions of αt given that Xt = 1 and
Xt = 0, respectively. Then we have for x ∈ R
and
◦
Gαt (x) = P (αt ≤ x|Xt = 0) = P (SN ◦
t +1
− t ≤ x|Xt = 0).
Similarly for the backward recurrence time, we define βt , Fβt , and Gβt . The
backward recurrence time βt equals the age of the system if the system is up
at time t and the duration of the repair if the system is down at time t, i.e.,
◦
βt = t − SN ◦
t
on {Xt = 1}
and
βt = t − SNt on {Xt = 0}.
which gives
∞
t
A(t) = EXt = F̄ (t) + F̄ (t − x)dH ◦(n) (x)
n=1 0
t
= F̄ (t) + F̄ (t − x)dM ◦ (x).
0
Ā(t) ≤ λμG ,
Some closely related results are stated below in Propositions 4.1 and 4.2.
110 4 Availability Analysis of Complex Systems
Proposition 4.1. The probability of n failures occurring in [0, v] and the sys-
tem being up at time v is given by
v
P (Nv = n, Xv = 1) = F̄ (v − x)d(F ∗ G)∗n (x), n ∈ N0 .
0
Proof. The result clearly holds for n = 0. For n ≥ 1, the result follows by
observing that
Proposition 4.2. The probability of n failures occurring in [0, v] and the sys-
tem being down at time v is given by
v
0
Ḡ(v − x)dH (n) (x) n ∈ N
P (Nv = n, Xv = 0) =
0 n = 0.
From Propositions 4.1 and 4.2 we can deduce several results, for example,
a formula for P (Nu = n|Xu = 1) using that
P (Nu = n, Xu = 1)
P (Nu = n|Xu = 1) = .
A(u)
and
Proof. To establish the formula for P (N(u,v] ≤ n), we condition on the state
of the system at time u:
1
P (N(u,v] ≤ n) = j=0 P (N(u,v] ≤ n|Xu = j)P (Xu = j).
From this equality the formula follows trivially for n = 0. For n ∈ N, we need
to show that the following two equalities hold true:
4.2 One-Component Systems 111
But (4.4) follows directly from (4.3) with the forward recurrence time distri-
bution given {Xu = 1} as the first operating time distribution. Formula (4.5)
is established analogously.
The formula for A[u, v] is seen to hold observing that
If the downtimes are much smaller then the uptimes in probability (which
is the common situation in practice), then N is close to a renewal process
generated by all the uptimes. Hence, if the times to failure are exponentially
distributed, the process N is close to a homogeneous Poisson process. Formal
asymptotic results will be established later, see Sect. 4.4.
In the following two propositions we relate the distribution of the forward
and backward recurrence times and the renewal functions M and M ◦ .
Proposition 4.4. The probability that the system is up (down) at time t and
the forward recurrence time at time t is greater than w is given by
Proposition 4.5. The probability that the system is up (down) at time t and
the backward recurrence time at time t is greater than w is given by
t−w
F̄ (t) + 0 F̄ (t − x)dM ◦ (x) w ≤ t
P (Xt = 1, βt > w) = (4.10)
0 w>t
t−w
Ḡ(t − x)dM (x) w ≤ t
P (Xt = 0, βt > w) = 0 (4.11)
0 w > t.
Proof. The proof is similar to the proof of Proposition 4.4. Replace the indi-
cator function in the sums in (4.8) and (4.9) by
I(Sn◦ + Tn+1 > t, Sn◦ + w < t)
and
I(Sn + Rn > t, Sn + w < t),
respectively.
Theorem 4.6. The asymptotic distributions of the state process (Xt ) and the
forward (backward) recurrence times at time t are given by
∞
F̄ (x) dx
lim P (Xt = 1, αt > w) = w
t→∞ μF + μG
∞
Ḡ(x) dx
lim P (Xt = 0, αt > w) = w
t→∞ μF + μG
∞
F̄ (x) dx
lim P (Xt = 1, βt > w) = w (4.12)
t→∞ μF + μG
∞
Ḡ(x) dx
lim P (Xt = 0, βt > w) = w .
t→∞ μF + μG
Proof. The results follow by applying the Key Renewal Theorem (see
Appendix B, p. 277) to formulas (4.6), (4.7), (4.10), and (4.11).
Let us introduce
w
0
F̄ (x) dx
F∞ (w) = , (4.13)
μF
w
0
Ḡ(x) dx
G∞ (w) = . (4.14)
μG
4.2 One-Component Systems 113
and
lim Ḡαt (w) = lim Ḡβt (w) = Ḡ∞ (w). (4.15)
t→∞ t→∞
Proof. To establish these formulas, we use (4.2) (see p. 109), Theorem 4.6,
and identities like
P (Xt = 1, αt > w)
P (αt > w|Xt = 1) = .
A(t)
Proof. The result follows from the expression for the distribution of the num-
ber of failures given in Theorem 4.3, p. 110, combined with the limiting avail-
ability formula (4.2), p. 109, and Proposition 4.7.
The expected number of system failures can be found from the distribu-
tion function. Obviously, M (v) ≈ M ◦ (v) for large v. The exact relationship
between M (v) and M ◦ (v) is given in the following proposition.
Proposition 4.10. The difference between the renewal functions M (v) and
M ◦ (v) equals the unavailability at time v, i.e.,
Proof. Using that P (Nv ≤ n) = 1 − (F ∗ G)∗n ∗ F (v) (by (4.3), p. 109) and
the expression (4.1), p. 108, for the availability A(t), we obtain
∞
M (v) = P (Nv ≥ n)
n=1
∞
= (F ∗ G)∗n ∗ F (v) = F (v) + M ◦ ∗ F (v)
n=0
= M ◦ (v) + Ā(v),
where λ is the failure rate function and βv is the backward recurrence time
at time v, i.e., the relative age of the system at time v, cf. Sect. 3.3.2, p. 85.
We have m(v) = Eηv , where m(v) is the renewal density of M (v). Thus if the
system has an exponential lifetime distribution with failure rate λ,
In general,
m(v) ≤ [sup λ(s)]A(v). (4.19)
s≤v
This bound can be used to establish an upper bound also for the unavailability
Ā(t).
Proposition 4.11. The unavailability at time t, Ā(t), satisfies
t
Ā(t) ≤ sup λ(s) Ḡ(u)du ≤ [sup λ(s)]μG . (4.20)
s≤t 0 s≤t
4.2 One-Component Systems 115
It follows that
t
Ā(t) ≤ sup λ(s) Ḡ(t − x)dx
s≤t 0
t
= sup λ(s) Ḡ(u)du ≤ [sup λ(s)]μG ,
s≤t 0 s≤t
Proof. These results follow directly from renewal theory, see Appendix B,
pp. 276–278.
First we formulate and prove some results related to the mean of the downtime
in the interval [0, u]. As before (cf. Sect. 4.1, p. 106), we let Yu represent the
downtime in the interval [0, u].
Asymptotically, the (expected) portion of time the system is down equals the
limiting unavailability, i.e.,
EYu
lim AD (u) = lim = Ā. (4.27)
u→∞ u→∞ u
With probability one,
Yu
lim = Ā. (4.28)
u→∞ u
4.2 One-Component Systems 117
= E(1 − Φt )dt
0 u
= Ā(t)dt.
0
This proves (4.26). Formula (4.27) follows by using (4.26) and the limiting
availability formula (4.2), p. 109. Alternatively, we can use the Renewal Re-
ward Theorem (Theorem B.15, p. 280, in Appendix B), interpreting Yu as
a reward. From this theorem we can conclude that EYu /u converges to the
ratio of the expected downtime in a renewal cycle and the expected length of
a cycle, i.e., to the limiting unavailability Ā. The Renewal Reward Theorem
also proves (4.28).
Now we look into the problem of finding formulas for the downtime distri-
bution.
Let Nsop denote the number of system failures after s units of operational
time, i.e.,
∞
n
Nsop = I( Tk ≤ s).
n=1 k=1
Note that
n
Nsop ≥n⇔ Tk ≤ s, n ∈ N. (4.29)
k=1
Let Zs denote the total downtime associated with the operating time s, but
not including s, i.e.,
op
Ns−
Zs = Ri ,
i=1
where
op
Ns− = lim Nuop .
u→s−
Define
Cs = s + Zs .
We see that Cs represents the calendar time after an operation time of s time
units and the completion of the repairs associated with the failures occurred
up to s but not including s.
The following theorem gives an exact expression of the probability distri-
bution of Yu , the total downtime in [0, u].
118 4 Availability Analysis of Complex Systems
This first equality follows by noting that the event Yu ≤ y is equivalent to the
event that the uptime in the interval [0, u] is equal to or longer than u−y. This
means that the point in time when the total uptime of the system equals u − y
must occur before or at u, i.e., Cu−y ≤ u. Now using a standard conditional
probability argument it follows that
∞
op op
P (Zu−y ≤ y) = P (Zu−y ≤ y|N(u−y)− = n)P (N(u−y)− = n)
n=0
∞
= G∗n (y)P (N(u−y)−
op
= n)
n=0
∞
= G∗n (y)P (Nu−y
op
= n).
n=0
We have used that the repair times are independent of the process Nsop and
that F is continuous. This proves (4.30). Formula (4.31) follows by using
(4.29).
In the case that F is exponential with failure rate λ the following simple
bounds apply
The lower bound follows by including only the first two terms of the sum in
(4.30), observing that Ntop is Poisson distributed with mean λt, whereas the
upper bound follows by using (4.30) and the inequality
In the case that the interval is rather long, the downtime will be approximately
normally distributed, as is shown in Theorem 4.15 below.
4.2 One-Component Systems 119
Xt
0 t
R1 T1 R2 T2 R3 T3
Fig. 4.2. Time evolution of a failure and repair process for a one-component system
starting at time t = 0 in the failure state
where
μ2F σG
2
+ μ2G σF2
τ2 = .
(μF + μG )3
Proof. The result follows by applying Theorem B.17, p. 280, in Appendix B,
observing that the length of the first renewal cycle equals S1◦ = T1 + R1 , the
downtime in this cycle equals YS1◦ = R1 and
The asymptotic results established above provide good approximations for the
performance measures related to a given point in time or an interval. Based
on the asymptotic values we can define a stationary (steady-state) process
having these asymptotic values as their distributions and means. To define
such a process in our case, we generalize the model analyzed above by allowing
X0 to be 0 or 1.
Thus the time evolution of the process is as shown in Fig. 4.2 or as shown
in Fig. 4.1 (p. 107) beginning with an uptime. The process is characterized
by the parameters A(0), F ∗ (t), F (t), G∗ (t), G(t), where F ∗ (t) denotes the
distribution of the first uptime provided that the system starts in state 1 at
time 0 (i.e., X0 = 1) and G∗ (t) denotes the distribution of the first downtime
120 4 Availability Analysis of Complex Systems
provided that the system starts in state 0 at time 0 (i.e., X0 = 0). Now
assuming that F ∗ (t) and G∗ (t) are equal to the asymptotic distributions of
the recurrence times, i.e., F∞ (t) and G∞ (t), respectively, and A(0) = A, then
it can be shown that the process (Xt , αt ) is stationary; see Birolini [44]. This
means that we have, for example,
A(t) = A, ∀t ∈ R+ ,
∞
F̄ (x) dx
A[u, u + w] = w , ∀u, w ∈ R+ ,
μF + μG
w
M (u, u + w] = , ∀u, w ∈ R+ .
μF + μG
The following results show that the point availability (limiting availability) of
a monotone system is equal to the reliability function h with the component
reliabilities replaced by the component availabilities Ai (t) (Ai ).
Theorem 4.16. The system availability at time t, A(t), and the limiting sys-
tem availability, limt→∞ A(t), are given by
4.3 Point Availability and Mean Number of System Failures 121
We first state some results established in Sect. 3.3.2, cf. formula (3.18), p. 86.
See also (4.17) and (4.18), p. 114.
n u
= [h(1i , A(t)) − h(0i , A(t))] mi (t) dt
i=1 0
n u
= [h(1i , A(t)) − h(0i , A(t))] Eηt (i)dt,
i=1 0
≤ uλ̃,
n
where λ̃ = i=1 λi .
Theorem 4.19. The expected number of system failures per unit of time is
asymptotically given by
Nu h(1i , A) − h(0i , A)
n
lim = . (4.39)
u→∞ u μFi + μGi
i=1
Proof. To prove these results, we make use of formula (4.35). Dividing this
formula by u and using the Elementary Renewal Theorem (see Appendix
B, p. 277), formula (4.37) can be shown to hold noting that E[Φ(1i , Xt )
− Φ(0i , Xt )] → [h(1i , A) − h(0i , A)] as t → ∞. Let h∗i (t) = E[Φ(1i , Xt ) −
Φ(0i , Xt )] and h∗i its limit as t → ∞. Then we can write formula (4.35)
divided by u in the following form:
n
Mi (u) 1 u ∗
h∗i + [hi (t) − h∗i ]dMi (t) .
i=1
u u 0
Definition 4.20. The limit of ENu /u, given by formula (4.37), is referred to
as the system failure rate and is denoted λΦ , i.e.,
where mi (t) denotes the renewal density of Mi (t), we see that EλΦ (t) →
λΦ as t → ∞ provided that mi (t) → 1/(μFi + μGi ). From renewal theory,
see Theorem B.10, p. 278, in Appendix B, we know that if the renewal
cycle lengths Tik + Rik have a density function h with h(t)p integrable for
some p > 1, and h(t) → 0 as t → ∞, then Mi has a density mi such that
mi (t) → 1/(μFi + μGi ) as t → ∞. See the remark following Theorem B.10
for other sufficient conditions for mi (t) → 1/(μFi + μGi ) to hold. If com-
ponent i has an exponential lifetime distribution with parameter λi , then
mi (t) = λi Ai (t), (cf. (4.18), p. 114), which converges to 1/(μFi + μGi ).
It is intuitively clear that the process X is regenerative if the components
have exponential lifetime distributions. Before we prove this formally, we for-
mulate a result related to ENu◦ : the expected number of visits to the best
state (1, 1, . . . , 1) in [0, u]. The result is analogous to (4.35) and (4.37).
Lemma 4.22. The expected number of visits to state (1, 1, . . . , 1) in [0, u] is
given by
n u
ENu◦ = Aj (t) dMi◦ (t). (4.42)
i=1 0 j =i
Furthermore,
ENu◦ n n
1
lim = Aj . (4.43)
u→∞ u j=1
μ
i=1 Fi
124 4 Availability Analysis of Complex Systems
This is as expected noting that the number of visits to state (1, 1, . . . , 1) then
should be approximately equal to the average number of component failures
per unit of time. If a component fails, it will normally be repaired before any
other component fails, and, consequently, the process again returns to state
(1, 1, . . . , 1).
Theorem 4.23. If all the components have exponential lifetimes, then X is
a regenerative process.
Proof. Because of the memoryless property of the exponential distribution
and the fact that all component uptimes and downtimes are independent, we
can conclude that X is regenerative (as defined in Appendix B, p. 281) if we
can prove that P (S < ∞) = 1, where S = inf{t > S : Xt = (1, 1, . . . , 1)}
and S = min{Ti1 : i = 1, 2, . . . , n}. It is clear that if X returns to the state
(1, 1, . . . , 1), then the process beyond S is a probabilistic replica of the process
starting at 0.
Suppose that P (S < ∞) < 1. Then there exists an > 0 such that
P (S < ∞) ≤ 1 − . Now let τi be point in time of the ith visit of X to the
state (1, 1, . . . , 1), i.e., τ1 = S and for i ≥ 2,
τi = inf{t > τi−1 + Si : Xt = (1, 1, . . . , 1)},
where Si has the same distribution as S . We define inf{∅} = ∞. Since τi < ∞
is equivalent to τk − τk−1 < ∞, k = 1, 2, . . . , i (τ0 = 0), we obtain
P (τi < ∞) = [P (S < ∞)]i ≤ (1 − )i .
4.4 Distribution of the Number of System Failures 125
For all t ∈ R+ ,
P (Nt◦ ≥ i) ≤ P (τi < ∞),
and it follows that
∞
ENt◦ = P (Nt◦ ≥ i)
i=1
∞
≤ (1 − )i
i=1
1− 1−
= = < ∞.
1 − (1 − )
Under the given set-up the regenerative property only holds true if the
lifetimes of the components are exponentially distributed. However, this can
be generalized by considering phase-type distributions with an enlarged state
space, which also includes the phases; see Sect. 4.7.1, p. 163.
If the repair times are small compared to the lifetimes and the lifetimes
are exponentially distributed with parameter λi , then clearly the number of
failures of component i in the time interval (u, u + w], Nu+w (i) − Nu (i), is
approximately Poisson distributed with parameter λi w. If the system is a
series system, and we make the same assumptions as above, it is also clear
that the number of system failures in
the interval (u, u + w] is approximately
n
Poisson distributed with parameter i=1 λi w. The number of system
failures
in [0, t], Nt , is approximately a Poisson process with intensity ni=1 λi .
If the system is highly available and the components have constant failure
rates, the Poisson distribution (with the asymptotic rate λΦ ) will in fact also
produce good approximations for more general systems. As motivation, we
observe that EN(u,u+w] /w is approximately equal to the asymptotic system
failure rate λΦ , and N(u,u+w] is “nearly independent” of the history of N up
to u, noting that the process X frequently restarts itself probabilistically, i.e.,
X re-enters the state (1, 1, . . . , 1).
Refer to [22, 82] for Monte Carlo simulation studies of the accuracy of
the Poisson approximation. As an illustration of the results obtained in these
studies, consider a parallel system of two identical components where the
126 4 Availability Analysis of Complex Systems
failure rate λ is equal to 0.05, the repair times are all equal to 1, and the
expected number of system failures is equal to 5. This means, as shown below,
that the time interval is about 1,000 and the expected number of component
failures is about 100. Using the definition of the system failure rate λΦ (cf.
(4.41), p. 123) with μG = 1, we obtain
ENu 5 1 μG 1
= ≈ λΦ = 2Ā1 =21 · 1
u u μF1 + μG1 λ + μG λ + μG
≈ 2λ2 = 0.005.
Hence u ≈ 1, 000 and 2 ENu (i) ≈ 2λu ≈ 100. Clearly, this is an approximate
steady-state situation, and we would expect that the Poisson distribution gives
an accurate approximation. The Monte Carlo simulations in [22] confirm this.
The distance measure, which is defined as the maximum distance between the
Poisson distribution (with mean λΦ u) and the “true” distribution obtained
by Monte Carlo simulation, is equal to 0.006. If we take instead λ = 0.2 and
ENu = 0.2, we find that the expected number of component failures is about
1. Thus, we are far away from a steady-state situation and as expected the
distance measure is larger: 0.02. But still the Poisson approximation produces
relatively accurate results.
In the following we look at the problem of establishing formalized asymp-
totic results for the distribution of the number of system failures. We first
consider the interval reliability.
The above discussion indicates that the interval reliability A[0, u], defined by
A[0, u] = P (Nu = 0), is approximately exponentially distributed for highly
available systems comprising components with exponentially distributed life-
times. This result can also be formulated as a limiting result as shown in the
theorem below. It is assumed that the process X is a regenerative process
with regenerative state (1, 1, . . . , 1). The variable S denotes the length of the
first renewal cycle of the process X, i.e., the time until the process returns to
state (1, 1, . . . , 1). Let TΦ denote the time to the first system failure and q the
probability that a system failure occurs in a renewal cycle, i.e.,
For q ∈ (0, 1), let P0 and P1 denote the conditional probability given NS = 0
and NS ≥ 1, i.e., P0 (·) = P (·|NS = 0) and P1 (·) = P (·|NS ≥ 1). The
corresponding expectations are denoted E0 and E1 . Furthermore, let c20S =
[E0 S 2 /(E0 S)2 ] − 1 denote the squared coefficient of variation of S under P0 .
P D
The notation → is used for convergence in probability and → for con-
vergence in distribution, cf. Appendix A, p. 248. We write Exp(t) for the
4.4 Distribution of the Number of System Failures 127
ENS
λΦ = (4.44)
ES
by the Renewal Reward Theorem (Theorem B.15, p. 280, in Appendix B).
For a highly available system we have ENS ≈ q and hence λΦ ≈ q/ES.
Results from Monte Carlo simulations presented in [22] show that the factors
q/E0 S, q/ES, and 1/ETΦ typically give slightly better results (i.e., better
fit to the exponential distribution) than the system failure rate λΦ . From a
computational point of view, however, λΦ is much more attractive than the
other factors, which are in most cases quite difficult to compute. We therefore
normally use λΦ as the normalizing factor.
The basic idea of the proof of the asymptotic exponentiality of αTΦ is as
follows: If we assume that X is a regenerative process and the probability
that a system failure occurs in a renewal cycle, i.e., q, is small (converges to
zero), then the time to the first system failure will be approximately equal
to the sum of a number of renewal cycles having no system failures; and this
number of cycles is geometrically distributed with parameter q. Now if q → 0
as j → ∞, the desired result follows by using Laplace transformations. The
result can be formulated in general terms as shown in the lemma below.
Note that series systems are excluded since such systems have q = 1. We
will analyze series systems later in this section; see Theorem 4.35, p. 143.
ν−1
∗
S = Si .
i=1
≤ E[(qx/a)S]2 /2q
x2 q
= ES 2
2 a2
x2
= q(1 + c2S ).
2
4.4 Distribution of the Number of System Failures 129
The desired conclusion (4.47) follows now since q → 0 and qc2S → 0 (assump-
tions (4.45) and (4.46)).
Theorem 4.25. Assume that X is a regenerative process, and that Fij and
Gij change in such a way that the following conditions hold (as j → ∞) :
q → 0, (4.48)
→ 0,
qc20S (4.49)
qE1 S
→ 0, (4.50)
ES
E1 (NS − 1) → 0. (4.51)
Then
A[0, u/λΦ ] → e−u , i.e., λΦ TΦ → Exp(1).
D
(4.52)
Proof. Using Lemma 4.24, we first prove that under conditions (4.48)–(4.50)
we have
TΦ q D
→ Exp(1). (4.53)
E0 S
Let ν denote the renewal cycle index associated with the time of the first
system failure, TΦ . Then it is seen that TΦ has the same distribution as
ν−1
S0k + Wν ,
k=1
where (S0k ) and (Wk ) are independent sequences of i.i.d. random variables
with
P (S0k ≤ s) = P0 (S ≤ s)
and
P (Wk ≤ w) = P1 (TΦ ≤ w).
Both sequences are independent of ν, which has a geometrical distribution
with parameter q = P (NS ≥ 1). Hence, (4.53) follows from Lemma 4.24
provided that
Wν q P
→ 0. (4.54)
E0 S
By a standard conditional probability argument it follows that
ES = (1 − q)E0 S + qE1 S,
130 4 Availability Analysis of Complex Systems
ES (1 − q)
qE1 S
= → 0, (4.55)
1 − qE
ES
1S
= →1
ES 1−q
by (4.48) and (4.50). Hence the ratio of λφ and q/E0 S converges to 1. Com-
bining this with (4.53), the conclusion of the theorem follows.
Then ES ∗ = E0 S(1 − q)/q, observing that the mean of ν equals 1/q. It follows
that
ETφ = E0 S(1 − q)/q + E1 Tφ ,
which can be rewritten as
qETφ /E0 S = 1 − q + qE1 Tφ /E0 S.
We see that the right-hand side of this expression converges to 1, remember-
ing (4.48),(4.50), and (4.55). Hence, 1/ETφ is also a normalizing factor. Note
that the condition (4.51) is not required if the normalizing factor equals either
q/E0 S, q/ES, or 1/ETφ.
We can conclude that the ratio between any of these normalizing factors
converges to one if the conditions of the theorem hold true.
4.4 Distribution of the Number of System Failures 131
It is intuitively clear that if the components have constant failure rates, and
the component unavailabilities converge to zero, then the conditions of The-
orem 4.25 would hold. In Theorems 4.27 and 4.30 below this result will be
formally established. We assume, for the sake of simplicity, that no single
component is in series with the rest of the system. If there are one or more
components in series with the rest of the system, we know that the time to
failure of these components has an exact exponential distribution, and by in-
dependence it is straightforward to establish the limiting distribution of the
total system.
Define
n
n
d= λ μ
i Gi , λ̃ = λi .
i=1 i=1
Theorem 4.27. Assume that the system has no components in series with
the rest of the system, i.e., Φ(0i , 1) = 1 for i = 1, 2, . . . , n. Furthermore,
assume that component i has an exponential lifetime distribution with failure
rate λi > 0, i = 1, 2, . . . , n.
If d → 0 and there exist constants c1 and c2 such that λi ≤ c1 < ∞
and ERi2 ≤ c2 < ∞ for all i, then the conditions (4.48),(4.49), and (4.50)
D
of Theorem 4.25 are met, and, consequently, αTΦ → Exp(1) for α equal to
q/E0 S, q/ES, or 1/ETφ.
where S represents the “busy” period of the renewal cycle, which equals the
time from the first component failure to the next regenerative point, i.e., to the
time when the process again visits state (1, 1, . . . , 1). (The term “busy” period
is taken from queueing theory. In the busy period at least one component is
under repair.) Let S be an exponentially distributed random variable with
parameter λ̃ representing the time to the first component failure. This means
that we can write
S = S + S .
Assume that we have already proved (4.56). Then this condition and (4.48)
imply (4.50), noting that
132 4 Availability Analysis of Complex Systems
qE1 S
≤ λ̃qE1 S
ES
= λ̃(qE1 S + qE1 S )
= q + λ̃qE[S |NS ≥ 1]
= q + λ̃E[S I(NS ≥ 1)]
≤ q + λ̃q 1/2 [E(S )2 ]1/2
= q + q 1/2 [λ̃2 E(S )2 ]1/2 ,
E0 S 2
c20S ≤
(E0 S)2
≤ λ̃2 E0 S 2
= λ̃2 E[S 2 I(NS = 0)]/(1 − q)
≤ λ̃2 ES 2 /(1 − q)
= λ̃2 {E(S )2 + E(S )2 + 2E[S S ]}/(1 − q)
≤ λ̃2 {(2/λ̃2 ) + E(S )2 + 2(E(S )2 E(S )2 )1/2 }/(1 − q)
= {2 + λ̃2 E(S )2 + 2(21/2 ) (λ̃2 E(S )2 )1/2 }/(1 − q),
Now, to establish (4.48), we note that with probability λ̃i = λi /λ̃, the busy
period begins at the time of the failure of component i. If, in the interval of
repair of this component, none of the remaining components fails, then the
busy period comes to an end when the repair is completed. Therefore, since
there are no components in series with the rest of the system,
n ∞
1−q ≥ λ̃i e−t(λ̃−λi ) dGi (t),
i=1 0
n ∞
≤ λi tdGi (t) = d.
i=1 0
Consequently, d → 0 implies q → 0.
4.4 Distribution of the Number of System Failures 133
It remains to show (4.56). Clearly, the busy period will only increase if
we assume that the flow of failures of component i is a Poisson flow with
parameter λi , i.e., we adjoin failures that arise according to a Poisson process
on intervals of repair of component i, assuming that repair begins immediately
for each failure. This means that the process can be regarded as an M/G/∞
queueing process, where the Poisson input flow has parameter λ̃ and there are
an infinite number of devices with servicing time distributed according to the
law
n
G(t) = λ̃i Gi (t).
i=1
Note that the probability that a “failure is due to component i” equals λ̃i . It
is also clear that the busy period increases still more if, instead of an infinite
number of servicing devices, we take only one, i.e., the process is a queueing
process M/G/1. Thus, E(S )2 ≤ E(S̃ )2 , where S̃ is the busy period in a
single-line system with a Poisson input flow λ̃ and servicing distribution G(t).
It is a well-known result from the theory of queueing processes (and branching
processes) that the second-order moment of the busy period (extinction time)
equals ERG 2
/(1 − λ̃ERG )3 , where RG is the service
n time having distribution
G, see, e.g., [80]. Hence, by introducing d2 = i=1 λi ERi2 we obtain
λ̃d2 n2 c21 c2
λ̃2 E(S )2 ≤ ≤ .
(1 − d)3 (1 − d)3
The conclusion of the theorem follows.
where t∗ = sup{t ∈ R+ : Ḡi (t) > 0}. We see that μ̆i expresses the maximum
expected residual repair time of component i. We might have μ̆i = ∞, but
we shall in the following restrict attention to the finite case. We know from
Sect. 2.2, p. 37, that if Gi has the NBUE property, then
μ̆i ≤ μGi .
If the repair times are bounded by a constant c, i.e., P (Rik ≤ c) = 1, then
μ̆i ≤ c. Let
n
μ̃ = μ̆i .
i=1
Proof. The lemma will be shown by induction. We first prove that (4.57)
holds true for k = 2. Suppose the first system failure occurs at time t. Let
Lt denote the number of component failures after t until all components are
again functioning for the first time. Furthermore, let Rit denote the remaining
repair time of component i at time t (put Rit = 0 if component i is functioning
at time t). Finally, let Vt = maxi Rit and let GVt (v) denote the distribution
function of Vt . Note that Lt ≥ 1 implies that at least one component must fail
in the interval (t, t + Vt ) and that the probability of at least one component
failure in this interval increases if we replace the failed components at t by
functioning components. Using these observations and the inequality 1−e−x ≤
x, we obtain
∞
P (Lt ≥ 1) = P (Lt ≥ 1|Vt = v)dGVt (v)
0 ∞
≤ (1 − e−λ̃v )dGVt (v)
0
∞
≤ λ̃ vdGVt (v) = λ̃EVt ≤ λ̃E Rit
0 i
≤ λ̃μ̃.
Since NS ≥ 2 implies Lt ≥ 1, formula (4.57) is shown for k = 2 and P1
conditional on the event that the first system failure occurs at time t. Inte-
grating over the failure time t, we obtain (4.57) for k = 2. Now assume that
P1 (NS ≥ k) ≤ (λ̃μ̃)k−1 for a k ≥ 2. We must show that
P1 (NS ≥ k + 1) ≤ (λ̃μ̃)k .
We have
P1 (NS ≥ k + 1) = P1 (NS ≥ k + 1|NS ≥ k)P1 (NS ≥ k)
≤ P1 (NS ≥ k + 1|NS ≥ k) · (λ̃μ̃)k−1 ,
thus it remains to show that
P1 (NS ≥ k + 1|NS ≥ k) ≤ λ̃μ̃. (4.58)
Suppose that the kth system failure in the renewal cycle occurs at time t. Then
if at least one more system failure occurs in the renewal cycle, there must be
at least one component failure before all components are again functioning,
i.e., Lt ≥ 1. Repeating the above arguments for k = 2, the inequality (4.58)
follows.
Remark 4.29. The inequality (4.57) states that the number of system failures
in a renewal cycle when it is given that at least one system failure occurs is
bounded in distribution by a geometrical random variable with parameter λ̃μ̃
(provided this quantity is less than 1)
4.4 Distribution of the Number of System Failures 135
Theorem 4.30. Assume that the system has no components in series with the
rest of the system. Furthermore, assume that component i has an exponential
lifetime distribution with failure rate λi > 0, i = 1, 2, . . . , n. If d → 0, where
d = λ̃μ̃, and there exist constants c1 and c2 such that λi ≤ c1 < ∞ and
ERi2 ≤ c2 < ∞ for all i, then the conditions (4.48)–(4.51) of Theorem 4.25
(p. 129) are all met, and, consequently, the limiting result (4.52) holds, i.e.,
D
λΦ TΦ → Exp(1).
Proof. Since d ≤ d , it suffices to show that condition (4.51) holds under the
given assumptions. But from (4.57) of Lemma 4.28 we have
E1 (NS − 1) ≤ d /(1 − d ),
The above results show that the time to the first system failure is approxi-
mately exponentially distributed with parameter q/E0 S ≈ q/ES ≈ 1/ETΦ ≈
λΦ . For a system comprising highly available components, it is clear that
P (Xt = 1) would be close to one, hence the above approximations for the
interval reliability can also be used for an interval (t, t + u].
For a highly available system, the downtimes will be small compared to the
uptimes, and the time from when the system has failed until it returns to the
state (1, 1, . . . , 1) will also be small. Hence, the above results also justify the
Poisson process approximation for N . More formally, it can be shown that
Nt/α converges in distribution to a Poisson distribution under the same as-
sumptions as the first system failure time converges to the exponential distri-
bution. Let TΦ∗ (k) denote the time between the (k − 1)th and the kth system
failure. From this sequence we define an associated sequence TΦ (k) of i.i.d.
variables, distributed as TΦ , by letting TΦ (1) = TΦ∗ (1), TΦ (2) be equal to the
time to the first system failure following the first regenerative point after the
first system failure, etc. Then it is seen that
TΦ (1) + TΦ (2)(1 − I(N(1) ≥ 2)) ≤ TΦ∗ (1) + TΦ∗ (2) ≤ TΦ (1) + TΦ (2) + Sν ,
where N(1) = equals the number of system failures in the first renewal cycle
having one or more system failures, and Sν equals the length of this cycle (ν
denotes the renewal cycle index associated with the time of the first system
failure). For α being one of the normalizing factors (i.e., q/E0 S, q/ES, 1/ETΦ,
or λΦ ), we will prove that αTΦ (2)I(N(1) ≥ 2) converges in probability to zero.
It is sufficient to show that P (N(1) ≥ 2) → 0 noting that
But
P (N(1) ≥ 2) = P1 (NS ≥ 2) ≤ E1 (NS − 1),
where the last expression converges to zero in view of (4.51), p. 129. The
distribution of Sν is the same as the conditional probability of the cycle length
given a system failure occurs in the cycle, cf. Theorem 4.25 and its proof. Thus,
if (4.48)–(4.51) hold, it follows that α(TΦ∗ (1)+TΦ∗ (2)) converges in distribution
to the sum of two independent exponentially distributed random variables
with parameter 1, i.e.,
P (Nt/α ≥ 2) = P (α(TΦ∗ (1) + TΦ∗ (2)) ≤ t)
→ 1 − e−t − te−t .
Similarly, we establish the general distribution. We summarize the result in
the following theorem.
Theorem 4.31. Assume that X is a regenerative process, and that Fij and
Gij change in such a way that (asj → ∞) the conditions (4.48)–(4.51) hold.
Then (asj → ∞)
D
Nt/α → Poisson(t), (4.59)
where α is a normalizing factor that equals either q/E0 S, q/ES, 1/ETΦ or
λΦ .
Results from Monte Carlo simulations [22] indicate that the asymptotic
system failure rate λΦ is normally preferable as parameter in the Poisson
distribution when the expected number of system failures is not too small
(less than one). When the expected number of system failures is small, the
factor 1/ETΦ gives slightly better results. The system failure rate is however
easier to compute.
Asymptotic Normality
Now we turn to a completely different way to approximate the distribution
of Nt . Above, the up and downtime distributions are assumed to change such
that the system availability increases and after a time rescaling Nt converges to
a Poisson variable. Now we leave the up and downtime distribution unchanged
and establish a central limit theorem as t increases to infinity. The theorem
generalizes (4.16), p. 114.
Theorem 4.32. If X is a regenerative process with cycle length S, Var[S] <
∞ and Var[NS ] < ∞, then as t → ∞,
√ Nu+t − Nu D
t − λΦ → N(0, γΦ 2
),
t
where
γΦ2 ES = Var[NS − λΦ S]. (4.60)
4.4 Distribution of the Number of System Failures 137
Below we argue that if the system failure rate is small, then we have
γΦ2 ≈ λΦ .
We obtain
Var[NS − λΦ S] E(NS − λΦ S)2
γΦ2 = =
ES ES
ENS2 ENS
≈ ≈ = λΦ ,
ES ES
where the last approximation follows by observing that if the system failure
rate is small, then NS with a probability close to one is equal to the indicator
function I(NS ≥ 1). More formally, it is possible to show that under certain
conditions, γΦ2 /λΦ converges to one. We formulate the result in the following
proposition.
and
qc2S → 0, (4.63)
where c2S denotes the squared coefficient of variation of S. Then (as j → ∞)
γΦ2
→ 1.
λΦ
Proof. Using (4.60) and writing N in place of NS we get
q −1 EN 2 + q −1 (λΦ )2 ES 2 − 2q −1 λΦ E[N S]
=
q −1 λΦ ES
E1 N 2 + q −1 (λΦ )2 ES 2 − 2q −1 λΦ E[N S]
= .
q −1 λΦ ES
Since the denominator converges to 1 (the denominator equals the ratio be-
tween two normalizing factors), the result follows if we can show that E1 N 2
138 4 Availability Analysis of Complex Systems
converges to 1 and all the other terms of the numerator converge to zero.
Writing
γΦ2
→ 1.
λΦ
Proof. It is sufficient to show that conditions (4.62) and (4.63) hold. Condition
(4.62) follows by using that under P1 , N is bounded in distribution by a
geometrical distribution random variable with parameter d = λ̃μ̃, cf. (4.57)
of Lemma 4.28, p. 133. Note that for a variable N that has a geometrical
distribution with parameter d we have
∞
E(N − 1)2 = (k − 1)2 (d )k−1 (1 − d )
k=1
d (1 + d )
= .
(1 − d )2
ES 2
c2S ≤ ≤ λ̃2 ES 2 .
(ES)2
λ2
q = λμG − ER2 + λ3 ER3 O(1)
2
(λμG )2
= λμG − (1 + c2G ) + o((λμG )2 ),
2
where c2G denotes the squared coefficient of variation of G defined by
c2G =VarR/μ2G. We can conclude that if λμG is small, then comparing dis-
tributions G with the same mean, those with a large variance exhibit a small
probability q.
If we instead apply the Taylor formula 1 − e−x = x − x2 O(1), we can write
This gives ∞
1 1
E0 S = + re−λr dG(r).
2λ 1 − q 0
it can be shown that if the failure rate λ and the squared coefficient of variation
c2G are bounded by a finite constant, then the normalizing factor q/E0 S is
asymptotically given by
q
= 2λ2 μG + o(λμG ), λμG → 0.
E0 S
Now we will show that the system failure rate λΦ , defined by (4.41), p. 123,
is also approximately equal to 2λ2 μG . First note that the unavailability of a
component, Ā, is given by Ā = λμG /(1 + λμG ). It follows that
2Ā
λΦ = = 2λ2 μG + o(λμG ), λμG → 0, (4.64)
λ−1 + μG
provided that the failure rate λ is bounded by a finite constant.
Next we will compute the exact distribution and mean of TΦ . Let us denote
this distribution by FTΦ (t). In the following FX denotes the distribution of any
random variable X and FiX (t) = Pi (X ≤ t), i = 0, 1, where P0 (·) = P (·|NS =
0) and P1 (·) = P (·|NS ≥ 1). Observe that the length of a renewal cycle S can
be written as S + S , where S represents the time to the first failure of a
component, and S represents the “busy” period, i.e., the time from when one
component has failed until the process returns to the best state (1, 1). The
variables S and S are independent and S is exponentially distributed with
rate λ̃ = 2λ. Now, assume a component has failed. Let R denote the repair
time of this component and let T denote the time to failure of the operating
component. Then
1 ∞
F1T (t) = P (T ≤ t|T ≤ R) = (1 − e−λ(t∧r) )dG(r),
q 0
where a ∧ b denotes the minimum of a and b. Furthermore,
1 t −λr
F0R (t) = P (R ≤ t|R < T ) = e dG(r),
q̄ 0
where q̄ = 1 − q. Now, by conditioning on whether a system failure occurs in
the first renewal cycle or not, we obtain
where t
h(t) = q F1T (t − s)dFS (s). (4.66)
0
Hence, FTΦ (t) satisfies a renewal equation with the defective distribution
q̄F0S (s), and arguing as in the proof of Theorem B.2, p. 275, in Appendix B,
it follows that t
FTΦ (t) = h(t) + h(t − s)dM0 (s), (4.67)
0
where the renewal function M0 (s) equals
∞
∗j
q̄ j F0S (s).
j=1
Noting that F0S = FS ∗ F0R , the Laplace transform of S equals 2λ/(2λ + v),
q̄ = LG (λ) and LF0R (v) = LG (v+λ)/LG (λ), we see that the Laplace transform
of M0 takes the form
2λ 2λ
q̄ 2λ+v LF0R (v) 2λ+v LG (v + λ)
LM0 (v) = = .
1 − q̄ 2λ+v
2λ
LF0R (v) 1 − 2λ+v
2λ
LG (v + λ)
Now using (4.67) and (4.66) and the above expressions for the Laplace trans-
form we obtain the following simple formula for LFTΦ :
2λ2 1 − LG (v + λ)
LFTΦ (v) = · .
λ + v v + 2λ(1 − LG (v + λ))
The mean ETΦ can be found from this formula, or alternatively by using a
direct renewal argument. We obtain
ETΦ = ES + E(TΦ − S )
1
= + Emin{R, T } + (1 − q)ETΦ ,
2λ
noting that the time one component is down before system failure occurs or
the renewal cycle terminates equals min{R, T }. If a system failure does not
occur, the process starts over again. It follows that
1 Emin{R, T }
ETΦ = + .
2qλ q
Note that
∞ ∞
Emin{R, T } = F̄ (t)Ḡ(t)dt = e−λt Ḡ(t)dt.
0 0
3 1 − 23 LG (λ)
ETΦ = .
2λ 1 − LG (λ)
where c2Gis the squared coefficient of variation of G. From this it can be shown
that the normalizing factor 1/ETΦ can be written in the same form as the
other normalizing factors:
1
= 2λ2 μG + o(λμG ), λμG → 0,
ETΦ
assuming that λ and c2G are bounded by a finite constant.
We
now return to
the general asymptotic analysis. Remember that d =
λi μGi and λ̃ = λi . So far we have focused on nonseries systems (series
4.4 Distribution of the Number of System Failures 143
system have q = 1). Below we show that a series system also has a Poisson
limit under the assumption that the lifetimes are exponentially distributed.
We also formulate and prove a general asymptotic result for the situation that
we have some components in series with the rest of the system. A component
is in series with the rest of the system if Φ(0i , 1) = 0.
Theorem 4.35. Assume that Φ is a series system and the lifetimes are ex-
ponentially distributed. Let λi be the failure rate of component i. If d → 0 (as
j → ∞), then (as j → ∞)
D
Nt/λ̃ → Poisson(t).
Proof. Let NtP (i) be the Poisson process with intensity λi generated by the
consecutive uptimes of component i. Then it is seen that
n
n
P
Nt/ λ̃
(i) − D = Nt/λ̃ ≤ P
Nt/λ̃
(i),
i=1 i=1
where
n
D= P
Nt/ λ̃
(i) − Nt/λ̃ .
i=1
n
n
P
E Nt/ λ̃
(i) = (t/λ̃)λi = t. (4.68)
i=1 i=1
which gives
n
t/λ̃
ENt/λ̃ = Ak (s)λi Ai (s)ds
i=1 0 k =i
t/λ̃
n
= λ̃ Ak (s)ds.
0 k=1
Using
this expression together with (4.68), the inequalities 1 − i (1 − qi ) ≤
i qi , and the component unavailability bound (4.22) of Proposition 4.11,
p. 114, (Āi (t) ≤ λi μGi ), we find that
144 4 Availability Analysis of Complex Systems
$ '
t/λ̃
n
ED = λ̃ 1− Ai (s) ds
0 i=1
t/λ̃
n
≤ λ̃ Āi (s)ds
0 i=1
n
≤ λ̃(t/λ̃) λi μGi
k=1
= td.
Remark 4.36. Arguing as in the proof of the theorem above it can be shown
that if aj → a as j → ∞, then
D
Naj t/λ̃ → Poisson(ta).
n
Observe that i=1 NaP t/λ̃ (i) is Poisson distributed with parameter aj t and
j
as j → ∞ this variable converges in distribution to a Poisson variable with
parameter at.
cf. Theorem 4.31, p. 136. Theorem 4.30, p. 135, gives sufficient conditions for
(4.48)–(4.51).
4.5 Downtime Distribution Given System Failure 145
Nt/αB ≤ Nt/α
A B
B + Nt/αB = N
A
a t/λ̃A
B
+ Nt/αB,
j
where aj = λ̃A /αB . Now in view of Remark 4.36 above and the conditions of
the theorem, it is sufficient to show that D∗ , defined as the expected number
of times system A fails while system B is down, or vice versa, converges to
zero. But noting that the probability that system A (B) is not functioning is
less than or equal to d (the unreliability of a monotone system is bounded by
the sum of the component unreliabilities, which in its turn is bounded by d,
cf. (4.22), p. 115), it is seen that
B
To find a suitable bound on ENt/α B , we need to refer to the argumentation
in the proof of Theorem 4.43, formulas (4.88) and (4.93), p. 156. Using these
∗
B → t. Hence, D → 0 and the theorem is
B
results we can show that ENt/α
proved.
assuming that the limit exists. It turns out that it is quite simple to establish
the asymptotic (steady-state) downtime distribution of a parallel system, so
we first consider this category of systems.
146 4 Availability Analysis of Complex Systems
where Gαt (y) = P (αt (i) > y|Xi (t) = 0) denotes the distribution of the for-
ward recurrence time in state 0 of a component. But we know from (4.14) and
(4.15), p. 112, that the asymptotic distribution of Gαt (y) is given by
∞
y
Ḡ(x)dx
lim Ḡαt (y) = = Ḡ∞ (y). (4.69)
t→∞ μG
Thus we have proved the following theorem.
Theorem 4.40. Let mi (t) be the renewal density function of Mi (t), and as-
sume that mi (t) is right-continuous and satisfies
1
lim mi (t) = . (4.71)
t→∞ μFi + μG i
where
1/μGi
ci =
n (4.72)
k=1 1/μGk
Proof. The proof follows the lines of the proof of Theorem 4.39, the difference
being that we have to take into consideration which component causes system
failure and the probability of this event given system failure. Clearly,
y∞ Ḡk (x) dx
1 − Ḡi (y)
μG k
k =i
(i)
1
λΦ = Āk
μFi + μGi
k =i
represents the expected number of system failures per unit of time caused
by failures of component i, an intuitive argument gives that the asymptotic
(steady-state) probability that component i causes system failure equals
1
(i)
λΦ μFi +μGi k =i Āk
=
n 1
λΦ l=1 μF +μG k =l Āk
l l
1 n
μG k=1 Āk
=
n i 1 n = ci .
l=1 μGl k=1 Āk
Then
c
P (N[t,t+h) (i) = 1)
ci (t) = lim c
h→0+ P (N[t,t+h) = 1)
where
c
oi (1) = E[N[t,t+h) c
(i))I(N[t,t+h) (i) ≥ 2)]/h
148 4 Availability Analysis of Complex Systems
and
c
o(1) = E[N[t,t+h) c
)I(N[t,t+h) ≥ 2)]/h.
Hence it remains to study the limit of the ratio of the first terms of (4.73).
Using that
c
EN[t,t+h) (i) = (h(1i , A(s)) − h(0i , A(s))mi (s)ds,
[t,t+h)
Remark 4.41. 1. From renewal theory (see Theorem B.10, p. 278, in Ap-
pendix B) sufficient conditions can be formulated for the limiting result
(4.71) to hold true. For example, if the renewal cycle lengths Tik +Rik have
a density function h with h(t)p integrable for some p > 1, and h(t) → 0
as t → ∞, then Mi has a density mi such that mi (t) → 1/(μFi + μGi )
as t → ∞. If component i has an exponential lifetime distribution with
parameter λi , then we know that mi (t) = λi Ai (t) (cf. (4.18), p. 114),
which converges to 1/(μFi + μGi ).
2. From the above proof it is seen that the downtime distribution at time t,
GΦ (y, t), is given by
⎡ ⎤
n
GΦ (y, t) = ci (t) ⎣1 − Ḡi (y) Ḡkαt (y)⎦ .
i=1 k =i
Consider now an arbitrary monotone system comprising the minimal cut sets
Kk , k = 1, 2, . . . , k0 . No simple formula exists for the downtime distribution
in this case. But for highly available systems the following formula can be
used to approximate the downtime distribution:
rk GKk (y),
k
4.5 Downtime Distribution Given System Failure 149
where
λK
rk =
k .
l λKl
Here λKk and GKk denote the asymptotic (steady-state) failure rate of min-
imal cut set Kk and the asymptotic (steady-state) downtime distribution of
minimal cut set Kk , respectively, when this set is considered in isolation (i.e.,
we consider the parallel system comprising the components in Kk ). We see
that rk is approximately equal to the probability that minimal cut set Kk
causes system failure. Refer to [23, 72] for more detailed analyses in the general
case. In [72] it is formally proved that the asymptotic downtime distribution
exists and is equal to the steady-state downtime distribution.
The above asymptotic (steady-state) formulas for GΦ give in most cases good
approximations to the downtime distribution of the ith system failure, i ∈ N.
Even for the first system failure observed, the asymptotic formulas produce
relatively accurate approximations. This is demonstrated by Monte Carlo sim-
ulations in [23]. An example is given below. Let the distance measure Di (y)
be defined by
Di (y) = |GΦ (y) − Ĝi,Φ (y)|,
where Ĝi,Φ (y) equals the “true” downtime distribution of the ith system fail-
ure obtained by Monte Carlo simulations. In Fig. 4.3 the distance measure
of the first and second system failure have been plotted as a function of y
for a parallel system of two identical components with constant repair times
and exponential lifetimes. As we can see from the figure, the distance is quite
small; the maximum distance is about 0.012 for i = 1 and 0.004 for i = 2.
Di(y)
0.014
.........
.............. .........................
0.012 ......
.......... ......
......
.
...
....... ......
.. ......
...... .....
..... .....
0.010 ..
.
....
.
.
.. .....
.....
..
....
. .....
..
..... .....
.....
..... ....
0.008 ....
.
. ...
...
..
..
.
..
. .........................
i =1 ...
...
...
..
0.006 ...... ..... ............... ..
. ..
i =2 ...
...
...
.
... ..
... ...
...
. ....... ...............
. ...
.. .... ........ ...
0.004 .
.
.
..
.
..
..
.
....
.
.. ............
..............
...
...
.
... .
..
..... ..
...........
...
...
. ..
...... ...
...
........
...
...
.. ....
0.002 ..
.
..
.
...
.
.
....
..
. ...
...
. ....
.. ...
.. ...
.. ...
.. ..
........
.. .
.. .........
...
...
...
...
. ..
...
.... ....
. ...
.. ............. ........
.. . .
. .
...
........ ..
.......... ............................. y
0.000
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Fig. 4.3. The distance Di (y), i = 1, 2, as a function of y for a parallel system of two
components with constant repair times, μG = 1, λ = 0.1
150 4 Availability Analysis of Complex Systems
Only for some special cases are explicit expressions for the downtime dis-
tribution of the ith system failure known. Below we present such expressions
for the downtime distribution of the first failure for a two-component parallel
system of identical components with exponentially distributed lifetimes.
Theorem 4.42. For a parallel system of two identical components with con-
stant failure rate λ and repair time distribution G, the downtime distribution
G1,p2 (y) of the first system failure is given by
∞s
Ḡ(y + s − x)dF (x) dF (s)
G1,p2 (y) = 1 − Ḡ(y) 0 ∞0 s (4.74)
0 0 Ḡ(s − x)dF (x) dF (s)
∞
y
[1 − e−λ(r−y) ]dG(r)
= 1 − Ḡ(y) ∞ −λr ]dG(r)
. (4.75)
0 [1 − e
Proof. Let Ti and Ri have distribution function F and G, respectively, i = 1, 2,
and let
Y = min (Ti + Ri ) − max (Ti ).
1≤i≤2 1≤i≤2
It is seen that the downtime distribution G1,p2 (y) equals the conditional dis-
tribution of Y given that Y > 0. The equality (4.74) follows if we can show
that ∞ s
P (Y > y) = Ḡ(y) 2Ḡ(y + s − x)dF (x) dF (s). (4.76)
0 0
Consider the event that Ti = s, Tj = x, Ri > y, and Tj + Rj > y + s for x < s
and j = i. For this event it holds that Y is greater than y. The probability of
this event, integrated over all s and x, is given by
∞ s
Ḡ(y + s − x)Ḡ(y)dF (x)dF (s).
0 0
By taking the union over i = 1, 2, we find that (4.76) holds.
But the double integral in (4.76) can be written as
∞ s
2 Ḡ(y + s − x)d(1 − e−λx )d(1 − e−λs )
0 0
∞ s
=1− G(y + s − x)2λ2 e−λ(x+s) dxds
∞ ∞
0 0
Thus the formulas (4.75) and (4.74) in the theorem are identical. This
completes the proof of the theorem.
Now what can we say about the limiting downtime distribution of the first
system failure as the failure rate converges to 0? Is it equal to the steady-
state downtime distribution GΦ ? Yes, for the above example we can show
that if the failure rate converges to 0, the distribution G1,p2 (y) converges to
the steady-state formula, i.e.,
∞
y
Ḡ(r)dr
lim G1,p2 (y) = 1 − Ḡ(y) = GΦ (y).
λ→0 μG
This is seen by noting that
∞ ∞
y
[1 − e−λ(r−y) ]dG(r) y
(r − y)dG(r)
lim ∞ −λr
= ∞
0 [1 − e ]dG(r) 0 rdG(r)
λ→0
∞
y
Ḡ(r)dr
= ∞ .
0 Ḡ(r)dr
This result can be extended to general monotone systems, and it is not
necessary to establish an exact expression for the distribution of the first
downtime; see [72]. Consider the asymptotic set-up introduced in Sect. 4.4,
to study highly available components, with exponential lifetime distributions
Fij (t) = 1 − e−λij t and fixed repair time distributions Gi , and where we as-
sume λij → 0 as j → ∞. Then for a parallel system it can be shown that the
distribution of the ith system downtime converges as j → ∞ to the steady-
state downtime distribution GΦ . For a general system it is more complicated.
Assuming that the steady-state downtime distribution converges as j → ∞ to
G∗Φ (say), it follows that the distribution of the ith system downtime converges
to the same limit. See [72] for details.
the time interval [0, u]) in the case that the components are highly available,
utilizing that (Yu ) is approximately a compound Poisson process, denoted
(CPu ), and the exact one-unit formula (4.30), p. 118, for the downtime distri-
bution. Then we formulate some sufficient conditions for when the distribu-
tion of CPu is an asymptotic limit. The framework is the same as described
in Sect. 4.4.1, p. 126. Finally, we study the convergence to the normal distri-
bution.
We assume that the components have constant failure rate and that the com-
ponents are highly available, i.e., the products λi μGi are small. Then it can
be heuristically argued that the process (Yu ), u ∈ R+ , is approximately a
compound Poisson process,
Nu
Yu ≈ Yi ≈ CPu . (4.77)
i=1
n
h(1i , A)−h(0i , A)
λΦ = .
i=1
(μFi + μGi )h(A)
To motivate this result, we note that the expected number of system failures
per unit of time when considering calendar time is approximately equal to the
asymptotic (steady-state) system failure rate λΦ , given by (cf. formula (4.41),
p. 123)
n
h(1i , A)−h(0i , A)
λΦ = .
i=1
μFi + μGi
4.6 Distribution of the System Downtime in an Interval 153
Then observing that the ratio between calendar time and operational time is
approximately 1/h(A), we see that the expected number of system failures per
unit of time when considering operational time, EN op (u, u + w]/w, is approx-
op
imately equal to λΦ /h(A)Furthermore, N(u,u+w] is “nearly independent” of
op
the history of N up to u, noting that the state process X frequently restarts
itself probabilistically, i.e., X re-enters the state (1, 1, . . . , 1). It can be shown
by repeating the proof of the Poisson limit Theorem 4.31, p. 136, and using
op
the fact that h(A) → 1 as λi μGi → 0, that Nt/α has an asymptotic Poisson
distribution with parameter t. The system downtimes given system failure are
approximately identically distributed with distribution function G(y), say, in-
dependent of N op , and approximately independent observing that the state
process X with a high probability restarts itself quickly after a system failure.
The distribution function G(y) is normally taken as the asymptotic (steady-
state) downtime distribution given system failure or an approximation to this
distribution; see Sect. 4.5.
Considering the system as a one-unit system, we can now apply the ex-
act formula (4.30), p. 118, for the downtime distribution with the Poisson
parameter λΦ . It follows that
∞
[λΦ (u − y)]n −λΦ (u−y)
P (Yu ≤ y) ≈ G∗n (y) e = Pu (y), (4.78)
n=0
n!
where the equality is given by definition. Formula (4.78) gives good approx-
imations for “typical real life cases” with small component availabilities; see
[82]. Figure 4.4 presents the downtime distribution for a parallel system of
two components with the repair times identical to 1 and μF = 10 using the
steady-state formula GΦ for G (formula (4.70), p. 146). The “true” distribu-
tion is found using Monte Carlo simulation. We see that formula (4.78) gives
a good approximation.
0.98
0.96
0.94
0.92
0.90
0.88
0.86
P(Y10≤y)
0.84
P10(y)
0.82
0.8 y
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Fig. 4.4. P10 (y) and P (Y10 ≤ y) for a parallel system of two components with
constant repair times, μG = 1, λ = 0.1
Theorem 4.43. Assume that X is a regenerative process, and that Fij and
Gij change in such a way that the following conditions hold (as j → ∞) :
q → 0, (4.79)
qc20S → 0, (4.80)
where c20S = [E0 S 2 /(E0 S)2 ] − 1 denotes the squared coefficient of variation of
S under P0 ,
4.6 Distribution of the System Downtime in an Interval 155
qE1 S
→ 0, (4.81)
ES
E1 (NS − 1) → 0, (4.82)
Yi1 → G∗Φ .
D
(4.83)
Then (as j → ∞)
Yt/α → CP(t, G∗Φ ),
D
(4.84)
where α = λΦ , q/E0 S, q/ES, or 1/ETΦ .
Proof. First we will introduce two renewal processes, N and N , having the
same asymptotic properties as Nt/α . From Theorem 4.31, p. 136, we know
that
D
Nt/α → Poisson(t)
under conditions (4.79)–(4.82).
Let ν(1) equal the renewal cycle index associated with the first “fiasco”
renewal cycle, and let U1 denote the time to the starting point of this cycle,
i.e.,
ν(1)−1
U1 = Si .
i=1
Note that if the first cycle is a “fiasco” cycle, then U1 = 0. Starting from the
beginning of the renewal cycle ν(1)+1, we define U2 as the time to the starting
point of the next “fiasco” renewal cycle. Similarly we define U3 , U4 , . . .. The
random variables Ui are equal the interarrival times of the renewal process
Nt , i.e., % k &
∞
Nt = I Ui ≤ t .
k=1 i=1
By repeating the proofs of Theorem 4.25 (p. 129) and Theorem 4.31 it is seen
that
D
Nt/α → Poisson(t). (4.85)
Using that the process Nt and the random variables Yi are independent, and
D
the fact that Yi1 → G∗Φ (assumption (4.83)), it follows that
Nt/α
Yi1 → CP(t, G∗Φ ).
D
(4.86)
i=1
A formal proof of this can be carried out using Moment Generating Functions.
Next we introduce Nt as the renewal process having interarrival times
with the same distribution as U1 + Sν(1) , i.e., the renewal cycle also includes
the “fiasco” cycle. It follows from the proof of Theorem 4.25, using condition
(4.81), that Nt/α has the same asymptotic Poisson distribution as Nt/α .
156 4 Availability Analysis of Complex Systems
It is seen that
where N(i) equals the number of system failures in the ith “fiasco” cycle.
Note that Nt is at least the number of “fiasco” cycles up to time t, including
the one that is possibly running at t, and Nt equals the number of finished
“fiasco” cycles at time t without the one possibly running at t.
Now to prove the result (4.84) we will make use of the following inequali-
ties:
Nt/α Nt/α
Yt/α ≤ Yi1 + Yi2 , (4.89)
i=1 i=1
Nt/α Nt/α
Yt/α ≥ Yi1 − Yi1 . (4.90)
i=1
i=Nt/α +1
In view of (4.86), and the inequalities (4.89) and (4.90), we need to show that
Nt/α
P
Yi2 → 0, (4.91)
i=1
Nt/α
P
Yi1 → 0. (4.92)
i=Nt/α +1
since
P (Yi2 > ) ≤ P1 (NS ≥ 2) ≤ E1 (NS − 1) → 0
by (4.82). Using Moment Generating Functions it can be shown that (4.91)
holds.
The key part of the proof of (4.92) is to show that (Nt/α ) is uniformly inte-
D
grable in j (t fixed). If this result is established, then since Nt/α → Poisson(t)
by (4.85) it follows that
ENt/α → t. (4.93)
And because of the inequality (4.87), (Nt/α ) is also uniformly integrable so
that ENt/α → t, and we can conclude that (4.92) holds noting that
4.6 Distribution of the System Downtime in an Interval 157
P (Nt/α − Nt/α ≥ 1) ≤ ENt/α − ENt/α → 0.
Thus it remains to show that (Nt/α ) is uniformly integrable.
l
Let FU denote the probability distribution of U and let Vl = i=1 Ui .
Then we obtain
∞
E[Nt/α I(Nt/α ≥ k)] = P (Nt/α ≥ l) + (k − 1)P (Nt/α ≥ k)
l=k
∞
= P (Vl ≤ t/α) + (k − 1)P (Vk ≤ t/α)
l=k
∞
= FU∗l (t/α) + (k − 1)FU∗k (t/α)
l=k
∞
≤ (FU (t/α))l + (k − 1)(FU (t/α))k
l=k
(FU (t/α))k
= + (k − 1)(FU (t/α))k .
1 − FU (t/α)
Since FU (t/α) → 1 − e−t , as j → ∞, it follows that for any sequence Fij , Gij
satisfying the conditions (4.79)–(4.82), (Nt/α ) is uniformly integrable. To see
−t
this, let be given such that 0 < < e . Then for j ≥ j0 (say) we have
(1 − e−t + )k
sup E[Nt/α I(Nt/α ≥ k)] ≤ + (k − 1)(1 − e−t + )k .
j≥j0 e−t −
Consequently,
lim sup E[Nt/α I(Nt/α ≥ k)] = 0,
k→∞ j
i.e., (Nt/α ) is uniformly integrable, and the proof is complete.
Remark 4.44. The conditions (4.79)–(4.82) of Theorem 4.43 ensures the asymp-
totic Poisson distribution of Nt/α , cf. Theorem 4.31, p. 136. Sufficient condi-
tions for (4.79)–(4.82) are given in Theorem 4.27, p. 131.
Asymptotic Normality
where
Var[Y − ĀS]
τΦ2 = (4.95)
ES
EY
Ā = . (4.96)
ES
Proof. The result (4.94) follows by applying the Central Limit Theorem for
renewal reward processes, Theorem B.17, p. 280, in Appendix B.
where Y1 is the downtime of the first system failure (note that Y1 = Y11 ).
The idea used to establish (4.97) is the following: As before, let S be equal to
the time of the first return to the best state (1, 1, . . . , 1). Then (4.97) follows
by using (4.95), (4.96), Ā ≈ 0, the fact that Y ≈ Y1 if a system failure
occurs in the renewal cycle, the probability of two or more failures occurring
in the renewal cycle is negligible, and λΦ = ENS /ES (by the Renewal Reward
Theorem, p. 280). We obtain
we will use the word “demand rate” also in the general case. The state of the
system at time t, which we in the following refer to as the throughput rate,
is assumed to be a function of the states of the components and the demand
rate, i.e.,
Φt = Φ(Xt , Dt ).
If Dt is a constant, we write Φ(Xt ). The process (Φt ) takes values in
{Φ0 , Φ1 , . . . , ΦM }.
Performance Measures
The performance measures introduced in Sect. 4.1, p. 105, can now be gener-
alized to the above model.
(a) For a fixed time t we define point availabilities
(b) Let NJ be defined as the number of times the system state is below de-
mand in J. The following performance measures related to NJ are con-
sidered
P (NJ ≤ k), k ∈ N0 ,
ENJ ,
P (Φt ≥ Dt , ∀t ∈ J) = P (NJ = 0).
P (YJ ≤ y), y ∈ R+ ,
EYJ
,
|J|
E J Φt dt
, (4.98)
E J Dt dt
where |J| denotes the length of the interval J. The measure (4.98) is called
throughput availability.
(d) Let
1
ZJ = I(Φt ≥ Dt ) dt.
|J| J
The random variable Z represents the portion of time the throughput
rate equals (or exceeds) the demand rate. We consider the following per-
formance measures related to ZJ
P (ZJ ≤ y), y ∈ R+ ,
EZJ .
Computation
We now briefly look into the computation problem for some of the measures
defined above. To simplify the analysis we shall make the following assump-
tions:
Assumptions
1. J = [0, u].
2. The demand rateDt equals the maximum throughput rate ΦM for all t.
4.7 Generalizations and Related Models 161
exist.
Arguing as in the binary case we can use results from regenerative and renewal
reward processes to generalize the results obtained in the previous sections.
To illustrate this, we formulate some of these extensions below. The proofs
are omitted. We will focus here on the asymptotic results. Refer to Theorems
4.16, p. 120, and 4.19, p. 122, for the analogous results in the binary case. We
need the following notation:
μi = ETim + ERim
k
Nt = N[0,t] (k is fixed)
pir (t) = P (Xt (i) = xir ); if t is fixed, we write pir and X(i)
p = (p10 , . . . , pnMn )
a = (a10 , . . . , anMn )
Φk (X) = I(Φ(X) ≥ Φk )
hk (p) = EΦk (X)
h(p) = EΦ(X)
(1ir , p) = p with pir replaced by 1 and pil = 0 for l = r.
and let filr denote the expected number of times component i makes a tran-
sition from state xil to state xir during a cycle of component i. Assume
filr < ∞. Then the expected number of times the system state is below Φk
per unit of time in the long run equals
The limit (4.99) is denoted λΦ . If the random variables Tim are exponen-
tially distributed, then X is regenerative, cf. Theorem 4.23, p. 124.
It is also possible to extend the asymptotic results related to the distribu-
tion of the number of system failures at level k, and the distribution of the
lost volume (downtime). We can view the system as a binary system of binary
components, and the asymptotic results of Sects. 4.4–4.6 apply.
Consider the gas compression system example in Sect. 1.3.2, p. 13. Two design
alternatives were studied:
(i) One gas train with a maximum throughput capacity of 100 %.
(ii) Two trains in parallel, each with a maximum throughput capacity of 50 %.
Normal production is 100 %. Each train comprises compressor–turbine, cooler
and scrubber. To analyze the performance of the system it was considered
sufficient to use approximate methods developed for highly available systems,
as presented in this chapter. In the system analysis, each train was treated as
one component, having exponential lifetime distribution with a failure rate of
13 per year, and mean repair time equal to
From this we find that the asymptotic unavailability Ā, given by formula
(4.2), p. 109, for a train equals 0.027, assuming 8,760 h per year. The number
of system failures per unit of time is given by the system failure rate λΦ . For
alternative (i) there is only one failure level and λΦ = 13. For alternative (ii)
we must distinguish between failures resulting in production below 100 % and
below 50 %. The system in these two cases can be viewed as a series system
of the two trains and a parallel system of the two trains, respectively. Hence
the system failure rate for these levels is approximately equal to 26 and 0.7,
respectively. Note that for the latter case (cf. (4.64), p. 140),
4.7 Generalizations and Related Models 163
λΦ ≈ 2 · Ā · 13.
The first term in the sum represents the contribution from failures leading to
50 % loss, whereas the second term represents the contribution from failures
leading to 100 % loss. The latter contribution is in practice negligible compared
to the former one. To compute the distribution of the lost production, we
need to know more about the distribution of the repair time R of the train.
It was assumed in this application that ER2 = 1, 000, which corresponds
to a squared coefficient of variation equal to 1.9 and a standard deviation
equal to 25.7. The unit of time is hours. This assumption makes it possible to
approximate the distribution of the lost production during a year, using the
normal approximation. We know the mean (EY = 0.027) and need to estimate
the variance of Y . To do this we make use of (4.97), p. 158, stating that the
variance in the binary case is approximately equal to λΦ EY12 /t, where t is the
length of the time period considered and Y1 is the downtime of the first system
failure. For alternative (i) we find that the variance equals approximately
and for alternative (ii) (we ignore situations with both components down so
that the lost production is approximately 50 % of the downtime)
From this we estimate, for example, that the probability that the lost produc-
tion during 1 year is more than 4 % of demand, to be 0.16 for alternative (i)
and 0.08 for alternative (ii).
In the asymptotic analysis in Sects. 4.4–4.6 main emphasis has been placed
on the situation that the lifetimes are exponentially distributed. Using the
so-called phase-type approach, we can show that the multistate model also
164 4 Availability Analysis of Complex Systems
We can conclude that the set-up also covers mixtures of Erlang distributions,
and Theorems 4.25, 4.31, and 4.43 apply.
Note that we have not proved that the limiting results obtained in the
previous sections hold true for general lifetime distributions Fij . We have
shown that if the distributions Fij all belong to a certain class of mixtures of
Erlang distributions, then the results hold. Starting from general distributions
Fij , we can write Fij as a limit of Fijr , r → ∞, where Fijr are mixtures of
Erlang distributions. But interchanging the limits as j → ∞ and as r → ∞
is not justified in general. Refer also to Bibliographic Notes, p. 173, for some
comments related to the non-exponential case.
Consider the model as described in Sect. 4.3, p. 120, but assume now that
there are repair constraints, i.e., a maximum of r (r < n) components can
be repaired at the same time. Hence if i, i > r, components are down, the
remaining i − r components are waiting in a repair queue. We shall restrict
attention to the case r = 1, i.e., there is only one repair facility (channel)
available. The repair policy is first come first served. We assume exponentially
distributed lifetimes.
Consider first a parallel system of two components, and the set-up of
Sect. 4.4, p. 126. It is not difficult to see that ETΦ , q, and E0 S are identi-
cal to the corresponding quantities when there are no repair constraints; see
section on parallel system of two identical components p. 139. We can also
find explicit expressions for ES and λΦ . Since the time to the first component
failure is exponentially distributed with parameter 2λ, ES = 1/2λ + ES ,
where S equals the time from the first component failure until the process
again returns to (1, 1). Denoting the repair time of the failed component by
R, we see that
ES = μG + qE[S − R|NS ≥ 1].
But E[S − R|NS ≥ 1] = ES , and it follows that
1 μG
ES = + .
2λ 1 − q
Hence
ENS q/(1 − q)
λΦ = =
ES ES
2λq
= .
1 − q + 2λμG
Alternatively, and easier, we could have found λΦ by defining a cycle S as
the time between two consecutive visits to a state with just one component
functioning. Then it is seen that ES = μG +(1−q)/2λ and ENS = q, resulting
in the same λΦ as above.
166 4 Availability Analysis of Complex Systems
observing that if the state is i and the repair is completed at time s, then the
probability that the process jumps to state j, where j ≤ i ≤ n − 1, equals
the probability that i − j + 1 components fail before s and j − 1 components
survive s; and, furthermore, if the state is i and the repair is completed at
time s, then the probability that the process jumps to state i + 1 equals the
probability that i components survive s. Now if the process is in state n, it
stays there for an exponential time with rate nλ, and jumps to state n − 1.
Having established the transition probabilities, we can compute a number
of interesting performance measures for the system using results from semi-
Markov theory. For example, we have an explicit formula for the asymptotic
probability that P (Φt = k) as t → ∞, which depends on the mean time spent
in each state and the limiting probabilities of the embedded discrete-time
Markov chain; see Ross [135], p. 104.
Model
Performance Measures
provided the limits exist. Clearly, the availability at time t, A(t), is given by
A(t) = pn (t) + pn−1 (t)
and the limiting availability, A, is given by
A = pn + pn−1 .
Computation
First, we focus on the limiting unavailability Ā, i.e., the expected portion of
time in the long run that at least two components are not functioning. Under
the assumption of constant failure and repair rates this unavailability can
easily be computed using Markov theory, noting that Φ is a birth and death
process. The probability p̃i of having i components down is given by (cf. [13],
p. 303)
z
p̃i = pn−i =
in , (4.100)
1 + j=1 zj
where
(n−1)(n−1)!
i 1 δi i = 1, 2, . . . , n
zi = (n−i)! l=1 ul
1 i=0
δ = μG /μF
ul = 1 under repair regime R1 and l under repair regime R2.
Note that if δ is small, then p̃i ≈ zi for i ≥ 1. Hence
(n − 1)2 2
Ā ≈ p̃2 ≈ δ . (4.101)
u2
4.7 Generalizations and Related Models 169
EY
Ā = . (4.102)
ES
Here system downtime corresponds to the time two or more of the components
are not functioning. Let us now look closer into the problem of computing Ā,
given by (4.102), under repair regime R1.
E[max{R − T, 0}]
Ā = (4.103)
ET + E[max{R − T, 0}]
(μG − w)
= , (4.104)
μF + (μG − w)
where
∞
w = E[min{R, T }] = F̄ (t)Ḡ(t) dt,
0
where
λ2 δ2
Ā = ER2 = [1 + c2G ]. (4.106)
2 2
This gives a simple approximation formula for computing Ā. The approxima-
tion (4.105) is established formally by the following proposition.
δ 3 ER3
0 ≤ Ā − Ā ≤ (Ā )2 + . (4.107)
6 μ3G
Hence Ā overestimates the unavailability and the error term will be neg-
ligible provided that δ = μG /μF is sufficiently small.
Next, let us compare the approximation formula Ā with the standard
“Markov formula” ĀM = δ 2 , obtained by assuming exponentially distributed
failure and repair times (replace c2G by 1 in the expression (4.106) for Ā , or
use the Markov formula (4.101), p. 168). It follows that
4.7 Generalizations and Related Models 171
1
Ā = ĀM · [1 + c2G ]. (4.110)
2
From this, we see that the use of the Markov formula when the squared
coefficient of variation of the repair time distribution, c2G , is not close to 1,
will introduce a relatively large error. If the repair time is a constant, then
c2G = 0 and the unavailability using the Markov formula is two times Ā . If c2G
is large, say 2, then the unavailability using the Markov formula is 2/3 of Ā .
Assume now n > 2. The repair regime is R1 as before. Assume that δ is
relatively small. Then it is possible to generalize the approximations obtained
above for n = 2.
Since δ is small, there will be a negligible probability of having Φ ≤ n − 3,
i.e., three or more components not functioning at the same time. By neglecting
this possibility we obtain a simplified process that is identical to the process
for the two-component system analyzed above, with failure rate (n − 1)λ.
Hence by replacing λ with (n − 1)λ, formula (4.105) is valid for general n, i.e.,
Ā ≈ Ā , where
[(n − 1)δ]2
Ā = [1 + c2G ].
2
The error bounds are, however, more difficult to obtain, see [27].
The relation between the approximation formulas Ā and ĀM , given by
(4.101), p. 168, are the same for all n ≥ 2. Hence Ā = ĀM · 12 [1 + c2G ] (formula
(4.110)) holds for n > 2 too.
Next we will establish results for the long run average number of sys-
tem failures. It follows from the Renewal Reward Theorem that ENt /t and
E[Nt+s − Nt ]/s converge to λΦ = EN/ES as t → ∞, where N equals the
number of system failures in one renewal cycle and S equals the length of
the cycle as before. With probability one, Nt /t converges to the same value.
Under repair regime R1, N ∈ {0, 1}. Hence EN equals the probability that
the system fails in a cycle, i.e., EN = q using the terminology of Sects. 4.3
and 4.4. Below we find expressions for λΦ in the case that the repair regime
is R1. The regenerative points are consecutive visits to state n − 1.
Proof. First note that EY equals the expected downtime in a cycle and is
given by
172 4 Availability Analysis of Complex Systems
ES = μF + EY. (4.113)
Suppose the system has just jumped to state 1. We then have one component
operating and one undergoing repair. Now if a system failure occurs (i.e.,
T ≤ R), then the cycle length equals R, and if a system failure does not occur
(i.e., T > R), then the cycle length equals T . Consequently,
We see from (4.111) that if F (t) is exponential with rate λ and the com-
ponents are highly available, then
λΦ ≈ λ2 μG .
If n > 2 and the repair regime is R1, it is not difficult to see that q
is given by (4.112) with F (t) replaced by 1 − e−(n−1)λt . It is however more
difficult to find an expression for ES. For highly available components, we can
approximate the system with a two-state system with failure rate (n − 1)λ;
hence,
λΦ ≈ [(n − 1)λ]2 μG ,
1
ES ≈ .
(n − 1)λ
When the state process of the system jumps from state n to n − 1, it will re-
turn to state n with a high probability and the sojourn time in state n − 1 will
be relatively short; consequently, the expected cycle length is approximately
equal to the expected time in the best state n, i.e., 1/(n − 1)λ.
Repair Regime R2. Finally in this section we briefly comment on the repair
regime R2. We assume constant failure rates. It can be argued that if there is
ample repair facilities, i.e., the repair regime is R2, the steady-state unavail-
ability is invariant with respect to the repair time distribution, cf., e.g., Smith
[145] and Tijms [156], p. 175. This means that we can use the steady-state
Markov formula (4.100), p. 168, also when the repair time distribution is not
4.7 Generalizations and Related Models 173
exponential. The result only depends on the repair time distribution through
its mean value. However, a strict mathematical proof of this invariance result
does not seem to have been presented yet.
al. [111]. See also Lam and Lehoczky [114] and the references therein. These
results are not applicable for the availability problems studied in this book.
Sections 4.5 and 4.6 are to a large extent based on Gåsemyr and Aven
[72], Aven and Haukås [23], and Aven and Jensen [26]. Gåsemyr and Aven
[72] and Aven and Haukås [23] study the asymptotic downtime distribution
given system failure. Theorem 4.42 is due to Haukås (see [26, 81]) and Smith
[146]. Aven and Jensen [26] gives sufficient conditions for when a compound
Poisson distribution is an asymptotic limit for the distribution of the downtime
of a monotone system observed in a time interval. An alternative approach
for establishing the compound Poisson process limit is given by Serfozo [138].
There exist several asymptotic results in the literature linking the sums of
independent point processes with integer marks to the compound Poisson
process; see, e.g., [153]. It is, however, not possible to use these results for
studying the asymptotic downtime distributions of monotone systems.
Section 4.7.1 generalizes results obtained in the previous sections to mul-
tistate systems. The presentation on multistate systems is based on Aven
[11, 14]. For the analysis in Sect. 4.7.3 on standby systems, reference is given
to the work by Aven and Opdal [27].
In this chapter we have primarily focused on the situation that the com-
ponent lifetime distributions are exponential. In Sect. 4.7.1 we outlined how
some of the results can be extended to phase-type distributions. A detailed
analysis of the nonexponential case (nonregenerative case) is however outside
the scope of this book. Further research is needed to present formally proved
results for the general case. Presently, the literature covers only some partic-
ular cases. Intuitively, it seems clear that it is possible to generalize many of
the results obtained in this chapter. Consider, for example, the convergence
to the Poisson process for the number of system failures. As long as the com-
ponents are highly available, we would expect that the number of failures are
approximately Poisson distributed. But formal asymptotic results are rather
difficult to establish; see, for example, [102, 106, 112, 152, 162]. Strict con-
ditions have to be imposed to establish the results, to the system structure
and the component lifetime and downtime distributions. Also the general ap-
proach of showing that the compensator of the counting process converges in
probability (see Daley and Vere-Jones [58], p. 552), is difficult to apply in our
setting.
Of course, this chapter covers only a small number of availability mod-
els compared to the large number of models presented in the literature. We
have, for example, not included models where some components remain in
“suspended animation” while a component is being repaired/replaced, and
models allowing preventive maintenance. For such models, and other related
models, refer to the above cited references, Beichelt and Franken [36], Osaki
[128], Srinivasan and Subramanian [150], Van Heijden and Schornagel [160],
and Yearout et. al. [166]. See also the survey paper by Smith et al. [147].
5
Maintenance Optimization
For this policy a replacement age s, s > 0, is fixed for each system at which
a preventive replacement takes place. If Ti , i = 1, 2, . . . , are the successive
lifetimes of the systems, then τi = Ti ∧ s denotes the operating time of the ith
system and equals the ith cycle length. The random variables Ti are assumed
to form an i.i.d. sequence with common distribution F , i.e., F (t) = P (Ti ≤ t).
The costs for one cycle are described by the stochastic process Z = (Zt ), t ∈
R+ , Zt = c + kI(T ≤ t). Clearly, the average cost after n cycles is
1
N
t
Ct = Zτ ,
t i=1 i
Under this policy the item is replaced at times is, i = 1, 2, . . . and s > 0,
and at failures. The preventive replacements occur at regular predetermined
intervals at a cost of c, whereas failures within the intervals incur a cost of
c + k.
The advantage of this policy is the simple structure and administration
because the time points of preventive replacements are fixed and determined
in advance. On the other hand, preventive replacements are carried out, ir-
respective of the age of the processing unit, so that this policy is usually
applied to several units at the same time and only if the replacement costs c
are comparatively low.
For a fixed time interval s the long run average cost per unit time is
(c + k)M (s) + c
Ks = , (5.2)
s
∞
where M is the renewal function M (t) = j=1 F ∗j (t) (see Appendix B, p.
274). If the renewal function is known explicitly, we can again use elementary
analysis to find the optimal s, i.e., to find s∗ with
In most cases the renewal function is not known explicitly. In such a case
asymptotic expansions like Theorem B.5, p. 277 in Appendix B or numerical
methods have to be used. As is to be expected in the case of an Exp(λ)
distribution, preventive replacements do not pay: M (s) = λs and s∗ = ∞.
Example 5.2. Let F be the Gamma distribution function with parameters λ >
0 and n = 2. The corresponding renewal function is
λs 1
M (s) = − (1 − e−2λs )
2 4
(cf. [1], p. 274) and s∗ can be determined as the solution of
d M (s) c
M (s) = + .
ds s s(c + k)
The solution s∗ is finite if and only if c/(c + k) < 1/4, i.e., if failure replace-
ments are at least four times more expensive than preventive replacements.
178 5 Maintenance Optimization
The age and block replacement policies will result in a finite optimal value
of s only if there is some aging and wear-out of the units, i.e., in probabilistic
terms the lifetime distribution F fulfills some aging condition like IFR, NBU,
or NBUE (see Chap. 2 for these notions). To judge whether it pays to follow
a certain policy and in order to compare the policies it is useful to consider
the number of failures and the number of planned preventive replacements in
a time interval [0, t].
Part (i) and (ii) say that under the weak aging notion NBU it is useful
to apply a replacement strategy, since the number of failures is (stochasti-
cally) decreased under such a strategy. If, in addition, F has an increasing
failure rate, block replacement results in stochastically less failures than age
replacement, and it follows that ENtA (s) ≥ ENtB (s). On the other hand, for
any lifetime distribution F (irrespective of aging notions) block policies have
more removals than age policies.
This result says that IFR is characterized by the reasonable aging condi-
tion that the number of failures is growing with increasing replacement age.
Somewhat weaker results hold true for the block policy (see Shaked and Zhu
[143] for proofs):
Theorem 5.6. The expected value ENtB (s) is increasing in s for each t ≥ 0
if and only if the renewal function M (t) is convex.
(a) lim r(t) = ∞ or (b) lim r(t) = a and lim (at − R(t)) > c,
t→∞ t→∞ t→∞
where c is the cost of a preventive replacement and r denotes the marginal de-
terioration cost rate. Again it can easily be seen that the basic age replacement
model (5.1) is a special case setting r(x) = kλ(x), where λ(x) = f (x)/F̄ (x) is
the failure rate. Now a very similar analysis can be carried out (see [63]) and
the same theorem holds true for this cost criterion except that condition (ii)
(b) has to be replaced by
This shows that behind these two quite different models the same opti-
mizations mechanism works. This has been exploited by Aven and Bergman
in [19] (see also [21]). They recognized that for many replacement models the
optimization criterion can be written in the form
- τ .
E 0 at ht dt + c0
- τ . , (5.5)
E 0 ht dt + p0
maximizes the expectation EZτ of some stochastic process Z. We will see that
the smooth semimartingale (SSM) representation of Z, as introduced in de-
tail in Sect. 3.1, is an excellent tool to carry out this optimization. Therefore,
we want to solve the stopping problem and to characterize optimal stopping
times for the case in which Z is an SSM and τ ranges in a suitable class of
stopping times, say
Without any conditions on the structure of the process Z one cannot hope to
find an explicit solution of the stopping problem. A condition called monotone
case in the discrete time setting can be transferred to continuous time as
follows.
Definition 5.8 (MON). Let Z = (f, M ) be an SSM. Then the following
condition
{ft ≤ 0} ⊂ {ft+h ≤ 0} ∀ t, h ∈ R+ , {ft ≤ 0} = Ω (5.6)
t∈R+
ζ = inf{t ∈ R+ : ft ≤ 0}
EZζ = sup{EZτ : τ ∈ C F }.
Remark 5.10. The condition that the martingale is uniformly integrable can
be relaxed; in [98] it is shown that the condition may be replaced by
Mζ ∈ L1 , ζ ∈ C F , lim Mt− dP = 0 ∀ τ ∈ C F ,
t→∞ {τ >t}
For the latter equality we make use of our agreement that σ(·) denotes the
completion
of the generated σ-algebra so that, for instance, the event {ρ =
t} = n∈N {t − n1 < ρ ≤ t} is also included in σ(ρ ∧ t). Then we define
2. λ > 1. The monotone case holds true with the ILA stopping time ζ = 0.
It is not hard to show that in this case the martingale
t
Mt = Zt − 1 − Zs (1 − λ)ds
0
so that the more general conditions mentioned in the above remark are
fulfilled with Mζ = 0. This yields
EZζ = 1 = sup{EZτ : τ ∈ C F }.
As in Sect. 3.1 we use the short notation Z = (f, M ) and X = (g, L). Almost
all of the stochastic processes used in applications without predictable jumps
admit such SSM representations. The following general assumption is made
throughout this section:
Assumption (A). Z = (f, M ) and X = (g, L) are SSMs with EZ0 >0,
EX0 ≥ 0, gs > 0 for all s ∈ R+ and M T , LT ∈ M0 are uniformly inte-
grable martingales, where MtT = Mt∧T , LTt = Lt∧T .
Remember that all relations between real random variables hold (only) P -
almost surely. The first step to solve the optimization problem is to establish
bounds for K ∗ in (5.8).
Lemma 5.14. Assume that (A) is fulfilled and
ft (ω)
q = inf : 0 ≤ t < T (ω), ω ∈ Ω > −∞.
gt (ω)
Then
bl ≤ K ∗ ≤ bu
holds true, where the bounds are given by
EZT
bu = ,
EX
T
E[Z0 −qX0 ]
EXT + q if E[Z0 − qX0 ] > 0
bl = EZ0
EX0 if E[Z0 − qX0 ] ≤ 0.
5.2 A General Replacement Model 185
Proof. Because T ∈ CTF only the lower bound has to be shown. Since the
martingales M T and LT are uniformly integrable, the optional sampling the-
orem (see Appendix A, p. 262) yields EMτ = ELτ = 0 for all τ ∈ CTF and
therefore
EZ0 + qE[Xτ − X0 ] EZ0 − qEX0
Kτ ≥ = + q ≥ bl .
EXτ EXτ
The lower bound is derived observing that EX0 ≤ EXτ ≤ EXT , which
completes the proof.
The following example gives these bounds for the basic age replacement
policy.
Example 5.15 (Continuation of Example 5.13). Let us return to the simple
cost process Zt = c + kI(T ≤ t) with the natural filtration as before. Then
I(T ≤ t) has the SSM representation
t
I(T ≤ t) = I(T > s)λ(s)ds + Mt ,
0
where λ is the usual failure rate of the lifetime T . It follows that the processes
Z and X have representations
t
Zt = c + I(T > s)kλ(s)ds + Mt , Mt = kMt
0
and t
Xt = t = ds.
0
Assuming the IFR property, we obtain with λ(0) = inf{λ(t) : t ∈ R+ } and
q = kλ(0) the following bounds for K ∗ in the basic age replacement model:
EZT c+k
bu = = ,
EXT ET
c
bl = + kλ(0).
ET
These bounds could also be established directly by using (5.1), p. 176. The
benefit of Lemma 5.14 lies in its generality, which also allows the bounds to
be found in more complex models as the following example shows.
Example 5.16. (Shock Model). Consider now a compound point process model
in which shocks arrive according to a marked point process (Tn , Vn ) as was
outlined in Sect. 3.3.3. Here we assume that (Tn ) is a nonhomogeneous Pois-
t
son process with a deterministic intensity λ(s) integrating to Λ(t) = 0 λ(s)ds
and that (Vn ) forms an i.i.d. sequence of nonnegative random variables inde-
pendent of (Tn ) with Vn ∼ F. The accumulated damage up to time t is then
described by
186 5 Maintenance Optimization
Nt
Rt = Vn ,
n=1
∞
where Nt = n=1 I(Tn ≤ t) is the number of shocks arrived until t. The
lifetime of the system is modeled as the first time Rt reaches a fixed threshold
S>0:
T = inf{t ∈ R+ : Rt ≥ S}.
We stick to the simple cost structure of the basic age replacement model, i.e.,
Zt = c + kI(T ≤ t).
But now we want to minimize the expected costs per number of arrived shocks
in the long run, i.e.,
Xt = Nt .
This cost criterion is appropriate if we think, for example, of systems which
are used by customers at times Tn . Each usage causes some random damage
(shock). If the customers arrive with varying intensities governed by external
circumstances, e.g., different intensities at different periods of a day, it makes
no sense to relate the costs to time, and it is more reasonable to relate the
costs to the number of customers served.
The semimartingale representations with respect to the internal filtration
generated by the marked point process are (cf. Sect. 3.3.5, p. 89)
t
Zt = c + I(T > s)kλ(s)F̄ ((S − Rs )−)ds + Mt ,
0
t
Xt = λ(s)ds + Lt .
0
c+k
bu = ,
EXT
c
bl = + k F̄ (S−),
EXT
n
where EXT = EΛ(T ). Observe that XT = inf{n ∈ N : Vi ≥ S} and
i=1
{XT > k} = { ki=1 Vi < S}. This yields
∞
% k & ∞
1
EXT = P Vi < S ≤ F k (S−) = ,
k=0 i=1 k=0
F̄ (S−)
5.2 A General Replacement Model 187
XT
if F (S−) < 1. In addition, using Wald’s equation E n=1 Vn = EXT EV1 ≥ S,
we can derive the following alternative bounds
EV1
bu = (c + k) ,
S
bl = (c + k)F̄ (S−),
Rt = K ∗ Lt − Mt .
The optimal stopping level x∗ for the ratio fs /gs can be determined from
EYσ = 0 and coincides with K ∗ as is shown in the following theorem.
Theorem 5.18. Assume (A)(see p. 184) and let ρx , x ∈ R, and the bounds
bu , bl be defined as above in (5.11) and in Lemma 5.14, p. 184, respectively. If
the process (rt ), t ∈ R+ , with rt = ft /gt has (bl , bu )-increasing paths on [0, T ),
then
σ = ρx∗ , with x∗ = inf{x ∈ R : xEXρx − EZρx ≥ 0}
is an optimal stopping time and x∗ = K ∗ .
Equally for x < K ∗ and v(x) = −EZ0 we have v(x) < v(K ∗ ) = 0 because of
EZ0 > 0. Therefore,
Remark 5.19. 1. If E[Z0 − qX0 ] < 0, then the lower bound bl in Lemma 5.14
is attained for σ = 0. So in this case K ∗ = EZ0 /EX0 is the minimum
without any further monotonicity assumptions.
2. If no monotonicity conditions hold at all, then x∗ = inf{x ∈ R : xEXρx −
EZρx ≥ 0} is the cost minimum if only stopping times of type ρx are
considered. But T = ρ∞ is among this restricted class of stopping times
so that x∗ is at least an improved upper bound for K ∗ , i.e., bu ≥ x∗ . From
the definition of x∗ we obtain x∗ ≥ Kρx∗ , which is obviously bounded
below by the overall minimum K ∗ : bu ≥ x∗ ≥ Kρx∗ ≥ K ∗ .
3. Processes r with (bl , bu )-increasing paths include especially unimodal or
bath-tub-shaped processes provided that r0 < bl .
Corollary 5.20. If (ft ) and (gt ) are deterministic with inverse of the ratio
r−1 (x) = inf{t ∈ R+ : rt = ft /gt ≥ x}, x ∈ R, and X0 ≡ 0, then σ = t∗ ∧ T is
optimal with t∗ = r−1 (K ∗ ) ∈ R+ ∪ {∞} and
−1
r (x)
K ∗ = inf x∈R: (xgs − fs )P (T > s)ds ≥ EZ0 .
0
As indicated in Sect. 3.2.4 in the context of the general lifetime model, the
semimartingale set-up has its advantage in opening new fields of applications.
One of these features is the aspect of partial information. In the framework
of stochastic process theory, the information is represented by a filtration, an
190 5 Maintenance Optimization
since all A-stopping times are also F-stopping times, and the question, to
what extent the information level influences the cost minimum, has to be
investigated.
5.3 Applications
The general set-up to minimize the ratio of expectations allows for many
special cases covering a variety of maintenance models. Some few of these will
be presented in this section, which show how the general approach can be
exploited.
We first focus on the age replacement model with the long run average cost
per unit time criterion: find σ ∈ CTF with
EZσ
K ∗ = Kσ = = inf{Kσ : τ ∈ CTF },
EXσ
5.3 Applications 191
2α 2α
bl = c, bu = (c + 1).
3 3
Since Ab and Ac are subfiltrations of F and include Ad as a subfiltration, we
must have for the optimal stopping values
bl ≤ Ka∗ ≤ Kb∗ ≤ Kd∗ ≤ bu , Ka∗ ≤ Kc∗ ≤ Kd∗ ,
i.e., on a higher information level we can achieve a lower cost minimum. Let us
consider the complete information case in more detail. The failure rate process
is nondecreasing and the assumptions of Theorem 5.18, p. 188, are met. For
the stopping times ρx = inf{t ∈ R+ : λt ≥ x} ∧ T we have to consider values
of x in [bl , bu ] and to distinguish between the cases 0 < x ≤ α and x > α :
• 0 < x ≤ α. In this case we have ρx = X1 ∧ X2 , Eρx = 2α1
, EZρx = c, such
∗ ∗
that xEρx − EZρx = 0 leads to x = 2αc, where 0 < x ≤ α is equivalent
to c ≤ 12 ;
• α < x. In this case we have ρx = T, Eρx = 2α3
, EZρx = c + 1, such that
∗ ∗ 1
x = bu , x > α is equivalent to c > 2 .
The other information levels are treated in a similar way. Only case (b)
needs some special attention because the failure rate process λ̂b is no longer
monotone but only piecewise nondecreasing. To meet the (bl , bu )-increasing
condition, we must have λ̂bh < bl , i.e., 2α(1 − (2 − e−αh )−1 ) < 2α3c. This
inequality holds for all h ∈ R+ , if c ≥ 3
2 and for h < h(α, c) = − α1 ln 3−2c
3−c ,
3
if 0 < c < 2.
We summarize these considerations in the following proposition the proof
of which follows the lines above and is elementary but not straightforward.
5.3 Applications 193
√ √
c) Kc∗ = α 2c, σc = X1 ∧ − α1 ln 1 − 2c ;
/ /
c2 c2
d) Kd∗ = 2α 4 + c − c
2 , σd = T ∧ − 1
a ln 1 − c
2 − 4 + c .
For c > 1
2 we have on all levels K ∗ = bu and σ = T.
For decreasing c the differences between the cost minima increase. If the
costs c for a preventive replacement are greater than half of the penalty costs,
i.e., c > 12 k = 12 , then extra information and preventive replacements are not
profitable.
Example 5.23. Under the above assumptions let (Nt ) be a point process with
Vn ∼Exp(ν). Then we get with F̄ (x) = exp{−νx}
positive intensity (λs ) and
n
and EXT = E[inf{n ∈ N : i=1 Vi ≥ S}] = νS + 1 the bounds
c
bl = + ke−νS ,
νS + 1
c+k
bu = ,
νS + 1
and the control-limit rules
ρx = inf{t ∈ R+ : k exp{−ν(S − Rt )} ≥ x} ∧ T
1 x
= inf t ∈ R+ : Rt ≥ ln + S ∧ T.
ν k
EXρx = νg(x) + 1,
EZρx = c + kP (T = ρx ) = c + ke−ν(S−g(x)) = c + x.
In this section the basic lifetime model for complex systems is combined with
the possibility of preventive replacements. A system with random lifetime T >
0 is replaced by a new equivalent one after failure. A preventive replacement
can be carried out before failure. There are costs for each replacement and an
additional amount has to be paid for replacements after failures. The aim is to
determine an optimal replacement policy with respect to some cost criterion.
5.3 Applications 195
Several cost criteria are known among which the long run average cost per unit
time criterion is by far the most popular one. But the general optimization
procedure also allows for other criteria. As an example the total expected
discounted cost criterion will be applied in this section. We will also consider
the possibility to take different information levels into account. This set-up will
be applied to complex monotone systems for which in Sect. 3.2 some examples
of various degrees of observation levels were given. For the special case of a two-
component parallel system with dependent component lifetimes, it is shown
how the optimal replacement policy depends on the different information levels
and on the degree of dependence of the component lifetimes.
Consider a monotone system with random lifetime T, T > 0, with an F-
semimartingale representation
t
I(T ≤ t) = I(T > s)λs ds + Mt , (5.12)
0
If all failure rates λt (i) are of IFR-type, then the F-failure rate process λ
and the ratio process r are nondecreasing. Therefore, Theorem 5.18, p. 188,
can be applied to yield σ = ρx∗ . So the optimal stopping time is among the
control-limit rules
ρx = inf{t ∈ R+ : rt ≥ x} ∧ T
0 α 1
= inf t ∈ R+ : λt ≥ (c + x) ∧ T.
k
This means: replace the system the first time the sum of the failure rates of
critical components reaches a given level x∗ . This level has to be determined as
The effect of partial information is in the following only considered for the
case that no single component or only some of the n components are observed,
say those with index in a subset {i1 , i2 , . . . , ir } ⊂ {1, 2, . . . , n}, r ≤ n. Then the
subfiltration A is generated by T or by T and the corresponding component
lifetimes, respectively. The projection theorem yields a representation on the
corresponding observation level:
t
1 − Φ̂ = E[I{T ≤t} |At ] = I{T ≤t} = I(T > s)λ̂s ds + M̄t .
0
If the A-failure rate process λ̂t = E[λt |At ] is (bl , bu )-increasing, then the
stopping problem can also be solved on the lower information level by means
of Theorem 5.18. We want to carry out this in more detail in the next section,
allowing also for dependencies between the component lifetimes. To keep the
complexity of the calculations on a manageable level, we confine ourselves to
a two-component parallel system.
F̄ (x, y) = P (T1 > x, T2 > y) = P (Y1 > x, Y2 > y)P (Y12 > x ∨ y)
198 5 Maintenance Optimization
is then given by
⎧
⎨ β1 e−γ2 x−(β̄2 +β12 )y − β̄2 −β2 −βy
for x ≤ y
γ2 γ2 e
F̄ (x, y) = (5.16)
⎩ β2 e−γ1 y−(β̄1 +β12 )x − β̄1 −β1 −βx
γ1 γ1 e for x > y,
where here and in the following γi = 0, i ∈ {1, 2}, is assumed. For βi = β̄i this
formula diminishes to the Marshall–Olkin distribution and for β12 = 0 (5.16)
gives the Freund distribution. From (5.16) the distribution H of the system
lifetime T = T1 ∧ T2 can be obtained:
In the following it is assumed that βi ≤ β̄i , i ∈ {1, 2}, and β̄1 ≤ β̄2 ,
i.e., after failure of one component the stress placed on the surviving one is
increased. Without loss of generality the penalty costs for replacements after
failures are set to k = 1. The solution of the stopping problem will be outlined
in the following. More details are contained in [84].
The failure rate process λ on the F-observation level is given by (cf. Exam-
ple 3.27, p. 74)
Inserting q = −c + β12 α−1 in (5.15) we get the bounds for the stopping
value K ∗
5.3 Applications 199
cv β12 (c + 1)v
bl = + and bu = ,
1−v α 1−v
where v = E[e−αT ] can be determined by means of the distribution H. Since
the failure rate process is monotone on [0, T ) the optimal stopping time can
be found among the control limit rules ρx = inf{t ∈ R+ : rt ≥ x} ∧ T :
⎧
⎪
⎪ 0 for x ≤ βα12 − c
⎪
⎪
⎪
⎨ T1 ∧ T2 for β12 − c < x ≤ β̄1 +β12 − c
α α
ρx =
⎪
⎪ T1 for β̄1 +β12
− c < x ≤ β̄2 +β 12
−c
⎪
⎪ α α
⎪
⎩
T for x > α − c.
β̄2 +β12
Since the optimal value x∗ lies between the bounds bl and bu , the considera-
tions can be restricted to the cases x ≥ bl > β12 α−1 − c. In the first case when
β12 α−1 − c < x ≤ (β̄1 + β12 )α−1 − c, one has ρx = T1 ∧ T2 and
α
E[1 − e−αρx ] =
β+α
β β12
EZρx = cE[e−αρx ] + E[I(T ≤ ρx )e−αρx ] = c + .
β+α β+α
The solution of the equation
∗ α β β12
x − c + =0
β+α β+α β+α
is given by
The explicit formulas for the optimal stopping value were only presented here
to show how the procedure works and that even in seemingly simple cases
extensive calculations are necessary. The main conclusion can be drawn from
the structure of the optimal policy. For small values of c (note that the penalty
costs for failures are k = 1) it is optimal to stop and replace the system at the
first component failure. For mid-range values of c, the replacement should take
place when the “better” component with a lower residual failure rate (β̄1 ≤ β̄2 )
fails. If the “worse” component fails first, this results in an replacement after
system failure. For high values of c, preventive replacements do not pay, and
it is optimal to wait until system failure. In this case the optimal stopping
value is equal to the upper bound x∗ = bu .
The paths of the failure rate process λ depend only on the observable compo-
nent lifetime T1 and not on T2 . The paths are nondecreasing so that the same
procedure as before can be applied. For γ1 = β1 + β2 − β̄1 > 0 the following
results can be obtained:
⎧
⎨ T1 ∧ b∗ for 0 < c ≤ c1
ρx ∗ = T 1 for c1 < c ≤ c2
⎩
T for c2 < c
⎧ ∗
⎨ x1 for 0 < c ≤ c1
x∗ = x∗2 for c1 < c ≤ c2
⎩ ∗
x3 for c2 < c.
5.3 Applications 201
The constants c1 , c2 and the stopping values x∗2 , x∗3 are the same as in the
complete information case. What is optimal on a higher information level and
can be observed on a lower information level must be optimal on the latter
too. So only the case 0 < c ≤ c1 is new. In this case the optimal replacement
time is T1 ∧ b∗ with a constant b∗ , which is the unique solution of the equation
Information About T
Numerical Examples
The following tables show the effects of changes of two parameters, the re-
placement cost parameter c and the “dependence parameter” β12 . To be able
to compare the cost minima K ∗ = x∗ , both tables refer to the same set of
parameters: β1 = 1, β2 = 3, β̄1 = 1.5, β̄2 = 3.5, α = 0.08. The optimal replace-
ment times are denoted:
a: ρx∗ = T1 ∧ T2 b: ρx∗ = T1 c: ρx∗ = T1 ∧ b∗
∗
d: ρx∗ = T ∧ b e: ρx∗ = T = T1 ∨ T2 .
Table 5.1 shows the cost minima x∗ for different values of c. For small
values of c, the influence of the information level is greater than for mod-
erate values. For c > 1.394 preventive replacements do not pay, additional
information concerning T is not profitable.
Table 5.2 shows how the cost minimum depends on the parameter β12 . For
increasing values of β12 the difference between the cost minima on different
information levels decreases, because the probability of a common failure of
both components increases and therefore extra information about a single
component is not profitable.
202 5 Maintenance Optimization
respectively, where it is supposed that λ0j (t) ≥ λ1j (t) for all t ≥ 0. We assume
that the lifelength Tj of the jth item admits the following representation:
t
I(Tj ≤ t) = I(Tj > s)λYj s (s)ds + Mt (j), j = 1, . . . , m, (5.18)
0
where Yt = I(τ < t), τ is the burn-in time and M (j) ∈ M is bounded in L2 .
This representation can also be obtained by modeling the lifelength of the
jth item in the following way:
Since we assume that the failure time of any item can be observed during the
burn-in phase, the observation filtration, generated by the lifelengths of the
items, is given by
EZζ = sup{EZτ : τ ∈ C F }.
204 5 Maintenance Optimization
In other words, at any time t the observer has to decide whether to stop or to
continue with burn-in with respect to the available information up to time t.
Since Z is not adapted to F, i.e., Zt cannot be observed directly, we consider
the conditional expectation
m
Ẑt = E[Zt |Ft ] = c I(Tj > t)E[(Tj − t)+ |Tj > t] − mcF
j=1
m
+(cF − cB ) I(Tj ≤ t). (5.21)
j=1
As an abbreviation we use
∞
1
μj (t) = E[(Tj − t)+ |Tj > t] = F̄j (x)dx, t ∈ R+ ,
F̄j (t) t
for the mean residual lifelength. The derivative with respect to t is given by
μj (t) = −1 + λ1j (t)μj (t). We are now in a position to apply Theorem 5.9,
p. 181, and formulate conditions under which the monotone case holds true.
Theorem 5.24. Suppose that the functions
Then
m
ζ = inf t ∈ R+ : I(Tj > t)gj (t) ≤ 0
j=1
EZζ = sup{EZτ : τ ∈ C F }.
Substituting
5.3 Applications 205
s
I(Tj > s) = 1 + (−I(Tj > x)λ0j (x))dx + Mj (s)
0
m t
m
= −mcF + c μj (0) + I(Tj > s)gj (s)ds + Lt
j=1 0 j=1
Remark 5.25. The structure of the optimal stopping time shows that high
rewards per unit operating time lead to short burn-in times whereas great
differences cF − cB between costs for failures in different phases lead to long
testing times, as expected.
and therefore we know that, if there exists an optimal finite stopping time σ,
then it is among the indexed stopping times
x
ρx = inf{t ∈ R+ : λt ≥ }, 0 ≤ x ≤ bu ,
c
provided λ has nondecreasing paths. We summarize this in a corollary to
Theorem 5.18, p. 188.
208 5 Maintenance Optimization
Example 5.29. Consider the shock model with state-dependent failure proba-
bility of Sect. 3.3.4 in which shocks arrive according to a Poisson process with
rate ν (cf. Example 3.47, p. 89). The failure intensity is of the form
∞
λt = ν p(Xt + y)dF (y),
0
where p(Xt + y) denotes the probability of a failure at the next shock if the
accumulated damage is Xt and the next shock has amount y. Here we assume
that this probability function p does not depend on the number of failures in
the past. Obviously λt is nondecreasing so that Corollary 5.27 applies provided
that the integrability conditions are met.
the unobservable state of the system. The failure or minimal repair intensity
may depend on the state of the system. There is some constant flow of income,
on the one hand, and on the other hand, each minimal repair incurs a random
cost amount. The question is when to stop processing the system and carrying
out an inspection or a renewal in order to maximize some reward functional.
For the basic set-up we refer to Example 3.14, p. 65 and Sect. 3.3.9. Here
we recapitulate the main assumptions of the model:
The basic probability space (Ω, F , P ) is equipped with a filtration F,
the complete information level, to which all processes are adapted, and
S = {1, . . . , m} is the set of unobservable environmental states. The changes
of the states are driven by a homogeneous Markov process Y = (Yt ), t ∈ R+ ,
with values in S and infinitesimal parameters qi , the rate to leave state i, and
qij , the rate to reach state j from state i. The time points of failures (minimal
repairs) 0 < T1 < T2 < · · · form a point process and N = (Nt ), t ∈ R+ , is the
corresponding counting process:
∞
Nt = I(Tn ≤ t).
n=1
It is assumed that N has a stochastic intensity λYt that depends on the unob-
servable state, i.e., N is a so-called Markov-modulated Poisson process with
representation t
Nt = λYs ds + Mt ,
0
where M is an F-martingale and 0 < λi < ∞, i ∈ S.
Furthermore, let (Xn ), n ∈ N, be a sequence of positive i.i.d. random
variables, independent of N and Y , with common distribution F and finite
mean μ. The cost caused by the nth minimal repair at time Tn is described
by Xn .
There is an initial capital u and an income of constant rate c > 0 per unit
time.
Now the process R, given by
Nt
Rt = u + ct − Xn ,
n=1
describes the available capital at time t as the difference of the income and
the total amount of costs for minimal repairs up to time t.
The process R is well-known in other branches of applied probability like
queueing or collective risk theory, where the time to ruin τ = inf{t ∈ R+ :
Rt < 0} is investigated (cf. Sect. 3.3.9). Here the focus is on determining the
optimal operating time with respect to the given reward structure. To achieve
this goal one has to estimate the unobservable state of the system at time t,
given the history of the process R up to time t. This can be done using results
210 5 Maintenance Optimization
m
The integrand j=1 Ûs (j)rj with Ûs (j) = E[Us |As ] = P (Ys = j|As ) is the
conditional expectation of the net gain rate at time s given the observations
up to time s. If this integrand has nonincreasing paths, then we know that we
are in the “monotone case” (cf. p. 181) and the stopping problem could be
solved under some additional integrability conditions. To state monotonicity
conditions for the integrand in (5.26), an explicit representation of Ût (j) is
needed, which can be obtained by means of results in filtering theory (see [50],
p. 98, [93]) in the form of “differential equations”:
• Between the jumps of N : Tn ≤ t < Tn+1
t %
m
&
Ût (j) = ÛTn (j) + Ûs (i){qij + Ûs (j)(λi − λj )} ds,
Tn i=1
qjj = −qj , (5.27)
Û0 (j) = P (Y0 = j), j ∈ S.
• At jumps
λj ÛTn − (j)
ÛTn (j) =
m , (5.28)
i=1 λi ÛTn − (i)
the first time the conditional expectation of the net gain rate falls below 0.
212 5 Maintenance Optimization
Theorem 5.30. Let τ ∗ be the A-stopping time (5.30) and assume that con-
ditions (5.29) hold true. If, in addition, qim > λm − λi , i = 1, . . . , m − 1, then
τ ∗ is optimal:
EZτ ∗ = sup{EZτ : τ ∈ C A }.
Proof. Because of EZτ = E Ẑτ for all τ ∈ C A we can apply Theorem 5.9, p.
181, of Chap. 3 taking the A-SSM representation (5.26) of Ẑ. We will proceed
in two steps:
(a) First, we prove that the monotone case holds true.
(b) Second, we show that the martingale part M̄ in (5.26) is uniformly inte-
grable.
m
(a) We start showing that the integrand j=1 Ûs (j)rj has nonincreasing
paths. A simple rearrangement gives
m
m−1
Ûs (j)rj = rm + (rm−1 − rm ) Ûs (j) + · · · + (r1 − r2 )Ûs (1).
j=1 j=1
m
j
j
= Ûs (i)qiν + Ûs (ν)(λ̄(s) − λν )
i=1 ν=1 ν=1
⎛ ⎞
j
m
= Ûs (i) ⎝− qik + λ̄(s) − λi ⎠
i=1 k=j+1
m
using qij = 0 for i > j and qii = − k=i+1 qik , i = 1, . . . , m − 1.
From qim > λm − λi ≥ λ̄(s) − λi it follows that
% &
d
j
Ûs (ν) ≤ 0, j = 1, . . . , m − 1.
ds ν=1
j
j
λv − λ̄(Tn −)
(ÛTn (ν) − ÛTn − (ν)) = ÛTn − (ν) .
ν=1 ν=1
λ̄(Tn −)
5.4 Repair Replacement Models 213
The condition λ1 ≤ · · · ≤ λm ensures that the latter sum is not greater than
0. This is obvious in the case λj ≤ λ̄(Tn −); otherwise, if λj > λ̄(Tn −), this
follows from
m
λv − λ̄(Tn −)
j
λv − λ̄(Tn −)
0= ÛTn − (ν) ≥ ÛTn − (ν) .
ν=1
λ̄(Tn −) ν=1
λ̄(Tn −)
For the monotone case to hold it is also necessary that
m
Ût (j)rj ≤ 0 = Ω
t∈R+ j=1
with a nonnegative integrand. This shows that Ût (m) is a bounded submartin-
gale. Thus, the limit
Û∞ (m) = lim Ût (m) = E[U∞ (m)|A∞ ]
t→∞
t
m t
m
Us (j)rj ds = Us (j)(rj − rm )ds + trm ,
0 j=1 0 j=1
e−gn (t)
Ût (2) = P (σ ≤ t|At ) = 1 − t , Tn ≤ t < Tn+1 ,
dn + (λ2 − λ1 ) Tn e−gn (s) ds
λ2 ÛTn − (2)
ÛTn (2) = ,
λ1 + (λ2 − λ1 )ÛTn − (2)
−1
where dn = 1 − ÛTn (2) , gn (t) = (q − (λ2 − λ1 ))(t − Tn ). The stopping
time τ ∗ in (5.30) can now be written as
r1
τ ∗ = inf{t ∈ R+ : Ût (2) > z ∗ }, z ∗ = .
r1 − r2
For 0 < q < λ2 − λ1 , Ût (2) increases as long as Ût (2) < q/(λ2 − λ1 ) = r. When
Ût (2) jumps above this level, then between jumps Ût (2) decreases but not
below the level r. So even in this case under conditions (5.31) the monotone
case holds true if z ∗ ≤ q/(λ2 − λ1 ). As a consequence of Theorem 5.30 we
have the following corollary.
At the complete information level F the change time point σ can be observed,
and it is obvious that under conditions (5.31) the F-stopping time σ is optimal
in C F . Thus, we have the following upper and lower bounds bu and bl :
bl ≤ sup{EZτ : τ ∈ C A } ≤ bu
with
bl = sup{EZt : t ∈ R+ },
bu = sup{EZτ : τ ∈ C F } = EZσ .
In many cases, the presence of a fault in a system does not lead to an imme-
diate system failure; the system stays in a “defective” state. There will be a
216 5 Maintenance Optimization
time lapse between the occurrence of the fault and the failure of the system–
a “delay time”. This is the idea of the delay time models, which have been
thoroughly discussed in the literature. See the Bibliographic Notes at the end
of the chapter.
The delay time models are used as bases for determining monitoring strate-
gies for detecting system defects or faults. The state of the system is revealed
by inspections, except for failures which are observed. The basic delay time
model was introduced for analyzing inspection policies for systems regularly
inspected each T time units. If an inspection is carried out during the delay
time period, the defect is identified and removed. Thus, the delay time model
is based on the simplest monitoring framework possible: a defective state and
a nondefective state. In most of the models, the objective of the delay time
analysis is to determine optimal inspection times that minimize the (expected)
long-run average costs or downtimes.
The framework in the present analysis is the basic delay time model sub-
ject to regular inspections every T units of time. If a defect is detected by
an inspection, a preventive replacement is performed. If the system fails, a
corrective replacement is carried out. A replacement brings the system back
to the initial state. A cost is incurred at each inspection.
Furthermore, safety constraints are introduced, related to two important
safety aspects: the number of failures of the system and the time spent in
the defective state (the delay time). The control of these quantities can be
obtained by bounding the probability of at least one system failure occurring
during a certain interval of time and by bounding the probability that the
delay times are larger than a certain number.
The objective of the analysis is to determine an optimal inspection interval
T that minimizes the total expected discounted costs under the two safety
constraints.
If α is a positive discount factor, a cost C at time t has a value of Ce−αt
at time 0. Letting Ti be the length of the ith replacement cycle and Ci the
total discounted costs associated with the ith replacement cycle, then the total
discounted costs incurred can be written (see Sect. 5.3.3)
EC1
. (5.32)
1 − E[e−αT1 ]
To explicitly take into account risk and uncertainties we introduce two safety
constraints. Below these are defined and the results are compared.
In practice we may consider different levels for the safety constraint. The
optimization produces decision support by providing information about the
consequences of imposing various safety-level requirements.
Before we search for an optimal inspection time T , we need to specify
the optimization model in detail.
5.5 Maintenance Optimization Models Under Constraints 217
where [x] denotes the integer part of x. From (5.33) we obtain the following
lemma:
218 5 Maintenance Optimization
Lemma 5.33.
∞
- −αXT
.
1−E e = αe−αt F̄ (t)dt
0
∞
% &
(k+1)T (k+1)T −u
−αu −αv
+ f (u)αe e Ḡ(v)dv du.
k=0 kT 0
∞
k (k+1)T
+CI e−αiT f (u)G((k + 1)T − u)du,
k=1 i=1 kT
or rewritten,
∞
(k+1)T
CI e−α(k+1)T f (u)Ḡ((k + 1)T − u)du
k=0 kT
∞
k (k+1)T
+CI e−αiT f (u)du. (5.36)
k=1 i=1 kT
−αv
D(u) = e Ḡ(v)dv. (5.40)
0
Two safety conditions are introduced in this model. The first one is related to
the occurrences of system failures, whereas the second is related to the time
spent in a defective state.
P (Nc,T (A) ≥ 1) ≤ ω1 ,
1 − P (Nc,T (A) = 0) ≤ ω1 .
where F̄c,T represents the survival function of Xc,T . The following lemma
shows the analytical expression for the survival function F̄c,T .
5.5 Maintenance Optimization Models Under Constraints 221
Lemma 5.34. The survival function F̄c,T of Xc,T , representing the time be-
tween successive corrective maintenances, can be written in the following way:
k t
F̄c,T (t) = Bi,T F̄ (t − iT ) + f (u − iT )Ḡ(t − u)du ,
i=0 kT
kT ≤ t ≤ (k + 1)T, k = 0, 1, 2, . . . , (5.41)
B0,T = 1
k (k+1)T
Bk+1,T = Bi,T f (u − iT )Ḡ((k + 1)T − u)du, k = 0, 1, 2, . . .
i=0 kT
k
F̄c,T (t) = Bi,T Pk,i,T (t), kT ≤ t ≤ (k + 1)T,
i=0
= B2,T ,
k (k+1)T
Bi,T f (u − iT )Ḡ((k + 1)T − u)du.
i=0 kT
aA (T ) ≤ ω1 , (5.42)
The second safety constraint is related to the time spent in a failure state.
What we would like to control is the proportion of time the system is in such
a state. This is implemented by considering the asymptotic limit b(T ), which
is equal to the expected time that the system is in the defective state in a
replacement cycle divided by the expected renewal cycle (see Appendix B.2).
Hence we can formulate the safety criterion as
XT
E 1d (u)du
b(T ) = 0
≤ ω2 ,
E[XT ]
where 0 < ω2 < 1 and 1d (·) denotes the indicator function which equals 1 if
the system is defective at time u and 0 otherwise. From (5.33), the expected
length of a replacement cycle for this model is equal to
∞
E [XT ] = F̄T (t)dt
0
∞ (k+1)T
% &
(k+1)T −u
= E [X] + f (u) Ḡ(v)dv du.
k=0 kT 0
b(T ) ≤ ω2 , (5.44)
Optimization
or
Υ = {T > 0; b(T ) ≤ Υ2 },
where aA (T ) and b(T ) are given by (5.43) and (5.45), respectively.
Analyzing the terms in the function Cd (T ) given by (5.38), we will show
that Cd (T ) is a continuous function in T , with
lim Cd (T ) = ∞.
T →0
where c and D are given by (5.39) and (5.40), are continuous functions in T .
Moreover,
T T
hT (u)c(T − u)du ≤ (Cp + CI + Cc ) hT (u)du
0
0
∞
= (Cp + Cc + CI ) f (u)e−αu du,
0
and consequently
T
lim hT (u)c(T − u)du < ∞,
T →0 0
and
% &
T ∞
lim 1+ hT (u)(D(T − u) − 1)du = αe−αu F̄ (u)du < ∞,
T →0 0 0
is continuous in T and
∞
k (k+1)T
lim e−αiT f (u)du = ∞.
T →0 kT
k=1 i=1
5.5 Maintenance Optimization Models Under Constraints 225
Numerical Examples
In this section we present some numerical examples of the above model. The
aim is to find a value of T that minimizes Cd (T ) given by (5.38) under the two
safety constraints based on the occurrence of failures in an interval (5.42) and
the fraction of time in a defective state (5.44). We refer to these constraints
as criterion 1 and criterion 2, respectively.
We assume that the distributions of the random variables X and Y follow
Weibull distributions with nondecreasing failure rates, i.e.,
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2
Figure 5.4 shows the total expected discounted costs Cd (T ) along with the
function a2 (T ). We find that
In this case, T ∗ = 1.79 ∈ Υ , and hence the optimal value for the con-
strained optimization problem under criterion 1 is Topt = 1.79 with a value of
Cd (1.79) = 397.68.
Consider now the constrained optimization problem under criterion 2. We
assume that ω2 = 0.15, i.e., the proportion of time that the system is in a
228 5 Maintenance Optimization
1800
1600
1400
1200
1000
800
600
400
200
0 0.5 1 1.5 2 2.5
defective state should not exceed 0.15. Figure 5.5 shows the total expected
discounted costs and the function b(T ) for this problem. In this case
Let Cp = 400, Cc = 1000 and CI = 100 be the costs incurred, with α = 0.4 the
discount factor. The functions Cd (T ), aA (T ) and b(T ) are shown in Figs. 5.6–
5.8.
Figure 5.6 shows a simulation of the total expected discounted costs Cd (T )
versus T for this example. The function Cd (T ) is in standard form, nonincreas-
ing up to T = 1.1511 and nondecreasing for T ≥ 1.1511. Hence T ∗ = 1.1511.
The corresponding expected discounted costs equal Cd (1.1511) = 804.0365.
We analyze the constrained optimization problem for each safety require-
ment. As above we put ω1 = 0.2 for criterion 1. From Fig. 5.7 we find that
1000
500
0
0.5 1 1.5 2 2.5
0.5
b
0.4
0.3
ω1
0.2
0.1
0
0.5 1 1.5 2 2.5
Fig. 5.4. (a) Total expected discounted costs Cd (T ) versus T . (b) Function a2 (T )
versus T
Due to the form of Cd (T ) the optimal value for the constrained optimization
is Topt = 0.975 with a value of Cd (0.975) = 813.55.
For criterion 2, we suppose ω2 = 0.15. From Fig. 5.8,
and using the same reasoning as above, the optimal value for Cd (T ) is reached
for Topt = 0.313 with a value of Cd (0.313) = 1372. By comparing the expected
costs for the unconstrained and the constrained problem, we see that a rather
large cost is introduced by implementing the safety constraint.
Both constraints can be used to control the safety level. However, we prefer
to use criterion 1 as it is more directly related to the failures of the system.
1000
500
0
0.5 1 1.5 2 2.5
0.5
b
0.4
0.3
0.2 ω2
0.1
0
0.5 1 1.5 2 2.5
Fig. 5.5. (a) Total expected discounted costs Cd (T ) versus T . (b) Function b(T )
versus T
to avoid that the system fails and stays in the failure state for a long period.
However, this goal has to be balanced against the costs of inspections and
overhauls. Too frequent inspections would not be cost optimal. Costs are as-
sociated with tests, system downtime, and repairs. The optimization criterion
is the expected long-run cost per unit of time.
Below we present a formal set-up for this problem and show how an optimal
T can be determined. A special case where the components have three states
is given special attention. It corresponds to a “delay time type system” where
the presence of a fault in a component does not lead to an immediate failure;
there will be a “delay time” between the occurrence of the fault and the failure
of the component. We refer to Sect. 5.5.1.
2000
1800
1600
1400
1200
1000
800
0.5 1 1.5 2 2.5
absolute continuous, with finite means. The density and “jump rate” of Fij (t)
are denoted fij (t) and rij (t), respectively, i = 1, 2, . . . , n and j = 1, 2, . . . , Mi .
The jump rate rij (t) is defined as usual as
1
lim P (Uij ≤ t + h|Uij > t).
h→0 h
Hence rij (t)h (h a small positive number) is approximately equal to the condi-
tional probability that component i makes a jump to state j − 1 in the interval
(t, t + h] given that the component has stayed in state j during the interval
[0, t]. The sojourn times UiMi , Ui(Mi −1) , . . . , Ui1 , i = 1, 2, . . . , n, are assumed
independent. The distribution of the vector of all Uij s, U, is denoted FU .
We denote by G(t, x) the distribution of the vector of component states
Xt = (Xt (1), Xt (2), . . . , Xt (n)), i.e.,
Φt = φ(Xt ),
where φ is the structure function of the system. We assume that Φ and φ are
binary, equal to 1 if the system is functioning and 0 otherwise (see Sect. 2.1).
The system is a monotone system (see Sect. 2.1.2), i.e., its structure function
φ is nondecreasing in each argument, and
1400
1200
1000
800
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
b
0.5
0.4
0.3
ω1
0.2
0.1
0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Fig. 5.7. (a) Total expected discounted costs Cd (T ) versus T . (b) Function a2 (T )
versus T
Since at time t = 0 all components are in the best state, Φ(0) = 1. The
components deteriorate and at time τ the system fails, i.e.,
τ = inf{t > 0 : φ(Xt ) = 0}.
The deterioration of the components and the system failure is revealed by
inspections. It is assumed that the system is inspected every T units of time.
If the system is found to be in the failure state, a complete overhaul is carried
out meaning that all components are repaired to a good-as-new condition.
Furthermore, a preventive policy is introduced: if the system is found to be
in a critical state, also a complete overhaul is conducted. The system is said
to be in a critical state if the system is functioning and there exists at least
one i such that the system fails if component i jumps to the state Xt (i) − 1.
Let τC be the time to the system first becomes critical. Then
τC = inf{t ≥ 0 : φ(Xt ) = 1, φ((Xt (i) − 1)i , Xt ) = 0 for at least one i},
where φ(·i , x) = φ(x1 , . . . , xi−1 , ·, xi+1 , . . . , xn ). We assume τC > 0, i.e., the
system is not critical at time 0.
The distribution of τC is denoted FτC . The times τ and τC are functions
of the duration times Uij . Let g and gC be defined by
τ = g(U) and τC = gC (U).
The inspections and overhauls are assumed to take negligible time.
To further characterize the critical states, we introduce the concept of a
critical path vector for system level 1:
5.5 Maintenance Optimization Models Under Constraints 233
1800
a
1600
1400
1200
1000
800
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
0.5
b
0.4
0.3
0.2 ω2
0.1
0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Fig. 5.8. (a) Total expected discounted costs Cd (T ) versus T . (b) Function b(T )
versus T
Definition 5.35. A state vector x is a critical path vector for system level
1 (the functioning state of the system) if and only if φ(x) = 1 and φ((xi −
1)i , x) = 0 for at least one i.
From this definition we introduce a maximal critical path vector:
Definition 5.36. A critical path vector x is a maximal critical path vector for
system level 1 if it cannot be increased without losing its status as a critical
path vector.
Note that these concepts are different from the common defined path vectors
and minimal path vectors in a monotone system; see Sect. 2.1.2.
Based on the maximal critical minimal path vectors we introduce a new
structure function, φC (x), which is equal to 1 if and only if there exists no
maximal critical path vector xk such that the state x is below or equal to
xk , i.e.
φC (x) = (1 − I(x ≤ xk )),
k
where k runs trough all maximal critical path vectors for the system at level
1. We see that the system φC fails as soon as a system state becomes critical.
As an example, consider a binary parallel system. Then it is seen that the
maximal critical path vectors are (1,0) and (0,1), and φC (x) = x1 x2 , as if one
component fails, the system state becomes critical.
A counting process N is introduced that jumps to 1 at the time of system
failure, i.e.,
Nt = I(τ ≤ t).
234 5 Maintenance Optimization
Let Vij,t be the virtual age of component i in state j at time t. Then the
intensity λt of N is given by
n
Mi
λt = rij (Vij,t )I(Xt (i) = j)φ(Xt )(1 − φ((j − 1)i , Xt )),
i=1 j=1
noting that the rate is rij (Vij,t ) at time t for component i to cause sys-
tem failure by jumping from state j to state j − 1. A formal proof can be
given following the approach in Sect. 3.2.2. By introducing φij (x) = I(xi =
j)φ(x)(1 − φ((j − 1)i , x)), the intensity λt can be expressed as
n
Mi
λt = rij (Vij,t )φij (Xt ).
i=1 j=1
n
Mi
λC,t = rij (Vij,t )I(Xt (i) = j)φC (Xt )(1 − φC ((j − 1)i , Xt )).
i=1 j=1
Similarly to φij we define φijC (x) = I(xi = j)φC (x)(1 − φC ((j − 1)i , x)), and
hence the intensity λC,t can be expressed as
n
Mi
λC,t = rij (Vij,t )φijC (Xt ).
i=1 j=1
Optimization
For a fixed test interval length T , 0 < T < ∞, the system is overhauled at time
τ T , where τ T is the time of the first inspection following a critical state, i.e.,
τ T = T ([τC /T ]I + 1),
where [x]I equals the integer part of x. This inspection represents a renewal
for the cost and time processes, and using the renewal reward theorem (see
5.5 Maintenance Optimization Models Under Constraints 235
Appendix B.2), it follows that the long-run (expected) cost per unit time, B T ,
can be written:
EC T
BT = , (5.46)
Eτ T
where Eτ T expresses the expected length of the first renewal cycle (the time
until renewal) and EC T expresses the expected cost associated with
this cycle.
It is seen that Eτ T < ∞ and EC T < ∞, observing that Eτ T ≤ ij EUij +T ,
and EC T ≤ T c + cp + CI (Eτ T /T + 1). Theorem 5.37 establishes an explicit
formula for Eτ T and EC T , and hence for B T .
Theorem 5.37. Under the above model assumptions, with τ = g(U) and
τC = gC (U), we have
∞
Eτ T = T (k + 1) dFU (u) (5.47)
k=0 u:kT <gC (u)≤(k+1)T
∞
EC T = [cI (k + 1) + cp
k=0 u:kT <gC (u)≤(k+1)T
where φijC (x) = I(xi = j)φC (x)(1 − φC ((j − 1)i , x)). Furthermore, if
Gs (t, x|x ) denotes the conditional distribution of X(t) given X(s) = x
(t > s), we have
∞
n
Mi
EC T = [cI (k + 1) + cp ] φijC (x)[Hij ((k + 1)T, x) − Hij (kT, x)]
k=0 i=1 j=1 x
∞
n
Mi
+ φC (x )G(kT, x ) φij (x)
k=0 x i=1 j=1 x
(k+1)T
× c((k + 1)T − t)rij (t)GkT (t, x|x )dt. (5.53)
kT
Then taking expectation, using that NC,t has intensity λC,t , and noting that
we can write rij (Vi,j,t ) = rij (t), we obtain:
∞
(k+1)T
Eτ T = E (k + 1)T dNC,t
k=0 kT
∞ (k+1)T
=T (k + 1) EλC,t dt
k=0 kT
5.5 Maintenance Optimization Models Under Constraints 237
∞
(k+1)T
n
Mi
=T (k + 1) rij (t)EφijC (Xt )dt
k=0 kT i=1 j=1
∞
n
Mi (k+1)T
=T (k + 1) φijC (x) rij (t)G(t, x)dt
k=0 i=1 j=1 x kT
∞
n
Mi
=T (k + 1) φijC (x)[Hij ((k + 1)T, x) − Hij (kT, x)],
k=0 i=1 j=1 x
Similarly to the above analysis for Eτ T it is seen that the first term of this
expression for EC T equals
∞
n
Mi
[cI (k + 1) + cp ] φijC (x)[Hij ((k + 1)T, x) − Hij (kT, x)].
k=0 i=1 j=1
Hence it remains to establish the desired expression for the downtime costs,
the second term. This term can be expressed as
∞ (k+1)T
E φC (XkT ) c((k + 1)T − t)dNt ,
k=0 kT
B T (δ) = EC T − δEτ T .
τC = min{U11 , U21 },
noting that if a component fails, the system is functioning if and only if the
other component is functioning. Furthermore, the time to system failure, τ ,
equals the maximum component lifetime, i.e.,
τ = max{U11 , U21 }.
It follows that
∞
Eτ T = T (k + 1)[FτC ((k + 1)T ) − FτC (kT )]
k=0
∞
= (k + 1)[F̄11 (kT ))F̄21 (kT ) − F̄11 ((k + 1)T ))F̄21 ((k + 1)T )],
k=0
Proposition 5.39. For a parallel system of two binary components, the ex-
pected renewal cycle and expected associated costs are given by:
∞
Eτ T = (k + 1)[F̄11 (kT )F̄21 (kT ) − F̄11 ((k + 1)T )F̄21 ((k + 1)T )]
k=0
∞
EC T = [F̄11 (kT )F̄21 (kT ) − F̄11 ((k + 1)T )F̄21 ((k + 1)T )][cI (k + 1) + cp ] +
k=0
∞ (k+1)T
c{(k + 1)T − u1 }[F21 (u1 ) − F21 (kT )]dF11 (u1 ) +
k=0 kT
∞ (k+1)T
c{(k + 1)T − u2 }[F11 (u2 ) − F11 (kT )]dF21 (u2 ).
k=0 kT
From these expressions compact formulae can be derived for Hij (t, x) =
t
rij 0 G(s, x)ds.
Similar equations can be established for Gs (t, x|x ), the conditional distri-
bution of X(t) given X(s) = x . We need to compute the conditional distri-
bution of P (Xt (i) = j2 |Xs (i) = j1 ) for j2 ≤ j1 , j1 = 1, 2, i = 1, 2. We see that
P (Xt (i) = 2|Xs (i) = 2) = F̄i2 (t−s), P (Xt (i) = 1|Xs (i) = 2) = F̄11 ∗F12 (t−s),
and P (Xt (i) = 1|Xs (i) = 1) = F̄i1 (t − s). Furthermore; P (Xt (i) = 0|Xs (i) =
1) = Fi1 (t − s) and P (Xt (i) = 0|Xs (i) = 2) = F12 ∗ Fi1 (t − s). From these
formulae we see for example that
3
Gs (t, (2, 2, 2)|(2, 2, 2)) = F̄12 (t − s)F̄22 (t − s)F̄32 (t − s) = e−(t−s) i=1 ri2
Numerical Example
We assume that the failure rates are as follows: r12 = r22 = 0.5, r11 = r21 =
1.0 and r32 = 1/3, r31 = 1/2. Hence the expected time to failure for the three
components are 2 + 1 = 3, 2 + 1 = 3 and 3 + 2 = 5, respectively. The following
costs are assumed: c = 100, cI = 1 and cp = 5, i.e., the cost of an overhaul
is five times the inspection cost and the unit downtime cost is 100 times the
inspection cost. Then we can compute the B T function and determine an
optimal inspection time. Figure 5.9 shows the B T function as a function of
T , computed using Maple 10. By inspection an optimal value is obtained for
T = 0.43. A number of sensitivity analysis should be performed to see the
effect of changes in the input data. Figure 5.10 shows an example where the
unit downtime cost is increased by a factor 10, from 100 to 1,000, to reflect
the serious safety risk caused by downtime. The optimal inspection interval is
then reduced to 0.18.
110
100
90
80
70
60
50
40
30
20
10
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
T
Fig. 5.9. The B T function for the base case example with c = 100
242 5 Maintenance Optimization
Final Remarks
150
125
100
75
50
25
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
T
This appendix serves as background for Chaps. 3–5. The focus is on stochastic
processes on the positive real time axis R+ = [0, ∞). Our aim is to give
that basis of the measure-theoretic framework that is necessary to make the
text intelligible and accessible to those who are not familiar with the general
theory of stochastic processes. For detailed presentations of this framework
we recommend texts like Dellacherie and Meyer [61, 62], Rogers and Williams
[133], and Kallenberg [101]. The point process theory is treated in Karr [103],
Daley and Vere-Jones [58], and Brémaud [50]. A “nontechnical” introduction
to parts of the general theory accompanied by comprehensive historical and
bibliographic remarks can be found in Chap. II of the monograph of Andersen
et al. [2]. A good introduction to basic results of probability theory is Williams
[164].
For a function f : R → R we denote the left and right limit at a (in the
case of existence) by
f (a−) = lim f (t) = lim f (a − h) ,
t→a− h→0,h>0
f (a+) = lim f (t) = lim f (a + h) .
t→a+ h→0,h>0
random variable. The σ-algebra σ(X) = X −1 (B) is the smallest one with
respect to which X is measurable. It is called the σ-algebra generated by X.
Definition A.1 (Independence).
(i) Two events A, B ∈ F are called independent, if P (A ∩ B) = P (A)P (B).
(ii) Suppose A1 and A2 are subfamilies of F : A1 , A2 ⊂ F . Then A1 and A2
are called independent, if P (A1 ∩ A2 ) = P (A1 )P (A2 ) for all A1 ∈ A1 ,
A2 ∈ A2 .
(iii) Two random variables X and Y on (Ω, F ) are called independent, if σ(X)
and σ(Y ) are independent.
The expectation EX
(or E[X]) of a random variable is defined in the usual
way as the integral XdP with respect to the probability measure P . If the
expectation E|X| is finite, we call X integrable. The law or distribution of X
on (R, B) is given by FX (B) = P (X ∈ B), B ∈ B, and FX (t) = FX ((−∞, t])
is the distribution function. Often the index X in FX is omitted when it is
clear which random variable is considered. Let g : R → R be a measurable
function and suppose that g(X) is integrable. Then
Eg(X) = g(X)dP = g(t)dFX (t).
Ω R
If X has a density fX : R → R+ , i.e., P (X ∈ B) = B fX (t)dt, B ∈ B, then
the expectation can be calculated as
Eg(X) = g(t)fX (t)dt.
R
The variance of a random variable X with E[X 2 ] < ∞ is denoted Var[X] and
defined by Var[X] = E[(X − EX)2 ].
We now present some classical inequalities:
• Markov inequality: Suppose that X is a random variable and g : R+ → R+
a measurable nondecreasing function such that g(|X|) is integrable. Then
for any real c > 0
Eg(|X|) ≥ g(c)P (|X| ≥ c).
• Jensen’s inequality: Suppose that g : R → R is a convex function and that
X is a random variable such that X and g(X) are integrable. Then
g(EX) ≤ Eg(X).
• Hölder’s inequality: Let p, q ∈ R such that p > 1 and 1/p + 1/q = 1. Sup-
pose X and Y are random variables such that |X|p and |Y |q are integrable.
Then XY is integrable and
P
• Convergence in probability: We say Xn → X in probability, if for every
> 0,
lim P (|Xn − X| > ) = 0.
n→∞
D
• Convergence in distribution: We say Xn → X in distribution, if for every
x of the set of continuity points of F ,
lim Fn (x) = F (x).
n→∞
sup Yr − Ys p→ 0 for k → ∞.
r,s≥k
A.2 Random Variables, Conditional Expectations 249
X p ≤ X q.
The short proof of this result can be found in Williams [164]. The theo-
rem states that there is one unique element in the subspace K that has the
shortest distance from a given element in L2 and the projection direction is
orthogonal on K. A similar projection can be carried out from L1 (Ω, F , P )
onto L1 (Ω, A, P ), where A ⊂ F is some sub-σ-algebra of F . Of course, any
A-measurable random variable of L1 (Ω, A, P ) is also in L1 (Ω, F , P ). Thus,
for a given X in L1 (Ω, F , P ), we are looking for the “best” approximation in
L1 (Ω, A, P ). A solution to this problem is given by the following fundamental
theorem and definition.
The standard proof of this theorem uses the Radon–Nikodym theorem (cf.
for example Billingsley [42]). A more constructive proof is via the Orthogonal
Projection Theorem A.2. In the case that EX 2 < ∞, i.e., X ∈ L2 (Ω, F , P ), we
can use Theorem A.2 directly with K = L2 (Ω, A, P ). Let Y be the projection
250 A Background in Probability and Stochastic Processes
of X in K. Then property (ii) of Theorem A.2 yields E[(X − Y )Z] = 0 for all
Z ∈ K. Take Z = IA , A ∈ A. Then E[(X − Y )IA ] = 0 is just condition (A.1),
which shows that Y is a version of the conditional expectation E[X|A]. If X
is not in L2 , we split X as X + −X − and approximate both parts by sequences
Xn+ = X + ∧ n and Xn− = X − ∧ n, n ∈ N, of L2 -random variables. A limiting
argument for n → ∞ yields the desired result (see [164] for a complete proof).
Conditioning with respect to a σ-algebra is in general not very concrete,
so the idea of projecting onto a subspace may give some additional insight.
Another point of view is to look at conditioning as an averaging operator.
The sub-σ-algebra A lies between the extremes F and G = {∅, Ω}, the trivial
σ-field. As can be easily verified from the definition, the corresponding condi-
tional expectations of X are X = E[X|F ] and EX = E[X|G]. So for A with
G ⊂ A ⊂ F the conditional expectation E[X|A] lies “between” X (no aver-
aging, complete information about the value of X) and EX (overall average,
no information about the value of X). The more events of F are included in
A the more is E[X|A] varying and the closer is this conditional expectation
to X in a sense made precise in the following proposition.
Proof. The right-hand side inequalities are just special cases of the triangle
law for the L2 -norm or Minkowski’s inequality. So we need to prove the left-
hand inequalities.
(i) Since Y2 is the projection of X on L2 (Ω, A2 , P ) and
Y1 ∈ L2 (Ω, A1 , P ) ⊂ L2 (Ω, A2 , P ),
X − Y2 2 = inf{ X − Z 2 : Z ∈ L2 (Ω, A2 , P )} ≤ X − Y1 2 .
Ỹ2 2
2 = Ỹ2 − Ỹ1 + Ỹ1 2
2 = Ỹ2 − Ỹ1 2
2+ Ỹ1 22 ,
Here and in the following relations like <, ≤, = between random variables are
always assumed to hold with probability one and the term P -a.s. is suppressed.
All random variables in this subsection are assumed to be integrable, i.e., to
be elements of L1 (Ω, F , P ). Let A and C denote sub-σ-algebras of F . Then
the following properties for conditional expectations hold true.
1. If Y is any version of E[X|A], then EY = EX.
2. If X is A-measurable (σ(X) ⊂ A), then E[X|A] =X.
3. Linearity. E[aX + bY |A] = aE[X|A] + bE[Y |A], a, b ∈ R.
4. Monotonicity. If X ≤ Y , then E[X|A] ≤ E[Y |A].
5. Monotone Convergence. If Xn is an increasing sequence and Xn → X
P -a.s., then E[Xn |A] converges almost surely:
lim E[Xn |A] = E[X|A].
n→∞
E[g(X)|A] ≥ g(E[X|A]),
in particular
X p ≥ E[X|A] p , for p ≥ 1.
8. Successive Conditioning. If H is a sub-σ-algebra of A, then
E[E[X|A]|H] = E[X|H].
252 A Background in Probability and Stochastic Processes
is a version of E[g(X)|A].
2. We consider two random variables X and Y and a measurable function g
such that g(X) is integrable. We write
E[g(X)|Y ] = E[g(X)|σ(Y )]
E[g(X)|Y ] = h(Y ).
But even if the set {Y = y} has probability 0, we are now able to deter-
mine the conditional expectation of g(X) given that Y takes the value y
(provided we know h). Consider the case that a joint density fXY (x, y)
of X and Y is known. Let fY (y) = R fXY (x, y)dx be the density of the
(marginal) distribution of Y and
fXY (x, y)/fY (y) if fY (y) = 0
fX|Y (x|y) =
0 otherwise
equals
E[h(Y )IB (Y )] = h(y)IB (y)fY (y)dy
for all B ∈ B. But this follows directly from Fubini’s Theorem, which
proves the assertion.
using the fact that for a right-continuous filtration it suffices to show {inf τn <
t} ∈ Ft .
258 A Background in Probability and Stochastic Processes
For a sequence of stopping times (τn ) the random variables sup τn , inf τn
are stopping times, so that lim sup τn , lim inf τn and lim τn (if it exists) are
also stopping times.
We now define the σ-algebra of the past of a stopping time τ .
Definition A.22. Suppose τ is a stopping time with respect to the filtration
F. Then the σ-algebra Fτ of events occurring up to time τ is
Fτ = {A ∈ F∞ : A ∩ {τ ≤ t} ∈ Ft for all t ∈ R+ }.
B ∩ {τ ≤ t} = B ∩ {σ ≤ t} ∩ {τ ≤ t} ∈ Ft ,
A ∩ {σ ≤ τ } ∩ {τ ≤ t} = A ∩ {σ ≤ t} ∩ {τ ≤ t} ∩ {σ ∧ t ≤ τ ∧ t}.
Fσ∧τ ⊂ Fσ ∩ Fτ .
A ∩ {σ ∧ τ ≤ t} = (A ∩ {σ ≤ t}) ∪ (A ∩ {τ ≤ t}) ∈ Ft ,
Most important for applications are those random times σ that are defined
as first entrance times of a stochastic process X into a Borel set B: σ = inf{t ∈
R+ : Xt ∈ B}. In general, it is very difficult to show that σ is a stopping time.
For a discussion of the usual conditions in this connection, see Rogers and
Williams [133], pp. 183–191. For a complete proof of the following theorem
we refer to Dellacherie and Meyer [61], p. 116.
is an F-stopping time.
Note that the right-continuity of the paths was used to express {σ < t}
as the union of events {Xr ∈ B} and that we could restrict ourselves to a
countable union because B is an open set.
Xt ≤ E[Xs |Ft ].
and Xt − t is a martingale.
E[|Mt |I(|Mt | > c)] ≤ E[E[|Y |I(|Mt | > c)|Ft ]] = E[|Y |I(|Mt | > c)].
The last lemma is often applied with functions g(x) = |x|p , p ≥ 1. So, if
M is a square integrable martingale, then X = M 2 defines a submartingale.
One key result in martingale theory is the following convergence theorem
(cf. [62], p. 72).
a condition that is equivalent to limt→∞ EXt− < ∞. Then the random variable
X∞ = limt→∞ Xt exists and is integrable.
If the supermartingale (martingale) X is uniformly integrable, X∞ exists
and closes X on the right in that for all t ∈ R+
Proposition A.35. Suppose X is an adapted cadlag process such that for any
bounded stopping time τ the random variable Xτ is integrable and EX0 =
EXτ . Then X is a martingale.
X = A + M,
Remark A.38. 1. Several proofs of this and more general results, not res-
tricted to uniformly integrable processes, are known (cf. [62], p. 198 and
[101], p. 412). Some of these also refer to local martingales, which are not
needed for the applications we have presented and which are therefore not
introduced here.
2. The process A in the theorem above is often called compensator.
3. In the case of discrete time such a decomposition is easily constructed in
the following way. Let (Xn ), n ∈ N0 , be a submartingale with respect to
a filtration (Fn ), n ∈ N0 . Then we define
Xn = An + Mn ,
where
The continuous time result needs much more care and uses several lemmas,
one of which is interesting in its own right and will be presented here.
M2,d
0 are called purely discontinuous. As an immediate consequence we obtain
that any martingale M ∈ M20 has a unique decomposition M = M c + M d ,
where M c ∈ M2,c 0 and M ∈ M0 .
d 2,d
[M ] with
[M ]t = !M c "t + Ms2
s≤t
A.6 Semimartingales
A decomposition of a stochastic process into a (predictable) drift part and a
martingale, as presented for submartingales in the Doob–Meyer decomposi-
tion, also holds true for more general processes. We start with the motivating
example of a sequence (Xn ), n ∈ N0 , of integrable random variables adapted
to the filtration (Fn ). This sequence admits a decomposition
n
Xn = X0 + fi + Mn
i=1
The function g is said to have finite variation if Vg (t) < ∞ for all t ∈ R+ .
The class of cadlag processes A with finite variation starting in A0 = 0 is
denoted V.
Zt = Z0 + At + Mt ,
where A ∈ V and M ∈ M0 .
Proof. To prove the product rule we use a form of integration by parts for
semimartingales, which is an application of Ito’s formula (see [67], p. 140):
Zt Yt = Zs− dYs + Ys− dZs + [Z, Y ]t .
(0,t] (0,t]
The first term of the sum is an ordinary Stieltjes integral. Since the paths of
Z have at most countably many jumps, it follows that
s t
Zs− d gu du = Zs gs ds.
(0,t] 0 0
where
Rt = Zs− dLs + Ys− dMs + [M, L]t
(0,t] (0,t]
where R ∈ M0 .
A.6 Semimartingales 271
In this appendix we present some definitions and results from the theory of
renewal processes, including renewal reward processes and regenerative pro-
cesses. Key references are [1, 8, 44, 58, 135, 156].
The purpose of this appendix is not to give an all-inclusive presentation
of the theory. Only definitions and results needed for establishing the results
of Chaps. 1–5 (in particular Chap. 4) is covered.
and define
Nt = sup{j : Sj ≤ t},
or equivalently,
∞
Nt = I(Sj ≤ t). (B.1)
j=1
The processes (Nt ), t ∈ R+ , and (Sj ), j ∈ N0 , are both called a renewal process.
We say that a renewal occurs at t if Sj = t for some j ≥ 1. The random variable
Nt represents the number of renewals in [0, t]. Since the interarrival times Tj
are independent and identically distributed, it follows that after each renewal
the process restarts.
Let M (t) = ENt , 0 ≤ t < ∞. The function M (t) is called the renewal
function. It can be shown that M (t) is finite for all t. From (B.1) we see that
∞
M (t) = F ∗j (t), (B.2)
j=1
or equivalently
LM (s)
LF (s) = .
1 + LM (s)
Hence LF is determined by M and since the Laplace transform determines
the distribution, it follows that F also is determined by M .
noting that if the first renewal occurs at time x, x ≤ t, then from this point
on the process restarts, and thus the expected number of renewals in [0, t] is
just 1 plus the expected number to arrive in a time t − x from an equivalent
renewal process. A more formal proof is the following;
∞
∞
M (t) = ENt = E I(Sj ≤ t) = F (t) + E I(Sj ≤ t)
j=1 j=2
∞
= F (t) + E I(Sj − S1 ≤ t − S1 )
j=2
t ∞
= F (t) + E I(Sj − S1 ≤ t − s)dF (s)
0 j=2
t
= F (t) + M (t − s)dF (s).
0
Proof. A proof of this result is given in Asmussen [8], p. 113. A simpler proof
can however be given in the case where the Laplace transform of h and g
exists: Taking Laplace transforms in (B.4), yields
Using the (strong) law of large numbers, many results related to renewal
processes can be established, including the following.
We now formulate some limiting results, without proof, including the Ele-
mentary Renewal Theorem, the Key Renewal Theorem, Blackwell’s Theorem,
and the Central Limit Theorem for renewal processes. Refer to Alsmeyer [1],
Asmussen [8], Daley and Vere-Jones [58], and Ross [135] for proofs; see also
Birolini [44]. Some of the results require that the distribution F is not periodic
(lattice). We say that F is periodic if there exists a constant c, c > 0, such
that T takes only values in {0, c, 2c, 3c, . . .}.
B.1 Basic Theory of Renewal Processes 277
where f ∗1 = f and
t
∗j
f (t) = f ∗(j−1) (t − s)f (s)ds, j = 2, 3, . . . .
0
αt = SNt +1 − t,
βt = t − SN t .
The recurrence times αt and βt are the time intervals from t forward to the
next renewal point and backward to the last renewal point (or to the time
origin), respectively. Let Fαt and Fβt denote the distribution functions of αt
and βt , respectively. The following result is a consequence of the Key Renewal
Theorem.
B.1 Basic Theory of Renewal Processes 279
Theorem B.13. Assume that the distribution F is not periodic. Then the
asymptotic distribution of the forward and backward recurrence times are
given by
x
F̄ (s) ds
lim Fαt (x) = lim Fβt (x) = 0 .
t→∞ t→∞ μ
This asymptotic distribution of αt and βt is called the equilibrium distri-
bution.
A simple formula exists for the mean forward recurrence time; we have
ESNt +1 = μ(1 + M (t)). (B.5)
Formula B.5 is a special case of Wald’s equation (see, e.g., Ross [135]), and
follows by writing
k
ESNt +1 = E Sk I(Nt + 1 = k) = E Tj I(Nt + 1 = k)
k≥1 k≥1 j=1
=E Tj I(Nt + 1 ≥ j) = E Tj I(Sj−1 ≤ t)
j≥1 j≥1
= ETj EI(Sj−1 ≤ t) = μ F ∗j (t) = μ(1 + M (t)).
j≥1 j≥0
Finally in this section we prove a result used in the proof of Theorem 4.19,
p. 122.
Proposition B.14. Let g be a real-valued function which is bounded on finite
intervals. Assume that
lim g(t) = g.
t→∞
Then t
1 g
lim g(s)dM (s) = .
t→∞ t 0 μ
Proof. To prove this result we use a standard argument. Given > 0, there
exists a t0 such that |g(t) − g| < for t ≥ t0 . Hence for t > t0 we have
1 t
|g(s) − g|dM (s)
t 0
1 t0 1 t
≤ |g(s) − g|dM (s) + dM (s).
t 0 t t0
Since t0 is fixed, this gives by applying the Elementary Renewal Theorem,
1 t
lim sup |g(s) − g|dM (s) ≤ .
t→∞ t 0 μ
The desired conclusion follows.
280 B Renewal Processes
Nt
Zt = Yj .
j=1
The limiting value of the average return is established using the law of large
numbers and is given by the following result (cf. [135]).
where
EY
τ = Var Y −
2
T
ET
2
EY EY
= Var[Y ] + Var[T ] − 2 Cov[Y, T ].
ET ET
B.4 Modified (Delayed) Processes 281
It can be shown that all the asymptotic results presented in the previous
sections of this appendix still hold true for the modified processes. If we take
the first distribution to be equal to the asymptotic distribution of the rec-
urrence times, given by Theorem B.13, p. 279, the renewal process becomes
stationary in the sense that the distribution of the forward recurrence time
αt does not depend on t. Furthermore,
M (t + h) − M (t) = h/ET.
References
[129] Phelps, R. (1983) Optimal policy for minimal repair. J. Opl. Res. 34,
425–427.
[130] Pierskalla, W. and Voelker, J. (1976) A survey of maintenance models:
The control and surveillance of deteriorating systems.Nav. Res. Log. Q.
23, 353–388.
[131] Puterman, M. L. (1994) Markov Decision Processes: Discrete Stochastic
Dynamic Programming. Wiley, New York.
[132] Rai, S. and Agrawal, D. P. (1990) Distributed Computing network reli-
ability. 2nd ed. IEEE Computer Soc. Press, Los Alamitos, California.
[133] Rogers, C. and Williams, D. (1994) Diffusions, Markov Processes and
Martingales, Vol. 1, 2nd ed. Wiley, Chichester.
[134] Rolski, T., Schmidli, H., Schmidt, V. and Teugels, J. (1999) Stochastic
Processes for Insurance and Finance. Wiley, Chichester.
[135] Ross, S. M. (1970) Applied Probability Models with Optimization Appli-
cations. Holden-Day, San Francisco.
[136] Ross, S. M. (1975) On the calculation of asymptotic system reliability
characteristics. In: Barlow R. E., Fussel, J. B. and Singpurwalla, N. D.
(eds.) Fault Tree Analysis. Society for Industrial and Applied Mathe-
matics, SIAM, Philadelphia, PA.
[137] Schöttl, A. (1997) Optimal stopping of a risk reserve process with int-
erest and cost rates. J. Appl. Prob. 35, 115–123.
[138] Serfozo, R. (1980) High-level exceedances of regenerative and semi-
stationary processes. J. Appl. Prob. 17, 423–431.
[139] Shaked, M. and Shanthikumar, G. (1993) Stochastic Orders and their
Applications. Academic Press, Boston.
[140] Shaked, M. and Shanthikumar, G. (1991) Dynamic multivariate aging
notions in reliability theory. Stoch. Proc. Appl . 38, 85–97.
[141] Shaked, M. and Shanthikumar, G. (1986) Multivariate imperfect repair.
Oper. Res. 34, 437–448.
[142] Shaked, M. and Szekli, R. (1995) Comparison of replacement policies
via point processes. Adv. Appl. Prob. 27, 1079–1103.
[143] Shaked, M. and Zhu, H. (1992) Some results on block replacement poli-
cies and renewal theory. J. Appl. Prob. 29, 932–946.
[144] Sherif, Y. and Smith, M. (1981) Optimal maintenance models for sys-
tems subject to failure. A review. Nav. Res. Log. Q. 28, 47–74.
[145] Smith, M. (1998) Insensitivity of the k out of n system. Probability in
the Engineering and Informational Sciences, to appear.
[146] Smith, M. (1997) On the availability of failure prone systems. PhD thesis
Erasmus University, Rotterdam.
[147] Smith, M., Aven, T., Dekker, R. and van der Duyn Schouten, F.A.
(1997) A survey on the interval availability of failure prone sys-
tems. In: Proceedings ESREL’97 conference, Lisbon, 17–20 June, 1997,
pp. 1727–1737.
[148] Solovyev, A.D. (1971) Asymptotic behavior of the time to the first
occurrence of a rare event. Engineering Cybernetics 9 (6), 1038–1048.
References 291