Ajit Kumar Verma, Srividya Ajit, Manoj Kumar - Dependability of Networked Computer-Based Systems
Ajit Kumar Verma, Srividya Ajit, Manoj Kumar - Dependability of Networked Computer-Based Systems
Manoj Kumar
Dependability of Networked
Computer-based Systems
123
Prof. Ajit Kumar Verma Dr. Manoj Kumar
Department of Electrical Engineering System Engineering Section
Indian Institute of Technology Bombay Control Instrumentation Division
(IITB) Bhabha Atomic Research Centre (BARC)
Powai, Mumbai 400076 Trombay, Mumbai 400085
India India
e-mail: [email protected] e-mail: [email protected]
ISSN 1614-7839
DOI 10.1007/978-0-85729-318-3
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced,
stored or transmitted, in any form or by any means, with the prior permission in writing of the
publishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued
by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be
sent to the publishers.
The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of
a specific statement, that such names are exempt from the relevant Laws and regulations and therefore
free for general use.
The publisher makes no representation, express or implied, with regard to the accuracy of the
information contained in this book and cannot accept any legal responsibility or liability for any errors
or omissions that may be made.
Our Gurus
Bhagwan Sri Sathya Sai Baba
Paramhansa Swami Sathyananda Saraswati
Sri B. Jairaman & Smt Vijaya Jairaman
Dr. C.S. Rao & Smt Kasturi Rao
Our Teachers
Prof. A.S.R. Murthy (Reliability Engg., IIT Kharagpur)
Prof. M.A. Faruqi (Mechanical Engg., IIT Kharagpur)
Prof. N.C. Roy (Chemical Engg., IIT Kharagpur)
Foreword
vii
viii Foreword
book will serve as an invaluable guide for scholars, researchers and practitioners
interested and working in the field of critical applications where reliance on
automation is indispensable.
This book is meant for research scholars, scientists and practitioners involved with
the application of computer-based systems in critical applications. Ensuring
dependability of systems used in critical applications is important due to the
impact of their failures on human life, investment and environment. The individual
aspects of system dependability—reliability, availability, safety, timeliness and
security are the factors that determine application success. To answer the question
on reliance on computers in critical applications, this book explores the integration
of dependability attributes within practical, working systems. The book addresses
the growing international concern for system dependability and reflects the
important advances in understanding how dependability manifests in computer-
based systems.
Probability theory, which began in the seventeenth century is now a well-
established branch of mathematics and finds applications in various natural and
social sciences, i.e. from weather prediction to predicting the risk of new medical
treatments. The book begins with an elementary treatment of the basic definitions
and theorems that form the foundation for the premise of this work. Detailed
information on these can be found in the standard books on probability theory and
stochastic theory, for a comprehensive appraisal. The mathematical techniques
used have been kept as elementary as possible and Markov chains, DSPN models
and Matlab code are given where relevant.
Chapter 1 begins with an introduction to the premise of this book, where
dependability concepts are introduced. Chapter 2 provides the requisite foundation
on the essentials of probability theory, followed by introduction to stochastic
processes and models in Chap. 3. Various dependability models of computer-based
systems are discussed in Chap. 4. Markov models for the systems considering safe
failures, perfect and imperfect periodic proof tests, and demand rate have been
derived. Analysis has been done to derive closed form solution for performance-
based safety index and availability.
In Chap. 5, medium access control (MAC) protocol mechanisms of three
candidate networks are presented in detail. The MAC mechanism is responsible
for the access to the network medium, and hence effects the timing requirement
ix
x Preface
xi
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Evolution of computer based systems . . . . . . . . . . . . . . . . . . . 2
1.2 Application areas: safety-critical, life-critical. . . . . . . . . . . . . . 3
1.3 A review of system failures. . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Example: Comparison of system reliability . . . . . . . . . . . . . . . 5
1.5 Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 Basic Definitions and Terminology . . . . . . . . . . . . . . . 8
1.6 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Probability Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Sample Space, Events and Algebra of Events . . . . . . . . . . . . . 16
2.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Independence of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Exclusive Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.1 Discrete Random Variables. . . . . . . . . . . . . . . . . . . . . 20
2.7.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . 26
2.8 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8.1 Probability Generating Function . . . . . . . . . . . . . . . . . 30
2.8.2 Laplace Transform. . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.9 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.10 Operations on Random Variables . . . . . . . . . . . . . . . . . . . . . . 33
2.11 Moments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
xiii
xiv Contents
xvii
xviii Acronyms
Input/output Processor
module(s) Module(s)
Memory Communication
module(s) module(s)
Processors
or
Controllers
Sensors Actuators
S1 Si P1 Pj A1 Ak
Communication Network(s)
automobiles etc. Their failure can cause damage to huge investment, effort, life
and/or environment. Based on the function(s) and extent of failure, these real-time
systems are categorized in the following three types:
1. Safety-critical systems: Systems required to ensure safety of EUC (equipment
under control), people and environment. Examples include, shutdown system of
nuclear reactor, digital flight control computer of aircraft etc.
2. Mission-critical systems: Systems whose failure results in failure of mission.
For example, control & coding unit (CCU) of an avionic system, navigation
system of an spacecraft etc.
3. Economically-critical systems: Systems whose failure result in unavailability
of EUC, causing massive loss of revenue. For example, Reactor control system
of nuclear power plant.
Dependability attributes [10] for different kind of systems are different. For
safety-critical systems, the dependability attribute, safety [11] is of concern.
Reliability is the appropriate dependability measure for mission-critical systems
[12–14]. Similarly, for economically-critical systems the dependability measure
availability is of importance [15].
Extensive literature exists on dependability modeling of programmable
electronic systems and/or real-time systems, i) Safety [11, 16–21], ii) Reliability
[7, 13, 22–27], and iii) Availability [13–15]. Reliability models for soft real-time
systems are discussed in [28–30]. Networked real-time system like real-time
systems may fail due to value or timeliness.
Let’s consider a system with following reliability requirement, the system reli-
ability for a mission time of 10,000 hrs with repair, shall be 0.9. The system
implementation has three options, i) an analog, ii) a processor-based, and iii) a
networked based.
Analog implementation uses analog components. Analog systems do not have
sharing of resources, so for each function a dedicated resource is available.
Also, they have limited fault-diagnosis and fault-coverage. Processor-based
6 1 Introduction
implementation on the other hand are quite simple, as most of the complex
functions are implemented on a single processor. They have detailed fault-diag-
nosis and very good coverage. They can be easily interfaced with industry standard
communication link, through which fault information may be communicated,
displayed, stored and a detailed report can be generated. All this helps in a lower
time to repair-time. Networked based system has all advantages of processor-based
system, with much more modularity, i.e. further reduced repair-time and cost.
Reliability values considering hardware failures and repair-time for the three
options are given in Table 1.1.
From the table it seems that network-based system gives better reliability. But
the question is whether all failures mechanisms have been considered for processor
and network-based system. These system may fail due to missing of time deadline.
This problem of missing deadline is manageable to some extent in processor-based
system but in networked-based system it is really a challenge. So, Table 1.1 is of
limited use in making decision of which system will give better reliability - failure-
free operation - in application.
Let probability of missing deadline for the mission time for processor and
network-based system is given as 0.95 and 0.9, respectively, by some means. Now,
the system reliability considering deadline miss is given in Table 1.2.
From Table 1.2, it is clear that all three systems provides similar level of
reliability. This makes the comparison process uniform and simple.
The book provide methods which enable getting us the probability of deadline
miss and their incorporation in dependability models.
1.5 Dependability
Dependability is a collective term used to describe the ability to deliver service that
can justifiably be trusted [10]. The service delivered by a system is its behavior as it
is perceived by its user(s); a user is another system (physical, human) that interacts
1.5 Dependability 7
with the former at the service-interface. The function of a system is what the system
is intended for, and is described by the system specification.
As per Laprie et al [10] concept of dependability consists of three parts: the
threat to, the attributes of, and the means by which dependability is attained.
1.5.1.1 Threats
The threats are mainly three, failure, error and fault [10]. A system failure is an
event that occurs when the delivered service deviates from correct service. A
failure is a transition from correct service to incorrect service. An error is that part
of system state that may cause a subsequent failure: a failure occurs when an error
reaches the service interface and alters the service. A fault is the adjudged or
hypothesized cause of an error. A fault is active when it produces an error,
otherwise it is dormant.
1.5.1.2 Attributes
1.5.1.3 Means
Fault tolerance is the ability of a system to continue to perform its tasks after
occurrence of faults. The fault-tolerance requires fault detection, fault containment,
fault location, fault recovery and/or fault masking [12]. The definition of these terms
is as follows:
• Fault detection is the process of recognizing that a fault has occurred.
• Fault containment is the process of isolating a fault and preventing the effects of
that from propagating throughout the system.
• Fault location is the process of determining where a fault has occurred so that an
appropriate recovery can be implemented.
• Fault recovery is the process of remaining operational or regaining operational
status via reconfiguration even in the presence of fault.
• Fault masking is the process that prevents faults in a system from introducing
errors into the informational structure of that system. A system employing fault
masking achieves fault tolerance by ‘‘hiding’’ faults. Such systems do not
require that fault be detected before it can be tolerated, but it is required that the
fault be contained.
Systems that do not use fault masking require fault detection, fault location and
fault recovery to achieve fault tolerance. Redundancy is essential for achieving
fault tolerance. Redundancy is simply addition of information, resources, or time
beyond what is needed for normal system operation. The redundancy can take any
of the following form:
1. Hardware Redundancy is the addition of extra hardware, usually for the pur-
pose of either detecting or tolerating faults.
2. Software Redundancy is the addition of extra software, beyond what is needed
to perform a given function, to detect and possibly tolerate faults.
3. Information Redundancy is the addition of extra information beyond what is
needed to implement a given function.
4. Time redundancy uses additional time to perform the functions of a system
such that fault detection and often fault tolerance can be achieved.
1. Passive techniques use the concept of fault masking to hide the occurrence of
faults and prevent the fault from resulting in errors. Examples of passive
redundancy are Triple Modular Redundancy (TMR), N-Modular redundancy
etc, where majority voting or median of the module outputs are taking to decide
the final output and mask the fault of module(s).
1.5 Dependability 9
Time redundancy methods attempt to reduce the amount of extra hardware at the
expense of using additional time. As it is clear, above two methods require use of
extra hardware. So time redundancy becomes important in applications where
more hardware cannot be put, but extra time can be provided using devices of
higher speed.
1. Transient Fault Detection The basic concept of the time redundancy is the
repetition of computations in ways that allow faults to be detected. In transient
fault detection the concept is to perform the same computation two or more
times and compare the results to determine if a discrepancy exists. If an error is
detected, the computations can be performed again to see if the disagreement
remains or disappears. Such approaches are often good for detecting errors
resulting from transient faults, but they cannot protect against errors resulting
from permanent faults.
2. Permanent Fault Detection One of the biggest potentials of time redundancy is
the ability to detect permanent faults while using a minimum of extra hardware.
The approaches for this are as follows:
a. Alternating Logic
b. Recomputing with Shifted Operands
10 1 Introduction
Software redundancy may come in many forms starting from few lines of extra
code to complete replica of program. There could be few lines of code to check the
magnitude of a signal or as a small routine used to periodically test a memory by
writing and reading specific locations. The major software redundancy techniques
are as follows:
1. Consistency check uses a priori knowledge about the characteristics of infor-
mation to verify the correctness of that information.
2. Capability checks are performed to verify that a system possesses the capability
expected.
3. N-Version Programming: Software does not break as hardware do, but instead
software faults are the result of incorrect software designs or coding mistakes.
Therefore, any technique that detects faults in software must detect design
flaws. A simple duplication and comparison procedure will not detect software
faults if the duplicated software modules are identical, because the design
mistakes will appear in both modules.
The concept of N-version programming allows certain design flaws in software
module to be detected by comparing design, code and results of N-versions of
same software.
From the above discussion it is clear that use of time and software redundancy
and reconfigurable hardware redundancy reduces the total hardware for achieving
fault-tolerance. In case of distributed systems, processing units called nodes, are
distributed and communicate through communication channels. Processing a task/
job requires multiple nodes. System can be made fault-tolerant by transferring its
processing to a different nodes, in case of error. Here system reconfigure itself and
make use of time redundancy, as the new node has to process the tasks/jobs of the
failed node in addition to its own.
1.6 Motivation
1.7 Summary
Real-time systems has one additional failure mechanism, i.e. failure due to
deadline miss. When networked systems are used in real-time application this
failure mechanism becomes much more important. For ensuring timeliness, mostly
worst-case response-time guarantees are used, which are deterministic. Systems
with different timing characteristic cannot be compared based on a given
dependability measure, as dependability models do not consider timeliness.
The chapter points out this limitation with the help of an example. A refresher of
means to achieve dependability is given for the sake of completeness.
References
12. Johnson BW (1989) Design and Analysis of Fault-Tolerant Digital Systems. Addison Wesley
Publishing Company
13. Mishra KB (1992) Reliability Analysis and Prediction. Elsevier
14. Trivedi KS (1982) Probability and Statistics with Reliability, Queueing, and Computer
Science Applications. Prentice-Hall, Englewood Cliffs New Jersey
15. Varsha Mainkar. Availability analysis of transaction processing systems based on user
perceived performance. In: Proceedings of 16th Symposium on Reliable Distributed Systems,
Durham, NC, Oct. 1997.
16. Zhang T, Long W, Sato Y (2003). Availability of systems with self-diagnostics components-
applying markov model to IEC 61508-6. Reliability Engineering and System Safety
80:133–141
17. Bukowski JV (2001) Modeling and analyzing the effects of periodic inspection on the
performance of safety-critical systems. IEEE Transaction Reliability 50(3):321–329
18. Choi CY, Johnson BW, Profeta III JA (1997) Safety issues in the comparative analysis of
dependable architectures.IEEE Transaction Reliability 46(3):316–322
19. Summers A (2000) Viewpoint on ISA TR 84.0.02-simplified methods and fault tree
analysis.ISA Transaction 39(2):125–131
20. Bukowski JV (2005) A comparison of techniques for computing PFD average. In: RAMS
2005 590–595
21. Goble WM, Bukowski JV (2001) Extending IEC 61508 reliability evaluation techniques to
include common circuit designs used in industrial safety systems. In: Proc. of Annual
Reliability and Maintainability Symposium 339–343
22. Khobare SK, Shrikhande SV, Chandra U, Govidarajan G (1998) Reliability analysis of micro
computer modules and computer based control systems important to safety of nuclear power
plants. Reliability Engineering and System Safety 59(2):253–258
23. Jogesh Muppala, Gianfranco Ciardo, Trivedi KS (1994). Stochastic reward nets for reliability
prediction. Communications in Reliability, Maintainability and Serviceability 1(2):9–20
24. Kim H, Shin KG (1997) Reliability modeling of real-time systems with deadline information.
In: Proc. of IEEE Aerospace application Conference 511–523
25. Kim H, White AL, Shin KG (1998) Reliability modeling of hard real-time systems.
In: Proceedings of 28th Int. Symp. on Fault Tolerant Computing 304–313
26. Tomek L, Mainkar V, Geist RM, Trivedi KS (1994) Reliability modeling of life-critical, real-
time systems. Proceedings of the IEEE 82:108–121
27. Lindgren M, Hansson H, Norstrom C, Punnekkat S (2000) Deriving reliability estimates of
distributed real-time systems by simulation.In: Proceeding of 7th International Conference on
Real-time Computing System and Applications 279–286
28. Mainkar V, Trivedi KS (1994) Transient analysis of real-time systems using deterministic
and stochastic petri nets. In: Int’l Workshop on Quality of Communication-Based Systems
29. Mainkar V, Trivedi KS (1995) Transient analysis of real-time systems with soft deadlines.
In: Quality of communication based systems
30. Muppala JK, Trivedi KS Real-time systems performance in the presence of failures. IEEE
Computer Magazine 37–47 May 1991.
31. Avizienis A, Laprie J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of
dependable and secure computing. IEEE Transaction Dependable and Secure Computing
1(1):11–33
32. Atoosa Thunem P-J (2005). Security Research from a Multi-disciplinary and Multi-sectoral
Perspective. Lecture Notes in Computer Science (LNCS 3688). Springer Berlin / Heidelberg
381–389
33. Ross J.Anderson (2001) Security Engineering: A Guide to Building Dependable Distributed
Systems. Wiley Computer Publishing, USA
34. MIL-STD-1553B: Aircraft internal time division command/response multiplex data bus, 30
April 1975.
35. AERB/SG/D-25: Computer based systems of pressurised heavy water reactor, 2001.
References 13
36. Safety guide NS-G-1.3 Instrumentation and control systems important to safety in nuclear
power plants, 2002.
37. IEC 60880-2.0: Nuclear power plants - instrumentation and control systems important to
safety - software aspects for computer-based systems performing category a functions, 2006.
38. Keidar I, Shraer A (2007) How to choose a timing model? In: Proc. 37th Annual IEEE/IFIP
Int. Conf. on Dependable Systems and Networks (DSN’07)
39. Yang H, Sikdar B (2007) Control loop performance analysis over networked control systems.
In: Proceedings of ICC 2007 241–246
40. Yang TC Networked control systems: a brief survey. IEE Proc.-Control Theory Applications
153(4):403–412, July 2006.
Chapter 2
Probability Theory
Probability theory deals with the study of events whose precise occurrence cannot
be predicted in advance. These kind of events are also termed as random events.
For example, a toss of a coin, the result may be either HEAD or TAIL. The precise
result cannot be predicted in advance, hence the event—tossing a coin—is an
example of random event.
Probability theory is usually discussed in terms of experiments and possible
outcomes of the experiments.
Probability is a positive measure associated with each simple event. From a strict
mathematical point of view it is difficult to define the concept of probability. A
relative frequency approach, also called the posteriori approach is usually used to
define probability.
In classical probability theory, all sample spaces are assumed to be finite, and
each sample point is considered to occur with equal frequency. The definition of
the probability P of an event A is described by relative frequency by which A
occurs:
h
Pð AÞ ¼ ð2:1Þ
n
where h is the number of sample points in A and n is the total number of sample
points. This definition is also called probability definition based on relative
frequency.
Probability theory is based on the concepts of sets theory, sample space, events
and algebra of events. Before proceeding ahead, these will be briefly reviewed.
Probability theory is study of random experiments. The real life experiments may
consist of the simple process of noting whether a component is functioning
properly or has failed; measuring the response time of a system; queueing time at a
service station. The result may consist of simple observations such as ‘yes’ or ‘no’
period of time etc. These are called outcomes of the experiment.
The totality of possible outcomes of a random experiment is called sample
space of the experiment and it will be denoted by letter ‘S’.
The sample space is not always determined by experiment but also by the
purpose for which the experiment is carried out.
It is useful to think of the outcomes of an experiment, the elements of the
sample space, as points in a space of one or more dimensions. For example, if an
experiment consists of examining the state of a single component, it may be
functioning correctly, or it may have failed. The sample space can be denoted as
one-dimension. If a system consists of 2 components there are four possible
outcomes and it can be denoted as two-dimensional sample space. In general, if a
system has n components with 2 states, there are 2n possible outcomes, each of
which can be regarded as a point in n-dimensional space.
The sample space is conventionally classified according to the number of
elements they contain. If the set of all possible outcomes of the experiment is
finite, then the associated sample space is a finite sample space. Finite sample
space is also referred as countable or a discrete sample space.
Measurement of time—in response-time, queueing time, time till failure—
would have an entire interval of real numbers as possible values. Since the interval
of real number cannot be enumerated, they cannot be put into one-to-one corre-
spondence with natural numbers—such a sample space is said to be uncountable or
non denumerable. If the elements of a sample space constitute a continuum, such
as all the points of a line, all the points on a line segment, all the points in a plane,
the sample space is said to be continuous.
A collection or subset of sample points is called an event. Means, any statement
of conditions that defines this subset is called an event. The set of all experimental
outcomes (sample points) for which the statement is true defines the subset of the
sample space corresponding to the event. A single performance of the experiment
is known as trial. The entire sample space is an event called the universal event,
and so is the empty set called the null or impossible event. In case of continuous
sample space, consider an experiment of observing the time to failure of a com-
ponent. The sample space, in this case may be thought of as the set of all non-
negative real numbers, or the interval ½0; 1Þ ¼ ftj0 t\1g.
Consider an example of a computer system with five identical tape drives. One
possible random experiment consists of checking the system to see how many tape
drives are currently available. Each of the tape drive is in one of two states: busy
(labeled 0) and available (labeled 1). An outcome of the experiment (a point
in sample space) can be denoted by a 5-tuple of 0’s and 1’s. A 0 in position i of the
2.2 Sample Space, Events and Algebra of Events 17
5-tuple indicates that tape drive i is busy and a 1 indicates that it is available. The
sample space S has 25 ¼ 32 sample points.
A set is a collection of well defined objects. In general a set is defined by capital
letter such as A; B; C; etc. and an element of the set by a lower case letter such as
a; b; c; etc.
Set theory is a established branch of mathematics. It has a number of opera-
tions, operators and theorems. The basic operators are union and intersection.
Some of the theorems are given below:
1. Idempotent laws:
A[A¼A A\A¼A ð2:2Þ
2. Commutative laws:
A[B¼B[A A\B¼B\A ð2:3Þ
3. Associative laws:
A [ ðB [ CÞ ¼ ðA [ BÞ [ C ¼ A [ B [ C
ð2:4Þ
A \ ðB \ CÞ ¼ ðA \ BÞ \ C ¼ A \ B \ C
4. Distributive laws:
A \ ðB [ CÞ ¼ ðA \ BÞ [ ðA \ CÞ
ð2:5Þ
A [ ðB \ CÞ ¼ ðA [ BÞ \ ðA [ CÞ
5. Identity laws:
A[/¼A A\/¼/
ð2:6Þ
A[U ¼U A\U ¼A
6. De Morgan’s laws:
ð A [ BÞ ¼ A \ B
ð2:7Þ
ð A \ BÞ ¼ A [ B
7. Complement laws:
A[A¼U A\A¼/
ð2:8Þ
A ¼A U¼/ /¼U
8. For any sets A and B:
A ¼ ðA \ BÞ [ ðA \ BÞ ð2:9Þ
If the sample space has a finite number of points it is called a finite sample
space. Further, if it has many points as there are natural numbers it is called a
countable finite sample space or discrete sample space. If it has many points as
there are points in some interval it is called a non-countable infinite sample space
or a continuous sample space.
An event is a subset of sample space, i.e. it is a set of possible outcomes. An
event which consists of one sample point is called a simple event.
Condition probability deals with the relation or dependence between two or more
events. The kind of questions dealt with are ‘the probability that one event occurs
under the condition that another event has occurred’.
Consider an experiment, if it is known that an event B has already occurred,
then the probability that the event A has also occurred is known as the conditional
probability. This is denoted by PðAjBÞ, the conditional probability of A given B,
and it is defined by
Pð A \ BÞ
PðAjBÞ ¼ ð2:10Þ
Pð BÞ
Let there are two events A and B. It is possible for the probability of an event A to
decrease, remain the same, or increase given that event B has occurred. If the
probability of the occurrence of an event A does not change whether or not event B
has occurred, we are likely to conclude that two events are independent if and only if:
Pð Aj B Þ ¼ Pð AÞ ð2:11Þ
From the definition of conditional probability, we have [provided PðAÞ 6¼ 0 and
PðBÞ 6¼ 0]:
PðA \ BÞ ¼ Pð AÞPðBj AÞ ¼ PðBÞPðAjBÞ ð2:12Þ
From this it can be concluded that the condition for the independence of A and
B can also be given either as PðAjBÞ ¼ PðAÞ or as PðA \ BÞ ¼ PðAÞPðBÞ. Note
that PðA \ BÞ ¼ PðAÞPðBjAÞ holds whether or not A and B are independent, but
PðA \ BÞ ¼ PðAÞPðBÞ holds only when A and B are independent.
Now, events A and B are said to be independent if
PðA \ BÞ ¼ Pð AÞPðBÞ ð2:13Þ
2.5 Exclusive Events 19
An event is a well defined collection of some sample points in the sample space.
Two events A and B in a universal sample space S are said to be exclusive events
provided A \ B ¼ /. If A and B are exclusive events, then it is not possible for
both events to occur on the same trail.
A list of events A1 ; A2 ; . . .; An is said to be mutually exclusive if and only if:
Ai ; if i ¼ j
Ai \ A j ¼ ð2:14Þ
/; otherwise
So, a list of events is said to be composed of mutually exclusive events if no
point in the sample space is common in more than one event in the list.
A list of events A1 ; A2 ; . . .; An is said to be collectively exhaustive, if and only if:
A1 [ A 2 [ . . . [ An ¼ S ð2:15Þ
This relation is known as Bayes’ rule. From these probabilities of the events
A1 ; A2 ; . . .; Ak which can cause A to occur can be established. Bayes’ theorem
makes it possible to obtain PðAjBÞ from PðBjAÞ, which in general is not possible.
20 2 Probability Theory
When a real number is assigned to each point of a sample space, i.e. each sample
point has a single real value. This is a function defined on the sample space. The
result of an experiment which assumes these real-valued numbers over the sample
space is called a random variable. Actually this variable is a function defined on
the sample space.
A random variable defined on a discrete sample space is called a discrete
random variable, and a stochastic variable defined on a continuous sample space
and takes on a uncountable infinite number of values is called a continuous random
variable.
In general a random variable is denoted by a capital letter (e.g. X; Y) whereas
the possible values are denoted by lower case letter (e.g. x; y).
A random variable partitions its sample space into mutually exclusive and
collectively exhaustive set of events. Thus, for a random variable X, and a real
number x, let’s define a Ax to be the subset of S consisting of all sample points s to
which the random variable X assign the value x:
Ax ¼ f s 2 S j X ð s Þ ¼ x g ð2:18Þ
It is implied that Ax \ Ay ¼ / if x 6¼ y, and that:
[
Ax ¼ S ð2:19Þ
x2<
When the state space is discrete, and the random variable could take on values
from a discrete set of numbers, the random variable is either finite or countable.
Such random variables are known as discrete random variables. A random variable
defined on a discrete sample space will be discrete, but it is possible to define a
discrete random variable on a continuous sample space. For example, for a con-
tinuous sample space S, the random variable defined by XðsÞ ¼ 1 for all s 2 S is
discrete.
Let X be a random variable which take the values from sample space
fx1 ; x2 ; . . .; xn g. If these values are assumed with probabilities given by
PfX ¼ xk g ¼ f ðxk Þ ð2:20Þ
This is also known as frequency (or mass) function. In general, a function f ðxÞ
is a mass function if
f ðxÞ 0 ð2:21Þ
2.7 Random Variables 21
and
X
f ðxÞ ¼ 1 ð2:22Þ
x
Let’s move to compute the probability of the set fsjXðsÞ 2 Ag for some subset
A of < other than a one-point set. It can be shown that:
[
fs j X ð s Þ 2 Ag ¼ fsjX ðsÞ ¼ xi g ð2:24Þ
xi 2A
If f ðxÞ denotes the probability mass function of random variable X, then from
above equation we have:
X
Pð X 2 AÞ ¼ f ðx i Þ ð2:25Þ
xi 2A
When X takes values from discrete sample space. The above equations are
valid.
The cumulative distribution function contains most of the interesting infor-
mation about the underlying probabilistic system, and this is used extensively.
Often the concepts of sample space, event space, and probability measure, which
are fundamental in building the theory of probability, will fade into the back-
ground, and functions such as the distribution function or the probability mass
function become the most important entities.
2. Binomial distribution
In a series of Bernoulli trails, the number of successes (or failures) out of total
number of trials follows the Binomial distribution. Consider a sequence of n
independent Bernoulli trials with probability of success equal to p on each trial.
Let Yn denote the number of successes in n trials. The domain of the random
variable Yn is all the n-tuples of 00 s and 10 s, and the image is f0; 1; . . .; ng. The
value assigned to an n-tuple by Yn simply corresponds to the number of 10 s in
the n-tuple.
2.7 Random Variables 23
pk ¼ P ð Yn ¼ k Þ
ð2:32Þ
C ðn; kÞpk qnk ; for 0 k n
¼
0; otherwise
The above equation gives the probability of k ‘successes’ in n independent
trials, where each trial has probability p of success.
3. Geometric distribution
Let’s consider a sequence of Bornoulli trials, and count the number of trial until
the first ‘‘success’’ occurs. Let 0 denote a failure and 1 denote a success, then
the sample space of these trials consists of all binary strings with an arbitrary
number of 00 s followed by a single 1:
S ¼ 0i 1 1ji ¼ 1; 2; 3; . . .
ð2:33Þ
Note that this sample space has a countably infinite number of sample points.
Let define a random variable Z on this sample space so that the value assigned
to the sample points 0i 1 1 is i. Thus Z is the number of trials up to and
including the first success. Therefore, Z is a random variable with image
f1; 2; 3; . . .g, which is a countably infinite set. To find the pmf of Z, we note
that the event ½Z ¼ i occurs if and only if we have a sequence of i 1 failures
followed by one success. This is a sequence of independent Bernoulli trails with
probability of success equal to p. Hence, we have:
pZ ð i Þ ¼ q i 1 p
ð2:34Þ
¼ pð1 pÞi 1 ; for i ¼ 1; 2; 3; . . .
The geometric distribution has an important property, known as the Markov (or
memoryless) property. This is the only discrete distribution with this property.
To illustrate this property, consider a sequence of Bernoulli trials and let Z
represent the number of trials until the first success. Now assume that we have
observed a fixed number n of these trials and found them all to be failures. Let Y
denote the number of additional trails that must be performed until the first
success. Then Y ¼ Z n, and the conditional probability is:
qi ¼ PðY ¼ ijZ [ nÞ
¼ PðZ n ¼ ijZ [ nÞ
¼ PðZ ¼ n þ ijZ [ nÞ
PðZ ¼ n þ i and Z [ nÞ
¼
Pð Z [ n Þ
Pð Z ¼ n þ i Þ
¼
Pð Z [ n Þ ð2:35Þ
pqnþi 1
¼
1 ð 1 qn Þ
pqnþi 1
¼
qn
i 1
¼ pq
¼ pZ ð i Þ
24 2 Probability Theory
We see that condition on Z [ n, the number of trails remaining until the first
success, Y ¼ Z n, has the same pmf as Z had originally.
4. Negative binomial distribution
In geometric pmf, Bernoulli trials until the first success are observed. If r
success need to be observed, then the process results in negative binomial pmf.
Negative binomial pmf is given as:
Since the assumption that the probability of more than one arrival per interval
can be neglected is reasonable only if t=n is very small, we will take the limit of
the above probability mass function as n approaches 1.
kt ðnkÞ
kt nðn 1Þðn 2Þ ðn k þ 1Þ k
b k; n; ¼ ðktÞ 1
n k!nk n
k ð2:38Þ
k
kt n
n n 1 n k þ 1 ðktÞ kt
¼ 1 1
n n n k! n n
approach unity, the next factor is fixed, the next approaches unity, and the last
factor becomes: !kt
kt n=kt
lim 1 ð2:40Þ
n!1 n
Since the limit the bracket is the common definition of e. Thus, the binomial
probability mass function approaches:
ekt ðktÞk
; k ¼ 0; 1; 2; . . . ð2:42Þ
k!
ak
f ðk; aÞ ¼ ea ; k ¼ 0; 1; 2; . . . ð2:43Þ
k!
ak
Cðn; kÞ ¼ pk ð1 pÞnk ’ ea ; where a ¼ np ð2:44Þ
k!
6. Hypergeometric distribution
In Binomial distribution, probability of occurance of events remain same during
each experiment. In experiments such as drawing samples from a fixed set of
samples, binomial corresponds to ‘sampling with replacement’. But in some
experiments, the chance of occurance of events changes with the course of
experimentations. The Hypergeometric distribution is obtained while ‘sampling
without replacement’.
Suppose we select a random sample of n components from a box containing N
components, d of which are known to be defective. For the first component
selected, the probability that it is defective is given by d=N, but for the second
selection it remain same if the first is replaced. Otherwise, this probability is
ðd 1Þ=ðN 1Þ or ðdÞ=ðN 1Þ, depending on whether or not a defective
component was selected in first experiment. In this experiment constant
chances of occurance, as in Bernoulli trials, is not satisfied. The probability
distribution of such kind of experiments are referred as Hypergeometric.
Hypergeometric probability mass function, hðk; n; d; NÞ, defined to be the
probability of choosing k defective components in a random sample of n
26 2 Probability Theory
N
X
F ðt Þ ¼ pX ðiÞ ð2:47Þ
i¼1
It should be noted that, unlike the probability mass function, the values of the
pdf are not probabilities and thus it is acceptable if f ðxÞ [ 1 at a points x.
28 2 Probability Theory
kr tr 1 e kt
f ðtÞ ¼ ; t [ 0; k [ 0; r ¼ 1; 2; 3; . . . ð2:53Þ
ðr 1Þ!
r1
X ðktÞk
F ðt Þ ¼ 1 ekt ; t 0; k [ 0; r ¼ 1; 2; 3; . . . ð2:54Þ
k¼0
k!
If r takes non integer values, then the process results in Gamma distribution.
The density function is given as:
kr tr1 ekt
f ðt Þ ¼ ; t [ 0; k [ 0; r[0 ð2:55Þ
Cr
Z1
Cn
xn1 ekx dx ¼ ð2:57Þ
kn
0
4. Weibull distribution
Weibull distribution is widely used for statistical curve fitting of lifetime data.
The distribution has been used to describe fatigue failure, vacuum tube failure
and ball bearing failure. The density function is given as:
< k x
k1 x
8
k
f ðx; k; kÞ ¼ e k ; x0 ð2:58Þ
:k k
0; x\0
where k[0 is the shape parameter, and k[0 is the scale parameter of the density
function.
5. Normal distribution
This distribution is extremely important in statistical applications because of the
central limit theorem, which states that, under very general assumption, the
mean of a sample of n mutually independent random variables (having distri-
butions with finite mean and variance) is normally distributed in the limit
n ! 1. It has been observed that errors of measurement often possess this
distribution.
The normal density has the well-known bell-shaped curve and is given by:
2
xl
1 p ffi2r
f ð xÞ ¼ pffiffiffiffiffiffi e ð2:59Þ
r 2p
30 2 Probability Theory
where 1\x; l\1 and r [ 0. Here l stands for mean and r for standard
deviation. As the above integral of above function does not have close form,
distribution function FðxÞ does not have close form. So for every pair of limits
a and b, probabilities relating to normal distributions are usually obtained
numerically or normal tables.
CDF of normal distribution with zero mean (l ¼ 0) and unity standard devi-
ation (r ¼ 0) is given as:
Zx
1 t2
FX ðxÞ ¼ pffiffiffiffiffiffi e 2 dt ð2:60Þ
2p
1
2.8 Transforms
In many problems PGF GX ðzÞ will be known or derivable without the knowledge
of pmf of X. It will be shown in later sections that interesting quantities such as
mean and variance of X can be estimated from PGF itself. One reason for the
usefulness of PGF is found in the following theorem, which has been quoted here
without proof.
Theorem 2.1 If two discrete random variables X and Yhave same PGF’s, then
they must have the same distributions and pmf’s.
It means if a random variable has same PGF as that of another random variable
with a known pmf, then this theorem assures that the pmf of the original random
variable must be the same.
2.8 Transforms 31
Z1
Lð0Þ ¼ PX ð xÞdx ð2:68Þ
0
2.9 Expectations
The distribution function FðxÞ or the density f ðxÞ(or pmf for a discrete random
variable) complete characterizes the behavior of a random variable X. Frequently,
more concise description such as a single number or a few numbers, rather than an
entire function. One such number is the expectation or the mean, denoted by E½X.
Similarly, others are median, mode, and variance etc. The mean, median and mode
are often called measures of central tendency of a random variable X.
Definition 2.1 The expectation, E½X, of a random variable X is defined by:
( P
i xi pðxi Þ; for discrete,
E½X ¼ R 1 ð2:72Þ
1 x f ð xÞdx; for continuous,
2.9 Expectations 33
Equation 2.21 valid provided that the relevant sum or integral is absolutely
convergent; that is,
X
jxi jpðxi Þ\1
i
Z1 ð2:73Þ
jxjf ð xÞdx\1
1
If these sum or integral are not absolutely convergent, then E½X does not exist.
Example 2.1 Let X be a continuous random variable with an exponential density
given by:
f ð xÞ ¼ kekx ; 8x [ 0 ð2:74Þ
Expectation of X is evaluated as:
Z1 Z1
E½X ¼ x f ð xÞdx ¼ kxekx dx ð2:75Þ
1 0
While dealing with random variables, situation often arises which requires addi-
tion, maximum, minimum, mean, median, etc. In this section, these are discussed.
Let’s determine the pdf of random variable Z, where Z ¼ X1 þ X2 þ þ Xn .
X1 ; X2 , . . . are independent random variables with known pdf. On a ndimensional
34 2 Probability Theory
event space, this event is represented by all the events on the plane
X1 þ X2 þ þ Xn ¼ t. The probability of this event may be computed by adding
the probabilities of all the event points on this plane.
X
PðZ ¼ tÞ ¼ PðX1 ¼ x1 ; X2 ¼ x2 ; . . .Xn ¼ xn ; x1 þ x2 þ xn ¼ tÞ
fx1 ;x2 ;...;xn g
pz ð t Þ ¼ p 1 ð t Þ p 2 ð t Þ pn ð t Þ
ð2:79Þ
This summation turn out to be convolution (discrete or continuous).
Let Z1 ; Z2 ; . . .; Zn be random variables obtained by permuting the set,
X1 ; X2 ; . . .; Xn so as to be in increasing order.
Z1 ¼ minfX1 ; X2 ; . . .; Xn g
and ð2:80Þ
Zn ¼ maxfX1 ; X2 ; . . .; Xn g
The random variable Zk is called kth order statistic. To derive the distribution
function of Zk , note that the probability that exactly j of the Xi ’s lie in ð1; z and
ðn jÞ lie in ðz; 1Þ is:
since the binomial distribution with parameters n and p ¼ FðzÞ is applicable. Then:
F Z k ð z Þ ¼ P ð Zk z Þ
¼ Pð‘‘at least k of the Xi ’s lie in the interval ð 1; zÞ’’Þ
n ð2:82Þ
Cðn; jÞF j ðzÞ½1 F n j ðzÞ; 1\z\1
P
¼
j¼k
F Z1 ð z Þ ¼ 1 ½1 F ðzÞn
ð2:83Þ
F Zn ð z Þ ¼ F ð z Þ n
2.11 Moments
and
" #
n
X n
X
E ai Xi ¼ ai E½Xi ð2:86Þ
i¼1 i¼1
2.12 Summary
3.1 Introduction
1
The word ‘‘stochastic’’ is of Greek origin. In seventeenth century English, the word
‘‘stochastic’’ had the meaning ‘‘to conjecture, to aim at mark’’. It is not quite clear how it acquired
the meaning it has today of ‘‘pertaining to chance’’.
For a given time t ¼ t0 , process Xðt0 Þ is a simple random variable that describes
the state of the process at time t0 . For a fixed number x1 , the probability of the
event ½Xðt0 Þ x1 gives the cumulative distribution function (CDF) of the random
variable Xðt0 Þ. Mathematically, this is given as:
FX ðt1 Þ ðx1 Þ ¼ P½X ðt1 Þ x1 ð3:1Þ
Fðx1 ; t1 Þ is known as the first-order distribution of the process XðtÞ. Given two
time instants t1 and t2 ; Xðt1 Þ and Xðt2 Þ are two random variables on the same
probability space. Their joint distribution is known as the second-order distribution
of the process and is given by:
F ðx1 ; x2 ; t1 ; t2 Þ ¼ PðX ðt1 Þ x1 ; X ðt2 Þ x2 Þ ð3:2Þ
In general, the nth order joint distribution of the stochastic process fXðtÞ; t 2 Tg
by:
F ðx : tÞ ¼ P½X ðt1 Þ x1 ; . . .; X ðtn Þ xn ð3:3Þ
for all x ¼ ðx1 ; . . .; xn Þ 2 Rn and t ¼ ðt1 ; . . .; tn Þ 2 T n such that t1 \t2 \ \tn .
Many processes of practical interest, however, permit a much simpler description.
The processes can be classified based on time-shift, independence and memory,
as follows:
1. A stochastic process fXðtÞg is said to be stationary in the strict sense if for
n 1, its nth-order joint CDF satisfies the condition:
F ðx : tÞ ¼ F ðx : t þ sÞ ð3:4Þ
for all vector x 2 <n and t 2 T n , and all scalars s such that ti þ s 2 T. The
notation t þ s implies that the scaler s is added to all components of vector t:
Let l(t) = E[X(t)] denote the time-dependent mean of the stochastic process.
l(t) is often called the ensemble average of the stochastic process. Applying the
definition of the strictly stationary process to the first-order CDF, F(x; t) = F(x;
t ? s ) for all s. It follows that a strict-sense stationary process has a time-
independent mean; that is, l(t) = l for all t [ T.
2. A stochastic process fXðtÞg is said to be an independent process provided its
nth-order joint distribution satisfies the condition:
Yn n
Y
F ðx : t Þ ¼ F ðxi : t i Þ ¼ P½ X ð t i Þ x i ð3:5Þ
i¼1 i¼1
Random walk has its origin in study of movement of a particle in fluid. But the
random walk has been used in widely variety of applications such as modeling of
insurance risk, escape of comets from the solar system, content of dam and
queueing system, etc. Consider a particle which can move only in one dimension
i.e. x-axis. At time n ¼ 1 the particle undergoes a step or jump Z1 , where Z1 is a
random variable having a given distribution. At time n = 2 the particle undergoes
a jump Z2 , where Z2 is independent of Z1 and with the same distribution, and so on.
As the particle moves along a straight line and after one jump is at the position
X0 þ Z1 , after two jumps at X0 þ Z1 þ Z2 and, in general, after n jumps the
position of the particle is given as Xn ¼ X0 þ Z1 þ Z2 þ þ Zn . Here Zi is a
sequence of mutually independent, i.i.d. random variables. This can be represented
as Xn ¼ Xn 1 þ Zn for n ¼ 1; 2; . . .
Now for a particular case where the steps Zi can only take the values 1; 0; 1
with the probability:
PðZi ¼ 1Þ ¼ p
PðZi ¼ 1Þ ¼ q ð3:6Þ
PðZi ¼ 0Þ ¼ 1 p q
The above particular process is a stochastic process in discrete time with discrete
state space. If the particle continues to move indefinitely according to above
relation the random walk is said to be unrestricted. The motion of the particles may
be restricted by use of barriers. These barriers could be absorbing or reflection
barriers. Till now, we have restricted the discussion to one-dimensional jumps/
steps only. When the jumps are in two or three dimension, it results in two or
three-dimensional random walk, respectively.
Example The escape of comets from the solar system. This example has been
taken from Cox [1]. This problem was originally studied by Kendell. He has made
an interesting application of the random walk to the theory of comets. Comets
revolve around earth and during one revolution round the earth the energy of a
comet undergoes a change brought about by the disposition of the planets. In
successive revolutions the change in energy of the comet are assumed to be
independent and identically distributed random variables Z1 ; Z2 ; . . .. If initially the
comet has positive energy X0 then after n revolutions the energy will be
X n ¼ X 0 þ Z1 þ Z2 þ þ Zn ð3:7Þ
40 3 Stochastic Processes and Models
If at any stage the energy Xn becomes zero or negative, the comet escapes from the
solar system. Thus the energy level of the comet undergoes a random walk starting
at X0 [ 0 with an absorbing barrier at 0. Absorption corresponds to escape from
the solar system.
Definition 3.4 (Recurrent State) A state i is said to be recurrent if and only if, starting
from state i, the process eventually returns to state i with probability one [2].
Definition 3.5 (Mean recurrence time) The mean recurrence time of recurrent state
xj is [4]
X
Mj ¼ mfj ðmÞ ð3:11Þ
where: fj ðmÞ denote the probability of leaving state xj and first returning to that
same state in m steps.
3.4 Markov Chain 41
In this case, the pj are uniquely determined form the set of equations:
X X
pj ¼ pi pij subject to pi ¼ 1 ð3:14Þ
i i
where P is the transition probability matrix. The vector P is called the steady-
state solution of the Markov chain.
In discrete time Markov chain, the process can change its state only at discrete
time points. The time process spends in a given state will be investigated here. Due
to Markov property, the next transition does not depend up on how this state is
reached and how much time has passed in this state. Let process has already spent
42 3 Stochastic Processes and Models
n0 time quanta in a given state. The random variable ‘transition time’ is denoted by
X and the random variable ‘transition time after n0 ’ as Y and Y ¼ X n0 . Let
conditional probability of Y ¼ n0 , given that X [ n0 , be denoted by ZðnÞ.
Z ð n Þ ¼ Pð Y ¼ n j X [ n0 Þ
¼ Pð X n0 ¼ nj X [ n0 Þ
¼ Pð X ¼ n þ n0 j X [ n 0 Þ
PðX ¼ n þ n0 and X [ n0 Þ ð3:16Þ
¼
Pð X [ n0 Þ
Pð X ¼ n þ n 0 Þ
¼
PðX [ n0 Þ
So, in a discrete time Markov process, the distribution of resident time in a state posses
a unique property. That is, given that it has already spent a specified time n0 , does not
affect distribution of residual time. Means, process does not have time memory.
For a discrete-time Markov chain, the only sojourn time distribution, which sat-
isfies the memory less, sojourn time condition is the geometric distribution.
If the conditional probability defined above is invariant with respect to the
time origin tn , then the Markov chain is said to be homogeneous, i.e., for any t and tn ,
If the states in a Markov can change only at discrete time points, the Markov chain
is called a discrete-time Markov chain (DTMC). If the transitions between states
may take place at any instance, the Markov chain is called a continuous-time
Markov chain (CTMC).
In the last section we saw Markov process with discrete states and defined at discrete
time instant. Let the state space of the process remain discrete and parameter space
t ¼ ½0; 1Þ. As per the definition of Markov process, a discrete-state continuous-
parameter (time) stochastic process fXðtÞ; t 0g is a Markov process if
P½X ðtÞ ¼ xjX ðtn Þ ¼ xn ; X ðtn 1 Þ ¼ xn 1 ; . . .; X ðt0 Þ ¼ x0 ; ¼ P½X ðtÞ ¼ xjX ðtn Þ ¼ xn
ð3:18Þ
where
ð1Þ pQ ¼ 0
X
ð2Þ pi ¼ 1 ð3:20Þ
i
For a finite state-space process, the sum of all state probabilities equals to unity.
The evolution of Markov process over time can be realized using Chapman–
Kolmogorov (C–K) equation. C–K enables to build up conditional pdf over ‘long’
time interval from those over the ‘short’ time intervals. The transition probabilities
of a Markov chain fXðtÞ; t 0g satisfy the C–K equation for all i; j 2 I,
X
pij ðv; tÞ ¼ pik ðv; uÞpkj ðu; tÞ
k2I
ð3:21Þ
for
0 v\u\t
A Markov process transits from one state to other, this state transition is captured
by state-transition matrix. Unlike in discrete time Markov process, in continuous
time process state transition may occur at any time. These two condition imposes
restriction on the probability distribution a Markov process may have. Considering
the Markov property and these two conditions together with time of transition, it is
clear that time the process spends in a given state before transition does not
depends on the time it has already spent in that state.
Let process has already spent time t0 in a given state. The random variable
‘transition time’ is denoted by X and the random variable ‘transition time after t0 ’
as Y and Y ¼ X t0 . Let conditional probability of Y t, given that X [ t0 , be
denoted by ZðtÞ.
44 3 Stochastic Processes and Models
We have seen in the last section, in a Markov process future evolution depends
only on the present state, i.e. it does not depend on how that state is reached and
how much time has already elapsed in that state. Any process not fulfilling these
properties are termed as non-Markovian process. Some of the non-Markov pro-
cesses with some unique property are of special interest, some of them are dis-
cussed in this section.
Markov regenerative processes are the processes with embedded Markov renewal
sequences. The formal definition is as follows.
Definition 3.7 A stochastic process fZðtÞ; t 0g is called a Markov regenerative
process if there exists a Markov renewal sequence fðYn ; Tn Þ; n 0g of random
variables such that all the conditional finite distributions of fZðTn þ tÞ; t 0g
given fZðuÞ; 0 u Tn ; Yn ¼ ig are the same as those of fZðtÞ; t 0g given
Y0 ¼ i. [1, 3, 5–7]
The above definition implies that:
PrfZ ðTn þ tÞ ¼ jjZ ðuÞ; 0 u Tn ; Yn ¼ ig ¼ PrfZ ðtÞ ¼ jjY0 ¼ ig ð3:25Þ
It also implies that the future of the Markov regenerative process fZðtÞ; t 0g
from ft ¼ Tn g onwards depends on the past fZðtÞ; 0 t Tn g only through Yn .
Let vi;j ðtÞ ¼ PrfZ ðtÞ ¼ jjY0 ¼ ig. The matrix V ðtÞ ¼ vi;j ðtÞ is referred to as
conditional transient probability matrix of MRGPs. The following theorem gives
the generalized Markov renewal equation satisfied by VðtÞ. For the sake of con-
ciseness, ðK VÞðtÞ to denote the matrix whose element ði; jÞ is defined as follows:
XZ
t
The proof of this theorem is shown in [6, 7]. Note that EðtÞ contains information
about the behavior of MRGP over the first ‘‘cycle’’ ð0; S1 Þ. Thus this theorem
relates the behavior of the process at time t to its behavior over the first cycle.
Consider an M/G/1 queuing system, let ZðtÞ be the number of customers at time
t, we can define the embedded Markov renewal sequence, fðYn ; Sn Þ; n 0g, as
S0 ¼ 0 and Sn is the time of the nth customer departure; and Yn ¼ ZðSn þÞ [6, 7].
Note that fZðtÞ; t 0g satisfies the property of Definition 3.7 Hence, it is Markov
regenerative process.
Petri net is a directed, bipartite graph consisting of two kinds of nodes, called
places and transitions, where arcs are either from a place to transition or from a
transition to place [12–14]. Mathematically, Petri net structure is defined as a 5-
tuple. Petri net: N ¼ fP; T; I; O; M0 g; where:
• P is a finite set of ‘‘places’’
• T is a finite set of ‘‘transitions’’
• I ¼ ðP TÞ defines the input function
• O ¼ ðT PÞ defines the output function
• M0 is the initial ‘‘marking’’ of the net, where a ‘‘marking’’ is the number of
‘‘tokens’’ contained in each place.
A transition ti is said to be ‘‘enabled’’ by a marking m if and only if Iðti Þ is
contained in m. Any transition ti enabled by marking mj can ‘‘fire’’. When it does,
token(s) is removed from each place Iðti Þ and added to each place Oðti Þ. This may
result in a new marking mk . If a marking enables more than one transition, the
enabled transitions are said to be in conflict. Any of the enabled transition may fire
first. This firing may disable transitions which were previously enabled.
Petri nets are generally represented graphically. Places are drawn as circles and
transitions as bars. The input and output functions are represented by directed arcs
from places to transitions and transitions to places, respectively. Tokens are rep-
resented by black dots or number inside places.
An example of Petri net model is given in Fig. 3.1. The PN consists of five
places fp1 ; p2 ; p3 ; p4 ; p5 g and 4 transitions ft1 ; t2 ; t3 ; t4 g . The initial marking M0 =
(1,0,0,1,0). In the present marking transition t2 is only enabled. Firing of transition
removes the token from place p1 and deposit one token each in places p2 and p3 .
One of the possible firing sequence is t2 ; t1 ; t3 ; t4 ; . . .
The analysis of Petri nets revolves around investigating the possible markings.
The Petri net semantic does not state which of multiple simultaneously enabled
transitions fires first, so a Petri net analysis must examine every possible firing order.
Petri nets can be used to capture the behavior of many real-world situations
including sequencing, synchronization, concurrency, and conflict. The main fea-
ture which distinguishes PNs from queuing networks is the ability of the former to
represent concurrent execution of activities. If two transitions are simultaneously
48 3 Stochastic Processes and Models
t1
p2 p1
t2
p3
t3
p4 p5
t4
enabled, this means that the activities they represent are proceeding in parallel.
Transition enabling corresponds to the starting of an activity, while transition
firing corresponds to the completion of an activity. When the firing of a transition
causes a previously enabled transition to become disabled, it means the interrupted
activity was aborted before being completed.
Many extensions to PNs have been proposed to increase either the class of
problems that can be represented or their capability to deal with the common
behavior of real systems [10, 15]. These extension are aimed to increase, (1)
modeling power, (2) modeling convenience, and (3) decision power [10]. Mod-
eling power is the ability of a formalism to capture the details of a system.
Modeling convenience is the practical ability to represent common behavior.
Decision power is defined to be the set of properties that can be analyzed. The
generally accepted conclusion is that increasing the modeling power decreases the
3.6 Higher Level Modeling Formalisms 49
decision power. Thus each possible extension to the basic PN formalism requires
an in depth evaluation of its effect upon modeling and decision power.
Extensions which affect only modeling convenience can be removed by
transforming an extended PN into an equivalent PN so they can usually be adopted
without introducing any analytical complexity. These kind of extensions provide a
powerful way to improve the ability of PNs to model real problems. Some
extension of this type have proved so effective that they are now considered part of
the standard PN definition. They are [10, 15]:
• arc multiplicity
• inhibitor arcs
• transition priorities
• marking-dependent arc multiplicity
Arc multiplicity is a convenient extension for representing a case when more
than one token is to be moved to or from a place. The standard notation is to
denote multiple arcs as a single arc with a number next to it giving its multiplicity.
Inhibitor arcs are another useful extension of standard PN formalism. An
inhibitor arc from place p to transition t disables t for any marking where p is not
empty. Graphically, inhibitor arcs connect a place to a transition and are drawn
with a small circle instead of an arrowhead. It is possible to use the arc multiplicity
extension in addition to inhibitor arcs. In this case a transition t is disable
whenever place p contains at least as many tokens as the multiplicity of the
inhibitor arc. Inhibitor arcs are used to model contention of limited resources to
represent situations in which one activity must have precedence over another.
Another way to represent the latter situation is by using transition priorities, an
extension in which an integer ‘‘priority level’’ is assigned to each transition. A
transition is enabled only if no higher priority transition is enabled. However, the
convenience of priorities comes at a price. If this extension is introduced, the standard
PNs ability to capture the entire system behavior graphically is partially lost.
Practical situations often arise where the number of tokens to be transferred
(or to enable a transition) depends upon the system state. These situations can
be easily managed adopting marking-dependent arc multiplicity, which allows the
multiplicity of an arc to vary according to the marking of the net. Marking
dependent arc multiplicities allow simpler and more compact PNs than would
be otherwise possible in many situations. When exhaustive state space explora-
tion techniques are employed, their use can dramatically reduce the state space.
PNs lack the ‘‘concept of time’’ and ‘‘probability’’. Modeling power can be
increased by associating random firing times with either the places or the transi-
tions. When waiting times are associated with places, a token arriving into a place
enables a transition only after the place’s waiting time has elapsed. When waiting
50 3 Stochastic Processes and Models
times are associated with transitions, an enabled transition fires only after the
waiting time has elapsed, which is also referred as firing time.
Stochastic Petri net (SPN) models increase modeling power by associating
exponentially distributed random firing times with the transitions [11]. A transi-
tion’s firing time represents the amount of time required by the activity associated
with the transition. It is counted from the instant the transition is enabled to the
instant it actually fires, assuming that no other transition firing affects it.
An SPN example from Molloy [11] is taken. The SPN model is shown in
Fig. 3.2. To illustrate derivation of CTMC (continuous time Markov chain) from
this SPN, first all possible markings are estimated.
All possible markings of the SPN of Fig. 3.2 is shown in Table 3.1
Each marking corresponds to a state in the Markov chain. Possible transitions
from each states corresponds to transitions of Markov chain. The equivalent
Markov chain is shown in Fig. 3.3.
t1
p2 p3
t4 t2 t3
p4 p5
t5
M1 M2 M3 M4 M5
t4 t4
t3
t5
Like basic PN models, SPN models can have more than one transition enabled
at a time. To specify which transition will fire among all of those enabled in a
marking, an ‘‘execution policy’’ has to be specified. Two alternatives are the ‘‘race
policy’’ and the ‘‘pre-selection policy’’. Under the race policy, the transition whose
timing time elapses first is assumed to be the one that will fire. Under the pre-
selection policy, the next transition to fire in a given marking is chosen from
among the enabled transitions using a probability distribution independent of their
firing times. SPN models use the race policy.
Generalized stochastic petri nets (GSPNs), proposed by Marsan et al. [16], are an
extension of Stochastic Petri nets obtained by allowing the transitions of the
underlying PN to be immediate as well as timed. Immediate transitions (drawn as
thin black bars) are assumed to fire in zero time once enabled. Timed transitions
(represented by rectangular boxes or thick bars) are associated firing time just as in
SPNs.
When both immediate and timed transitions are enabled in a marking, only the
immediate transitions can fire; the timed transitions behave as if they were not
enabled. When a marking m enables more than one immediate transition, it is
necessary to specify a probability mass function according to which the selection
of the first transition to fire is made. The markings of a GSPN can be classified into
‘‘vanishing’’ markings in which at least one immediate transition is enabled, and
‘‘tangible’’ markings, in which no immediate transitions are enabled. The reach-
ability graph of a GSPN can be converted into a CTMC by eliminating vanishing
markings and solved using known methods.
Stochastic Reward Nets (SRN) introduce a stochastic extension into SPNs con-
sisting of the possibility to associate reward rates with the markings. The reward
52 3 Stochastic Processes and Models
rate definitions are specified at the net level as a function of net primitives like the
number of tokens in a place or the rate of a transition. The underlying Markov
model is then transformed into a Markov reward model thus permitting evaluation
of not only performance and availability but also a combination of the two.
A stochastic reward net (SRN) is an extension of a stochastic petri net (SPN). A
rigorous mathematical description of stochastic reward nets is there in Muppala
et al. [15].
Petri net in its original definition suffers from the problem of state-space
explosion. So, over the time various features such as guard, priority relationship,
and inhibitor arcs have been added to PNs to provide a concise description of a
given system.
Associating exponentially distributed firing times with the transitions of the PN
results in a Stochastic Petri net (Molly [11]). Allowing transition to have either
zero firing times (immediate transitions) or exponentially distributed firing times
(timed transitions) gives rise to the Generalized Stochastic Petri Net (GSPN)
(Ajmone-Marson et al. [17, 18]) as already seen.
By associating reward rates with the markings of the SPN, SRN is obtained. As
SRN can be automatically converted into a Markov reward model thus permitting
the evaluation of not only performance and availability but also their combination.
Putting all this together, SRN can formally be defined as:
SRN: A marked SRN is a tuple A ¼ ðP; T; DI; DO; DH; G; b [ ; k; PS; M0 ; rÞ
[19] where:
P ¼ p1 ; p2 ; . . .; pN is a finite set of places
T ¼ t1 ; t2 ; . . .; tM is a finite set of transitions
8pi 2 P; 8tj 2 T; DIij : IN N ! IN is the marking dependent multiplicity of the
input arc from place pi to transition tj ; if the multiplicity is zero, the input arc is
absent
8pi 2 P; 8tj 2 T; DOij : IN N ! IN is the marking dependent multiplicity of the
output arc from transition tj to place pi ; if the multiplicity is zero, the output arc is
absent
8pi 2 P; 8tj 2 T; DHij : IN N ! IN is the marking dependent multiplicity of the
inhibitor arc from place pi to transition tj ; if the multiplicity is zero, the inhibitor
arc is absent
8tj 2 T; G^ j : IN N ! f0; 1g is the marking dependent guard of the transition tj
[ is a transitive and irreflexive relation imposing a priority among transitions.
In a marking Mj ; t1 is enabled iff it satisfies its input and inhibitor conditions, its
guard evaluates to 1, and no other transition t2 exists such that t2 [ t1 , and t2
satisfies all other conditions for enabling
8tj 2 T such that tj is a timed transition, kj : IN N ! IRþ is the marking
dependent firing rate of transition tj and k ¼ ½kj
8tj 2 T such that tj is an immediate transition, PStj : IN N ! ½0; 1 is the marking
dependent firing probability for transition tj , given that the transition is enabled.
M0 2 IN N is the initial marking
3.6 Higher Level Modeling Formalisms 53
Stochastic Petri nets (SPNs) are well suited for model-based performance and
dependability evaluation. Most commonly, the firing times for the transitions are
exponentially distributed, leading to an underlying CTMC (continuous time Markov
chain). In order to increase modeling power, several classes of non-Markovian SPNs
were defined, in which the transition may fire after a non-exponentially distributed
firing time.
A particular case of non-Markovian SPNs is the class of deterministic and
stochastic Petri nets (DSPNs) [19], which allows transitions to have deterministic
firing times along with transition with exponential firing times. A DSPN with
restriction that at any time at most one deterministic transition may be enabled.
When this condition is met, it has been shown that the marking process corre-
sponds to Markov regenerative process [6, 7]. Being a non-Markovian system,
analysis method popular for solving DSPNs are bases on Supplementary variable
and imbedded Markov chain [1]. Stationary analysis method for DSPNs with
mentioned conditions are presented in [20] and citemarsanon and the transient
analysis are addressed in [6].
Queueing networks are a widely used performance analysis technique for those
systems, which can be naturally represented as networks of queues. Systems,
which have been successfully with queueing networks, include computer systems,
communication networks, and flexible manufacturing systems [4].
A queueing system consists of three types of components:
1. Service centers: a service center consists of one or more queues and one or
more servers. The server represent the resources of the system available to
service customers. An arriving customer will immediately be served if a free
server can be allocated to the customer or if a customer in service is preempted.
Otherwise, the customer must wait in one of the queues, until a server become
available.
2. Customers: a customer is one which demand service from the service centers
and which represent load on the system.
3. Routes: routes are the paths which workloads follow through a network of
service centers. The routing of customers may be dependent on the state of the
network. If the routing is such that no customers may enter or leave the system,
54 3 Stochastic Processes and Models
the system is said to be closed. If the customers arrive externally and eventually
depart, the system is said to be open. If some classes of customers are closed
and some are open, then the system is said to be mixed.
To complete specify the queueing network following parameters are defined:
A process algebra (PA) is an abstract language which differs from the formalisms
we have considered so far because it is not based on a notion of flow. Instead,
systems are modeled as a collection of cooperating agents or processes which
execute atomic actions. These actions can be carried out independently or can be
synchronized with the actions of other agents.
Since models are typically built from smaller components using a small set of
combinators, process algebras are particularly suited to the modeling of large
systems with hierarchical structure. This support for compositionality is comple-
mented by mechanisms to provide abstraction and compositional reasoning.
Widely known process algebras are Hoare’s Communicating Sequential Pro-
cess (CSP) and Miner’s Calculus of Communicating Systems. These algebras do
not include notion of time so they can only be used to determine qualitative
correctness properties of the system such as free from racing, deadlock and live-
lock. Stochastic Process Algebras (SPAs) additionally allow for quantitative per-
formance/reliability analysis by associating a random variable, representing
duration, with action/state. Several tools have been developed for SPA, such as
PEPA, TIPP, MPA, SPADES and EMPA.
We will describe SPA, using Markovian SPA PEPA. PEPA models are built
from components which perform activities of form ða; rÞ; where a is the action
type and r 2 <þ [ T is the exponentially distributed rate of the action. The special
symbol t denotes an passive activity that may only take place in synchrony with
another action whose rate is specified [21].
Interaction between components is expressed using a small set of combinators,
which are briefly described below [21]:
Sequential composition: Given a process P; ða; rÞ. P represents a process that
performs an activity of type a, which has a duration exponentially distributed with
mean 1=r, and then evolves into P.
Constant: Given a process Q; P ¼ Q means that P is process which behaves in
exactly the same way as Q.
Selection: Given processes P and Q, P þ Q represents a process that behaves
either as P or as Q. The current activities of both P and Q are enabled and a race
condition determines into which component the process will evolve.
Synchronization: Given processes P and Q and a set of action types L; P . /Q
defines the concurrent synchronized execution of P and Q over the cooperation set
L. No synchronization takes place for any activity a 62 L, so that activities can take
place independently. However, an activity a 2 L only occurs when both P and Q
are capable of performing the action. The rate at which the action occurs is given
by the minimum of the rates at which the two components would have executed
the action in isolation.
Cooperation over the empty set P . /Q represents the independent concurrent
execution of processes P and Q and is denoted as PjjQ.
56 3 Stochastic Processes and Models
3.7 Tools
3.7.1 SPNP
Stochastic Petri Net Package (SPNP) [22] is a versatile modeling tool for per-
formance, dependability and performability analysis of complex systems. Input
models based on theory of stochastic reward nets are solved by efficient and
numerically stable algorithms. Steady-state, transient, cumulative transient, time-
averaged and up-to-absorption measures can be computed. Parametric sensitivity
analysis of these measures is possible. Some degree of logical analysis capabilities
are also available in the form of assertion checking and the number and types of
markings in the reachability graph. Advanced constructs, such as marking
dependent arc multiplicities, guards, arrays of places and transitions, are available.
The modeling complexities can be reduced with these advanced constructs. The
most powerful feature of SPNP is the ability to assign reward rates at the net level
and subsequently compute the desired measures of the system being modeled.
SPNP Version 6.0 has the capability to specify non-Markovian SPNs and Fluid
Stochastic Petri Nets (FSPNPs). Such SPN are solved using discrete-event simu-
lation rather than by analytical-numeric methods. Several types of simulation
methods are available: standard discrete-event simulation with independent rep-
lications or batches, importance splitting techniques (splitting and Restart),
importance sampling, regenerative simulation without or with importance sam-
pling, thinning with independent replications, batches or importance sampling.
3.7 Tools 57
3.7.2 TimeNET
TimeNet [23, 24] is a graphical and interactive toolkit for modeling with stochastic
Petri nets. TimeNET has been developed at the Institut fur Technische Informatik
of the Technische Universitat Berlin, Germany. It provides a unified framework
for modeling and performance evaluation of non-Markovian stochastic Petri nets.
It uses a refined numerical solution algorithm for steady-state evaluation of DSPNs
with only one deterministic transition enable in any marking. Ex-polynomial
distributed firing times are allowed for transitions. Different solution algorithms
can be used, depending on the net class. If the transition with non-exponential
distributed firing times are mutually exclusive, TimeNET can compute steady-
state solution. DSPNs with more than one enabled deterministic transition in a
marking are called concurrent DSPNs. TimeNET provides an approximate anal-
ysis technique for this class. If the mentioned restrictions are violated or the
reachability graph is too complex for a model, an efficient simulation component is
available. A master/slave concept with parallel replications and techniques for
monitoring the statistical accuracy as well as reducing the simulation length in
case of rare events are applied. Analysis, approximation, and simulation can be
performed to the same model classes. For more details refer to TimeNET user
manual [24].
3.8 Summary
Real life system are usually complex, to model then a family random variables is
required. These family of random variables are termed as stochastic variables.
These stochastic variables have some unique characteristic. Dealing at lower level,
i.e. state-transition, for a complex problem becomes unmanageable. So, higher
level modeling formalisms are required. This is the theme of this chapter.
References
1. Cox DR, Miller HD (1970) The theory of stochastic processes. Methuen, London
2. Trivedi KS (1982) Probability statistics with reliability, queueing, and computer science
applications. Wiley, New York
3. Xinyu Z. (1999) Dependability modeling of computer systems and networks. Ph.D. thesis,
Department of Electrical and Computer Engineering, Duke University
4. IEC 60880-2.0: Nuclear power plants—instrumentation and control systems important to
safety—software aspects for computer-based systems performing category a functions, 2006
5. Erhan C (1975) Introduction to stochastic processes. Prentice-Hall, Englewood Cliffs
6. Choi H, Kulkarni VG, Trivedi KS (1993) Transient analysis of deterministic and stochastic
petri nets. In: Proceedings of the14th international conference on application and theory of
petri nets, pp 166–185
58 3 Stochastic Processes and Models
7. Choi H, Kulkarni VG, Trivedi KS (1994) Markov regenerative stochastic petri nets. Perform
Eval 20:337–357
8. Meyer JF (1980) On evaluating the performability of degradable computing systems. IEEE
Trans Comp C 29(8):720–731
9. Meyer JF (1982) Closed-form solutions of performability. IEEE Trans Comp C 31(7):648–657
10. Puliafito A, Telek M, Trivedi KS (1997) The evolution of stochastic petri nets. In:
Proceedings of World Congress of Systems and Simulation, WCSS 97:97
11. Molloy MK (1982) Performance analysis using stochastic petri nets. IEEE Trans Comp C
31(9):913–917
12. Murata T (1989) Petri nets: properties, analysis and applications. Proceedings IEEE
77(4):541–580
13. Peterson PL (1977) Petri nets. ACM Comput Surv 9(3)
14. Peterson PL (1981) Petri net theory and modeling of systems. PHI, Englewood Cliffs
15. Jogesh M, Gianfranco C, Trivedi KS (1994) Stochastic reward nets for reliability prediction.
Commun Reliab Maintainabil Serviceabil 1(2):9–20
16. Marsan MA, Balbo G, Conte G (1984) A class of generalized stochastic Petri nets for the
performance evaluation of multiprocessor systems. ACM Trans Comp Syst 93:93–122
17. Ajmone Marsan M, Balbo G, Conte G (1984) A class of generalized stochastic petri nets for
the performance evaluation of multiprocessor systems. ACM Trans Comp Syst 2(2):93–122
18. Marson MA, Balbo G, Bobbio A, Chiola G, Conte G, Cumani A (1989) The effect of
execution policies on the semantics and analysis of stochastic petri nets. IEEE Trans Softw
Eng 15(7):832–846
19. Bukowski JV (2001) Modeling and analyzing the effects of periodic inspection on the
performance of safety-critical systems. IEEE Trans Reliabil 50(3):321–329
20. Marsan MA, Chiola G (1987) On petri nets with deterministic and exponentially distributed
firing times. In: Advances in Petri Nets 1986, Lecture Notes in Computer Science 266,
pp 132–145
21. Diaz JL, Lopez JM, Gracia DF (2002) Probabilistic analysis of the response time in a real
time system. In: Proceedings of the 1st CARTS workshop on advanced real-time
technologies, October
22. Trivedi KS (2001) SPNP user’s manual. version 6.0. Technical report
23. Zimmermann A (2001) TimeNET 3.0 user manual
24. Zimmermann A, Michael Knoke (2007) Time NET 4.0 user manual. Technical report, August
Chapter 4
Dependability Models of Computer-Based
Systems
4.1 Introduction
Computer-based Systems
Dependability attributes
Combinatorial models are a class of reliability models that represent system failure
behaviors in terms of combination of component failures [4, 5]. Because of their
concise representation of system failure, combinatorial models have long been
used for reliability analysis. Reliability block diagrams, fault trees and reliability
graphs are three major types of combinatorial models. A brief description of these
is given below.
A reliability graph model consists of a set of nodes and edges, where the edges
represent components that can fail or structural relationship between the
62 4 Dependability Models of Computer-Based Systems
components. The graph contains nodes, a node with no incoming edges is called
source, while a node with no outgoing edges is termed as sink. Reliability graphs
are similar to RBD and comes under the broad category of structure-oriented
models. Reliability graphs are mostly suited for complex system where reliability-
wise relationship among bocks is more complicated then series-parallel. Graph
theoretic methods—cut-set, path-set and BDD [4]—be used for solution.
A fault tree is a graphical representation of the combination of events that can cause
the occurrence of overall desired event, e.g. system failure in case of reliability
modeling. RBD and RG can only model hardware failures, while fault tree can
model hardware failures as well as failures on account of software failure, human
errors, operation and maintenance errors and environment influences on the system.
Fault tree identifies relationships between an undesired system event and the
subsystem failure events that may contribute to its occurrence. Fault tree devel-
opment employs a top-down approach, descending from the system level to more
detailed levels of subsystems and component levels. It is well suited to evaluate the
reliability considerations at each stage of the system design.
Fault tree provides both qualitative or quantitative system reliability. Envi-
ronmental and other external influences can easily be considered in fault tree
analysis. It provides visual and graphical aid to the analyst.
Combinatorial models are the simplest and widely used methods for reliability
modeling. These methods are not suitable for modeling where system failure
depends upon sequence of failure occurrence. Combinatorial models give point
estimate of reliability, i.e. for a given scenario. While system may degrade with
time, components may fail and could be repaired and restored back. This repair
could be perfect or imperfect.
To include sequence of failure fault trees are extended to dynamic fault tree.
Markov models are suitable to model sequence of failure, degradation with time,
failure and repair, and reliability with time. Brief description of these two are
given below:
switch unit fails before the active component fails, then the standby unit cannot be
switched into active operation and the system fails when the active components
fails. Thus, the failure criteria depends not only on the combination of events, but
also on their sequence.
Dynamic fault tree tries to eliminate this limitation—sequence of failure—of
fault tree by incorporating functional dependency and priority gates.
Markov chains are widely used for modeling and analyzing problems of stochastic
nature. A stochastic process is a Markov process if its future evolution depends on
its current state only. Means, the next state of the process is independent of the
history of the process, i.e. it only depends upon its current state. The processes of
this nature are termed as Markov process. Equation 4.1 describe a Markov Process
[7, 8].
PrfXðtn ¼ jjXðtn 1 Þ ¼ in 1 ; . . .; Xðt0 Þ ¼ i0 g ¼ PrfXðtn Þ ¼ jjXðtn 1 Þ ¼ in 1 g
ð4:1Þ
Whether a particular system leads to a Markov process depends on how the
random variables specifying the stochastic process are defined. For example,
consider a component, such as a IC, which may fail. Let the component be checked
periodically and classified as being in one of three states, (i) satisfactory, (ii)
unsatisfactory and, (iii) failed. Let these three states are termed as state 0, 1 and 2
respectively. The process has been depicted in Fig 4.2. The transition probabilities
of this example is given as:
20 1 2 3
0 p00 p01 p02
P ¼ 1 40 p11 p12 5 ð4:2Þ
2 0 0 1
In (4.2) rows correspond to initial state while column to final state. Transition
matrix element p00 depicts the probability of remaining in state 0, while p01 depicts
probability of transition to state 1 from state 0. Using this transition matrix, next
state probabilities can be estimated using following relation:
pnþ1 ¼ pn P ð4:3Þ
p00 p11 1
p01 p12
0 1 2
p02
From this modified problem definition, it is evident that system has memory in state
1, and by definition it cannot be modeled directly as Markov process. New process is
described as:
1
probðpnþ1 ¼ 1jpn ¼ 1; pn ¼ 0Þ ¼ 1
1 ð4:4Þ
probðpnþ1 ¼ 1jpn ¼ 1; pn ¼ 1Þ ¼ 0
But a simple extension to state space of this problem converts the problem into
Markov process. The process involves dividing the original state 1 into two states,
(1,0) and (1,1), where (1,0) is the state corresponding to pn ¼ 1; pn 1 ¼ 0 and
(1,1) the state corresponding to pn ¼ 1; pn 1 ¼ 1. The new process with four
states has transition probability matrix as:
2 0 ð1; 0Þð1; 1Þ 2 3
0 p00 p01 0 p02
ð1; 0Þ 6
6 0 0 1 0 7 7
P¼ 4 ð4:5Þ
ð1; 1Þ 0 0 0 1 5
2 0 0 0 1
The addition of this new state enables the problem to modeled as Markov
process.
A Markov process is characterized by its states and transitions. Time to tran-
sition and states may be discrete or continuous, independent of each other. So a
Markov process can further be divided into four domains. A Markov process with
discrete states and continuous time to transition is termed as ‘‘Continuous Time
Markov Chain’’ (CTMC), while process with discrete state and discrete time is
termed as ‘‘Discrete Time Markov Chain’’ (DTMC). CTMC is well suited for
reliability analysis of electronic equipments, as their failure process is character-
ized as Markov. In reliability analysis states of the CTMC depicts the state of the
system and transition depicts the failure rate. CTMCs are capable of taking
redundancy and repair activity into account.
4.3 Reliability Models 65
p00 1
p01 1 1
0 (1,0) (1,1) 2
p02
1. Failure Rate models these are based on modeling the software failure intensity
from software test data.
(a) Jelinski–Moranda model
(b) Schick–Walvertom model
(c) Jelinski–Moranda Geometric model
(d) Goel–Okumoto debugging model [12]
2. NHPP software Reliability model it assumes faults are dormant, and time to
uncover follows non-homogeneous Poisson process.
3. State based models these models use control flow graph of the software to
represent the architecture of the system, which could be modeled as DTMC,
CTMC or SMP (semi-Markov process).
(a) Littlewood model [13]
(b) Cheung model [14]
(c) Laprie model [15]
66 4 Dependability Models of Computer-Based Systems
Availability is a measure for system which are subjected to failure and repair.
Availability refers to fraction of time system spends in UP state. Mathematically it
is described as:
Total time spent in UP
availability upto time t ¼ AðtÞ ¼ ð4:6Þ
Total time ‘t’
From (4.6), it is evident that once system history (time spent in UP and DN
state) is available, availability can be determined.
tUP
AðtÞ ¼ ð4:7Þ
tUP þ tDN
Figure 4.4 gives a typical trace of system state with time. From this, statistically
mean value of tUP and tDN can be determined. With this, availability seems to be
a posteriori measure of system dependability. A estimate of being in UP state for a
given duration may give availability estimate a priori. For computer-based system,
Markov models are widely used to estimate the time a system spend in UP state.
For a repairable system MTBF and MTTR can be found from system history or
estimate using model, in that case availability is given as:
MTBF
A¼ ð4:8Þ
MTBF þ MTTR
4.4 Availability Models 67
UP
System state
DN
t=0
t1
t2
t3
t4
t5
t6
t7
t8
Time
Safety-critical systems are used for automatic shutdown of EUC; whenever the
equipment or plant parameters go beyond the acceptable limits for more than
acceptable time. These kinds of systems are used in a variety of industries; such as
oil refining, nuclear power plant, chemical and pharmaceutical manufacturing etc.
When the safety system is functioning correctly (successfully), it permits the EUC
to continue provided its parameters remain within safe limits. If the parameters
move outside of an acceptable operating range for a specified time, the safety
system automatically shutdowns the EUC in a safe manner.
The safety systems generally have some redundancy and can tolerate some
failures while continuing to operate successfully. As discussed in ref. [24–27]
system’s independent channels can fail leading system to following states:
1. Safe failure (SF) state where it erroneously commands to shutdown a properly
operating equipment. Taking a channel off line and shut-down of a channel is
also referred as safe failure.
2. Fail Dangerous Detected (DD) state where channel(s) is (are) failed in dan-
gerous mode, but detected by internal diagnostics, and announced.
3. Fail Dangerous Undetected (DU) state where channel(s) is failed in dangerous
mode and not detected by internal diagnostics, hence not announced.
The safety system can fail in distinctly two different ways [24–28]
1. Safe Failure (FS ), failure which does not have potential to put the safety
system in a hazardous or fail-to-function state [24]. This occurs when more
than tolerable numbers of channels are in safe failure. This type of failure is
referred to in a variety of ways including fail safe [25, 28], false trip and false
alarm.
2. Dangerous Failure (DF), failure which has the potential to put the safety system
in a hazardous or fail-to-function state [24]. More than tolerable number of
68 4 Dependability Models of Computer-Based Systems
channel in DD and/or DU lead to this failure. The system fails in such a way
that it is unable to shutdown the EUC properly when shutdown is required
(or demanded).
Dangerous failures are important from safety point of view. A survey of recent
research work related to safety quantification indicates that there are diverse safety
indices, methods and assumptions about the safety systems. Safety indices used are
PFD (probability of failure on demand) [24, 26, 27, 29–33], MTTFD (mean time to
dangerous failure) [25, 28], MTTFsys (mean time to system failure) [34], MTTUF
(mean time to unsafe failure) & SSS (steady state safety) [35], and MTTHE (mean
time to hazardous event) [36]. Simplified equations [24, 26, 29, 32, 33], Markov
model [25, 27, 28, 30, 31, 34–37] and fault tree [33] are the methods used for
safety quantification. Safety indices of [35, 36] consider only repair. Bukowski
[25] considers repair as well as periodic inspection to uncover undetected faults.
Refs. [24, 26, 27, 29, 32, 33] consider common cause failures (CCF), periodic
inspection along with repair, and [37] consider demand rate. Ref. [32] discusses
the CCF model (b factor) of [24] and suggests generalization, multiple beta factor
(MBF) (multiple beta factor).
b-factor model is very simple method to model common cause failure (CCF). The
problem with b-factor model is that it has no distinction between different voting
logics. To overcome this problem, different b’s, based on heuristics are used for
different voting logics.
4.5.1.2 Multiple Beta Factor (MBF) for Common Cause Failures (CCF)
Failure
Safety index PFD [24] has already been published as a standard. As per IEC 61508
[24], a typical trace of system states is given in Fig. 4.6. The mean probability of
being in DD or DU state is PFD. In the figure times marked as tDi denotes
occurrence of ith detected dangerous failure, tRj completion of jth repair, tUk
occurrence of kth undetected dangerous failure, and tpl time of lth proof-test.
IEC 61508 [24] gives simplified equations for safety evaluation. Since the
inception of IEC 61508 [24], its concepts and methods for loss of safety have been
made clearer and substantiated by Markov models. A review of different techniques
by Rouvroye [38] suggests Markov analysis covers most aspects for quantitative
safety evaluation. Bukowski [30] also compares various techniques for PFD eval-
uation and defends Markov models. Zhang [27] provides Markov model for PFD
evaluation without considering demand rate and modeling imperfect proof-tests.
Demand refers to a condition when the safety system must shut down EUC. The
condition arises when EUC parameters move outside of an acceptable operating
range for a specified time. A trace of system state considering demand is shown in
Fig. 4.7. Downward arrows at Demand Incidence line shows the time epochs of
demand arrival. Epochs marked as at DEUC line, denote successful action taken
by safety system at demand arrival. While epoch marked as , denotes unsafe
failure of the safety system and damage to EUC.
Bukowski [37] proposes a Markov model based safety model, similar to PFD,
considering demand rate. This model does not consider periodic proof-tests.
Detailed comparison of these models with the one proposed here is given in
Sect. 4.5.4.2.
System model developed here is similar to the model of IEC 61508 [24]. It uses
Markov model for analysis. This model explicitly consider periodic proof-test
(perfect or imperfect), demand rate and safe failures. Incorporation of safe failure
DU
System state
DD
OK
t=0 tD01 tR01 tD02 tR02 tU0 tp1 tR1 tU1 tp2 tR2
Time
Fig. 4.6 A typical trace of system states as per IEC 61508 [24]. System may make transitions to
states DD and DU, based on the type of failure. System can be restored back from DD state by
means of repair, after the failure is detected. DU state can be detected only during proof-test. So,
a system will remain in DU state till proof-tests. Safety measure of IEC 61508 [24], PFD, gives
the mean probability of finding the system in state DD or DU
4.5 Safety Models 71
DEUC .
Demand
Incidence
States/Events
DU
DD
OK
Time
Fig. 4.7 A typical trace of system states considering demands. This trace is same as Fig. 4.6,
with addition of demand arrival epoch and state DEUC. The safety system will damage EUC, if it
is in DD or DU state, at demand arrival. Safety measure considering demand is the probability of
reaching state, DEUC
enables modeling of all possible system states and estimation of additional mea-
sures such as availability (or probability of being in one or more specified states)
for a specified amount of time.
System description and assumptions about the system to derive Markov model
is given in next section.
1. Overall channel failure rate of a channel is the sum of the dangerous failure
rate and safe failure rate for that channel. There values need not to be equal.
This is generalization of the assumption [24, 26] that these two failure rates
are equal in value.
2. At least one repair team is available to work on all known failure. This is
generalization of the assumption in [24, 26]. One repair facility work for one
72 4 Dependability Models of Computer-Based Systems
known failure only. Availability of single repair crew in many cases has been
discussed in [30].
3. The fraction of failures specified by diagnostic coverage is detected, corre-
sponding channel is put into a safe state and restored thereafter. This assumption
is on contrary to assumption of IEC [24] for low demand case which assumes on
line repair, but in case of high demand IEC [24] assumes system achieves safe
state after detecting a dangerous fault. With this assumption, 1oo1 and 1oo2
voted group, on any detected fault EUC is put into the safe state.
4. A failure of any kind (SF, DD, DU) once occurred to any channel cannot be
changed to other types without being restored to healthy state [35]. This means
if a channel fails to SF state, then unless it is repaired back to healthy state it
cannot have failures of type DD or DU and vice versa.
5. Proof-tests (inspections or functional tests) are conducted on line. Proof-test of
a healthy channel neither changes system’s state nor EUC’s. While a channel
with undetected faults is put into safe state following proof-tests. This is a new
assumption. It is mainly based on the practice followed in nuclear industry.
6. Proof-tests are periodic with negligible duration. The proof-test interval is at
least 3 orders of magnitude greater than diagnostic test interval. This assump-
tion modifies the assumption of IEC [24] which put the limit of 1 order of
magnitude. This assumption is based on the fact that order of diagnostic test
interval is usually less than or equal to 10s of seconds, while proof-tests interval
are not less than a day.
7. Expected interval between demands is at least 3 order of magnitude greater
than diagnosis test interval. IEC [24] defines two different limits for low
demand and high demand mode of operation. Here limit of high demand
operation is taken with limit increased to 3 orders. This is based on the
assumption that expected interval between demands is not less than a day [37]
even in high demand mode.
8. On occurrence of safe fault, the channel is put into safe state, independent of
other channels. Hence all safe failures even in voted groups are detectable.
9. Time between demands is assumed to follow exponential distribution with
parameter demand rate. This is as in Bukowski [37].
10. Time to restart the EUC following safety action by safety system on demand is
assumed negligible.
11. Following safe failure of safety system, EUC can be started as soon as suf-
ficient number of channels of safety system is operational.
12. The fraction of failures that have a common cause are assumed be equal for
both safe and dangerous undetected failures.
State-transition diagram of a generic system is given in Fig. 4.8 .System state
OK represents healthy state of all its channels. When system has some channels
either in SF or DU state but sufficient number of channels are in healthy state to
take safety action on demand is denoted by Dr. When more than tolerable number
of channels are either in SF state or DU state, it leads the system to go to FS or FDU
respectively. Demand for safety action when system is in FDU lead to DEUC or
4.5 Safety Models 73
λ1,3
µP(t)
λ1,2 λ2,3
OK Dr FS
1 2 3
iµ jµ
λ2,4 µP (t)
λ2,4
λarr
F DU DEUC
4 5
Fig. 4.8 Generic state-transition diagram for a safety system. State OK depicts the healthy state,
Dr degraded working state, FS safe failure state, FDU unsafe failure state and, DEUC damage to
EUC state. ki;j denoted the transition from state i to j, due to failure(s), karr is the arrival rate of
demands and, il and jl is the repair rate from the corresponding states. lp ðtÞ denotes the time-
dependent proof-test event
All transitions in Markov model of Fig. 4.8, except lP ðtÞ are constant and inde-
pendent of time. Exclusion of this transition (i.e. lP ðtÞ ¼ 0 for all t) from state-
transition diagram, transforms the state transition diagram in to a continuous-time
Markov chain (CTMC) [7, 42]. The infinitesimal generator matrix, Q of the CTMC
of Fig. 4.8 is given by,
KTT 0
Q¼ ð4:9Þ
KTA 0
The CTMC of Fig. 4.8 is absorbing, so its infinitesimal matrix, Q, is singular,
i.e jQj ¼ 0. To analyze such a system analytically, use of Darroch and Seneta [43]
technique of only considering transient state is proposed. KTT is the infinitesimal
generator matrix of CTMC considering transient states only. All transient state of
the CTMC communicate to absorbing state, ensures KTT is regular (i.e. KTT 6¼ 0)
[43]. Time dependent transient state probabilities are given by solving following
Chapman–Kolmogrov equation [7, 42].
_ ¼ KTT pðtÞ
pðtÞ
ð4:10Þ
pðtÞ ¼ ½P1 ðtÞ P2 ðtÞ P3 ðtÞ P4 ðtÞ
Solution of (4.10) gives time varying transient state probabilities without
periodic proof-tests. Incorporation of periodic proof-tests in this model makes it a
non-Markovian model. Marsan [44] proposed a method to analyze a non-Mar-
kovian system for steady state which satisfies following two conditions:
process, time instances, in present context, Tproof , 2Tproof , 3Tproof . . .; are called
Markov regeneration epochs. State probabilities of the model can be obtained by
sequentially solving the Markov chain between Markov regenerative epochs and
redistributing the state probabilities at regeneration epochs. Bukowski [25] also
uses the similar method to determine MTTF under various conditions, and call it
piece-wise CTMC method. Here, Markov regenerative process based analysis to
determine state probabilities with periodic proof-tests has been employed.
State probabilities for time up to first regenerative epoch can be obtained from
(4.10).
and
where a ¼ DeKTT s .
Let system operates continuously for time duration of T, then mean state
probabilities for this duration can be computed as:
76 4 Dependability Models of Computer-Based Systems
RT
0 pðtÞdt
¼
E½pðtÞ ¼ p RT ð4:18Þ
0 dt
0 1
X
n Zjs nZsþS
1B C
¼
p @ pðtÞdtþ pðtÞdtA ð4:19Þ
T j¼1
ðj 1Þs ns
1 h 1 KTT s i
¼
p K ½e I½I a 1 ½I an þ ½eKTT S Ian pð0Þ ð4:20Þ
T TT
where T ¼ ns þ s.
Equation 4.20 gives the closed form solution for state probabilities considering
demand rate and periodic proof-test (perfect as well as imperfect).
T .
where 1x is the vector of 1s equal to size of p
The Markov model being absorbing pðtÞ will decrease to 0 with increasing
time. So, PFaDðtÞ like failure distribution (complement of reliability) [7] is a non-
decreasing function of time.
Table 4.2 Comparison of results, PFaDðtÞ and PFDPRS(t) [37]. First column shows MTBD
(mean time between demands), second column shows value of safety index PFDPRSðtÞ, PFaD(t)
and its mean value are given in third and fourth column, % relative difference between
PFDPRS(t) and PFaD(t) are given in column 5
MTBD PFDPRS(t) PFaD(t) PFaD % difference
[37]
1/day 0.0015 0.000995 0.000495 33.688667
1/week 0.0014 0.000966 0.000467 31.007143
1/month 0.00085 0.000856 0.000377 -0.6752941
1/year 0.00022 0.000238 0.000083 -0.3636364
1/10 years 0.00002 0.000028 0.000009 -0.39965
Let a safety system survives from unsafe failures (i.e. DEUC) up to time t, then
conditional state probability are given as:
pi ðtÞ pi ðtÞ
i ðtÞ ¼ P
p ¼ ð4:23Þ
i pi ðtÞ 1 PFaDðtÞ
With this condition, probability of not being in FS gives the availability of
safety system. This availability is termed as manifested Availability (mAv).
Average mAv value up to time 0 t0 is given as:
X
avg:mAvðtÞ ¼ 1 p^i ðtÞ where system state corresponding to i 2 FS ð4:24Þ
i
4.6 Examples
4.6.1 Example 1
Backplane
Fig. 4.9 Composition of a channel. Number within braces shows the quantity of such modules in
a channel
4.6 Examples 79
Table 4.3 Module hazard rate and calculation of channel hazard rate
S. No. Module name Quantity Module hazard rate Total hazard rate
1 8687EURO 1 1.22E-05 1.22E-05
2 SMM-256 1 4.42E-06 4.42E-06
3 DIFIT 3 1.07E-05 3.21E-05
4 RORB 2 3.03E-06 6.05E-06
5 ADA-12 1 6.73E-06 6.73E-06
6 DOSC 3 7.11E-06 2.13E-05
7 WDT 1 2.80E-06 2.80E-06
8 Backplane 1 0 0
Channel hazard rate (k) 8.56E-05
state, when a dangerous failure is detected in the channel. Module failure (hazard)
rates are taken from Khobare et al. [46]. Based on module hazard rates channel
hazard rate is estimated assuming any module failure lead to channel failure.
Protection system is configured as 2oo3 i.e. TMR (Triple Modular Redundant).
The 2oo3 system configuration is shown in Fig. 4.12b. Three channels operate
independently and open their control switches to shutdown the EUC. Control
switches from individual channels are wired to form a 2oo3 majority voting logic.
This enables the system to tolerate one channel’s failure of either type safe or
unsafe.
Ratio of safe failure to unsafe failures and coverage factor are taken from [47].
Module hazard rate values are given in Table 4.3, diagnostic parameters given in
Table 4.4 and derived parameters required for model are given in Table 4.5.
4.6.1.2 Model
The first step is derivation of Markov model for the system. Markov model of the
system is shown in Fig. 4.10. States are marked with 3-tuple, (i, j, k), i shows the
number of healthy channels, j channel(s) in safe failure state and k channel(s) in
dangerous failure state. Transition rate is in terms of safe, dangerous and repair
rate. MATLAB code for the example is given below:
80 4 Dependability Models of Computer-Based Systems
4.6 Examples 81
4.6.1.3 Results
Mean PFaD values for different demand rates and proof-test interval are plotted in
Fig. 4.11. If required value of PFaD is assumed 1 10 4 . From the figure, it can
be observed to meet this target safety value frequent proof-test are required. At
design time demand rate of a new EUC is not known clearly. So this plot can be
used to choose a proof-test interval which guarantees the required PFaD value for
maximum anticipated demand.
82 4 Dependability Models of Computer-Based Systems
1 1 1
2 2 2
1 1
2 2
1,0,2 0,1,2
arr
DEUC
2
1
= S
+C D
0,0,3
2 = (1-C) D
Fig. 4.10 Markov model for safety analysis of Example 1. This safety model is based on generic
safety model shown in Fig. 4.8. State {(3,0,0)} corresponds to OK state, states {(2,1,0), (2,0,1),
(1,1,0)} corresponds to Dr state and, states {(1,2,0), (0,3,0), (0,2,1)} corresponds to FS state of
Fig. 4.8. States {(1,0,2), (0,1,2), (0,0,3)} corresponds to FDU states. System may fail dangerously
from any of the FDU states. Transition rate karr from each of these state is same due to distribution
property of transitions, i.e. ðP102 þ P012 þ P003 Þ:karr ¼ P102 :karr þ P012 :karr þ P003 :karr
4.6.2 Example 2
-1
10
Demand rate
2/Yr
-2
10 1/Yr
1/5Yr
1/10Yr
-3
10
PFaD
-4
10
-5
10
-6
10
-7
10
1 2 3 4
10 10 10 10
Proof-test interval (Hr.)
Fig. 4.11 PFaD values with respect to proof-test interval at different demand rates. This plots
relationship between mean PFaD and proof-test interval for 4 different demand rates
open at demand, but other channel’s switch open and EUC shutdown is ensured.
So, 1oo2 system can endanger EUC, only if both channels are in DU at demand.
So, 1oo2 with output ORing can tolerate one channel’s DU failures only. Both
channels can fail to SF or DU due to CCF. Beta-factor model [24, 32] for CCF is
used. The Markov model of the system with one repair station is shown in
Fig. 4.13.
The states of the Markov chain are denoted by a 3-tuple, ði; j; kÞ
• i denotes the number of channels in OK state
• j denotes the number of channels in SF state
• k denote the number of channels on DU state
2oo3 consists of 3 s-identical channels with a pair of output control switches
from each channel. These control switches are used in implementing majority
voting logic. 2oo3 architecture shutdowns the EUC when two channels go to SF, It
endangers the EUC when 2 channels are in dangerous failure states (DU) at
demand. Means, 2oo3 can tolerate one channel’s safe or dangerous failure. To
incorporate CCF we have used MBF (multiple beta factor) of Hokstad [32]. The
advantage of MBF is it can model variety of cases, beta-factor, gamma-factor, and
base case. Beta-factor allows only simultaneous failure of three channels, gamma-
factor allows simultaneous failure of two channels only, while base case allows
combination of simultaneous failure of two and three channels. The Markov model
of the system with one repair station is shown in Fig. 4.14.
84 4 Dependability Models of Computer-Based Systems
System
Channel
Control switches
Channel
(a) 1oo2
System
A
Channel
A C
B
Channel B A
C C B
Channel
2oo3 voting
logic
(b) 2oo3
Table 4.6 gives system parameter values such as the channel hazard rate, repair
along with proof-test interval or mission time used for the example architectures.
Probability redistribution matrix, D, for 2 architectures is derived from dis-
cussion of Sect. 4.5.3. These are given as below:
For 1oo2
2 3
1 0 0 0 0 0
6 0 1 0 0:9 0 0 7
6 7
6 0 0 1 0 0:9 0:9 7
D¼6 6 7 ð4:25Þ
6 0 0 0 0:1 0 0 77
4 0 0 0 0 0:1 0 5
0 0 0 0 0 0:1
4.6 Examples 85
2∼βλ1 1
2∼βλ22 2
2
1 = S + DC D
1,0,1 0,1,1
2 = (1-DC) D
∼β = β
arr
0,0,2 DEUC
Fig. 4.13 Markov model of 1oo2 system of Example 2. Markov model is based on generic
Markov model of Fig. 4.8. State {(2,0,0)} corresponds to OK, {(1,0,1)} to Dr, {(1,1,0), (0,2,0),
(0,1,1)} to FS and {(0,0,2)} to DU state of Fig. 4.8. b represents the fraction of failures due to
CCF. k1 is hazard rate of safe failures, while k2 is hazard rate of dangerous failures
For 2oo3
2 3
1 0 0 0 0 0 0 0 0 0
60 1 0 0 0:9 0 0 0 0 0 7
6 7
60 0 1 0 0 0:9 0 0:9 0 0 7
6 7
60 0 0 1 0 0 0:9 0 0:9 0:9 7
6 7
60 0 0 0 0:1 0 0 0 0 0 7
D¼6
60
7 ð4:26Þ
6 0 0 0 0 0:1 0 0 0 0 77
60 0 0 0 0 0 0:1 0 0 0 7
6 7
60 0 0 0 0 0 0 0:1 0 0 7
6 7
40 0 0 0 0 0 0 0 0:1 0 5
0 0 0 0 0 0 0 0 0 0:1
86 4 Dependability Models of Computer-Based Systems
Fig. 4.14 Markov model of 2oo3 system of Example 2. Markov model is based on generic
Markov model of Fig. 4.8. State {(3,0,0)} corresponds to OK, {(2,1,0), (2,0,1), (2,1,0)} to Dr,
{(1,2,0), (0,3,0), (0,2,1)} to Fs and {(1,0,2), (0,1,2), (0,0,3)} to DU state of Fig. 4.8. b and b2
represents the fraction of failures due to CCF, a is fraction of individual failures. k1 is hazard rate
of safe failures, while k2 is hazard rate of dangerous failures
Average PFaD and availability values for the specified period are evaluated. These
values are evaluated with parameter values given in Table 4.6 for proof-test
interval [100 h, 12,000 h] at increment of 100 h. The plots of average PFaD and
mAv for 1oo2 and 2oo3 architectures are given in Figs. 4.15 and 4.16 respectively.
Step-wise detailed calculation is given below for better clarity:
Case-1 1oo2 architecture.
Step 1 First step is to determine infinitesimal generator matrix, Q, which is
composed of KTT and KTA . KTT is infinitesimal generator matrix for transient
states, {(2,0,0), (1,1,0), (0,2,0), (1,0,1), (0,1,1), (0,0,2)}, and KTA is infinitesimal
generator matrix for absorbing state DEUC.
KTT 0
Q¼
KTA 0
4.6 Examples 89
-4
x 10
7 1oo2
2oo3
5
mean PFaD
0
0 2000 4000 6000 8000 10000 12000
Proof-test interval (Hrs)
Fig. 4.15 Variation of avg. PFaD w.r.t. T proof . For both hardware architectures mean PFaD is
evaluated for proof-test interval [100 h, 12,000 h] at a increment of 100 h for operating time of
T = 87,600 h. So, each point gives mean PFaD value for corresponding proof-test s (T proof )
value. This saw tooth behavior is observed because of ‘S’ of (4.20). With varying proof-test
interval, ‘S’ may have value from 0 to approx. equal to proof-test interval. For proof-test values,
at which ‘S’ is zero, mean PFaD be local minimum
0.9999
0.9999
1oo2
0.9998 2oo3
0.9998
mAv
0.9997
0.9997
0.9996
0.9996
0.9995
0.9995
0 2000 4000 6000 8000 10000 12000
Proof-test interval (Hrs)
KTT ¼
2 3
P200 ~ 1 þðbþ2bÞk
ððbþ2bÞk ~ 2Þ l 0 0 0 0
6 7
P110 6
6 ~ 1
2bk ðk1 þk2 þlÞ l 0 0 0 7
7
6 7
P020 6
6 bk1 k1 l 0 0
7
0 7
6 7
6 ~ 2 7
P101 6 2bk 0 0 ðk1 þk2 Þ l 0 7
6 7
6 7
P011 6
4 0 k2 0 k1 l 0 7
5
P002 bk2 0 0 k2 0 karr
2 3
P200 0
P110 6
6 0 7
7
P020 6 0 7
6
7
KTA ¼ 6 7
P101 6 0 7
6 7
P011 4 0 5
P002 karr
From the parameter values given in Table 4.6, KTT and KTA is given as:
KTT ¼
2 5 1
3
1:910 1:2510 0 0 0 0
6 7
6 1:7110 5 1:2510 1
1:2510 1
0 0 0 7
6 7
6 7
6 9:510 7 9:510 6
1:2510 1
0 0 0 7
6 7
6 7
6 9:010 7 0 0 1:010 5
1:2510 1
0 7
6 7
6 0 5:010 7
0 9:510 6
1:2510 1
0 7
4 5
8 7 5
5:010 0 0 5:010 0 2:28310
Similarly
2 3
0
6 0 7
6 7
6 0 7
KTA ¼6
6
7
7
6 0 7
4 0 5
5
2:283 10
4.6 Examples 91
and
2 3
1
607
6 7
6 7
607
pðtÞ ¼ 6
607
7
6 7
6 7
405
0
Step 3 Let proof-test interval, Tproof is 100 h, then state-probabilities just before
Ist proof-test, for example 99 h, is given by:
2 1
3
9:99 10
6 47
6 1:44 10 7
6 7
6 67
6 7:60 10 7
pð99Þ ¼ 6
6
7
57
6 8:90 10 7
6 7
6 6:80 10 97
4 5
6
4:94 10
Step 4 State probability just after Ist proof-test, i.e. at t = 100 h and after proof-
test.
2 1 3
9:99 10
6 2:25 10 4 7
6 7
6 1:22 10 5 7
6 7
pð100Þ ¼ 6 6 7
6 8:99 10 7
6 7
4 6:86 10 10 5
7
4:99 10
Step 5 State probabilities between Ist and IInd proof-test interval is given by (4.7)
and
Following this step progressively for next proof-test intervals will enable
computation of state-probabilities for any specified time.
where a ¼ DeKTT s .
4.6 Examples 93
For operation time T = 87,600 h, n and S of above equation are 876 and 0,
respectively, state probabilities at t ¼ T can be obtained as:
2 3
9:20 10 1
6 7
6 6:67 10 2 7
6 7
6 2:46 10 3 7
6 7
pð87;600Þ ¼ 6 7
6 7:39 10 3 7
6 7
6 7
4 5:62 10 7 5
4
2:72 10
Step 6 Once state probabilities with time are known, using mean operator mean
probabilities can be obtained:
RT
pðtÞdt
0
¼
E½pðtÞ ¼ p RT
0 dt
1 h 1 KTT s i
¼
p K ½e I½I a 1 ½I an þ ½eKTT S Ian pð0Þ
T TT
where T ¼ ns þ S.
so for example case it is given as,
1
9:99 10
2 3
4
6 1:52 10
6 7
7
6 7
6 8:00 10 67
¼6
p
6 7
57
7
6 5:49 10
6 7
97
4 4:23 10
6
5
6
3:05 10
KTT ¼
5 1
P300 2:73 10 1:25 10 0 0 0 0 0 0 0 0
5
0 0 0 0 0 0 0
2 3
7 7 6 1
6 7
6 5 1
6 7
7 5
6 7
7 7
6 7:
7 7 5 1
6 7
8 7 6 1
6 7
8 8 7 5
6 7
Similarly
2 3
0
6 0 7
6 7
6 0 7
6 7
6 0 7
6 7
6 0 7
KTA ¼6
6
7
7
6 0 7
6 0 7
6 7
6 2:28 10 57
6 7
4 2:28 10 55
5
2:28 10
Step 2 Evolution of transient states with time is given by:
_ ¼ KTT pðtÞ
pðtÞ
2 3
P300
6 P210 7
6 7
6 P120 7
6 7
6 P030 7
6 7
6 P201 7
6
pðtÞ ¼ 6 7
7
6 P211 7
6 P021 7
6 7
6 P102 7
6 7
4 P012 5
P003
Step 3 State probability just before Ist proof-test, i.e. 99 h. (For proof-test
interval 100 h)
pðs Þ ¼ eKTT s pð0Þ s ¼ Tproof
2 3
9:99 10 1
6 2:07 10 4 7
6 7
6 1:82 10 5 7
6 7
6 2:28 10 6 7
6 7
6 1:23 10 4 7
pð99Þ ¼ 6
6 1:78 10 8 7
7
6 7
6 9:35 10 10 7
6 7
6 1:03 10 5 7
6 7
4 8:08 10 10 5
1:48 10 6
Step 4 State probability just after Ist proof-test, i.e. 100 h. (For proof-test
interval 100 h)
96 4 Dependability Models of Computer-Based Systems
The state probabilities just after Ist proof-test are given as:
2 1
3
9:99 10
6 4 7
6 3:18 10 7
6 7
6 2:76 10 5 7
6 7
6 7
6 3:61 10 6 7
6 7
6 5 7
6 1:23 10 7
pð100Þ ¼ 6
6 1:78 10 9 7
7
6 7
6 7
6 9:35 10 11 7
6 7
6 6 7
7
6 1:03 10
6 7
6 11 7
4 8:08 10 5
7
1:48 10
Step 5 State probabilities between Ist and IInd proof-test interval is given by (4.7)
and
Following this step progressively for next proof-test intervals will enable
computation of state-probabilities for any specified time.
where a ¼ DeKTT s .
For operation time T = 87,600 h, n and S of above equation are 876 and 0,
respectively, state probabilities at t ¼ T can be obtained as:
2 3
8:87 10 1
6 8:88 10 2 7
6 7
6 5:30 10 3 7
6 7
6 7
6 6:60 10 4 7
6 7
6 9:85 10 3 7
pð87;600Þ ¼ 6 6 7
67
6 1:42 10 7
6 7
6 7:49 10 8 7
6 4 7
6 5:85 10 7
6 7
4 4:45 10 8 5
7:30 10 5
Step 6 Once state probabilities with time are known, using mean operator mean
probabilities can be obtained:
RT
pðtÞdt
E½pðtÞ ¼ p ¼ 0R T
0 dt
1 h 1 KTT s i
¼
p K ½e I½I a 1 ½I an þ ½eKTT S Ian pð0Þ
T TT
where T ¼ ns þ S.
So for example case it is given as,
1
2 3
9:99 10
6 2:18 10 4 7
6 7
6 1:92 10
6 5 7
7
6 6 7
7
6 2:40 10
6 7
6 7:60 10 5 7
¼6
p
6 7
6 1:11 10 8 77
6 7
6 5:80 10 10 7
6 7
6 7
6 6:41 10
6
7
10 5
6 7
4 5:09 10
7
9:15 10
Step 7 Determination of mean safety index,
98 4 Dependability Models of Computer-Based Systems
4.6.2.3 Discussion
For the systems with identical channels PFaD values of 1oo2 architecture is lower
than 2oo3 architecture for all proof-test intervals. PFaD values for both the
architecture increases with increase in proof-test interval. Manifested availability
values of 2oo3 architecture are high compared with 1oo2. From Fig. 4.16 it can be
observed that there is no appreciable decrease in availability of 2oo3 architecture
with proof-test interval. While 1oo2 architecture shows decrease in availability
with increasing proof-test interval.
One factor for higher PFaD value of 1oo2 architecture is it spends more time
(compared with 2oo3) in FS (i.e. safe shutdown). Lower value of availability for
1oo2 proves this fact.
Safety index PFaDðtÞ along with availability index mAv(t) can be very useful at
the time of safety system design as well as operation phase. During design phase,
these two indices can be evaluated for different design alternatives (architecture,
hazard rates, DC, CCF) with specified external factors (Proof-test interval, MTBD)
for a specified time. The design alternative which gives lowest PFaDðtÞ (lower
than required) and maximum mAv(t) is best design option.
An example with three cases is taken, system parameter values along with
common environment parameter values foe all three are given in Table 4.7. Both
PFaD and availability are evaluated for all the 3 cases for 1oo2 and 2oo3 archi-
tectures at two different values of MTBD. The results are shown in Table 4.8.
Case-I with 1oo2 architecture and case-III with 2oo3 architecture gives lowest
value of PFaD for both MTBD values compared to others. Now looking at
availability values gives that availability is highest. So, case-III with 2oo3 archi-
tecture seems to be preferable.
During operational phase of system mainly proof-test interval is tuned to
achieve the target safety. Frequent proof-tests increase safety but decrease the
availability of safety system as well as of EUC. These two indices are helpful in
deciding the maximum proof-test interval which will met required PFaDðtÞ value
with maximizes mAvðtÞ.
The safety model discussed here, require Markov models to be manually made and
various matrices to be deduced from the system parameters. Stochastic Petri based
tools- SPNP [48], TimeNET [49] proves to be helpful as they provide a graphical
interface to specify the problem and numerically gives desired measure. The safety
model discussed here, have a deterministic event- periodic proof-test. A class of
SPN called DSPN can model and solve systems with combination of exponential
and deterministic events. DSPNs have some limitations, a detail overview of Petri
net based tools is given in [48, 49].
DSPN based safety models of 1oo2 and 2oo3 system architectures are shown in
Figs. 4.17 and 4.18, respectively.
100 4 Dependability Models of Computer-Based Systems
Fig. 4.17 DSPN based safety model of 1oo2 system. Number of tokens in places OK, SF and
DU represent number of channels in healthy state, safe failure and dangerous failure state
respectively. Transition T1 (T4 ) represents safe (dangerous) hazard rate due to CCF. Transition T2
(T3 ) represents safe (dangerous) hazard rate without CCF. T5 depicts the repair rate of channel
from safe failure state. Demand arrival rate is shown with transition T6 . Places P0 and P1 along
with deterministic transition Tproof and immediate transition T7 model periodic proof-test. When
token is in place P0 , transition Tproof is enabled and fires after a deterministic time. On firing a
token is deposited in place P1 . In this marking immediate transition T7 become enable and fires
immediately and removes all the tokens from place DU. All the token of DU are deposited in
place P2 and one token is deposited in P0 . From P2 tokens may go to place SF or DU based on
degree of proof-test
4.9 Summary
Fig. 4.18 DSPN based safety model of 2oo3 system. Number of tokens in places OK, SF and
DU represent number of channels in healthy state, safe failure and dangerous failure state
respectively. Transition T1 , T2 (T5 , T6 ) represents safe (dangerous) hazard rate due to CCF.
Transition T3 (T4 ) represents safe (dangerous) hazard rate without CCF. T7 depicts the repair rate
of channel from safe failure state. Demand arrival rate is shown with transition T8 . Places P0 and
P1 along with deterministic transition Tproof and immediate transition T9 model periodic proof-
test. When token is in place P0 , transition Tproof is enabled and fires after a deterministic time. On
firing a token is deposited in place P1 . In this marking immediate transition T9 become enable and
fires immediately and removes all the tokens from place DU. All the token of DU are deposited in
place P2 and one token is deposited in P0 . From P2 tokens may go to place SF or DU based on
degree of proof-test
safety models from literature is given. Safety models of IEC 61508 has been
extended to incorporate demand rate, the method is illustrated in detail.
Quantitative safety index PFD is published in safety standard IEC 61508.
Various researchers have contributed to make the method more clear, usable and
relevant. Contributing in the same direction Markov model for the systems con-
sidering safe failures, periodic proof-tests (perfect as well as imperfect) and
demand rate have been derived. The analysis has been done to derive closed form
102 4 Dependability Models of Computer-Based Systems
solution for performance based safety index PFaD and availability. The advan-
tages of modeling safe failures are shown with the help of an example.
Reliable data on process demands is needed to correctly estimate demand rate
and its distribution.
References
23. Everett W (1999) Software component reliability analysis. In: Proceeding of the symposium
on Application-Specific Systems and Software Engineering Technology (ASSET’99), Dallas,
TX, pp 204–211
24. IEC 61508: Functional safety of electric/electronic/programmable electronic safety-related
systems, Parts 0–7; October 1998–May 2000
25. Bukowski JV (2001) Modeling and analyzing the effects of periodic inspection on the
performance of safety-critical systems. IEEE Trans Reliab 50(3):321–329
26. Guo H, Yang X (2007) A simple reliability block diagram method for safety integrity
verification. Reliab Eng Syst Saf 92:1267–1273
27. Zhang T, Long W, Sato Y (2003) Availability of systems with self-diagnostics components-
applying markov model to IEC 61508-6. Reliab Eng Syst Saf 80:133–141
28. Bukowski JV, Goble WM (2001) Defining mean time-to-failure in a particular failure-state
for multi-failure-state systems. IEEE Trans Reliab 50(2):221–228
29. Brown S (2000) Overview of IEC 61508: functional safety of electrical/electronic/
programmable electronic safety-related systems. Comput Control Eng J 11(1):6–12
30. Bukowski JV (2005) A comparison of techniques for computing PFD average. In: RAMS
2005, pp 590–595
31. Goble WM, Bukowski JV (2001) Extending IEC61508 reliability evaluation techniques to
include common circuit designs used in industrial safety systems. In: Proceeding of Annual
Reliability and Maintainability Symposium, pp 339–343
32. Hokstad P, Carneliussen K (2004) Loss of safety assesment and the IEC 61508 standard.
Reliab Eng Syst Saf 83:111–120
33. Summers A (2000) Viewpoint on ISA TR84.0.02-simplified methods and fault tree analysis.
ISA Trans 39(2):125–131
34. Scherrer C, Steininger A (2003) Dealing with dormant faults in an embedded fault-tolerant
computer system. IEEE Trans Reliab 52(4):512–522
35. Delong TA, Smith T, Johnson BW (2005) Dependability metrics to assess safety-critical
systems. IEEE Trans Reliab 54(2):498–505
36. Choi CY, Johnson RW, Profeta JA III (1997) Safety issues in the comparative analysis of
dependable architectures. IEEE Trans Reliab 46(3):316–322
37. Bukowski JV (2006) Incorporating process demand into models for assessment of safety
system performance. In: RAMS 2006, pp 577–581
38. Rouvroye JL, Brombacher AC (1999) New quantitative safety standards: different
techniques, different results? Reliab Eng Syst Saf 66:121–125
39. Manoj K, Verma AK, Srividya A (2007) Analyzing effect of demand rate on safety of
systems with periodic proof-tests. Int J Autom Comput 4(4):335–341
40. Manoj K, Verma AK, Srividya A (2008) Modeling of demand rate and imperfect proof-test
and analysis of their effect on system safety. Reliab Eng Syst Saf 93:1720–1729
41. Manoj K, Verma AK, Srividya A (2008) Incorporating process demand in safety evaluation
of safety-related systems. In: Proceeding of Int Conf on Reliability, Safety and Quality in
Engineering (ICRSQE-2008), pp 378–383
42. Cox DR, Miller HD (1970) The theory of stochastic processes. Methuen & Co, London
43. Darroch JN, Seneta E (1967) On quasi-stationary distributions in absorbing continuous-time
finite markov chains. J Appl Probab 4:192–196
44. Marsan MA, Chiola G (1987) On petri nets with deterministic and exponentially distributed
firing times. In: Advances in Petri Nets 1986, Lecture Notes in Computer Science 266,
pp 132–145
45. Varsha M, Trivedi KS (1994) Transient analysis of real-time systems using deterministic and
stochastic petri nets. In: Int’l Workshop on Quality of Communication-Based Systems
46. Khobare SK, Shrikhande SV, Chandra U, Govidarajan S (1998) Reliability analysis of micro
computer modules and computer based control systems important to safety of nuclear power
plants. Reliab Eng Syst Saf 59(2):253–258
104 4 Dependability Models of Computer-Based Systems
47. Khobare SK, Shrikhande SV, Chandra U, Govidarajan G (1995) Reliability assessment of
standardized microcomputer circuit boards used in C&I systems of nuclear reactors.
Technical report BARC/1995/013
48. Trivedi KS (2001) SPNP user’s manual, version 6.0. Technical report
49. Zimmermann A, and Knoke M (2007) TimeNET 4.0 user manual. Technical report, August
2007
Chapter 5
Network Technologies for Real-Time
Systems
5.1 Introduction
The purpose of this chapter is to introduce basic term and concepts of network
technology. Main emphasis of is on schedulers and real-time analysis of these
networks. Networks used in critical applications, such as, CAN and MIL-STD-
1553B are discussed in detail.
Usually, the MAC protocol is considered a sub layer of the physical layer or the
data link layer.
message in order to reduce the risk of the same messages colliding again. However,
due to the possibility of successive collisions, the temporal behavior of CSMA/CD
networks can be somewhat hard to predict. 1-persistent CSMA/CD is used for
Ethernet [3].
networks is that they are not flexible, as messages can not be sent at an arbitrary
time and changing message table is somewhat difficult. A message can only be
sent in one of the message’s predefined slots, which affect the responsiveness of
the message transmissions. Also, if a message is shorter than its allocated slot,
bandwidth is wasted since the unused portion of the slot cannot be used by another
message. Example of TDMA real-time network is MIL-STD-1553B [4] and TTP/
C [5]. In both these, exchange tables are created offline. One example of an online
scheduled TDMA network is the GSM network.
Another fixed assignment MAC protocol is Flexible Time Division Multiple Access
(FTDMA). As regular TDMA networks, FTDMA networks avoid collisions by
dividing time into slots. However, FTDMA networks use mini slotting concept in
order to make more efficient use of bandwidth, compared to TDMA network.
FTDMA is similar to TDMA with the difference in run-time slot size. In an
FTDMA schedule the size of a slot is not fixed, but will vary depending on whether
the slot is used or not. In case all slots are used in a FTDMA schedule, FTDMA
operated the same way as TDMA. However, if a slot is not used within a small time
offset after its initiation, the schedule will progress to its next slot. Hence, unused
slots will be shorter compared to a TDMA network where all slots have fixed size.
However, used slots have the same size in both FTDMA and TDMA networks.
Variant of FTDMA can be found in Byteflight [6], and FlexRay [7].
5.3.7 Master/Slave
5.4 Networks
5.4.1 Ethernet
(synchronized time interval) that also remove collisions. The window protocol is
more dynamic and somewhat more efficient in its behavior compared to the
VTCSMA approach.
Without modification to the hardware or networking topology (infrastructure),
the usage of traffic smoothing [19, 20] can eliminate bursts of traffic, which have
severe impact on the timely delivery of message packets on the Ethernet. By
keeping the network load below a given threshold, a probabilistic guarantee of
message delivery can be provided. Some more detail about the Ethernet is given
below:
Ethernet is the most widely used local area networking (LAN) technology for
home and office use in the world today. Ethernet is in existence for almost 3
decades. A brief history of evolution of Ethernet over the years is given below.
bus idle arbitration field control data CRC ACK EOF bus idle
Fig. 5.1 CAN message format. It does not have source/destination address. 11-bit identifier is
used for filtering at receiver as well as to arbitrate access to bus. It has 7 control bits- RTR
(remote retransmission request), IDE (identifier extension), reserve bit for future extensions and 4
bits to give length of data field, DLC (data length code). Data field can be 0 to 8 bytes in length
and CRC field contain 15-bit code that can be used to check frame integrity. Following CRC field
is acknowledge (ACK) field comprising an ACK slot bit and an ACK delimiter bit
Controller Area Network (CAN) is a broadcast bus- a single pair of wires- where a
number of nodes are connected to the bus. It employs carrier sense multiple access
with collision detection and arbitration based on message priority (CSMA/AMP)
[21]. The basic features [21, 22] of CAN are:
1. High-speed serial interface: CAN is configurable to operate from a few kilobits
to 1 Mega bits per second.
2. Low cost physical medium: CAN operates over a simple inexpensive twisted
wire pair.
3. Short data lengths: The short data length of CAN messages mean that CAN has
very low latency when compared to other systems.
4. Fast reaction times: The ability to transmit information without requiring a token
or permission from a bus arbitrator results in extremely fast reaction times.
5. Multi master and peer-to-peer communication: Using CAN it is simple to
broadcast information to all or a subset nodes on the bus and just an easy to
implement peer-to-peer communication.
6. Error detection and correction: The high level of error detection and number of
error detection mechanisms provided by the CAN hardware means that CAN is
extremely reliable as a networking solution.
Data is transmitted as message, consisting of up to 8 bytes. Format of CAN
message set is shown in Fig. 5.1. Every message is assigned a unique identifier.
The identifier serves two purposes, filtering messages upon reception and
assigning priority to the message.
The use of identifier as priority is the most important part of CAN regrading
real-time performance. The identifier field of CAN message is used to control
access to the bus after collision by taking advantage of certain electrical charac-
teristics. In case of multiple stations transmitting simultaneously, all stations will
see 0 if any one of the node puts 0 bit (dominant), while all stations will see 1 if all
transmitting node put 1 bit. So, during arbitration, by monitoring the bus a node
5.4 Networks 113
arbitration field
Resulting bus level
bus 0 0 1 0 0 1 1 0 1 1 0
A 0 0 1 0 0 1 1 1 1 1 1 ID=319=13FH=00100111111
B 0 0 1 0 0 1 1 0 1 1 1 ID=311=137H=00100110111
Node A Node B Node C
C 0 0 1 0 0 1 1 0 1 1 0 ID=310=136H=00100110110
A loses B loses
(a) (b)
Fig. 5.2 a CAN’s electrical interface (wired-OR) which enables priority based arbitration,
b arbitration mechanism when 3 nodes are transmitting simultaneously
5.4.3 MIL-STD-1553B
Trunk line
Primary bus
Bus Bus
Terminator Coupler Coupler Coupler Coupler Terminator
Secondary bus
Stub line
Bus Controller Remote Terminal ‘1’ Remote Terminal ‘2’ Remote Terminal ‘31’
A typical network consisting of a bus controller and a remote terminal with dual
redundant bus is shown in Fig. 5.3.
The control, data flow, status reporting, and management of the bus are pro-
vided by three word types:
1. Command words
2. Data words
3. Status words
Word formats are shown in Fig. 5.4.
The primary purpose of the data bus is to provide a common media for the
exchange of data between terminals of system. The exchange of data is based on
5.4 Networks 115
Bit Times 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Command
5 1 5 5 1
Word
Data Word 16 1
Status Word 5 1 1 1 3 1 1 1 1 1 1
Instrumentation
Acceptance
Subsystem
Command
Broadcast
Received
Message
Dynamic
Terminal
Request
Service
Busy
Error
Flag
Flag
SYNC Terminal Address Reserved
Bus
Parity
Next
Receive Transmit Status Data Data Status Command
RT-RT @ @ #
Command Command Word Word Word Word Word
# : Inter-message gap
@ : Response Time
message transmission formats. The standard defines ten types of message trans-
mission formats. All of these formats are based on the three word types defined in
succeeding paragraph. A RT-RT information transfer format is shown in Fig. 5.5.
Intermessage gap shown in Fig. 5.5, is the minimum gap time that the bus
controller shall provide between messages. Its typical value is 4 ls. Response time
is time period available for terminals to respond to a valid command word. This
period is of 4–12 ls. A time out occurs if a terminal do not respond within No-
Response timeout, it is defined as the minimum time that a terminal shall wait
before considering that a response has not occurred, it is 14 ls.
A real-time scheduler schedules real-time tasks sharing a resource. The goal of the
real-time scheduler is to ensure that the timing constraints of these tasks are
satisfied. The scheduler decides, based on the task timing constraints, which task to
execute or to use the resource at any given time.
Traditionally, real-time schedulers are divided into offline and online sched-
ulers. Offline schedulers make all scheduling decisions before the system is
executed. At run-time a simple dispatcher is used to activate tasks according to
the schedule generated before run-time. Online schedulers, on the other hand,
116 5 Network Technologies for Real-Time Systems
Real-time
Schedulers
Time-driven schedulers [26] work in the following way: The scheduler creates a
schedule (exchange table). Usually the schedule is created before the system is
started (offline), but it can also be done during run-time (online). At run-time, a
dispatcher follows the schedule, and makes sure that tasks are executing at their
predetermined time slots.
By creating a schedule offline, complex timing constraints, such as irregular task
arrival patterns and precedence constraints, can be handled in a predictable manner
that would be difficult to do online during run-time (tasks with precedence
constraints require a special order of task executions, e.g., task A must execute
before task B). The schedule that is created offline is the schedule that will be used
at run-time. Therefore, the online behavior of time-driven schedulers is very pre-
dictable. Because of this predictability, time-driven schedulers are the more com-
monly used schedulers in applications that have very high safety-critical systems.
However, since the schedule is created offline, the flexibility is very limited, in the
sense that as soon as the system will change (due to adding of functionality or
change of hardware), a new schedule has to be created and given to dispatcher.
Creating a new schedule non-trivial and sometimes very time consuming. This
motivates the usage of priority-driven schedulers described below.
5.5 Real-Time Scheduling 117
Scheduling policies that make their scheduling decisions during run-time are
classified as online schedulers. These schedulers make their scheduling decisions
online based on the system’s timing constraints, such as, task priority. Schedulers
that base their scheduling decisions on task priorities are called priority-driven
schedulers.
Using priority-driven schedulers the flexibility is increased (compared to time-
driven schedulers), since the schedule is created online based on the currently
active task’s properties. Hence, priority-driven schedulers can cope with change in
work-load as well as adding and removing of tasks and functions, as long as the
schedulability of the complete task-set is not violated. However, the exact
behavior of the priority-driven schedulers is hard to predict. Therefore, these
schedulers are not used as often in the most safety-critical applications.
Priority-driven scheduling policies can be divided into Fixed Priority Sched-
ulers (FPS) and Dynamic Priority Schedulers (DPS). The difference between these
scheduling policies is whether the priorities of the real-time tasks are fixed or if
they can change during execution (i.e. dynamic priorities).
When using FPS, once priorities are assigned to tasks they are not changed. Then,
during execution, the task with the highest priority among all tasks that are available
for execution is scheduled for execution. Priorities can be assigned in many ways,
and depending on the system requirements some priority assignments are better than
others. For instance, using a simple task model with strictly periodic non-interfering
tasks with deadlines equal to the period of the task, a Rate Monotonic (RM) priority
assignment has been shown by Liu and Layland [27] to be optimal in terms of
schedulability. In RM, the priority is assigned based on the period of the task.
The shorter the period is the higher priority will be assigned to the task.
The most well known DPS is the Earliest Deadline First (EDF) scheduling policy
[27]. Using EDF, the task with the nearest (earliest) deadline among all tasks ready
for execution gets the highest priority. Therefore the priority is not fixed, it
changes dynamically over time. For simple task models, it has been shown that
EDF is an optimal scheduler in terms of schedulability. Also, EDF allows for
higher schedulability compared with FPS. Schedulability is in the simple scenario
guaranteed as long as the total load in the scheduled system is B100%, whereas
FPS in these simple cases has a schedulability bound of about 69%. For a good
comparison between RM and EDF interested readers are referred to [24].
118 5 Network Technologies for Real-Time Systems
Other DPS are Least Laxity First (LLF) (sometimes also called Least Slack
Time first (LST)) [28]. Here the priorities of the tasks are generated at the time of
scheduling from the amount of laxity (for LLF, or slack for LST) available before
the deadline is violated. Laxity (or slack time) is defined as the maximum time a
task can be delayed on its activation and still complete within its deadline [24].
Several server-based schedulers for FPS systems exist where the simplest one is
the Polling Server (PS) [32]. A polling server allocates a share of the CPU to its
users. This share is defined by the server’s period and capacity, i.e., the PS is
guaranteed to allow its users to execute within the server’s capacity during each
server period. The server is scheduled according to RM together with the normal
tasks (if existing) in the system. However, a server never executes by itself.
A server will only mediate the right to execute for its users, if some of its users
have requested to use the server’s capacity. Otherwise the server’s capacity will be
left unused for that server period. However, if the PS is activated and no user is
ready to use the server capacity, the capacity is lost for that server period and the
server’s users have to wait to the next server period to be served. Hence, the worst-
case service a user can get is when it requests capacity right after the server is
activated (with its capacity replenished). The behavior of a PS server is in the
5.5 Real-Time Scheduling 119
worst-case equal to a task with the period of the server’s period, and a worst-case
execution time equal to the server’s capacity. Hence, the analysis of a system
running PS is straightforward.
Another server-based scheduler for FPS systems that is slightly better than the
PS (in terms of temporal performance) is the Deferrable Server (DS) [34]. Here, the
server is also implemented as a periodic task scheduled according to RM together
with the (if existing) other periodic tasks. The difference from PS is that the server is
not polling its users, i.e., checking if there are any pending users each server period
and if not drop all its capacity. Instead, the DS preserves its capacity throughout the
server period allowing its users to use the capacity at any time during the server
period. As with the PS, the DS replenish its capacity at the beginning of each server
period. In general, the DS is giving better response times than the PS. However, by
allowing the servers’ users to execute at any time during the servers’ period it
violates the rules govern by the traditional RM scheduling (where the highest
priority task has to execute as soon it is scheduled), lowering the schedulability
bound for the periodic task set. A trade-off to the DS allowing a higher schedula-
bility but a slight degradation in the response times is the Priority Exchange (PE)
algorithm. Here the servers’ capacities are preserved by exchanging it for the
execution time of a lower priority periodic task. Hence, the servers’ capacities are
not lost but preserved at the priority of the low priority task involved in the
exchange. Note that the PE mechanisms are computationally more complex than
the DS mechanisms, which should be taken into consideration in the trade-off.
By changing the way capacity is replenished, the Sporadic Server (SS) [32] is a
server-based scheduler for FPS systems that allows high schedulability without
degradation. Instead of replenishing capacity at the beginning of the server period,
SS replenishes its capacity once the capacity has been consumed by its users. As
DS, SS violates the traditional RM scheduling by not executing the highest priority
task once it is scheduled for execution. However, this violation does not impact on
the schedulability as the same schedulability bound is offered for a system running
both with and without SS.
There are server-based schedulers for FPS systems having better performance
in terms of response-time. However, this usually comes at a cost of high com-
putational and implementation complexity as well as high memory requirements.
One of these schedulers is the Slack Stealer. It should be noted that there are no
optimal algorithms in terms of minimising the response time. The non existence of
an algorithm that can both minimise the response time offered to users and at the
same time guarantees the schedulability of the periodic tasks has been proven in
[35]. Hence, there is a trade-off between response-time and schedulability when
finding a suitable server-based scheduler for the intended target system.
also been extended to EDF based DPS systems, e.g., an extension of PE called the
Dynamic Priority Exchange (DPE) [36], and an extension of the SS called the
Dynamic Sporadic Server (DSS) [36]. A very simple (implementation wise) ser-
ver-based scheduler that provides faster response-times compared with SS yet not
violating the overall load of the system (causing other tasks to miss their dead-
lines) is the Total Bandwidth Server (TBS) [36]. TBS makes sure that the server
never uses more bandwidth than allocated to it (under the assumption that the users
do not consume more capacity than specified by their worst-case execution times),
yet providing a fast response time to its users (i.e., assigning its users with a close
deadline as the system is scheduled according to EDF). Also, TBS has been
enhanced by improving its deadline assignment rule [36]. A quite complex server-
based scheduler is the Earliest Deadline Late server (EDL) [36] (which is a DPS
version of the Slack Stealer). Moreover, there is an Improved Priority Exchange
(IPE) [36] which has similar performance as the EDL, yet being less complex
implementation wise. When the worst-case execution times are unknown, the
Constant Bandwidth Server (CBS) [24] can be used, guaranteeing that the server’s
users will never use more than the server’s capacity.
The task model notation used throughout this section is presented in Table 5.2.
Periodic tasks could be of two types: synchronous periodic tasks, and asyn-
chronous periodic tasks. These are defined as:
5.6 Real-Time Analysis 121
Synchronous periodic tasks are a set of periodic tasks where all first instances
are released at the same time, usually considered time zero.
Asynchronous periodic tasks are a set of periodic tasks where tasks can have
their first instances released at different times.
Seminal work on utilisation-based tests for both fixed priority schedulers and
dynamic priority schedulers have been presented by Liu and Layland [27].
In [27] by Liu and Layland, a utilisation-based test for synchronous periodic tasks
using the Rate Monotonic (RM) priority assignment is presented (Liu and Layland
provided the formal proofs). The task model they use consists of independent
periodic tasks with deadline equal to their periods. Moreover, all tasks are released
at the beginning of their period and have a known worst-case execution time and
they are fully pre-emptive. If the test succeeds, the tasks will always meet their
deadlines given that all the assumptions hold. The test is as follows:
X
N
Ci
N 21=N 1 ð5:1Þ
i¼1
Ti
This test only guarantees that a task-set will not violate its deadlines if it passes
this test. The lower bound given by this test is around 69% when N approaches
infinity. However, there are task-sets that may not pass the test, yet they will
meet all their deadlines. Later on, Lehoczky showed that the average case real
122 5 Network Technologies for Real-Time Systems
feasible utilization is about 88% when using random generated task sets. More-
over, Lehoczky also developed an exact analysis. However, the test developed by
Lehoczky is a much more complex inequality compared to Inequality (5.1). It has
also been shown that, by having the task’s periods harmonic (or near harmonic),
the schedulability bound is up to 100% [38]. Harmonic task sets have only task
periods that are multiples if each other.
Inequality (5.1) has been extended in various ways, e.g., by Sha et al. [39] to
also cover blocking-time, i.e., to cover for when higher priority tasks are blocked
by lower priority tasks. For a good overview of FPS utilisation-based tests inter-
ested readers are referred to [25].
Liu and Layland [27] also present a utilisation-based test for EDF (with the same
assumptions as for Inequality (5.1)):
X
N
Ci
1 ð5:2Þ
i¼1
Ti
The processor demand is a measure that indicates how much computation that is
requested by the system’s task set, with respect to timing constraints, in an arbi-
trary time interval t 2 ½t1 ; t2 Þ. The processor demand h½t1;t2Þ is given by
X
h½t1 ;t2 Þ ¼ Ck ð5:3Þ
t1 rk ;dk t2
where rk is the release time of task k and dk is the absolute deadline of task k, i.e.,
the processor demand is in an arbitrary time interval given by the tasks released
within (and having absolute deadlines within) this time interval.
Looking at synchronous task sets, (5.3) can be expressed as hðtÞ given by
X
t Di
hðtÞ ¼ 1þ Ci ð5:4Þ
Di t
Ti
where Di is the relative deadline of task i. Then, a task set is feasible iff
5.6 Real-Time Analysis 123
for which several approaches have been presented determining a valid (sufficient) t
[40, 41].
By looking at the processor demand, Inequality (5.2) has been extended for
deadlines longer than the period [40]. Moreover, [41] present a processor demand-
based feasibility test that allows for deadlines shorter than period. Given that
Inequality (5.2) is satisfied, Spuri et al. [33] introduce a generalized processor
demand-based feasibility test that allows for non pre-emptive EDF scheduling.
Additional extensions covering sporadic tasks is discussed in [40].
Response-time tests are calculating the behavior of the worst-case scenario that
can happen for any given task, scheduled by a specific real time scheduler. This
worst case behavior is used in order to determine the worst-case response-time for
that task.
Joseph and Pandya presented the first response-time test for real-time systems [42].
They present a response-time test for pre-emptive fixed-priority systems. The
worst-case response-time is calculated as follows:
Ri ¼ Ii þ Ci ð5:6Þ
where hpðiÞ is the set of tasks with higher priority than task i.
For FPS scheduled systems, the critical instant is given by releasing task i with
all other higher priority tasks at the same time, i.e., the critical instant is generated
when using a synchronous task set [32]. Hence, the worst-case response-time for
task i is found when all tasks are released simultaneously at time 0.
The worst-case response-time is found when investigating the processors level-i
busy period, which is defined as the period preceding the completion of task i, i.e.,
the time in which task i and all other higher priority tasks still not yet have
124 5 Network Technologies for Real-Time Systems
executed until completion. Hence, the processors level-i busy period is given by
rewriting (5.6) to:
X
Rn
nþ1 i
Ri ¼ Cj þ Ci ð5:8Þ
j2hpðiÞ
Tj
where lep(i) is the set of tasks with priority less or equal than task i. e is the
minimum time quantum, which, in computer systems, corresponds to one clock
cycle. Including e makes the blocking expression less pessimistic by safely
removing one clock cycle from the worst case execution time of task k, causing the
blocking. The recurrence equation (5.9) is solved in the same way as (5.8). Note
that the in the presence of blocking, the scenario giving the critical instant is re-
defined. The maximum interference now occurs when task i and all other higher
priority tasks are simultaneously released just after the release of the longest lower
priority task (other than task i).
to BC
Maximum data rate 1 MB/S up to 40 m 1 MB/S up to 40 m 5 MB/S up to 1000 m 10 MB/S up to 2500 m 1 MB/S, NO limit
distance distance
Maximum data size with 8 bytes 8 bytes 32000 bytes 1500 bytes 32 data word (512 actual
overheads Overhead:47/65 bits Overhead:47/ 65 bits Overhead:80 bits minimum Overhead:206 bits data bits) overhead = 208
minimum bits
Data flow Half duplex Half duplex Half duplex Half duplex / Full duplex Half duplex
MAC layer CSMA /CD/ AMP Exclusive window for one Token passing CSMA/CD with BEB TDMA, Command
Arbitration on message message, Arbitration for algorithm response
priority, rest of messages like basic
Priority based CAN
Clock synchronization Not required Reference message is used Not required Not required Not required
to for local synchronization
Delay and jitter Low priority low delay and jitter Worst case delay is fixed Unbounded delay and very Worst case delay is fix
messages has improvement and low jitter high jitter with very low jitter
higher delay jitter
Network size in term of 120 nodes 120 nodes 100 nodes Maximum segment length 31 remote terminal
node is 100 m. maximum
Minimum length between
nodes is 2.5 m.
Maximum number of
connected segments
is 1024. Maximum number
of nodes per segment
is 1 (star topology).
125
(continued)
Table 5.3 (continued)
126
Serial/Field bus features CAN TTCAN TOKEN BUS Ethernet (CSMA/CD) MIL-STD- 1553B
Physical Media Twisted pair Twisted pair Twisted pair Twisted pair Twisted pair
Single wire CAN
(CSMA/CR)
Topology Multidrop Multidrop Logical ring 10BASE5 uses bus Multidrop
topology
Redundancy NO NO NO NO DUAL or more
redundant BUS
Fault Tolerant NO NO NO NO YES
Fail silent YES, Node can go into YES, Node can go into NO NO NO
BUS OFF state BUS OFF state
Frame/Message check YES, acknowledgement YES, acknowledgement CRC CRC Parity, status word
bit, CRC bit, CRC
Incremental/ Detrimental Incremental/Detrimental
counter for node state on counter for node state on
the bus the bus
Node failure tolerance YES YES YES YES Yes, if the BUS controller
remains functional
5 Network Technologies for Real-Time Systems
5.6 Real-Time Analysis 127
in which only tasks with absolute deadlines smaller than or equal to di are allowed
to execute.
Hence, in dynamic-priority systems the worst-case response-time for an arbi-
trary task i can be found for the pre-emptive case when all tasks, but i, are released
at time 0. Then, multiple scenarios have to be examined where task i is released at
some time t.
Also for the non pre-emptive case, all tasks but i are released at time 0.
However, one task with an absolute deadline greater than task i (i.e., one lower
priority task) has initiated its execution at time 0-e. Then, as in the preemptive
case, multiple scenarios have to be examined where task i is released at some time
t.
Worst-case response-time equations for both preeptive and non pre-emptive
EDF scheduling are given by Spuri [33]. Furthermore, these have been extended
for response-time analysis of EDF scheduled systems to include offsets.
5.8 Summary
References
35. Tia T-S, Liu W-S, Shankar M (1996) Algorithms and optimality of scheduling soft aperiodic
requests in fixed-priority preemptive systems. Real-Time Syst 10(1):23–43
36. Spuri M, Buttazzo GC (1996) Scheduling aperiodic tasks in dynamic priority systems. Real-
Time Syst 10(2):179–210
37. Tindell KW, Burns A, Wellings AJ (1994) An extendible approach for analysing fixed
priority hard real-time tasks. Real-Time Syst 6(2):133–151
38. Sha L, Goodenough JB (1990) Real-time scheduling theory and ADA. IEEE Comput
23(4):53–62
39. Sha L, Rajkumar R, Lehoczky JP (1990) Priority inheritance protocols: An approach to real-
time synchronization. IEEE Trans Comput 39(9):1175–1185
40. Baruah SK, Mok AK, Rosier LE (1990) Preemptive scheduling hard real-time sporadic tasks
on one processor. In: Proceedings of the 11th IEEE Real-Time Systems Symposium
(RTSS’90), pp 182–190
41. Baruah SK, Rosier LE, Howell RR (1990) Algorithms and complexity concerning the
preemptive scheduling of periodic real-time tasks on one processor. Real-Time Syst
2(4):301–324
42. Joseph M, Pandya P (1986) Finding response times in a real-time system. Comput J
29(5):390–395
43. Audsley NC, Burns A, Richardson MF, Tindell K, Wellings AJ (1993) Applying new
scheduling theory to static priority pre-emptive scheduling. Softw Eng J 8(5):284–292
Chapter 6
Response-Time Models and Timeliness
Hazard Rate
6.1 Introduction
Theorem:
lim P½Yk n ¼ pN 1 ðnÞ
k!1
where
n 2 Sð N 1Þ
This theorem forms the basis of the ‘‘tagged customer’’ approach for computing
the response time distribution. It is also referred to as the Arrival Theorem, and
states that in a closed queueing network an arriving customer would see the
network in equilibrium with one less customer. Thus in a network with N cus-
tomers, the tagged customer sees the network in equilibrium with N 1 cus-
tomers. The arrival theorem gives the probability distribution for the state of the
system as seen by the arriving customer. So, computing the response time dis-
tribution using tagged customer approach is a two step process [5]:
1. Compute the steady-state probabilities for each of the states of the queueing
network with one less customer, pN 1 ðnÞ
2. Use these probabilities to compute the response time distribution, P½R t
0,0
1 2
1,0 0,1
1
2
2 1
1,1
The first step is to derive CTMC of the system without the job of interest and
finding the steady-state probabilities. The CTMC is shown in Fig. 6.3.
Having found the steady state probability of the system without tagged customer,
we now construct the modified Markov chain from these state to state where tagged
customer leaves the system. The modified CTMC is shown in Fig. 6.4.
In the figure stating and absorbing states are obvious. Following set of differ-
ential equations need to be solved to get passage times:
d ðP1 Þ
dx ¼ 0:175P1 þ 0:025P2
d ðP2 Þ
dx ¼ 0:125P1 0:075P2
d ðP3 Þ
dx ¼ 0:050P1 0:125P3
d ðP4 Þ
dx ¼ 0:050P2 þ 0:125P3 0:225P4
d ðP5 Þ
dx ¼ 0:025P4 0:2P5
d ðP6 Þ
dx ¼ 0:2P4
d ðP7 Þ
dx ¼ 0:2P5
134 6 Response-Time Models and Timeliness Hazard Rate
0,0
0,1 1,0
1 2
1,1
0,1
0.8
Analytical Result
Prob(Resp ≤ t)
0.7 Simulation Result
0.6
0.5
0.4
0.3
0 5 10 15 20 25 30 35 40 45 50
Time(ms)
Now, let us take a multi-server model of a computer system as shown in Fig. 6.6. It
is also an example of close-queueing network. We have already assumed that
service discipline at all queues is SPNP and the service time distributions are job
dependent exponential. The service rates for a job i at CPU, Disk-1 and Disk-2 are
l0i ; l1i ; and l2i ; respectively. When the customer finishes at the CPU, it will either
access to Disk-1 or Disk-2 with probability p1i, p2i, respectively. After completing
the disk access, the job rejoins the CPU queue. For this model we define the
response time as the amount of time elapsed from the instant at which the job enters
the CPU queue for service until the instance it leaves either of the disk.
Following the steps mentioned in section 6.2.1, CTMC of the system with job
‘1’ in the system is prepared. The steady state probabilities are evaluated for job
‘2’ being in CPU, Disk-1 and Disk-2 . These are the probabilities, job ‘1’ may see
at its arrival.
Now, CTMC is evolved from these states and sets of differential equations are
made. These equations are solved for three different initial conditions, i.e. at
arrival, job ‘1’ finds job ‘2’ at (1) CPU.
136 6 Response-Time Models and Timeliness Hazard Rate
0.7
Prob(Resp ≤ t)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700
Time(ms)
After this, these three conditional passage times are unconditioned using
Arrival Theorem and steady-state probabilities.
The response time distribution of the highest priority job is plotted in the Fig. 6.7.
The method illustrated is applicable if the model has following properties [2, 3]:
G0 1 '
0 1 2 3 4 5 6 7 r
gl0 1
gh 0 1
0 1 2 3 4 5 6 7 8 r 0 1 2 3 4 5 6 7 8 r
gl0 1
gh 0 1
f c2
1/8 1/8
0 1 2 3 4 5 6 7 8 r 0 1 2 3 4 5 6 7 8 9 r
G1 1
1/8 1/8
0 1 2 3 4 5 6 7 8 9
r
job runs, and Ci is the required computation time, which is a random variable with
known probability density function (PDF).
Consider the system shown in Table 6.1, and it is required to obtain the PDF of
the response time for the job T1 .
138 6 Response-Time Models and Timeliness Hazard Rate
The three steps used to determine the PDF of response-time graphically, are
shown in Fig. 6.8. These three steps are used iteratively to determine the PDF of
response-time in complex system involving several jobs.
It is an iterative method whose computational complexity is a function of
number of jobs in the system and the maximum number of points defining the
computation times.
6.3.1 CAN
where Ji is the queueing jitter of the messages, i.e., the maximum variation in the
queueing time relative to Ti ; hpðiÞ is the set of messages with priority higher than
i; sbit (bit-time) caters for the difference in arbitration start times at the different
nodes due to propagation delays and protocol tolerances. Equation 6.2 is
6.3 Response-Time Models 139
bFree
~arbSuc
~hpMesBlock
mesBlockedhp
tq
mesQueued
hpMesBlock
~mesBlock
Fig. 6.9 Response-time model for CAN. A token in place mesQueued depicts that a message of
interest is queued for transmission. Time taken by token to reach place mesTxComp is the
response-time of message. Immediate transitions bFree; hpmesBlock; arbSuc are probabilities of
bus being free, blocking due to higher priority message, and successful arbitration, respectively.
Immediate transitions labeled with as prefix are complementary to transitions without this
symbol. General transitions tB ; tq represent time associated with blocking, and queueing,
respectively. Deterministic transition tC , represent time associated with transmission
recurrence relation for qi . Considering effect of external interference and error, the
worst-case response time [11] can be given as:
X qi þ Jj þ sbit
Ri ¼ Bi þ Cj þ Ci þ Eðqi þ Ci Þ ð6:3Þ
j2hpðiÞ
Tj
The expression (6.1) depicts main parameters affecting response-time. The model
presented here is based on following assumptions:
CAN message is shown in Fig. 6.9. This model is based on Deterministic Sto-
chastic Petri Net (DSPN). The model is analyzed analytically and the method is
discussed in detail. This DSPN model is chosen for better representation and
explanation of analysis steps. The DSPN model is only for explaining the model. A
brief introduction to SPN, GSPN and DSPN is given in Chapter 3.
In Fig. 6.9 a token in place mesQueued depicts that a message of interest is
queued for transmission. Time taken by token to reach place mesTxComp is the
response-time of message. Immediate transitions bFree, hpmesBlock, arbSuc are
probabilities of bus being free, blocking due to higher priority message, and
successful arbitration, respectively. Immediate transitions labeled with as prefix
are complementary to transitions without this symbol. General transitions tB ; tq
represent time associated with blocking and queueing, respectively. Deterministic
transition tC represent time associated with worst-case transmission.
To analyze this model, value of all transitions, immediate (probabilities) and
timed (pdfs) are required. This model gives response-time distribution of one
message only. So parameter values need to be calculated for all the messages
whose response-time distribution is required.
Set of messages of the system are denoted by M. Parameter are estimated for a
message m from set M. In parameters estimation sub-section, i 2 M means all
messages except message m for which the parameter is being estimated.
Probability of finding bus free (bFree): probability that a message finds the
network free when it gets queued, is estimated based on the utilization of network.
This utilization is by other messages of network.
X Ci
Pm
free ¼ 1 ð6:4Þ
i2M i
T
where
PiC ¼ Prob½non occurrence of ith message in time sw
1
¼1 sw
Ti
6.3 Response-Time Models 141
Blocking time (tB ): a message in queue can be blocked by any message under
transmission by any of the other nodes. This is because CAN messages in trans-
mission cannot be preempted. pdf of this blocking time pb ðtÞ is obtained by
following steps:
1
1. Find the ratio ri of all the messages. ri ¼ T P 1 , for i; j 2 M
i j Tj
3. Message can get ready at any time during the blocking time with equal prob-
ability. So, effective blocking time is given by following convolution
Zt
m 1
pb ðtÞ ¼ pðsÞ½U ðt þ sÞ U ðt þ maxðCi Þ þ sÞds ð6:7Þ
maxðCi Þ
0
Blocking time by high priority message (tq ): when the ready node finds bus
free and start transmission of ready message, then if within the collision time
window, another node starts transmitting a higher priority message, node backs off.
And the message need to wait till the time of completion of this transmission. pdf
of blocking time by high priority message, pbhp is obtained by following steps 1–3
of Blocking time with one variation, instead of all messages only high priority
message of network are considered.
1
1. Find the ratio riH of all the messages. riH ¼ T P 1 , for i; j 2 M; i 2 hpðmÞ
i j Tj
Pm PiTB
Q
TB ¼
i2M ð6:9Þ
i2hpðmÞ
where
where
where
gi ð t Þ ¼ gi 1 ð t Þ
pm
bhp ðtÞ
gð t Þ ¼ pm
bhp ðtÞ
X
m
Bhp ðtÞ ¼ Bhpm ði; tÞ
i
Symbol
is used to denote convolution.
In the same way, time to reach place mesTxStarted in ith attempt from
mesReady is modeled as a single r.v. with pdf trdy ði; tÞ.
h i
m
trdy ði; tÞ ¼ ð1 Pfree Þi 1 Pfree ðBmhp Þ
i 1
ðt Þ
where
i m i 1
ðBm
hp Þ ðtÞ ¼ ðBhp Þ ð t Þ
Bm
hp ðtÞ
0
ðBm
hp Þ ðtÞ ¼ dðtÞ
X
m m
trdy ðtÞ ¼ trdy ði; tÞ
i
From the instant message is queued, it can reach state mesReady either directly
or via state mesBlocked and mesBlockedhp. State mesBlocked has an associated
time delay. So, using total probability theorem [12], total queueing time is given
as:
qm ð t Þ ¼ Pm Pm
m m
free trdy ðtÞ þ 1 free PTB pb ðtÞ
trdy ðtÞ
h i
þ 1 Pm free 1 Pm
TB pm m
b ðtÞ
pbhp ðtÞ
trdy ðtÞ ð6:11Þ
6.3 Response-Time Models 143
tdm
Let is deadline for the message m. Then value of cumulative distribution at
tdm gives the probability of meeting the deadline.
P t tdm ¼ Rm tdm
ð6:14Þ
6.3.1.3 Example
-1
10
Probability
-2
10
-3
10
Blocking Time
Blocking time due to high priority
-4
10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Time (ms)
0.9
ID 17
0.8
ID 9
0.7 ID 1
0.6
Probability
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7
Time (ms)
0.8 ID 1 (Sim)
ID 9 (Sim)
0.7 ID 17 (Sim)
ID 1 (Ana)
0.6 ID 9 (Ana)
Probability
ID 17 (Ana)
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10
Time (ms)
In Fig. 6.11 the offset at time axis is due to blocking when the message is
queued. It is same for all messages irrespective of message priority, because CAN
message transmissions are non-preemptive. Slope of response-time curves are
different. Slopes are dependent upon the message priority, higher the message
priority higher is the slope.
Response-times from worst-case analysis are giving upper bound on response-
time, so probability at these times from response-time distribution is expected to be
very high or even 1. Values in column 3 of Table 6.4 confirms this. Worst-case
response-time from simulation is obtained from a limited simulation (2,000,000 ms
[13]). Hence there is no consistence probability at these response-times.
Response-time of message with probability 0.999, is comparable for higher pri-
ority messages, while it is almost 25% of worst-case for lower priority. This is because
worst-case analysis assumes all higher priority message will get queued determinis-
tically, while response-time distribution gives probabilistic treatment to this.
mesTxStarted tC mesTxComp
bFree
~IhpMesBlock
~ImesBlock
mesQueued
tq
IhpMesBlock
mesBlocked tB
Basic CAN model of Fig. 6.9 is analyzed again, in view of the simulation results.
It has been found that probability of collision in collision window is almost
negligible for all the messages. So, considering this is of not much importance.
In basic CAN model, time taken to evaluate probabilities mesBlock and
hpmesBlock is mean of blocking times tB and tq , respectively. So, this time is
changed from mean to maximum blocking time in the improved model. The
improved CAN model is shown in Fig. 6.13. Computation of parameters
ImesBlock and IhpmesBlock is as below:
Probability of no new higher priority message arrival in tq ð hpMesBlockÞ:
in the improved mode maximum of BlockTime is used.
Pm PiTB
Q
TB ¼
i2M ð6:15Þ
i2hpðmÞ
where
PiTB ¼ Prob½non occurrence of ith message in time BlockingTime
1
¼1 max½tB
Ti
148 6 Response-Time Models and Timeliness Hazard Rate
Pm PiTBhp
Q
TBhp ¼
i2M ð6:16Þ
i2hpðmÞ
where
i 1
Bm
hp ði; tÞ ¼ 1 Pm
TBhp Pm
TBhp gi ð t Þ
where
gi ð t Þ ¼ gi 1 ð t Þ
pm
bhp ðt Þ
gð t Þ ¼ pm
bhp ðtÞ
X
Bm
hp ðtÞ ¼ Bhpm ði; tÞ
i
In the improved model, token from place mesQueued can reach place
mesTxStarted by following either of these following paths, (1) directly by firing
of transition bFree, (2) firing of transitions bFree; tB ; ImesBlock and, (3) firing
of transitions bFree; tB ; ImesBlock, loop of transitions tq IhpMesBlock and
escaping transition IhpMesBlock. Total queueing time for improved model is
given as:
qm ð t Þ ¼ Pm
free dðtÞ þ 1 P m m m
free PTB pb ðtÞ
þ 1 Pm free 1 Pm m m
TB ½pb ðtÞ
Bhp ðtÞ ð6:17Þ
Probability
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10
Time (ms)
6.3.2 MIL-STD-1553B
As per [18], the delay for cyclic service network can be simply modeled as a
periodic function such that sSC SC CA CA SC CA
k ¼ skþ1 and sk ¼ skþ1 where sk and sk are the
150 6 Response-Time Models and Timeliness Hazard Rate
A B
MIL-STD-1553B
BC
Consider a network with two nodes as shown in Fig. 6.15. BC of the network
controls transfer of data on network. The network delay for data transfer from node
A to B is defined as
A
sAB ¼ tsuc tQA ð6:19Þ
In this equation sAB is the network delay experienced by message at node A for
A
transfer to node B. tsuc is the time of successful transfer of data from node A, tQA is
the time of queuing of data by node A for transfer to B.
Node A is allowed to transmit its data to B periodically under the command of
BC. As node A and BC are not synchronized, waiting time (queuing time to the time
of actual start of transmission)
AB will have uniform distribution. This uniform dis-
AB
tribution has range 0; smil . smil is time period or cycle time of data transfer from A
to B. Once node A gets turn for message transfer to B, it starts putting the frame.
The transmission delay has two components, frame size and prorogation delay.
Frame size is proportional to number of bytes to be transferred, while propagation
delay is because media length connecting nodes A and B. Now in terms of waiting
time, frame time and propagation time, network delay can be written as:
For a given data and pair of nodes framing time and propagation time are
constant. The sum of these two is referred as transmission time. With the
assumption that data is not corrupted during framing or propagation (i.e no
6.3 Response-Time Models 151
6.3.3 Ethernet
We will start with a myths about Ethernet performance. The main is that the Ethernet
system saturates at 37% utilization. The 37% number was first reported by Bob
Metcalfe and David Boggs in their 1976 paper titled ‘‘Ethernet: Distributed Packet
Switching for Local Computer Networks’’. The paper considered a simple model of
Ethernet. The model used the smallest frame size and assumed a constantly trans-
mitting set of 256 stations. With this model, the system reached saturation at 1=e
(36.8%) channel utilization. The authors had warned that this was simplified model
of a constantly overloaded channel, and did not bear any relationship to the normal
functioning networks. But the myth is persistent that Ethernet saturates at 37% load.
Ethernet is a 1-persistent CSMA/CD protocol. 1-persistent means, a node with
ready data senses the bus for carrier, and try to acquire the bus as soon as it becomes
free. This protocol has serious problem, following a collision detection. Nodes
involved in collision, keep on colliding with this 1-persistent behavior. To resolve
this, Ethernet has BEB (binary exponential backoff) algorithm. Incorporation of
BEB introduces random delays and makes the response time indeterministic.
In this section, we proposes a method to estimate response time distribution for
a given system. The method is based on stochastic processes and operations. The
method requires definition of model parameters.
tBEB
~
PNC
CollisionCounter
mesReady
PtNC
bFree BusFree BusAcq ttx MesTransComp
~
bFree
mesBlocked
Computation delay is negligible for some nodes (sensor, actuator). For con-
troller nodes it is finite. For analysis purpose it can be assumed constant. Sampling
time due to phase difference is modeled as uniform distribution.
sisamp Unif 0; tsamp
i
ð6:23Þ
So, pdf of sample to actuation delay is convolution of all the variable of above
equation:
d a s ðtÞ ¼ dSC ðtÞ
unif 0; tsamp
C
d t tcomp C
dCA ðtÞ ð6:24Þ
A
unif 0; tsamp
.. .. .. .. .. ..
Communication
Channel
.. .. .. .. .. ..
2/3 Voting
OPN’s also run a cyclic program of acquiring outputs from their respective
IPN’s, and generates commands for the actuators/manipulators. Similar to IPNs,
time elapsed from the instant of reception of commands at communication inter-
face to its acquisition is a random variable sO
acq with domain (0, cycle time of OPN)
and time required for generation of command for actuators, sO gen with domain <.
Let us take a deterministic case. Here all the random variable are having fixed
values. The density function of fixed random variables is denoted by Dirac delta
function, dðtÞ. Response-time density functions of random variable of channel A
are given in Table 6.5.
6.4 System Response-Time Models 155
F_ A ðsÞ ¼ F_ 1
F_ 2
F_ 3
F_ 4
F_ 5 ðsÞ
and
Zt
C
F ðt Þ ¼ F_ C ðsÞds ¼ [ ðt 16Þ
0
[ðt 14Þ[ ðt 15Þ[ ðt 15Þ[ ðt 16Þ[ ðt 16Þ[ ðt 14Þ
þ
[ðt 14Þ[ ðt 15Þ[ ðt 15Þ[ ðt 16Þ[ ðt 16Þ[ ðt 14Þ
156 6 Response-Time Models and Timeliness Hazard Rate
Table 6.6 Notations and Notation Mean value Distribution Density function
distributions of random
variables F_ 1 ðtÞ 3 Uniform 1
6 ð[ðtÞ [ð t 6Þ Þ
F_ 2 ðtÞ 2 Exponential 1 2t
2e
F_ 3 ðtÞ 5 Exponential 1 5t
5e
F_ 4 ðtÞ 3 Uniform 1
6 ð[ðtÞ [ðt 6ÞÞ
t
F_ 5 ðtÞ 1 Exponential e
In this case random variables for time to acquire inputs are assumed to be
uniformly distributed over its cycle time. This assumption is based on the fact
that instant of change of input and instant of acquisitions are statistically inde-
pendent. Random variable for time to process (or transmit/generate) at IPN
(communication channel/OPN) is assumed to be exponentially distributed. This
assumption is based on the fact that these systems (IPN/OPN/communication
channel) are having complex interactions within them, making then non-deter-
ministic and memoryless. Memoryless means, remaining time to process does
not depend on how long it is being processed. Memory less random variable in
continuous domain is exponential.
Density functions of the random variables are given in Table 6.6.
Using (3–3) and (3–4), response-time densities and distributions can be
obtained. The response-time distribution for a channel, e.g. channel A is given as:
Assuming channel response-time distribution for all the three channels same,
system response-time distribution can be calculated using (3–8).
6.4 System Response-Time Models 157
In networked systems, to make the system fault-tolerant, it’s node group (sensor,
controller and actuator) can have redundancy. This redundancy could of any form
and type, active/passive, hot, cold or warm [19], etc. Here MooN (M-out-of-N)
redundancy is considered. A generic diagram is shown in Fig 6.19.
158 6 Response-Time Models and Timeliness Hazard Rate
0.8
Pr.[Response-time < t]
0.7
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30
Time
Physical Process
Network with n 1
1 1 1
2 2 2
n n
S S C C A A
samp comp SC samp comp CA samp comp
S C A
All nodes in a group are independent. Each sensor node of sensor group sample
the physical parameter independently. Parameters from network are common to all
the nodes of a group. For parameters from network, nodes might wait for copies of
same message from multiple nodes of the sending group. For example, in case
sensor group is having triple redundancy, controller nodes might wait for a data
from two different nodes of sensor group for processing for majority voting sys-
tem. As all nodes are time-triggered, a controller node performs 2oo3 of recent
messages from sensor group.
Let redundancy of each group is denoted as M i ooN i and i 2 fS; C; Ag.
Receiving nodes of a data wait for data from sending group, i from at least Mi
nodes. To model response-time in this scenario, we define group delay time.
Group delay time of a group j, is defined as time difference of successful
6.4 System Response-Time Models 159
Ni n N i n
X
D i ðt Þ ¼ Di1 ðtÞ 1 Di1 ðtÞ ð6:27Þ
n¼M i
With this response-time model of the system in Fig. 6.19 is given as:
r ðtÞ ¼ dS ðtÞ
d C ðtÞ
dA ðtÞ ð6:28Þ
In the above analysis, we considered that nodes are asynchronous. The analysis is
valid only for initial or first response, as nodes and network controllers are car-
rying out their periodically. Local clocks of nodes as well as network controllers
(if any) are very stable. So, if system response-time is, x at first cycle after startup,
chances are very high that it will remain x in consecutive cycle(s). Response-times
of consecutive cycles are correlated. From common logic it can be inferred, any
variation in consecutive response-times will be due to drift in clocks, failure of
node/controller and restoration of node/controller.
A networked real-time system, able to meets its deadline at startup may fail
after operating for time t, due to drift in clocks, failure and restoration of nodes/
network controllers.
6.4.3.1 CAN
Network delay on CAN has variation mainly due to traffic and priority among
messages. Message with lower priority has higher variation than high priority
messages. Network delay of a message in one cycle is not related with the delay in
the previous cycle, as the interacting traffic is independent. So, network delay in
each cycle is independent and follow the pdf given by (6.12).
160 6 Response-Time Models and Timeliness Hazard Rate
6.4.3.2 MIL-STD-1553B
and
similarly
dAB ðt; i þ nÞ ¼ ppre dAB ðt; n 1Þ þ 1 ppre dAB ðtÞ
ð6:31Þ
¼ pnpre dðt xÞ þ 1 pnpre dAB ðtÞ
If ppre ¼ 1, i.e. there is no mismatch in clocks, then nth cycle will also have the
same network delay as ith cycle. When ppre \1, then network delay in nth cycle
will be given be dAB ðtÞ.
Similarly, if the receiving node has received the message in present cycle
before a specified time, ta , then the distribution of delay time in next cycle is given
be the above equations. Let conditional density is denoted as d AB ðtjta Þ. Dirac delta
function in above equations is replaced by conditional density function.
Node and network channel may fail, repaired and restored back to operation. When
a node or network channel is not available because of failure, it might affect the
system response-time distribution and probability of meeting deadline(s).
6.4 System Response-Time Models 161
Systems usually have robust control algorithm to tolerate message delays and
drops. Message delay and drop has effect of delaying control action by networked
real-time systems. Delay beyond a specified time in taking control action is
considered timeliness failure.
Let system failure criteria is n consecutive deadline violation (or timeliness
failures). When n ¼ 1, number of cycles at which timeliness failure will occur
follows geometric distribution [12].
Pð Z ¼ i Þ ¼ pi 1 q ð6:32Þ
where Z: random variable; q: probability of occurrence of timeliness failure; p : 1 q:
Geometric distribution is a memoryless distribution in discrete time and is
counterpart of exponential distribution in continuous time [12]. At gross level
(larger time scale), it can be easily converted to exponential distribution.
In exponential distribution characterizing parameter is hazard rate, which in this
case is referred to as ‘‘timeliness hazard rate’’.
0 1
1 B 1
kT ¼ lnB
C
C ð6:33Þ
t @ t A
P Z[
tC
p q q q q
0 1 n-1 n
OK Fail
p
p
When n [ 1, number of cycles for timeliness failure will not follow geometric
distribution. This process (number of cycles for timeliness failure) is a memory
process and directly cannot be modeled as Markov. Using the technique of
additional states [20], Markov model can be used to model this process. A DTMC
for this process is shown in Fig. 6.20. It is clear a general equation for n cannot be
given. This gives rise to computational problem for higher values of n. Also, it will
be better if this process can be mapped to a continuous time process, as it will
enable modeling of timeliness failure along with hardware failures in system
dependability modeling.
One algorithm to evaluate timeliness hazard is by using 6.33. Alternate method
to estimate hazard rate is by its definition [12]. Hazard rate at any given time t, is
conditional instantaneous probability of failure at time, given that it has survived
up to time t [12].
where: F(t): Failure distribution (CDF); f(t): Failure density (pdf); R(t): Reliability.
Equation (6.34) requires FðtÞ as differentiable function. While DTMC will give
FðtÞ values at discrete points only. Using techniques of discrete mathematics, these
discrete values are used to evaluate timeliness hazard rate. MATLAB code for
deriving timeliness hazard rate is given in Appendix A.
6.5.1 Example 1
Let probability of meeting the specified deadline is p ¼ 0:99998 per cycle. And
cycle time is 50 ms. Timeliness hazard rate, kT for three timing failure criteria,
n ¼ 1; 2 and 3, as evaluated using (6.33) is given in Table 6.7
Timeliness failure probabilities are estimated using DTMC of Fig. 6.20 and
using exponential distribution with hazard rate of Table 6.7
Plot of timeliness hazard rate with time does not show any trend, as shown in
Fig. 6.21. All three hazard rates are constant with time, so exponential distribution
6.5 Timeliness Hazard Rate 163
time
-6
10
n=1
n=2
n=3
-8
10
Hazard rate
-10
10
-12
10
-14
10
-16
10
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Time (x cycle time)
1.5
0.5
-0.5
0 1 2 3 4 5
10 10 10 10 10 10
Time (x cycle time)
can be used to model failure distribution [12, 20]. Difference in estimated prob-
abilities is plotted in Fig. 6.22.
The process with more than one consecutive failure although have memory at
micro-scale i.e. at cycle level, but at a larger time scale it may not. If there is no
timeliness failure up to time t, then chances of timeliness failure in time t þ dt in
independent of t. Means, system state at t w.r.t. timeliness failures is as good as new.
164 6 Response-Time Models and Timeliness Hazard Rate
SC A
1 2 3 1 2 3
Communication
networks
6.5.2 Example 2
Consider a two node system with three networks interconnecting them. One node
acts as sensor & controller (SC) node and other node acts as actuator (A) node.
Interconnecting network system is taken to be CAN in one case and MIL-STD-
1553B in the other. The schematic of the system is shown in Fig. 6.23.
System components may fail and restored back by means of repair. In degraded
states (system has some faulty components, but still system is functional) response
time change. In all working states system’s response-time is computed using
method discussed earlier in this chapter.
In this example system, message set of previous example is taken. Messages
with 2 bytes and 10 ms cycle time are considered for both the cases. On network
other messages are taken. System has total of 17 messages as per Table 6.2.
Messages for present case I, have message ID 8, 9 and 12 (with message ID 12
having 10 ms as period). For case I following healthy and degraded states are
considered:
0.8
0.7
0.6
Probability
0.5
0.4
0.3 No fault
1 SC faulty
0.2 1 A node faulty
1 SC and 1 A node faulty
0.1
0
0 5 10 15 20 25 30
Time (ms)
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30
Time (ms)
affect the system response-time. For case II following healthy and degraded states
are considered:
6.6 Summary
Network-induced delays are important for NRT systems as they are the cause of
system degradation, failure and sometime system’s stability. NRT system’s control
algorithms considering probabilistic network delay have better control QoP. In this
chapter, methods to probabilistically model network-induced delay of two field bus
networks, CAN, MIL-STD-1553B and Ethernet is proposed. CAN is random
access network. For response-time analysis, various model parameters—proba-
bilities and blocking time pdf- need to be evaluated from message specifications.
Effect of hot network redundancy on system delay time of these two networks is
analyzed. The method is extended to evaluate sample-to-actuation delay and
response-time. A fault-tolerant networked computer system has a number of nodes
within sensor, controller and actuator groups, effect to these redundancy on system
response-time is also analyzed. Assuming probability of missing deadline in each
cycle is constant and given failure criteria, a method to derive timeliness hazard
rate is given. This method derives hazard rate from a discrete time process.
References
1. Wesly WC, Chi-Man S, Kin KL (1991) Task response time for real-time distributed systems
with resource contentions. IEEE Trans Softw Eng 17(10):1076–1092
2. Diaz JL, Gracia DF, Kim K, Lee C-Gun, Bello LL, Lopez JM, Min SL, and Mirabella O
(2002) Stochastic analysis of periodic real-time systems. In: Proceedings of the 23rd IEEE
real-time systems symposium (RTSS’02)
3. Diaz JL, Lopez JM, Gracia DF (2002) Probabilistic analysis of the response time in a real
time system. In: Proceedings of the 1st CARTS workshop on advanced real-time
technologies, October
4. Mitrani I (1985) Response time problems in communication networks. J R Statist Soc (Series
B) 47(3):396-406
5. Muppala JK, Varsha M, Trivedi KS, Kulkarni VG (1994) Numerical computation of
response-time distributions using stochastic reward nets. Ann Oper Res 48:155–184
6. Trivedi KS, Ramani S, Fricks R (2003) Recent advances in modeling response-time
distributions in real-time systems. Proc IEEE 91:1023–1037
7. Muppala JK, Trivedi KS (1991) Real-time systems performance in the presence of failures.
IEEE Comp Mag 37–47
8. Sevick KC, Mitrani I (1981) The distribution of queueing network states at input and output
instants. J ACM 28(2):353–471
9. Tindell K, Burns A, Wellings AJ (1995) Calculating controller area network (CAN) message
response times. Control Eng Prac 3(2):1163–1169
10. Tindell KW, Hansson H, Wellings AJ (1994) Analyzing real-time communications:
controller area network (CAN). In: Proceeding of real-time symposium, pp 259–263,
December
11. Nolte T, Hansson H, Norstrom C (2002) Minimizing can response-time jitter by message
manipulation. In: Proceedings of the 8th real-time and embedded technology and application
symposium (RTAS’02)
12. Trivedi KS (1982) Probability & Statistics with Reliability, Queueing, and Computer Science
Applications. Prentice-Hall, Englewood Cliffs
168 6 Response-Time Models and Timeliness Hazard Rate
13. Nolte T, Hansson H, Norstrom C (2003) Probabilistic worst-case response-time analysis for
the controller area network. In: Proceedings of the 9th real-time and embedded technology
and application symposium (RTAS’03)
14. Law M, Kelton WD (2000) Simulation Modeling and Analysis. McGraw Hill, New York
15. Nolte T, Hansson H, Norstrom C, Punnekkat S (2001) Using bit-stuffing distributions in can
analysis. In: IEEE/IEE real-time embedded systems workshop (RTES’01), December
16. Hansson H, Norstrom C, Punnekkat S (2000) Integrating reliability and timing analysis of
can-based systems. In: Proceedings of WCFS’2000-3rd IEEE international workshop on
factory communication systems, pp 165–172, September
17. Lindgren M, Hansson H, Norstrom C, Punnekkat S (2000) Deriving reliability estimates of
distributed real-time systems by simulation. In: Proceeding of 7th international conference on
real-time computing system and applications, pp 279–286
18. Tipsuwan Y, Chow M-Y (2003) Control methodologies in networked control systems.
Control Eng Prac 11(10):1099-1111
19. Johnson BW (1989) Design and analysis of fault-tolerant digital systems. Addison Wesley,
Reading
20. Cox DR, Miller HD (1970) The theory of stochastic processes. Methuen, London
Chapter 7
Dependability of Networked
Computer-Based Systems
7.1 Introduction
7.2 Background
The system under discussion is a networked system. A network system has two
basic elements, (i) nodes, (ii) network. Nodes are the users of the network and
performs the functional part of the system. Network provides a medium for
communication between nodes, and responsible for timely behavior of nodes.
Network consists of network controller(s), if any, cables, connectors, hub/switches
etc. These elements can fail and might have different impact on overall system
dependability attributes.
c
2 1
RP 0
When only one processor is available, the task has to be executed sequentially,
so the task completion time distribution, F1 ðtÞ will be hypoexponential [14] with
parameters a1 and a2 . This is given as [11]:
F1 ðtÞ ¼ Pr½T1 t
8 a a
< 1 e 2 ea1 t þ e 1 ea2 t ; 8a 6¼ a ð7:2Þ
1 2
¼ a2 a1 a2 a1
1 ea1 t a1 tea1 t ; 8a1 ¼ a2
:
The model was solved for system unavailability, due to (i) deadline violations
and (ii) due to arrivals when all processors are down.
Reliability model of NRT system proposed here, is conceptually similar to the
above model. The key features of proposed model are as follows:
1. tasks arrival are periodic
2. task response-time follow general distribution
3. failure-repair activities at nodes are independent of each other
4. system have shared communication links for information exchange. Shared link
have delay which may not be constant.
5. system may have different redundancy configurations
In Chap. 6, it was shown that timeliness failures are dependent on the system
state. The reliability of NRT systems is evaluated using the following two step
process:
172 7 Dependability of Networked Computer-Based Systems
7.3.2 Analysis
Let timeliness hazard rate from various healthy state of the system is denoted as
kTi;j;k;l ,
where i; j; k; l denote the number of UP sensor, controller and actuator node
and number of UP network channels, respectively. Probability of not reaching
Failure state by time, t due to timeliness hazard rate from healthy system state is
the NRT system reliability, RðtÞ.
dF ðtÞ X X X X T
¼ ki;j;k;l pH
i;j;k;l ðtÞ ð7:4Þ
dt i2S j2C k2A l2N
Rð t Þ ¼ 1 F ðt Þ ð7:5Þ
7.3.2.1 Example 1
To illustrate the model an example system with two node groups- sensor (con-
troller node is clubbed with sensor node) and actuator are considered. Each node
group has 2oo3 redundancy. Figure 7.2 shows the node’s state-space considering
hardware failures. States with more than tolerable number of node failures, i.e. 2,
in this case are termed as node failure states. Nodes cannot be repaired back from
any of the failure state, while in case of repairable systems, nodes can be repaired
from healthy set of system states. In Fig. 7.2 repair rate, l will be zero in case of
non-repairable mission-critical systems.
In this example, there are 2 types of functional nodes and network is assumed to
be failure free. So, NRT system will have 4 UP states. From the technique
7.3 Reliability Modeling 173
λ∗ λ∗ λ∗
nF nF-1 mF mF-1
µ µ
(a) State-space of node groups (S, C and A). nF denotes the total number of UP nodes at beginning. mF denotes the number of
nodes required to be in UP state for functional group to be UP. From healthy states repair may be possible. From failure state repair
is not possible.
λ∗ λ∗
nN nN-1 0
µ µ
(b) State-space of network. nN denotes the total number of UP network channels at beginning. Repair of network channel may be
possible
Fig. 7.2 Generic Markov models for node groups and networks
developed in Chap. 6, timeliness hazard rate for each state can be obtained. Once
these hazard rate are available from each of these non-failure state, a state tran-
sition diagram for timeliness failure is as shown as in Fig. 7.3.
Safety-critical systems differ from other computer based systems - control and
monitoring - based on the mode of operation. Other computer based systems may
require change their type of response continuously. While safety systems need to
be in either of two states, (i) operate (i.e. allowing EUC to operate), (ii) and safe
(i.e. shutting down/stopping of EUC). Means, in absence of any of the safety
condition, safety system allows EUC to be in operational state and on assertion of
any of the safety condition, safety system takes the EUC in safe state. So, in case
of detectable failures, safety system shall take fail-safe action.
Safety-critical systems are designed to minimize the probability of unsafe
failures, by incorporating design principles such as fail-safe and testability. To
derive safety model for safety critical NRT systems, following assumptions are
made.
7.4.1 Assumptions
1. All nodes have indulgent protocol. Indulgent protocol ensures safety even when
message arrive late or corrupted [16–18].
2. Safe (unsafe) failure of any group lead to safe (unsafe) failure of system.
3. When system is in safe state, unsafe failure cannot happen.
4. Proof-tests are carried out at system level as a whole.
174 7 Dependability of Networked Computer-Based Systems
nS,nC,nA,nN
λijkl
i,j,k,l
Fail
λxyza
mS,mC,mA,
mN
pDU H
i;j;k;l ¼ pS¼f4g pC2f1;2;4g pA2f1;2;4g pN
þ pS2f1;2;4g pC¼f4g pA2f1;2;4g pH
N
þ pS2f1;2;4g pC2f1;2;4g pA¼f4g pH
N ð7:6Þ
7.4 Safety Modeling 175
λ1,3
0
µP(t)
λ1,2 λ2,3
OK Dr FS
1 2 3
iµ jµ
i
(0<i<N)
λ2,4 µP(t)
λ2,4
FDU
4
When NRT system is in DU state demand arrival will lead to unsafe failure of
system. It is pictorially shown in Fig. 7.5.
The probability of NRT system being is DU state can be estimated from
independent model of functional nodes and network. Once this probability is
known PFaDðtÞ can be estimated.
dPDEUC ðtÞ X
¼ karr pDU
i;j;k;l ðtÞ
dt
PFaDðtÞ ¼ PDEUC ðtÞ ð7:7Þ
176 7 Dependability of Networked Computer-Based Systems
where
i 2 fS1 ; S2 ; S3 ; S4 g; j 2 fP1 ; P2 ; P3 ; P4 g; k 2 fA1 ; A2 ; A3 ; A4 g; l 2 fNH ; NF g.
X
mAvðtÞ ¼ 1 pði;j;k;lÞ ðtÞ ð7:9Þ
i;j;k;l
A generic Markov model for NRT system availability is shown in Fig. 7.7.
Continuing with philosophy of independence of groups, and evaluation of
system state based on individual group states. The availability of NRT system can
µEUC
7.5 Availability Modeling 177
Timeliness failures are of importance when NRT system is in UP state. For any
working state of the system, timeliness failure hazard rate can be evaluated using
the technique discussed in preceding chapter. Timeliness hazard rate for a given
system state is constant, as derived in previous chapter. Evaluation of equivalent
timeliness hazard rate for NRT systems can be modeled as a reward rate problem
[19, 20]. Here system evolves because of hardware failures and repairs, timeliness
hazard rate is taken as reward rate of the corresponding states.
P T
T k2UP kk pk ðtÞ
k ðt Þ ¼ P ð7:10Þ
k2UP pk
µS ns λ µP nPλ µA nA λ µN nN λ
Dr Dr Dr Dr
E[Λ( t)]
i µS i µP vµA iµN
ms λ mP λ mA λ mN λ
7.6 Example
T7
T0
P0 P1
2
T4
T2
T6 P6
T9
T1
P3 P4 T3
UC1 = P{#P0=3}
2
UC2 = P{#P3=3}
T5
CombProb = P{(#P0=3) AND (#P3=3)}
T8
10 -3
10 -4
10 -5
Probability
with n=2
10 -6 with n=3
10 -7
-8
10
10 -9
10-10
0 500 1000 1500 2000 2500 3000 3500 4000
Time (hr)
T2 P1
SC
T6
P6
#P2
T3 P2
#P2 T11
T10
P7
2 #P2
DEUC
T21
2 #P2
T7
P11 T61
A
2
T31
2
T71
P21
7.7 Summary
Models are derived for three dependability attributes, reliability, availability and
safety, of NRT systems. Appropriate engineering assumptions are made about
NRT systems used in different applications. It was found as timeliness failure does
7.6 Example 181
-3 PFaD(t)
x 10
1
Probability
0.5
0
0 1000 2000 3000 4000 5000 6000 7000 8000
Distribution of undangerous failure
-3
x 10
1.5
Probability
0.5
0
0 1000 2000 3000 4000 5000 6000 7000 8000
Distribution of safe failure
0
10
Probability
-5
10
-10
10
0 1000 2000 3000 4000 5000 6000 7000 8000
Time (hr)
Fig. 7.11 Plot of PFaD and other state probabilities with time
not require any repair and if system has not failed can be restarted back instan-
taneously. In case of safety and availability models, where system goes to
failsafe or unavailable state for short (or negligible) time, timeliness failures
does not affect safety and availability attributes. In these models, timeliness fail-
ures only affect the EUC availability. Timeliness hazard rate is modeled as reward
rate. This mean reward rate serves as an index for overall timeliness failures. In
case of reliability modeling timeliness failure are one source of system failure,
they are considered in system reliability modeling.
182 7 Dependability of Networked Computer-Based Systems
SC P1
T1
T01
A P11
T11
T02
Net P12
T12
References
1. Borcsok J, Schwarz MH, Holub P (2006) Principles of safety bus systems. In: UKACC
Control Conference, Universities of Glasgow and Strathclyde, UK, September 2006
2. Borcsok J, Ugljesa E, Holub P (2006) Principles of safety bus systems-part II. In: UKACC
Control Conference, Universities of Glasgow and Strathclyde, UK, September 2006
3. Elia A, Ferrarini L, Veber C (2006) Analysis of Ethernet-based safe automation networks
according to IEC 61508. In: IEEE Conf. on Emerging Technologies and Factory Automation
(ETFA ’06), pp 333–340, September 2006
4. Rushby J (2001) Bus architectures for safety-critical embedded systems. In Embedded
Software, Lecture Notes in Computer Science 2211, Springer, pp 306–323
5. Rushby J (2001) A comparison of bus architectures for safety-critical embedded systems.
Technical report, June 2001
6. Vasko DA, Suresh Nair R (2003) CIP Safety: Safety networking for the future. In: 9th
International CAN Conference, CAN in Automation (iCC 2003), Munich, Germany
7. IEC 61508: Functional safety of electric/electronic/programmable electronic safety-related
systems, Parts 0–7; October 1998–May 2000
8. Avizienis A, Laprie J-C, Randell B (2000) Fundamental concepts of dependability. In: Proc.
of 3rd Information Survivability Workshop, pp 7–11, October 2000
9. Johnson BW (1989) Design and analysis of fault-tolerant digital systems. Addison Wesley,
New York
References 183
MATLAB programs are used throughout the thesis for analysis and plotting of
results. Source code of important programs is attached here. The codes are
arranged chapter wise.
Code for evaluation of safety measure PFaD and manifested availability mAv for
1oo2 system.
Code: 1
% ***********Input Parameters ********************
prmtr = load(’parameters1oo2O.txt’);
LSafe = prmtr(1,1);
LDang = prmtr(2,1);
MeanRepairTime = prmtr(3,1);
DiagCov = prmtr(4,1);
PrTestCov = prmtr(5,1);
Tproof = prmtr(6,1);
RunTime = prmtr(7,1);
MeanTimeBetweenDemands = prmtr(8,1);
CommonCause2 = prmtr(9,1);
CommonCause3 = prmtr(10,1);
NumberofRepairStn = prmtr(11,1);
B2 = CommonCause3;
Tp = Tproof;
n = floor(RunTime/Tp);
s = RunTime - n*Tp;
N = NumberofRepairStn;
B2_ = 1 - B2;
B_ = 1 -B;
Alfa = 1-(2-B2)*B;
L = L1 + L2;
P0 = [1 0 0 0 0 0]’;
% ***************************************************
E = Delta*expm(Q*Tp);
I = eye(size(Q));
%EPn=(1/RunTime)*(inv(Q)*(expm(Q*s)-I)*inv(I-E)*(I-Eˆ(n+1))*P0);
Temp = inv(Q)*(expm(Q*Tp)-I)*inv(I-E)*(I-Eˆn)*P0;
anaPFaD =1 - ones(size(P0’))*(meanProb);
anaFs = [0 1 1 0 1 0]*meanProb;
anaS = [1 0 0 1 0 0]*meanProb;
for i = 0:RunTime/10,
Time(i+1) = 10*i;
n = floor(Time(i+1)/Tp);
s = Time(i+1)-n*Tp;
Pn(i+1) = (ones(size(P0’)))*expm(Q*s)*Eˆn*P0;
F(i+1) = 1-Pn(i+1);
Appendix A: MATLAB Codes 187
end
% PFaD_t = 1 - ones(size(Pn’))*Pn;
% PFaD = 1 - ones(size(EPn’))*EPn;
% [Tp PFaD]
plot(Time, F);
Code for evaluation of safety measure PFaD and manifested availability mAv for
2oo3 system.
Code: 2
% ***********Input Parameters ********************
prmtr = load(’parameters2oo3w.txt’);
LSafe = prmtr(1,1);
LDang = prmtr(2,1);
MeanRepairTime = prmtr(3,1);
DiagCov = prmtr(4,1);
PrTestCov = prmtr(5,1);
Tproof = prmtr(6,1);
RunTime = prmtr(7,1);
MeanTimeBetweenDemands = prmtr(8,1);
CommonCause2 = prmtr(9,1);
CommonCause3 = prmtr(10,1);
% **********Derived Parameters *******************
L1 = LSafe + DiagCov*LDang;
L2 = (1-DiagCov)*LDang;
M = 1/MeanRepairTime;
La = 1/MeanTimeBetweenDemands;
a = PrTestCov;
B = CommonCause2;
B2 = CommonCause3;
Tp = Tproof;
n = floor(RunTime/Tp);
s = RunTime - n*Tp;
B2_ = 1 - B2;
B_ = 1 -B;
Alfa = 1-(2-B2)*B;
L = L1 + L2;
0 0 0 0 0 0;
3*Alfa*L1 -(2*B_+B)*L-M M 0
0 0 0 0 0 0;
3*B2_*B*L1 2*B_*L1 -(L+M) M
0 0 0 0 0 0;
B2*B*L1 B*L1 L1 -M
0 0 0 0 0 0;
3*Alfa*L2 0 0 0
-(2*B_+B)*L M 0 0 0 0;
0 2*B_*L2 0 0
2*B_*L1 -(L+M) M 0 0 0;
0 0 L2 0
B*L1 L1 -M 0 0 0;
3*B2_*B*L2 0 0 0
2*B_*L2 0 0 -(L+La) M 0;
0 B*L2 0 0
0 L2 0 L1 -(M+La) 0;
B2*B*L2 0 0 0
B*L2 0 0 L2 0 -La;];
P0 = [1 0 0 0 0 0 0 0 0 0]’;
% ***************************************************
E = Delta*expm(Q*Tp);
I = eye(size(Q));
% EPn = (1/RunTime)*(inv(Q)*(expm(Q*s)-I)*inv(I-E)*(I-Eˆ(n+1))*P0);
Temp = inv(Q)*(expm(Q*Tp)-I)*inv(I-E)*(I-Eˆn)*P0;
anaPFaD =1 - ones(size(P0’))*(meanProb);
anaFs = [0 0 1 1 0 0 1 0 0 0]*meanProb;
for i = 0:RunTime/10,
Time(i+1) = 10*i;
Appendix A: MATLAB Codes 189
n = floor(Time(i+1)/Tp);
s = Time(i+1)-n*Tp;
Pn(i+1) = (ones(size(P0’)))*expm(Q*s)*Eˆn*P0;
F(i+1) = 1-Pn(i+1);
end
% PFaD_t = 1 - ones(size(Pn’))*Pn;
% PFaD = 1 - ones(size(EPn’))*EPn;
% [Tp PFaD]
plot(Time, F);
Basic CAN model: The program shown below, is the code for response-time analysis
of CAN messages. This program is based on basic CAN model. Message set is
defined in file named ‘messlist.txt’.
Code: 3
% File name: RespCANBasic.m
messList = load(’messlist.txt’);
messInt = 9;
% Message ID for which response-time distribution is required
ProbDelivery = 0.9999; %0.99999;
ProbBnB = 0.9999;
bittime = 0.007745;
fractColl = 0.88;
% *********************************************************** %
Util = 0.0;
maxC = 0;
for i = 1: 17
C(i) = messList(i,2)*8 + 44 +floor((messList(i,2)*8 + 33)/4);
if i ˜= messInt
Util = Util + (C(i)*bittime)/messList(i,3);
if C(i) > maxC
maxC = C(i);
end
end
end
% *********************************************************** %
190 Appendix A: MATLAB Codes
% ************************************************************* %
Phigh=1;
for i = messInt+1: 17
Phigh = Phigh*(1-(fractColl*bittime)/messList(i,3));
%prob. of no collision
end
% ************************************************************* %
for i = 1: 17
if i ˜= messInt
sumTi = sumTi + 1/messList(i,3);
end
end
for i = 1: 17
if i == messInt
ri(i) = 0.0;
else
ri(i) = 1/(messList(i,3)*sumTi);
end
end
% ************************************************************ %
temppt = zeros(1,maxC);
for i = 1: 17
if i ˜= messInt
temppt(C(i)) = temppt(C(i)) + ri(i);
end
end
end
pbtfinal = zeros(1,maxC);
for i = 1: length(WaitLen)
pbt(i,:)=WaitVal(i)*(1/WaitLen(i))*
[ones(1,WaitLen(i)),zeros(1,maxC-WaitLen(i))];
pbtfinal = pbtfinal+pbt(i,:); % pdf of blocking time
end
% ************************************************************* %
for i = messInt+1: 17
sumT = sumT + 1/messList(i,3);
if C(i) > maxCnew
maxCnew = C(i);
end
end
for i = 1: 17
if i > messInt
rbhp(i) = 1/(messList(i,3)*sumT);
else
rbhp(i) = 0.0;
end
end
% ************************************************************ %
pbhpt = zeros(1,maxCnew);
for i = 1: 17
if i > messInt
pbhpt(C(i)) = pbhpt(C(i)) + rbhp(i);
% timepbhp(i) = i*bittime; % ??
end
end
for i = messInt+1: 17
PB_hp = PB_hp*(1-(mT*bittime)/messList(i,3));
192 Appendix A: MATLAB Codes
for i = messInt+1: 17
PBhp_hp = PBhp_hp*(1-(mTnew*bittime)/messList(i,3));
%prob. of no new hp arrival in Blocking by new
end
% PBhp_hp = 1-PBhp_hp;
Temp1=[Temp1,zeros(1,max(length(Temp2),length(Temp3))-length(Temp1))];
Temp2=[Temp2,zeros(1,max(length(Temp2),length(Temp3))-length(Temp2))];
Temp3=[Temp3,zeros(1,max(length(Temp2),length(Temp3))-length(Temp3))];
CumResp(1)=Respt(1);
Time (1) = bittime;
for i = 2: length(Respt)
CumResp(i)= CumResp(i-1) + Respt(i);
Time(i) = i*bittime;
end
Basic CAN response-time model uses two functions to perform the iterative
tasks. They are give below:
Function: “fBlock”
function [Q] = fBlock(p, C, x)
n = 0;
probSum = 0.0;
Appendix A: MATLAB Codes 193
convBlock = 1;
mSize = n*length(x)-(n-1);
q = zeros(n,mSize);
Q = zeros(1,mSize);
for i = 1 : n
convBlock = conv(x,convBlock);
convBlockt = [convBlock, zeros(1,mSize-length(convBlock))];
q(i,:) = prob(i)*convBlockt; %prob(i)*
Q = Q + q(i,:);
end
Function: “fReady”
function [Q] = fReady(p, C, x)
n = 0;
probSum = 0.0;
convBlock = 1;
mSize = (n-1)*length(x)-(n-2);
q = zeros(n,mSize);
Q = zeros(1,mSize);
for i = 1 : n
convBlockt = [convBlock, zeros(1,mSize-length(convBlock))];
q(i,:) = prob(i)*convBlockt; %prob(i)*
Q = Q + q(i,:);
convBlock = conv(x,convBlock);
end
Code: 4
clear all;
messList = load(’messlist.txt’);
bitTime = 0.007745;
RespTime = zeros(17,200);
CAN response-time simulation model uses one functions to perform the iterative
task. It is given below:
Function: “fCANrun”
%20.10.2008: saving time changed from 1000 to 2000
function [Resp] = fCANrun(bitTime, List, NextSch)
Resp = zeros(17,200);
Time = 0.0;
break,
end
end
end
% Update list
NextSch(SelectedMess)=NextSch(SelectedMess)+List(SelectedMess,1);
for i = 1: 17
C(i) = messList(i,2)*8 + 44 + floor((messList(i,2)*8 + 33)/4);
if i ˜= messInt
196 Appendix A: MATLAB Codes
% ************************************************************* %
for i = 1: 17
if i ˜= messInt
sumTi = sumTi + 1/messList(i,3);
end
end
for i = 1: 17
if i == messInt
ri(i) = 0.0;
else
ri(i) = 1/(messList(i,3)*sumTi);
end
end
temppt = zeros(1,maxC);
for i = 1: 17
if i ˜= messInt
temppt(C(i)) = temppt(C(i)) + ri(i);
end
end
pbtfinal = zeros(1,maxC);
for i = 1: length(WaitLen)
pbt(i,:)=WaitVal(i)*(1/WaitLen(i))*
[ones(1,WaitLen(i)),zeros(1,maxC-WaitLen(i))];
pbtfinal = pbtfinal+pbt(i,:); % pdf of blocking time
end
Appendix A: MATLAB Codes 197
% **************************************************** %
sumT = 0.0;
maxCnew = 0;
for i = messInt+1: 17
sumT = sumT + 1/messList(i,3);
if C(i) > maxCnew
maxCnew = C(i);
end
end
for i = 1: 17
if i > messInt
rbhp(i) = 1/(messList(i,3)*sumT);
else
rbhp(i) = 0.0;
end
end
pbhpt = zeros(1,maxCnew);
for i = 1: 17
if i > messInt
pbhpt(C(i)) = pbhpt(C(i)) + rbhp(i);
% timepbhp(i) = i*bittime; % ??
end
end
for i = messInt+1: 17
PB_hp = PB_hp*(1-(maxC*bittime)/messList(i,3));
%prob. of no new hp arrival in Blocking
end
% PB_hp = 1-PB_hp;
for i = messInt+1: 17
PBhp_hp = PBhp_hp*(1-(maxCnew*bittime)/messList(i,3));
%prob. of no new hp arrival in Blocking by new
end
% PBhp_hp = 1-PBhp_hp;
Temp1=[Temp1,zeros(1,max(length(Temp2),length(Temp3))-length(Temp1))];
Temp2=[Temp2,zeros(1,max(length(Temp2),length(Temp3))-length(Temp2))];
Temp3=[Temp3,zeros(1,max(length(Temp2),length(Temp3))-length(Temp3))];
CumResp(1)=Respt(1);
Time (1) = bittime;
for i = 2: length(Respt)
CumResp(i)= CumResp(i-1) + Respt(i);
Time(i) = i*bittime;
end
p = 0.99998;
q = 1-p;
cycTime = 1; %time in seconds
operTime = 1000; % time in hour
P1 = [p q; 0 1];
P2 = [p q 0; p 0 q; 0 0 1];
P3 = [p q 0 0; p 0 q 0; p 0 0 q; 0 0 0 1];
P10 = [1 0];
P20 = [1 0 0];
P30 = [1 0 0 0];
Appendix A: MATLAB Codes 199
200 Appendix A: MATLAB Codes