Fundamentals of Artificial Intelligence PDF
Fundamentals of Artificial Intelligence PDF
Chowdhary
Fundamentals
of Artificial
Intelligence
Fundamentals of Artificial Intelligence
K. R. Chowdhary
Fundamentals of Artificial
Intelligence
123
K. R. Chowdhary
Department of Computer Science
and Engineering
Jodhpur Institute of Engineering
and Technology
Jodhpur, Rajasthan, India
This Springer imprint is published by the registered company Springer Nature India Private Limited
The registered company address is: 7th Floor, Vijaya Building, 17 Barakhamba Road, New Delhi
110 001, India
To my wife
Suman Lata
Preface
The course of Artificial Intelligence (AI), is one of the standard course in all
computer science curricula at undergraduate programs, in many of masters pro-
grams, as well as taught a course in many of multi-disciplinary majors like Masters
in Mathematics, Masters in Information Technology, and in non-CS Programs like
Electrical Engineering, Electronics and Communication Engineering, and in some
Institutions it is taught by the name Intelligent Systems. The concepts and theories
of AI are often applied in many Engineering disciplines, like automation, pro-
duction, packing, optimizations, planning; in routing of information and goods (in
transporting); in message communications and message filtering; in natural lan-
guage translation, speech synthesis, speech recognition, search engines, information
retrieval, information extraction, information summarization, information filtering;
in robotics, computer/machine vision systems; expert systems in medical diagnosis,
fault diagnosis in machines, surveillance, remote sensing, automated vehicles; in
computer games, combinatorial games; in recent use in on-line sales, on-line
education, and many more.
ACM Curricula 2013 defines the scope of Artificial Intelligence as follows:1
Intelligent Systems (IS): Artificial intelligence (AI) is the study of solutions for
problems that are difficult or impractical to solve with traditional methods. It is used
pervasively in support of everyday applications such as email, word-processing and
search, as well as in the design and analysis of autonomous agents that perceive
their environment and interact rationally with the environment. The solutions rely
on a broad set of general and specialized knowledge representation schemes,
problem solving mechanisms, and learning techniques. They deal with sensing
(e.g., speech recognition, natural language understanding, computer vision),
problem-solving (e.g., search, planning), and acting (e.g., robotics) and the archi-
tectures needed to support them (e.g., agents, multi-agents). The study of Artificial
1
https://www.acm.org/binaries/content/assets/education/cs2013_web_final.pdf.
vii
viii Preface
Contents
This book covers in enough details all the topics recommended in the latest ACM
“Computer Science Curricula 2013” (available at the time of writing this manu-
script). The topics as per the curricula 2013 and covered in this text are as follows:
Fundamental Issues
Basic Search Strategies
Basic Knowledge Representation and Reasoning
Basic Machine Learning
Advanced Search
Advanced Representation and Reasoning
Reasoning under Uncertainty
Agents
Natural Language Processing
Advanced Machine Learning
Perception and Computer Vision
In addition, looking to their importance and scope in AI, this text covers fol-
lowing additional topics.
Logic Programming and Prolog
Constraint Satisfaction Problems
Game Theory
Data Mining
Information Retrieval
Speech Recognition
This book is written, keeping in mind the diversity of its readership—the con-
cepts and fundamental principles are given highest significance, more and more
explanation is based on figures and solved examples. This is supplemented with
exhaustive list of exercises at the end of each chapter, which are aimed to cover
wide range of problem solving skills for a learner. Each chapter begins with
introduction to the topic of chapter, its scope, and relation to other fields of AI,
supported with list of applications to keep constant motivation of the learner from
chapter to chapter, followed with learning outcome of that chapter. To maintain a
consistent interest of the student as well for ease of understanding, each chapter is
organized to gradually pick up—from simple concepts to advanced theory and
applications. For curious and interested readers, each chapter is supplemented with
exhaustive list of original source references, in the form of recent articles published
in Journals, Transactions, and Proceedings of Conferences of AI and CS.
Preface ix
The text is designed to be self taught, concept driven, simple language, with
large number of exercises both solved and unsolved. The simplicity of English
language used helps in easy grasping by an average student, and equally suited for
non-English speaking students having medium of instruction as English. It covers
all major areas of AI presently in focus and that belong to recent developments,
including machine learning, machine vision, IR, agents, etc.
The contents of this text are based on teaching this course by the author for many
years at M.B.M. Engineering College, Jodhpur (India), as well as based on the
course of AI taught to undergraduate students at Indian Institute of Technology
Jodhpur (IITJ) from the session 2011–12 to 2015–16 as a visiting faculty, the time
during which the book came into this shape. The slides of the topics taught at IITJ
can be referred by reader in the author’s website (http://www.krchowdhary.com/).
The contents of this text is class room tested over the years, and has been refined
and improved again and again over these years of teaching; all the exercises are also
class room tested.
The book is intended primarily for students majoring in computer science at under-
graduate and graduate level, but will also be of interest as a foundation to researchers
in any of the discipline of AI, either for self study or reference. In general, it is suitable
for all AI curricula of undergraduate courses in Engineering, particularly, Computer
Science and Engineering, Information Technology, Electrical Engineering,
and non-engineering curricula, like Master in Computer Applications, Master of
Science. Some of the chapters, particularly, the Chaps. 13–21 may form course
contents of graduate curricula. The chapters can be read in the order they appear in this
text. Readers may correspond to the author (at mail-ids: [email protected] or
[email protected]) to point out the errors, suggestions/criticism, so as to
make improvements in the script for future editions.
The prerequisites for this course are: mathematical logic, discrete mathematical
structures, and data structures with basic level of programming. This course can be
taught at third or final year of undergraduate program by name as AI or Intelligent
Systems.
Since all the solved, as well as many unsolved problems in the book are
classroom tested, these can go in parallel to the respective chapter’s delivery.
A keen learner is recommended to refer to the exhaustive list of references provided
at the end of each chapter for more details.
x Preface
The exercises are designed to continuously develop and strengthen the ideas pre-
sented in the book’s many proofs and examples. More than 250 exercises and
solved examples have been provided in the text to rigorously practice the material
presented, and to consolidate the principles presented to strengthen the over all
understanding of topics.
First and foremost, I thankfully acknowledge to all the authors mentioned in the
bibliography/references in this text, whose publications were helpful for teaching
this course of AI, as well as in bringing this text in final shape.
One of the major contribution to be acknowledged is due to my students of
AI class over a duration of one and half decade, who interacted in many ways while
I was teaching this course—by asking interesting questions, some times as curious
observers, and some times raising their doubts—all these led me to rethink, analyse
their point of view, and to discover new ways to present the ideas. This also helped
me to understand the subject even better during the time of 1997–2013 when I taught
undergraduate batches of CSE, IT, and Master’s students of computer applications at
M.B.M Engineering College, Jodhpur.
The major part of this text came into shape during the time I taught at Indian
Institute of Technology Jodhpur (IITJ) (with course nomenclatures as CS365, CS
30002, CS 324, CS323) during the academic years from 2011–12 to 2015–16.
Many thought provoking questions and interactions by students during this time
helped in organizing and presenting the material in better and digestible form, as
well as many helpful and constructive comments in their feedback system not only
improved the delivery and quality of problems solutions and exercises, but was also
a constant force to reshape the content from academic year to academic year.
I would particularly like to mention their names, however, list is long and many
names also exist, though not explicitly mentioned here. These are: Ashish Kumar,
Praneeth A. S., Sonu Mehta, Manu Agarwal, Akansha Saran, Ruchi Toshiwal,
Sonal Gupta, Tokla Sai Teja, Mahesh, Kalpnath Rao, Rishi Mishra, Syed Navaid
Ahmad, Jitendra Singh Garhwal, Pitta Divya Shree, Rishi Mishra, Jinak Jain, Avan
Jayendra Rathod, Nithin V., Pavan V. Sukalkar, Sumit Jangid, Siddarth Singh Rao,
Banoth Surya Prasad, Saurabh Kumar Gangwar, Abhishek Bassan, Arvind Pandey,
Shivam Verma, Dhanajit Brahma, Pranjal Singh, Ujjwal Anand, Jalaj Sharma, and
Shrey Maheshwari.
xi
xii Acknowledgements
I am thankful to the former Director IITJ, Prof. P. Kalara, and the next Director
Prof. C. V. R. Murty, for providing me the opportunity to continuously teach the
courses of AI, semester after semester, at IIT Jodhpur during the academic years
2011–12 to 2015–16, as well as the thanks are due to various heads of CS
department of IITJ, at that time, Dr. Venkata Ramana Badaria and Dr. Gaurav Harit
for their constant support while teaching at CSE department of IITJ. My thanks to
IITJ librarin Ms. Kshema Prakash for providing needed library support
time-to-time.
I am thankful to my guru Prof. V. S. Bansal, former Prof. and Head, Department of
Electrical Engineering, Professor Emeritus, and former Dean of the M.B.M.
Engineering College, whose every meeting with me starts and ends with technical
discussions on AI, as well as thanks to Prof. (late) Rajesh Purohit, Prof. N. C. Barwar,
and Prof. Anil Gupta of M.B.M. for their active involvements while discussing the
advances in AI, time to time.
Completing a script for a book in general is a job of deep involvement for many
years, and particularly this one for me, which took enormous amount of time for
preparation of lecture notes, referring to original material in Journals, filtering and
putting the text in digestible form looking to the need of the audience, was naturally
a big share of time that belonged to my life partner Suman Lata, and children
Pratibha, Keerti, Prabhat, and Garvita (dauhit r i). So in many indirect ways their
abundant contribution exists in completing this script, and they are rightfully
entitled for many thanks for their hidden contribution in bringing the book in this
shape.
xiii
xiv Contents
xxvii
Acronyms
xxix
xxx Acronyms
Abstract The field of Artificial Intelligence (AI) has interfaces with almost every
discipline of engineering, and beyond, and all those can be benefited by the use of
AI. This chapter presents the introduction to AI, its roots, sub-domains, Turing test
to judge if the given program is intelligent, what are the goals of AI for Engineers
and Scientists, what are the basic requirements for AI, symbol system, what are the
requirements for knowledge representation, and concludes with a chapter summary,
and an exhaustive list of exercises. In addition raises many questions to ponder over,
like, consciousness, and whether it is possible to create artificial consciousness.
1.1 Introduction
The following are often quoted definitions, all expressing this notion of intelli-
gence but with different emphasis in each case:
Partly, the aim of Artificial Intelligence (AI) is to find the answer to these questions,
through the tools provided by AI. These tools are because, the AI offers the medium,
as well as the test-bed for theories of intelligence, which can be expressed in the
form of computer programs, and can be tested as well as verified by running these
programs on computers.
Unlike Physics and Chemistry, AI is still a premature field, hence, its structure,
objectives, and procedures are less clearly defined, and not clear like those in physics
and chemistry. The AI has been more concerned about expanding limits of computers,
apart from defining itself.
Learning Outcomes of this Chapter1 :
1. Defining AI. [Familiarity]
2. Describe Turing test thought experiment. [Familiarity]
Closed Room
Interrogator
cat ..
dog..
T2
T1 T3
Computer
running an T1, T2, T3: Dumb terminals
intelligent program
1. It takes human being as a reference for intelligent behavior, rather than debating
over the true nature of intelligence: against.
2. The unmeasurable things are not considered, e.g., whether a computer uses inter-
nal structures, or for example, whether the machine is conscious of its actions,
which are currently not answerable: against.
3. Eliminates any bias to human oriented interaction mechanisms, since a computer
terminal is used as a communication device: for.
4. Biased towards only symbolic problem solving: against.
5. Perceptual skills or dexterity cannot be checked: against.
6. Unnecessarily constrains the machine intelligence to human intelligence: against.
Though, the number of counts of against are far more than for, there is yet no
known test which is considered better than the Turing Test.
In the part success of the Turing Test, a powerful computer has deceived humans
in the thinking process, where this machine modeled intelligence of a young boy,
to become the first machine to pass the Turing test, conducted in June 2014. In this
experiment, five machines were tested at the Royal Society in central London to
check if these machines could fool the people into thinking. The machines behaved
like humans, and the conversation was in the form of text. A computer program, by
the name “Eugene Goostman” was developed to simulate a young boy, which came
out to convince the one-third of the judges that it was human [5].
1.3 Goals of AI
AI is the area of computer science aiming for the design of intelligent computer
systems, i.e., systems that have characteristics of intelligence, like, we observe in
human behavior, for example, to understand language(s), and have abilities of learn-
ing, reasoning, and problem solving [9].
For many researchers, the goal of AI is to emulate human cognition, while to
some researchers, it is the creation of intelligence without considering any human
characteristics. To many other researchers, AI is aimed to create useful artifacts
for the comforts and needs of human, without any criteria of an abstract notion of
intelligence.
The above variation in aims is not necessarily a wrong, as each approach uncovers
new ideas and provides a base for pursuing the research in AI. However, there is a
convincing argument that due to the absence of a proper definition of AI, it is difficult
to establish as what can and what cannot be done through AI.
One of the goals for studying AI is to create intelligence in machines as a general
property—not necessarily based on any attribute of humans. When this is the goal,
it also includes the objective of creation of artifacts of human comforts and needs,
which can be the driving force of technological development. However, this goal
also requires a notion of intelligence, to start with.
1.3 Goals of AI 5
Further, the problem gets compounded, since artifacts manufacturer may say their
product is better in terms of saving labor and money, while cognitive scientists may
say that their system correctly predicted human behavior. Apart from this dispute,
without proper theoretical base of AI, it is not a wise idea to build complex system—
on which one has confidence, can be analyzed for performance and error analysis
can be carried out.
The definition of AI be such that for any system, it covers input, output, and their
relationship based on the structure of the system. There is a need of such definition to
be as general as possible so that it can be uniformly applicable. In the absence of such
a definition, one takes the AI as that exists in Chess playing, the one in automated
vehicle driving, and the one existing in medical expert system for diagnosis—all
these approaches for the definition of AI are varying from case to case.
The scientific goal of AI is to determine theories about knowledge representation,
learning, rule-based systems, and search that explain various sorts of intelligence.
The engineering goal of AI is to acquire ability in the machine so that it can
solve the real-life problems. The basic techniques used by AI for this purpose are
knowledge representation, machine learning, rule systems, and state space search.
In the past, computer scientists and engineers used to be more concerned with
the engineering goals, but the psychologists, philosophers, and cognitive scientists
were more keen on the scientific goals. In spite of these opposite concerns, there are
common techniques that the two approaches can feed to each other. Hence, we will
proceed with both the goals in mind.
1.4 Roots of AI
The field of AI does not live in isolation, and has significant roots in number of older
disciplines, particularly,
Philosophy,
Logic/Mathematics,
Computing,
Psychology/Cognitive Science,
Biology/Neuroscience, and
Evolution.
In the above domains, there is significant overlap, for example, between philos-
ophy and logic, and between mathematics and computation. By looking at each of
these in turn, we get a better understanding about their role in AI, and how these
fields have developed to play that role in AI [6, 10].
6 1 Introducing Artificial Intelligence
1.4.1 Philosophy
The evidence of philosophy goes as back as Socrates times (∼400 BC) where he
asks for an algorithm to distinguish between piety (a reverence of supreme being)
from non-piety. In around 300 BC, Aristotle formulated various types of deductive
reasoning approaches, to mechanically generate conclusions using initial premises.
One approach of deductive reasoning he used was modus ponens, now also a standard
technique of inference in predicate and propositional logic. It is stated as
If “A is True” → “B is True”, and “A holds True” then conclude that “B holds True”.
As an example of inference, “If it’s raining then you get wet, and it’s raining, then
you should have got wet”.
The philosopher Rene Descartes (1596–1650) introduced the concept of mind-
body dualism, which says that part of the mind is exempted from following the
physical laws. The conclusion of which he has drawn as free will. In the present time
also, when the AI, machine learning, and data science are dominantly used, but it
is argued that machine cannot supersede human in intelligence, as they do not have
free will, and they need to be assigned the goal by a human.
Gottfried Wilhelm Leibnitz, a German philosopher and mathematician (1646–
1716), who supported the materialist nature of mind, said that mind operates by
ordinary physical processes. In the present context, this means that mental processes
can be performed by the machines.
The logic, has history of development from the time of Greece philosophers—Plato
and Aristotle (∼300–400 BC), however, there has been more recent development at
a rapid pace. These were due to the following
• Earl Stanhope’s Logic (1777), using which Earl demonstrated a machine capable
of solving the problem using the inference rule, called syllogisms, and the numer-
ical problems of logical nature, and elementary questions based on the theory of
probability.
• George Boole (1815–1864) introduced the language-based formal logic for a draw-
ing logical inference in 1847, later became popular by name Boolean Algebra.
• Gottlob Frege (1848–1925) introduced the first-order logic that today forms the
most common knowledge representation system, called as FOPL (first-order pred-
icate logic).
• Kurt Gödel (1906–1978), in 1931, demonstrated that there are limits of logic.
Through his incompleteness theorem he showed that in any formal logic, which
is powerful enough to describe the properties of natural numbers, there exist true
statements, but their truth cannot be proved using any algorithm.
1.4 Roots of AI 7
• Roger Penrose in 1995 tried to prove that human mind has non-computable capa-
bilities.
1.4.3 Computation
Cognitive Psychology or Cognitive Science is the study about the functioning of the
mind, human behavior, and the processing of information about the human brain. An
important consequence of human intelligence is human languages. The early work
on knowledge representation in AI was about human language, and was produced
through research in linguistics.
It is humans’ quest to understand as to how our and other animals’ brains lead
to intelligent behavior, with the aim to ultimately build AI systems. On the other
hand, it is also aimed to explore the properties of artificial systems, like, computer
models/simulations to test our hypotheses concerning human systems.
Many people working in sub-fields of AI are in the process of building models
of how the human system operates, and use artificial systems for solving real-world
problems.
8 1 Introducing Artificial Intelligence
The field of neuroscience says that human brains, which provide intelligence, are
made up of tens of billions of neurons, and each neuron is connected to hundreds or
thousands of other neurons. A neuron is an elementary processing unit, performing a
function called firing, depending on the total amount of activity feeding into it. When
a large number of neurons are connected together, it gives rise to a very powerful
computational device that can compute, as well as learn how to compute.
The concept of human brains, having the capability to compute, as well as learn,
is used to build artificial neurons in the form of electronics circuits, and connect
them as circuits (called ANN—artificial neural networks) in large quantities, to build
powerful AI systems. In addition, the ANN are used to model various human abilities.
The major difference between the functions of neurons and the process of human
reasoning is that neurons work at sub-symbolic level, whereas much of conscious
human reasoning appears to operate at a symbolic level, for example, we do most of
the reasoning in the form of thoughts, which are manipulations of sentences.
The collection of neurons in the form of programs called Artificial Neural Net-
works (ANN), perform well in executing simple tasks, and provide good models
of many human abilities. However, there are many tasks of AI that they are not so
good at ANN, and other approaches are more promising in those areas, compared
to ANN. For example, for natural language processing (NLP) and reasoning, the
symbolic logic, called predicate logic is better suited.
1.4.6 Evolution
Unlike the machines, the humans (intelligence) has a very long history of evolution,
of millions of years, compared to less than hundred years for electronic machines
and computers. The first exhaustive document of human evolution, the evolution
by natural selection is due to Charles Darwin (1809–1882). The idea is that fitter
individuals will naturally tend to live longer and produce more children (may not be
truly valid in the modern world). Hence, after many generations, a population will
automatically emerge with good innate properties [3].
Due to this evolution, the structure of the human brain, and even the knowledge, are
to a sufficient extent built-in at the time of the birth. This is an advantage over ANNs,
which have no pre-stored knowledge, hence they need to acquire the entire knowledge
by learning only. However, the present-day computers are powerful enough that even
the evolution can be simulated using them, and can evolve the AI systems. It has
now become possible to evolve the neural networks to some extent so that they are
efficient at learning. But, may still be challenging to recreate the long history of the
evolution of humans in the ANNs.
A closely related field to ANNs is genetic programming, which is concerned with
writing the programs that evolve over time, and do not need to modify them as it is
done in usual programs when the system requirement changes [4].
1.4 Roots of AI 9
Right from the time, automated machines like computers came into existence, it
was the quest of researchers to build machines that can compete in intelligence with
humans. Looking at the current progress rate of smart machines, like smartphones,
it is believed that in the not very far future, it may be possible to build machines
which may be, if not more, but have comparable intelligence to that of humans.
Using such machines, it may be possible to produce human like consciousness in
machines—called as artificial consciousness.
On the contrary, even a far of realization of artificial consciousness gives rise to
several philosophical nature of questions:
The AI systems have a lot of variations, for example, the rule-based systems are based
on symbolic representations, and work on inferences. There are other extremes,
the ANN-based system work on the interface with other neurons, and connection
weights. In spite of all these, there are four common features among all of them.
Representation
All AI systems have an important feature of knowledge representation. The rule-
based systems, frame-based systems, and semantic networks make use of a sequence
of if-then rules, while the artificial neural networks make use of connections along
with connection weights.
Learning
All AI systems have capability of learning, using which they automatically build up
the knowledge from the environment, e.g., acquiring the rules for a rule-based expert
system, or determining the appropriate connection weights in an artificial neural
network.
10 1 Introducing Artificial Intelligence
Rules
The rules of an AI-based system can be implicit or explicit. When explicit, the rules
are created by a knowledge engineer, say, for an expert system, and when implicit,
they can be, for example, in the form of connection weights in a neural network.
Search
The search can be in many forms, for example, searching the sequence of states that
lead to solution faster, or searching for an optimum set of connection weights in an
ANN by minimizing the fitness function.
1.7 Sub-fields of AI
Most of these have both engineering and scientific aspects. Many of these are
going to be discussed in this text. Following is brief Introduction to some of these
areas.
1.7 Sub-fields of AI 11
Fig. 1.2 Sound waves for the text “Hello” repeated five times by a five year child
The speech processing and understanding of human speech has number of applica-
tions, some of them we come across quite often, like, speech recognition for dictation
systems, speech production for automated announcements, voice-activated control,
human–computer interface (HCI), and voice-activated transactions are a few exam-
ples.
One of the primary goals is, how do we get from sound waves to text streams and
vice-versa? The Fig. 1.2 is an example, showing the sound wave pattern for the text
“Hello” repeated five times.
To be precise, how should we go about segmenting the stream into words? How
can we distinguish between “Recognize speech” and “Wreck a nice beach”?
Consider the machine understanding and translation of simple sentences given below.
Some of the common applications of Natural Language Processing (NLP) are [2]:
1.7.3 Planning
Any planning, or specifically the robotic planning, is concerned with choosing (com-
puting) the correct sequence of actions to complete a specific task. This requires a
convenient and efficient representation of the problem domain. The plan steps are
called states defined in formal language, like, in predicate language, or in the form
of rules, depending on what type of planning has been used. A plan may be taken as
a sequence of operations that transform the initial state into the goal state ultimately,
the later is the solution. The best planning seeks to explore the best path of states for
reaching the goal state. Hence, the best path or an optimum path seeks exploration
or searching, to find out the best possible path efficiently, in terms of time and space
required for running the planning algorithm.
tent of the following chapters. Various representation schemes are classified into two
categories: Symbolic representation and Graphical representation. The first is propo-
sitional and predicate logic based, while the other is graph-based, like—semantic
networks, frames, ontologies and conceptual dependencies.
The primary aim of fuzzy or soft-computing is to exploit the tolerance for imprecision
and uncertainty to achieve tractability, robustness, and low cost applications. Fuzzy
systems can be integrated with other techniques such as neural networks, and prob-
abilistic reasoning. In fuzzy systems, the membership are partial (fuzzy/unclear).
Examples of fuzzy sets are: old, cloudy, high speed, rainy, etc. This is in contrast to
the classical-based systems, where boundaries are crisp, like, member of computer
science class, an Indian national, chair (member of the set of chair), where all three
are representing the full membership of each category of the given set.
The models of the human brain and evolution correspond to two major approaches to
AI. The AI as a model of the brain corresponds to Symbolic AI, and has remained the
major field of AI. This has the property of high level of mathematical abstractions,
and considered as the macroscopic view of AI. The human psychology operates at
a symbolic level, as well as the AI programming languages, and early engineered
systems fall in this type of AI.
The other approach of AI is like human evolution, which is based on low level
biological and genetic models of living beings. The neural-based computing (artificial
neural networks), and genetic algorithms derived from these concepts from life. These
biological models of AI may not necessarily be resembling verbatim to their living
being counterparts, but the AI techniques that are based on GA (genetic algorithms)
evolve the solution like the populations of human or like other forms of the life
evolves [7].
Neural networks, similar to expert systems, are modeled on the human brain and
learn by themselves from patterns. Thus, learning can then be applied to classifica-
tion, prediction, or control applications. The GAs, are computer models based on
genetic and evolution. Their basic idea is that genetic programs work towards finding
better and better solutions to problems, just as species evolve to better adapt to their
environments. The GAs comprises three basic processes: reproduction of solution
based on their fitness, crossover of genes, and mutation for random change of genes.
A broader definition GA is evolutionary computing, which includes not only GAs
but classifier systems, genetic programming in which each solution is a computer
program, and a part of artificial life.
14 1 Introducing Artificial Intelligence
These fields are concerned with vision, speech processing, and robotics. The basic
theme is applications that make machine sense (e.g., to see, hear, or touch), then
understand and think, and finally take action.
For example, the basic objective of machine vision may be to make the machine
“understand” the input consisting of reflected brightness values. Once this under-
standing is achieved, the results can be used for interpretation of patterns, inspecting
parts, action of robots, and so forth. Developing an understanding presents the same
difficulty in all areas of AI, including knowledge based systems—people understand
what they see by integrating an optical image with complex background knowledge.
Such background knowledge has been built over years of experiencing perceptions.
Creating this type of information processing in the machine is a challenging task;
however, some interesting applications have already appeared as the evidence to
support future progress.
The speech processing uses two major technologies. One of these areas focus
on input or speech recognition where acoustic input, like optical input in machine
vision, is difficult task to automate. People understand what they hear with complex
background knowledge. Speech recognition technology includes signal detection,
pattern recognition, and possibly semantics—a feature closely related to Natural
Language understanding. The other technology concerns to the creation of output
or text-to-speech (tts) synthesis. Speech synthesis is easier than recognition, and its
commercialization has been well established.
The field of Robotics integrate many techniques of sensing, and, is one of the AI
areas in which industrial applications have the longest and widest successful records.
The abilities of these robots are relatively limited. For example, only in a limited task
such as welding seams and installing windshields, etc [7].
The symbols are the basic requirements of intelligent activity, e.g., by human the sym-
bols are basic number systems, alphabet of our languages, sign language, etc. Similar
is the case with entire computer science, the languages, commands, computations, all
have symbols as the base. When the information is processed by computers, on the
completion of the task, we measure the progress, as well as the quality of results and
efficiency of computations, only based on its symbols’ contents in the end results [8].
1.9 Physical Symbol System Hypothesis 15
If we look at the entire knowledge of computer science, it is the symbols, which have
been used to explain this knowledge at the most fundamental level. The explanation
is nothing but the scientific proposition of nature, which is empirically derived over
a long period of time, through a graduate development. Hence, the symbols are at the
root of artificial intelligence, and are also the primary topic of artificial intelligence.
For all the information processed by computers in the service of finding the end
goals, the intelligence of the system (computers) is their ability to reach goals in
16 1 Introducing Artificial Intelligence
the face of difficulties and complexity of the solution, as well as the complexity
introduced by the environment. The fundamental requirements to achieve artificial
intelligence is to store and manipulate the symbols, however, there is no uniformity
and specific requirement of storage structures, as the structures vary from method to
method used for implementation of AI, which are mostly the variants of network-
based representation and predicate-based representations.
The “physical systems” used have two important characteristics: 1. Operation of
systems is governed by the laws of physics, when they are realized as engineered
systems, and made of engineered components; and 2. The “symbols” are not limited
only to the symbols used by human beings.
The “physical symbol system” hypothesis has its root to Russel’s formalizing logic,
which states that one need to capture the basic conceptual notion of mathematics in
logic and put that notion to proof and deduction as sound base. This notion, with the
effort ultimately grew in the form of mathematical logic—the propositional logic,
predicate logic, and their variants [13].
The second generation of computers brought the concept of stored program concept
in the mid-forties, after the Eniac computer. The arrival of these computers was con-
sidered as a milestone in terms of conceptual progress, as well as practical availability
of systems. In such systems, the programs are treated as data, which, in turn, was
processed by other programs, like a compiler program processing another program
as data to generate object code. Interestingly, this capability was already verified and
existed in Turing machine, which came as early as 1936. The Turing machine is a
model of computing given by Alan M. Turing, where, in a universal Turing machine,
an algorithm (another Turing machine) and data are on the very same tape. This idea
was realized practically when machines were built with enough memory to make it
practicable to store actual programs in some internal place, along with the data on
which the program will act, as well as the data which will be produced as a result of
the execution of the programs.
As per the Webster English language dictionary, the following are the meanings of
knowledge:
1. The act or state of knowing; clear perception of fact, truth, or duty; certain appre-
hension; familiar cognizance; cognition. [1913 Webster]
Knowledge, which is the highest degree of the speculative faculties, consists in
the perception of the truth of affirmative or negative propositions—Locke. [1913
Webster]
2. That which is or may be known; the object of an act of knowing; a cognition—
Chiefly used in the plural. [1913 Webster]
3. That which is gained and preserved by knowing; instruction; acquaintance;
enlightenment; learning; scholarship; erudition. [1913 Webster]
for example, a predicate like language. The language has two important components
in it.
Syntax
The system of a language defines the methods using which we or the machine can
distinguish the correct structures from incorrect, i.e., it makes possible to identify
the structurally valid sentences.
Semantics
The semantics of a language defines the world, or facts in the world of the concerned
domain. And, hence defines the meaning of the sentence in reference, to the world.
A good knowledge representation system for any particular domain should possess
the following properties.
Adequacy of representation
The representation system should be able to represent all kind of knowledge needed
in the concerned AI-based system.
Adequacy of Inference
The representation should be such that all that can be inferable by manipulating the
given knowledge structures should be inferred by the system, when needed.
Inference Efficiency
The knowledge structures in the representation are so organized that the attention of
the system, in the form of deductions, navigates in such a direction that it can reach
the goal quickly.
Efficient acquisition
It should be able to acquire the new information automatically and efficiently, as and
when needed, and also to update the knowledge regularly. In addition, there should
be a provision that knowledge engineer can update the information in the system.
We are aware of good and bad knowledge representation, when we consider the
knowledge representation in English or any other natural language. These are due to
factors, like, syntax, semantics, partial versus full knowledge on any subject, depth
and breadth of knowledge, etc.
1.10 Considerations for Knowledge Representation 19
To carry out the analysis of any representation system, it is useful to break the entire
representation into smaller (smallest) components, which are in the most fundamental
form. Accordingly, the components of an AI representation are divided into the four
fundamental components:
Lexical components
The lexical components of knowledge representation are the symbols and words of
the vocabulary used for representation.
Syntactic/Structural components
It describes how the symbols can be arranged systematically to create meaningful
sentences. These structures are the grammar of the language used for representation.
Semantic components
They help is associating real-world meaning to objects and entities.
Procedural components
These procedures are used for creating and modifying the representations, and also
for answering the questions using these procedures.
20 1 Introducing Artificial Intelligence
We humans are intelligent beings, who make use of the knowledge represented
in the form of natural language (like, English, Hindi, Chinese, etc.), we update
that knowledge (i.e., acquisition), and do the reasoning and inferences using this
representation. Of course, there are many other types of knowledge representation
and inferencing with the humans, which are not symbolic-based, like those acquired
through smell, touch, hearing, and taste (through tongue). Hence, why not use the
natural language for knowledge representation for machines also? The following are
the trade-offs for representation using natural language.
Advantages
There are very strong advantages in favor of using natural language for knowledge
representation.
• The natural language is strong at expressiveness, using which we can represent
almost everything (real-world situations, pictures, symbols, ideas, emotions) and
can carry out the reasoning using that.
• It is the most abundantly used source for knowledge representation by humans,
for example, can we list the name of textbooks not written in natural language? It
is hard to reply!
Disadvantages
In spite of strong points in favor of natural language-based representation, there are
serious difficulties in realizing such representation for machines, due to the following
reasons:
• The syntax and semantics of the natural language are very complex, which are not
so easily understood. Hence, it becomes challenging and risky if solely depended
on machines.
• The uniformity in representation is lacking—the sentences carrying the identical
meaning can be represented in many different syntax (structures).
• There is a lot of ambiguity in the natural language, a sentence/word may have
many different meanings, and the meanings are context-dependent. Hence, it is
overly risky to try these for machines, unless the machines are having intelligence
at par with the human.
1.12 Summary
AI has its inter-related goals for scientific, as well as engineering areas. Its roots
are in several historical disciplines, which include, philosophy, logic, computation,
psychology, cognitive science, neuroscience, biology, and evolution.
The major sub-fields of AI now include: neural networks, machine learning, evo-
lutionary computation, speech recognition, text-to-speech translation, fuzzy logic,
genetic algorithms, vision systems and robotics, expert systems, natural language
processing, and planning. Many of these domains have dependency and are inter-
related, for example, neural network is one of the techniques for machine learning.
The common techniques used across these sub-fields are: knowledge representation,
search, and information manipulations.
Human brain and evolution are also the areas of AI modeling.
The study of logic and computers have demonstrated that intelligence lies in the
physical symbol system (PSS)—a collection of patterns and processes. The PSS needs
the capability to manipulate the patterns, i.e., it should be able to create the patterns,
modify the patterns, and should be able to destroy the patterns. The patterns have
important properties, that they can designate objects, processes, and other patterns.
When the patterns designate processes, the later can be interpreted, i.e., to perform
the process steps. The two significant classes of symbol systems we are familiar with
those that are used by human beings, and those by computers. The later uses binary
strings or patterns.
The PSSH (physical symbol system hypothesis) says that to achieve the intelli-
gence, it is sufficient to have three things,
i. a representation system, using which anything can be represented,
ii. a manipulation system, using which the symbols can be manipulated, and
iii. search using which the solution can be searched.
For the above, it is in fact, not important as whether the medium of storage is the
human brain (neurons) or the electronic memory of computer systems.
Exercises
1. Try to analyze your own learning behavior, and list the goals of learning, in the
order of their difficulty level of learning.
2. List the living beings—human, dog, cow, camel, elephant, cat, birds, insects, in
the order of their intelligence levels. Also, justify your argument.
3. Suggest some method to combine large number of human beings, in a group,
and formulate a method/algorithm to perform the pipeline or any fast computing
work.
4. Write an essay, describing the various anticipated theories as to how the human
processes the information?
5. Consider a task requiring knowledge like baking a cake. Out of your imagination,
suggest what are the knowledge requirements to complete this task.
6. Out of your reasoning, explain the distinction between knowledge and belief.
7. What is the basic difference between neural network level processing and pro-
cessing carried out for human reasoning?
8. What are the major advantages of humans over modern computers?
9. List the examples where PSSH is not sufficient or not the basis of achieving
Intelligence. Justify your claims.
10. Write an essay, describing how the things (objects/activities) are memorized by
a. Plants.
b. Birds.
c. Sea animals.
d. Land animals.
e. Human.
What can be the size of memories in each of the above cases?
11. Discuss the potential uses of AI in the following applications:
a. Word processing systems.
b. Smartphones.
c. Web-based auction sites.
d. Scanner machines.
e. Facebook.
f. Twitter.
g. Linkedin.
h. Amazon and Flipkart.
12. How the artificial neural networks (ANN) and genetic algorithms differ from
each other in respect of leaning to be used for problem solution? (Note: You
need to explore the AI resources to answer this question).
Exercises 23
13. It is an accepted scientific base that physical characteristics of life are genetically
transferred. Do you believe that information and knowledge are also genetically
transferred? Justify for yes/no?
14. Are the beliefs of rebirth and reincarnation are also the goals of AI research?
How and how not?
References
1. Buttazzo G (2001) Artificial consciousness: utopia or real possibility? IEEE Comput 34(7):24–
30
2. Church KW, Lisa FR (1995) Commercial applications of natural language processing. Commun
ACM 38(11)
3. Forrest S (1993) Genetic algorithms: principles of natural selection applied to computation.
Science 261(13):872–878
4. Goldberg DE (1994) Genetic and evolutionary algorithms coming of age. Commun ACM
37(3):113–119
5. http://www.theguardian.com/technology/2014/jun/08/super-computer-simulates-13-year-
old-boy-passes-turing-test. Accessed 19 Dec 2017
6. Ludger GF (2009) Artificial intelligence - structures and strategies for complex problem solv-
ing, 5th edn. Pearson, New Delhi
7. Munakata T (1994) Commercial and industrial applications of AI. Commun ACM 37(03)
8. Newell A, Herbert AS (1976) Computer science as an empirical inquiry: symbols and search.
Commun ACM 19(3):113–126
9. Russell SJ (1997) Rationality and intelligence. Artif Intell 94:57–77
10. Russell SJ, Norvig P (2005) Artificial intelligence, a modern approach, 2nd edn. Pearson, New
Delhi
11. Stanford Encyclopedia of Philosophy. http://plato.stanford.edu/entries/turing-test/. Accessed
19 Dec 2017
12. Turing AM (1950) Computing machinery and intelligence. MIND - Q Rev Psychol Philos
Lix(236):433
13. Whitehead AN, Russell B (1910) Principia mathematica, vol 1 (Part I. Mathematical logic).
Merchant Books. ISBN 978-1-60386-182-3
Chapter 2
Logic and Reasoning Patterns
Abstract Logic is the foundation of AI, and the majority of AI’s principles are based
on logical or deductive reasoning. The chapter presents: contributions of pioneers of
logic, the argumentation theory, which is based on logic and with its roots in propo-
sitional logic, the process of validating the propositional formulas, their syntax and
semantics, interpretation of a logical expression through semantic tableau, followed
with presents the basic reasoning patterns used by human, and their formal notations.
In addition, presents the normal forms of propositional formulas and application of
resolution principle on these for inference. The nonmonotonic reasoning and its sig-
nificance is briefly described. At the end, the chapter presents the axiomatic system
due to Hilbert and its limitations, and concludes with chapter summary.
2.1 Introduction
The ancient Greeks are the source of modern logic, their education system empha-
sized the competence in rhetoric (proficient in language) and philosophy; the
words axioms and theorem are from Greek. The logic was used to formalize the
deductions—the derivation of true conclusions—from true premises. Later it was
formalized as a set theory by the mathematician George Boole. Till the arrival of the
nineteenth century, the logic remained more of a philosophical nature, rather than a
mathematical and scientific tool. Later, since complex things could not be reasoned
through logic, the logic became part of mathematics, where mathematical deduction
became justifiable through formalizing a system of logic, and resulted in one very
important breakthrough. This was, about the set of true statements, stated as “the set
of provable statements are only those that are true statements.” This is because some
proof exists for those due to some other true statements.
At the beginning of nineteenth century, the mathematician David Hilbert intro-
duced the logic, as well as theories of the nature of logic-a far more generalization
of the logic. But, this generalization received a blow when another mathematician
Kurt Gödel showed in 1931 that there are true statements of arithmetics that are not
provable, through his incompleteness theorem.
Now, though mathematical logic remains the branch of pure mathematics, it is
extensively applied to computer science and artificial intelligence in the form of
propositional logic and predicate logic (first-order predicate logic (FOPL)).
As per the Newell’s and Simons’s Physical Symbol System Hypothesis (PSSH),
discussed in the previous chapter, the knowledge representation is the first require-
ment of achieving intelligence. This chapter presents the knowledge representation
using propositional logic, introduces first-order predicate logic (FOPL), and drawing
of inferences using propositional logic.
Logic is a formal method for reasoning, using its concepts can be translated into
symbolic representation, which closely approximate the meaning of these concepts.
The symbolic structures can be manipulated using computer programs to deduce
facts to carry out the form of automated reasoning [9].
The aim of logic is to learn principles of valid reasoning as well as to discern
good reasoning from bad reasoning, identifying invalid arguments, distinguishing
inductive versus deductive arguments, identifying fallacies as well as avoiding the
fallacies.
The Objective of logic is to equip oneself with various tools and techniques, i.e.,
decision procedures for validating given arguments, detecting and avoiding fallacies
of a given deductive or inductive argument.
We study the logic because of the following reasons:
• Logic deals with what follows from what? For example, Logical consequence,
inference pattern, and validating such patterns,
• We want the computer to understand our language and does some intelligent tasks
for us (Knowledge representation),
• To engage in debates, solving puzzles, game like situation,
• Identify which one is a fallacious argument and what is a type of fallacy?
• Proving theorems through deduction. To find out whether whatever proved is
correct, or whatever obviously true has a proof, and
• To solve some problems concerning the foundations of mathematics.
5. Apply formal logic proofs and/or informal, but rigorous, logical reasoning to real
problems, such as predicting the behavior of software or solving problems such
as puzzles. [Usage]
6. Explain the difference between rule-based and model-based reasoning techniques.
[Familiarity]
7. Describe the strengths and limitations of propositional logic. [Familiarity]
The Argumentation theory is the study of how conclusions can be reached through
logical reasoning, that is, whether the claims are soundly based on premises or not
(Fig. 2.1). It includes the arts and sciences of civil debate, dialog, conversation, and
persuasion. It includes the studies of rules of inference, logic, and procedural rules
in both artificial and real world settings.
An argumentation system comprises debate and negotiations, aimed at reaching
to a mutually agreeable conclusion. The argumentation may also consist of erroneous
dialogs, where victory over an opponent is the only goal, without consideration of
the truth. The argumentation theory is an art, as well as science, using these people
protect their self-interests and beliefs using rational dialogs at commonplaces of their
meeting points, and during the process of argumentation.
People make use of argumentation theory in law also, for example, in preparing an
argument to be presented before the court of law, in debate in the court of law, in trials,
and in testing the validity of certain kinds of evidence. Scholars of Argumentation
theory study the post-hoc rationalizations by which organizational actors try to justify
decisions even made irrationally.
The simple block diagram for logical reasoning shown in Fig. 2.1 has internal
structure, comprising the following:
If the premises are P1 , P2 , . . . , Pn , then they are conjuncted and their conjunction
imply the conclusion C, i.e., P1 ∧ P2 ∧ · · · ∧ Pn → C.
An argument must have at least one premise and one conclusion. Often, classical
logic is used as the method of reasoning so that the conclusion follows logically
from the assumptions or support. One challenge is that if the set of assumptions is
We have discussed in the above section about knowledge and logic. The Logic needs
a base of knowledge to infer or conclude new knowledge. The knowledge is also
used for learning, retrieval and reasoning. The learning is not only adding new facts
into an existing knowledge base, but before the new data are put into the storage,
they need to be classified for ease of retrieval. The interaction and inference with
existing facts avoid the redundancy and duplication of knowledge in the knowledge
base. In addition, the learning updates the existing facts.
Having stored the knowledge in the process of learning, one important objective of
that is, retrieval. The representation scheme used in the knowledge base has critical
effect on the efficiency of the retrieval system. As humans, we are very good in
retrieval from our knowledge (memories), and many AI systems have used that for
modeling AI learnings.
The knowledge is also used for reasoning process, i.e., to infer new facts from the
existing facts in the knowledge. For example, observing many birds flying, to infer
that all the birds fly, as well for solving a complex problem, say, based on sufficient
facts, to infer that a customer financed by a bank, will be able to repay the loan the
bank has financed to him.
The propositional logic deals with individual Propositions, which are viewed as
atoms, i.e., these cannot be further broken down into smaller constituents. For build-
ing propositional logic, first we describe the logic with the help of a formula called
Well-Formed Formulas (wff, read as woofs). A formula is a syntactic concept, which
means whether or not a string of symbols is a formula not. It can be determined
solely based on its formal construction, i.e., whether it can be built according to
its construction rules. Therefore, we are in a position to verify that a sequence of
symbols is a formula or not, as per the specified rules. This function of verification,
in a compiler, is done by a parser—to verify whether the formula belongs to the
particular programming language or not. A parser also constructs a parse-tree of the
given formula through which it tells how the formula is constructed [2].
2.4 Propositional Logic 29
The meaning (semantics) is associated with each formula by defining its inter-
pretation, which assign a value true (T) or false (F) to every formula. The syntax is
also used to define the concept of proof—the symbolic manipulations of formulas
to deduce the given theorem. The important thing we should note is that provable
formulas are only those which are always true.
We start the propositional logic with the individual propositional variables. These
variables themselves are formulas, which cannot be further analyzed. We represent
these by English alphabets and subscripted alphabets p, q, r, s, t, p1 , p2 , q1 , q2 , . . . ,
etc. These formulas may have smaller constituents but it is not the role of propositional
logic to go into the details of their constructions. The use of letters to represent
propositions is not in true sense variables, they simply represent the propositions or
statements in a symbolic form, and they are not the variables in the sense used in
predicate logic (to be discussed later), or in high-level languages like C or Fortran,
where a variable stands for a domain of values. For example, an integer variable
in a Fortran program stands for any integer number as per the specifications of the
language.
The other symbols of propositional logic are operators as follows:
∧ conjunction operator,
∨ disjunction operator,
¬ not or inverting operator,
→ implication, i.e., if . . . than . . . rule, and
⊥ contradiction (false).
Let following be the propositions:
p = Sun is star.
q = Moon is satellite.
We can construct the following formulas using the above propositions:
p ∧ q = Sun is star and Moon is satellite.
p ∨ q = Sun is star or Moon is satellite tennis.
¬ p ∨ q = Sun is not star or Moon is satellite.
¬ p → q = if Sun is not star then Moon is satellite.
A formula in propositional logic can be recursively defined as follows:
(i) Each propositional variable and null are formulas, therefore, p, q, φ are formu-
las,
(ii) If p, q are formulas, then p ∧ q, p ∨ q, ¬ p, p → q, ( p), are also formulas,
(iii) A string of symbols is a formula only as determined by finitely many applica-
tions of above (i) and (ii), and
(iv) nothing else is propositional formula.
This recursive form of the definition can be expressed using B N F (Backups-Naur
Form) notation as follows:
30 2 Logic and Reasoning Patterns
In the above notation, the symbols— f or mula and atomic f or mula, that appears
to the left-hand are called non-terminals and represent grammatical classes. The
p, q, r, ⊥, p1 , etc, that appear only to the right-hand side, are called terminals, and
represent the symbols of the language.
A sentence in the propositional language is obtained through a derivation that
starts with a non-terminal, and repeatedly applied the substitution rules from the
BNF notations, until the terminals are reached [8].
The sequence of substitutions rules to derive this formula, i.e., to establish that it is
syntactically correct, are as follows:
The symbol atomic stands for atomic formula and the symbol “⇒” stands for
“implies”, i.e., the expression to right to this is implied by the expression to left of
“⇒”.
The derivation can also be represented by a derivation-tree (parse-tree), shown
in Fig. 2.2. From the derivation-tree, we can obtain another tree shown in Fig. 2.3,
called syntax-tree or formation-tree, by replacing each non-terminal by the child that
is an operator under that. There is always unique syntax-tree for every formula.
Considering two propositions p, q, the interpretation (semantics) of the formulas
constructed when they are joined using binary operators (∨, ∧, →) are shown in the
truth-table Table 2.1.
The Material conditional ‘→’ joins two simpler propositions, e.g., p → q, read
as “if p then q”. The proposition to the left of the arrow is called the antecedent and
to the right is consequent. There is no such designation for conjunction or disjunction
operators because they are commutative operations. The p → q expresses that q is
true whenever p is true. Thus it is true in every case in Table 2.1, except in row
three, because this is the only case when p is true but q is not. Using “if p then q”,
2.4 Propositional Logic 31
∧ r
p q
we can express that “if it is raining outside then there is a cold over Kashmir”.
The material conditional is often confused with physical causation. The material
conditional, however, only relates two propositions by their truth values—which is
not the relation of cause and effect. It is contentious in the literature whether the
material implication represents logical causation.
The logical consequence or logically follows is the central concept in the foundations
of logic. It is much more interesting to assume that a set of formulas is true and then
to investigates the consequences of these assumptions [1].
Assume that θ and ψ are formulas (sentences) of a set P, and I is an interpre-
tation of P. The sentence θ of propositional logic is true under an interpretation I
iff I assigns the truth value T to that sentence. The θ is false under an interpretation
I iff θ is not true under I .
Definition 2.3 (Logical consequence) A sentence ψ of propositional logic is a logi-
cal consequence of a sentence (or set of sentences) θ , represented as θ |= ψ, if every
interpretation I that satisfy θ also satisfy ψ.
In fact, ψ need not be true in every possible interpretation, only in those interpre-
tations which satisfy θ , i.e, those interpretations which satisfy every formula in θ .
In the formula (( p → q) ∧ p) q, the q is logical consequence of (( p → q) ∧ p).
The sign ‘’, is sign of deduction, and S q is read as S deduces q, where S is a set
of formulas and q is the formula.
A sentence of propositional logic is consistent iff it is true under at least one
interpretation. It is inconsistent if it is not consistent.
Example 2.2 Determine the logical consequence of ψ = ( p ∨ r ) ∧ (¬q ∨ ¬r ) from
θ = { p, ¬q}, i.e., find θ |= ψ, and validity for ψ.
Here ψ is logical consequence of θ , denoted by θ |= ψ, because ψ is true under all the
interpretations such that I ( p) = T r ue, and I (q) = False, is the interpretation,
for which θ is satisfied.
However, ψ is not valid, since it is not true under the interpretation I ( p) =
F, I (q) = T, I (r ) = T .
Further note that θ ψ is a valid statement because the expression θ ψ is
always true.
2.4 Propositional Logic 33
Semantic tableau is relatively efficient method for deciding satisfiability for the for-
mula of propositional calculus. The method (or algorithm) systematically searches
for a model for a formula. If it is found, the formula is satisfiable, else not satisfi-
able. We start with the definition of some terms, and then analyze some formulas to
motivate us for the construction of semantic tableau [1].
1. I ( p) = T and I (¬q) = T , or
2. I ( p) = T and I (¬ p) = T .
Hence A is satisfiable if either (1) interpretation holds or (2) holds. But (2) is not
feasible. So, A is satisfiable when the interpretation of (1) holds true. Note that the
satisfiability of a formula is reduced to the satisfiability of literals.
It is clear that a set of literals is satisfied if and only if it does not contain com-
plementary pair of literals. In the above case, the pair of literals { p, ¬ p} in case
(2) is complementary pair, hence the formula is unsatisfied for this interpretation.
However, the first set { p, ¬q} is not the complementary pair, hence it is satisfiable.
From the above discussion, we have trivially constructed a model for the formula
A by assigning True to positive literals and False to negative literals. Hence, p =
34 2 Logic and Reasoning Patterns
p, (¬q ∨ ¬p)
p, ¬q p, ¬p
True, and q = False makes the set in (1) true, hence { p = T, q = F} is a model for
formula A.
The above is a search process, and can be represented by a tree shown in Fig. 2.4.
The leaves in the tree represent a set of literals that must be satisfied. A leaf containing
complementary pair of literals is marked closed by ×, while the satisfying leaf is
marked as open by .
The construction process of the tree can be represented as an algorithm, to find
out if some model exists for a formula, and what is that model.
Definition 2.5 (Semantic Tableau) Semantic Tableau is a tree, each node of which
will be labeled with a set of formulas, and these formulas are inductively expanded
to leaves such that each leaf is marked as open by or closed by ×.
Definition 2.6 (Completed tableau) A semantic tableau whose construction is ter-
minated is called completed tableau . A completed tableau is closed if all the leaves
are marked closed. Otherwise, it is open i.e., some leaves are open.
Definition 2.7 (Unsatisfiable formula) Any formula A is unsatisfiable if its com-
pleted tableau T is closed.
Corollary 2.1 (Method for semantic tableau) A formula A is satisfied if its tableau
T is open. Thus a method for semantic tableau is an algorithm for the validity of a
propositional calculus formula.
Example 2.4 Find out whether ( p ∨ q) ∧ (¬ p ∧ ¬q) is satisfiable, using tableau
method.
Let A = ( p ∨ q) ∧ (¬ p ∧ ¬q). For the satisfaction of A, I (A) = T r ue for some
assignments. That is, I ( p ∨ q) = T r ue and I (¬ p ∧ ¬q) = T r ue. Thus, I (A)
is True if either,
• I ( p) = T, I (¬ p) = T r ue, I (¬q) = T r ue, or
• I (q) = T r ue, I (¬ p) = T r ue, I (¬q) = T r ue.
So that, two sets of literals are,
Since both contain complementary pairs, hence neither of the literals is satisfiable.
So it is impossible to find a model for A, and A is unsatisfiable.
2.5 Reasoning Patterns 35
How can we reason about solving any problem? To a certain extent, it depends on
the chosen knowledge representation. The followings are the methods in broad about
how the reasoning is performed by humans [2].
Deductive Reasoning
It is a process by which general premises are used to obtain the inferences, which
are specific. For example, we may have the following premises and conclusion:
Premise-I: I do shopping when the weather is good on weekends.
Premise-II: Today is Saturday and the sky is clear.
Conclusion: Therefore, I will go for shopping Today.
To perform the deductive reasoning, the problem is first formulated in the way as
we did in the above example. Having done this, the conclusions must be valid when the
premises are true. Beginning with a small set of axioms, postulates, and definitions,
the Greek mathematician Euclid proved a total of 465 geometric propositions as the
logical consequences of the input assumptions.
One of the most fundamental rules of inference is modus ponens rule. We have
the following example for modus ponens.
Premise-I: All the men are mortal.
Premise-II: Socrates is man.
Conclusion: Therefore, Socrates is mortal.
The new knowledge, “Socrates is mortal” has been deduced from the earlier two
sentences.
The enumeration table of all possible worlds for modus ponens are shown in
Table 2.2. We note that it is a valid inference, as the sentence (( p → q) ∧ p) → q,
with q as the inference implied, is true in all the rows.
Other deductive reasoning approaches are : modus tollens and syllogism, and
abduction. The Table 2.3 shows the formulas for these rules.
Abduction is deductive type logic, which provides only a “plausible inference.”
For example, given that: “smoking causes lung cancer” and “Sam died due to lung
cancer”, through abduction one would infer that “Sam was smoker”. However, this
conclusion is not necessarily true, because there are other reasons also for lung cancer,
which are not due to smoking. When statistics and probability theory are used along
with abduction, it may result in most probable inferences out of the many likely
inferences. To illustrate how the abduction based reasoning works, we consider a
logical system comprising a general rule and one specific proposition.
All successful enterprising industrialists are rich (general rule). Rajan is a rich person (specific
proposition). Therefore, a plausible inference can be that Rajan is a successful, enterprising
industrialist.
However, this conclusion can be false also, because there are many other paths
to richness, such as a lottery, inherited property, coming across a treasure, and so
on. If we have a table of all the riches and how they became rich, we may draw the
probability of abduction for richness to be true in this case.
Inductive Reasoning
The inductive reasoning arrives at a conclusion about all members of a class. It
is based on examination of only a few members of the class and based on that it
generalizes for the entire class. It is broadly reasoning from a specific to the general.
For example, the traffic police comes to know about following situation on a particular
day about nature of road accidents:
1st accident was due to wrong side drive,
2nd accident was due to wrong side drive,
3rd accident was due to wrong side drive.
One would logically infer that all the accidents are due to wrong side driving.
Another example is about the birds for their flying attribute.
Crow fly,
peacock fly,
pigeon fly.
Thus, we conclude that all the birds fly.
Another example is about the progressive sum of 1st n odd integers:
2.5 Reasoning Patterns 37
1 = 12
1 + 3 = 22
1 + 3 + 5 = 32
1 + 3 + 5 + 7 = 42
The rule-based reasoning is also called pattern matching, and uses forward and
backward chaining. The implementation of rule-based system makes use of modus
ponens and other approaches. Consider the rule:
Rule 1: If export rises the prosperity increases.
Using the modus ponens, if the premises, e.g., “The export rises” is true, the
conclusion of the rule is accepted as true. We call this accepting the rule as “rule
fires”. The firing of a rule occurs when all its premises are satisfied, whether all are
true or some are false. On firing, the resulting conclusion is stored in the assertion
base, to use for further firing of the rules and generate the assertions. When a premise
is not available as an assertion, it can be obtained by querying the user, or by firing
other rules. Testing of a rule premise or conclusion is as simple as matching a symbol
pattern.
Every rule in the knowledge base can be checked to see if its premises or conclu-
sion can be satisfied by previously made assertions. This process of matching, if done
using forward chaining, i.e., premises to conclusions. If it is done from conclusions
to premises, it is called backward chaining.
assignments and examples of the world. This is, in contrast, to the use of only the
formulas in the first-order predicate logic to describe the world. The other difference
is that the model-based approach is motivated from a cognitive point of view – the
forerunners of this approach of reasoning are cognitive psychologists who support
the “reasoning by examples.” When a model-based reasoning system is presented
with a query, the reasoning is performed by evaluating the query on these models.
Let us suppose that model-based knowledge base representation Γ , and a query
α are both given, and it is required to find out if Γ implies α (i.e., Γ |= α)? This
we can determine in two steps: 1) evaluate α on all the models in the representation,
2) If there is a model of Γ that does not satisfy α, the Γ does not model the alpha
(i.e., Γ |= α), otherwise we conclude that Γ |= α. This means if the model-based
representation contains all the models of Γ , then by definition, this approach verifies
the implication correctly, and produces the correct deduction.
However, there is a problem—the representation of Γ , such that it explicitly holds
all the models, is not a plausible solution. The model-based approach is feasible only
if Γ can be replaced by small model-based representation, and after that also it should
correctly support the deduction.
Various topics in reasoning are as follows:
Monotonic versus nonmonotonic reasoning,
Reasoning with uncertainty,
Shallow and deep representation of knowledge,
Semantic networks,
Blackboard approach,
Inheritance approach,
Pattern matching,
Conflict resolution.
These are discussed in current, and the following chapters, in details.
There are two different methods, one is through model checking and other is deduction
based. The first comprises enumeration of truth-tables, and is always exponential in
n, where n is the size of the set of propositional symbols. The other, i.e., deduction
based approach is repeated application of inference rules. The inference rules are
used as operators in the standard search algorithm. In fact, the application of the
inference approach to proof is called searching for solution. Proper selection of
search directions is important here, as these will eliminate many unnecessary paths
that are not likely to result in the goal. Consequently, the proof-based approach for
reasoning is considered better and efficient compared to model enumeration/checking
based method. The later is exhaustive and exponential in n, where n is the size of the
set of propositional symbols.
40 2 Logic and Reasoning Patterns
The property, the logical system follows, is the fundamental property of mono-
tonicity. As per this, if S α, and β is additional assertion, then S ∧ β α.
Thereby, the application of inference rules is legitimate (sound) rule, which helps
in the generation of new knowledge from the existing. If a search algorithm like DFS
(depth first search) is used, it will always be possible to find the proof, as it will
search the goal, whatever the depth may it be. Hence, the inference method in this
case is complete also [7].
Before the inference rules are applied on the knowledge base, the existing sen-
tences in the knowledge base (KB) needs to be converted into some normal form.
2.6.2 Resolution
The necessary condition for the above is that C should not be a function of any
of the p1 , p2 , q1 .
An inference rule is a mechanical process of producing new facts from the existing
facts and rules. The semantics of predicate logic provides a basis for a formal theory
of logical inference. It allows the creation of new facts from the existing facts and
rules [5, 7].
An interpretation of a predicate statement means the assignment of true or false
value to that statement. An interpretation that makes a sentence true is said to satisfy
a sentence. An interpretation that satisfies every member of a set is said to satisfy the
set.
Definition 2.8 (Logically follows) If every interpretation that satisfies S also satisfy
X , then we say the expression X logically follows from a set of expressions S (the
knowledge base). In other words, the knowledge base S entails sentence X if and
only if X is true in all worlds where knowledge base is true. If a sentence X logically
follows S, we represent it as S |= X .
The term logically follows simply means that X is true for every, potentially
infinite interpretations that satisfy S. However, it is not a practical way of interpre-
tations. In fact, inference rules provide a computationally feasible way to determine
the expression X , when it logically follows a set of premises S.
42 2 Logic and Reasoning Patterns
[(P → Q) ∧ P] → Q (2.3)
which is a valid statement (a tautology). Here, the Q also logically follows (entails)
from (P → Q) ∧ P. That is, [(P → Q) ∧ P] |= Q.
When every inference X deduced from S also logically follows S, then the inference
system is sound. This is expressed by,
S x ⇒ S |= x. (2.4)
The classical logic or FOPL (first order predicate logic) discussed far, is not all time
sufficient to model the real-world knowledge of the world we live in. The reason are:
things become false to true or vice-versa over a time, addition of new knowledge in
the knowledge base may contradict the existing knowledge (e.g., the statement “the
surface of the earth is curved” becomes false on poles), things may be partially true
instead either true or false, and some times there is a probability of being some thing
true or false, and so on. Hence, there is a requirement of an all together different
approach and method of inferencing for real world situations.
The Nonmonotonic logic is the study of those ways of inferring from given knowl-
edge that do not necessarily satisfy the monotonicity property, satisfied by all methods
2.7 Nonmonotonic Reasoning 43
No. of Inferences
Size of Knoledge
An axiomatic system comprises a set of axioms and a set of primitives, where the
primitives are object names but, these objects are left undefined. The axioms are
the sentences that make assertions about the primitives. Further, these assertions are
not provided with any justifications, so they are neither true nor false. The subse-
quent or new assertions about the primitives are called theorems, are rigorous logical
consequences of axioms and previously proved theorems.
44 2 Logic and Reasoning Patterns
The formal definitions in an axiomatic system serves the purpose to simplify the
things as they can be used to create new objects made of complex combinations of
primitives and previously defined terms (objects and theorems). If a definite meaning
is assigned to a primitive of an axiomatic system, called as an interpretation, the
theorems become meaningful assertions.
Following are some definitions of the axiomatic system.
Definition 2.11 (Model (for axiomatic system.)) If all the axioms are true for a
given interpretation, then everything asserted by the theorem is also true. Such an
interpretation is called a model for the axiomatic system.
Definition 2.14 (Isomorphic) If two models of the same axiom system can be proved
as structurally equivalent, then they are isomorphic to each other.
Definition 2.15 (Categorical Axioms) If all models of an axiom system are isomor-
phic then the axiom system is categorical.
Thus, for a categorical axiom system, there exists a model—the one and only
interpretation in which its theorems are all true.
The qualities—truth, logical necessity, consistency, and uniqueness were consid-
ered as the base of classical Euclidean geometry. Till recently, it was accepted that
Euclidean geometry is the only way to think about space. Now, the axiomatic sys-
tems are taken as the basis of geometry, and later all of the mathematics including
the computational mathematics and algorithms.
Hilbert’s definition of an axiomatic system lays the foundation of theory and
verifies that this system satisfies three main properties: independence, consistency,
and completeness. He proposed that just as in geometry, this kind of axiomatic
analysis should be applied to other fields of knowledge, and in particular to physical
theories. When we study any system of axioms as per Hilbert’s perspectives, the
focus of interest remained always on the disciplines themselves rather than on the
axioms. The axioms are just a means to improve our understanding of the discipline,
and not aimed to turn mathematics into a formally axiomatized game. For example,
in the case of geometry, a set of axioms were selected in such a way that they reflected
the basic manifestations of the intuition of space [4].
To understand the role of axioms, we will discuss the axioms of the set, as they
are useful in reasoning and inferences. By analyzing the mathematical arguments,
logicians become convinced that the notion of “set” is the most fundamental concept
of mathematics. For example, it can be shown that the notion of an integer can be
derived from the abstract notion of a set. Thus, in our world all the objects are sets,
and we do not postulate the existence of any more primitive objects. To support
this intuition, we can think our universe as all sets which can be built by successive
collecting processes, starting from the empty set, and we allow the formation of
infinite sets.
The first set of axioms for a general set theory was given by E. Zermelo in 1908,
and later developed by A. Fraenkel, hence usually referred to as Zermelo-Fraenkel
(ZF) set theory, the one we are most concerned. Another systems of axioms, which
has only finitely many axioms, but is less natural, was developed by von Neumann,
Bernays, and Gödel. The later is usually referred to as Gödel-Bernays (GB) set theory.
Following are some of the important axioms of ZF set theory [3, 8].
1. Axioms of Extensibility.
∀x∀y(∀z(z ∈ x ↔ z ∈ y) → x = y) (2.7)
46 2 Logic and Reasoning Patterns
The above says that set is determined by its members. We can define the subsets
as follows:
x ⊆ y ↔ ∀z(z ∈ x → z ∈ y). (2.8)
Also,
x ⊂ y ↔ x ⊆ y ∧ ¬x = y. (2.9)
The set defined by this axiom is the empty or null set and we denote it by φ.
We represent the set z by {x, y}. Also, {x} is {x, x} and we put x, y =
{{x}, {x, y}}. The set x, y is called ordered pair of x and y.
Using the above we can define a function as follows: a function is a set f of ordered
pairs such that x, y, x, z ∈ f → y = z. The set of x such that x, y ∈ f is
called domain, and set of y is called range. We say, f maps in set u if the range
of f is in u.
The above says that y is union of all sets in x. Using the axiom Eq. 2.12, we can
deduce that given x and y, there exists z, such that z = x ∪ y, that is, t ∈ z ↔
t ∈ x ∨ t ∈ y.
To motivate for the next axiom being described, if x is an integer, the successor
of x will be defined as x ∪ {x}. Then the “axiom of infinity” generates a set that
contains all the integers and thus infinite.
5. Axiom of Infinity. It can be expressed as follows, and we understand that it is the
principle of Induction.
6. Axiom of Power Set. This axiom states that there exists for each x the set y for all
the subsets of x.
∀x∃y∀z(z ∈ y ↔ z ⊆ x). (2.14)
If the axiom of extensionality is dropped, the resulting system may contain atoms,
i.e., sets x such that ∀y(¬y ∈ x) yet the sets x are different. Indeed, one possible
view is that integers are atoms and should not be taken as sets.
2.8 Hilbert and the Axiomatic Approach 47
The first interesting axiom is the Axiom of Infinity. If we drop it, then we can take
a model for ZF set of all finite sets which can be built from φ.
The axioms discussed above can be used to prove theorems, like, mathematical
induction, invertible functions, and in fact another theorem of set theory, as well as
the corollaries, but the same are not appropriate to cover here, and a curious reader
is encouraged to refer the literature given in the bibliography.
2.9 Summary
Logic is used for valid deductions, and it avoids fallacy reasoning. Logic is also
useful in argumentation theory—a study of how conclusions can be reached through
logical reasoning, that is, whether the claims are soundly based on premises or not.
Argumentation includes debate and negotiation, that are concerned with reaching
mutually acceptable conclusions. The logic is used in proofs, games, and puzzles
solutions. The arguments have the internal structure: comprising of premises, rea-
soning process, and consequence.
The most commonly used, Propositional logic, represents sentences using single
symbols, called atoms, which are joined using the operators ∨, ∧, ¬, → to cre-
ate compound sentences. The sign of “→” in p → q is material implication, also
called conditional join, if p then q. Propositional logic expressions are called sen-
tences/statements; these are interpreted as true or false. The sentences are called wff,
and are defined recursively. A formula is a syntactic concept, which means whether
or not a string of symbols is a formula.
The meaning (semantics) is associated with each formula by defining its inter-
pretation, which assign a value true (T) or false (F) to every formula. Interpretation
of a statement means the assignment of true values to its atoms. A set of truth values
assigned to the atoms in a statement is called its world. Assignment of truth values
to the atoms in a statement, which makes the statement true is called model of the
statement.
The model checking is the process of truth-table enumeration, and is exponential
on n, the number of atoms in a statement. The derivation can also be represented by
a derivation-tree (parse-tree).
A propositional formula A is satisfied iff v(A) = T r ue for some interpretation
v. A satisfying interpretation is called model for A. The formula A is called valid,
denoted by |= A, iff v(A) = T r ue for all interpretations v. A sentence is logically
true (valid) iff it is true under every interpretation. |= θ means that θ is valid.
A reasoning, in which addition of new knowledge may produce inconsistency in
the knowledge base, is called nonmonotonic reasoning. As per the property of mono-
tonicity, if S α, and β is additional assertion, then S ∧ β α. The Nonmonotonic
logic is the study of those systems that do not satisfy the monotonicity property
satisfied by all methods based on classical logic.
The reasoning pattern comprises inference methods: modus ponens, modus tollens,
syllogism; and Proof methods: resolution theorem, model checking, model checking,
48 2 Logic and Reasoning Patterns
Normal forms. Deducing new knowledge from the existing set of the knowledge
base is called inferencing. The Modus ponens, modus tolens, syllogism are inference
rules, and Sound and Complete are good properties of inference systems.
Semantic tableau is a method for deciding satisfiability for the formula of proposi-
tional calculus, which systematically searches for a model for a formula. If it is found
the formula is satisfiable, else not satisfiable. Semantic Tableau is a tree, each node
of which will be labeled with a set of formulas, and these formulas are inductively
expanded to leaves such that each leaf is marked as open by or closed by ×.
The resolution rule is an inference which uses deduction approach. It is used in
theorem proving.
If every interpretation that satisfies S also satisfy X , then we say expression X
logically follows from a set of expressions S (the knowledge base). The Soundness
means that you cannot prove anything that is wrong, and Completeness means that
you can prove anything that is right.
An axiomatic system comprises a set of axioms and primitives.
Exercises
References
1. Ben-Ari M (2008) Mathematical logic for computer science, 2nd edn. Springer International,
London. ISBN-978-81-8128-344-3
2. Chowdhary KR (2012) Fundamentals of discrete mathematical structures, 2nd edn. EEE PHI
India. ISBN: 978-81-203-4506-5
3. Cohen PJ (2008) Set theory and the continuum hypothesis. Dover, New York
4. Davis M, Putnam H (1960) A computing procedure for quantification theory. J ACM 7(3):201–
215. https://doi.org/10.1145/321033.321034
5. Gelder AV (1991) The well-founded semantics for general logic programs. J ACM 38(3):620–
650
6. Hilbert D (2005) The foundations of geometry. Ebook # 17384
50 2 Logic and Reasoning Patterns
7. Kaushik S (2002) Logic and prolog programming. New Age International, New Delhi
8. Puppes P (1972) Axiomatic set theory. Dover, New York
9. Shankar N (2009) Automated deduction for verification. ACM Comput Surv 41(4):20:1. https://
doi.org/10.1145/1592434.1592437
Chapter 3
First Order Predicate Logic
Abstract The first order predicate logic (FOPL) is backbone of AI, as well a method
of formal representation of Natural Language (NL) text. The Prolog language for AI
programming has its foundations in FOPL. The chapter demonstrates how to translate
NL to FOPL in the form of facts and rules, use of quantifiers and variables, syntax
and semantics of FOPL, and conversion of predicate expressions to clause forms.
This is followed with unification of predicate expressions using instantiations and
substitutions, compositions of substitutions, unification algorithm and its analysis.
The resolution principle is extended to FOPL, a simple algorithm of resolution is
presented, and use of resolution is demonstrated for theorem proving. The interpre-
tation and inferences of FOPL expressions are briefly discussed, along with the use
of Herbrand’s universe and Herbrand’s theorem. At the end, the most general unifier
(mgu) and its algorithms are presented, and chapter is concluded with summary.
3.1 Introduction
This chapter presents a formulation of first-order logic which is best suited as a basic
theoretical instrument—a computer based theorem proving program. As per the
requirements of theory, an inference method should be sound—allows only logical
consequences of premises deducible from the premises. In addition, it should be
effective—algorithmically decidable whether a claimed application of the inference
principle is really an application of it. When the inference principle is performed
by computer, the complexity of the inference principle is not an issue. However, for
more powerful principles, usage of combinatorial information processing for single
application may become dominant.
The system described in the following is an inference principle—the resolution
principle, is a machine-oriented rather than human-oriented system. Resolution prin-
ciple is quite powerful in psychological sense also, as it obeys a single type of
inference, which is often beyond the ability of the human to grasp. In theoretical
sense, it is a single inference principle that forms the complete system of first-order
logic. However, this latter property is not of much significance, but it is interesting
in the sense that no any other complete system of first-order logic is based on just
one inference principle, if ever one tries to realize a device of introducing a logical
axioms, or by a schema as an inference principle. The principle advantage of using
the resolution is due to its ability that allows us to avoid any major combinatorial
obstacles to efficiency, which used to be a serious problem in earlier theorem-proving
procedures.
Learning Outcomes of this Chapter:
1. Translate a natural language (e.g., English) sentence into predicate logic state-
ment. [Usage]
2. Apply formal methods of symbolic predicate logic, such as calculating validity
of formula and computing normal forms. [Usage]
3. Use the rules of inference to construct proofs in predicate logic. [Usage]
4. Convert a logic statement into clause form. [Usage]
5. Describe the strengths and limitations of predicate logic. [Familiarity]
6. Apply resolution to a set of logic statements to answer a query. [Usage]
7. Implement a unification-based type-inference algorithm for a simple language.
[Usage]
8. Precisely specify the invariants preserved by a sound type system. [Familiarity]
The first Order Predicate Logic (FOPL) offers formal approach to reasoning that has
sound theoretical foundations. This aspect is important to mechanize the automated
reasoning process where inferences should be correct and logically sound.
The statements of FOPL are flexible enough to permit the accurate representation
of natural languages. The words—sentence or well formed formula will be indicative
of predicate statements. Following are some of the translations of English sentences
into predicate logic:
• English sentence: If x and y are parents of a child z, and x is man, then y is not
man.
∀x∀y[[parents(x, z) ∧ parents(y, z) ∧ man(x)] ⇒ ¬man(y)]
∀x∀y[(formula(x) ∧ mainconnective(x, ⇒ ))
∧ (formula(y) ∧ mainconnective(y, ∨))
→ x ≡ y].
3. If a program cannot be told a fact, then it cannot learn that fact. ∀x[(program(x) ∧
¬told (x, fact)) → ¬learn(x, fact)]
Consider that there are physical objects, like—cuboid, cone, cylinder placed on the
table-top, with some relative positions, as shown in Fig. 3.1. There are four blocks
on the table: a, c, d are cuboid, and b is a cone. Along with these there is a robot
arm, to lift one of the object having clear top.
d
b
a
3.2 Representation in Predicate Logic 55
Following is the set of statements about the blocks world (called knowledge base):
cuboid (a).
cone(b).
cuboid (c).
cuboid (d ).
onground (a).
onground (c).
ontop(b, a).
ontop(d , c).
toplcear(b).
topclear(d ).
∀x∀y[topclear(x) ∧ toplcear(y) ∧ ¬cone(x) ⇒ puton(y, x)].
The knowledge specified in the blocks world indicate that objects a, c, d are
cuboid, b is cone, a, c are put on the ground, and b, d are on top of a, c, respectively,
and the top of b, d are clear. These are called facts in knowledge representation. At
the end, the rule says that there exists objects x and y such that both have their tops
clear and x is not a cone, then y can be put on the object x.
Bound and Free Variables
A variable in a wff is bound if it is within the scope of a quantifier naming the variable,
otherwise the variable is free. For example, in ∀x(p(x) → q(x, y)), x is bound and y
is free; in ∀x(p(x) → q(x)) → r(x), the x in r(x) is free variable. In the latter case
it is better to rename the variable to remove the ambiguity, hence we rephrase this
statement as ∀x(p(x) → q(x)) → r(z). An expression can be evaluated only when
all the variables in that are bound.
If F1 , F2 , . . . , Fn are wffs with ∧, ¬ as operators in each of Fi , then F1 ∨ F2 ∨
· · · ∨ Fn is called DNF (disjunctive normal form). Alternatively, if operators in Fi
are ∨, ¬ then F1 ∧ F2 ∧ · · · ∧ Fn is called CNF (conjunctive normal form). The
expression Fi called a term, consists only literals. We will prefer the CNF for predicate
expression. Thus, for an inference to be carried out, it is necessary to convert a
predicate expression to CNF. For example, ∃x[p(x) ⇒ q(x)] can be converted to
¬p(a) ∨ q(a), where a is an instance of variable x. The expression ¬p(a) ∨ q(a) is a
term of CNF. A formula in CNF, comprising ∧, ∨, ¬ along with constants, variables,
and predicates, is called clausal or clause form [2].
Two types of semantics are defined for the programing languages: (1) operational
semantics, and, (2) fixpoint semantics. The operational semantics defines input-
output relation computed by a program in terms of the individual operations per-
formed by the program inside a machine, like, basic logical and arithmetic opera-
tions. The meaning of a program is input-output relation obtained by executing the
56 3 First Order Predicate Logic
than B. If A and B have same length, then A has the alphabetically earlier symbol
in the first symbol position, at which A and B have distinct symbols.
• Herbrand Universe. It is set of ground terms associated with any set of S of clauses.
Let F be the set of all function symbols which occur in clause set S. If F contains any
function symbols of degree 0, then the functional vocabulary of S is F, otherwise,
it is {a} ∪ F, where a is a ground term. In this case, the Herbrand universe of S is
set of all ground terms with only symbols of the functional vocabulary of S.
• Models. It is a set of ground literals having no complementary pair. If M is a
Model and S is a set of ground clauses, then M is a model of S if, for all C ∈ S,
C contains a member of M . In general, if S is any set of clauses, and H is the
Herbrand Universe of S, then M is model of H (S).
• Satisfiability. A set S is Satisfiable if there is a model of S, otherwise S is Unsat-
isfiable.
For every sentence S1 of first order predicate logic there is always a sentence S2
in Clausal Form which is satisfiable if and only if S1 is also satisfiable. In other
words, for every non-clause form sentence there is a logically equivalent clause form
sentence. Due to this, all questions concerning to the validity or satisfiability of
sentences in FOPL can be addressed to sentences in clausal form.
Procedure for obtaining clausal-form for any well-formed formula (wff ) are dis-
cussed later in this chapter. In the above we have defined part of the syntax of
predicate logic, which is concerned with the specification of well-formed formulas.
The formalism we are going to use in the next section is based on the notions of
unsatisfiability and refutation rather than upon the notions of validity and proof.
To work on the criteria of refutation and unsatisfirability, it is necessary to convert
the given wff into clausal form.
To determine whether a finite set of sentences (S) of first-order predicate is satis-
fiable, it is sufficient to assume that each sentence in S is in clause form, and there is
no existential quantifiers as the prefix to S. In addition, the matrix of each sentence in
S is assumed to be a disjunction of formulas, each of which is either atomic formula
or the negation of an atomic formula. Therefore, the syntax of S is designed such
that the syntactical unit is a finite set of sentences in this special form, called clause
form. Towards the end of conversion process, the quantifier prefix is omitted from
each sentence, since it is necessary that universal quantifiers bind each variable in
the sentence. The matrix of each sentence is simply a set of disjuncts and the order
and multiplicity of the disjuncts are not important.
Following are the steps to convert a predicate formula into clausal-form [2].
2. Move the outer negative symbol into the atom, for example, replace ¬∀x p(x)
by ∃x¬p(x).
3. In an expression of nested quantifiers, existentially quantified variables not in the
scope of universal quantifiers are replaced by constants. Replace ∃x∀y(f (x) →
f (y)) by ∀y(f (a) → f (y)).
4. Rename the variables if necessary. For example, in ∀x(p(x)) → q(x), rename
second free variable x, as ∀x(p(x) → q(y)).
5. Replace existentially quantified variables with Skolem functions; then elimi-
nate corresponding quantifiers. For example, for ∀x∃y[¬p(x) ∨q(y)], we obtain
∀x[¬p(x) ∨ q(f (x)). These newly created functions are celled Skolem functions,
and the process is called Skolemization.
6. Move the universal quantifiers to the left of the equation. For example, substitute
∃x[¬p(x) ∨ ∀y q(y)] by ∃x∀y[¬p(x) ∨ q(y)]
7. Move the disjunctions down to the Literals, i.e., terms should be connected by
conjunctions only, vertically.
8. Eliminate the conjunctions.
9. Rename the variables, if necessary.
10. Drop all the universal quantifiers, and write each term in a separate line.
The resulting sentence is a CNF, and suitable for inferencing using resolution.
Example 3.5 Convert the expression ∃x∀y[∀z p(f (x), y, z) ⇒ (∃u q(x, u) ∧
∃v r(y, v))] to clausal form.
The steps discussed above are applied precisely, to get the clausal form of the pred-
icate formula.
1. Eliminate implication.
∀y[(¬p(f (a), y, g(y)) ∨ (q(a, h(y))) ∧ (¬p(f (a), y, g(y)) ∨ r(y, l(y)))]
8. Eliminate conjunctions.
∀y[¬p(f (a), y, g(y)) ∨ (q(a, h(y)), (¬p(f (a), y, g(y)) ∨ r(y, l(y))]
For [p(x, y) ⇒ q(y, x)] ∧ [q(y, x) ⇒ s(x, y)] by application of syllogism, it can be
reduced to [p(x, y) ⇒ s(x, y)]. Thus, original expression reduces to:
The following definitions are concerned with the operation of instantiation, i.e, sub-
stitutions of terms for variables in the well-formed expressions and in sets of well-
formed expressions [8].
Substitution Components
A substitution component is any expression of the form T /V , where V is any variable
and T is any term different from V . The T can be any constant, variable, function,
predicate, or expression.
60 3 First Order Predicate Logic
Substitutions
A substitution is any finite set (possibly empty) of substitution components, none
of the variables of which are same. If P is any set of terms, and the terms of
the components of the substitution θ are all in P, we say that θ is a substitu-
tion over P. We write the substitution where components are T1 /V1 , . . . , Tk /Vk as
θ = {T1 /V1 , . . . , Tk /Vk }, with the understanding that order of components is imma-
terial. We will use lowercase Greek letters θ, λ, μ denote substitutions.
Instantiations
If E is any function string of symbols, and θ = {T1 /V1 , . . . , Tk /Vk } is any substitu-
tion, then the instantiation of E by θ is the operation of replacing each occurrence of
variable Vi , 1 ≤ i ≤ k, in E by term Ti . The resulting string denoted by Eθ is called
an instance of E by θ . That is, if E is the string E0 Vi1 E1 . . . Vin En , n ≥ 0, then Eθ is
the string E0 Ti1 E1 . . . Tin En . Here, none of the substrings Ej of E contain occurrences
of variables V1 , . . . , Vk after substitution. Some of Ej are possibly null, and each Vij
is an occurrence of one of the variables V1 , . . . , Vk .
If θ = {T1 /V1 , . . . , Tk /Vk } and λ are any two substitutions, then the composition of
θ and λ denoted by θ λ is union θ ∪ λ , defined as follows:
The θ is set of all components Ti λ/Vi , 1 ≤ i ≤ k, such that Ti λ (λ substituted
in θ ) is different from Vi , and λ is set of all components of λ whose variables are
not among V1 , . . . , Vk .
Within a given scope, once a variable is bound, it may not be given a new binding
in future unifications and inferences. If θ and λ are two substitution sets, then the
composition of θ and λ, i.e., θ λ, is obtained by applying λ to the elements of θ and
adding the result to λ.
Following examples illustrate two different scenario of composition of substitu-
tions.
Example 3.7 Find out the composition of {x/y, w/z}, {v/x}, and {A/v, f (B)/w}.
Let us assume that θ = {x/y, w/z}, λ = {v/x} and μ = {A/v, f (B)/w}. Following
are the steps:
1. To find the composition λμ, A is is substituted for v, and v is then substituted for
x. Thus, λμ = {A/x, f (B)/w}.
2. When result of λμ is substituted in θ , we get composition θ λμ = {A/y,
f (B)/z}.
Example 3.8 Find out the composition of θ = {g(x, y)/z}, and λ = {A/x, B/y,
C/w, D/z}.
3.5 Substitutions and Unification 61
By composition,
The {D/z} has not been included in the resultant substitution set, because other-
wise, there will be two terms for the variable z, one g(A, B) and other D.
One of the important property of substitution is that, if E is any string, and σ = θ λ,
then Eσ = Eθ λ. It is straight forward to verify that εθ = θ ε = θ for any substitu-
tion θ . Also, composition enjoys the associative property (θ λ)μ = θ (λμ), so we may
omit the parentheses in writing multiple compositions of substitutions. The substi-
tutions are not in general commutative; i.e., it is generally not the case that θ λ = λθ ,
because for this Eθ λ has to be equal to Eλθ , which is not guaranteed. However, the
composition has distributive property.
The point of the composition operation on substitution is that, when E is any
string, and σ = θ λ, the string Eσ is just the string Eθ λ, i.e., the instance of Eθ by λ.
3.5.2 Unification
∃x mother(x, jill),
mother(susan, jill);
For perform unification, a variable can be replaced by any term, including other
variable or function expressions of arbitrary complexity. This also includes function
expressions that themselves contain variables. For example, the function expression,
mother(joe), may be substituted for x in human(x) to get human(mother(joe)).
Example 3.9 Find out the substitution instances for foo(x, a, zoo(y)), given the sim-
ilar predicates with literal arguments.
1. foo(fred , a, zoo(z)), where fred is substituted for x and z for y, i.e., λ1 =
{fred /x, z/y}. Thus,
We use the notation x/y to indicate that x is substituted for the variable y; we
also call this as bindings, so y is bound to x. A variable cannot be unified with a
term containing that variable. So p(x) cannot be substituted for x, because this would
create an infinite regression: p(p(p(...x)...).
The resolution rule can be traced back to 1960, which was introduced by Davis
and Putnam. However, this algorithm required all around ground instances for the
given formula, which resulted to a combinatorial explosion. However, the source of
combinatorial explosion was eliminated in 1965, when J. A. Robinson introduced
an algorithm for unification. The unification allowed the instantiation of the formula
during the proof “in demand”, just as per the need through the newly introduced most
general unifier (mgu) algorithm [8].
The resolution method for (propositional) logic due to J. A. Robinson (1965) is
sound and complete, and a well-known procedure for checking the unsatisfiability of
a set of clauses. The resolution is mathematical oriented rather than human oriented.
It is quite powerful both in the psychological sense that it condones single inferences
which are often beyond the ability of human to grasp, and in theoretical sense that
it alone, as sole inference principle, forms a complete system of FOPL. Because of
only one inference type, it allows to avoid the combinatorial obstacles to efficiency.
Let us refresh us with the terminology we discussed in the beginning of this
chapter. We will designate a literal by symbol L, which is either a propositional
3.6 Resolution Principle 63
Fig. 3.2 DAG for theorem {P, Q} {P, ¬Q} {¬P, Q} {¬P, ¬Q}
proving
{P } {Q}
{¬P }
{}
It can be easily verified that (¬P ∨ Q) ∧ (¬Q ∨ R) |= (¬P ∨ R), hence (¬P ∨
Q) ∧ (¬Q ∨ R) ⇒ (¬P ∨ R) is a valid statement. Thus, ¬P ∨ R is inference or the
resolvent. Arriving to a proof by above is called proof by refutation.
Resolution says that if there are axioms of the form ¬P ∨ Q and there is another
axiom of the form ¬Q ∨ R, then ¬P ∨ R logically follows; called the resolvent. Let
us see why it is so? When ¬P ∨ Q is True, then either ¬P is True or Q is True. For
other expression, when ¬Q ∨ R is True, then either ¬Q is True or R is True. Then we
can say that ¬P ∨ R is certainly True. This can be generalized to two expressions,
when we have any number of expressions, but two must be of opposite signs.
To prove a theorem, one obvious strategy is to search forward from the axioms, using
sound rules of inference. We try to prove a theorem by refutation. It requires to show
that negation of a theorem cannot be True. The steps for a proof by resolution are:
1. Assume that negation of the theorem is True.
2. Try to show that axioms and assumed negation of theorem, together are True,
which cannot be True.
3. Conclude that above leads to contradiction.
4. Conclude that theorem is True because its negation cannot be True.
3.6 Resolution Principle 65
2. use the resolution rule to eliminate the complement literals from both sentences
to get,
CNF : summer ∨ cold .
The question is, how you can be so clever to pickup the right clauses to resolve? The
answer is that you take advantage of two ideas:
1. You can be sure that every resolution involves the negated theorem, directly or
indirectly.
2. You know where you are and where you are going, hence you can compute the
difference to help you proceed with your intuition for selection of clauses.
Consider there are total n clauses, c1 . . . cn . We can try to match c1 with c2 . . . cn ,
and in next level c2 is matched with c3 . . . cn , and so on. This results to breadth first
search (BFS). Consider that resolvents generated due to this matching are c1 . . . cm .
Next all the newly generated clauses are matched with the original, and then they
are merged into the original. This process is repeated until contradiction is reached,
showing that theorem is proved. Since, the entire set of clauses are compared, the
66 3 First Order Predicate Logic
proof is bound to result, if it exists, at all. This gives completeness to the resolution
proof.
The other alternative is, nodes which are farther and farther away are matched
before those which are closer to the root. The c1 is matched with first child c2 out of
c2 . . . cn . Then c2 is matched with its first child generated, and so on, resulting to the
search process called DFS (depth first search).
However, the above both are brute-force algorithms, and are complex. The other
methods are heuristic based. In fact, there is difficulty to express your concepts
required in pure logic. One of the approaches is to use the clauses having smallest
number of literals. Another, to use negated clauses.
The resolution search strategies are subject to the exponential-explosion problem.
Due to this, those proofs which require long chains of inferences, will be exponen-
tially expensive in time.
All resolution search strategies are subject to a version of halting problem, for
search is not guaranteed to terminate unless there actually is a proof. In fact, all
complete proof procedures for the first order predicate calculus are subject to halting
problem. Complete proof procedures are said to be semi-decidable, because they are
generated to tell you whether an expression is a theorem, only if the expression is
indeed a theorem.
Theorem proving is suitable for certain problems, but not for all problems, due to
the following reasons:
To find out the satisfiability of the formula (3.4), we need to substitute (instantiate)
a value for x (an instance of x) and check if flies(x) is true. Until, that x is found,
it may require instantiation with large number of values. Similarly, to check if the
Eq. (3.4) is valid, it may require infinitely large number of values in the domain of
3.8 Interpretation and Inferences 67
x to be verified. If any one of that makes the formula false, the formula is not valid.
Thus, checking of satisfiability as well as validity of a formula is predicate logic are
complex process. The approach of truth table and tableau method we discussed in
the previous chapter are applicable here also.
Given a predicate sentences of m number of predicates each having one argument,
and domain size of all the arguments is n, in the worst case it will require total nm
substitutions to test for satisfiability, as well as for validity checking. However, a
sentence of m propositions will require in the worst case only 2m substitutions.
Hence, satisfiability checking in predicate sentences is much more complex than that
in proposition logic. It equally applies with expressions having existential quantifiers,
like, ∃x[bird (x) → flies(x)].
Thus, it is only the proof methods, using which logical deductions can be carried
out in realistic times.
Example 3.11 Given, “All men are mortal” and “Socrates is man”, infer using pred-
icate logic, that “Socrates is mortal”.
∀x[man(x) ⇒ mortal(x)],
man(socrates). (3.5)
man(socrates) ⇒ mortal(socrates),
man(socrates). (3.6)
?man(X ).
in Prolog (to be discussed later), which will match (called unify or substitute) man(X )
with man(socrates) with a unification set, say, θ = {socrates/X }. The substitution
which returns man(socrates) is represented by man(X )θ .
68 3 First Order Predicate Logic
mortal(socrates) :- man(socrates).
man(socrates).
Here, the sign ‘:-’is read as ‘if’. The subexpression before the sign ‘:-’is called
‘head’or procedure name and the part after ‘:-’is called body of the rule. The sentence
(3.6) can also be written in a clause form (to be precise, in Horn clause form) as,
mortal(socrates) ∨ ¬man(socrates).
man(socrates). (3.7)
For a predicate program, the Herbrand universe and Herbrand base are countably
infinite if the predicate program contains a function symbol of positive arity. If the
arity of function symbols is zero, then both the haerbrand’s universe and base are
finite [3].
In special cases, when resolution proof is used on FOPL, it reduces the expressions
to propositional form. If the set of clauses is A, its Harbrand’s universe is set of all
the ground terms formed using only the function symbols and constants in A. For
example, if A has constants a, b, and a unary function symbol f , then the Herbrand
universe is the infinite set:
The Herbrand’s base of A is the set of all ground clauses cθ where c ∈ A and θ
is a substitution that assigns the variables in c to terms in the Herbrand’s universe.
3.8 Interpretation and Inferences 69
Horn clauses
The Horn clauses are terms without variables, these are constructed using constants
and function symbols that occur in the set of clauses A. These terms form the data
structures, which are manipulated by the program built-in in the clauses A. The
collection of all such terms, determined by A, is called Herbrand universe. Every
n-ary predicate symbol P occurring in A denotes an n-ary relation over the Herbrand
universe of A. We call the n-tuples which belong to such relations as input-output
tuples and the relations themselves as input-output relations.
In an inference system, the operational semantics determine a unique denotation
for a formula P such that the n-tuple (t1 , . . . , tm ) belongs to the denotation of P ∈ A,
iff A P(t1 , . . . , tn ). That is,
We first pick an arbitrary constant, say, a, and then construct the variable-free
terms. Formally, D(P) is inductively defined as follows:
1. UA = D(P), and
2. for every n-ry function symbol p occurring in P and every t1 , t2 , . . . , tn ∈ D(P) :
f A (t1 , t2 , . . . , tn ) = f (t1 , t2 , . . . , tn ).
70 3 First Order Predicate Logic
Definition 3.4 (General logic program) It is a finite set of general rules, with both
positive and negative subgoals.
A general logic program comprises rules and facts. A general rule’s format is: its
head, or conclusion is to the left of the symbol “←,” (read “if”), and its subgoals
(called body) is right of the symbol “←”. Following is an example of rule, where
p(X ) is head, q(X ) is positive subgoal, and r(X ) is a negative subgoal.
This rule may be read as “p(X ) if q(X ) and not r(X ).” A Horn rule has no
negative subgoals, and a Horn logic program is made of only Horn rules.
We will follow the conventions of Prolog for naming the objects: the logical
variables begin with an uppercase letter, while the constants, functions, and predicates
begin with a lowercase letter. For both the predicate and its relation, we will use the
same symbol, for example p.
Followings may be the arguments of a predicate:
1. a constant/variable is a term;
2. a function symbol with terms as arguments, is a term.
The terms may be viewed as data structures of the program, with function symbols
serving as record names. Often a constant is treated as a function symbol of arity
zero.
Following are the definitions of some important terms.
Definition 3.7 (Complement of a set) For a set of literals L its complement is a set
formed by complementing of each literal in L, represented by ¬L.
Further,
– p is said to be inconsistent with L if p ∈ ¬L,
– Sets of literals R and L are inconsistent if at least one literal in R is inconsistent
with L, i.e., when R ∩ ¬L = φ,
– A set of literal is inconsistent if it is inconsistent with itself; otherwise it is con-
sistent.
3.8 Interpretation and Inferences 71
When using the classical formulation, the Herbrand’s theorem relates the validity
of a first-order formula in Skolem prenex form1 to the validity of one of its Herbrand
extensions. That means, the formula ∀x1 . . . ∀xn ψ(x1 . . . , xn ) is valid if, and only if,
m
i ψ(ti,1 , . . . , ti,n ) is valid for some m ≥ 1 and some collection of ground Herbrand
terms ti,j .
Since it is possible that every classical first-order formula can be reduced to this
Skolem prenex form through the Skolemization while preserving its satisfiability, the
Herbrand’s theorem provides a way to reduce the question of validity of first-order
formulas to propositional logic formula.
However, the required Herbrand’s extension and the terms ti,j cannot be computed
recursively (for otherwise first-order logic would be decidable), this result is highly
useful for the automated reasoning as it gives a way to some highly efficient proof
methods such as resolution and the resolution refutation.
Theorem 3.1 A closed formula F in Skolem form is satisfiable if and only if it has
a Herbrand model.
Proof If the formula has a Herbrand model then it is satisfiable. For the other direc-
tion let A = (UA , IA ) be an arbitrary model of F. We define a Herbrand structure
B = (UB , IB ) as follows:
Universe: UB = D(F)
Functional Symbols: f B (t1 , t2 , . . . , tn ) = f (t1 , t2 , . . . , tn )
Predicate Symbols: (t1 , . . . , tn ) ∈ P B iff A (t1 ), . . . , A (tn ) ∈ P A .
Claim: B is also a model of F.
We prove a stronger assertion: For every closed form G in Skolem form such that
G ∗ only contains atomic formulas of F ∗ : if A |= G then B |= G.
By induction on the number n of universal quantifiers of G.
Basis (n = 0). Then G has no quantifiers at all.
It follows A (G) = B(G), this proves the theorem.
To perform reasoning with the Herbrand base, the unifiers are not required, and we
have a sound and complete reasoning procedure, which is guaranteed to terminate.
The idea used in this approach is: Herbrand’s base will typically be an infinite set of
propositional clauses, but it will be finite when Herbrand’s universe is finite (there
is no function symbols and only finitely many constants appear in it). Sometimes
we can keep the universe finite by considering the type of the arguments (say t) and
values of functions (f ), and include a term like f (t) in the universe only if the type of
t is appropriate for the function f . For example, f (t) may be, birthday(john), which
produces a date.
It is easy to procedurally interpret the sets of clauses, say, A, which contain at most one
positive literal per clause. However, along with this any number of negative literals
can also exist. Such sets of clauses are called Horn sentences or Horn Clauses or
simply clauses. We distinguish three kinds of Horn clauses [3].
1. ‘[]’the empty clause, containing no literals and denoting the truth value false, is
interpreted as a halt statement.
2. B̄1 ∨ · · · ∨ B̄n , a clause consisting of no positive literals and n ≥ 1 negative literals,
is interpreted as a goal statement. Note that goal statement is negated and added
into the knowledge base to obtain the proof through resolution refutation.
3. A ∨ B̄1 ∨ · · · ∨ B̄n , a clause consisting of exactly one positive literal and n ≥ 0
negative literals is interpreted as a procedure declaration (i.e., rule in Prolog
program). The positive literal A is the procedure name and the collective negative
literals are the procedure body. Each negative literal Bi , in the procedure body
is interpreted as a procedure call. When n = 0 the procedure declaration has an
empty body and interpreted as an unqualified assertion of fact.
Suppose, the name of procedure A matches with the procedure call Ai , i,e., some
substitution θ of terms for variables makes Ai and A identical. In such a case, the
resolution derives a new goal statement by disjunction formulas (3.11) and (3.12) as
given below, subject to substitution θ .
3.8 Interpretation and Inferences 73
Suppose it is required to compute the result of appending list cons(b, nil) to the
list cons(a, nil). Therefore, the goal statement is,
where v (a variable) and a, b (constants), are the “atoms” of the lists. To prove using
resolution, we add the negation of the goal,
into the set of clauses. The program is activated by this goal statement to carry out
the append operation. With this goal statement the program is deterministic, because
only one choice is available for matching. The following computation follows with
a goal directed theorem prover as interpreter: The goal statement,
matches with the clause statement (3.15) with matchings: x = a, y = nil, z = cons
(b, nil). Also, v = cons(x, u) = cons(a, u), i.e., there exists a unifier θ1 = {cons
(a, w)/v}. The variable u has been renamed as w. On unifying clauses (3.18) and
(3.15), the next computation C2 is:
C3 = []θ2 . (3.20)
v = cons(a, u)
= cons(a, w)
= cons(a, cons(b, nil)).
The above result is equal to goal: append (cons(a, nil), cons(b, nil), v).
Following axioms are about the observed block relationship shown in Fig. 3.3, which
are already in clausal form.
3.8 Interpretation and Inferences 75
on(cylinder, box).
on(box, table).
After we have gone through the procedure for conversion to clausal form, the
above axioms are transformed into clause forms.
Now, we manually run the Algorithm 3.1 on the clauses (1)–(5), as well as those
which would created new, to unify them according to unification Algorithm 3.2,
until we reach to a null resolvent.
First we resolve clauses (2) and (5) and bind x to ‘cylinder’and z to ‘table’.
Applying the resolution, we get resolvent (6). Unifier for this is {cylinder/x, table/z}.
76 3 First Order Predicate Logic
Next, resolve clauses (1) with (6), binding u with y and v with ‘table’, we get (7).
Unifier for this is {y/u, table/v}.
We use (1) again with (7) with u bound to cylinder and v to y. Unifier for this is
{cylinder/u, y/v}. On resolving we get (8).
Next, use clause (3) and (8), binding y to box, with unifier {box/y}. We get (8) as
resolvent.
Since we have arrived at the contradiction, it shows that negation of the theorem:
¬above(cylinder, table) must be False. Hence the theorem above(cylinder, table)
must be True.
and
c2 = ¬p(y, f (w), a).
3.9 Most General Unifiers 77
and also by θ2 ,
Note that a constant, or variable, or a function substitutes for a variable, and not
the other way.
We may very well be able to derive the empty clause using c1 , c2 with substitution
of θ1 , followed with application of resolution. But if we cannot, we will need to
consider other substitutions like θ2 .
The trouble is that both of these substitutions are overly specific. We can see that
any unifier must give w the same value as x, and to y the same as g(x), but we do not
need to commit yet to a value for x. The substitution,
unifies the two literals without making an arbitrary choice that might preclude a path
to the empty clause. The θ3 is a most general unifier (mgu).
More precisely, an mgu θ of literals ρ1 and ρ2 is a unifier that has the property
that for any other unifier θ , there is a further substitution θ ∗ such that θ = θ θ ∗ .
So starting with θ , you can always get to any other unifier by applying additional
substitutions. For example, given θ3 , we can get to θ1 by further applying λ = {b/x}
so that θ1 = θ3 λ. And, we can get to θ2 by μ = {f (z)/x} so that θ2 = θ3 μ. Note that
an mgu need not be unique. For example, θ4 = {g(w)/y, a/z, w/x} is also an mgu
for c1 and c2 .
The key fact about mgus is that we can limits the resolution rule to mgus without
loss of completeness. This helps immensely in the search since it dramatically reduces
the number of resolvents that can be inferred from these two input clauses.
Suppose you have two expressions p(x) an p(y). One way to unify these is to substitute
any constant expression for x and y: S = {fred /x, fred /y}. But this is not the most
general unifier, because if we substitute any variable for x and y, we get a more
general unifier: G = {z/x, z/y}. The first unifier is a valid unifier, but it would lessen
the generality of inferences that we might want to make.
So, given a unifier, you can always create a more general unifier. When both of
these unifiers are composed and instantiate the original expression E, you get the
same instance as it was obtained with the earlier unifier.
3.9.1 Lifting
It is necessary to show that the general resolution principle is sound and complete.
However, a technical difficulty is the completeness of the proof. Using the Herbrand’s
theorem and semantic trees, we can prove that there is a ground resolution refutation
of an unsatisfiable set of clauses. But, this cannot be generalized as a proof for
general resolution, because the concept of semantic trees cannot be generalized.
Why it cannot be generalized, is due to the variables, which give rise to potentially
infinite number of elements in the Herbrand’s base, as we will show it shortly.
Fortunately, there is a technique, called, “Lifting”, to prove completeness of a
theorem. Following are the steps for lifting:
1. first prove the completeness of the system for a set of ground classes, then,
2. as a second step, lift the proof to non-ground case.
Let us assume that there are two non-ground clauses: 1. p(u, a) ∨ q1 (u) and, 2.
¬p(v, w) ∨ q2 (v, w). If the signature pattern contains function symbols, then these
clauses have infinite set of instances, as follows:
The above difficulty can be overcome by taking a ground resolution refutation
and “lifting” it to a more abstract general form.
The lifting is an idea to represent infinite number of ground inferences of the form
given in Eq. (3.21) by a single non-ground inferences:
3.9 Most General Unifiers 79
This lifting can be done using most general unifier, we will be discussing shortly.
Example 3.17 Find out the Lifting for following clauses:
C1 = p(u) ∨ p(f (v)) ∨ p(f (w)) ∨ q(u)
C2 = ¬p(f (x)) ∨ ¬p(z) ∨ r(x)
Using the substitution θ = {f (a)/u, a/v, a/w, a/x, f (a)/z}, the above clauses
become C1 = p(f (a)) ∨ q(f (a)), and C2 = ¬p(f (a)) ∨ r(a). Using C1 and C2 , it
resolves to C = q(f (a)) ∨ r(a). The lifting claims that there is a clause C =
q(f (x)) ∨ r(x) which is resolvent for clauses C1 and C2 , such that clause C is ground
instance of C. This can be realized using the unification algorithm to obtain a most
general unifier (mgu) of clauses C1 and C2 , the latter two clauses resolves to C, as
{f (x)/u, x/v, x/w, f (x)/z}.
Once the disagreement is resolved through unification for this this symbol position,
there is no disagreement at this position. The process is repeated for the new first
symbol position at which all wffs in A do not have same symbol, and so on, until A
becomes a singleton.
Evidently, if A is nonempty and is not a singleton (a set with exactly one element),
then the disagreement set of A is not a singleton and nonempty. Also, if θ unifies A,
and A is not singleton, the θ unifies the disagreement set A.
For A to be a finite nonempty set of well-formed expressions for which the sub-
stitution Algorithm 3.2 terminates with “return σA ”, the substitution σA available as
output of the unification algorithm is called the most general unifier (mgu) of A, and
A is said to be most generally unifiable [8, 9].
Through manually running the Algorithm 3.2 for the disagreement set in (3.22),
stepwise computation for σk is as follows:
For k = 0, and σ0 = ε,
σ2 = σ1 {a/k(y)}
= {k(y)/h(x, y)}{a/k(y)}.
3.9 Most General Unifiers 81
The same process is repeated for the disagreement set of 3rd argument in A, which
results to substitution set as {b/y}.
σ3 = σ2 {b/y}
= {k(y)/k(x, y)}{a/k(y)}{b/y}.
In the well-founded semantics, the unfounded sets provide the basis for negative
conclusions. Let there is a program P (set of rules and facts in FOPL), its associated
Herbrand base is H , and suppose its partial interpretation is I . Then, some A ⊆ H
is called an unfounded set of P with respect to the interpretation I , with following
condition: for each instantiated rule R ∈ P, at least one of the following holds: (In
the rules P , we assume that p is a head, and qi are the corresponding subgoals.)
1. Some positive / negative subgoal qi of the body of the rule is false in the interpre-
tation I ,
2. Some positive subgoals qi of the body occurs in the unfounded set A.
For rule R with respect to I , a literal that makes conditions 1 or 2 above true is
called witness of unusability.
Intuitively, the interpretation I is intended model of P. The rules that satisfy
condition 1 cannot be used for further derivations because their hypotheses are already
known to be false.
82 3 First Order Predicate Logic
The condition 2 in above, called unfoundedness condition, states that all the rules
which might still be usable to derive something in A, should have an atom (i.e., a
fact) in A as true. In other words, there is no single atom in A, that can be established
to be true by the rules of P (as per the knowledge of interpretation I ). Therefore, if
we infer that some or all atoms in A are false, there is no way available later, using
that we could infer that an atom is true [4].
Hence, the well-founded semantics uses conditions 1 and 2 to draw negative
conclusions, and simultaneously infers all atoms in A to be false. The following
example demonstrates the construction of unfounded set from the set of rules and
facts.
Example 3.19 Unfounded set.
Assume that we have a program in predicate logic with instantiated atoms.
p(c).
q(a) ← p(d ).
p(a) ← p(c), ¬ p(b).
q(b) ← q(a).
p(b) ← ¬ p(a).
p(d ) ← q(a), ¬ q(b).
p(d ) ← q(b), ¬ q(c).
p(e) ← ¬ p(d ).
From above rules, we see that A = {p(d ), q(a), q(b), q(c)} is an unfounded set
with respect to φ (null set). Since A is unfounded, its subsets are also unfounded. The
component, {q(c)} is unfounded due to condition (1), because there is no rule available
to establish its truth. The set {p(d ), q(a), q(b)} is unfounded due to condition (2) (their
subgoals or body appear in unfounded set.
There is no way available to establish p(d ) without first establishing q(a) or q(b).
In other words, whether we can establish ¬q(b) to support the first rule for p(d ) is
irrelevant as far as determination of unfoundedness is the concern.
Interestingly, there is no way available to establish q(a) in the absence of first
establishing p(d ), and also there is no way available to establish q(b) without first
establishing q(a). Further, q(c) can never be proven. We note that among p(d ), q(a),
and q(b) as goals, none can be proved without the other two or their negation as
subgoals.
The pair p(a), p(b), even though they depend on each other, but does not form
an unfounded set due to the reason that the only dependence is through negation.
Hence, it can be concluded that the any attempt for proof of p(a) and p(b) will fail,
but this claim is faulty.
The difference between sets {p(d ), q(a), q(b)} and {p(a), p(b)} is as follows:
declaring any of p(d ), q(a), or q(b) false (unfounded), does not create a proof that
any other element of the set is true.
3.10 Unfounded Sets 83
Finally, consider the set {p(a), p(b)}: If any of the elements p(a) or p(b) is taken
false, it becomes possible to prove that the other is true. And, if both are declared
false together, there is an inconsistency.
3.11 Summary
Exercises
4. Show that a formula in CNF is valid if and only if each of its disjunctions contains
a pair of complementary literals P and ¬P.
5. Prove or disprove the followings:
a. If S is a first-order formula, then S is valid iff S → ⊥ is contradiction.
b. If S is a first-order formula and x is a variable, then S is contradiction iff
∃xS is a contradiction.
6. Using the resolution principle prove the validity of following formula:
8. Consider a set of statements of FOPL that uses two 1-place predicates: Large
and Small. The set of object constants are a, b. Find out all possible models for
this program. For each of the following sentences find out the models in which
each of the sentence becomes true.
a. ∀x Large(x).
b. ∀x ¬Large(x).
c. ∃x Large(x).
d. ∃x ¬Large(x).
e. Large(a) ∧ Large(b).
f. Large(a) ∨ Large(b).
g. ∀x [Large(x) ∧ Small(x)].
h. ∀x [Large(x) ∨ Small(x)].
i. ∀x [Large(x) ⇒ ¬Small(x)].
9. Find out the clauses for the following FOPL formulas.
a. ∃x∀y∃z(P(x) ⇒ (Q(y) ⇒ R(z))).
b. ∀x∀y((P(x) ∧ Q(y)) ⇒ ∃zR(x, y, z)).
10. Define the required predicates and represent the following sentences in FOPL.
a. Some students opted Sanskrit in fall 2015.
b. Every student who opts Sanskrit passes it.
c. Only one student opted Tamil in fall 2015.
d. The best score in Sanskrit is always higher than the best score in Tamil.
e. There is a barber in a village who shaves every one in the village who does
not shave himself / herself.
f. A person born in country X , each of whose parents is a citizen of X or a
resident of X , is also a resident of X .
11. Determine whether the expression p and q unify with each other in each of the
following cases. If so, give the mgu, if not justify it. The lowercase letters are
variables, and upper are predicate, functions, and literals.
a. p = f (x1 , g(x2 , x3 ), x2 , b); q = f (g(h(a, x5 ), x2 ), x1 , h(a, x4 ), x4 ).
b. p = f (x, f (u, x)); q = f (f (y, a), f (z, f (b, z))).
c. p = f (g(v), h(u, v)); q = f (w, j(x, y)).
12. What can be the strategies for combination of clauses in resolution proof? For
example, if there are N clauses, in how many ways they can be combined?
13. Why resolution based inference is more efficient compared modus-ponens?
14. Let Γ is knowledge base and α is inference from Γ . Give a comparison among
the following inferences, in terms of their performances:
a. Proof by Resolution, i.e., Γ α,
b. Proof by Modus poenes, i.e., Γ α,
c. Proof by Resolution Refutation, i.e., Γ ∪ {¬α} φ.
86 3 First Order Predicate Logic
15. Given n number of clauses, draw a resolution proof tree to demonstrate combin-
ing them. Suggest any two strategies.
16. Given the knowledge base in clausal form, is it possible to extract answers from
that making use of resolution principle? For example, finding an answer like,
“Where is Tajmahal located?”
17. Represent the following set of statements in predicate logic, convert them to
clause from, then apply the resolution proof to answer the question : Did Ranjana
kill Lekhi?
“Rajan owns a pat. Every pat owner is an animal lover. No animal lover ever
kills an animal. Either Rajan or Ranjana killed a pat, called Lekhi.”
18. Explain:
a. Unification
b. Skolemization
c. Resolution principle versus resolution theorem proving.
19. Use resolution to show that the following set of clauses is unsatisfiable.
{p(a, z), ¬p(f (f (a)), a), ¬p(x, g(y)) ∨ p(f (x), y)}.
20. Derive ⊥ from the following set of clauses using the resolution principle.
Facts:
ontable(a), ontable(c), on(d , c), on(b, a), heavy(b),
cleartop(e), cleartop(d ), heavy(d ), wooden(b), on(e, b).
Rules:
All blocks with clear top are black.
All wooden blocks are black.
Every heavy and wooden block is big.
Every big and black block is on a green block.
Making use of resolution theorem find out the block that is on the green block.
23. Given the following knowledge base:
If x is on top of y then y supports x.
If x is above y and they are touching each other then x is on top of y.
A phone is above a book.
A phone is touching a book.
Exercises 87
Translate the above knowledge base into clause form, and use resolution to show
that the predicate “supports(book, phone)” is true.
24. How resolution can be used to show that a sentence is:
a. Valid?
b. Unsatisfiable?
25. “The application of resolution principle for theorem proving is a non-
deterministic approach.” justify this statement.
26. a. Use Herbrand’s method to show that formula,
∀x shaves(barber, x) → ¬shaves(x, x)
is unsatisfiable?
b. What is Herband’s universe for S = {P(a), ¬P(f (x)) ∨ P(g(x))}?
27. Prove that ∀x¬p(x) and ¬∃x p(x) are equivalent statements.
28. Let S and T be unification problems. Also, let σ be a most general unifier for S
and θ be a most general unifier for σ (T ). Show that θ σ is a most general unifier
for S ∪ T .
29. Write the axioms describing predicates: grandchild, grandfather, grandmother,
soninlaw, fatherinlaw, brother, daughter, aunt, uncle, brotherinlaw, and first-
cousin.
30. For each pair of atomic sentences in the following, find out the most general
unifier.
a. knows(father(y), y) and knows(x, x).
b. {f (x, g(x)) = y, h(y) = h(v), v = f (g(z), w)}.
c. p(a, b, b) and p(x, y, z).
d. q(y, g(a, b)) and q(g(x, x), y).
e. older(father(y), y) and older(father(x), ram).
31. Explain what is wrong with the below given definition of set membership pred-
icate ∈:
∀x, s : x ∈ {x | s}
∀x, s : x ∈ s ⇒ ∀y : x ∈ {y | s}.
32. Consider the following riddle: “Brothers and sisters have I none, but that man’s
father is my father’s son”. Use the rules of kinship relations to show who that
man is?
33. Let the following be a set of facts and rules:
Rita, Sat, Bill, and Eden are the only members of a club.
Rita is married to Sat.
Bill is Eden’s brother.
Spouse of every married person in the club is also in the club.
88 3 First Order Predicate Logic
References
1. Chowdhary KR (2015) Fundamentals of discrete mathematical structures, 3rd edn. EEE, PHI
India
2. Davis M, Putnam H (1960) A computing procedure for quantification theory. J ACM 7(3):201–
215. https://doi.org/10.1145/321033.321034
3. Emden V, Kowalki RA (1976) The semantics of predicate logic as a programming language. J
ACM 23(4):733–742
4. Av Gelder et al (1991) The well-founded semantics for general logic programs. J ACM
38(3):620–650
5. Luckham D, Nilsson NJ (1971) Extracting information from resolution trees. Artif Intell 2:27–54
6. Nilsson NJ (1980) Principles of artificial intelligence, 3rd edn. Narosa, New Delhi
7. Robinson JA (1963) Theorem-proving on the computer. J ACM 10(2):163–174
8. Robinson JA (1965) A machine-oriented logic, based on the resolution principle. J ACM
12(1):23–41
9. Stickel ME (1981) A unification algorithm for associative-commutative functions. J ACM
28(3):423–434
Chapter 4
Rule Based Reasoning
4.1 Introduction
Symbolic rules are popular for knowledge representation and reasoning. Their pop-
ularity stems mainly from their naturalness, which facilitates comprehension of the
represented knowledge. The basic form of a rule is,
In this chapter we will discuss the use of easily stated if-then rule to solve prob-
lems. In particular you will learn the forward-chaining from the assertions and back-
ward chaining from hypotheses. The illustrative examples will explain the rule firing
sequences and graph search. The complexities of Algorithms for both forward and
backward rule chaining are also analyzed and presented. In addition, the complexi-
ties of preconditions of rules, and other popular types of expert systems, like—Case
based reasoning, and model based reasoning are presented.
Rule-based systems (RBSs) constitute the one of the most common and simple to
implement means for codifying the problem-solving know-how of human experts.
Experts tend to express most of their problem-solving techniques in terms of a set of
situation-action rules, and this suggests that RBSs should be the method of choice
for building knowledge-intensive expert systems. The RBSs share following key
properties [3]:
1. They incorporate practical human knowledge in conditional if-then rules,
2. Their skill increases at a rate proportional to the enlargement of their knowledge
bases,
3. By selecting the relevant rules and combining them appropriately, it is possible
to solve a large range of problems, of varying complexities.
4. They can be designed to adaptively decide the best sequence of rules to execute,
and
5. The RBS can retrace the reasoning steps and explain the justification for the
conclusions drawn. In this retracing, it can show each rule executed, using natural
language.
Table 4.1 lists the major application areas addressed by RBS.
4.1 Introduction 91
Roughly speaking, a Rule Based System comprises a knowledge base and an infer-
ence engine (see Fig. 4.1). The knowledge base is collection of facts and rules. Facts
are simple statements with constant or variable object lists, while the Rules always
express a conditional sentence, with one or more premises and a consequent com-
ponent. The rules needs to be interpreted, which means, if a premise can be satisfied
the consequent also be satisfied. The consequent can be a conclusion or an action.
When it is an action, the effect of satisfying the premises is to schedule that action
for execution, and when the consequent defines a conclusion, the effect is to infer
the conclusion [2].
Several key techniques for organizing RBSs have emerged. Rules can be used to
express deductive knowledge, such as logical relationships, and thereby to support
inference, verification, or evaluation tasks. Conversely, rules can be used to express
goal-oriented knowledge that an RBS can apply in seeking problem solutions and
cite in justifying its own goal-seeking behavior. The rules can also be used to express
causal relationships, which an RBS can use to answer “what-if” questions, or to
determine possible causes for specified events.
Match
Conflict set
Inference Engine
Select
Execute
92 4 Rule Based Reasoning
In a rule based system, each if pattern may match to one or more of assertions in
a collections of assertions. The collections of assertions is called working-memory
(Fig. 4.1). The assertions collectively may match to premises of one or more rules in
the knowledge base. This is done by “match” block of the inference-engine. All the
rules matching with the assertions in the working memory are put in the conflict set,
from where one of the rule is “selected” based on some conflict resolving criteria set
for, and then the selected rule is executed. The resulting consequences/conclusions
(the then patterns) are put in the working memory to form new assertions, and the
process continues till desired result (goal) is not reached.
A system like this is called deduction system, as it deduces new inferences from
the rules and assertions. A realistic example of a rule is:
Facts is the kind of data in a knowledge base that express assertions about prop-
erties, relations, propositions, etc. In contrast to rules, which the RBS interprets
as imperatives, facts are usually static and inactive implicitly. Also, a fact is silent
regarding the pragmatic value and dynamic utilization of its knowledge. Although
in many contexts facts and rules are logically interchangeable, in the RBSs they are
quite distinct.
In addition to its static memory for facts and rules, an RBS uses a working-
memory to store temporary assertions. These assertions record earlier rule-based
inferences. We can describe the contents of working memory as problem-solving
state information. Ordinarily, the data in working memory adhere to the syntactic
conventions of facts. Temporary assertions thus correspond to dynamic facts.
The basic function of an RBS is to produce results. The primary output may be—
a problem solution, an answer to a question, or result of an analysis of some data.
Whatever the case, an RBS employs several key processing determining its overall
4.2 An Overview of RBS 93
In a forward chaining system, we start with the initial facts, and use the rules to
draw new conclusions (or take certain actions), given those facts. Forward chaining
systems are primarily data-driven. Whenever an if pattern is observed to match an
assertion, the antecedent is satisfied. When all the if patterns of a rule are satisfied,
the rule is triggered. When a triggered rule establishes a new assertion or performs
an action, it is fired. This procedure is repeated until no more rules are applicable
(Fig. 4.1).
The selection process is carried out in two steps:
1. Pre-selection: Determining the set of all the matching rules, also called the
conflict-set.
2. Selection: Selection of a rule from the conflict set by means of a conflict resolving
strategy.
A simple forward chaining Algorithm 4.1 starts from the known facts in the knowl-
edge base, and triggers all the rules whose premises are the known facts, then adds
the consequent of each into the knowledge base. This process is repeated until the
query is answered or until there is no conclusion generated to be added into the
knowledge base. We will use symbols θ, λ, γ to represent substitutions. The unify
is a function unifies the newly generated assertion q and the query α, and returns a
unifying substitution λ if they are unified, else returns null.
Let α be the goal. The forward-chaining Algorithm 4.1 picks up any sentence
s ∈ Γ , where Γ is knowledge base, and checks all possible substitutions θ for s.
Let, the predicate form of s is p1 ∧ p2 ∧ · · · ∧ pn → q. On substituting θ , say, it
results to ( p1 ∧ p2 ∧ . . . pn )θ = ( p1 ∧ p2 ∧ . . . pn ), such that pi ’s ∈ Γ . Let ( p1 ∧
p2 ∧ . . . pn ) → q , such that q = qθ . This q is added into new inference. If the
94 4 Rule Based Reasoning
inference q and goal α unify, then their unifier λ is returned as the solution, else the
Algorithm continues.
The Algorithm may fail to terminate in the case when the raised query has no
answer. For example, every natural number can be generated by recursively applying
successor operation on a natural number, and assuming that 0 is natural number. This
will lead to indefinite loop for very large numbers.
naturalnum(0).
∀x[naturalnum(x) → naturalnum(succ(x)).]
The following example demonstrates the manual run of forward chaining Algo-
rithm.
Example 4.1 Produce the inference, given the knowledge base Γ = {man(x) →
mor tal(x), man(socrates)} and query α = mor tal(w), i.e., “Who is mortal?”
q = qθ
= mor tal(x){socrates/x}
= mor tal(socrates).
λ = uni f y(q , α)
= uni f y(mor tal(socrates), mor tal(w))
= {socrates/w}.
which is exponential. However, since the size and arities are constant in the real-
world, the complexity of expression (4.1) is a polynomial in nature. To search the
predicates for matching, they are indexed and a hash function is generated for quick
search.
When we are doing data-directed reasoning, we may not like to fire all of the rules
in case more than one rule is applicable. In cases where we want to eliminate some
applicable rules, some kind of conflict resolution is necessary for arriving at the
most appropriate rule(s) to fire. In a deduction system all rules generally fire, but in
a reaction system, when more than one rule are triggered at the same time, only one
of the possible actions is desired [4].
The conflict resolving strategy is used to avoid superfluous rules. The most obvious
strategy is to choose a rule at random, out of those that are applicable. Following are
some common approaches for deciding the preferences for the rule to fire.
Selection by order
Selection by order may either on the order in which the rule appears in the knowledge
base, or the order may be based on how recent the rule was used.
Order
Pick the first applicable rule in the order they have been presented. This is the type
of strategy used by Prolog, and is one of the most common ones. Production system
programmers would take this strategy into account when formulating rule sets.
96 4 Rule Based Reasoning
Recency
Select an applicable rule based on how recently it has been used. There are different
versions of this strategy, ranging from firing the rule that matches on the most recently
created (or modified) wff, to firing the rule that has been least recently used. The
former could be used to make sure that a problem solver stays focused on what it
was just doing (typical of depth-first search); the latter would ensure that every rule
gets a fair chance to influence the outcome (typical of breadth-first search).
Selection according to syntactic structure of the rule
There are many ways one can consider the syntax, e.g., it may be conditions and their
specificity, or the size of the rule: largest, smallest, or in increasing order of size.
Specificity Criteria
Select the applicable rule whose conditions are most specific. A set of conditions is
taken as more specific than other if the set of wffs that satisfy it is a subset of those
that satisfy the other. For example, consider the following three rules:
Here, the second and third rules are both more specific than the first, because the
condition in (ii) and (iii) are more specific than in (i) (condition of (i) is very general).
If we have a bird that has weight greater than 5 kg, or it is a penguin, then the first
rule applies, because the condition of “bird” is satisfied. However, in this case the
other two rules should take precedence, as they are more specific. Note that if the bird
is a penguin and heavy, neither second nor third is applicable, and another conflict
resolution criteria must be applied to decide between the second and third rules.
Largest Rule First
Apply the syntactically largest rule, i.e., the rule which contains the most proposi-
tions.
Incremental in Size
Solutions to subproblems are constructed incrementally, and are cached to avoid the
recompilation.
Apart from the priority criteria, there are additional rules, called meta-rules, con-
trol the selection.
The simple way of implementing the forward-chained interpreter is to check the
preconditions of all the rules, sort the applicable rules in an agenda (i.e., an ordered
list) according to the criteria given by the conflict-resolving strategy, and then perform
the action of the leading rule in the agenda, followed with this previous agenda is
deleted and the process begin as a new.
Following are the criteria for selection of rules for implementation of reasoning
process.
Similarity Between Rules
The first improvement over the brute-force approach is based on the similarity
between preconditions of different rules. Using this, rules with a common proposi-
tions may be eliminated all together if these propositions do not hold. For achieving
this, the rules are structured in a tangled hierarchy by constructing a network of inter-
connecting trees in which the internal decision nodes are formed by the propositions
and the leaves of the rules. The path from the root of the tree to a rule includes all
the propositions of the precondition of this rule.
Check Applicability of New Rules
This improvement is based on the fact that a complete new calculation of an agenda
requires far more effort than a modification on the basis of changes. From one cycle
to next, the knowledge base is altered only by the action of the selected rule, which
may add or delete portions of knowledge. Thus, it needs to be tested that whether the
deleted knowledge destroy the applicability of any rule in the old agenda and whether
the added knowledge make any new rules applicable. The later test is performed by
running through the rule network with new knowledge as starting point.
Indexing
One way of improving over the brute-force Algorithm is to index all the rules so that
each parameter of the predicate produces a reference to its rule. When a parameter
takes on a new value or changes its value, the reference enables the rules concerned
to be found and checked efficiently.
Consider the following rule, for which many conjunct ordering may be possible,
Here, all the p(x) may be ordered, next find q(x, y) for each x and y. Alterna-
tively, for each x find p(x), then find all q(x, y) for different y. Which approach is
better? Finding this is solution of a conjective-ordering problem, which is NP-hard.
A heuristic can use most constrained, the one having fewest values. This shows
98 4 Rule Based Reasoning
While forward chaining allows conclusions to be drawn only from a given knowledge
base, a backward-chained rule interpreter is suitable for requesting still unknown
facts. In a backward chaining system, you start with some hypothesis (goal) you are
trying to prove, and keep looking for rules that would allow you to conclude that
hypothesis, perhaps setting new sub-goals to prove as you go.
4.4 Backward Chaining 99
For this, the goal expression is initially placed in the working memory, followed
with this, the system performs two operations:
1. matches the rule’s “consequent” with the goal. For example, q in p1 ∧ p2 ∧ · · · ∧
pn → q,
2. selects the matched rule p1 ∧ p2 ∧ . . . pn → q and places its premises ( p1 ∧
· · · ∧ pn ) in the working memory.
The second step in above corresponds to the transformation of the problem into
subgoals. The process continues in the next iteration, with these premises becoming
the new goals to match against the consequent of other rules. Thus, the backward
chaining system, in the human sense are hypothesis testing.
Backward chaining uses stack. First, the goal is put in the stack, then all the rules
which results to this goal are searched and put on the stack. These becomes the sub-
goals, and the process goes on till all the facts are proved or are available as literals
(ground clauses). Every time the goal and premises are decided, the unification is
performed.
The Algorithm for backward chaining returns the set of substitutions (unifier) which
makes the goal true. These are initially empty. The input to the Algorithm is knowl-
edge base Γ , goals α, and current substitution θ (initially empty). The Algorithm
returns the substitution set λ for which the goal is inferred. The Algorithm 4.2 is the
backward-chaining Algorithm.
Example 4.2 Apply the backward-chaining Algorithm for inferencing from a given
knowledge base.
θ = θγ
= {w/x}{socr taes/w}
= {socrates/x}.
If the goal is not known in the knowledge base, the rule interpreter first decides
whether it can be derived or must be requested from user. For the derivation of goal,
all the rules which includes the goal in their consequent part are processed. These
rules can be identified efficiently if they are indexed according to their consequent
parts.
Backward chaining therefore contains implicitly a dialogue control, where order of
the questions decides the order of the rules for deriving a parameter. This dependency
on the order reduces the modularity of the rule-based system. The efficiency of the
backward-chained rule interpreter is determined by the formulation of the goal: the
more precise the goal, the smaller is the search tree of rules to be checked and
questions to be asked. For example, in medicine, a query may be: “What is name of
disease?” as against an alternate query of “Is X the name of disease?”.
The examples of back-ward chained rule interpreter are MYCIN, and PROLOG.
Whenever the rules are such that typical set of facts relate to large number of con-
clusions, i.e., the fan-out is larger than fan-in, it is better to use backward chaining.
This is to stop in explosive growth in search space. On the other hand, when the rules
4.5 Forward Versus Backward Chaining 101
are such that a typical hypothesis can lead to many facts, i.e., fan-in is larger than
fan-out, it is better to use forward chaining. The reason being that if we go backward
in a case where fan-in is high, it would lead to exponentially growing search space,
so we prefer forward chaining in such knowledge.
If there is a situation that all the required facts are known, and we want to get all
the possible conclusions from these, the forward chaining is better. However, if the
rules are such that facts are incompletely known, we cannot proceed from the facts
to conclusions, and thus goal driven strategy should be used.
Forward chaining is often preferable in cases when there are many rules with the
same conclusions, because we choose the satisfying premises. In backward chaining
it may lead to non-satisfying premises. The following examples demonstrate this.
Example 4.3 Given a knowledge base for an animal kingdom, infer animal(bruno)
after adding of dog(bruno). The taxonomy of the animal kingdom includes such rules
as shown in Table 4.2.
In addition to the methods discussed above, there exists reasoning methods, like,
model based reasoning, case based reasoning, and some hybrid combinations of
these.
4.7 Other Systems of Reasoning 103
Add-2
The case-based reasoning (CBR) uses specific database for problem solving. Con-
sider the case of medical practitioner, who uses medical history of failed as well as
successful “cases” for diagnosis of future patients. If a new patient’s case appears
similar to one of the earlier patient’s successful case treatment case, the knowledge
history of old case is useful. At the same time if some old case has failed, that will
not be tried again due to available history.
Figure 4.3 shows the functional block diagram of case based reasoning. Often, the
researchers use the combination of above cases for reasoning due to their advantages.
The similar is the situation in the case of legal practitioners. Lawyers, are allowed
by law, to quote the old case histories in their legal proceedings, if that case matches
in the present situations. They can also give a different interpretation for the old case
to deal with new situations, even if they do not fully match [1].
The above are the examples of case based reasoning.
Many other cases can be counted in this category: programmers often reuse the
old code to fit in the new situation with partial modification, architects reuse the
design of older successful architectures, the wars winners learn from the history of
wars won, all of us often follow the success stories of others in life, and so on. Many a
times, the case where we need to apply the reasoning is not identical to the previous,
hence some transformations are required to fit into the current situation. Thus, this
reasoning is a a kind of learning, from the analogy, called analogical leaning.
For each new problem, the case based reasoning researchers share the following
common operations:
1. Retrieve appropriate cases from the storage (case database).
2. Modify the retrieved case to fit into the current situations.
3. Apply the transferred rule.
4. Save the solution along with the case details, for success or failure for future use.
Input Elaborate
the problem
4.8 Summary
Some of the applications of rule based expert systems are: component selection,
equipment maintenance, product configuration, computer operation & troubleshoot-
ing, process control, and quality assurance. The basic form of a rule is,
Exercises
R1 : i f A then E
R2 : i f B ∧ F then G
R3 : i f C ∧ E then F
a. Using these rules and facts, explain what is meant by “backward chaining”
and show explicitly how it can be used to determine the truth of G?
b. Explain what is meant by “forward chaining”, and show explicitly how it can
be used in this case to determine new facts.
4. Consider the following possibilities, suggest the solution strategy to be adopted
if the system is implemented as a rule based system:
a. A subgoal literal is generated such that the higher goal is a subset of the
subgoal.
b. A subgoal literal is generated whose negation unifies with the higher-goal.
c. A subgoal S literal is generated that is equal to another goal G, and G is
neither higher nor lower to S.
5. Translate the following knowledge base into predicate form.
If x is on top of y then y supports x.
If x is above y and they are touching each other then x is on top of y.
A phone is above a book.
A phone is touching a book.
Count the number of triggering and rule firing in each of the above cases.
6. Given the propositions A, B, C, D and rules R1 to R4 ,
R1 = i f A ∧ X ∧ Y then Z
R2 = i f B ∧ V then Y
R3 = i f C ∧ V then X
R4 = i f D then C
Exercises 107
13. Which type of rule chaining you will prefer in the following situations? Also,
identify, what is the goal in each case.
a. To diagnose the case of malaria infection, based on the symptoms of
cough, soar throat, regular fever, shivering.
b. Identify a thief, based on the nature of theft, finger prints, and goods
stolen.
14. ] Suggest an approach to combine the forward and backward chaining for the
knowledge base shown in Table 4.2 (Hint: Try backward chaining to begin with,
if it is not doing efficiently, then switch over to forward chaining.)
15. To prove a theorem of geometry using rule-based systems, represent the follow-
ing statements as production rules:
a. Corresponding sides of two congruent triangles are also congruent.
b. Corresponding angles of two congruent triangles are also congruent.
c. If corresponding sides of two triangles are congruent then the triangles
are congruent.
d. If corresponding sides and the angle covered by them are equal then the
triangles are congruent.
e. Base angles of an isosceles triangle are congruent.
16. In what order the rules should be fired for inference efficiency? Discuss the merits
and demerits of choosing a particular order of rule firing.
17. What are the criteria of selection of premises for firing the rules? Discuss this
based on specificity and generality of the premises.
18. Consider a network of applicability of rules. Imagine that due to firing of a
rule, a new inference is generated. This inference may contradict some fact or
goal. Explain, what should be the structure of dependency network so that any
inconsistency caused due to new inference is taken care of.
19. Suggest the strategy, as how a tree like rules structure can be used for reasoning
in forward direction. Explain or suggest your strategy for following:
a. Whether the breadth-first or depth-first search is better for rule searching?
b. Whether the top-down or bottom-up search is better for firing of rule
sequences?
c. How to eliminate one or both the rules, say R1 and R2 , for consideration,
if one of the precondition is false, which is common in both R1 and R2 ?
d. In the begin of a reasoning process whether you would prefer to choose
to fire a rule with large number of preconditions or small?
20. Analyze the Algorithm 4.1 and justify that it is NP-hard as a general case, how-
ever, polynomial in real world situation.
21. Analyze the complexity of preconditions in the forward reasoning process.
22. How the following situations are implemented in knowledge base for rule-
chaining? Explain.
a. “Is the pain intensifying with respirations?”
Exercises 109
References
Abstract Prolog is logic programming languages for AI, based on predicate logic.
This chapter discusses the structure, syntax, and semantics of Prolog language, pro-
vides comparison with procedural language like C, interpretation of predicate logic
and that of Prolog, both formally as well through worked out examples, and explain
how the recursion is definition as well solution of a problem, and explains with sim-
ple examples as how the control sequencing takes place in Prolog. Use of two open
source compilers of prolog using simple worked out examples is demonstrated. Each
concept of Prolog and logic programming is explained with the help of worked out
examples. At the end, a brief summary gives glimpse of the chapter, followed with
number of exercises to strengthen the learning.
5.1 Introduction
This chapter presents the basic concepts of logic programming, and Prolog language.
Prolog is a logic programming language, implemented in two parts:
1. Logic, which describes the problem, and
2. Control, provides the solution method.
• Computer algebra.
• Design of parallel computer architectures.
• Database systems.
• Natural-language processing.
• Compiler construction.
• Design of expert systems.
The popular compilers of prolog are swi-prolog and gnu prolog; both are available
in open source, and runs on Windows and Linux platforms.
Learning Outcomes of this Chapter:
In the above equation, term ‘logic’ represents the descriptive part of an algorithm,
and ‘control’ represents the solution part, which takes the description as the point
5.2 Logic Programming 113
what how(search)
of departure. In other words, the logic component defines what the algorithm is
supposed to do, and the control component indicates how it should be done.
For solution through logic program, a problem is always described in terms of
relevant objects and relations between them. These relations are represented in a
clausal form of logic—a restricted form of first-order predicate logic. The logic
component for a problem is called a logic program, while the control component
comprises method for logical deduction (reasoning) for deriving new facts from the
logic program. This results to solving a problem through deduction. The deduction
method is designed as a general program, such that it is capable of dealing with any
logic program that follows the clausal form of syntax.
There are number of advantages of splitting of an algorithm into a logic component
and a control component:
• We can develop these two components of the program independent of each other.
For example, when writing the description of the problem as logic, we do not have
to be familiar with how the control component operates for that problem; thus
knowledge of the declarative reading of the problem specification suffices.
• A logic component may be developed using a method of stepwise refinement; we
have only to watch over the correctness of the specification.
• Changes to the control component affect (under certain conditions) only the effi-
ciency of the algorithm; they do not influence the solutions produced.
The implementation of Logic programming is explained in Fig. 5.1.
When all facts and rules have been identified, then a specific problem may be
looked upon as a query concerning the objects and their interrelationships. In sum-
mary, to provide specification of a logic program amounts to following jobs:
• specify the facts about the objects and relations between them for the problem
being solved;
• specify the rules about the objects and their interrelationships;
• specify the queries to be posed concerning the objects and relations.
An algorithm for logic program can be shown to be decomposed into the compo-
nents shown in Fig. 5.2.
114 5 Logic Programming and Prolog
The reasoning process can be carried out as back-ward reasoning, or forward rea-
soning. In backward-chaining, reasoning is performed backwards from the conclu-
sion, which repeatedly reduces the goals to subgoals until ultimately all the subgoals
are solved directly in the form of original assertions.
In the case of problem-solving using forward-chaining approach, we reason for-
wards from the hypotheses, and repeatedly derive new assertions from old ones until
eventually the original goal is solved directly by derived assertions [6].
For our reasoning using forward and backward chaining, we consider the family-
tree of Mauryan Dynasty (India) as shown in Fig. 5.3 [5].
The problem of showing that chandragupta is a grandparent of ashoka can be
solved either backward-chaining or forward-chaining. In forward-chaining, we start
with the following assertions:
father(bindusara, asoka) ←
father(chandragupta, bindusara) ←
Also, we shall use the clause par ent (x, y) ← f ather (x, y) to derive new
assertions,
5.3 Interpretation of Horn Clauses in Rule-Chaining 115
by denying that any z is both a child of chandragupta and a parent of ashoka. Contin-
uing backward-chaining and considering both subgoals (either one at a time or both
simultaneously), we use the clause,
Different control strategies for the same logical representation generate different
behaviors. Also, the information about a problem-domain can be represented using
logic in different ways. Alternative representations can have a more significant impact
on the efficiency of an algorithm compared to alternative control strategies for the
same representation.
Consider the problem of sorting a list x and obtaining the list y. In our represen-
tation, we can have a definition with an assertion consisting of two arguments: “y is
permutation of x”, and “y is ordered”, i.e.,
Here “←” is read as ‘if’ and ‘,’ is logical AND operator. The first argument generates
permutations of x and then it is tested whether they are ordered. Executing procedure
calls as coroutines, the procedure generates permutations, one element at a time.
Whenever a new element is generated, the generation of other elements is suspended
while it is determined whether the new element preserves the orderedness of the
existing partial permutation.
A program consists of a logic component and a control component. The logic
component defines the part of the algorithm that is problem specific, which not
only determines the meaning of algorithm but also decides the way the algorithm
behaves. For systematic development of well-structured programs using successive
refinements, the logic component needs to be defined before the control component.
5.4 Logic Versus Control 117
Due to these two components of a logic program, the efficiency of an algorithm can
be improved through two different approaches: 1. by improving the efficiency of
logic component, 2. through the control component. Note, that both improvements
are additive, and not as alternative choices.
In a logic programs, the specification of control component is subordinate to logic
component. The control part can be explicitly specified by the programmer through a
separate control specification language, or the system itself can determine the control.
When logic is used like in relational calculus to specify queries (i.e., higher level
of language) to knowledge base, the control component is entirely determined by
the system. Hence, for higher level of programming language, like queries, lesser
effort is required for programming the control part, because in that case the system
assumes more responsibility about efficiency, as well as to exercise control over the
use of given information.
Usually, a separate control-specifying language is preferred by advanced pro-
grammers to exercise the control with higher precision. A higher system efficiency
is possible if programmer can communicate to system a more precise information to
have finer control. Such information may be a relation, for example, F(x, y), where y
can also be a function of x. This function could be used by a backtracking interpreter
to avoid looking for another solution to the first goal in the goal statement. This can
be expressed, for example by,
← F(A, y), G(y),
when the second goal fails. Other example of such information can be that one
procedure,
S←T
may be more likely to solve problem S than another procedure,
S ← R.
This kind of information of cause-effect relation is common in fault diagnosis
where, on the basis of past experience, it might be possible to estimate that symptom
S is more likely have been caused by T rather than by R.
In the above examples, the control information is problem-specific. However, if
the control information is correct, and the interpreter is correctly implemented, then
the control information should not affect the meaning of the algorithm, which is
decided by the corresponding logic component of the program.
In quick-sort, the predicates empty, first, rest, partitioning, and appending, can be
defined independently from the definition of sorting (see Eqs. 5.2 and 5.3). For this
definition we assume that, partitioning of a list x2 by element x1 gives a list u com-
prising the elements of x2 that are less or equal to x1 , and a list v of elements of x2
that are greater than x1 .
There is another advantage of data-structure-independent definition: with well-
chosen interface procedure names, data-structure-independent programs are virtually
self-documenting. In conventional programs that mix procedures and data structures,
the programmer needs to provide separate documentation for data structures.
On the other hand, in-spite of the arguments in support for separating procedures
and data structures, the programmers usually mix them together for the sake of run-
time efficiency.
In a simplest backward reasoning based execution, the procedure calls are executed
one at a time, in the order they have been written. Typically, an algorithm can be
made to run faster by executing the same procedure calls in the form of coroutines
or as communicating parallel-processes. Consider an algorithm A1 ,
A1 = L + C 1 (5.4)
where we replaced control strategy C1 by new control strategy C2 , and the logic
L of the algorithm remains unchanged. For example, executing procedure calls in
sequence, the procedure,
first generates permutations of x and then tests whether they are ordered. By executing
procedure calls as coroutines, the procedure generates permutations, one element at
a time. Whenever a new element is generated, the generation of other elements is
suspended while it is determined whether the new element preserves the orderedness
of the existing partial permutation.
120 5 Logic Programming and Prolog
In the Similar way to procedure (5.6), procedure calls in the body of quick-sort
can be executed (either as coroutines or as parallel/sequential processes). When they
are executed in parallel, partitioning the rest of x can be initiated as soon as the
first element of the rest are generated (see procedure (5.3) on p. 118). Note that the
sorting for u and v can take place in parallel, the moment first elements of u and v
are generated. And, the appending of lists can start soon after the first elements of
u , and the sorted output of u, are made available.
The high level language SIMULA provides the facility of writing both the usual
sequential algorithms, as well as algorithms with coroutines. Here, the program-
mer can provide the controls about: When the coroutines should be suspended and
resumed? However, in this language, the logic and control are inextricably inter-
twined in the program text, like in other procedure oriented languages. Hence, the
control of an algorithm cannot be modified without rewriting the entire program.
In one way, the concept of separating logic and controls is like separating data
structures and procedures. When a procedure is kept separate from a data structure,
we are able to distinguish clearly as what functions are fulfilled by which data struc-
tures. On the other hand, when logic is separated from control, it becomes possible
to distinguish, what the program (i.e., logic) does, and how it does it (i.e., control-
ling takes place). In both conditions it becomes obvious as what the program does,
and hence it can be more easily determined whether it correctly does the job it was
intended for.
Recursive definitions are common in mathematics, where they are more likely to
be understood as forward-chaining rather than backward-chaining. Consider, for
example, the definition of factorial, given below.
f actorial o f 0 is 1 ← (5.7)
f actorial o f x is u ←y plus 1 is x,
f actorial o f y is v,
x times v is u. (5.8)
f actorial o f 0 is 1 ←
f actorial o f 1 is 1 ←
f actorial o f 2 is 2 ←
f actorial o f 3 is 6 ←
and so on.
5.4 Logic Versus Control 121
A = L 1 + C1 = L 2 + C2 . (5.9)
c v
122 5 Logic Programming and Prolog
Carrying out a bidirectional search from both the start node and the goal node
results to a combination of backward and forward reasoning. In that case, whether a
path-finding algorithm investigates one path at a time (in depth-first) or develops all
paths together (in breadth-first) will depend on search strategy used.
← grandparent(chandragupta, ashoka)
f ather(chandragupta, bindusara) ←
f ather(bindusara, ashoka) ←
mother(durdhara, bindusara) ←
← grandparent(chandragupta, ashoka)
f ather(chandragupta, bindusara) ←
f ather(bindusara, ashoka) ←
mother(durdhara, bindusara) ←
1 2
2 3
1. indicates that the algorithm waits until x is asserted to be the parent of z, then
2. indicates that the algorithm finds a child y of z, and finally
3. indicates that the algorithm asserts that x is grandparent of y.
The combination indicted by Fig. 5.8, which represents the algorithm, has marking
1–3 as events, with following meanings:
In the similar way the the algorithms takes care of rest of the controls of backward-
chaining and forward-chaining.
Program files can be compiled using the predicate consult. The argument has to be
a Prolog atom denoting the particular program file. For example, to compile the file
socrates.pl, submit the following query to swi-Prolog [8]:
?- consult(socrates.pl).
5.6 Running Simple Programs 125
If the compilation is successful, Prolog will reply with ‘Yes’. Otherwise a list of
errors is displayed. The ‘swi-prolog’ can also run in GUI environment in Windows.
The gnu-prolog running on linux can run a prolog program ’socrates.pl’ as
follows [3]:
$ gprolog <enter>
GNU Prolog 1.3.0
Copyright ....
| ?-[socrates]. % filename without extensions.
mortal(X) :- man(X).
man(socrates).
The conclusion of the argument is: “Socrates is mortal,” which can be expressed in
predicate as ‘mortal(socrates)’. After we have compiled, we run the above program,
and query it, as follows:
?- mortal(socrates).
Yes
We notice that Prolog agrees with our own logical reasoning. But how did it come
to its conclusion? Let’s follow the goal execution step-by-step [7].
5. Prolog executes the new subgoal by again trying to match it with a rule-head or
a fact.
6. Obviously, new subgoal man(socrates) matches the fact man(socrates), and
current sub-goal succeeds.
7. This, means that the initial goal succeeds, and prolog responds with ’YES’.
bigger(train, bus).
bigger(train, bus).
bigger(bus, car).
bigger(car, bicycle).
bigger(car, motorbike).
This is a syntactically correct program, and after having compiled it, we can ask
the Prolog system questions (or queries) about it.
The query ‘bigger(car, bicycle)’ (i.e. the question “Is a car bigger than a bicycle?”)
succeeds, because the fact ‘bigger(car, bicycle)’ was previously communicated to the
Prolog system. Our next query is, “is a motorbike bigger than an train?”
?- bigger(train, motorbike).
No
The reply by Prolog is “No”. The reason being that, the program says nothing about
the relationship between train and motorbike. However, we note that, the program
says—‘trains are bigger than bus’, and ‘buses are bigger than cars’, which in turn
are bigger than motorbike. Thus, trains are also to be bigger than motorbikes. In
mathematical terms, the bigger-relation is transitive. But it also not been defined
in our program. The correct interpretation of the negative answer Prolog is that:
“from the information communicated to the program it cannot be proved that an
5.6 Running Simple Programs 127
train is bigger than a motorbike”. As an exercise, we can try the proof by resolution
refutation, but it cannot be proved because it is not possible to verify the statements.
Solution would be to define a new relation, which we will call isbigger, as the
transitive closure. Animal X is bigger than Y , if this has been stated as a fact.
Otherwise, there is an animal Z , for which it has been stated as a fact that animal X
is bigger than animal Z , and it can be shown that animal Z is bigger than animal Y .
In Prolog such statements are called rules and are implemented as follows:
where ‘:-’ stands for ‘if’ and comma (,) between ‘bigger(X, Z)’ and ‘isbigger(Z,Y)’
stands for ‘AND’, and a semicolon (;) for ‘OR’. If from now on we use ‘isbigger’
instead of ‘bigger’ in our queries, the program will work as intended.
?- isbigger(train, motorbike).
Yes
In the rule1 above, the predicate ‘isbigger(X, Y)’ is called goal, and ‘bigger(X,
Y)’ is called sub-goal. In the rule2 ‘isbigger(X,Y)’ is goal and the expressions after
the sign ‘:-’ are called sub-goals. The goal is also called head of the rule, and the
expressions after sign ’:-’ is called body of the rule statement.
In fact, the rule1 above corresponds to the predicate,
or
bigger (X, Y ) → isbigger (X, Y ).
bigger(train, bus).
bigger(bus, car).
bigger(car, bicycle).
bigger(car, motorbike).
are called facts(or assertions). The facts and rules, together, make the knowledge
base (KB) in a program.
For the query ‘isbigger(train, motorbike)’ the Prolog still cannot find the fact
‘bigger(train, motorbike)’ in its database, so it tries to use the second rule instead.
This is done by matching the query with the head of the rule, which is ‘isbigger
128 5 Logic Programming and Prolog
(X, Y)’. When doing so, the two variables get bound: X = train, and Y = motorbike.
The rule says that in order to prove the goal ‘isbigger(X,Y)’ (with the variable
bindings that’s equivalent to isbigger(train, motorbike)), Prolog needs to prove the
two subgoals ‘bigger(X, Z)’ and ‘isbigger(Z, Y)’, with the same variable bindings.
Hence, the rule2 gets transformed to:
By repeating the process recursively, the facts that make up the chain between
train and motorbike are found and the query ultimately succeeds. Our earlier Fig. 5.5
demonstrated the similar chain of actions.
Of course, we can do slightly more exciting job than just asking yes/no-questions.
Suppose we want to know, what animals are bigger than a car? The corresponding
query would be:
?- isbigger(X, car).
We could also have chosen any other name in place of X for it as long as it starts
with an uppercase letter, which makes it a variable. The Prolog interpreter replies as
follows:
There are many more ways of querying the Prolog system about the contents of
its database.
For example, try to find out the answer for:
?- bigger(Who, Whom).
You will get many answers! The Prolog treats the arguments Who and Whom as
variables.
As an example we ask whether there is an animal X that is both smaller than a
car and bigger than a motorbike:
? isbigger(bus, motorbike).
1. Call: isbigger(bus, motorbike)?
2. Call: bigger(bus, motorbike)?
2. Fail: bigger(bus, motorbike)?
2. Call: bigger(bus, _80)?
2. Exit: bigger(bus, car)?
2. Call: isbigger(car, motorbike)?
3. Call: bigger(car, motorbike)?
3. Exit: bigger(car, motorbike)?
2. Exit: isbigger(car, motorbike)?
1. Exit: isbigger(bus, motorbike)?
True?
Yes
{Trace}
The trace can be verified to be performing as per the Rule1 and Rule2 discussed
above.
Many a times, when started from goal, it may not be possible to reach to facts
available. This shows that prolog is incomplete in theorem proving even for definite
clauses, as it fails to prove facts that can be concluded from knowledge base.
The built-ins can be used in a similar way as user-defined predicates. The important
difference between the two is that a built-in predicate is not allowed to appear as the
principal function in a fact or in the head of a rule. This must be so, because using
them in such a position would effectively mean changing their definition [1].
Equality. We write X = Y . Such a goal succeeds, if the terms X and Y can be
matched.
Output. Besides Prolog’s replies to queries, if you wish your program to have
further output, you can use the write predicate. The argument can be any valid Prolog
term. In the case of a variable argument, its value will be printed. Execution of the
predicate causes the system to skip a line, as in the following cases.
read(N).
write(’the number is’), write(N), nl.
the number is 5
N = 5
Yes
Matchings. Following are the examples for matchings. If two expressions matches,
the output is ’Yes’ otherwise it is ’No’. The query, to prolog shows that the two
expressions cannot be matched.
Sometimes there is more than one way of satisfying the current goal. Prolog
chooses the first possibility (as determined by the order of clauses in a program),
but the fact that there exists alternatives, is recorded. If at some point, Prolog fails to
prove a certain subgoal, the system can go back and try an alternative left behind in
the of executing of the goal. This process is known as backtracking. The following
example demonstrates backtracking.
%finding factorial.
fact(0, 1). % base case
A I B
%gcd
gcd(X, X, X).
gcd(X, Y, Z) :- X > Y, D is X - Y, gcd(D, Y, Z).
gcd(X, Y, Z) :- X < Y, D is Y - X, gcd(X, D, Z).
move(A,B):- nl,
write(’move top from ’),
write(A),
write(’ to ’),
write(B).
transfer(1,A,B,I) :- move(A,B).
transfer(N, A, B, I):- N > 1,
M is N -1,
transfer(M, A, I, B),
move(A, B),
transfer(M, I, B, A).
The algorithm uses the strategy: move n − 1 disks from A to I , then move a single
disk from A to B, finally move n − 1 disks from I to B. For n − 1, it recursively
calls the algorithm. The predicate nl stands for new-line.
132 5 Logic Programming and Prolog
Prolog represents the lists contained in square brackets with the elements being
separated by commas. Here is an example:
Elements of lists could be any valid Prolog terms, i.e. atoms, numbers, variables,
or compound terms. A term may also be other list. The empty list is denoted by ‘[]’.
The following is another example for a (slightly more complex) list:
Internally, lists are represented as compound terms using the function. (dot). The
empty list ‘[]’ is an atom and elements are added one by one. The list [a, b, c], for
example, corresponds to the following term:
% membership built-in
?-member(2, [a, b, c, 2, 4, 900]).
Yes.
% membership program
ismember(X, [X|R]). % matches with 1st element
ismember(X, [Y|R]) :- ismember(X, R). % try for
% next element
The built-in program append, appends two lists.
append([], L , L).
append([X |L1], L2, [X |L3]) : −append(L1, L2, L3).
5.9 List Manipulation 133
θ2 = {2/Y, [3, 4]/L5, [2, L4]/L3} append([Y |[]], [3, 4], [Y |L4])
The search, along with unifications for appending two lists is shown in Fig. 5.10.
The goal search shows alternate cycles of unification and calling of sub-goals. As a
result of the recursion, the append operation can be realized as follows. The terminal
node is matched with the fact: append([], L , L). Consequently, L4 = L5 = [3, 4].
On back substitution,
L3 = [2|L4]
= [2|[3, 4]]
= [2, 3, 4].
Z = [1|L3]
= [1|[2, 3, 4]]
= [1, 2, 3, 4].
Head and Tail
The first element of a list is called its head and the remaining list is called the tail.
An empty list does not have a head. A list containing a single element has a head
(namely that particular single element) and its tail is the empty list. A variant of the
134 5 Logic Programming and Prolog
list notation allows for convenient addressing of both head and tail of a list. This is
done by using the separator | (bar) [1].
Notice that the H ead and T ail are just names for variables.1 We could have used
X and Y or something else instead with the same result. We can access 2nd element
as well.
?- [bovi, jovi, kv, licet, quod, non, ] = [_, X | _].
X = jovi
?3 + 5 = 8.
No
?X is 3 + 5.
X = 8
Yes
The terms 3 + 5 and 8 do not match as the former is a compound term, whereas
the latter is a number.
The following are arithmetical relational predicates:
X > Y
X < Y
X >= Y
X =< Y
X \= Y
X = Y
The Prolog language has number of predicates to explicitly control the backtracking
behavior of its interpreter. This way the Prolog deviates from the logic programming
idea. For example, the predicate True takes no arguments, and it always succeeds.
Some of the other explicit predicates of Prolog as discussed below.
Fail Predicate
The predicate Fail also has no arguments, the condition fail never succeeds. The
general application of the predicate fail is to enforce backtracking, as shown in the
following clause:
a(X ) : - b(X ), f ail.
When the query a(X ) is entered, the PROLOG interpreter first tries to find a
match for b(X ). Let us suppose that such a match is found, and that the variable X is
instantiated to some term. Then, in the next step f ail, as a consequence of its failure,
enforces the interpreter to look for an alternative instantiation to X . If it succeeds
136 5 Logic Programming and Prolog
in finding another instantiation for X , then again f ail will be executed. This entire
process is repeated until no further instantiations can be found. This way all possible
instantiations for X will be found. Note that if no side-effects are employed to record
the instantiations of X in some way, the successive instantiations leave no trace. It
will be evident that in the end the query a(X ) will be answered by no. But, we have
been successful in backtracking, i.e., going back and trying all possible instantiations
for X , which helps in searching all the values.
The negation in prolog is taken as failure as shown in the following program.
Example 5.10 Negation as failure.
When run, the queries responded as obvious. In the third case, married(Who) suc-
ceeds, so the negation of goal fails.
?bachelor(rajan).
yes
?bachelor(dicken).
no
bachelor(Who).
Who = dicken
no
Cut Predicate
Some times it is desirable to selectively turn off backtracking. This is done by cut (!).
The cut, denoted by !, is a predicate without any arguments. It is used as a condition
which can be confirmed only once by the PROLOG interpreter: on backtracking it is
not possible to confirm a cut for the second time. Moreover, the cut has a significant
side effect on the remainder of the backtracking process: it enforces the interpreter
to reject the clause containing the cut, and also to ignore all other alternatives for the
procedure call which led to the execution of the particular clause.
a :- b,c,d.
c :- p,q,!,r,s.
c.
Suppose that upon executing the call a, the successive procedure calls b, p, q, the
cut and r have succeeded (the cut by definition always succeeds on first encounter).
Furthermore, assume that no match can be found for the procedure call s. Then as
usual, the interpreter tries to find an alternative match for the procedure call r . For
5.11 Backtracking, Cuts and Negation 137
each alternative match for r , it again tries to find a match for condition s. If no
alternatives for r can be found, or similarly if all alternative matches have been tried,
the interpreter normally would try to find an alternative match for q. However, since
we have specified a cut between the procedure calls q and r , the interpreter will not
look for alternative matches for the procedure calls preceding r in the specific clause.
In addition, the interpreter will not try any alternatives for the procedure call c; so,
clause 3 is ignored. Its first action after encountering the cut during backtracking is
to look for alternative matches for the condition preceding the call c, i.e., for b.
For a given goal, prolog explores the premises for rules in the knowledge base,
making the goal true. If there are premises p1 ∧ p2 ∧ · · · ∧ pn , it fully explores the
premise (called choice point) pi before proceeding to pi+1 .
The solution through prolog is unification, and binding of variables, pushing and
retrieving the stack, associated with backtracking. When a search fails, prolog will
backtrack to the previous choice point, followed with possibly unbounding of some
of the variables. It always keeps track of all the bound variables at any moment, and
those kept in the stack. In addition, it has to manage the index for fast searching of
predicates. This is called trail. Accordingly, even the most efficient prolog interpreters
consume thousands of machine instructions for even the simple unifications and
matching.
For the huge task, and due to nature of computing, required, the normal proces-
sors give very poor performance to prolog programs. Hence, the prolog programs
are compiled into intermediate programs, called WAM (Warren Abstract Machine).
WAM helps prolog running faster as well as making it parallel.
Prolog may some times lead to incomplete loops.
The true version of prolog is called pure Prolog. It is obtained from a variation of
the backward chaining algorithm that allows Horn clauses with the following rules
and conventions:
These rules makes fast processing. But, unfortunately, the Pure Prolog inference
Procedure is Sound but not Complete. This can be seen by the following example.
Using this we are unable to derive in Prolog P(a, c) because we get caught in an
ever deepening depth-first search.
138 5 Logic Programming and Prolog
P(a, b).
P(b, c).
P(Y, X ) : - P(X, Y ).
P(X, Z ) : - P(X, Y ), P(Y, Z ).
Actual Prolog
Actual Prolog differs from pure Prolog in three major respects:
• There are additional functionalities besides theorem proving, such as functions to
assert statements, functions to do arithmetic, functions to do I/O.
• The “cut” operator allows the user to prune branches of the search tree.
• The unification routine is not quite correct, in that it does not check for circular
bindings e.g. X → Y, Y → f (X ).
5.13 Summary
Exercises
1. Determine, in which of the following lists cases the unification succeeds and
where it fails? If it succeeds, write down the unifier. (Note: Uppercase are vari-
ables.)
[a, d, z, c] and [H |T ]
[apple, pear, grape] and [A, pear |Rest]
[a|Rest] and [a, b, c]
[a, []] and [A, B|Rest]
[One] and [two| []]
[one] and [T wo]
[a, b, X ] and [a, b, c, d]
2. Give Prolog predicates for natural-number and plus that implement the Peano
axioms. Using plus give a Prolog program times for multiplication. Further, using
times give a Prolog program exp for exponentiation.
3. Given the following knowledge base for prolog, find a female descendant of
‘george’, by manually running the program.
parent(george,sam).
parent(george,andy).
parent(andy,mary).
male(george).
male(sam).
male(andy).
female(mary).
ancestor(X,Z) :- parent(X, Z).
ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z).
?[a,b,c,d] = [a,b,c,d|[]].
?[a,b,c,d] = [a,b,c,[d]].
?[a,b,c,d] = [a,b,[c,d]].
?[a,b,c,d] = [a|[b,c,d]].
?[a,b,c,d] = [a,b|[c,d]].
?[a,b,c,d] = [a,b,c,d,[]].
?[a,b,c,d] = [a,b,c|[d]].
?[a,b,c,d] = [a,[b,c,d]].
140 5 Logic Programming and Prolog
5. Which of the following lists are syntactically correct for Prolog language? Find
out the number of elements in the lists that are correct.
[1, 2, 3, 4| []]
[1| [2| [3| [4]]]]
[1, 2, 3| []]
[[] | []]
[[1, 2] |4]
[1| [2, 3, 4]]
[[1, 2] , [3, 4] | [5, 6, 7]] [1|2, 3, 4]
6. Write a predicate second(S, List), that checks whether S is the second element
of List.
7. Consider the knowledge base comprising the the following facts:
tran(ek, one).
tran(do, two).
tran(teen, thr ee).
tran(char, f our ).
tran( panch, f ive).
tran(cha, si x).
tran(saat, seven).
tran(aat, eight).
tran(no, nine).
11. Given the following facts and rules about a blocks world, represent them in rules
forms, then translate the rules into prolog, and find out “what block is on black
block?”
Facts:
A is on table.
B is on table.
E is on B.
C is on A.
C is heavy.
D has top clear.
E has top clear.
E is heavy.
C is iron made.
D is on C.
Rules:
Every big, black block is on a red block.
Every heavy, iron block is big.
All blocks with top clear are black.
All iron made blocks are black.
References
1. Clocksin WF, Mellish CS (2009) Programming in polog, 3rd edn. Narosa, New Delhi
2. http://www.dtic.upf.edu/~rramirez/lc/Prolog.pdf. Cited 19 Dec 2017
3. http://www.gprolog.org/. Cited 19 Dec 2017
4. Ivan B (2007) PROLOG programming for artificial intelligence, 3rd edn. Pearson Education
5. Strong John S (1989) The Legend of King Asoka—a study and translation of Asokavadana.
Princeton University Press, Princeton, N.J.
6. Kowalski RA (1979) Algorithm = Logic + Control. Commun ACM 22(7):424–436
7. Van Emden MH, Kowalski RA (1976) The semantics of predicate logic as a programming
language. J ACM 23(4):733–742
8. http://www.swi-prolog.org/. Cited 19 Dec 2017
Chapter 6
Real-World Knowledge Representation
and Reasoning
Abstract Apart from its theoretical significance, the AI must represent real-world
knowledge, and produce reasoning using that. The real-world things are collections
of entities in different classes. This chapter presents the representations structures for
such knowledge, e.g., taxonomies and reasoning based on that. Other phenomena
in real-world, that are presented are, action and change, commonsense reasoning,
ontology structures for different domains, like, language, and world. The Sowa’s
ontology for objects, and processes, both concrete and abstract, is explained. The
situation calculus is presented in its formal details, along with worked exercises.
The more prevalent real-world reasoning, like nonmonotonic and default reasoning
are also treated in sufficient details, along with supporting worked exercises. This is
followed, with summary of the chapter, and exhaustive list of practice exercises.
6.1 Introduction
Who is taller, the father or his son held in his hands by the father? Can you make tea
out of salt? If you push a needle into a potato, does the needle make a hole in the
potato or in the needle? We may find these questions as absurd, but many tasks, such
as machine vision, natural language processing, planning, and reasoning in general,
requires the same kind of real-world knowledge and capabilities of reasoning. As
another example, if a six feet tall person is holding a two feet baby in his arms, and
it is also given that they are father and son, we take no time to conclude which one
is father and which one is son.
The use of real-world knowledge for natural language processing (NLP), partic-
ularly for all kinds of sense disambiguation, and for machine translation is a chal-
lenging task. Some of the ambiguities are resolved using the simple rules, while the
others can be resolved using rich understanding of the world. Consider the following
two examples:
1. Compare and contrast the most common models used for structured knowledge
representation, highlighting their strengths and weaknesses. [Assessment]
2. Identify the components of non-monotonic reasoning and its usefulness as a rep-
resentational mechanism for belief systems. [Familiarity]
3. Compare and contrast the basic techniques for representing uncertainty.
[Assessment]
4. Compare and contrast the basic techniques for qualitative representation.
[Assessment]
5. Apply situation and event calculus to problems of action and change. [Usage]
6. Explain the distinction between temporal and spatial reasoning, and how they
interrelate. [Familiarity]
Animal
Bruno Tweety
Category Subcategory
Instance
Individual
Propertyof
Property Cancel
• All the events are deterministic in nature, i.e., state of the world at the end of every
event can be fully determined using the state before that event, plus this event
itself.
The problem of representation, and the form of reasoning such as prediction
and planning, have been largely understood for the domain that satisfy the above
mentioned three constraints. However, there are other extreme domains: continu-
ous domains, simultaneous events, probabilistic events, and imperfect knowledge
domains, where, the representation and reasoning are not so well understood yet.
Some of these domains will be discussed in the following text.
The field of commonsense reasoning can be largely divided into two areas:
• knowledge based approaches, and
• machine learning approaches that cover large data corpora, which are almost
always of the text corpora types.
However, there is a very limited interaction between these approaches. There
are other approaches, e.g., crowd-sourcing based approaches, which try to construct
knowledge base as combination of other knowledge bases and participation of many
non-experts [2].
The knowledge based approaches are further divided into categories that are based
on following domains (see Fig. 6.2).
1. Mathematical logic or some other mathematical formalism,
2. Informal approaches, which are in contrary to mathematical formalism,
3. Based on theories from cognition, and
4. There are large-scale other approaches, which are usually mathematical or infor-
mal, but mainly targeted at collecting a large amount of knowledge.
Commonsense
Reasoning
6.4 Ontologies
Ontology is a particular theory about the nature of being or the kinds of existent.
The task of intelligent systems in computer science is to formally represent these
existents. A body of formally represented knowledge is based on conceptualization.
6.4 Ontologies 149
A concept is some thing we can form a mental image of. Conceptualization consists
of a set of objects, concepts, and other entities about which knowledge is being
expressed and of relationships that hold among them. Every knowledge model is
committed to some conceptualization, implicitly or explicitly [3].
The major applications of ontologies are in natural language processing, infor-
mation retrieval, modeling and simulation. The CYC (enCYClopedia) ontology, for
example, is used for a CYC natural language system, built for the purpose of trans-
lating natural language text into cycL [4].
An ontology is a systematic arrangement of all of the important categories of
objects or concepts that exist in some field of discourse, that shows the relations
between these concepts/objects. In a completed form, an ontology is a categorization
of all of the concepts in some field of knowledge, that includes all the objects,
their properties, relations, and functions needed to define the objects, and specify
their actions. A simplified ontology may contain only a hierarchical classification
of objects and processes (called taxonomy), which shows the subsumption relations
between concepts in the specific domain of knowledge [5].
An ontology may be visualized as an abstract graph with nodes representing the
objects and labeled arcs representing the relations between them, as shown in Fig. 6.3,
which is the upper level of the CYC (from enCYClopedia) project hierarchy.
The CYC Project is an example of ontology. CYC contains more than 10,000
concept types used in the rules and facts encoded in the knowledge base. Its goal
was to build a commonsense knowledge base containing millions of facts and their
interrelationships. It was the efforts to build the knowledge in it that is seldom written
down—the knowledge for example, the reader of an encyclopedia is assumed to pos-
sess to understand the contents of encyclopedia. This kind of knowledge is required
as built-in in a system, which are required to read the restricted natural language.
A concept is some thing we can form a mental image of. At the top of the hierarchy
is the Thing “concept”, which does not have any properties of its own. The hierarchy
under Thing is quite tangled. Not all the subcategories are exclusive. In general, Thing
is partitioned in three ways: First is Represented Thing versus Internal Machine
Thing. Every CYC category must be an instance of one and only one of these sets.
Internal-Machine-Thing is anything that is local to the platform CYC is running on
(strings, numbers, and so on). Represented-Thing is everything else.
The concepts used in an ontology and hierarchical ordering are to some extent
arbitrary, and largely depends on the purpose the ontology is created. This is because,
same objects are of varying importance when used for different purposes, and one or
other set of properties of objects are chosen as the criteria for classification of these
objects. Apart from this, different degrees of aggregation of concepts may be used,
and the importance for one purpose may be of no concern for a other purpose. For
example, class is collection of students, and when all of them are playing in ground
we call it a team. As other example, when both parents are teaching they are called
teachers, when traveling together they are passengers, and at home they are spouse.
150 6 Real-World Knowledge Representation and Reasoning
Thing
Individual Represented
Intangible thing
object
Attribute Slot
Process Intangible
value
stuff
Atrribute
Something
existing
Internal machine
thing
Tangible
Intelligence
object
Tangible
stuff
When compared with First-Order Predicate Logic (FOPL) for knowledge represen-
tation and reasoning, in the ontologies we are particular for knowledge organization
as well as contents. This approach has generalization, combined with exceptions.
For example, “all birds fly” can be a rule. But, there are some birds which do not fly,
for example, penguin. This is an exception. We should be able to add this concepts
of knowledge, not as exception, but as extension. This requires categorization of
objects.
Ontologies provide explicit representations of domain concepts busing a structure
around which knowledge base is built. Each ontology is a structure of concepts
and relations between them, such that all concepts are defined as well interpreted
using a declarative approach. The ontology system also defines the vocabulary of the
problem domain, and a set of constructs as how the terms can be combined to model
the domain. The Fig. 6.3 shows a general structure of ontology.
6.5 Ontology Structures 151
Ontology Type-cheching
Logic Reasoning
The Fig. 6.4 shows the interrelationship of language and logic with ontology, and
how each of these are related to commonsense, reasoning, and type checking. The
language, ontology, and logic are representations, where as the corresponding blocks,
i.e., common-sense, type-checking, reasoning, are respectively, the applications.
Consider the object “ball”, we identify it as a member in the collection of balls
by its type checking; we say “plants have life”, they belong to the category of all
the living beings (type-check). At lowest level in Fig. 6.4 is logic, making use of
binary operators. This logic helps in reasoning process. The common-sense is con-
text level knowledge, which is essential for language understanding. For example,
the phrase—“a good play”, while in a cricket ground is understood as good shot,
while in a theater, it is taken as performance by an artist. We resolve the phrase “a
good play” by context.
There is a close relationship between natural language understanding (NLU) and
knowledge representation and reasoning. In other words, the relationship between
language and knowledge. In fact, understanding is reasoning paradigm, as it has
become quite clear that understanding natural language is, for the most part, a com-
monsense reasoning process at the pragmatic level. As an example, to illustrate this
strong interplay between language understanding and commonsense reasoning, con-
sider the following:
(a) “Jack defended a jailed activist in every state.”
(b) “Jack knows a jailed activist in every state.”
Through commonsense, we have no difficulty in inferring for (a) that Jack supports
for the same activist in every state. Perhaps jack might have traveled to different states
for campaign. Where as in (b), we hardly conceive of the same. It may mean, “there
is a jailed activist in every state”, or an activist being jailed in every state. Such
inferences lie beyond syntactic and semantic explanations, and are in fact depend on
our commonsense knowledge of the world (where we actually live).
As another example, consider resolving of the noun He in the following:
(c) Rajan shot a policeman. He instantly
(i) ran away.
(ii) collapsed.
152 6 Real-World Knowledge Representation and Reasoning
It is obvious that the above references can only be be resolved through common-
sense knowledge. For example, typically, when shot (x, y) holds between some x
and some y, the x is the more likely a “subject” which would run away. So, in (c(i)),
most likely its is Rajan who ran away. In c(ii), y is the more likely to collapse, that
is the policeman. Note, however, that such inferences must always be considered
defeasible, since quite often additional information might result in the retraction of
previously made inferences. For example, (c(ii)) might be describing a situation in
which Rajan, a ten-year old who was shooting for practice, fell down. Similarly, and
(c(i)) might actually be describing a situation in which the policeman, upon being
slightly injured, tried to run away, may be to escape further injuries!
There are number of challenges, in term of computational complexities, in rea-
soning with uncommitted (or underspecified) logical forms. However, even bigger
challenge is to make available of large body of commonsense knowledge, with com-
putationally effective reasoning engines.
6.5.3 WordNet
{Thing, Entityt}
The CYC project, for example, comprised among the largest number of axioms,
i.e., in the order of 106 .
John Sowa (1997, 1995) stated his fundamental principles for ontology design as
“distinctions, combinations, and constraints” (see Fig. 6.7). He uses philosophical
motivation as the basis for his categorization. There are three top-level distinctions:
6.5 Ontology Structures 155
Object Process
Contrete Abstract
Physical Informatio
Process Object
Physical Information
Object Process
We have discussed in the above that, ontologies are taxonomical structures of cate-
gories, subcategories, and members of concepts. As human, we identify the objects,
whether physical or abstract, due to this classification. For example, we can distin-
guish an oranges from cricket balls, because in the first the surface property belong
to softness, and spongy, though their size and shape are quite similar. The Structures,
e.g., a University System, a building, a computer network, operating system, or say,
even office chair, are decomposed into parts, so that we identify the object as a whole
as well the relationship of the parts remain explicit in this representation. Hence, it is
due to these associated categorizations, and sub-categorizations, along with restric-
tions we are able to identify the real world objects. The reasoning can be done using
categories and objects, using physical decomposition of categories, using measuring
criteria, or using object oriented analysis.
The taxonomical hierarchies are common in government, military, and other organi-
zations. Classifying objects into categories is important for knowledge organization.
Of course interactions take place at individuals levels but relations are created at the
level of categories. For example, the sentence, “Basketball team has won”, does not
refer to a single ball and nor a single player, but a relationship among team members
as a group.
The Category also helps in prediction of objects once they are classified. One refers
the objects from perceptual inputs, infers category of membership from perceived
properties of the objects, then uses category information to predict about object. For
example, from its yellow colour, size, shape, we identify that an object is a ‘Mango’.
In FOPL there are two choices for categories: predicates and objects. Following
are the examples:
basketball(b);
member (b, basketballs);
b ∈ basketballs; (the object b is a member of category basketball)
subset (basketballs, balls); (category is subclass of another category)
basketballs ⊂ balls.
dogs ∈ domesticatedanimals. (categories are members)
The category is defined by members and subset relation. We also note that cate-
gories organize the knowledge as well as they help to inherit. Thus, ‘mango’∈ ‘man-
goes’, and ‘mangoes’⊂ ‘fruits’. Thus, mango inherits the properties of ‘taste’from
fruits (fruits have ‘tastes’). Similarly, we have categories: animals, birds, foods,
institutions. A “set of institutions” ⊂ “institutions”, and MBM ∈ “institutions”. The
subclass organize the categories into a taxonomic hierarchy.
6.6 Reasoning Using Ontologies 157
An object can be part of another, for example, nose, eyes, hands, are part of body;
steering is part of car, wheels are part of wheel-assembly, and wheel-assembly is part
of car. This can be represented by relations of physical decompositions,
par to f (wheel, wheelassembly).
par to f (wheelassembly, car ).
par to f (x, y) ∧ par to f (y, z) ⇒ par to f (x, z).
Note that taxonomies are transitive relations. The relation of reflexivity also holds
on individual objects, e.g., par to f (x, x), as x has one part, and that part is itself.
The categories of composite objects is defined by structural relations between parts
and assemblies.
6.6.3 Measurements
The physical objects have height, weights, mass, length, and other physical measure-
ments. To characterize a physical object, values need to be assigned to the objects in
the form of their properties. Following are the examples:
length(i pad) = inches(5) or length(i pad) = centimeter s(12.5).
list price(basketball) = r upees(500).
height (x) > height (y) ⇒ taller (x, y).
The ontology development processes are similar to that of object-oriented design and
analysis. In both, domain vocabulary is collected in the beginning, which is often
out of domain’s generic nouns, verbs, and adjectives. The result of object-oriented
158 6 Real-World Knowledge Representation and Reasoning
analysis is in the form of a draft document for domain ontology and relevant to
the application. However, analysts do not call the result as ontology. A designer in
an object-oriented system defines the things like, objects, classes, interface func-
tions, hierarchies, and the system behavior. But, an ontological engineer make use of
intermediate representations such as graphs, tables, and semantic networks, to design
hierarchies and other concept relationships, which hold true for object oriented design
also.
In object oriented analysis, the classes act as templates. Both types of specialists
make use of templates to specify product details. The classes are then merged or
refined, as with ontologies.
The LOOM takes two ontologies represented in OWL and produces a pair of
relatedconcepts from the ontologies. In order to identify the corresponding concepts,
LOOM compares the preferred names and symbols of the concepts in both ontologies.
It identifies these concepts as similar, if and only if, their preferred names or synonyms
are equivalent based on modified string functions. The string comparison removes the
delimiters from both strings, then uses approximate matching techniques to compare
them, allowing for mismatch of at most one character in a specified length.
Some languages/tools used in building ontologies are based on translation
approaches. Using these approaches it is possible to build ontologies directly at
knowledge level, and knowledge level is translated into the implementation level,
thus eliminating the need to master the implementation languages [8].
The concept of action arises in at least two major subareas of artificial intelligence,
(1) natural language processing and (2) problem solving. For the most part, the
formalisms that have been suggested in each sub-area are independent of each other
and difficult to compare. However, so far there does not exists a computational
theory of actions, which is powerful enough to capture the range of the meanings
and distinctions expressible in any natural language. The primary goal of situation
calculus is to provide a formalism which is expressive enough for representation of
actions and to explore its use in defining the meanings of English language sentences
that describe actions and events [9].
The requirement on the formalism for representation of actions is that it should
be a useful representation for action reasoning (i.e., problem solving). It has a very
useful application, i.e., to describe, how this representation could be used for planning
(of actions) or for plan recognition system. This is essential to the natural language
processing as well, because the natural language understanding is aimed to ultimately
result to problem-solving capability.
The situation calculus is also the language of choice for investigations of various
technical problems that arise in theorizing about actions and their effects. It is being
taken as a foundation for practical work in planning, control, simulation, database
updates, agent programming and robotics. In parallel with these developments of
its applications, there have emerged axiomatizations for the situation calculus, and
explorations of some of their mathematical and computational properties [10].
The situation calculus is a logic formalism for representing and reasoning about
dynamical domains. The concepts in the situation calculus are situations, actions
and fluents. Number of objects are typically involved in the description of the world.
160 6 Real-World Knowledge Representation and Reasoning
The situation calculus is based on a sorted domain with three sorts: actions, situations,
and objects, where the objects include everything that is not an action or a situation.
Hence, domain D is,
D = A, S, O (6.2)
where A is set of actions, S is set of situations, and O is set of objects, with variables
of all sort can be used. The actions, situations, and objects are elements of the domain,
but fluents are modeled either as predicates or as functions. The phrase, e.g., sorted
domain of actions means, that set of actions are to be carried out in a certain (sorted),
and similar meaning exists for sorted situations, and sorted fluents.
A situation is a kind of state (ako), however it is a result of chain of earlier
situations. Actions are what make the dynamic world change from one situation to
another when performed by agents. A fluent is a condition that can change over time.
Fluents are situation-dependent functions used to describe the effects of actions.
There are two kinds of them: relational fluents and functional fluents. The former
have only two values: True or False, while the latter can take a range of values.
For instance, one may have a relational fluent called handempty which is true in a
situation if the robot’s hand is not holding anything. We may need a relation like this
in a robot domain. One may also have a functional fluent, e.g., battery-level, whose
value in a situation is an integer of value between 0 and 100 denoting the total battery
power remaining in one’s notebook computer.
The set of actions form sort of domain. The action can also use variables, and
they can be quantified, for example in a robot world, possible action terms would
be move(x, y) to model a robot moving to a new location (x, y), and pickup(o) to
model the robot picking up an object o, and so on.
6.8.2 Formalism
The intended interpretation is that do(a, s) (or r esult (a, s)) denotes the successor
situation resulting from performing action a in situation s. Accordingly, the basic
formalism of the situation calculus is represented by:
s
= do(e, s) (6.4)
is true. The location of the robot can be modeled using a functional fluent location,
which returns the location coordinates ‘(x, y)’of the robot in a particular situation.
To describe a dynamic domain using the situation calculus, one has to decide on
the set of actions available for the agents to perform, and the set of fluents needed
to describe the changes these actions will have on the world. For example, consider
the blocks world where some blocks of equal size can be arranged into some set of
towers on a table. The set of actions in this domain depends on what the imaginary
agent can do. If we imagine an agent to be a robot-arm that can be directed to grasp
any block that is on the top of a tower, and either add it to the top of another tower
or put it down on the table to make a new tower, then we can have the following
actions:
stack(x, y)—put block x on block y, provided that robot is holding x, and y’s
top is clear. Being action, it is read “stack x on y”, and shall not be confused with
predicate.
162 6 Real-World Knowledge Representation and Reasoning
unstack(x, y)—pick up block x from block y, provided that robot’s hand is empty,
x is on y, and x has top clear.
putdown(x)—put block x down on the table, provided that robot is holding x.
pickup(x)—pick up block x from the table, provided the robot’s hand is empty,
x is on the table and top is clear.
move(x, y)—move block x to position y.
To describe the effects of these actions, we can use the following relational fluents:
handempt y—True in a situation if the robot’s hand is empty.
holding(x)—True in a situation if the robot’s hand is holding the block x.
on(x, y)—True in a situation if block x is on block y.
ontable(x)—True in a situation if block x is on the table.
clear (x)—True in a situation if block x has top clear.
For action stack(x, y) to be performed in a situation, the fluent,
– holding(x) must be true, and
– clear (y) must be true.
Also, after stack(x, y) is performed and it results to a new situation, the fluent,
– on(x, y) will be true,
– handempt y will be true,
– holding(x) will be false, and
– clear (y) will be false.
The action stack(x, y), along with present value of fluents, and resulting action
with new value of fluents can be formally expressed as an axiom,
Now, for example, we can say that for action stack(x, y) to be performed in
a situation, holding(x) and clear (y) must be true, and that after stack(x, y) is
performed, in the resulting new situation, on(x, y) and handempt y both will be
True, and holding(x) and clear (y) will no longer be True.
We only need the action move(x, y) to take place, to move block x to position y.
This can happen if the agent (robot) in its world moves a block from a clear position
to another clear position. The block x is clear to move means that x is on the table
with top clear, or it is top most position in a stake of blocks. Accordingly, move(x, y)
action can be expressed by axiom,
∀x∃y[(clear (x) ∧ clear (y)) → move(x, y)] → on(x, y) ∧ clear (x)]. (6.6)
For describing the effects of this action two fluents are sufficient: clear (x), and
on(x, y). The action move(x, y) can be performed only when the situation “clear (x),
clear (y), x = y” is True. The consequence of this action is that x is no longer at the
place where it was earlier, but instead at location y in the resulting new situation.
6.8 Situation Calculus 163
The axiomatic system we discussed above is justified only in some specific context.
With a little creative thought it is possible to construct a more general context in an
axiomatic system. However, in such a generalized context the exact form of axioms
will not hold. We can appreciate this argument if we think of human reasoning
apparent in natural language, for example, consider axiomatizing of the adjective
(concept) “on”, to draw correct inferences from the information presented in a English
language sentence, in a world of space-craft on a Mars mission (as context), where
flight crew answers a query of other crew as,
As critic may propose to suggest about the precise meaning of ‘on’, causing
difficulties about what can be between the book and the table or about how much
gravity there has to be in a spacecraft in order to use the word “on” and whether
centrifugal force counts. If the space-raft is orbiting the planet Mars! Thus, we
encounter a Socratic puzzles over the issue of what the concepts mean in complete
generality and come across examples that never arise in reality—“there simply is not
a most general context”!
On the other hand, if a system is axiomatized at sufficiently higher level (i.e.,
more general), the axioms will be too long to be suitable even in special situations.
Thus, we find it convenient to say, “The book is on the table,” and omit the reference
to exact time when it is on the table (global or local), its precise location on the
table (on which part or corner it is located), and what is location of table in reference
to global (GPS) coordinates, or universal coordinates, etc. Hence, we conclude that
position of book in above example is sufficiently general. How much general the
specification be? This depends on whether that general/commonsense knowledge
has been expressed in the logic used for reasoning, in the corresponding program or
in other other formalism.
A possible solution to the question—“how much general?”, is, formalize the idea
of context, and combine it with circumscription procedure of monotonic reasoning.
This is done by adding context parameters to predicates and functions in the axioms,
because each axiom makes an assertion about certain context. Apart from this, the
axioms should express that, facts are inherited using more restricted context unless
exceptions are specified. Each assertion is assumed to apply nonmonotonically in any
particular more general context, but for that also there are exceptions. For example,
there is rule that says, birds fly. In this rule, we implicitly assume that there is
an atmosphere available. However, in a more general context, this (existence of
atmosphere) is not required to be assumed. Further, it still remains to be determined
as how the inheritance to more general contexts differs from that of more specific
contexts.
There are a some predicates about contexts as well as dealing with time. One of
the most important predicate is holds, which asserts that a property holds (i.e., is
true) during a time interval. Thus, the sentence,
164 6 Real-World Knowledge Representation and Reasoning
holds( p, t)
is true if and only if property p holds during time t. A subsequent axiom will state,
this is intended to mean that p holds at every subinterval of t as well. Note that
if we had introduced holds as a “modal operator” we would not need to introduce
properties into our ontology.
Suppose that, whenever a sentence p is present in the memory of a computer, we
consider it as in a particular context and as an abbreviation for the sentence,
holds( p, C)
where C is the name of a context. We create a relation about the generality of a
context as follows: a relation (≤), e.g., c1 ≤ c2 means context c2 is more general
than context c1 . Using this relation, we allow sentences like,
holds(c1 ≤ c2 , C)
which means that statements relating contexts (c1 ≤ c2 ) can have contexts C.
A logical system using contexts might provide operations of entering and leaving
a context yielding what we might call unnatural deduction allowing a sequence of
reasoning as given in the followings.
holds( p, C)
ENT ER C
p
...
...
q
L E AV E C
Example 6.1 Represent the following sentences in the formalism of situation cal-
culus.
“A man called Rex is standing at the gate of Ghana Bird Sanctuary wearing a black
T-shirt. Rex loads his toy gun, waits for few seconds, and shoots at a migratory
crane.”
The fluents, which are either true or false, in this sentence are:
standing(place): whether Rex is standing at a given place or not?
black: whether Rex is wearing black T-shirt or not?
loaded: whether the gun is loaded or not?
6.8 Situation Calculus 165
Now consider the first sentence: “if Rex shoots the gun and the gun is loaded,
crane is not alive,” which is written as:
1. ∀s[holds(loaded, s) → ¬holds(alive, do(shoots, s))]
2. The remaining four sentences can be expressed in situation calculus as follows:
3. ∀s[¬holds(loaded, do(shoot, s))]
4. ∀s[holds(loaded, do(load, s))]
5. ∀s[holds(loaded, s) → holds(loaded, do(wait, s))]
6. ∀s[¬holds(loaded, s) → ¬holds(loaded, do(wait, s))]
Having represented in this form of FOPL, the reasoning is done using conventional
methods used for predicate logic reasoning, e.g., resolution based theorem proving.
The most logic we have studied so far falls in the category of monotonic logic, and
the corresponding reasoning as monotonic reasoning. The property of monotonicity
is satisfied by all methods that are based on the classical (mathematical) logic. As
per this property, if a conclusion is warranted on the basis of certain premises, no
166 6 Real-World Knowledge Representation and Reasoning
additional premises should ever invalidate the conclusion. The Nonmonotonic logic
is those ways of reasoning to infer additional information, that do not satisfy the
monotonicity property of classical logic, that is, on the face of additional information,
some of the earlier conclusions drawn may even become invalid!
In everyday life, however, we often draw sensible conclusions from what we
know/information available to us. On receiving new/updated information, many a
times we take back our previous conclusions, even when the new information we have
collected has in no way indicated that it is contradicting the previous conclusions.
Consider the following example: we have assumption that most birds fly, and that
penguin are the birds which do not fly. On learning or receiving a new information
that “Tweety is a bird”, we infer that it flies. Further learning that “Tweety is a
penguin”, still does not effect our old inference that most birds fly. But, it (addition
of new information) will require to abandon on inference of “Tweety flies.” It is quite
appropriate to say that intelligent automated systems must have the capabilities of
this kind of inference, call as non-monotonic inference.
Many systems perform such nonmonotonic inferences. The most common are :
negation as failure, circumscription, the modal system, default logic, autoepistemic
logic and inheritance systems. Each of those systems is worth studying by itself, but a
general framework in which those many examples could be compared and classified
is necessary. The following section presents a general framework, concentrating on
properties, which are important families of nonmonotonic reasoning systems.
One of the features of commonsense reasoning that makes it different from traditional
mathematical proofs is the use of defaults. A default is a proposition that is postulated
to be true in the absence of information to the contrary. For instance, an intelligent
agent may assume by default that his observation correctly reflects the state of the
world, and be prepared to retract this assumption in the face of evidence that be is in
error [11].
The logic systems we have often come across, like, classical logic, intuitionistic
logic, and modal logic are monotonic in nature. The name monotonic indicates that
on adding a new fact in these systems never results in retraction of a conclusion
derived before addition of that new fact. In fact, after addition of new knowledge in
these logic systems, the inference ability of these systems monotonically increases.
The Default logic is called nonmonotonic, because the new fact may be an exception
to one of the defaults included in the set, and on the face of that the inference system
should withdraw the conclusions drawn already.
In an intelligent system (either computer-based or human), when it tries to solve
a problem, it may rely on complete information about the problem, and its main
task will be to draw correct conclusions using classical reasoning. The classical
(predicate) logic may be sufficient in such cases. But, in many situations, the system
may have only incomplete information at hand, and it is required to draw inferences
6.10 Default Reasoning 167
with only these available information. The nonavailability of the information may
be because of the need to respond quickly, and acquiring these more information
will take some time, but the system cannot afford to wait for the collecting of all the
relevant data. For example, the case evacuation required in a nuclear disaster, where
we cannot wait to gather the complete information, and then decide to evacuate or
not, as otherwise it would have caused havoc. Some times, the information available
is unreliable/inaccurate, and it needs to be authenticated, which will require more
time, and we cannot prolong the decision.
The classical logic possess the quality to represent and reason with certain aspects
of incomplete information. In special conditions, additional information needs to
be “filled in” to overcome the incompleteness, as it is important to make certain
decisions. In such special circumstances, the system need to make some plausible
conjectures, which are based on rules of thumb, called defaults, and this modified
logic is called default logic, and the reasoning is called default reasoning. As an
example, a doctor has to make some conjectures during a situation of emergency,
about some most probable causes of the symptoms observed. Obviously, in such
conditions it would be not appropriate to wait for the results of laboratory tests, as
possibly extensive and time-consuming tests results may arrive too late to complete
the diagnosis based on those and begin the treatment.
However, when a medical diagnosis (i.e., decision) is based on assumptions, it
may turn out to be wrong on the face of availability of new information, and may
require a modified diagnosis! The phenomenon of having to take back some previous
conclusions is called nonmonotonic reasoning or non-monotonicity, which can be
stated as: if a statement ϕ follows from a set of premises M, i.e., M |= ϕ and M ⊆ M
,
then ϕ does not necessarily follow from M
(see Example 6.2). Default Logic provides
formal methods to support this kind of reasoning.
Example 6.2 Let
A rule used by football organizers in Kashmir valley might be: “A football game shall
take place, unless there is snow in the stadium.” This rule of thumb is represented
by the default
f ootball : ¬snow
(6.8)
takes Place
Consider the following example to understand the default reasoning: In the
absence of information that there is going to be snowfall in the stadium, it is reason-
able to assume ¬snow, and also to conclude that game will take place. Accordingly,
the preparations for the game can go on. But actually if there is a heavy snowfall
during the night before the game is scheduled, then this assumption is wrong. This
is because, when we are certain, based on definite information that there is snow, we
cannot assume ¬snow, and therefore the default cannot be applied. In this case we
need to refrain from the previous conclusion (i.e., the game will take place), so the
reasoning is nonmonotonic, as we are going to withdraw the previous conclusion [11].
Before proceeding with more examples, let us first explain why classical logic is
not appropriate to model this situation. Of course, we could use the rule:
The problem with this rule is that we have to definitively establish that there will
be no snow in the stadium before applying the rule. But that would mean that no
game could be scheduled in the winter, as it would not be possible to comment about
“no snow” on previous night of game, for in advance to decide whether to proceed for
preparations or not! And, if we wait for the previous night of match to make decision
for match, then there is no time left to do preparations. It is necessary to analyze
the difference between two statements: having “to know that it will not snow”, and
being able to “assume that it will not snow.” The defaults works on the second, and
it supports drawing conclusions based on assumptions.
The defaults can be used to model prototypical reasoning, i.e., the most instances
of a concept carry some property. One example is the statement “Typically, children
have (living) parents”; this statement can be expressed using the default logic as,
commonly used principle of awarding justice in the courts of Law: “In the absence
of evidence to the contrary, assume that the accused is innocent.” In default form we
write this as:
accused(X ) : innocent (X )
(6.11)
innocent (X )
here ϕ, ψ1 , . . . , ψn , χ , are one or more closed formulas in predicate logic. The for-
mula ϕ is called prerequisite of the inference rule (6.12), ψ1 , . . . , ψn are called
justifications, and χ is consequent or inference of this rule δ.
It is important to note that the formulae in a default must be a ground clause. For
example the formula,
is not a default according to the definition: it should have all clauses as ground clauses.
Let us call this rule of inference as open defaults rule. An open default is interpreted
as a default schema, and it may represents a set of defaults, which may be infinitely
large in numbers.
A default schema is very much like a default, with the difference that ϕ, ψ1 , . . . ,
ψn , χ in default are arbitrary predicate formulas (i.e., they may contain free vari-
ables). Where as, a default schema defines a set of defaults as,
ϕσ : ψ1 σ, . . . , ψn σ
(6.14)
χσ
for all ground substitutions σ that assign values to all free variables occurring in
the schema. That means free variables are interpreted as being universally quantified
over the whole default schema. Given a default schema
170 6 Real-World Knowledge Representation and Reasoning
and the facts bir d(tweet y) and bir d(sam), the default theory is represented as
T =W, D
={bir d(tweet y), bir d(sam)},
bir d(tweet y) : f lies(tweet y)
,
f lies(tweet y)
bir d(sam) : f lies(sam)
.
f lies(sam)
The Default Logic is a simple and commonly used method of knowledge repre-
sentation and reasoning in real-world situations. It has following characteristics:
• The most important aspect is that it supports reasoning with incomplete informa-
tion.
• Defaults inferences are found naturally in many applications, like in, medical diag-
nostics, reasoning related to legal issues, information retrieval, preparing specifi-
cations of systems, and as a logic in software.
• Default Logic is commonly used to model the reasoning with incomplete infor-
mation, which was the original motivation, as well as a formalism that enables
compact representation of information.
• Important prerequisites for the development of successful applications in these
domains are,
– basic concepts’ understanding, and
– there exists a powerful implementation of default logic.
The Algorithm 6.1 for default reasoning provides extensions to the formula set W .
Let T = W, D be a closed default theory (unspecified variables are false) with a
finite set of default rules D and formula W . Let P be the set of all permutations of
elements of default set D. If P = {}, i.e., D = {}, or if W is inconsistent, then return
default theory T h(W ) as the only extension of T . The set of justifications and beliefs
are indicated by variables JUST and BELIEF, respectively.
6.10 Default Reasoning 171
6.11 Summary
Ontology is a theory about the nature of being or the kinds of existent. It is a sys-
tematic arrangement of all the important categories of objects or concepts that exist
in some field of discourse, such that the arrangement shows the relations between
them (objects/concepts). When compared with First-Order Predicate Logic (FOPL),
in the ontologies we are particular for knowledge organization as well as contents,
where as in predicate logic, the emphasis is on knowledge contents only.
The language, logic, ontology are closely related to commonsense, reasoning, and
type checking, respectively. The language, ontology, and logic are representations,
where as common-sense, type-checking, reasoning, are respectively, the applications.
The natural language ‘understanding is reasoning’paradigm. In comparing various
ontologies, they can be viewed at three different levels: is-a taxonomy of concepts,
(2) internal concept structure and relation between concepts, and (3) the presence or
absence of explicit axioms.
John Sowa stated his fundamental top-level principles for ontology design as
“distinctions, combinations, and constraints”. The distinctions is Physical versus
Information, combinations classify the objects into firstness versus secondness versus
thirdness, or Form versus Role versus Mediation. The constraint is, continuant versus
occurrent, or object versus Process.
The knowledge engineering is process of knowledge representation comprising of:
identification of task, assemble relevant knowledge, decide vocabulary of predicates,
functions and constructs, encode knowledge about domain, and deploy it. The onto-
logical engineering is aimed to create representation of general concepts—actions,
time, physical objects and beliefs.
6.11 Summary 173
Some of the languages and tools for implementing ontologies are: Ontolingua,
CycL, and LOOM, which use either a frame-based formalism, or a logic-based for-
malism, or both.
Classifying objects into categories is important for knowledge organization and
reasoning. For example, the sentence, “Basketball team has won”, does not refer to
a single ball and nor a single player, but a relationship among team members as a
group. The Category helps in reasoning, e.g., prediction of objects once they are
classified.
The concept of action arises in two major subareas of artificial intelligence, (1)
natural language processing and (2) problem solving. However, the formalisms that
have been suggested in each sub-area are independent of each other and difficult
to compare. The requirement on the formalism for representation of actions is that
it should be a useful representation for action reasoning (i.e., problem solving). It
has an important application to describe how this representation could be used for
planning (of actions) or for plan recognition system.
A situation is a consequence of sequence of action’s occurrences. We shall make
use of do (a binary function symbol) expressed as,
Exercises
1. Give the top level ontology of following structures, represent the concepts using
relations, and attributes.
a. University system—consisting of faculties, departments, teachers, classes,
students, courses, etc.
174 6 Real-World Knowledge Representation and Reasoning
10. A robot is to pickup n (n = 10) cuboid lying on the table and drop one-by-one in
a bucket, available nearby the table. Write the statements for sequential calculus.
What are the situations, actions, and fluents here?
11. Consider the eight puzzle shown in Fig. 6.8 with initial and goal states (situa-
tions).
The objective is to go from a initial situation to the goal situation. We are allowed
to move a tile into the empty space if that tile is adjacent to the empty space (e.g.
Exercises 175
4 5 6 8 4
7 8 7 6 5
in the initial situation tile 6 and 8 are adjacent to empty space). The locations are
numbered as 1-9, as shown in initial situation, with number 9 as empty tile. The
tiles are numbered 1-8. There is a single action move(t, l) which indicates mov-
ing tile t to location l. Assume a predicate ad jacent (l1 , l2 ), which is true when
location l2 is one move from l1 . It is only possible to do a move action if a tile is
in a location adjacent to an empty location. The only fluent is location(t, l, s)
meaning tile t is in location l in situation s. Given this, write down:
• initial conditions,
• effect axioms,
• precondition axioms, and
• from the effect axioms derive the successor state axioms.
12. Given the following set of facts and default rules:
a. People typically live in the same city where they work (default: d1 )
b. People typically live in the same city where their spouses are (default: d2 )
c. John works in New Delhi (fact: f 1 )
d. John’s spouse works in Mumbai (fact: f 2 )
15. Translate the following into first-order predicate logic, and check whether the
given conclusion follow from it?
Typically, the computer science students like computers. Female students who
like computers are typically interested in cognitive science. The computer sci-
ence students are typically female: for example Anita, Babita, Cathy; but Dorthy
is an exception to this rule. Conclusion: Anita, Babita, Cathy are interested in
cognitive science; Dorthy is not interested in cognitive science.
16. Compute the default extensions of following theories T = (M, D):
a. M = {a}, D = { a:¬b
c
, :¬c
d
, :¬d
e
}
b. M = {a → c, b → c}, D = { :¬b a
, :¬a
b
, :¬d
e
}
:¬b :¬a :¬d
c. M = {}, D = { a , b , d }
d. M = { p ∧ q}, D = { b:a
a
, :¬a
a
, :¬a
¬c
, :¬q
b
, :¬q p }
17. Compute the default extensions of T = M, D , where
M = {∀x[mynah(x) → ¬nests(x)],
∀x[ penguin(x) → ¬ f lies(x)],
∀x[bir ds(x) ≡ mynah(x) ∨ penguin(x) ∨ canar y(x)],
bir d(T weet y)},
bir d(x) : nests(x) bir d(x) : f lies(x)
D={ , }
nest (x) f lies(x)
References
1. Chowdhary KR (2004) Natural language processing for word-sense disambiguation and infor-
mation extraction. PhD thesis, Department of Computer Science and Engineering, J.N.V. Uni-
versity, Jodhpur (India)
2. Davis E, Marcus G (2015) Commonsense reasoning and commonsense knowledge in artificial
intelligence. Commun ACM 58(9):92–103
3. https://protege.stanford.edu/ . Cited 19 Dec 2017
4. Douglas BL (1995) CYC: a large-scale investment in knowledge infrastructure. Commun ACM
38(11):33–38
References 177
5. Fridman-Noy N, Hafner CD (1997) The state of the art in ontology design: a survey and
comparative review. AI Mag 53–74
6. http://clarity.princeton.edu/pub/wordnet/ . Cited 19 Dec 2017
7. Sowa JF (1995) Distinctions, combinations, and constraints. In: Proceedings of the workshop
on basic ontological issues in knowledge sharing. Montreal, Canada
8. Devedzic V (2002) Understanding ontological engineering. Commun ACM 45(4):136–144
9. Pinto JA (1994) Temporal reasoning in the situation calculus. PhD Dissertation, Submitted to
the Graduate department of Computer Science, University of Toronto
10. Thielscher M (1999) From situation calculus to fluent calculus: State update axioms as a
solution to the inferential frame problem. Artif Intell 111:277–299
11. Antoniou G (1999) A tutorial on default logics. ACM Comput Surv 31(3)
Chapter 7
Networks-Based Representation
7.1 Introduction
Semantic Networks were developed with the goal of characterizing the knowledge
and the reasoning of a system by means of network-shaped cognitive structures. The
similar goals were later achieved through frame-based systems, which depend on the
notion of a “frame” as a prototype, and have the capability to express the relationship
between the frames. The frames and semantic networks are quite different from
each other, but both have cognitive features, and have the capability that allows
navigation in the structures. Due to this, both of them can be classified as network-
based structures, where the network is used for representing individuals and the
relationship between them.
Due to their human-oriented origins, the network-based systems are more appeal-
ing and effective from the practical point of view than the logical systems, that are
based on predicate logic and its variants. However, these network-based systems
were not accepted as a complete solution, due to their lack of semantic properties.
Due to that, every system behaved differently from the others, despite their almost
identical-looking components, as well as identical-looking relationships. Hence, the
need was felt to represent semantic characteristics in these structures. The semantic
characteristics in network-based systems could be achieved by introducing the notion
of hierarchical structures. Using hierarchical structures in semantics networks and
frames, one could gain both in terms of ease of representation, and also in terms of
efficiency of reasoning.
Learning Outcomes of this Chapter:
1. Identify all of the data, information, and knowledge elements and related organi-
zations, for a computational science application. [Assessment]
2. Describe how to represent data and information for processing. [Familiarity]
3. Compare and contrast the most common models used for structured knowledge
representation, highlighting their strengths and weaknesses. [Assessment]
4. Identify the components of non-monotonic reasoning and its usefulness as a rep-
resentational mechanism for belief systems. [Familiarity]
5. Compare and contrast the basic techniques for qualitative representation. [Assess-
ment]
ako
parrot has-colour green
through network mathcer or retriever. The most advanced system of inference has
Inference Engine, which can perform specialized inferences tailored to treat certain
functions, predicates, and constant symbols differently than others. This is achieved
by building into the inference engine certain true sentences, which involve these
symbols, and control is provided to handle these sentences. The inference engine
is able to recognize the special conditions, on which it makes use of specialized
machinery. It becomes possible by coupling the specialized knowledge to the form
of situations that it can deal with.
The semantic networks also called the Associative Networks, model the semantics
and words of the English language. In a system developed by their inventor Quillian,
evidenced that meanings were found between the words by the path connecting them.
These models carry an intuitive appeal: the related information is always clustered
and bound together through relational links, and the knowledge required to perform
a certain task is typically contained in a narrow domain or in the vicinity of the
concerned task. This type or knowledge organization, in some way, resembles the
way knowledge is stored and retrieved in human brain [11].
Semantic networks includes a lattice of concept types [6]. Originally they included
a different correlated nets, which were based on 56 relations, like: subtypes, instances,
case-relations, part-whole, kinship relations, and various types of attributes. A simple
example of semantic network is shown in Fig. 7.1, where ako(a-kind-of), has-parts,
color, and has-property, are binary relations.
There are several benefits of using Semantic Networks for representing knowl-
edge:
The Semantic networks have been used for knowledge representation in various
applications like, natural language understanding, information retrieval, deductive
databases, learning systems, computer visions, and speech generation systems.
Unlike the predicate logic, there is no well accepted syntax and semantic for semantic
networks. A syntax for any given system is determined based on the objects and
relation primitives chosen and the rules used for connection of the objects. However,
there are some primitives which are quite established. We can define a Semantic
Network by specifying its fundamental components:
1. Lexical part:
2. Structural part: The nodes and the edges connecting them form directed graphs,
and the labels are placed on the edges, which represent the relation between nodes.
3. Semantic part: Meanings (semantics) are associated with the edges and node
labels, whose details depend on the application domains.
4. Procedural part: The constructors are part of the procedural part, they allow for
the creation of new edges (links) and nodes. The destructors allow the deletion of
edges and nodes, the writers allow the creation and alteration of labels, and the
readers can extract answers to questions. Clearly, there is plenty of flexibility in
creating these representations.
The word-symbols used for the representation are those which represent object
constants and n-ary relation constants. The network nodes usually represent nouns
(objects) and the arcs represent the relationships between objects. The direction of the
arrow is taken from the first to the second objects, as they represent in the relations.
The Fig. 7.2 shows a is-a hierarchy representing a semantic network. In set theory
terms, is-a corresponds to the sub-set relation ‘⊆’, and an instance corresponds to
the membership relation ‘∈’ (an object class relation) [3].
The commonly used relations are: Member-of, Subset-of, ako (a-kind of), has-
parts, instance-of, agent, attributes, shaped-like, etc. The ‘is-a’ relationship occurs
quite often, like, in sentences: “Rajan is a Professor”, “Bill is a student”, “cat is a
pet animal”, “Tree is a plant”, “German shepherd is a dog”, etc. The ‘is-a’ relation
is most often used to state that an object is of a certain type, or to state that an object
is a subtype of another, or an object is an instance of a class.
Figure 7.2 shows some important features of semantic networks. The representa-
tion makes it easy to retrieve the properties of any object efficiently due to hierarchy
of relations. These networks implement property of inheritance (a form of inference).
7.2 Semantic Networks 183
Living being
isa isa
isa isa
human vertiberate Animal Plant
isa isa isa isa isa isa isa
isa
Afro Indian pigeon crow Pet Livestock Tree Bush
has-property
cannot fly
The nodes which are members or subsets of other nodes may inherit the properties
from their higher level ancestor nodes. For example, we can infer from Fig. 7.2 that
dogs are animals and pigeons are vertebrates, and both of them are living beings.
The property inherited like this is recognized as default reasoning. It is assumed
that unless there is an information to the contradictory, it is reasonable to inherit the
information from the ancestor nodes. In Fig. 7.3, Pigeon inherits the property of “can
fly” from the vertebrates, while Ostrich has a locally installed attribute of “cannot
fly”, hence the property ’fly’ will not be inherited by it.
The inference procedures for semantic networks can also be in parallel to those in
propositional and predicate logic. For example, if a class A of objects have property P,
and ‘a’ is a member of A, we can infer that ‘a’ has property P. The inferences in these
networks can be defined as per those in predicate logic, making use of unification,
chaining, modus ponens, and resolution, however, the reasoning is default type [13].
184 7 Networks-Based Representation
The semantic networks are based on the associationist theory, which defines the
meaning of an object in terms of a network of associations with other objects. When
human perceives and reasons about an object, that perception is first mapped into
a concept. This concept is part of our entire knowledge about the world and is
connected through appropriate relationships to other concepts. These relationships
form an understanding of the properties and behavior of objects such as snow. For
example, through associations, we associate the snow with other concepts like cold,
white, snowman, slippery, and ice. Our understanding of snow and truth of statements
such as “snow is white” and “the snowman is white” manifests out of this network
of associations.
There is experimental evidence that, in addition to associate concepts, humans
also organize their knowledge hierarchically, as per that information is kept at the
highest levels, to be inherited by other concepts. The evidence have shown that if
the concepts in the networks were far off in the hierarchy, it took a relatively longer
time by humans to understand this relation compared to the concepts which were in
proximity to each other. For example, with reference to Fig. 7.2, the human response
will be faster to infer that “dog is an animal”, compared to the statement “Pomeranian
is a living being”. This time difference is argued due to the fact that humans store
the information in a hierarchical way. The fastest recall was for the cases where the
traits were specific to the object. For example, the negative response for “Can Ostrich
fly?”, will be faster than the positive response for “Can pigeons fly?”. This is because
in the reasoning for the first path: “Ostrich → cannot fly”, is faster that for “pigeon
→ vertebrate → fly”, due to path length difference [1].
The inheritance based system allows to store the knowledge at the highest level of
abstraction. This results to reduction in size of the knowledge base, as well as it helps
in updating the inconsistencies. The graphs can explicitly represent the relations using
arcs and nodes, which helps in formalizing the knowledge of semantic networks.
These networks can be used to answer various queries related to the knowledge base
stored, using inferences.
Much of the research in network representation has been done in the field of
Natural Language Understanding (NLU). Often, the natural language understanding
requires the understanding of the common sense, the ways in which the physical
objects behave, the interactions that occur between humans, and the ways in which
the human institutions are organized. A natural language understanding the program
must understand the intentions, beliefs, hypothetical reasoning, plans, and goals
embedded in the natural language text. Due to these, the language understanding has
been the driving force for knowledge representation.
7.2 Semantic Networks 185
The NLU programs define the words of the English language in terms of other
words, like the English dictionary does, rather than defining in terms of primitive
words or axioms. Thus, to understand the meaning of a word we traverse a network
of words until we understand the meaning of the required word.
7.2.4 Performance
The network structures used to provide intuitive and useful representations for mod-
eling semantic knowledge. These models are required to fulfill the following four
objectives:
1. Is is possible to organize the human semantic knowledge using the general struc-
tural principles which are characteristics of semantic networks?
2. Is it possible that the human performance of semantic processing can be emulated
in terms of general processes operating on semantic networks?
3. Is it possible to emulate the human processes of semantic retrieval and search on
general structures of semantic networks?
4. Can the semantic networks emulate the human processes of semantic acquisition
and development of semantics in humans?
For all the above questions, the required answer is not only in terms of yes/no, but
to what degree it is achievable. Also, the answer depends on the applications, which
exploit the features of these networks, as well as there is yet lot to be found out.
The best example of a semantic network is semantic web, which enables the
people to access the documents and services on the Internet. The interface to service
is represented in web pages written in natural language, which must be understood
and acted on by humans. The existing web is augmented by the semantic web with
formalized knowledge and data, to be processed by computers.
Inst
Fork
A sentence in the semantic network is represented with a verb node, and various
case links to this node represent other participants in carrying out the action. The
complete structure formed is called case-frame. While this sentence is parsed, the
algorithm identifies the verb node, and retrieves the complete case-frame from the
knowledge base. As a next step, the algorithm binds the values of an agent, object,
etc., to appropriate nodes in the case-frame.
Example 7.1 Represent the sentence: “Rajan eats noodles with fork” as a conceptual
graph.
The given sentence is represented by the conceptual graph shown in Fig. 7.4, and the
corresponding predicate formula is given as Eq. 7.1.
The concept we have represented is called typed or sorted version of logic, each of
the four concepts (i.e., Rajan, noodles, eat, and fork) have type labels, that represent
the type of entity the concept refers. The two concepts, Rajan and Noodles, which
have the names, identify the referent, e.g., the concept [PERSON:Rajan] has type
PERSON and referent Rajan. Three concepts have type labels: Agent, Instrument,
Object. The Conceptual Graph (CG) as a whole indicates the semantic that a person
“Rajan” is an agent of some instance of eating the food, noodles are an object, and
the fork is an instrument. Eat and Fork has no fields to refer them by name, as these
are generic concepts.
The concept symbols may represent entities, actions, properties, or events. For
the Fig. 7.4, a linear conceptual graph form, which is easier to represent as text is
given in the Eq. 7.2.
7.3 Conceptual Graphs 187
Example 7.2 Represent the sentence “Rajan is going to Mumbai by bus”, using a
Conceptual Graph.
The Fig. 7.5 shows a CG for the sentence: “Rajan is going to Mumbai by bus.”
In the CG all the four concepts, Person, Go, Mumbai, and Bus, have type label, for
the type of the entity referred by the concept. Two concepts, Person and Destination
have names. The verb “Go” is related to the remaining three concepts, by relations of
agent, destination, and Instrument. The complete CG indicates that the person Rajan
is an agent of some instance “Going”, the city of Mumbai is the destination, and the
bus is the instrument.
(∃x)(∃y)(Go(x)Person(Rajan)City(Mumbai)Bus(y)
Agent(x, Rajan)Dest(x, Mumbai)Inst(x, y)) (7.3)
The Fig. 7.5 can be translated into the predicate formula (7.3).
Inst
Bus
The only logical operators used in Fig. 7.5 are conjunction and the existential
quantifier, as expressed in the Eq. 7.3, which are the most common operators in
translations from natural languages, to first-order predicate logic [14].
The frames are structure-based knowledge representation technique, and are similar
to semantic networks. The latter is based on the concept of human associative mem-
ory, but may simply be thought of as data structures of nodes—“concepts”—and
links—“associations”—between them [3].
The concept of a frame was proposed in the 1970s by Minsky. As per Minsky,
when we encounter a new situation, or there is substantial change in the present
context, we select from memory a new structure, called Frame. This is a previously
remembered framework, which is adapted to fit into the current set of things as
required for the new situation, with necessary changes in the frame. The frame is a
data structure to represent a stereotype situation, like a certain kind of living room,
or a frame of a picnic, or of the classroom, etc. Each frame is attached with several
kinds of information, where some information may also be about how the frame is to
be used, while other information may be about what one may expect to be happening
next, and some information may be about what is to be done if these expectations
are not met, and so on [8].
Consider representing the sentence: “Car #12 is red”
Approach 1: red(car12). With this representation, it is easy to ask “what is red”, but
we cannot ask “what is the color of car12?”
Approach 2: color(car12, red). In this approach, it is easy to ask “What is red?”
Also, we can ask, “What is the color of car12?” But, we cannot ask “What property
of car12 has value red?”
Approach 3: property(car12, color, red). With this, it is easy to ask all the above
questions.
We call this representation as, object-property-value representation, and have for-
mat property(Object, Property, Value). To get the object-centered representation, we
merge many properties of the object of the same type into one structure, as follows:
property(Object, Property1, Value1)
property(Object, Property2, Value2)
…
property(Object, Property-n, Value-n)
A frame is a network of nodes and relations, whose “top levels” are fixed, and
represent the things that are always true concerning the supposed situation. The lower
levels of this frame-based network comprise many terminals or “slots”, which must
be filled in by specific instances of the data. The conditions must also be specified
for each terminal, under which the assignment be made. The assignments are usually
smaller frames, called “sub-frames” (see Fig. 7.8). Simple conditions are indicated
by markers, which might require a terminal assignment to be a person, an object
of sufficient value, or a pointer to a sub-frame of a certain type. It is possible to
specify relations among the things assigned to several terminals using more complex
conditions.
The inheritance hierarchies serve for economic data conservation. Instead of stor-
ing all the properties of each object, all the objects are structured in a hierarchy,
and only the individual properties are stored in the object itself, while the general
properties are attached to the predecessors and inherited by all the successors.
190 7 Networks-Based Representation
Every frame is identified by its individual name—the frame name; a frame consists of
a set attributes associated with it (see Fig. 7.8). For example, in the frame “Person”,
slots may be: name, weight, height, and age; for the frame “Computer”, the slots
may be: model, processor, memory, and price. There is also a value attached to each
attribute or slot. A frame-based approach provides a natural way of structured and
compact knowledge representation. The knowledge is organized in slots that describe
various attributes or properties of any object. This approach is appropriately suited
for object-oriented programming of expert systems.
Following is the typical Information included in a Slot:
1. Relationship: A frame provides the relationship to the other frames. The frame
Hotel room (Fig. 7.8) can be a member of other frame class Room, which in turn
can belong to the class Housing, thus providing the relationship with these other
room types.
2. Slot value: The value of a slot can be numeric, symbolic, or Boolean (True/False).
For example, a slot identified as ‘Person’ is symbolic, with slot names as ‘Age’,
and ‘Height’, both having float values. The slot’s values can be dynamically
assigned during a session with the expert system, or they can be static, or can be
initialized in the beginning while the slot is created.
7.4 Frames and Reasoning 191
3. Default value of slot: A slot may contain default value when the true value is not
available, and there is no evidence that the value chosen is in no way providing
any contradiction. For example, in a frame named as Car when slot values are
not provided, default values of the slots: wheels-count and Engine-count can be
taken as 4 and 1, respectively.
4. Slot value’s range: The range of a slot’s value is useful in checking the bounds
of the slot value—whether the provided value of a slot is within the prescribed
limit? For example, a pressure range of a car tire may range 30–50 psi (pounds
per inch).
5. Slot Procedure: A slot has a procedure attached to it; when this procedure is called
it may read a value from the given slot, or it can update the value of the slot (write
the value in it).
6. Facets: The facets provide an extension to slot value structure in a frame-based
expert system. A facet provides extended knowledge about the attribute of a frame.
It can be used for establishing the attributed value of a frame, it can control end-
user queries, and can direct the inference engine as how to process the attributes.
It is assumed that there is no multiple inheritance conflicts among these slots. The
frames of the hierarchy communicate by means of messages. The frames make use of
methods (processes), which gets executed when a message is received by the frame
in which this message is declared or when the message is received by its sub-classes
or instances.
A domain knowledge base builder uses the frame languages to describe the type
of objects to be modeled by the system, and this representation can be provided
efficiently by the frame languages. The object description of certain type may contain
a prototype description of objects of that type. These objects are used for creating a
default description of an object when its type becomes known in the model.
The frame-based languages are good for providing facilities for describing object
attributes. For example, a frame representing a car might include descriptions for
the length of the car, number of seats, size of the engine, engine’s horse-power, etc.
These attributes can be used to include partial descriptions of the attribute values,
and are helpful in preserving the semantic integrity of a system’s knowledge base by
restricting the number and range of permitted attribute values. The frame languages
do not provide a facility for describing the declarative behavior, however, they have
facilities for attaching procedural information expressed in some other language,
e.g., LISP. The procedural capability enables behavioral models of objects, and as an
expert in an application domain. It also provides a powerful tool for object-oriented
programming, where frames are treated as objects, and they respond to messages.
There are two standard forms for attaching the procedures: 1. methods, and 2.
active values. In languages, like LISP and Prolog, the methods are used as procedures,
that are attached to the frames, and respond to messages sent to the frames. Just like
the attributes, the methods are also stored as values in slots, and they are recognized
as responders to the messages. The messages sent to the frames carry the information
about target message-responder slot, these messages contain the arguments needed by
the methods stored in the slot. The “actives values” in the slots are either procedures
or collection of production rules attached to the slot. These (procedure or rules) are
invoked when the slot’s values are accessed or new values are stored in the slots.
Thus, these slots behave like “daemons”, and monitor the change and usages of the
values.
The active values can also be used to dynamically compute values on a “when-
needed” basis. The methods and active values are so written that they apply to any
member of a class of objects, and are included by the knowledge base designer in
the class description as a part of the prototype description of a class member.
Many systems make use of the approach discussed above to control the reasoning,
where functions or rule classes behave like daemons, are attached to the slots of the
frames. The attachments get invoked when the value of a slot changed. That way, they
behave like sensors or alarms, or monitors. For example, an expert system is used for
7.4 Frames and Reasoning 193
an intelligent alarm facility, and this calls a user supplied function only when a slot’s
value crosses the threshold. The user may establish an alarm by providing critical
boundaries, and an alarm function. The system stores the boundaries and function
as facets of the slots, and may attach generic active value for check of boundary
crossing, whenever a value of the slot changes.
An example of the above phenomenon can be found in the knowledge system (see
Fig. 7.9), which is a Satellite diagnostic system . The system is meant to serve as an
intelligent facility for human operators to perform diagnostic and trouble-shooting
of satellite malfunctions. Such systems can also be used as a simulator to train the
operators and diagnostic experts. An important requirement of the diagnosis task is
that it requires detailed analysis of the satellite system through a diverse set of domain
experts. The software architecture of the prototype, which can be built using an expert
system, will make use of daemons attached to the slots to respond to messages sent
to objects. The prototypes of experts have the property that they can be instantiated,
as well as deleted dynamically, as per the need, during the operation of the system.
Significant events can be controlled. This system makes elaborate and effec-
tive use of frames, including prototype expert frames, which also, like a daemon,
can be instantiated and deleted dynamically as per the requirement, during opera-
tion of the system. The use of frame-based approach to build the system’s model
allowed the designers to organize the frames in such a way, that knowledge could
easily be accessible and comprehensible to diagnostic experts, as well as the satellite
operators [9].
A satellite diagnostic, e.g., STAR-PLAN, is designed with several requirements,
that resulted in an architecture based on the integration of frames and production
rules, with the following important features:
1. First, the system’s knowledge is made accessible and comprehensible to the satel-
lite operators as well as to diagnostic experts.
2. The system is incrementally and progressively built-up, as the descriptions of
additional satellite modules become ready, and as the needed experts became
194 7 Networks-Based Representation
is waking-up itself after a fixed amount of time, the rules will be invoked by a
backward-chaining rule interpreter that tests a specific hypothesis (goal) about the
problem. The backward-chaining system attempts to find a sequence of rule applica-
tions that should conclude the hypothesis. When such a sequence is found, the rules
are applied so that the hypothesis is added into the knowledge base.
Based on the conclusions arrived-at making use of rules, the monitor will either
put itself to sleep again or make recommendations to the operator. When the monitor
concludes that the problem has been solved, it removes its daemons and then removes
(deletes) itself, and frees the memory occupied.
Due to the creation and deletion of monitors in a dynamic way, the satellite
problems by the STAR-PLAN system model are actually handled like by human
operators and experts. The situation is like in real-world: when a problem is identified,
a suitable expert is called in, which works with the team until the problem is resolved,
and then the expert leaves! The monitor’s rule sets are organized for problem specific
knowledge base about the problems of the satellite. The module-based system and
its organization structure also makes it easier for a (human) domain expert to create
and debug the knowledge base of rules.
A Description Logic models concepts, roles and individuals, and their relation-
ships. The fundamental modeling concept of a DL is the axiom: “a logical statement
relating roles and/or concepts”. DL is a family of formal knowledge representation
languages, which is more expressive than propositional logic and has more efficient
decision properties than first-order predicate logic. It is used in formal reasoning on
the concepts of an application domain (known as terminological knowledge). It is
used for providing a logical formalism for ontologies and the Semantic Web. Modern
ontology languages are based on DL , such as OWL (Ontology Web Language).
Atomic concepts are thus interpreted as subsets of the interpretation domain, while
the semantics of the other constructs is then specified by defining the set of individuals
denoted by each construct. For example, the concept C D is the set of individuals
obtained by intersecting the sets of individuals denoted by C and D, respectively.
For example, Female Teacher. Similarly, the interpretation of ∀R.C is the set of
individuals that are in the relationship R with individuals belonging to the set denoted
by the concept C. For example, ∀ResidentsOfJodhpur.Students represents all the
students who are residents of Jodhpur.
There exists some ambiguity also, due to the natural language being the source,
where many nouns can be used to refer as a category as well as relations. For example,
child can be used as a category, i.e., a very young person, it can also be used to
represent a relation, which stands for the inverse of parent.
An important feature of Description Logic is to define complex concepts in
terms of simpler ones. This is achieved by means of concept-forming operators:
∃, ∀, AND, FILLs. A complex concepts is defined as [4, 10]:
Every atomic concept is a concept;
If R is a role and C is a concept, then ∀R.C is a concept;
If R is a role and n is a positive integer, then ∃n.R is a concept;
If R is a role and C is a constant, then FILLs R.C is a concept; and
If C1 . . . Cn are concepts, then AND C1 , . . . , Cn is a concept.
The symbol ∃ stands for the class of individuals in the domain that are related by
relation R to at least n other individuals. The following can be created as complex
concepts:
∃ 1.Child : All the individuals (the class of the individuals) that have at least one
child.
∃ 2.HasCar: All the individuals that have at least two cars.
∃ 6.HasW heels: All the individuals that have at least six wheels.
The FILLs R.C stands for those individuals that are related (R-related) to the
individual identified by C. For example, “All the individuals that have the car with
plate RJC12 is represented by FILLs HasCar.RJC12.
The ∀ R.C stands for those individuals that are R-related only to individuals of
class C. For example, ∀ BeingInThisRoom.PHDStudents represents “All the individ-
uals that are in this room and are Ph.D. students.”
198 7 Networks-Based Representation
Concept : −Thing|conceptName
|AND(concept1 , concept2 , ...)
|∀RoleName.concept
| ≤ ingtger.RoleName
| ≥ integer.RoleName
|FillsRoleName.IndividualName
|SameAs(Path, Path)
|∃IndividualName.concept||⊥
Path : −[RoleName, ...]
Hunter(X ) ∧ Shooter(X ),
Hunter&Shooter(X ).
or
AND(Hunter, Shooter).
Person ¬Female
and
Female Male.
The key characteristic features of DL lies in the constructs that are helpful for
creating relationships between concepts. The most common and elementary is the
value restriction, written as ∀R.C, which means all the individuals that are having
relationship R with the concept being described, belong to concept C. Similarly,
∃R.C is a value restriction for some individuals.
Example 7.3 Represent the concepts of semantic network form in Fig. 7.10, using
DL .
isa Women
isa
isa
Mother
200 7 Networks-Based Representation
Assume that the atomic concepts are: Female, Person, and Woman. And,
hasChild, hasFemaleRelative are atomic roles. They use the operators inter-
section, union and complement of concepts, interpreted as set operations.
The Description Logic has a characteristic feature as their ability to represent
other kinds of relationships that can hold between concepts, this is beyond the IS-A
relationships. For example, in Fig. 7.10, the concept of Parent has a property that is
usually called a “role,” and expressed by a link from the concept to a node for the
role, labeled hasChild. The role has a “value restriction,” denoted by the label
v/r, which expresses a limitation on the range of types of objects that can fill that
role. In addition, the node has a number of restrictions expressed as (1, NIL), where
the first number is a lower bound on the number of children and the second element
is the upper bound, and NIL denotes no restriction on the upper limit. Overall, the
representation of the concept of Parent here can be read as “A parent is a person
having at least one child, and all of his/her children are persons”, thus the role link
between Parent and Person in Fig. 7.10 can be expressed as a concept expression
DL as,
∃1.hasChild.Parent ∀hasChild.Person.
we observe that there may be implicit relationships between concepts. For example, if
we define Woman as the concept of a female person, it is the case that every Mother
is a Woman. It is the task of the knowledge representation system to find implicit
relationships such as these (many are more complex than this one). Typically, such
inferences have been characterized in terms of properties of the network. In this case,
one might observe that both Mother and Woman are connected to both Female
and Person, but the path from Mother to Person includes a node Parent,
7.5 Description Logic 201
which is more specific then Person, thus enabling us to conclude that Mother is
more specific than Person.
KB
Rules Application
programs
202 7 Networks-Based Representation
There are some important common assumptions usually made about DL termi-
nologies:
• Only one definition for a concept name is allowed;
• Definitions are acyclic in the sense that concepts are neither defined in terms of
themselves nor in terms of other concepts that indirectly refer to them.
The ABox comprises extended knowledge about the domain of interest, which
are, assertions about individuals, called membership assertions. For example,
Female Person(sita)
show that the individual named as sita is a female person. Given this definition of
woman, one can derive from this assertion that sita is an instance of the concept
Woman. Similarly,
hasChild(sita, luv)
indicates that sita has luv as a child. Assertions of the first category are also called
concept assertions, while of the second is called role assertions.
Let us now turn our attention to role restrictions by looking first at the quantified
role restrictions and, subsequently, at what we call “number restrictions.” Most lan-
guages provide (full) existential quantification and value restriction that allows one
to describe, for example, the concept of “individuals having a female child” as
∃hasChild.Female, and to describe the concept of “individuals all of whose
children are female” by the concept expression ∀hasChild.Female. In order to
distinguish the function of each concept in the relationship, the individual object that
corresponds to the second argument of the role viewed as a binary predicate is called
a role filler. In the above expressions, which describe the properties of Parents
having female children, individual objects belonging to the concept Female are the
fillers of the role hasChild.
Another important kind of role restriction is given by number restrictions, which
restrict the cardinality of the sets of fillers of roles. For instance, the concept
(≥ 3 hasChild) (≤ 2 hasFemaleRelative)
represents the concept of “individuals having at least three children and at most
two female relatives.” Number restrictions are sometimes viewed as a distinguishing
feature of Description Logics, although one can find some similar constructs in some
database modeling languages (notably Entity-Relationship models).
Beyond the constructs to form concept expressions, Description Logics provide
constructs for roles, which can, for example, establish role hierarchies. However,
7.5 Description Logic 203
CheesePizza = Pizza
(∃hasTopping.CheeseTopping)
(∃hasPizzabase.PizzaBase).
cheeseTopping Topping.
“The big man steals the red book from the girl.”,
the analysis of the sentence is as follows: The article, ‘The’ stands for connecting
sentences in paragraphs, i.e., ‘The’ also specifies that ‘man’ might have been used
previously. The adjective, ‘Big’ refers to the concept ‘big’, which cannot stand inde-
pendently. The concept ‘man’, however, can stand alone, but in the above sentence it
is conceptually modified by ‘big’, and in the network, it is realized as governor with
its dependent. The verb ‘steals’ is an action, which is dependent on the concept of
doing the acting. A conceptualization (a statement about a conceptual actor) cannot
be complete without a concept acting (or an attribute statement). Thus, to com-
plete a two-way dependency, a link must exist between the ‘man’ and the ‘steal’.
7.6 Conceptual Dependencies 205
big red
Which indicates that they are dependent on each other, and also govern each other
as shown in Fig. 7.12.
It is mandatory that every conceptualization has a two-way dependency link. In
the Fig. 7.12, the concept ‘Book’ governs the concept ‘red’ attributively (color is an
attribute), and the whole entity is placed as objectively dependent on ‘steals’. The
construction “from the girl” is realized as being dependent on the action through the
conceptual object. This is prepositional type of dependency (denoted by ⇐). There
are different forms of this prepositional dependency, each of which is expressed by
writing the preposition over the link, that indicates the prepositional relationship.
A language may use inflections, or nothing may be used instead of prepositions
to indicate prepositional dependency. Here we will discuss a language-free system,
which represents the relation of the parts conceptually.
The CDs are intended to be used in reasoning with natural language constructs,
independent of any language or the phrases in the language. This has resulted in a
small number of primitive actions, about 10–12, and a set of dependencies which con-
nect the primitive actions with each other and with their actions, objects, instruments,
etc.
The CDs have two main objectives:
1. If the meaning of any two sentences is the same, they should be represented the
same CD, regardless of the particular words are used in these.
2. Any information, which may be present implicitly in the sentence, should be
represented explicitly in the CD.
For item 1 above, the examples are ‘get’ and ‘receive’, both will have the same
CD representation. For item 2, the machine must extract the implicit part from the
sentence.
The CDs have:
1. a set of primitive actions,
2. a set of states for representation and result of the action,
3. a set of dependencies, or conceptual relationships, which could exist between
primitives, states, and objects.
The representation of English sentences could be constructed by joining together
the building blocks to form a CD graph. The CD provides four conceptualization
primitives using which the world of meaning is built. These are:
206 7 Networks-Based Representation
ACTs: Actions
PPs: Picture producers(objects)
AAs: Action Aiders (they modify the actions)
PAs: Picture Aiders (they modify the objects)
All the actions are assumed to reduce to one or more of the primitive ACTs, out
of the following actions only [7]:
PROPEL: Apply physical pressure to an object (push)
MOVE: Move body parts by owner
GRASP: Grab an object by actor(grasp)
ATRANS: Transfer of relationship (give)
PTRANS: Transfer of physical location of an object (go)
INGEST: Ingest an object by an animal (eat)
EXPEL: Expel from an animal’s body (cry)
MTRANS: Transfer mental information(tell)
MBUILD: Mentally make a new information(decide)
ATTEND: Focus sense organ
CONC: Conceptualize or think about an idea (think)
SPEAK: Produce sound (say)
At the conceptual level, a CD framework is responsible for representing the mean-
ing of a piece of written language in language-free terms. The conceptualization is
written on a straight line, in a conceptual dependency analysis. The dependents writ-
ten perpendicular to the line are attributes of their governor, except when they are
part of another conceptualization line. The whole of conceptualizations can relate to
other conceptualizations as actors or attributes.
Each primitive comprises a set of slots associated with it, from the set of conceptual
dependencies. Associated with each slot are restrictions as to what sorts of objects
could appear in that slot. For example, the following are slots for PTRANS.
ACTOR: It is either human or animate object, that initiates the PTRANS
OBJECT: It is a physical object, that is moved (PTRANSed)
FROM: The PTRANS begins at this location
TO: PTRANS ends at this location
Figure 7.13 shows the basic roles of conceptualization primitives of CD, and
Fig. 7.14 shows a general case of primitives along with an example.
The inference rules are written based on these primitives to make explicit the
information which is implicitly presented in the English or any other language text.
For example, using the primitive PTRANS, it is possible to make inference about
the OBJECT that was PTRANSed (physically transferred), which was initially at the
FROM location, and after the PTRANS is carried out, it is at the TO location. The
same inferences shall be made no matter what type of PTRANS was present, like,
flying, driving, walking, falling, etc.
7.6 Conceptual Dependencies 207
PP
R : Recipient and doner of an object are in action
ACT
PP
P
Rajan PTRAN : Rajan ran.
dog
poss-by : Rajan’s dog.
Rajan
The CD framework can be used for natural language parsing. A CD-based system
analyzes sentences into their conceptual representation by operating on pieces of
every sentence, and lookup for potential conceptual cases. All the conceptualiza-
tions are checked against a list of expressions to see if that part of the concept has
occurred before. Those concepts which never occurred, are meaningless in the sys-
tem. Consider the sentence: “tall boy went to the park with the dog.” Part of the
parser output is as shown in Fig. 7.15.
In this sentence, the problem is, where to attach the concept “with dog”? Is it to the
“park” or to the “tall boy”? The problem can be solved by conceptual semantics. The
semantics for ‘go’ contains a list of conceptual prepositions. Under the preposition
208 7 Networks-Based Representation
tall
tall
P to
(2) boy go park
tall with
dog
‘with’ there is a description: “any movable physical object”, and since the dog is a
physical object, the dependency is applicable. As per the sentence, the parse tree (1),
in Fig. 7.16 is allowed, while (2) is rejected.
The parser tries to analyze a sentence in a way analogous to human method.
It handles input one word at a time as it is encountered, checks potential linkages
with its own knowledge of the world and past experience, and puts its output into a
language-free formalism that can be acted on.
The CD parser is a conceptual analyzer rather than syntactic parser.
Consider a sentence: “big boy gives apples to pig”. The input sentence is processed
word by word, and parsed into CD as shown with steps in Fig. 7.17.
big
to
(4) give pig
Assembled sentence:
to
boy give apples pig
big
The representation of text in canonical form has the benefit to perform inference
using that text, because the canonical form allows to write inference rules as general
as possible. If the representation does not capture the similarities in the meaning of
the text, the rules about what can be inferred from a given text needs be duplicated
by writing one rule for every form of the representation, even though they may have
the same meaning relevant to the inference. To illustrate that inference is facilitated
by CD, the inferences are used to build a “causal chain” to connect the events like in
a story, as in the following sentence.
Simon hit Anne. Anne’s mother took Anne to the hospital. Anne’s mother called Simon’s
mother. Simon’s mother slapped Simon.
An inference system could draw many inferences through this story, e.g., 1.
Simon’s mother slapped Simon because she was angry at him for hitting Anne,
2. Anne was taken to the hospital because she was hurt, 3. Anne’s mother called
Simon’s mother because she wanted to complain later. Inferences are based on a set
of rules, organized around inference categories. The total number of categories in
this case is 16, some of these are:
1. Causative inferences: These are hypothesized as possible causes or preconditions
of actions.
2. Specification inferences: These fill in missing ‘slots’ in a CD primitive, such as
the ACTOR or INSTRUMENT of an ACTion.
3. Resultantive inferences: These are inferred as likely results of the actions.
4. Function inferences: These are inferred as likely functions of objects.
The inference rules are applied in an undirected fashion. When a system reads a
sentence, the inference rules for this are automatically applied without a goal, such
as building a causal chain of events. So, it is fortuitous (having no cause or apparent
cause): the inferences are applied to new representation in an undirected fashion,
which some times result in the confirmation of another representation.
The undirected inferences are analogous to the spontaneous nature of inferences
made by people, which some times seem uncontrollable for people. This is convincing
as one cannot learn new facts without inferring things about that. However, this
undirected behavior leads to problems: When processing a story, an expert system,
which is based on CD, would not know which inferences are most likely to lead
to building a coherent causal chain to represent the story. Hence, it would lead to a
combinatorial explosion in the number of inferences that the system needs to consider
to build the causal chain to represent the story. In other words, the expert system lacks
the commonsense knowledge about what inferences are most likely to be relevant in
a given situation. For another example,
Simon picked up the menu. He decided on fish.
Restaurent
Customer PTRAN Customer
table
Customer ATTEND EYES
sitting position
Customer MOVE Customer
7.6.3 Scripts
C PTRANS C to table
C MOVE C to siting position
In above, C is a customer.
The conceptual dependency networks and its derivatives are often grouped into the
family of semantic net representations, because the CD is a content theory, whereas
the semantic networks are a structure theory. The distinction between these two
types of theories lies in their emphasis: the semantic nets theory is about how the
knowledge should be organized—there are nodes with arcs connecting them. There
is also some general notion about the structure’s semantics, i.e., how to interpret a
particular semantic network. In any representation, apart from the knowledge orga-
nization, there is also a general notion about inheritance. Both of these points are
about structural information. The semantic networks say nothing about what will be
represented, e.g., what labels should be used for nodes and arcs, and what arcs to use
where? It is only up to the user to decide about these details, and what a semantic
network says is about the (structural) form that the representation will take.
The CD theory was an attempt to enumerate the types of nodes and arcs which
could be used to build representations. Instead of specifying the structure of repre-
sentations, the CD theory specifies the contents. Note that the basic conventions used
for drawing CD graphs also specified structures, but these graphs were not really the
essence of the theory. The essence of the CD was in the primitives, and in the types
(names) of dependencies which could be used to link the primitives together.
The above highlighted distinctions between CDs and semantic nets would become
clear if one thinks of trying to implement the CDs and semantic nets in first-order
predicate logic (FOPL). In CDs, it is not difficult to imagine that the primitives:
ACTs, dependencies, and states would specify a set of predicates to be used when
one writes the predicate calculus statements to represent sentences. There will be a
need to adopt some translation conventions, in order to make all assertions of FOPL
type, but these would be quite straightforward.
But, it does not make a sense to implement a semantic net in FOPL, as the two
representations are in competition with each other: each representation provides
a different syntax for distinguishing between predicates, arguments, and relations.
There would be nothing left to semantic nets, if they are translated into predicate
logic. Putting it another way, the semantic nets would not add anything to FOPL.
This statement is in contrast to CD, which adds the CD primitives to the predicates
used, as and when needed.
212 7 Networks-Based Representation
7.7 Summary
Exercises
11. Suppose Tom is unlike most cats and does not eat fish. How could one deal with
this in the semantic network?
12. Formulate the solutions as to how the semantic networks can be used in the
following cases?
a. Natural language understanding
b. Information retrieval
c. Natural language translation
d. Learning systems
e. Computer vision
f. Speech generation system
13. “The inferencing in semantic networks make use of unification, chaining, modus
ponens, and resolution.” Justify each, taking a suitable example.
14. Explain, using semantic networks, how we can map an object’s perception to
concepts, and identify these concepts. Give examples.
15. How semantic networks help in understanding the meaning of words in natural
language sentences? Explain.
16. Represent the following as a series of frames:
Dogs are Mammals Birds have Wings
Mammals are Animals Bats have Wings
Birds are Animals Bats are Mammals
Fish are Animals Dogs chase Cats
Worms are Animals Cats eat Fish
Cats are Mammals Birds eat Worms
Cats have Fur Fish eat Worms
17. Express the following sentences in Description Logic:
a. All employees are humans.
b. A mother is a female who has a child.
c. A parent is a mother or a father.
d. A grandmother is a mother who has a child who is a parent.
e. Only humans have children that are humans.
18. Translate the logic expressed in Fig. 7.10 into DL .
19. Select one or more answers from the following. Also, justify the answer(s)
selected by you.
a. What type of reasoning is performed using semantic networks?
(A) Deductive (B) Default
(C) Inductive (D) Abductive
(E) Hierarchical
b. In the Description Logic, the domain is always,
(A) Open world (B) Closed world
(C) Depends on the domain used (D) None of above
References 215
References
1. Collins AM, Quillian MR (1969) Retrieval time from semantic memory. J Verbal Learn Verbal
Behav 8(2):240–247
2. Deliyanni A, Kowalski RA (1979) Logic and semantic networks. Commun ACM 22(3):184–
192
3. Faucher C (2001) Easy definition of new facets in the frame-based language Objlog+. Data
Knowl Eng 38:223–263
4. Harmelen FV et al (2008) Handbook of knowledge representation. Elsevier, pp 213–237
5. https://www.inf.unibz.it/~franconi/dl/course/dlhb/dlhb-02.pdf. Accessed 19 Dec 2017
6. http://www.jfsowa.com/pubs/semnet.htm. Accessed 12 Feb 2018
7. Lytinen SL (1992) Conceptual dependency and its descendants. Comput Math Appl 23(2–
5):51–73 Pergamon Press
8. Minsky M (1974) A framework for representing knowledge. MIT-AI Laboratory Memo-306
9. Pike R, Kehler T (1968) The role of frame-based representation in reasoning. Commun ACM
28(9):904–920
10. Quillian MR (1967) Word concepts: a theory and simulation of some basic semantic capabilities.
Behav Sci 12(5):410–443
11. Quillian MR (1968) Semantic information processing. Cambridge, Mass., MIT Press, pp 216–
270
12. Simmons RF (1973) Semantic networks: their computation and use for understanding English
sentences. In: Schank RC, Colby KM (eds) Computer models of thought and language. W.H.
Freeman and Co, San Francisco, CA
13. Simmons RF, Chester D (1977) Inferences in quantified semantic networks. Proceedings of the
fifth international joint conference on artificial intelligence. MIT, pp 267–273
14. Sowa J (1976) Conceptual graphs for a data base interface. IBM J Res Develop 336–355
Chapter 8
State Space Search
Abstract State space search is one of the three fundamental requirements to achieve
AI. This chapter present the basic techniques, called uninformed search, of searching
the goal in problem solution. A search requires the representation of state space in the
forms of a directed graph. The basic methods—depth-first search (DFS), breadth-
first search (BFS), along with their algorithms, analysis of these algorithms, and
along with worked out exercises are presented. The improved techniques—iterative
deepening DFS, and bidirectional search, followed with chapter summary, and then
large number of practice exercises are provide at the chapter end
8.1 Introduction
searching in the conventional trees and graphs, in the sense that there, the space
limits are fixed, i.e., the number of vertices, and edges connecting them are already
known. However, in AI search problems, the alternate moves are generated from each
vertex, like in a chess game, the search tree is to be generated and simultaneously be
searched. Hence, you cannot determine the size of tree to be searched in advance.
Learning Outcomes of this Chapter:
1. Formulate an efficient problem space for a problem expressed in natural language
(e.g., English) in terms of initial and goal states, and operators. [Usage]
2. Describe the problem of combinatorial explosion of search space, search time,
and its consequences. [Familiarity]
3. Select and implement an appropriate uninformed search algorithm for a problem,
and characterize its time and space complexities. [Usage]
The search space is generally represented as a directed graph (in fact a tree), with
vertices as states, and edges of the graph as transitions for moves to explore for
the goal state. The initial configuration of the problem description is start state, at
which we apply the specified rules to generate new states (vertices). Thus, it becomes
similar to a tree with start configuration as root node, then there are interior nodes,
and finally child nodes at the lowest level having no further children. To understand
the state space search better, we consider the following example of puzzle game [5].
The 8-puzzle game is shown in Fig. 8.1, with initial configuration, goal configuration,
and transitions (i.e., moves) possible from each state. We refer the configuration of
the game equal to the collective status of tiles on the board. This configuration we
call as state.
Using the allowed moves of a numbered tile (1–8), to tile left, right, up, down,
to the destination blank-tile, we generate new states starting from the start state (S)
as A, B, C, . . . ,. As a state is generated, it checked for the goal state. If yes, we
terminate the process and declare the fact that goal is reached. If not, the generated
states are further expanded by application of generate rules (moves), until the goal is
reached. If we carry on the process of expanding all the states generated, the goal state
shown in the figure with reverse order of tiles from first to last row, will ultimately
be reached, of course after a large number of transitions have been carried out.
The transitions’ diagram like this is called graph, with each configuration as a node,
each move as an edge, and finding the solution, i.e., goal is nothing but searching the
graph through traversing, for the goal state.
There are many ways in what order we must generate the new child nodes. For
example, generate children at a level k + 1 only after all the nodes at level k have
8.2 Representation of Search 219
1 2 3 S= 8 7 6
Start Goal
state 5 4 3
4 5 6 state 2 1
7 8
1 2 3 1 2 3
6 A B
4 5 4 5
7 8 7 8 6
1 2 3 1 2 3
4 5 6 C 4 6
D
7 8 7 5 8
8 7 6
5 4 3
Goal 2 1
state
been expanded. This is called BFS (breadth first search). Alternatively, to expand
a node to its full depth without touching its neighbors (siblings), this BFS (breadth
first search), or there can be variants of DFS or BFS.
The Search Strategies are evaluated along the following dimensions:
• Completeness: Does it always find a solution if one exists?
• Optimality: Does it always find a least-cost solution, i.e., in terms of minimum
number of transitions?
• Time complexity: What is to be number of nodes generated? The duplicate nodes
generated due to multiple paths, are also counted.
• Space complexity: What is maximum number of nodes in memory at any time?
Let there is a directed graph G = (V , E), where V is set of vertices, and E is set
of ordered pairs (u, v), called edges, such that u is head and v is tail of an edge.
In an undirected graph, the edges are unordered pairs of vertices, and there is no
concept of fixed head and tails of edges. Graphs form abstractions for many problems,
e.g., circuits in electrical engineering, molecular structures in chemistry, relations in
social structures, and in modeling for example, in online sales. Due to their enormous
applications, it is important that there should be efficient algorithms for answering
graph-theoretical questions.
220 8 State Space Search
From the previous section, it is clear that the process of searching can be automated
by applying the rules of moves to cause transitions from one state to next, until you
reach to the goal state. However, in the process a state should not repeat in the path, for
example, like v1 → v2 → v3 → v1 → v4 → v5 → vg . Such repeated vertices form
a loop and we never reach to goal state. Since the previously visited vertex v1 is
repeated, the path may instead be taken as v1 → v4 → v5 → vg .
Let us try to find out number of states to be visited in the worst case, to reach
to goal state, in the search required in Fig. 8.1. Thus, all unique states generated
(like shown in figure as A, B, C, . . . , goal), in the path to goal, in the worst it is
9! = 362880.
If we consider the case of one of the popular board game like Chess, for a
particular strategy of opponent, we definitely have some winning strategy, repre-
sented by a configuration. Our aim is to reach to that configuration from the starting
configuration, in step-by-step way, where each step is a move by us, followed by
move by opponent. Each move transforms the chessboard from one configuration to
8.4 Complexities of State-Space Search 221
Example 8.2 Consider the graph shown in Fig. 8.2 for a small size problem, where
it is required to reach to goal state E from the start state A.
For the graph shown in Fig. 8.2a, when it is searched from the start state A, the
states generated are shown in the tree in Fig. 8.2b. Since objective of search is to
reach to the goal, a search process terminates the moment the goal state is reached
through any path. The tree in Fig. 8.2b shows the total possibles states generated in
the worst case, of course many repeated, but in a path no vertex is repeated.
Thus, We can conclude that problem solution through search process is a tree
search (even if it is a graph originally) with already specified start node and goal
node. Tree search can end-up repeatedly visiting the same nodes, unless it keeps
track of all nodes visited. But, this could take vast amounts of memory, so large that
most computers do not have, even for small size graph of say 50 nodes.
Complexity of Travelling Salesman Problem
The travelling salesman problem’s (TSP) solution requires to find a best path covering
all the nodes in a graph with given start node and the destination node. Consider that
number of nodes are n. Assuming that there is a path from every node to every other
node. Thus, all the possible paths are the all permutations of these n nodes. Having
given the first node, traversing all the nodes and visiting back to the first, can be done
in n! number of ways. This gives worst case complexity of O(n!). For example, for
A
B 1 D
5 4 Start node = A B C
6 Goal node = E
A D E
2 C B
3 E
D
C E E
E
(a) Graph with vertices A-E. (b) Graph search is a tree.
the graph shown in Fig. 8.2b, the possible paths set for start A and destination node
E, is {ABCE, ABDE, ACBDE, ACE}.
The possible applications of TSP are distribution networks, VLSI design, and
other optimization problems.
It is also called blind search, because we do not know, which moves ultimately
will lead to the goal faster. Since all the nodes are required to be searched—the
search is called exhaustive search. To search the entire state space, the two impor-
tant approaches are—breadth-first search (BFS) and depth-first search (DFS). What
others approaches exist are variants of DFS and BFS. For example, depth-limited
search, and iterative deepening DFS are variants of DFS, while bidirectional search
may be based on DFS as well BFS.
To carry out the BFS for a graph or to solve a puzzle, like, shown in Fig. 8.1, first,
root node is checked if it is goal, if yes then terminate the process after declaring that
goal is found. If goal is not found, the children nodes are generated for root node
and tested if any one of them is goal. Then further children are generated for each
of them, and so on, until goal is reached or the search is terminated. In the latter, no
new state can be generated in any of the search path. A BFS phenomena is indicated
in Fig. 8.3 where dotted trace is showing order in which nodes are tested for the
goal [2].
The BFS can be implemented using a queue type data structure, named here as
List. The front node of the of the queue is represented by List.Head . The algorithm
for this is shown as, Algorithm 8.1, which checks all the paths at a given length,
before testing the paths of longer length. The inputs to this algorithm are: the graph
G, start node S, and goal node Goal. If goal is reached it returns the success. After
searching all the vertices, which will be indicated by a empty List, if goal is still not
found, the algorithm returns fail and exits.
If the BFS Algorithm 8.1 is applied for generate-and-search, a tree like one shown
in Fig. 8.3 gets constructed, and this entire tree needs to be searched in the worst case
to find the goal state. The BFS order of this tree is S, A, B, C, D, E, F, G, H .
Though, we certainly locate the goal if it exists, the path to goal from start state is
not remembered by the search Algorithm 8.1. This, however can be carried out in the
following way. In the queue data structure used for implementing BFS, each entry is
stored like (x, y), where x is next child node to be explored for goal, and y is its parent
node. The value y = 0 represent the root node. We call this queue as open-list, i.e,
the nodes which have not been fully explored yet. When a node is deleted from front
8.5 Uninformed Search 223
Table 8.1 Trace of state space search (BFS) for Fig. 8.3
Open-list Closed-list
[(S, 0)] []
[(A, S), (B, S), (C, S)] [(S, 0)]
[(B, S), (C, S), (D, A), (E, A)] [(S, 0), (A, S)]
[(C, S), (D, A), (E, A), (F, B), (G, B)] [(S, 0), (A, S), (B, S)]
[(D, A), (E, A), (F, B), (G, B), (G, C), (H , C)] [(S, 0), (A, S), (B, S), (C, S)]
[(E, A), (F, B), (G, B), (G, C), (H , C)] [(S, 0), (A, S), (B, S), (C, S), (D, A)]
[(F, B), (G, B), (G, C), (H , C)] [(S, 0), (A, S), (B, S), (C, S), (D, A),
(E, A)]
[(G, B), (G, C), (H , C)] [(S, 0), (A, S), (B, S), (C, S), (D, A),
(E, A), (F, B)]
A B C
D E F G H
of list, we add this deleted node into a separate list, called closed-list. The entries in
Table 8.1 show the search operations to search the goal node G. Each row in open-list
is constructed by a deletion or append operation on the previous content of the queue
data structure [4].
The search process is terminated when goal node G is encountered as the next
node in open-list. After reaching the goal node we can backtrack to find out the path
from goal node to start as: “(G, B), (B, S)”. Another path is “(G, C), (C, S)”. We
consider the first, or if path lengths are given then the one which is shorter.
Since BFS always explores the shallow nodes before the deeper ones, BFS finds
the shallowest path leading to the goal state. A modified and improved search can
be obtained by expanding only the lowest cost node (n). This is called Uniform cost
search. This modifies BFS such that the cost of a path should remain low but not
224 8 State Space Search
decreasing hence the term uniform. Since, a BFS algorithm is bound to find a goal, if
at all the goal exists, the BFS is complete inference system, as well as optimal (finds
the shortest path, as the nearer nodes are searched before the farther).
A C
B
D E F G H
8.5 Uninformed Search 225
Consider the Figs. 8.3, and 8.4. Assume that the goal node is E. Using the DFS method
it needs only four steps to reach to node E. Where as it requires six steps to reach
to E if BFS search method is used. Thus, for a deeper goal node, DFS is considered
better. If the node to be searched was C, the BFS required four steps, and DFS, which
first search deeper information, then shallow, needs total eight comparisons. Thus,
which approach is best, depends on the position of goal node. Also, if branching
factor b (number of branches per node) is large, the DFS is better suited, and BFS
is worst. Thus, the efficiency of search depends on the structure of tree, the search
method used, and branching factor [2].
Let us assume that in a tree constructed in a search has depth d , and for each
node there are b nodes that gets generated, called branching factor of the tree. For a
BFS tree, the worst-case time spent for any search is the maximum number of nodes
visited to determine the goal. This is the worst-case time-complexity of the search. At
a time how many nodes are in the open-list, determine the space, or space-complexity
of the algorithm.
For a BFS search, total nodes visited for tree of depth d are:
1 + b + b2 + · · · + bd = O(bd ), (8.1)
which is worst-case time complexity, and the maximum number of nodes in the Open-
list will exist at the lowest level of the tree. Thus, space-complexity is O(bd ). Hence,
in the case of BFS, both time and space complexity are O(bd ).
For a DFS search also, in the worst-case, all the nodes are required to be visited.
Hence, time-complexity is same as for BFS, and equal to O(bd ). Since, DFS needs
to store only b nodes per level for a depth of d , the total nodes to be stored are in
memory are b × d , hence space-complexity is O(bd ). Usually the branching factor
b is much smaller than d , results to the space complexity as O(d ).
In the BFS, since all the nodes at a given depth are stored in order to generate
the nodes at the next depth, the maximum number of nodes that must be stored to
search to depth d is bd , hence the space complexity is O(bd ). As with time, the
226 8 State Space Search
v3
vn
average-case space complexity is roughly one-half of this, which is also O(bd ). This
space requirement of breadth-first search is its most critical drawback. As a practical
matter, a breadth-first search of most problems, spaces will exhaust the available
memory long before an appreciable amount of time is used.
The Depth-first search avoids the memory limitation of BFS. It works by always
generating a descendant of the most recently expanded node, until some depth cutoff
is reached, and then backtracking to the next most recently expanded node and
generating one of its descendants. Therefore, only the path of nodes from the initial
node to the current node must be stored in order to execute the algorithm. If the depth
cutoff is d , the space required by DFS is only O(d ).
Since the depth-first needs to store the current path at any given depth, it is bound
to search all the paths down to any specified cut-off depth. A new parameter, called
edge branching factor (e), is defined to analyse its time complexity.
e = average number of different operators applicable to a given state (i.e., node).
In case of trees, the edge branching factor and node branching factors are equal.
This is because, the number of branches of a tail node v of an edge (u, v) are same
as that from the edge (u, v). But, for a graphs, the edge branching factor may exceed
the node branching factor. For example, for the graph shown in Fig. 8.5 has an edge
branching factor of two, while its node branching factor is one only. Accordingly,
the worst case number of paths from node v1 to reach to the node vn is 2n−1 , i.e., in
the worst case in which you can traverse the tree of root v1 is O(2n−1 ), which is equal
to O(2n ), an exponential time complexity!
We know that BFS traverses a graph by visiting the nearer nodes before the farther
nodes, hence when Fig. 8.5 is traversed in BFS it will be in linear order of all the nodes
in the graph. However, the DFS, which goes deeper and deeper by sequencing the
edges, hence in general may visit many nearer nodes far later than the farther nodes.
But, considering the Fig. 8.5 for DFS, since it joins the edges up to the maximum
depth, the worst-case paths will be 2n−1 , with time complexity of O(2n ), which is
8.5 Uninformed Search 227
exponential. In general, the time complexity of a depth-first search for depth d and
edge branching factor e is O(ed ). Note that, the space used by depth-first search grows
only as the log of the time required, the algorithm is actually time-bound rather than
space-bound. This is because, the time is ed , while space is log ed = d .
Consequently, for a system in which state repetition is possible (see Fig. 8.5),
every generated state must be stored in a table, and every new state generated must
be looked upon if it is generated again. In principle, this lookup table can be expensive,
but an efficient approach can be used to do it in a time logarithm of the total number
of states or better. A simple way of doing this is discrimination tree, where previously
seen states are stored in the leaves of the tree, whose non-terminal nodes are labeled
with discriminations.
Another drawback of DFS is that it requires an arbitrary cutoff depth. During the
search, if branches are not cutoff, and duplicates are not checked, the algorithm may
run for ever. Generally, the depth at which the first goal may appear, is not known in
advance, hence if cutoff is set too low, the algorithm may not find the goal, in spite of
its existence, and may terminate. If the goal estimate is taken too deep, the algorithm
may spend too much of a time, before it reaches to the goal or terminating without
finding the goal.
It is to be noted that, DFS is not good at finding the goal, particularly if the goal
is on the opposite side of the tree which is being searched, even though the goal is
shallow. So, many a times, a depth-limited DFS search is performed, i.e., searching
the space for predetermined depth only. A method which combines the advantages of
both the BFS and DFS is iterative deepening DFS, discussed in the following part.
The depth-first iterative deepening DFS (DFID) reaches to shallow goals much faster
than the ordinary DFS. It first sets the tree depth d = 0, performs DFS, hence checks
the root node for goal. Then discards all the nodes generated in the previous search,
starts over and do a depth-first search for d = 1. Next, starts over and do a DFS for
d = 2, and so on. In ordinary DFS, the goal node C (Fig. 8.6) will get searched in
8 comparisons, where as using iterating deepening DFS, search is carried out for
tree depth d = 0, next in d = 1 the node C gets located, requiring total 1 + 4 = 5
comparisons [3].
The standard algorithms of BFS and DFS, both have serious limitations, which
are overcome by the algorithm DFID. The iterative-deepening algorithm, however,
is a general algorithm, and can be applied to uni-directional search, bi-directional
search, and heuristic searches like A∗ , which we will discuss in Chap. 9.
Since, DFID expands all the nodes at given depth, it is guaranteed to find a solution
in shortest-time. The disadvantage of DFID is that it performs wasted computation
prior to reaching to goal depth. In fact, at first glance it seems very inefficient.
However, this wasted computation does not effect the asymptotic growth of the run
228 8 State Space Search
A B C depth = 1
depth = 2
D E F G K
time for exponential search. The intuitive reason is that almost all the work is done
at the deepest level of the search.
The worst-case complexities for the this method remains the same as ordinary
DFS, however, the average case improves. This is because, in the worst case you
still need all nodes to be compared, taking time equal to O(bd ), and worst case space
requirements is O(d ) only.
If the search is carried out in both the directions, it can be sped-up. The bidirectional
search sacrifices the space for time, i.e., gains in time complexity at the cost of space
complexity. This technique search from the initial state in forward direction, and
from the goal state in backward direction, and both searches are carried out together.
The states generated are stored until the common state is found to both the searches.
Consider that a bidirectional search is carried out with the depth as d and branching
factor b. If each side progresses with same depth, the search required from each, for
d
example, for iterative deepening DFS, has O(b 2 ) for time, and O(b ∗ d2 ) for for
d d
space. When combined from both opposite sides, it becomes 2 ∗ O(b 2 ) = O(b 2 )
for time, and 2 ∗ O(b × d2 ) = O(bd ), which is O(d ), for space.
The depth first iterative deepening (DFID) search can be applied to bidirectional
search as follows: A single iteration consists of a DFS from one direction to a depth k,
and stores the states at depth k. The second DFS searches from the other direction, one
to depth k and other to k + 1, which does not stores the states but simply matches
against the states stored from other direction. Note that the search to depth k + 1
is necessary to find odd-length solutions for the goal. The process is repeated from
k = 0 (solution at depth zero), to k = d /2. If it is assumed that some hashing scheme
is used to perform the matching in constant time for every node, the bidirectional
algorithm will find the optimal solution of total depth d of the graph, in time O(bd /2 )
and space O(bd /2 ).
However, for bidirectional search to be implemented, it is necessary that invertible
functions must be available for generating nodes. Which, in fact, does not exists in
every problem. For example, to search a descendant in a tree of many generations,
one can generate children nodes for each parent, and if a path leads from a forefather
node x to a descendant node y, we can say that y is descendant of x. Similarly, we can
travel in the tree backward, from a descendant y to its parents, and then their parents,
8.5 Uninformed Search 229
We consider first the depth-first search algorithm, as it has the minimal space require-
ments. The DFS can be implemented using last-in first out stack data structure. The
nodes are stored and taken off from the top of the stack. Every time a node is taken
from the top of the stack, it is expanded, and the resulting children are are placed on
the top of the stack. When recursive implementation is used, every time a node is
generated, the DFS is called recursively, on each of the child nodes generated. The
memory requirements of DFS, in both the explicit DFS and recursive implementa-
tion, is only linear in the maximum depth search, and it is generally not of much
significance.
The DFS consists number of drawbacks, e.g., in a graph with cycles, a pure DFS
may not terminate. Even in a tree, the first solution found by DFS is not necessarily
optimal. Both of these deficiencies can be solved by depth-first iterative deepening
(DFID) method. This method performs a series of depth first searches, to successively
greater depths, until the goal state is found. At this point, the current node stack is
solution, with top most node as goal, and bottom node to top node sequence as the
path from start node to the goal node. If recursive procedure is used, then the recursive
call stack contains the solution. The space complexity of DFID is linear, as in the
case of DFS.
Another drawback of DFS is such that it cannot be easily rectified. In a graph
having multiple paths to the same state, DFID as well all the DFS can generate far
230 8 State Space Search
more states than the number of actual states. It is due to the inability of DFS and DFID
to detect the duplicate nodes which corresponds to the same states. As an example,
consider a grid graph, where each node has four adjacent nodes (north, east, south,
and west) as its neighbors.1 An efficient DFS would keep track of the operator used
to generate a node from its parent, and will not apply its inverse when generating
its children. This will reduce the branching factor from four to three. Accordingly,
when considering a search in a radius r of this graph, the DFS would generate O(3r )
nodes. The large majority of these nodes are duplicate, because, there are only O(r 2 )
unique states with in a radius r. For example, applying, East followed by North,
generates the same states, as applying the North followed by East. Hence, the DFS
can generate exponentially more nodes than the actual states available in the problem
space [1].
As a general solution to BFS, the search algorithm stores all the generated nodes and
compares the newly generated nodes with the stored nodes to find out if there are
any duplicates. The BFS maintains two nodes lists, a Closed-list which is a list of
nodes that have been expanded, another list called Open-list that is generated but yet
not expanded. The Open-list is managed as first-in fist-out (FIFO) queue. If a head
node of the Open-list is not the goal state, it is removed from Open-list, expanded,
marked as Closed, and its children generated due to expansion are added to the tail
of the Open-list (i.e., appended).
Apart from the above operations, as each node is generated, it is also added to a
hash table if it is already not in this table. This storing in hash table is needed for fast
checking of the nodes in future as, whether the same node is reached again (through
a different path). For the implementation, every generated node is typically stored in
the hash table. The Open-list consists of pointers originated through the hash table,
and each node of Closed-list is indicated by a flag in the hash table storage.
The main drawback of this method is the memory requirements to store the nodes,
which is exponential.
Given a problem for solution using AI search, we need to formulate its solution
through a state space search. This requires discovering the start and goal states,
which may be explicitly specified in the problem, else we need to discover from the
problem specification. The next step is to find out the possible moves to generate
children at each state. Consider the following example to formulate a search problem.
A farmer (F) wants to move himself, a fox (X), a goose (G), and some grains (R),
across a river. Unfortunately, his boat is so tiny that he can take only one of his
possessions across on any trip. Worse yet, an unattended fox will eat a goose, and
an unattended goose will eat grains, so the farmer must not leave the fox alone with
the goose nor the goose alone with the grains. What is he to do?
We create a node for each safe position. Then draw the possible next moves by
a link from the one to the other. The end result is a graph with 10 nodes shown in
Fig. 8.7. The states are marked as 1–10. State 1 indicates that all the four objects are
one side of the river bank, and other side (separated by symbol |) is empty. Only one
move is possible from state 1, caused by transporting of goose and farmer by boat
to other bank. This results to state 2. Then, the farmer leaves the goose and returns
back, the new state is 3. From state 3 two alternate moves are possible, to states 4,
5. These states result to states 6, 7, respectively.
The states 6, 7 merges to a single state 8, which has one child state 9, and next
state 10 is goal state, indicating that all objects are to the destination back of river.
What one needs to do is to find a shortest path from the state where all objects are
on one side of the bank, to the goal state where all the object have been transported
to other side of the river bank.
The general approach is to search the paths until either the goal is reached or no
more path is left to be explored. We note that the goal state can be reached by several
paths, but to avoid loops and cycles. We note that each state is invertible also. When,
loops are not considered, there are two paths to goal as: 1–3, 4, 6, 8–10, and 1–3, 5,
7, 8–10.
XRF | G 3
4 X | GRF R | XRF 5
6 XGF | R GRF | X 7
G | XRF 8
GF | XR 9
- | XGRF 10
232 8 State Space Search
8.8 Summary
All the search strategies discussed in this chapter are called blind search. In the
absence of any idea as what direction of search may lead to a solution faster, one
need to carry out the exhaustive search. In all these searches, the time complexity is
exponential.
A graph search for AI turns out to be a tree search. Unlike conventional data
searching in a tree with tree’s physical dimensions are given, in a AI search one need
to generate-and-search simultaneously. The search space is called state space, and
nodes of the tree are called states.
The BFS (breadth-first search) has time and space complexities both as O(bd ), and
in DFS (depth-first search) has these complexities as O(bd ) and O(d ), respectively.
The Depth-first iterative-deepening (DFID) provides better average case in space
complexity, as it can locate the shallow nodes faster. However, the worst case time
and space complexities are same as that in DFS. The bidirectional search reduces
d
the time complexity to half exponential, i.e., O(b 2 ) in both the depth and breadth
searches.
To analyse the time complexity of DFS, a new parameter, called edge branching
factor (e) is important. It is “average number of different operators which are appli-
cable to a given state.” In case of trees, both the edge branching factor and the node
branching factor are equal. However, for the graphs in general, the edge branching
factor may exceed the node branching factor. In a graph with multiple paths to the
same state, any depth-first search, including DFID, can generate far more nodes than
there are states. This is due to the inability of the DFS algorithm to detect most
duplicate nodes that represent the same state. Following table gives comparison in
brief:
Exercises
1. Describe a state space in which iterative deepening search performs worse than
depth-first search.
2. Three missionaries (m1 , m2 , m3 ) and three cannibals (c1 , c2 , c3 ) are on one side
of a river, along with a boat that can hold one or two people. Suggest a solution
to get everyone to the other side, without ever leaving a group of missionaries
on any side of the river outnumbered by the cannibals in that place. Formulate
the solution for this, draw the graph, and demonstrate the goal search through
BFS.
Exercises 233
3. Given a 5-litre jug full of water, and an empty 2-litre jug. The jugs have no
marking or level indicator. The goal is to fill 2-litre jug with exactly one litre of
water. Give the corresponding state diagram. Assume that following are some
example of moves: p5,2 (pour 5 l into 2 l jug), e2 (empty two litre jug), etc.
4. John has two jugs: one holds exactly 3 l and one exactly 4 l, but they have no
subdivision marking. A recipe calls for exactly 5 l. He has a source of water but
no other jug. Draw the state-space graph and determine the sequence of steps in
getting exactly 5 l (into the recipe) that wastes the least amount of water. Suggest
the state-space search approach to be followed.
5. Figure 8.8 shows a search tree with states A to O along with estimated distance
from that state to the goal. The state with distance zero is goal itself.
a. Make use of BFS, and show the steps along with the elements of Queue data
structure as well the distance from goal, for each step.
b. Make use of DFS, and show the steps along with the elements of Stack data
structure as well the distance from goal, for each step.
6. Manually run the Algorithm 8.2 for the graph shown in Fig. 8.4 considering
that start node is S and goal node is H . While running this algorithm, show the
progressive contents of the stack.
7. Given the blocks world indicated in Fig. 8.9, it is required to get to goal state
from the initial start state. Construct the search tree for:
a. BFS
b. DFS
Assume that following rules for moves will be followed by the robot arm for
carrying out this job:
A
20
B C
15 17
D E F G
14 12 11 13
H I J K L M N O
10 9 0 5 8 0 7 5
Robot arm
C B
B A C
Goal State
Start State
Fig. 8.9 Blocks world
8. Find space and time complexity for DFS, for parsing a NL sentence of 50 words.
Suggest the assumptions you would make to reduce the space and time com-
plexity.
9. Give examples of graphs having different edge and node branching factors.
10. For the graphs shown in Fig. 8.10a–d, with start node S and goal node G, find
out the average node branching factor and average edge branching factor.
11. For the following board games, find out the worst case path length to goal, total
number of paths to goal, node branching factor, and edge branching factor.
a. Rubics-cube
b. Tic-tac-toe
c. 8-puzzle
d. Sudoku
12. Consider that a large number of websites are available, with their domain names
in a domain-name-file. Each website has index-page (homepage) and other pages
linked directly or indirectly from the homepage. In addition, there may be links
from website pages to other website pages. Suggest a method and write an
algorithm to crawl (scan) these websites to create an index table as follows:
each row in table has one entry corresponding to each domain name, comprising
domain name and five most frequent keywords: {domain-name, key-word1, …,
key-word5}. Your algorithm must follow a suitable search approach.
a. Give the justification for the type of search you have used.
b. Give the time and and space complexity analysis.
Exercises 235
S S
(a) (b)
G
S S
(c) (d)
G G
17. Imagine there are 10 million websites in the entire Internet. Suppose that a search
engine uses certain strategies to index these websites on regular basis, to maintain
and update a very large database of indexes. Answer the following through your
imaginations/logic as well as explore them on web.
a. What strategy should search engine use to navigate through these websites
through the links.
b. What strategy should be used to create the index of the home-page of every
website?
c. What should be criteria to update the indexes?
d. How to avoid the re-indexing of already indexed nodes?
18. Can every AI search be carried out using bidirectional search? What is necessary
condition for this, if any?
19. For a search-tree of depth d and branching-factor also b, carry-out the following
analysis for DFS and BFS:
a. Average case time complexity.
b. Average case space complexity.
c. Best case time complexity.
d. Best case space complexity.
20. Suggest an architecture of hash-table for storage of nodes found in graph search,
and quick searching of the same when it is found again next-time in graph search.
21. What is the memory requirement (in KB or MB) for data-structure for 8-puzzle
for each of the following searches?
a. BFS
b. DFS
Assume that only the open-list is maintained to store the nodes. Each node
correspond a state of the 8-puzzle board, and is equal to 9-bytes (in each one
byte, 4-bits are for square number, and 4-bits for integer number in that, including
the space). Note that, when head nodes are deleted, the memory is freed.
22. Suggest an approach to combine the DFS and BFS in a single algorithm, so that
at certain states you can choose DFS in place of BFS or vice-versa, based on
meeting specific situation. For example, to check for shallow goals BDF can be
used while for deep goals DFS can be tried. What data structures you will use
to implement such type of search, as well the true strategies?
References
3. Korf RE (1985) Depth-first iterative-deepening: an optimal admissible tree search. Artif Intell
27:97–109
4. Ludger GF (2009) Artificial intelligence—structures and strategies for complex problem solving,
5th edn. Pearson Education, New Delhi
5. Nilsson NJ (1980) Principles of artificial intelligence, 3rd edn. Narosa Publishing
6. Tarjan R (1972) Depth-first search and linear graph algorithms*. SIAM J Comput 1(2):146–160
Chapter 9
Heuristic Search
9.1 Introduction
The subject of combinatorial optimization comprises a set problems that are central
to the domain of computer science. The field of combinatorial optimization aims to
develop efficient algorithms to find out minimum/maximum values of some function
having large number of independent variables. The function is usually called objective
function or cost function, and represents the quantitative measure of the “goodness”
of some complex system. The cost function depends on the total configuration of the
system which comprises many parts.
The best-quoted example of a combinatorial optimization problem is Traveling
Salesman Problem (TSP). It is stated as: given a list of n cities, and distance between
every two cities, it is required to plan a salesman’s route which guarantees to pass
through every city once only, covering all the cities, and finally returns to the starting
point. The order of cities to be visited should be so planned, that the total distance,
to cover all the cities, is minimum. Instead of distance, it can be cost of travel from
city to city, or some other parameter. For the sake of generality, we call this as
cost. The problems of this type are common in the areas of scheduling and design.
Two subsidiary problems in these areas are of general interest: 1. Predicting the
estimated cost of a salesman’s optimal route averaged over some arrangement of
cities, having given the arrangement of cities and the distance between pairs of
cities, 2. Estimating/obtaining the upper bounds of computing efforts necessary to
determine the optimal route.
All the exact methods (also called exhaustive) known so far, for determining the
optimal route of the salesman problem, requires the computing efforts, that grows
exponentially with the total number of cities n. Hence, to carryout the solution in
realistic times, the exact solution can be attempted only for a small number of cities
n, may be less than a hundred. The TSP belongs to a class of large set of problems,
called NP-Complete (NP for nondeterministic polynomial in time) problems. So far,
no method of exact solution having computing effort bounded by a polynomial power
of n (say n k , k ≥ 1) has been found for any of these problems.1
However, if such a solution was found for any problem, say A, which is NP-
complete, then it would be possible to map to A, all the remaining NP problems, as
all NP problems are member problems of A (an NP-complete). However, it is yet not
known as what are the features of the individual problems that make it NP-complete,
and that cause this level of difficulty of solving!
The problems of NP-complete class are at common place in many situations
of practical interest, and hence the importance of their solutions. Fortunately, the
solution methods, called, heuristic methods have been developed, which have com-
putational requirements proportional to only a limited powers of n (the size of the
problem). Unfortunately, there is no universal heuristic method which can be applied
to all types of problems. In fact, the heuristics are not general, but problem-specific,
and there is no guarantee that one heuristic procedure for finding near-optimal solu-
tions for one NP-complete problem will be effective for another problem also.
Fundamentally, there exist two basic approaches to heuristics: 1. “divide-and-
conquer” and 2. “iterative improvement”. The first approach is based on the concept
of dividing the original problem into subproblems of manageable sizes, and then the
subproblems are individually solved. At the end, the solutions to the subproblems are
patched back together to get the complete solution. For this method to produce a good
solution, it is necessary that subproblems are naturally disjoint, and the division of
the original into subproblems is proper, in the sense that errors made in patching up
the subproblems do not offset the gains obtained in applying more powerful methods
to the subproblems.
Other approach to heuristics is based on iterative improvement, where one has
to start with a known configuration. And, for this configuration, a standard rear-
rangement operation is applied to all parts of the system in turn, until a rearranged
configuration which improves the cost function is discovered. As a next step, this
rearranged configuration becomes the new configuration of the system, and the pro-
cess is repeated until we reach to a configuration such that no further improvements
1 The NP-Complete problems require exponential power of computing efforts, in terms of n, i.e.,
kn .
9.1 Introduction 241
can be found. The iterative improvement comprises a search space for rearrange-
ment, so that there is a flat region in space, indicating that an optimized solution has
reached—called global maxima.
Instead of settling to global maxima, the search quite often gets stuck-up in a local
maxima. Due to this, it is usual to perform the search several times, starting from
different randomly generated configurations, and save the best result, so as to reach
the global maxima.
This chapter presents the heuristic methods for AI search problems. These meth-
ods are better informed, hence explore the state space in a more right directions. The
analysis and complexities of these methods are also discussed.
1. Describe the role of heuristics and describe the trade-offs among completeness,
optimality, time complexity, and space complexity. [Familiarity]
2. Select and implement an appropriate informed search algorithm for a problem by
designing the necessary heuristic evaluation function. [Usage]
3. Evaluate whether a heuristic for a given problem is admissible (i.e., can guarantee
optimal solution). [Assessment]
4. Design and implement a genetic algorithm solution to a problem. [Usage]
5. Design and implement a simulated annealing schedule to avoid local minima in
a problem. [Usage]
6. Design and implement A∗ search to solve a problem. [Usage]
7. Compare and contrast genetic algorithms with classic search techniques. [Assess-
ment]
8. Compare and contrast various heuristic searches vis-a-vis applicability to a given
problem. [Assessment]
The search efficiency can improve tremendously—reducing the search space, if there
is a way to order the nodes to be visited in such that most promising nodes are explored
first. These approaches are called informed methods, in contrast to the uninformed
or blind methods discussed in the previous chapter. These methods depend on some
heuristics determined by the nature of the problem. The heuristics is defined in
the form of a function, say f , which some how represents the mapping to the total
distance between start node and the goal node. For any given node n, the total distance
between start and goal node is f (n), such that
where g(n) is distance between start node and the node n, and h(n) is the distance
between node n and the goal node. We note that g(n) can be easily determined
242 9 Heuristic Search
and can be taken as shortest. However, the distance to goal is f (n), which requires
computation of h(n), called heuristics, cannot be so easily determined. In deciding
the next state every time, which is represented by node n, the state is chosen such
that f (n) is minimum.
Considering the case of the traveling salesman problem, which otherwise, is a
combinatorially explosive problem, with exponential time complexity of O(n!) for
n nodes, reduces to only O(n 2 ) if every time the next node selected is the nearest
neighbor, that is, the one having shortest distance from the current node.
Similarly, in the 8-puzzle problem, the next move is chosen the one having mini-
mum disagreement from the goal, i.e., having minimum number of misplaced posi-
tions with respect to the goal.
The heuristic methods reduce the state space to be searched, and supposed to give
the solution, but may fail also.
The name hill-climbing comes from the fact that to reach the top of a hill, one selects
the steepest path at every node, out of the number of alternatives available. Naturally,
one has to sort the slope values available, pick up the direction of move having highest
angle, then reach to the next point (node) toward the hill top, then repeat the process.
The hill-climbing Algorithm 9.1 is an improved variant of the depth-first search
method. A Hill-climbing method is called greedy local search. Local, because it
considers a node close to the current node, at a time; and greedy because it always
selects the nearest neighbor without knowing its future consequences. The inputs to
this algorithm are G, S, and Goal, which stand for—graph, start(root) node, and the
goal node, respectively.
Consider the graph shown in Fig. 9.1a, where start node is A and goal node is G.
It is required to find out the shortest path from node A to node G, using the method
of hill-climbing. Figure 9.1b shows the search tree for reaching to goal node G from
start node A, with shortest path A, B, D, E, G, and path length 14. It can be easily
worked out that this approach cannot lead to shortest path always.
Though simple, hill-climbing suffers from various problems. These problems are
prevalent when hill-climbing is used to optimize the solution.
• Foothill Problem: This occurs when there are secondary peaks or local maxima.
These are mistaken for the global maxima, as the user is left with false sense of
achieving the goal.
• Plateau Problem: This occurs when there is a flat region separating the peaks.
• Ridge Problem: It is like a knife edge or an edge on top of a hill, both the sides are
valleys. It again gives a false sense of top of the hill, as no slope change appears.
9.3 Hill-Climbing Methods 243
B 9 A
2 E 2 5
2
4
9 C
A 3 B 9
D 3
7 6 G 2
5 C E
C D
4 6
Start node = A
E G
Goal node = G
6
G
(a) Graph to be searched for goal ’G’. (b) Search-Tree.
Due to above difficulties, a hill-climbing algorithm is not able to find a goal, even
if the goal exist. Thus, hill-climbing algorithms are not complete.
Consider the following phenomenas:
1. Rotating the brightness knob in control panel of an analog TV does not improve
the quality of picture,
2. While testing a program, running it again and again, with different data sets does
not indicate new discovery of errors,
3. Participating in a sports again and again (without new ideas and training), does
not improve further performance.
In all the above three cases, we strive for optimum performance. In case 1, the
adjustable parameter is TV control, in second types of different input data, and in the
third, adjustable parameter is more energy and preparedness. But it is appears that
optimum has reached (a highest point in performance, from where no improvements
take place, an indication of saturation point, the goal, but not the true goal).
244 9 Heuristic Search
The above scenarios are created due to either of the foothill problem, or plateau,
or ridge.
Local Versus Global Search
The preference for local search where only one state is considered to further expand
at a time, out of the newly explored states, is good for a memory efficiency, in
comparison to expanding all the successor nodes, but appear to be a too extreme
step. Thus, instead, k best nodes out of successors generated are considered for
further expansion. This gives rise to a new method, called local beam search.
However, the local beam search too will lead to concentration to a specific direction
to those successors which generate more potential nodes. Consequently, this search
also ultimately becomes a local search. A better solution is to elect these k nodes
such that they are not the best successors, but randomly selected, out of the next
generation of successor nodes. This new method is called stochastic beam search.
Thus, we grow a population (of nodes) and from that we generate further another
population by random selection, as well as on some criteria of merit. Thus, we reach
closely to the approach used in genetic algorithms-based search.
If a problem that has a very large search space and can be solved by iteration (unlike
theorem proving),2 there is usually no alternative to using the iterative methods. In
such problems, there is a serious issue in bounding the effort to make the search
tractable. Due to this, the search should be kept limited in some way, e.g., in terms
of total number of nodes to be expanded, or maximum depth to which it may reach.
Since there is no guarantee that a goal node will be ultimately reached, an evaluation
function is invoked to help us decide the approximate distance to the goal, for any
given node at the periphery of the search. This or a similar function can also be
used for deciding which tip node to sprout next. Hence, the effort for proper design
of valuation functions limits the later stage difficulty in solving the problem. Note
that heuristics-based methods are called informed search methods, in contrast to the
uninformed methods discussed in the previous chapter.
Among all the problem-solving strategies based on search the heuristics, one of
the most popular methods of exploiting heuristic information to cut down search
time is best-first search strategy. This strategy possess general philosophy of using
heuristic information to assess the “merit” latent in every candidate search avenue
exposed during the search, and then continues the exploration along the direction
of highest merit. There are no local maxima “traps” in best-first search like in hill-
climbing methods. The best-first search strategy is usually seen in context of path-
searching problems—a formulation that represents many combinatorial problems
with practical applications, such as routing telephone traffic, layout of printed circuit
2 In problems such as theorem proving, the search must continue until a proof is found.
9.4 Best-First Search 245
Fig. 9.2a shows the graph, and Fig. 9.2b shows the search tree using GBFS for reach-
ing to goal node G.
Every node in best-first search shows the node identification along with its distance
from the root node S. The order in which the nodes are explored is shown with dotted
line. Since G is goal, its path from root is S, A, C, G with shortest path length 11.
Note that any other path will be of longer or equal length.
S,0
4 4 G
C
A A,3 B,4
3 3
5 C,7 B,8 A,9 D,6
S 5
4 4 E
B
2 E,10
D G,11 C,11
D,12
As we have noted that, the best-first searches tend to put the searching effort into
those subtrees that seem most promising (i.e., they are most likely of providing the
best solution). However, the best-first searches require a great deal of bookkeeping
for keeping track of all competing nodes, contrary to the great efficiencies possible
in depth-first searches.
Depth-first searches, on the other hand, tend to be forced to stop at inappropriate
moments thus giving rise to the horizon effect (number of possible states is immense
and only a small portion can be searched). They also tend to investigate huge trees,
large parts of which have nothing to do with any solution (since every potential arc
of the losing side must be refuted). However, these large trees sometimes turn up
something that the evaluation functions would not have found were they guiding the
search. Sometimes the efficiencies and discovery potential of the depth-first methods
appear to out-weight what best-first methods have to offer. In fact, both methods have
some glaring deficiencies.
Optimizing Best-First Search
In practice, several shortcuts have been devised to simplify the computation of GBFS.
First, if the evaluation function f (n) used for node selection always provides opti-
mistic estimates of the final costs of the candidate paths evaluated, then we can
terminate the search as soon as the first goal node is selected for expansion, without
compromising the optimality of the solution used. This guarantee is called admissi-
bility and is, in fact, the basis of the branch-and-bound method. This we can observe,
for example in a chess game, where goal is property of a configuration and not the
property of path from start node. Hence, once a winning configuration is reached,
there is no need to try it from other paths.
Second, we are often able to purge from tree T , large sets of paths that are rec-
ognized at an early stage to be dominated (i.e., superior) by other paths in T . This
becomes particularly easy if the evaluation function f is order-preserving, that is,
if, for any two paths p1 and p2 , leading from S to n, and for any common extension
p3 of those paths, the following holds (Fig. 9.3) [2]:
f ( p1 ) ≥ f ( p2 ) ⇒ f ( p1 p3 ) ≥ f ( p2 p3 ). (9.2)
p1
p2
n
p3
The best-first search allows revisiting the decisions. This is possible when a newly
generated state by expansion of one of the state in Open-list is found in the closed-list
also. The best-first would retain the shorter path to this node, and purge the other to
save space. However, in a variant of best-first, called, greedy best-first search once a
state is visited the decision is final and the state is not visited again, thus eventually
accepting the suboptimal solution. This however, does not require the Close-list, thus
saving the memory space significantly.
Special Cases of Best-First Search
Individual best-first search algorithms differ primarily in the cost function f (n).
If f (n) is the total depth of node n (not the distance from start), best-first search
becomes breadth-first search. Note that breadth-first searches all the closer nodes (to
start) before farther nodes. If f (n) = g(n), where g(n) is the cost of the current path
from the start state to node n, then best-first search becomes Dijkstra’s single-source
shortest-path algorithm.
If f (n) = g(n) + h(n), where h(n) is a heuristic estimate of the cost of reaching
a goal from node n, then best-first search becomes a new algorithm A∗ algorithm.
Breadth-first search can terminate as soon as a goal node is generated, while
Dijkstra’s algorithm and A∗ must wait until a goal node is chosen for expansion to
guarantee optimality. Every node generated by a best-first search is stored in either
the Open- or Closed-lists, for the following reasons:
1. To detect when the same state has previously been generated. This is to prevent
expanding it more than once.
2. To generate the solution path once a goal is reached. This is done by saving with
each node a pointer to its parent node along an optimal path to the node, and then
tracing these pointers back from the goal state to the initial state.
3. To choose only the node, which is at shorter distance, when a newly generated
node already exists in the closed-list.
9.4 Best-First Search 249
By far, the most studied version of best-first-search is the algorithm A∗ , which was
developed for additive cost measures, that is, where the cost of a path is defined as
the sum of the costs of its arcs. The A∗ is in fact a family of algorithms, we will see
it shortly. The algorithm makes use of ordered state-space search and the estimated
250 9 Heuristic Search
heuristic cost to determine the evaluation function f ∗ , a function which provides the
goal state. This process is similar to the best-first search, but now it is unique in the
sense that it defines f ∗ , which will provide a guarantee of optimal path to the goal
state. The A∗ algorithm is in the class of branch-and-bound algorithms, which are
common in use in operations research for finding the solution of a problem, in the
form of a shortest path in a graph [1].
The evaluation function f ∗ (n) estimates the quality of the solution path through
node n, and it is based on values returned from two components: 1. g ∗ (n), and 2.
h ∗ (n). The first component is the minimal cost of a path from a start state to n, and
the second component, called heuristic value, is a lower bound on the minimal cost
of a solution path from state n to goal state.
In graphs, g ∗ can have error only in the direction of overestimating the minimal
cost. In future steps of the algorithm, if a shorter path is found, the value of g ∗ can
be readjusted to lower side. The function h ∗ carries the heuristic information, such
that it has the capability that ensures that value of h ∗ (n) is less than h(n). This later
condition is essential to the optimality of A∗ algorithm. The property, as per which,
h ∗ (n) is always less than h(n), is called admissibility condition.
To match this cost measure, A∗ employs an additive evaluation function f (n) =
g(n) + h(n), where g(n) is the cost of the currently evaluated path from S to n and
h is a heuristic estimate of the cost of the path remaining between n and some goal
node. Since g(n) is order-preserving and h(n) depends only on the description of
the node n, therefore f (n) is also order-preserving, and one is justified in discarding
all but one parent for each node.
The admissible search algorithm for A∗ (A-star) is given as Algorithm 9.3.
Let, for any graph G and any set of goal nodes Goal, let us assume that f (n) is
the actual cost of an optimal path that is restricted to go through only the node n,
i.e., from source S to a preferred goal node n. Note that at the begin of the A∗
search algorithm, the constrained node n is nothing but S. Hence, g(n) = g(S) = 0.
Therefore, f (S) = h(S) is the cost of unconstrained optimal path from node S to
preferred goal node, what so ever it is. Actually, for every node n on optimal path,
the condition f (n) = f (S) holds, and for every node n not on an optimal path,
f (n) > f (S) holds. Thus, although f (n) may not be known in advance, it seems
reasonable to use the evaluation function f ∗ (n) as an estimate of f (n). This is
because determination of the true value of f (n) may be main problem of interest.
In the following, we present some properties of search algorithm A∗ where cost
f (n) of an optimal path through node n is estimated using an evaluation function
f ∗ (n). The function f (n) can be written as the sum of two parts, g(n) and f (n):
In the above, g(n) is the actual cost of an optimal path from node S to n, and h(n)
is the actual cost of an optimal path from node n to a preferred goal node.
If we had the estimates of g and h, we could easily get the estimate of f (n), as
the simple addition of the two. An obvious choice for g ∗ (n) is, so far smallest cost
path found by the algorithm, from S to n. This indicates that g ∗ (n) ≥ g(n).
Through a simple example we will explain that the above estimate is easy to
compute as the algorithm progresses through its computations [6].
Example 9.2 Consider the subgraph shown in Fig. 9.4, with start node S and three
other nodes n 1 , n 2 , n 3 , with cost on edges as the weights.
Having given this, we trace the algorithm A∗ as it proceeds. With S as start node,
n 1 and n 2 are the successor nodes. The estimates for g ∗ (n 1 ) and g ∗ (n 2 ) are then 4
and 9, respectively. Let, A∗ expands the next node as n 1 , and obtains the successors
n 2 and n 3 . At this stage g ∗ (n 3 ) = 4 + 3 = 7, and g ∗ (n 2 ) is 4 + 4 = 8. The value of
g ∗ (n 1 ) ultimately remains equal to 4, irrespective of the goal node.
4
4
S
9
n2
252 9 Heuristic Search
We have the following arguments for an estimate h ∗ (n) of h(n), which we are not
able to compute for this example, as no criteria for heuristics is specified here. We
usually rely on information from the problem domain for heuristics. For example,
many problems that can be represented in the form of problem of finding mini-
mum cost path through a graph that contains some “physical” information, and this
information is used to form the basis for estimation of h ∗ . When considering the
connection between cities through roads, the value h ∗ (n) might be air distance from
city n to goal city, because this distance is the shortest possible length of any road
connecting city n to the goal city. Thus, it is lower bound on h(n). However, the
above conclusion is based on the assumption that air connectivity between any two
cities follows a straight line rule.
As another example, in an 8-puzzle problem, at node n, the distance h ∗ (n) might
be equal to the number of tiles dislocated with respect to the goal state.
We will discuss later about using information from specific problem domains, to
form estimate of f ∗ . However, first we can prove that if h ∗ is any lower bound of h,
then the algorithm A∗ is admissible.
Example 9.3 A∗ -Search.
Figure 9.5 shows a graph, heuristic function table for h(n) for every node in the
graph, and tree constructed for A∗ search for the graph, for given start state S and
goal state G. To expand the next node, the one having smallest value of function f is
chosen out of the nodes in the frontiers. The function f for a node n is sum of three
values: the g value of the parent of n, the distance from parent of n to the node n,
and heuristic value (estimated distance from n to goal, given in the table) indicated
by h. In the case of a tie, i.e., two states having equal values of f , the one to the left
of the tree is chosen.
If, in addition, h(n) is a lower bound to the cost of any continuation path from
n to goal, then f (n) is an optimistic estimate of all possible solutions containing
the currently evaluated path. Then, terminating A∗ upon the selection of the first
goal node does not compromise its admissibility. Several other properties of A∗ can
be established if admissibility holds, such as the conditions for node expansion,
node reopening, and the fact that the number of nodes expanded decreases with
increasing h.
Based on the criteria set for A∗ (i.e., to always expand the node having smallest
value of f ). The distance to goal ( f ) for each node n is: distance from source S to
parent, plus distance from parent to this node n, plus distance h from this node n to
goal. For current node n = A, these distances are 0, 1, 6, respectively. The order in
which nodes have been expanded for Fig. 9.5a, and shown in search tree in Fig. 9.5c
with start node S and goal node G are: (S, 0), (B, 6), (A, 7), (B, 5), (C, 6), (C, 7)
(with parent A), (C, 7) (with parent B), (G, 8), (G, 9), (G, 12). Finally, we note that
the best path is corresponding to goal (G, 8), and it is: S, A, B, C, G. Note that, in
the A∗ -tree, we followed the sequence (S, 0), (B, 6), with (A, 7) and not (C, 7),
which are equally weighted. We chose the node to the left-side subtree. Had we
chosen, (C, 7) in place of (A, 7), we would have reached to (G, 9) as next node,
which incidentally was not a good choice.
9.5 Heuristic Determination of Minimum Cost Paths 253
A State h
1 11 S 7
5 A 6
S 2
C G B 2
4 3
C 1
2
B G 0
(a) Graph. (b) Heuristic Table.
f = 0+7
S =7
f=0+1+6 1 4
= 1+6=7 A B f=0+4+2
2 11 =4+2=6
5 2
f=1+2 +2 f=1+5+1
=3+2=5 B =6+1=7 C G C f=4+2+1
f=1+11+0 = 6+1=7
2 3 3
=12+0=12
f=3+2+1 G f=6+3+0
C G =9+0=9
=5+1=6 f=6+3+0
3 =9+0=9
f=5+3+0
=8+0=8 G (c) A*-Search Tree.
Fig. 9.5 Graph and A*-Search tree
A search algorithm B is called optimal if there does not exist any other search algo-
rithm performing the searching in less time or space or can do the job by expanding
fewer nodes, having a guarantee of solution quality as that of algorithm A. Hence,
if h(n) is lower bound on h ∗ (n) then the solution of A∗ is admissible. This estimate
also concludes that any open node n may even be arbitrarily close to a preferred goal
node.
In one way, we can define an optimal search algorithm as one that picks the correct
next node at each choice. However, this specification of an algorithm is not of much
use as this much specification is insufficient to design an algorithm. In fact, whether
such an algorithm may ever exist is also an open question in itself [6].
The heuristic search that finds the shortest path to a goal wherever it exists is called
admissible. We may like to know, in what sense one heuristics is better than other,
is called informedness of the heuristics.
When search is made, it is expected that same node will not be accessible from a
shorter path later on. This property is called monotonicity.
The breadth-first search algorithm is admissible algorithm, because it searches a
path at level n, before searching the paths at level n + 1, hence if the goal exists at
level n it will be certainly found out. However the BFS algorithm is too expensive
as a general purpose algorithm.
The A∗ algorithm does not require g(n) = g ∗ (n). This shows that there may be
subgoals in the path, which are longer than g ∗ (n); this is due to monotonicity [8].
Definition 9.2 (Informedness) For a problem, suppose there are two A∗ heuristic
functions h 1 and h 2 . Then, if h 1 (n) ≤ h 2 (n), for all states n in the search space, the
heuristic h 1 is called more informed that h 2 .
For example, the criteria of number of tiles out of place, in the 8-puzzle is better
informed than the breadth-first or depth-first search methods. Similarly, the heuristics
which calculates the number of transitions to reach the goal is better informed than
the one considering the heuristics based on the number of tiles out of place. In general,
a more informed is an algorithm, there is less expansion of space for searching [6].
9.6 Comparison of Heuristics Approaches 255
(D, E, L , N , P, U ), (9.5)
where
where ΔE is positive change in energy level (difference, i.e., current cost–new cost,
so ΔE is negative in valley descending), T is temperature in Kelvin absolute tem-
perature scale, and k is Boltzmann’s constant. As per this property, the probability
to a large uphill move will be lower than probability of small move. Also, the prob-
ability that a large uphill move will take place, decreases as the temperature (T )
9.7 Simulated Annealing 257
Probability p
uphill move in simulated
Probability p
annealing, as a function of 1 1
temperature and change in
energy
1 (-ve)
Temperaure T ΔE
(a) (b)
decreases. Figure 9.6 shows the effect of increase of temperature and decrease of
ΔE, on probability.
In other words, the uphill moves are more possible when temperature is high, but
as the temperature decreases, relatively small uphill moves are made until finally
process converge to a local minimum configuration.
The rate at which system cools is called annealing schedule. If cooling occurs
too fast, the physical system will form stable regions high energy. That is, local but
not global minima is reached. If a slower schedule is used, a uniform crystalline
structure, which corresponds to a global minimum, will develop.
In search techniques, change in E (ΔE) is equal to change in object function. The
constant k represents the correspondence between the unit of temperature and unit
of energy. Since it is constant, we take probability p in Eq. (9.6) as
ΔE
p = e T . (9.7)
Formal Approach
The Boltzmann probability function tends to return True at higher temperature and
False at lower temperature; thus in essence the search gradually shifts from random
walk to local hill-climb.
A simulated annealing algorithm is suitable for minimization of an objective
function f , having the mapping, f : S → R, where S is some finite search space,
and R is real number. Typically, the search spaces, designated as Sn , comprise sets
of bit strings {0, 1}n of fixed length or the set of all possible the permutations over
the set {1, 2, ..., n}.
When considering the search space S, it is necessary to define some notion of
neighborhood N , which is a relation N ⊆ S × S. A function
N : S → P(S) (9.8)
algorithms, because the analysis is usually more difficult and the nature of the appro-
priate average to study is not always clear. However, as the size of optimization
problems increases, the worst-case analysis of a problem will become increasingly
irrelevant, and the average performance of algorithms will dominate the analysis of
practical applications. This large number domain of statistical mechanics, and hence
of simulated annealing.
The Genetic Algorithms (GAs) are search procedures based on the process of natural
selection and genetics. These are increasingly used in applications in difficult search
problems, optimization, and machine-learning, across a wide spectrum of human
endeavor. A GA processes a finite population of fixed-length binary strings. In prac-
tice, the strings are: bit codes, k-ary codes, real (floating-point) codes, permutation
(order) codes, etc. Each of these has their place, but here we examine a simple GA
to better understand basic mechanics and principles [3].
A simple GA consists of three operators: selection, crossover, and mutation. Selec-
tion is the survival of the fittest within the GA. Figure 9.7 shows the sequence of
operations on a population Pn and producing next population Pn+1 . To understand
the operation of the genetic algorithm, a population of three individuals is taken.
Each is assigned a fitness value by the function F, called fitness function. On the
basis of these fitnesses, the selection phase assigns the first individual (00111) zero
copies, the second (111000) two, and the third (01010) one copy. After selection,
the genetic operators are applied probabilistically; the first individual has its first bit
mutated from a 1 to a 0, and crossover combines the next two individuals into two
new ones. The resulting population is shown in the box labeled Tn+1 . Algorithm 9.5
shows the steps for search using GA.
Population at Population at
Tn Tn+1
00111 11100 Mutation 01100
11100 11100 11010
01010 Selection 01010 Cross- 01100
......... over
F(00111) = 0.1 applied
F(11100) = 0.9 probabilistically
F(01010) = 0.5
There are many ways to achieve effective selection, including ranking, tourna-
ment, and proportionate schemes, but the key notion is to give preference to better
individuals. Consider that individuals are strings of fixed length. In a selection game
of two-party of such individuals, pairs of strings are drawn randomly from the parental
(original) population, and the better/fitting individuals places an identical copy into
the mating pool. When the whole population is selected in this manner, every indi-
vidual will participate in two tournaments; the best individuals in the population will
win both trials. The median individual will typically win one trial and those worst,
do not win at all.
For the selection to function, there must be some way of determining who is fitter
individual. This evaluation can come directly from the formal objective function,
or from the subjective judgment by a human observer. The ordering used is usually
partial ordering.
The population holds representations of possible solutions. It usually has a fixed
size and is a multi-set. Selection operators usually take whole population into account,
i.e., reproductive probabilities are relative to current generation and diversity of a
population refers to the number of different fitnesses.
If we were to do nothing but selection, GAs would not be very interesting because
the trajectory of populations could contain nothing but changing proportions of
strings contained in the original population. In fact, if run repeatedly, selection alone
is a fairly expensive way of—with high probability-filling a population with the best
structure of the initial population [4].
2. a crossover site along the string length is chosen uniformly at random, and,
3. the values following the crossover site are exchanged between the two strings.
Let the two strings be, A = 00000 and B = 11111. If the random choice of a cross
site turns up at 2, the two new strings following the crossover will be A = 00111
and B = 11000. These resulting strings are placed in the new population pool, and
the process continues pair-by-pair from original population, until the new population
is completely filled with “off-springs” constructed from the bits and pieces of good
(selected) parents [5].
Note that, the selection and crossover are simple operators, which do the job of:
generating random number, copying string, and exchange of partial strings. However,
their combined action makes much of the genetic algorithm’s search ability. To
understand this, we need to think the processing required to be done by human (us)
when we innovate. Often we combine the notions that worked well in one context,
with those that worked well in another context, to generate possibly better ideas (new
notions) of how to attack the problem at hand. In similar way, the GAs juxtapose
many different, highly fit substrings (called as notions) through the combined actions
of selection and crossover to form new strings (can be called as ideas).
If selection and crossover provide much of the innovative capability of a GA, what is
the role of the mutation operator? In a binary-coded GA, mutation is the occasional
(low-probability) alteration of a bit position, and with other codes a variety of diver-
sity generating operators may be used. By itself, mutation induces a simple random
walk through string space. When used with selection alone, the combination form a
parallel, noise-tolerant hill-climbing algorithm. When used together with selection
and crossover, mutation acts as both insurance policy and as a hill-climber.
9.8.4 GA Applications
The simplest GAs are discrete, nonlinear, stochastic, highly dimensional algorithms
operating on problems of infinite varieties. Due to this, GAs are hard to design and
analyze [5].
262 9 Heuristic Search
Suppose we choose to solve the problem for N = 4. This means that the board size
is 42 = 16, and the number of queens we can fit inside the board without crossing
each other is 4. A configuration of 4 queens can be represented as shown in Fig. 9.8,
using 4-digit string made of decimal numbers in the range 1–4. Each digit in a string
represents the position of queen in that column. Thus, all queens in the left-to-right
diagonal will be represented by a string 1234.
To solve this problem we take initial populations as [1234, 2342, 4312, 3431]. Let
us recombine by randomly choosing the crossover after digit position 2. We recom-
bine 1, 2 and 3, 4 members of population, producing [1242, 2334, 4331, 3412]. When
this is combined with the original population, we get [1234, 2342, 4312, 3431, 1242,
2334, 4331, 3412]. Next, a random mutation is applied on members 3431 and 2334,
changing the third digit gives 3421, 2324. Thus new population is, [1234, 2342, 4312,
3421, 1242, 2324, 4331, 3412].
The fitness of a string is proportional to the inverse of number of queens giving
check in each string. for example, in the configuration in Fig. 9.8, total number of
checks of Q 1 . . . Q 4 are: 2 + 3 + 2 + 1 = 8. Similarly in configuration 1234, total
number of checks are 12.
This is because each queen is crossing the remaining three. These numbers in
the remaining seven configurations are: 8, 4, 4, 4, 6, 8, 8. Thus fitness functions of
9.8 Genetic Algorithms 263
Q2
Q3
= 2 3 4 1
9.9 Summary
where g(n) is distance between start node and the node n, and h(n) is the distance
between node n and the goal node.
One of the heuristic methods is hill-climbing. The name comes from the fact that
to reach the top of a hill, one selects the steepest path at every node, out of the number
of alternatives available. A Hill-climbing method is called greedy local search.
Among all the heuristic-based problem-solving strategies, informed best-first
search strategy is one of the most popular methods to exploit heuristic information
to cut down the search time. This method assesses exploration along the direction of
highest merit, using heuristic information.
Several shortcuts help to simplify the computation of best-first search. First, if the
evaluation function f (n) used for node selection always provides optimistic estimates
of the final costs of the candidate paths evaluated, then we can terminate the search
as soon as the first goal node is selected for expansion, without compromising the
optimality of the solution used. This guarantee is called admissibility.
We are often able to purge from search tree T , large sets of paths that are recognized
at an early stage to be dominated by other paths in T . This becomes possible if the
evaluation function f is order-preserving, that is, if for any two paths p1 and p2 ,
leading from start node S to n, and for any common extension p3 of those paths, the
following holds:
f ( p1 ) ≥ f ( p2 ) ⇒ f ( p1 p3 ) ≥ f ( p2 p3 ) (9.11)
The Genetic Algorithms (GAs) are search procedures based on natural selection
and genetics. A GA processes a finite population of fixed-length binary strings. In
practice, all these are bit codes, kry-codes real (floating-point) codes, permutation
(order) codes, etc.
A simple GA consists of three operators: selection, crossover, and mutation. Selec-
tion is the survival of the fittest within the GA. Each member of population is assigned
a fitness value. On the basis of these fitnesses, the selection phase assigned zero or
more copies of each individual. After selection, the genetic operators of crossover
and mutation are applied probabilistically to the current population to produce new
population.
The special features of GAs are as follows:
• GAs can solve hard problems quickly and reliably,
• GAs are easy to interface to existing simulations and models,
• GAs are extensible, and
• GAs are easy to hybridize.
Table 9.1 gives comparison of various heuristic methods:
Exercises
2 3 1 2
1 3
5. Prove that, if the heuristic function h never overestimates by more than a constant
cost c, then algorithm A∗ making use of h returns a solution whose cost exceeds
that of the optimal solution by no more than c.
6. Give the name of the algorithms that results from each of the following special
cases:
a. Local beam search with k = 1.
b. Local beam search with k = ∞.
c. Simulated annealing with T = 0 at all times.
d. Genetic algorithm with population size N = 1.
7. Explain, how will you use best-first search in each of the following cases? Give
the data structure and explain logic.
a. Speech recognition
b. PCB design
c. Routing telephone traffic
d. Routing Internet traffic
e. Scene analysis
f. Mechanical theorem proving
8. What type of data structure is suitable for implementing best-first search, such
that each node in the frontier is directly accessible, and all the vertices behind it
remain in the order they have been visited.
9. Answer the following in one sentence/one word.
a. How will you detect during the search of a graph that a particular node has
been already visited?
b. Is the best-first search optimal?
c. Is the best-first search order-preserving?
d. Is the best-first search admissible?
10. In the Traveling Salesperson Problem (TSP) one is given a fully connected,
weighted, undirected graph and is required to find the Hamiltonian cycle (a
cycle visiting all of the nodes in the graph exactly once) that has the least total
weight.
11. Show that if a heuristic is consistent, then it can never overestimate the cost
to reach the goal state. In other words, if a heuristic is monotonic, then it is
admissible.
12. Suggest an admissible heuristic that is not consistent.
13. Can GAs have Local maximas? If it is not, how does the GAs tries to avoid it?
If yes, justify it.
268 9 Heuristic Search
14. Explain different data structures that can be used to implement the open list in
BFS, DFS, Best-first search?
15. Find out the worst-case memory requirements for best-first search.
16. If there is no solution, will A∗ explore the whole graph? Justify.
17. Define and describe the following terms related to heuristics:
Admissibility, monotonicity, informedness.
18. Show that:
a. The A∗ will terminate ultimately.
b. During the execution of the A∗ algorithm, there is always a node in the
open-list, that lies on the path to the goal.
c. If there exits a path to goal, the algorithm A∗ will terminate by finding the
path of to goal.
d. If there is no solution in A∗ , the algorithm will explore the whole graph.
19. Discuss the ways using which h function in f (n) = g(n) + h(n) can be improved
during the search.
20. Why must the A∗ algorithm work properly on a graph search, with graph having
cycles?
21. For the graph shown in Fig. 9.5, find out whether the A∗ search for this graph is
a. Optimal?
b. Oder-preserving?
c. Complete?
d. Sound?
22. Apply the BFS algorithm for robot path planing in the presence of obstacles for
a square matrix of 8 × 8 given in Fig. 9.10. Write an algorithm to generate the
frontier paths. Assume that each move of robot in horizontal (H) and vertical
(V) covers a unit distance, and the robot can take only the H and V moves. The
start and goal nodes are marked as S and G. Shaded tiles indicate obstacles, i.e.,
robot cannot pass through these.
23. Redesign the problem of robot shown in Fig. 9.10 for A∗ search. Assume that
value of h is number of squares equal to V − i + H − j, where V and H are
both 8.
G
Exercises 269
24. Solve the 8-puzzle manually for 20-steps, where heuristic is number of tiles
out of place with respect to goal state. Assume that f ∗ (n) = g ∗ (n) + h ∗ (n)
= 0 + h ∗ (n) = h ∗ (n), so that only the heuristics is deciding factor for next node.
Note that algorithm shall be DFS. In case of ties, give preference to those nodes
which are to the left of the search tree.
25. Find an appropriate state-space representation for the following problems. Also,
suggest suitable heuristics for each.
a. Cryptarithmetic problems (e.g., TWO + TWO = FOUR)
b. Towers of Hanoi
c. Farmer, Fox, Goose, and Grain.
26. Suggest appropriate heuristics for each of the following problems:
a. Theorem proving using resolution method
b. Blocks world
27. If P = “heuristic is consistent”, and Q = “heuristic is admissible”. Then show
that P ⇒ Q. Demonstrate by counter example that Q Q.
28. Consider the magic-puzzle shown in Fig. 9.11. Suggest the formalism for search-
ing the goal state when started from the start state. (Note that in the goal state
all the rows, columns, and diagonals have equal sums equal to 15).
29. Make use of GA to solve 4-puzzle (Fig. 9.12). A move consists, sliding of either
of tiles 1 or 2, or 3 into the blank tile. Such a movement creates blank tile at
a different position, and the process is repeated until goal state is reached. The
solution requires not only reaching to the goal state, but also finds the trace path
to reach the goal. Construct a suitable fitness function to implement search by
GA, the search should consider only those members of the population which
correspond to valid moves.
30. For the graph shown in Fig. 9.13, make use of DFS and certain depth cut-off to
backtrack the search from that cut-off.
31. Use best-first search for Fig. 9.14 to find out if the search from start node A to
goal node G is
3 2 3
Start state Goal state
270 9 Heuristic Search
1 5 1 4
E 4 G
7 9
5
D C
a. Oder-preserving
b. Admissible
c. Optimal
32. How you will apply the simulated annealing in the following scenarios? For each
case, give the problem formation so as to compute the ΔE, temperature T , the
states s, s ; the state space S, and object function f : S → R, and perform 5–10
iterations steps manually.
a. 8-puzzle
b. 8-Queen problem
c. Tic-tac-toe problem
33. One fine morning you find that your laptop is not booting. There can be enumer-
able reasons for this. Assume that you are expert in installation and maintenance
of laptops. Represent the search process for trouble-shooting the laptop by con-
structing a search tree.
a. Suggest, what search method you consider as most appropriate for this?
Also, explain the justification of the particular method you have chosen?
b. What heuristics you would like to suggest for making the search efficient?
c. What are the characteristics of this search? Comment for admissibility,
monotonocity, and completeness of this solution.
34. Assume a population of 10 members, and apply GA to find out the solution
by performing five cycles of iterations, each having selection, mutation, and
crossover. Verify that we are far closer to the solution than we were in the begin
after performing these iterations. Represent the members as bit strings {0, 1}n
for some integer n. Also, fix up some criteria for fitness function, as well as the
probability of mutation, and point of crossover.
Exercises 271
a. 8-puzzle
b. square-root of a number
c. Factors of an integer
35. What are the consequences of the following special cases of GA-based search?
a. Only the selection operation is performed in each iteration, based on the
fitness value.
b. Only the crossover operation is performed in each iteration at a random
position.
c. Only the mutation operation is performed in each iteration at a random bit
position.
36. Simulated annealing is guided by a changing “temperature” value that determines
the likelihood of visiting nodes that appear to be worse than the current node.
How does the search behave for very low and very high temperature values, and
why it behaves so?
37. Select the best alternatives in each of the following questions.
i. The mutation operation is good for the following:
(a) noise tolerance (b) hill-climbing
(c) random walk (d) all above
ii. The following operation of GA has maximum contribution to search:
(a) mutation (b) selection
(c) crossover (d) fitness function
iii. What operation of GA is responsible for random walk?
(a) mutation (b) crossover
(c) none above (d) both a and b
iv. GAs are not good for the following purpose:
(a) finding exact global optimum (b) local search
(c) approximate solution (d) global search
v. GAs are good in environments which are :
(a) complex (b) noisy
(c) dynamic (d) all above
vi. GA is always:
(a) P (b) nondeterministic
(c) NP (d) deterministic
References
1. Bagchi A, Mahanti A (1985) Three approaches to heuristic search in networks. J ACM 32:
I:1–27
2. Dechter R, Pearl J (1985) Generalized Best-First Search strategies and the optimality of A∗ . J
ACM 32(3):505–536
3. Forrest S (1993) Genetic algorithms: principles of natural selection applied to computation.
Science 261:872–878
272 9 Heuristic Search
4. Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Addison-
Wesley, Reading
5. Goldberg DE (1994) Genetic and evolutionary algorithms come of age. Commun ACM
37(3):113–119
6. Hart PE et al (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE
Trans Syst Sci Cybern 100–107
7. Kirkpatrick S et al (1983) Optimization by simulated annealing. Science 220(4598):671–680
8. Korf RE et al (2005) Frontier search. J ACM 52(5):715–748
Chapter 10
Constraint Satisfaction Problems
10.1 Introduction
This chapter presents the theory of some special type of problems, called, Constraint
Satisfaction Problems (CSP), where one need to search the state space, but every move
is subject to the fulfillment of certain constraints, which is different from the selection
of best fitting state in best-first search or in hill-climbing. Many combinatorial search
problems can also be expressed in the form of CSPs, where the aim is to find an
assignment of values to a given set of variables, subject to some specified conditions
(constraints). For example, the SAT (the satisfiability) problem may be viewed as a
CSP where variables are assigned only to the Boolean values, such that it satisfies
the constraints given in the form of clauses.
A general case of a constraint satisfaction problem is known to be NP-hard.
However, a restricted version of the problem that can be obtained as tractable by
imposing restrictions on the form of constraint interconnections [7].
In an algorithm for the solution of CSP, the constraints imposed on the previous
states also propagates to the next states. If there is a violation of constraints, we
need to backtrack and try other branches in the tree constructed. The constraints only
Application of CSPs range from database queries, e.g., find all x, y, z such that x,
y are components, z is assembly, and failure of x causes the failure of y, to scene
analysis where it is required to analyze the scene by segmenting it, for example, sky
region is labeled blue, vegetation regions are green, and cars are totally surrounded
by either grass or sky. Following are some applications of CSPs:
10.2 CSP Applications 275
Consider the case of boundary relations between states of Indian territory, where,
Madhya Pradesh (MP) is connected to Uttar Pradesh (UP), Rajasthan (RAJ), Gujrat
(GUJ), Maharastra (MAH), and Chattisgarh (CHG), all surrounded to MP as shown
in Fig. 10.1.
Assuming that we have been given a task of coloring each region either red,
green, or blue in such a way that no neighboring regions have the same color. To
formulate this as a CSP, we define the variables to be the regions: MP, RAJ , GU J ,
MAH , CH G, and U P. The domain of each variable is the set {red , green, blue}.
The constraints require neighboring regions to have distinct colors; for example, the
allowable combinations for RAJ and GU J is a set of pairs:
MP
GUJ CHG
MAH
276 10 Constraint Satisfaction Problems
V , D, C (10.1)
where,
V = {v1 , v2 , . . . , vn }, is a finite set of variables;
D = {D1 , D2 , . . . , Dn } is a finite set (domains) of values (which may be finite or
infinite); and
C = {C1 , C2 , . . . , Cn } is a finite set of constraints.
Associated with each variable v is a finite, discrete domain Dv . A constraint c
on the variable set Vk ⊆ V in its extensional form, is a subset of the Cartesian-
product of the domains of the afflicted variables. The expression var(c) denotes the
tuple of variables on which the constraint is defined. A relation, rel(c) ⊆ Xv Dv , and
v ∈ var(c) are the relational information of the constraint c. The assignment of a
value d ∈ Dv to a variable v is denoted by v := d . A tuple t of a assignments of
variables Vk ⊆ V satisfies a constraint c, if and only if t[var(c)] ∈ rel(c).
Each constraint ci is a pair (Vi , Ri ) (usually represented as a matrix), where Vi ⊆ V
is a set of variables, called the constraint scope, and Ri is a set of (total) functions
from Vi to D, called the constraint relation (an n-ary relation). Thus an evaluation
of variables is evaluation of a function f from the set of variables to the domain of
values, f : V → D. The evaluation function f satisfies a constraint (v1 , . . . , vn ), R
if (f (v1 ), . . . , f (vn )) ∈ R. A solution is an evaluation that satisfies all constraints.
Hence, given an instance of a constraint satisfaction problem, its solution is a
function f from the set of variables (V ) of that instance to the set of values (domain
D), i.e., f : V → D. This mapping is subject to each constraint (Vi , Ri ) ∈ C, the
restriction of f to Vi , denoted f | Vi , is an element of Ri .
10.3 Representation of CSP 277
If we fix an ordering on the variables (w, u, v), then the same relation can be
represented as a set of 3-tuples of real numbers:
The specifications in the constraint relations indicate the scope for allowed com-
binations of simultaneous values for the different variables. How many variables in
a scope simultaneously satisfy the constraint, is called arity of the constraint. For
example, unary constraint specify the allowed values for a single variable, while the
binary constraint specify the allowed combinations of values for a pair of variables.
In addition, there is an empty constraint (i.e., 0, 0), having scope of variables and
relations both empty.
We represent constraints’ relations as sets of functions, rather than sets of tuples.
In fact, both the representations are equivalent, as by fixing the ordering of variables
in the scope of constraints, we can associate the functions with the corresponding
tuples. But, we shall prefer to use functional representations because it simplifies the
definitions, as the following examples demonstrate.
Let the variables take Boolean values, and the set of possible values for the variables
is {T , F}. Given this the logical formula ‘x1 ∧ x2 ∧ ¬x3 ’can be used to specify the
constraint with scope {x1 , x2 , x3 }, such that the following constraint relation holds.
In a similar way, when the possible values is a set of real numbers R, then the
equation ‘x1 + 2x2 = 7’can be used to specify the constraint with scope {x1 , x2 } and
following constraint relation holds.
For the representation of constraints, such as the relationship between height and
weight of a person, one of the following possibilities can be used:
actual weight
< 0.8 (10.6)
normal weight
In an electrical circuit, R is a constraint, given the voltages V1 and V2 (see Fig. 10.2).
In an extension of the above logic, all the circuit components can be represented
as constraints and connected by common variables (voltage and currents) to form a
constraint network. If the voltage and currents of power source are known, the values
are propagated through the constraint network in the same way as a real current, so
that a value can be calculated for each variable.
In the following we will consider how restricting the allowed constraints affects
the complexity of this decision problem.
A CSP can also be considered as a decision problem as follows: For any set of
constraints CSP() for the solution of the problem represented as a set of formulas
, and with every,
It is helpful to visualize a CSP as a constraint graph (network). The nodes of the graph
correspond to variables of the problem and the edges correspond to constraints.
The representation of states in CSP is in accordance with a standard pattern—a
set of variables with assigned values, successor function, and goal test are written in
generic way. In addition, one can develop effective and generic heuristics that require
no additional domain-specific expertise.
It is fairly easy to see that a CSP can be as a standard search problem as follows:
• Initial state: It is an empty assignment {}, where all variables are unassigned.
• Successor function: A value can be assigned to any unassigned variable, provided
that it does not conflict with previously assigned variables, and does not violate
the constraint.
• Goal test: Whether the current assignment is the complete assignment?
• Path cost: A constant cost (e.g., 1) for every step.
All the CSPs have similar steps for the solution. Every solution appears at depth
n with n variables, this indicates that it is a depth-first search approach. The Path is
irrelevant, so it can also use complete-state formulation. For domain size |D| = d ,
the branching factor is,
b = (n − i)d (10.7)
10.4 Solving a CSP 281
n−1
(n − i)d n = n!d n . (10.8)
i=0
In the figure, the nodes identified as {red green}1 and {red green}2 contain possible
values for variables X1 and X2 . The loop symbol (⊂) at each node indicates that a
unary constraint is applicable on each node, and the link “not-same-color”, between
nodes corresponds to binary constraint between them. The first level of inconsistency
in constraint networks is node inconsistency. Note that, the potential domain values
in this example are red and green, while the unary constraint for each node is red.
Hence, we can immediately eliminate the green from both the nodes X1 and X2 . On
doing this elimination, we get the following network.
not-same-color
⊂ {red}1 {red}2 ⊃
The second level of inconsistency is arc inconsistency. The arc between X1 and X2
is inconsistent because, for the value of “red” in X1 , it does not satisfy the constraint
“red is not-same-color as red.” To satisfy this constraint, we remove the color “red”
from both the nodes. This cuts down the search space, but unfortunately, this case
reflects the fact that the problem is impossible to solve. Hence, in this case, there is
no global solution, and the network is “unsatisfiable.”
1A complete graph is one in which all possible edges between pairs of vertices are present.
282 10 Constraint Satisfaction Problems
It is quite possible for a network to have no arc inconsistencies, but still, it can be
unsatisfiable. Now, let us consider the problem of coloring a complete three-vertex
graph with two colors, represented in Fig. 10.3.
Let us assume that a set of possible values of color for each variable is {red green}.
The binary constraint between a pair of nodes is the same, “is not the same color
as.” This network is arc-consistent, e.g., given a value “red” for X1 , the X2 may be
assigned “green”, hence “red is not the same as green” is satisfied. But, there is no
way to choose single values a1 , a2 , a3 for X1 , X2 , X3 , respectively, so that all the three
binary constraints are simultaneously satisfied. If we choose red for X1 , the binary
constraint forces us to choose green for X2 . Having chosen green for X2 , the binary
constraint forces us to choose red for X3 , and that forces us to choose green for X1 ,
but for X1 , red color has been already chosen.
Nevertheless, it may be helpful to remove arc inconsistencies from a network.
This involves comparing nodes with their neighbors as we did above. Each node
must be so compared; however, comparisons can cause changes (deletions) in the
network and so the comparisons must be iterated until a stable network is reached.
These iterations can propagate constraints some distance through the network. The
comparisons at each node can theoretically be performed in parallel and this parallel
pass iterated [3].
Thus, removing arc inconsistencies involves several distinct ideas: local con-
straints are globally propagated through iteration of parallel local operations. It
remains to be seen which aspects of this process are most significant to its appli-
cation. The parallel possibilities may prove to be particularly important.
A more powerful notion of inconsistency is called as path inconsistency, where a
network is path inconsistent if there are two nodes X1 and X2 such that ‘a’ satisfies
X1 , ‘b’ satisfies X2 , a, and b together satisfy the binary constraint between them,
yet there is some other path through the network from X1 to X2 , such that there is
no set of values, one for each node along the path, which includes a and b, and can
simultaneously satisfy all constraints along the path. For example, the network in
Fig. 10.3 is path inconsistent: red satisfies X1 , green X3 , red is not the same color as
green: however, there is no value for X2 which will satisfy the constraints between
X1 and X2 , and between X2 and X3 , while X1 is red, X3 is green.
Consider the problem of coloring the complete four-vertex graph with three colors
(Fig. 10.4). Each node contains red, green, and blue, and each arc again represents
the constraint “is not the same color as.” In particular, path consistency does not fully
determine the set of values satisfying the global constraint, which in this inconsistent
case is the empty set.
10.4 Solving a CSP 283
In conclusion, arc and path consistency algorithms may reduce the search space,
but do not in general fully synthesize the global constraint. When there are multiple
solutions, additional searches will be required to specify the several acceptable com-
binations of values. Even a unique solution may require further search to determine,
and the consistency algorithms may even fail to reveal that no solutions at all exist.
As the coloring problem suggests, the general problem of synthesizing the global
constraint is NP-complete, and thus unlikely to have an efficient (polynomial time)
solution. On the other hand, it is found that in specific applications it may be possible
to greatly facilitate the search for solutions.
There are two key observations that have motivated this approach.
1. Node, arc, and path consistency in a constraint network for n variables can be
generalized to a concept of k-consistency for any k ≤ n.
2. The given constraints can be represented by nodes, as opposed to links, in a
constraint network; we can add nodes representing k-ary constraints to a constraint
network for all k ≤ n (irrespective of whether a corresponding k-ary constraint
is given or not). These constraints are then propagated in the augmented net to
obtain higher levels of consistency.
{e f }2 {e f }2
X1 X2 X3
b e d,
b f g.
But, but the consistency of all the variables have not been simultaneously verified
yet, which is covered using the concept of augmented network.
Augmented Network
So far we have merely achieved a sort of “arc consistency” (though we indicate the
restriction of the pair bc (in C13 ), as well as the elements a and c).
10.4 Solving a CSP 285
{b}1
{d g}3
{bd bg}13
Next, we add a node for the ternary constraint (i.e., k-ary constant k ≤ n, k = 3,
n = 3). Note that, there was no order-three constraint given originally, hence we
could assume initially of no constraint of order three. But, from the given binary and
unary constraints, we could construct the ternary constraint.
Note that, C1 and C23 together allow only the {bed bf g} as set of triples. This is
used as the ternary node and we link it to the binary nodes shown in Fig. 10.6, which
makes an augmented network.
We look at the new node from its neighbors and vice versa, as we did earlier, to
ensure consistency of the sort we obtained earlier between neighboring nodes. C13
is consistent with the new node: bd is part of bed , bg part of bf g. Similarly C12 and
C23 are consistent with the ternary node. If necessary, we could propagate deletions
around until local consistency is achieved on this augmented network. However,
in this case, the network is already stable; no further changes are required. The
ternary node represents the synthesis of the given constraints. There are two ways
to simultaneously satisfy the given constraints: X1 = b, X2 = e, X3 = d ; or X1 = b,
X2 = f , X3 = g.
In a similar way, nodes can be created for higher order, like, quaternary, and k-ry
consistency checks.
Example 10.6 Applying the Synthesizing algorithm to solve the coloring problem.
We apply the Synthesizing algorithm discussed above for this purpose. There are a
total of six unary constraints C1 −C6 , and nine arc constraints. The arc, which is for
binary, indicates Ci “color is not the same as” Cj . To begin with, all the states are
assigned rgb (Fig. 10.7a). After this, to start with MP is constrained to r, therefore, r
is removed from the rest (Fig. 10.7b). In the next and final stage node consistency is
assured, starting from U P and propagating to the rest of the network, simultaneously,
maintaining the arc-consistency also as shown in figure (Fig. 10.7c).
The progressive assignment of colors is also shown in Table 10.1.
(a) (b)
{b} {g}
RAJ UP
{r}
MP
{g}
(c)
variables. However, these constraints only specify the relations between variables,
without any computational procedure/method to enforce such relations. The solution
to such CSP is an algorithm which arrives at results, while satisfying the constraints
relations. The finite domains’ constraint satisfaction problems are typically solved
using some type of search. The most common techniques used for solutions to CSPs
comprise three things: 1. some variant of backtracking search, constraint propagation,
and local search [4].
There are several variants for backtracking. The backtracking ensures that the
solution is consistent, and it saves the search time once it finds that a particular
iteration of search is not going to lead to a solution. The backtrack is usually based
10.5 Solution Approaches to CSPs 287
on the value of a single variable, but it can be done on more than one variable. The
learning through constraints infers and saves the new constraints which can later be
used to avoid part of the search. We can foresee the effects of choosing a variable or
value, through look ahead, and can determine in advance whether the sub-problem
is going to be satisfied or not.
The propagation of constraints is useful because it turns a problem into one that
is equivalent but is usually simpler to solve. In addition, it proves satisfiability or
unsatisfiability of the problem. However, the later is not always guaranteed to happen,
but it always happens for some forms of constraint propagation or some kind of
problems, or both.
The process of Constraint propagation in the course of solution of a CSP problem
modifies the CSP, in other words, the constraint propagation enforces a local consis-
tency, which is a set of conditions related to the consistency of a group of variables
or constraints, or both.
The Local search methods are nothing but incomplete satisfiability algorithms.
They may find a solution to a problem, but may fail also even if the problem is sat-
isfiable. The local search works iteratively to improve an assignment over variables.
At each step of the iteration, a small number of variables are changed values, with
the aim of increasing the number of constraints satisfied by this assignment.
A local search algorithm specific to CSPs is min-conflicts algorithm. The local
search appears to work well when changes to variables are of random nature. The
search, when integrated with local search, is called a hybrid search, and found to
give better results.
Yet another approach for CSP solution is Neighborhood substitution. As per this,
whenever a value a for a variable is such that it can be replaced in all constraints by
another value b, then a is eliminated. The neighborhood substitutions are useful to
find a single solution, as well all the possible solutions.
2. Analytical methods for the infinite domain of values D: Solutions here can be
found by analyzing the constraints as some (generalized) equation system, either
solving the constraints simultaneously or considering the constraints sequentially
one by one.
A systematic search can be carried out in one of the two ways, either using
Generate and Test approach or through Backtracking.
We generate one by one all possible complete variable assignments and for each, we
test if it satisfies all constraints. The corresponding program structure is very simple,
just nested loops, one per variable. In the innermost loop, we test each constraint.
In most situation, this method is intolerably slow. The steps for generate-and-test is
shown as Algorithm 10.1.
10.6.2 Backtracking
Using a computer algorithm to answer a question like “How many ways are there
to do so and so?”, or “list all possible ways for ....”, usually requires an exhaus-
tive search out of a set of potential solutions. One technique for organizing these
searches is the backtrack, which works by continually trying to extend partial solu-
tions. At each stage of the search, if an extension of the current partial solution is
not possible or not found, we backtrack to the original partial solution, and try a
different extension of the existing partial solution. This method is, however, used in
a wide range of combinatorial problems, which includes, parsing, game playing, and
optimization [1].
We assume that the solution to a CSP problem consists of a set of vector
(a1 , a2 , . . .) of undetermined length, which satisfies certain constraints on the com-
ponents that makes it a solution. Each ai (i = 1, 2, . . .) is a member of a finite and
linearly ordered set Ai . Thus, an exhaustive search must consider all the elements of
set A1 × A2 × · · · × Ai , for i = 0, 1, 2, . . . as potential solutions. At the begin, we
start with the null vector (say, ) as a partial solution, and the constraints tell us
which of the members of A1 are candidates for a1 . Let, this set be S1 ⊆ A1 , the least
element of S1 is a1 , and the partial solution is (a1 ). In general, various constraints
which describe solutions show that what subset Sk of Ak comprises candidates for the
extension of the partial solution (a1 , a2 , . . . , ak−1 ) to (a1 , a2 , . . . , ak−1 , ak ). Suppose
the partial solution (a1 , a2 , . . . , ak−1 ) allows for no possibilities for ak , then Sk = φ,
hence we backtrack, and make a new choice for ak−1 . If no more new choices exist
for ak−1 , then we backtrack still further and make a new choice for ak−2 , and this
goes on, till a value next to previously selected value comes.
It is helpful to picture the above discussed process in terms of a tree traversal in
Fig. 10.8. The subset of A1 × A2 × · · · × Ai , i = 0, 1, 2, . . . which, when searched,
can be represented as a tree as follows: the node at 0th level /root node of the tree
is null vector. The children of the root are the choices for a1 . In general, the nodes
at the kth level are the choices for ak , given the choices made for a1 , a2 , . . . , ak−1
as indicated by the ancestors of these nodes. In the tree, the dotted lines show the
backtrack traverse. Since the traversal goes as deep as possible in the tree before
backing up to explore other parts of the tree, such a search is called a depth-first
search (DFS).
Choices for a1
Backtracking Algorithm
Backtracking is performed through recursion, where it maintains a partial assignment
of the variables. All variables are initially unassigned. At each step of the algorithm,
a variable is chosen and all possible values are assigned to it in turn. For each
value assigned, consistency of partial assignment is checked to find out whether it
matches with the constraints. For each case of a consistency check, a recursive call is
performed. When all values for a variable have been tried, the algorithm backtracks
and goes back to previous variable to try a different value for this and start again.
In the basic backtracking algorithm, consistency is defined as—satisfaction of all
constraints whose variables are all assigned.
As the backtrack is no more than an “educated” exhaustive search procedure; it
should be stressed that there exist numerous problems that even the most sophisticated
application of backtrack will not solve in a reasonable length of time.
The checking or verifying the assignment of variables is not done in arbitrary
order. The variables are tried in some order, so as to place in the front those variables
that are highly constrained or with smaller ranges. This ordering has an impact on the
efficiency of solution algorithms, as once the more constrained variables are satisfied,
there are less chances of backtracking the solution. In the process we start assigning
values to variables, check constraint satisfaction at the earliest possible time, and
extend an assignment if the constraints involving the currently bound variables are
satisfied.
The Algorithm 10.2 shows the steps for backtracking.
The variable and value ordering in backtracking are important. Note the line
contained by this backtracking algorithm:
10.6 CSP Algorithms 291
v = select−unassigned −variable();
which selects the next unassigned variable. A static variable never results in a most
efficient search. We pickup that variable which fails quickly (i.e., with fever legal
moves), to help faster exploration of state space.
However, the following are the problems with backtracking:
• Thrashing. This is repeated failure due to the same reason generated by older
variable assignments.
• Redundancy. This is rediscovering the same inconsistencies.
An example, we use a backtracking algorithm to solve the problem of n × n
non-attacking queens, i.e., in how many ways can n queens be placed on an n × n
chessboard so that no two queens are attacking each other? The following example
presents the solution for n = 4 queens problem [10].
Example 10.7 Four/Eight Queens problem.
“Eight queens” is a classic problem of combinatorial analysis, which requires place-
ment of eight “queens” on a 8 × 8 chessboard such that no pair of queen “attack” each
other, i.e., no two queens are on a common row or column or diagonal. This problem
is ideally suited for a solution by backtracking algorithm. Instead of 8 queens, here
we use its simpler version—a scaled down version to place four queens on a 4 × 4
chessboard so that no two queens attack each other.
Instead of beginning by examining of all 16 C4 = 1820 ways of selecting the loca-
tions for four queens on 16 squares of 4 × 4 board, the first observation is that each
row is to contain exactly one queen. Thus, to begin with, a queen is placed on the
first square of the first row (Fig. 10.9a), and then another queen on the first available
square of the second row (Fig. 10.9b). Observing that this prevents placing a queen
anywhere in the third row (Fig. 10.9c), it is necessary to backtrack to the second row
and move the queen there one square to the right (Fig. 10.9d), so that there is room
for another queen in the third row.
We note that now there are no available squares (Fig. 10.9e) in the fourth row, as
every position in fourth row is crossed by some already placed queen. Therefore, it is
necessary to backtrack all the way to the first row (Fig. 10.9f). After this, there is one
available location in the second row (Fig. 10.9g), followed by suitable locations in the
third and fourth rows (Fig. 10.9h), which gives us a solution. Note that, backtracking
reveals that this solution is unique except for its mirror images, which are, of course,
also solutions. The two mirror images can be obtained by rotating the board along
the X axis and Y axis.
A backtrack search techniques cut down the search space, as they need not test all
combinations of possible assignments to variables—a solution backtracks whenever
it comes across an inconsistency of assignment. However, the backtrack often exhibits
costly “thrashing” behavior.
Similar to 4-queen, the eight queens puzzle is the problem of placing eight
chess queens on an 8 × 8 chessboard so that no two queens attack each other (see
Fig. 10.10).
292 10 Constraint Satisfaction Problems
1 1 1 1
. . 2 2 2
. . . . .
3
Though the backtracking algorithm does not try all the alternatives of assignments,
hence it is a relatively efficient algorithm. But, there is a number of heuristics, using
which we can cut the cost, and make it more efficient. The following are the ways to
improve the backtrack and make it more efficient [1].
Branch merging
The branches of the search path isomorphic to the already searched path should not
be searched and instead be merged with the already searched path. This will reduce
the total search time.
10.6 CSP Algorithms 293
Preclusion
In the generation of solutions through backtracking, the backtracking should occur
as soon as it is discovered that the current partial solution is in fact not leading to a
solution, and will only add to the time required for the solution.
Search rearrangement
The search should be rearranged such that nodes of low degree occur at the begin-
ning of the search, and of high degree occur in the latter part of the search. This
will not contribute to the explosive growth of search tree in the beginning, and high
branching at the ending nodes does not contribute to much complexity. Since preclu-
sion frequently occurs at a fixed depth, in this strategy fewer nodes may need to be
examined. As an example, when faced with the choice of several ways of extending
the partial solution, e.g., which square to tile next or in which column to place the
next queen, we choose the one that offers the fewest alternatives.
Branch and bound
When searching for a solution of minimum cost, once a solution is found all the partial
solutions with greater cost than the minimum are discarded. When this approach is
used, it is possible to get a good solution early in the search. Note that, this technique
is possible only when the costs are additive in nature.
The solution to CSP requires propagation of constraints. Let us find out what is
the basic mechanism to propagate these constraints. Consider that we have to locally
propagate a constraint CJ to a neighboring constraint CK . For this, we should remove
all aK ∈ CK which do not satisfy CJ .
A global propagation of a constraint is defined recursively. To globally propagate
a neighboring constraint CJ , first locally propagate CJ to CK , then if anything was
removed from CK due to local propagation, globally propagate CK through all its
neighbors. The rest of the propagation is similar to arc consistency algorithm (see
page no. 284).
Yet another situation is, a constraint is looked at only when a variable is chosen,
by selecting the variable that is yet unassigned. This also will reduce the propagation
overheads as, before this time it is not required, and hence we are not concerned.
In the following, we discuss the various techniques for the propagation of con-
straints.
294 10 Constraint Satisfaction Problems
We can drastically reduce the search space by early looking at some of the constraints,
even before the search is started. This process is called forwarding checking. When
a variable, say X is assigned a value, the forward checking process checks each of
the variable Y that is unassigned and connected to X by some constraint. And, a
value in Y ’s domain is deleted that is inconsistent with the value chosen for X . To
apply this in Fig. 10.7, for map coloring, we keep check of remaining legal moves
for unassigned variables, as shown in the Table 10.1. When ‘red’is assigned to MP,
the red color is dropped from all the rest, otherwise, it will conflict with any red with
any of the other variables. Hence, in the table, once the MP is assigned R, the R is
deleted from rest of all states (i.e., variables Y ). This has reduced the ranching on
variables all together, by propagation of information from MP to rest of all states.
In the above, we terminate the assignment when any variable has no legal moves.
Try yourself the forward checking for the 4-queen problem yourself.
A constraint graph is shown in Fig. 10.11, for the domain values {1, 2, 3, 4}. Follow-
ing are the constraints:
v1 v3 v5
v4
10.7 Propagating of Constraints 295
1. There are two unary constraints: Variable v1 cannot take values 3, 4, and variable
v2 cannot take value 4.
2. There are eight binary constraints, stating that variables connected by edges cannot
have the same value.
It is required to find the solution using the following heuristics: minimum value
heuristics (MVH), degree heuristics (DH), and forward checking (FC).
The first variable v1 is assigned the minimum value (i.e., 1) out of the available
set of values at the start. The forward checking and MVH assign next, minimum
available value to v2 , while checking forward, i.e., v3 , v5 , that there is no conflict.
Next using the FC heuristics, MVH and DH, we assign 3 to v3 . Using FC and MVH
we assign 4 to v4 . Finally, the variable v5 is assigned 1, through forward checking.
In the above, the degree heuristics select that variable, which is involved in the
largest number of constraints on other unassigned variables. It is useful for tie-
breaking. In other words, it is selecting the least constraints, where given a variable
choose the least constraining value (leaves the maximum flexibility for subsequent
assignments).
10.8 Cryptarithmetics
The cryptarithmetic codes are typical of encrypting some information using these
codes, or hiding an encryption key such that it can be recovered on solution of
the given constraints. The cryptarithmatic puzzles are CSPs that lie at the heart
studies of human and computer and computer problem-solving. Usually, the problem
to be solved is to find unique digit assignments to each of the letters such that
the numbers represented by words do the correct addition or subtraction, or even
any given arithmetic. However, this problem turns out to be complex. In decimal
representation, the constraint of a unique digit mapping to a unique letter reduces
the total number of states, from 10n to 10!/(10 − n)!, where n is a number of unique
letters in the problem.
An example of such a problem is:
GERALD
DONALD
-------
ROBERT
296 10 Constraint Satisfaction Problems
197485
526485
-------
723970
For solving such a cryptic problem requires an agent to perform a search. The
speed at which these problems can be solved depends on many factors, like, initial
conditions, and the particular sequence of actions chosen by the agent to solve it while
moving through the search space. The chosen sequence depends on the knowledge
about the problem/puzzle to be solved, as well as the knowledge about which steps
to be taken next.
A simple non-cooperative search strategy is—generate-and-test, where an assign-
ment is made to each letter and tested to see if the problem is solved or partly solved.
In the following example, we show the detailed steps of generate and test, and use
backtrack to solve a similar problem [9].
Example 10.9 Cryptographic puzzle.
Given, a sum as below, find out the values of these 8-variables (S, E, N , D,
M , O, R, Y ∈ {0, 1, . . . , 9}) satisfying the validity of this sum, such that each variable
gets a unique value from set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.
S E N D
+ M O R E
========
M O N E Y
c1 : D + E = Y + 10 × v1
c2 : v1 + N + R = E + v2 × 10
c3 : v2 + E + O = N + v3 × 10
c4 : v3 + S + M = O + v4 × 10
c5 : M = v4
D, E, N , R, O, Y ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9},
M , S ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9}.
D, E, N , R, O, Y ∈ {0, 2, 3, 4, 5, 6, 7, 8, 9}.
Assume that v3 = 1.
c4 : v3 + S + M = O + v4 × 10
1 + S + 1 = O + 1 × 10
c3 : v2 + E + O = N + v3 × 10
∴ v2 + E = N + v3 × 10 = 10 + N .
The above cannot be satisfied for any constraint value for E, since E < 10. This
concludes that assumption v3 = 1 was wrong. Hence, this and all the derived values
must be reset, and we backtrack to select v3 = 0.
c4 : v3 + S = O + 9
S = O + 9.
c3 : v2 + E + O = N + v3 × 10
v2 + E + 0 = N + 0 × 10
∴ v2 + E = N .
c2 : v1 + N + R = E + v2 × 10
v1 + 1 + E + R = E + 1 × 10
v1 + R = 9.
D, E, N , Y ∈ {2, 3, 4, 5, 6, 7} (10.9)
c1 : D + E = Y + 10 × 1
That is,
D + E = Y + 10 (10.10)
298 10 Constraint Satisfaction Problems
S E N D 9 5 6 7
+ M O R E + 1 0 8 5
========= => =========
M O N E Y 1 0 6 5 2
The CSPs are also studies in computational complexity theory and in finite model
theory. The CSPs are well-known for their complexities, which are exponential in
nature. To solve a CSP one needs to establish a relation, like mapping nodes in a
graph to certain colors, satisfying some constraints; assigning queens to blocks on a
chessboard, with certain constraints; mapping letters to digits in a cryptarithmetic,
etc. In all these cases we have certain mapping relations, with certain constraints.
From these observations an important question that can be asked is, whether for each
set of relations, the set of all the CSPs that can be represented using only relations
chosen from that set is P or NP-complete? If such a theorem is true, then CSPs
provide one of the largest known subsets of NP. The classes of CSP that are known
to be tractable, are those where hypergraph of constraints have bounded tree-width
and there are restrictions on the set of constraints relations [7].
Some Properties of CSPs are:
• Solution of CSPs is in general NP-complete.
• Before a CSP is solved, identification of restrictions that make the problem
tractable is important.
• Each CSP can be converted into a binary CSP.
• Over-constrained CSP: The CSP contains more constraints than required which
may be inconsistent and / or redundant.
• Under-constrained CSP: The CSP that cannot be solved uniquely.
10.10 Summary 299
10.10 Summary
In Constraint Satisfaction Problems(CSP), one needs to search the state space, but
every move is subject to the fulfillment of certain constraint, which is different from
the selection of the best fitting state in best-first search or hill-climbing. The aim of
their solution is to find an assignment of values to a given set of variables subject to
specified constraints.
The CSP problems are common in job scheduling, supply chain management,
CPU job scheduling, and other optimization problems. A constraint satisfaction prob-
lem is defined as a triple,
V , D, C
Exercises
1. How many solutions exists for the map coloring of Indian states (see Fig. 10.1)?
2. A problem called, “Minimal Bandwidth Ordering Problem” is described as fol-
lows: Given is a graph G = (V , E), with V as a set of vertices, E as a set of
edges (x, y), and x, y ∈ V . If we order the vertices, the bandwidth of a vertex
x is the maximum distance between x and a vertex y that is adjacent to it. The
bandwidth of ordering is the maximum bandwidth of all the nodes under that
ordering. The minimal bandwidth ordering problem is to find an ordering with
the minimum bandwidth.
300 10 Constraint Satisfaction Problems
X4 X5 X6
X7 X8 X9
References
1. Bitner JR, Reingold EM (1975) Backtrack Program Tech. Communications of the ACM
18(11):651–655
2. Bordeaux L et al (2006) Propositional satisfiability and constraint programming: a comparative
survey. ACM Comput Surv 38(4):1–54. https://doi.org/10.1145/1177352
3. Cooper MC (1997) Fundamental properties of neighborhood substitution constraint satisfaction
problems. Artif Intell 90:1–24
4. Dorndorf U et al (2000) Constraint propagation techniques for the disjunctive scheduling
problem. Artif Intell 122:189–240
5. Freuder EC (1978) Synthesizing constraint expressions. Commun ACM 21(11):958–966
6. Freuder EC (1985) A sufficient condition for backtrack-bounded search. J ACM 32(4):755–761
7. Jerrum M (2010) Constraint satisfaction problems and computational complexity. Commun
ACM 53(9):98–106
8. Michalowski M (2008) A general approach to using problems instance data for model refine-
ment in constraint satisfaction problems. A Dissertation Presented to the Faculty of graduate
school, USC, Computer Science Department
9. Scott H (1991) Cooperative solution of constraint satisfaction problems. Science 254:1181–
1182
10. Solomon W et al (1965) Backtrack programming. J ACM 12(4):516–524
Chapter 11
Adversarial Search and Game Theory
Abstract Game theory is the formal study of conflict and cooperation, first time
introduced as long back as 1921 by mathematician Emile Borel, then enriched in
1928 by von Neumann and Oskar Morgenstern, and much enriched by Josh Nash,
has enormous applications, including in business, and even in the prediction of
election results, etc. The game playing is also a search process. The chapter presents
the classes of games as combinatorial and games of chance, then further as zero-sum
games and non-zero-sum games, the prisoner’s dilemma, game playing strategies,
the games of perfect information, arbitration scheme in games, minimax search in
game playing, and analysis of specific games like tic-tac-toe. The more efficient
search processes like alpha and beta are presented, as well as the alpha cutoff and
beta cutoff methods to prune the search process are presented and analyzed, followed
with chapter summary, and an exhaustive list of exercises along with a number of
multiple-choice questions provided at the end.
11.1 Introduction
The Game theory is the formal study of conflict and cooperation. Game theory’s
concepts apply whenever the actions of several agents are interdependent. These
agents may be individuals, groups, firms, or their combination. The concepts of
game theory provide a language to formulate, structure, analyze, and understand
strategic scenarios. Cooperative game theory concentrate on what agreements the
agents are likely to reach, while noncooperative game theory concentrates on what
strategies are likely to be adopted by the agents. Both of these traditions have adopted
models of bargaining in which each agent’s risk posture is conveyed by comparing
his preferences for risky and riskless alternatives.
The formal game theory was introduced in 1921 by mathematician Emile Borel,
which was further advanced by the mathematician John von Neumann in 1928 in
a text by title “Theory of parlor games.” The Game theory became a field of study
in mathematics, subsequent to the publication of the book “Theory of Games and
Economic Behavior” in 1944 by John von Neumann and the economist Oskar Mor-
genstern. John Nash, in the year 1950, demonstrated that finite games have always an
equilibrium state. At this equilibrium state all the players choose actions that are the
best for them given their opponents’ choices. From that time onward, the concept of
equilibrium has remained the focal point of analysis of noncooperative game theory.
In the year 1960, the scope of game theory got widened and it was being in problems
related to war and politics. The whole subject of game theory caught special attention
when Nobel prize was awarded jointly to John Nash, John Harsanyi, and Reinhard
Selten, in the field of Economics, in 1994.
Another branch of game theory, called, Combinatorial Game theory (CGT) stud-
ies strategies and mathematics of two-player games of perfect knowledge such as
chess or go. But generally concentrates on simpler games like nim or solving end-
games or their special cases. An important difference between this subject (i.e.,
CGT) and classical game theory (a branch of economics, after von Neumann and
John Nash) is that in CGT, game players are assumed to move in sequence rather
than simultaneously, so there is no scope in randomization or information hiding
strategies.
The CGT does not study games with importance in computer knowledge called
games of chance, like poker (card games). The combinatorial games include games
like chess, checkers, Go, Arimaa, Hex, and Connect. They also include one player
combinatorial puzzles, and even no player automata.
In CGT, moves are represented as game-tree, and the trees can be searched while
making moves in the game. A game-trees is a special kind of semantic tree where
nodes represent board configurations and branches indicate how one board configu-
ration can be transformed into another configuration by a single move. The decisions
regarding the next move are made by two adversaries, who play the game, hence the
name adversarial search.
Since the number of nodes in a binary or higher order tree generally increases
exponentially with increase in depth of the tree, it is usually unfeasible to do an
exhaustive search in a game-tree. The search-procedures, called as, alpha-beta search
are commonly used to guarantee correct results without exhaustively searching for
such trees. However, such procedures still require complete search of the tree in
special situations. The fact of these game-trees is that some times they are so large that
you cannot fit them even in the hard-disk, hence practically impossible to completely
search such trees.
Improved results can be obtained in the game playing computer programs, by
searching the tree to only a limited depth. This is possible by static evaluation func-
tions to estimate the utility vales at some given nodes. This is followed by computing
the utility values at shallower nodes in the tree assuming that estimated utility values
were correct. This technique is called heuristic game-tree searching, is used by us
in “real-life” decision situations.
11.1 Introduction 305
Games are classified into several categories based on certain significant features, the
most obvious of which is the number of players involved. A game can thus be desig-
nated as one-person, two-person, or n-person (n > 2). A player, i.e., a participant in
a game need not be a single person. If each member of a group has the same feelings
about how the game should progress, and end, the members may be called as a single
player. A player may also be a corporation, or a basketball team, or even a country.
In games of perfect information, such as chess, each player knows everything
about the game at all times. The Poker, on the other hand, is a game of imperfect
information because players do not know about the cards that others are having.
The extent to which the players are opposed or coincidence is another basis of clas-
sifying the games. The zero-sum or to be precise constant-sum games are completely
competitive. The Poker, for example, is zero-sum because the combined wealth of
the players remains constant; if one player wins the other must lose. In non-zero-sum
games, all players can be winners or losers. In a dispute between management versus
labor, if some agreement is not reached, both will be losers. For example, if a labor
strike cannot be avoided.
In cooperative games, the bidders may communicate and make agreement as
binding to both the players, and in noncooperative they may not. The examples
of players in cooperative games are car salesman versus customer, management
versus employees, whereas, bidders for government contracts are supposed to be
noncooperative bidders.
306 11 Adversarial Search and Game Theory
The games can further be classified as finite and infinite games. The games are
finite if the player has to make only a finite number of decisions, and has only a finite
number of alternatives. When either the decisions are infinite or the alternatives are
infinite, the game is called infinite.
The simplest of all is a one-person game, with no opponent. As an example, for
going to the office he/she may choose to go on foot or ride on a public transport
or hire a cab, are the alternate moves. Each one of us as an individual always play
many one-person games daily, like, doing shopping, doing homework given by a
class teacher, managing kitchen, cleaning house, doing meditation; assuming that
no other person or robot is involved in the above tasks. Each of these games (the
tasks), has many steps or moves, and one needs to choose a move among number
of available next move/choices. For example, while shopping, we may have options
to: first collect vegetable items or the grocery items, and before the final move of
making payment of bills, we may choose the move as, pay by cash, or by card.
Let there be a facility by which in a board game, every unique board configuration
can be converted into a unique single, overall quality value in the form of an integer
number. Further, consider that positive number indicates the move in favor of one
player, and negative—in the favor of his/her opponent. The degree of favoredness is
the absolute value of that number. Let us call ourself as maximizer player, and the
opponent as minimizer. Also, let us imagine that there is a number (call it static score,
accumulated so far), the maximizer will make a move such that the score increases
to maximum, while the opponent (adversary) makes a move so that by addition of
quality number, the score number minimizes.
The game playing is favorite for AI search, because:
• It is a structured task, often a symbol of “intelligence”,
• clear definition of success and failure,
• does not require large amounts of knowledge (at first glance),
• focus on games of perfect information, and
• multi-player, and chance game.
The difference between the games and search problems is that the games playing
are highly unpredictable, as one does not know what move the opponent is going to
make. And, only based on the opponent’s move the other player has to make a move.1
The other difference is that the time limits for a move are in realistic times, hence
which one is unlikely to find a goal in that time, approximations are necessary for
both the players. The formal system behind these games is called Game Theory [6].
1A player can learn from the opponent’s past moves as to how and what strategy the opponent
followed. But, this requires the learning ability. Hence, the game’s moves shall not be based current
state, but past also. This has not been considered for the present discussion.
11.4 Two-Person Zero-Sum Games 307
The term “zero-sum” (or equivalently “constant-sum”) means that players have dia-
metrically opposed interests. The term comes from parlor games like poker, where
there is a fixed amount of money around the table. If one player wins some money,
others have to lose an equivalent amount. Two nations trading make a non-zero-sum
game since both are simultaneously in gain. An equilibrium point is a stable outcome
of the game associated with a pair of strategies [2].
Consider the Fig. 11.1, which is played between you and your opposite. The
numerical values in the cells indicate the figures in dollars your opponent will pay
to you. To start the game, assume that you select row B. Next, your opponent will
select a column, say I I . Therefore, your opponent will pay you $4. Next, say, you
select row C, your opponent, suppose selects column I I I . In this case, you pay to
your opponent one dollar, as the value at the junction is −1. The two-player zero-sum
game is a perfect information game.
A strategy in game theory is a complete plan of action that describes what a
player will do under all possible circumstances. There are poor strategies and good
strategies.
Other examples of two-person zero-sum games of perfect information are chess,
checker, and Japanese go. Such games are strictly determined, i.e., rational players
making use of all available information can deduce a strategy that is clearly optimal,
hence the outcome of such games are deterministic. In chess, for example, one of
these three possibilities exists: (1) white has a winning strategy, or (2) black has
a winning strategy, or (3) black and white each have a strategy that is winning or
draw [9].
Two-Person Non-zero-sum Games
Most games that arise in practice are non-zero-sum games, where players have both
common as well as opposed interests. For example, the buyer vs. seller is a non-zero-
sum game. The buyer wants a lower price and seller a higher price, but both want
that deal to be executed. Similar is the case with the game of hostile countries, who
may disagree on many issues but try to avoid the war. Many obvious properties of
zeros-sum games are not available in non-zero-sum games. One of the differences
is the effect of communication on the game, which does not help the opponent in
zero-sum game.
YOU B 6 4 2
C 0 8 -1
308 11 Adversarial Search and Game Theory
Two criminals A and B are arrested and interrogated separately. Each suspect may
either confess (defect the other prisoner) with a hope of a lighter sentence, or may
refuse to talk (cooperate the other prisoner). The police do not have sufficient infor-
mation to convict the suspects unless at least one of them confesses. If they cooperate,
then both will be convicted to minor offense and sentenced to a month in jail in the
absence of supportive evidence. If both are defecting, then both will be sentenced to
jail for six months. If A confesses and B does not, then the A will be set free but the
B will be sentenced to one year in jail, and vice versa when A does not confess and
B confesses.
The police explain these outcomes to both suspects and tell each one that the
other suspect knows the deal as well. Each suspect must choose his action without
knowing what the other will do. This is shown in a table in Fig. 11.2.
If we observe closely the outcome of different choices that are available to the
suspects, it becomes clear that regardless of what one suspect chooses, the other sus-
pect feels better off by choosing to defect, which means he confesses. This concludes
that both the suspects choose to defect, and stay for six months in jail. This choice
is clearly a less desirable outcome than that of only one month in jail, which is the
result if both cooperate to each other by not confessing (i.e., cooperating).
We assume that there is no honor among the criminals, and each one’s sole concern
is to save himself. This game is called prisoner’s dilemma. Since a suspect must
make his own decision without knowing that of others, he must consider each of his
partner’s alternatives and anticipate the effect of those on him.
Supposes that A confesses, B must go to jail for one year or six months. Alter-
natively, if A does not confess, then either B is set free or given one month’s jail. In
either case, for B it is better to confess!
However, if both cooperate to each other, both will not confess, and both go for
miner term in jail of 1 month. If both confess, they go to jail for 6 months each.
Further, we conclude that both may be cooperative or uncooperative. In coopera-
tive case, each does better than they were uncooperative. Also, for any fixed strategy
of the opponent, other player does better by playing uncooperatively [2].
Consider the Fig. 11.3, which shows that there are two business firms A and B, trading
in different regions, and each spending either $1,000 or $2,000 for publicity. The total
of publicity and sale figure is $12,000 (12k) always. If both the firms spend $1k for
publicity, their sale is $5k each, if both spend $2k, the sale of each is $4k. Similarly,
if A spends $1k and B spends $2k for publicity, A has sale of $3k, while B’s sale is
$6k. Similarly in the reverse case. We assume that in $2k publicity expenditure for
each together, the extra market efforts is wasted, hence the sale comes down.
If any party is spending $2k each year, it will induce the other firm to spend also
$2k next time. A more optimistic strategy is to signal your intent to the opponent of
$1k expenditure, so he also spends $1k, and this cooperation results to $5k sale for
each. But, two competing firms know that they would not do the business forever,
hence they loose the cooperation, and each spends $2k, and the profit comes down
to $4k each.
Similar conditions to the prisoner’s dilemma may appear in water consumption by
the citizens. Imagine that there is water shortage and citizens are urged to cut down
the water consumption. Each citizen responds to these requests by considering his
own interest, so no citizen will conserve water. This is with the prevailing concept:
any saving by an individual has a negligible effect on the city’s water supply. If
everyone acts as per his own interest, the results will be catastrophic for all.
Imagine that two rival nations are preparing their military budget. Each nation
wants to maintain a military advantage over the other by building a powerful army,
and each spends accordingly. They ultimately end up having the same relative power
and good deal of power.
310 11 Adversarial Search and Game Theory
1. Every player in the market acts as per his/her self interest. The players pursue
well-defined exogenous objectives, i.e., they are rational. The players understand
and try to maximize their payoff functions.
2. While selecting a strategy/plan of action, a player estimates the potential response/
reactions of other players. The player also takes into account his/her knowledge
or expectation of other players’ behavior, that is, the player reasons strategically.
Dominant Strategies
A dominant strategy is one that provides roughly the highest payoff for a particular
player, irrespective of what strategies are employed by other players [9].
A formal definition of dominant strategy can be given as follows: Let there be a
game played between n players, and let pi (s) is the pay off to player i as a function
of all other players’ strategies s, where,
s = (s1 , s2 , . . . , sn ). (11.1)
For the sake of simplicity, let us define s−i as a combined strategy of all n players,
except player i as,
The strategy si is called dominant for player i if for any strategy s−i and any
alternative strategy si , provided that the following pay off relation, holds for player
i with respect to all players s−i .
B B
s1 s2 s3 s1 s2 s3
s1 2, . . . 1, . . . 4, . . . s1 ...,2 ...,4 ...,3
A A
s2 1, . . . 0, . . . 3, . . . s2 . . . , 1 ...,2 ...,1
(a) (b)
In the above equation, pi (si , s−i ) is the payoff function of player i for the strategy
profile (si , s−i ).
By simply eliminating the strictly dominating strategies in a modified prisoner’s
dilemma, it is always not possible to predict the final outcome of the game that,
however, may simplify the solution. Note that, in such a case we may find only the
(defect, defect) strategy. This is possible because many a time in a game, there are
no dominated strategies! Hence, there is a need for a stronger solution concept—a
theory that constructs a notion of equilibrium state to which a complex chain of
thinking finally converges. In that case, the strategies of all players will be mutually
312 11 Adversarial Search and Game Theory
consistent, i.e., each player will choose his/her best response against the choices of
other players. For such a theory to work, the equilibrium state must exist. We will
discuss this in the Nash arbitration section.
An important question in game theory is whether all participants in a game are equally
updated to the progressive information about the game, and even in the presence of
any external factors that might affect the pay offs or outcomes of the game. In the
case of “prisoner’s dilemma,” the common-sense knowledge rules form the game
rules, and these rules are known to everyone. Such games are the games of complete
information.
On the other hand, in a game of incomplete information, at least one player is
uncertain about other player’s pay off function. For example, a sealed bid quotation
of a software by many bidders for government purchase, is a game of incomplete
information, because a bidder does not know how much other bidders have quoted.
In other words, in a game of incomplete information, the bidders’ pay off functions
are not common knowledge. In game theory these situations are called static games.
In contrast to static games, dynamic games have multiple stages, such as in the
game of chess or bargaining while buying. The static games are also called simulta-
neous move games, due to the way they are played. However, this simultaneity is not
reflected in the actual sequence of the moves in the game, but it is in respect of the
information available to the players,2 while they make their choices. For example, in
a case of sealed bid tender, the process of receiving bids may run for hours or days,
and the participants are free to submit their bids any day during that week (in online
it is all the hours of the day!), subject to some deadline time and date. Such bids
are still called simultaneous move games because the bidders are not affected due to
others’ actions while they are choosing their own actions.
In the above discussions, the word strategy may be taken as a complete action plan
for all possible ways the game can proceed. For example, a player could delegate the
strategy to its representative, or write a computer code so that the representative or
computer would know exactly what to play in all possible situations, every time the
player is supposed to make a move [11].
The simplest two-person zero-sum games of imperfect information has saddle points.
Such games have predetermined outcomes provided it is rational play. The predeter-
mined outcome is called the value of the game. Consider the table given in Fig. 11.5.
The two campaigning political parties, A and B must each decide how to handle a
disputed issue in a village. They can support it, oppose it, or evade it. Each party must
make a decision without knowing what its rivals will do. Each pair of decision (x, y),
where x, y either of three decisions by party A and B, determine the percentage of
votes that each party receives in this village. Each party wants to maximize its own
percentage of the vote. The numbers in the matrix in Fig. 11.5 indicate the percentage
share of votes for party A, the balance (100 min this value) is that of party B. Suppose
party A oppose the issue and B supports it, then A gets 80% and B gets 20%.
The A’s decision seems difficult at first because it depends upon B’s strategy. The
player A does best to support if B evades; A does best to oppose if B supports; and A
does best to evade if B evades it. We note that party A must consider the B’s strategy
before deciding its own strategy. Whatever A decides, B gets the largest share by
opposing it. Interestingly, once this is realized by A, he prefers to settle for 30% !
This 30%:70% votes for A:B is called game’s saddle point [11].
A better way of finding the saddle point is to determine the maximin and minimax
values. Using this method, first A determines the minimum percentage of votes it can
obtain for each of its strategies, then selects the maximum of these three. This can be
shown by representation in the form of a tree (see Fig. 11.6), where B’s decisions are
nodes of a sub-tree, and its root is A’s decision. The minimum percentage A will get
if it supports, oppose, and evade are: 20, 25, 30, respectively. The largest of these,
i.e. 30, is maximin (maximum of the minimums) value. Similarly, each strategy the
B chooses, it determines the maximum percentage of votes A can win, i.e. minimum
B can win (similar to what we did for A). In this case, if B supports, oppose, and
evade, the maximum A gets is: 80, 30, 90, respectively. Now, the B obtains its highest
percentage by minimizing A’s maximum percentage of the vote. Here, the smallest
of A’s maximum is 30. This 30 is B’s minimax (minimum of maximums) value.
Because both the maximin and minimax are 30, hence 30 is a saddle point. This
selection process is shown in Fig. 11.6.
The player’s in a bargaining game often want to make the most favorable agreement
that they can, and make all the attempts to avoid any risk of losing the agreement. For
example, a seller does not want to lose the customer, but at the same time keeping the
maximum profit as far as possible. Also, the customer would like to buy it at the best
price. A similar situations occur when one of the two fighting nations tries to arrive
at peace. In the above situations, the problem is to reach arbitration that somehow
reflects the strengths of the players, so that you get the effects of negotiation without
risk. The Nash arbitration suggests the following procedure [10]:
Assume that two parties are in the process of negotiating a contract. The parties
may be two nations, seller-customer, management vs. labor, etc. Also, assume
that failure to agree, i.e., no trade, no sale, or labor strike to go on, would have
utility of zero to both parties. Nash then selects a single arbitrated outcome
from all agreements that are within the power of the players, such that the
outcomes of the product of the players’ utilities is maximized.
The disadvantage of Nash is that before the Nash arbitration is applied, the utility
function of both the players must be known. In practice, the utility functions, are
often misrepresented by the player to turn them to their advantage. Further, the Nash
scheme is neither predictable nor enforceable. It is rather a priori agreement obtained
by abstracting away many relevant factors, such as bargaining strengths of players
A and B, cultural norms, etc. It is often found that Nash arbitration often happens
unfair, the poor become poorer and the rich as richer.
The idea of a Nash equilibrium (NE) in a game played by two or more players is
based on the fundamental principle of a system at rest.
Definition 11.4 (Nash equilibrium) A Nash equilibrium is a set of strategies for all
n players in a game, such that every player is playing a best response to every other
player’s best response. Formally, an NE is a combined strategy s ∗ , as a consequence
of all players strategies s1 , . . . , sn , of n-players, expressed as,
Consider that a rich man and a poor man get one million dollars if they agree on how
they share it between them. If they fail to agree, they get nothing. In this case, the
Nash equilibrium will give more money to the richer person [9].
When more money is received than a person already possess—people play it safe.
Most, unless they are very rich, prefer to sure of thousand dollars against a chance
of a million dollars But a large insurance company would even prefer the chance of
one million losings and in fact,gladly accepts much less attractive risks every day.
This indifference is reflected in the poor man’s utility function more strongly than
in the rich man’s utility function. A utility function that correctly reflects the poor
man’s situation would be square-root function: $100 would be 10 tiles, $16 as four
tiles, and $1 as one tile only. Thus, the poor man would be indifferent on a chance
of $10,000 and sure of $2000.
The general approach for solving imperfect-information games is shown in the
Fig. 11.7. To begin with, the original game is abstracted to generate a similar but
smaller game. The size of the game is reduced to the extent that the equilibrium-
finding algorithm can be easily applied on this. In the next step, the abstract game is
solved for equilibrium or near-equilibrium state.
A Nash equilibrium is based on the notion of rational play, which is a profile of
strategies of all the players, with one strategy per player, such that no player can
increase or decrease the payoff by switching to a different strategy. The strategy for
a player means the characteristic of the play carried out when his/her turn comes.
316 11 Adversarial Search and Game Theory
For each information set (i.e., collection of game states), it is the probability with
which the player selects each of his available actions. Due to the private nature of
the information of other players, an information set cannot be distinguished by any
player as whose turn is it. At the end of the game, the strategies from the abstract
game are mapped back to the original game [13].
There are two kinds of abstract games: (1) Information abstraction, where it is
assumed that a player does not know some information—that he knows, due to the
reason that information sets are bundled, and, (2) Lossless abstraction algorithm,
which results to an abstract game, and each equilibrium from this abstract game is
also equilibrium in the original game.
In the above discussions, because the graph is closed and the image of each point
under the mapping is convex, the mapping has a fixed point in its image. Therefore,
there is an equilibrium point.
In the case of two-person zero-sum game, the “main theorem” and the “existence
of an equilibrium point”, are two equivalent statements. In such case, any two equi-
librium points ultimately lead to the same expectation for the players. However, its
occurrence is not always necessary [5].
Chess has certain attractive features that many, more complex tasks do not have. The
available options (moves) and goal (checkmate) are sharply defined. The discrete
model of chess is called a game-tree, and it is the general mathematical model on
which the theory of two-player zero-sum games of perfect information is based.
Since all the legal moves are known to both the players at all the times, therefore this
is a game of perfect information. This also is a game of zero-sum, because when one
player looses, the other gains it.
There is a root node at the top of a chess tree, that represents the initial setup
(configuration) on the board, say c0 . Corresponding to each opening move, there is
an arc leading to another node that represents a new board configuration (say, c0i ) after
this move is made. If there are n number of such maximum configurations to which
a legal move can be taken from the root node, then the nodes corresponding to these
configurations are c01 . . . c0n . This process of construction of news configurations or
new nodes goes on as we progress in playing the chess game. Thus, constructing a
game-tree—a recursive structure of the tree, with the root node and sub-trees under
that, with arcs pointing to the potential next states, such that each of the game-trees
would be a smaller game-tree. The number of arcs/edges leaving a node is referred
to as branching factor, and distance of any node from the root, in terms of a number
of in-between nodes, is called depth. If b and d are the average branching factor and
318 11 Adversarial Search and Game Theory
depth of the tree, respectively, then the tree contains at least bd nodes. When the
current state of the game is a leaf node, the game terminates.
Each leaf node has a value associated with it, that corresponds to the payoff of
that particular outcome. A game can have any payoff associated with each outcome,
which helps in deciding to choose the move of higher payoffs. However, the standard
parlor games have result like win, loss, or some times draw.
In the case of two-player games, the players take alternate turns to make a next
move, which chooses a state from among the children states of the current state. If
the two-player game is a zero-sum game, one player attempts to choose a move of
maximum value, and other players would always choose a move of minimum value.
An approach that tells a player what to choose next, is called the strategy of the
game. The chosen strategy (called control strategy) controls the flow of the game.
In principle, the decision to choose which node should be the next node is a simple
one—any state that is one move from the leaf node can be assigned a value which is
best of its children—this can be either maximum or minimum value of the children
under the node.
This maximum or minimum depends on which player out of the two is playing.
The states that are two moves away from the leaf then take on the value of their
best children. This goes on until each child node of the current node is assigned a
value. The best move is chosen based on this strategy. In the two-player zero-sum
game, the player that maximizes is called maximizer, and the one minimizing is
called minimizer. Accordingly, the method of assigning values and choosing moves
is called the minimax algorithm, and it defines the optimal move to be made from
each state in the game.
Complexities
The network of configurations and move (links) come up as a search tree, called
game-tree. The start node is at level 0, and other nodes levels from the top are at
levels 1, 2, . . . , d for a game-tree of depth d. If b is the maximum branching factor,
the worst-case time and space complexities for breadth-first search (BFS) are both
O(bd ). For depth-first search (DFS), time and space complexities are O(bd ), O(d),
respectively.
The Minimax search is based on minimax theorem, which is the fundamental theorem
of game theory, stated in 1928 by mathematician John von Neumann.
The minimax theorem says that every two-person zero-sum game of a finite number
of states has a solution in mixed strategies. If A and B are the players in such game,
there is a value V and mixed strategies for these players, such that if player A adopts
11.12 Minimax Search 319
2 3 2
3 P2
P1 P2 Level 1 P1 P2 P1
Level 2
algorithm is shown as Algorithm 11.1. The idea of minimaxing is to translate the board
into a static number. However, the process may be expensive due to the generation
of a large number of paths.
Start (7)
MAX’s move
(2, 1, 1, 1, 1, 1) MAX
The tic-tac-toe is a board game, with 3 × 3 = 9 positions, and played between two-
players through alternate moves. The game has a total 765 different positions, and
26,830 possible games. The initial configuration is the empty board. The one player,
called MAX makes the first move with X , and the other player, called MIN player,
makes a O move. The player which is first in placing all X or O array as row or
column or diagonal wins the game [8].
Figure 11.10 shows some initial configuration and of some moves of this game.
The game of tic-tac-toe generates all the possible moves at level i before generating
the moves of level i + 1, hence, it is breadth-first search (BFS). We note that while
MAX’s move
Level 1 X
X MIN
X
MIN’s move
O
X O X O MAX
X
O X O X MIN
X X
5 6 7 8
O X O X O X O X
O O X O O X O
X X O X X O X X O X X
in start configuration (level 0) having all empty positions, the player MAX can make
moves to level 1. The player MAX will choose moves in such a way that it leads
to a configuration ultimately such that all the X ’s are continuous and are in a line,
or column or diagonal. At the same time, the MAX will choose such a move that it
becomes difficult for MIN later to make a move to a configuration with all O’s in a
line, or column or diagonal. Thus, each of MAX and MIN makes the moves so that
it is winning for them and blocking for the other player.
The Fig. 11.11 shows total of 8 moves of this game, for a particular sequence of
moves by alternate players X and O, numbered as 1–8 Finally, the player O wins
the game.
To decide which configuration is superior to move into, we associate a weight
function that will be maximized by the MAX player, and the MIN player will try to
minimize it. If a configuration is indicated by c, then this weight function is defined
as:
f (c) = m − n, (11.8)
where, m = number of complete rows, columns, and diagonals that are still open
(fully) for the next move of MAX, and
n = number of complete rows, columns, and diagonals that are still open for the
next move of MIN.
If c is a winning configuration for MAX, then f (c) = ∞, and if c is winning
configuration for MIN then f (c) = −∞. If a configuration is like in Fig. 11.12, then
f (c) = 4 − 6 = −2. The m are solid arrows (Fig. 11.12a), and n are shown as dotted
arrows (Fig. 11.12b).
(a) (b)
11.13 Tic-tac-toe Game Analysis 323
O O
X X X X
O O
If we make use of symmetries in generating successor positions, then all the game
states shown in Fig. 11.13 are identical.
For efficiency reasons, early in the game, the branching factor is kept small by
symmetries, and late in the game, it is kept small by a number of open positions
available. The first move is always important.
The Fig. 11.14 demonstrates the computation of this weight function at each of the
node. To win the game, the MAX at level 0, should make move to such a node at level
1 that produces a maximum value of function f (c) at level 1. It is this maximum
value that is backed up to level 0. This value is as a result of the maximum value
of the corresponding nodes that are under the MAX node at level say 1, (in general,
for MAX at level i it backup the maximum of values those at level i + 1). The MIN
node at level i + 1 should make a move that produces minimum value from those at
level i + 2. This process goes on until the maximizer sees the static values at the last
level (bottom) of the tree. Similarly, the minimizer will also, choose the next move
by attempting to see the static values at the last node.
The static values at the bottom nodes in this example are shown as 1, 3, 4, 0, −1.
The values at the upper levels, i.e, parents of static nodes, and their parents are called
backup values. The backup values are based on the “look-ahead” of the game-tree,
and this depends on the lower levels nodes, in fact even those that are near the end
of the game-tree.
Some of the nodes may represent a win for MIN, hence they may have value
−∞. When evaluations are backed-up, the MAX’s best move is that which avoids
its immediate defeat. In the case of MAX, the win corresponds to a node value of
+∞.
3 Selection
of minimum of the payoffs or maximum of the payoffs (depending on which player has
a move) of children of any node vk and moving it to node vk , is called backing-up of values.
11.14 Alpha-Beta Search 325
P MAX Position
Depth 0
Backed-up
value P1 P2 P3 MIN Position
5
Depth 1
P1,1 P1,2 P1,3 P2,1 P2,2 P2,3 P3,1 P3,2 Bottom level
static P3,3
positions
evaluations 5 10 12 3 0 0 7 1 5 1 Depth 2
Positions not X X X
evaluated due
alpha-cutoff
cutoff occurs at P3,2 , which eliminates P3,3 . Thus, in the Fig. 11.15 alpha-beta search
program would evaluate only six of the bottom level successors, while a minimax
program would evaluate all nine.
Although the example is given only for a tree of three levels, it is clear that the
procedure will work just the same below any Max position Px , at any depth in a large
tree, provided only that there are at least two levels below Px . If there were more
levels below P1,1 in the example, then we could use the backup value of P1,1 instead
of the static value. If there were more levels above P, then we could backup of P.
Moreover, it is possible to pass a value of alpha down from the top of a large tree.
Thus, an alpha established at depth 1 could be used to produce cutoffs at depths 2,
4, and 6. These deep cutoffs are illustrated in Fig. 11.16.
326 11 Adversarial Search and Game Theory
Alpha is defined by the values of the successors of a Max positions (i.e., odd
depths, here it is P1 , P2 , P3 nodes, in Fig. 11.15), while alpha cutoff occur among
the successors of a Min position (even depths, here it is P1,1 . . . P3,3 , in Fig. 11.15).
It is possible to define another variable, Beta, which is established at even depths
and generates cutoffs at odd depths. The action of the beta cutoff is exactly inverse
of alpha cutoffs.
The effect of alpha-beta cutoff is to make the tree space grow slower with depth.
Thus, the advantage over simple MINIMAX. It is about twice as good as at Maximum
depth Dmax = 3 and about thirty times as good at Dmax = 6. This is typical. It is
good to have depth-dependent program. Such a measure is called as depth relation
(DR),
log N
DR = (11.9)
log N M M
where, N is a number of nodes at the bottom of the tree, and N M M is the number of
nodes at the bottom of the tree in minimax search. D R = [0, 1], is effective depth in
comparison to minimax procedure.
Alpha-beta search is equivalent to minimax in the sense the two procedures will
choose the same depth successor as best and will always give the same value for the
successor.
The benefit of alpha-beta pruning lies in the fact that branches of search tree can be
eliminated so that search tree can be limited to more promising sub-trees and deeper
search can be performed in lesser time. Thus it belongs to the branch-and-bound
class of algorithms.
If move ordering for alpha-beta is optimal (best moves are always selected
√ first),
d
then the total number of leaf nodes to be evaluated are O(b 2 ) = O( bd ). The
explanation of above is that all the first players’ moves must be studied to find the
best one, but for each, only the best second player’s move is needed to refute all but
the first (and best) player move-alphabet ensures that no second player moves need
d
to be considered. Thus, the complexity (O(b 2 ) is possible if nodes are ordered. If
3d
they are at random then it is O(b 4 ).
Thus, it can be demonstrated that the number of static evaluations needed to dis-
cover the best move in an optimally arranged tree is O(bd/2 ). Let the static evaluator
function be represented by a variable c, which needs to be maximized and minimized
at different levels. We note in Fig. 11.15 that evaluation of c is not required at level
1 (shown with values 5 and ≤3 for minimizing level). This is because we do not
need to evaluate for ≤3, as the value 5 is carried forward to the root. Since static
evaluations are required only at the alternate levels, the depth of search reduces to
half. Therefore the time time complexity is O(bd/2 ).
11.14 Alpha-Beta Search 327
O O
X X X X Symmetric configurations
O O
O O
X X Symmetric configurations
O O
Due to the prevalence of more and more business, commerce, and trading activities
on the WWW, the sponsored search is becoming very important. The sponsored
search is based on input from many resources, like Web-search engines, users, and
content providers. Here, contents providers are companies or individuals owning the
Website hosting the contents.
The keywords selected by the contents provider are matched against the users’
query by the web search engines, and the corresponding link is displayed in high-
lighted form. The content providers pay to the search engine company in many cases,
whenever a user enters one of these terms for search, and clicks on the sponsored
link displayed. On occasions, the user is required to go one step further and perform
some specified action on a web-page on the content provider’s website. The content
providers can identify those phrases from the user queries that are most likely to be
related to the required Web sites. Often the sponsored search platforms provide the
capability to the content providers to tailor the presented material on the sponsored
link that conforms to user queries.
In fact, it is not necessary that these simple approaches would always work.
Consider a situation, where many different content providers assign the work of
responding to user queries to a search, all of them want the search engine to respond
to the same term or phrase of the user query. In such a scenario, an electronic auction
system will rank the sponsored links, in such a way that the highest bidder will be of
the topmost rank, the next lower bidder will get the next link, and so on. That means,
the one who pays more will get the links of his/her product and services listed at top,
and only they will get the customers, because the customers usually do not open all
the links.
The web search engines also adopt other ranking factors apart from the price
paid by the bidder. For example, in the past, some sponsored link used to get much
higher clicks, is likely to be displayed among the top, or at the top. This is because
a link that gets more clicks, is likely to be more relevant. And, when the link is
relevant, the user is potentially a good customer for the content provider. A sponsored
link with maximum clicks will generally produce a major part of the profit for the
search engine. Hence, both the content providers and the search engines have higher
monetary incentives for providing the contents to the user which are relevant.
The key-phrase selection can be conceptually viewed as dynamic form of Web
site meta-tagging focused on the user. The contents providers can change one or
more of following:
phrases to be searched,
bid price for various phrases,
degree of matching of terms,
time based restrictions, like displaying at 10 AM, evening, at early morning, etc,
geographical restrictions, and,
amount they bid for a given period.
From the above discussions, we conclude that the content providers become active
participants in the sponsored search process [3, 4].
11.16 Playing Chess with Computer 329
One of the well known, landmark example of the game is the chess game, played
between grandmaster Garry Kasparov and IBM Deep Blue Computer in 1997. Fol-
lowing are some characteristics of this competition:
• The system used for this purpose comprised of 30 number of IBM RS6000
machines, running a program for search, and a large number of custom VLSI
chess processors that performed the job of chess move generation, and ordering
of these moves. This hardware searched for the last few levels of the game-tree,
and evaluated the leaf nodes.
• Deepblue searched more than 100 million nodes per second on average, with peak
speed as high as 330 million nodes per second.
• This software generated 30 million positions per move reaching a depth of 14 quite
often.
• Heart of the search algorithm was a standard iterative deepening alpha-beta search.
• In some cases, the search reached a depth of 40 plies.
• The evaluation function for a proper move comprised over 8000 features, many of
them described a highly specific pattern of pieces.
• A large, end-game database, which comprised solved positions, were also used.
• The chess program also used an opening book with about 4000 positions, and a
database of 700,000 grand-master games for consulting for consensus for next
moves.
In the above, the IBM’s Deep Blue software beat the World Chess Champion in
a series of six matches.
In the more recent chess playing computer games, varieties of heuristics for prun-
ing are commonly used to reduce the effective branching factor from 35 down to less
than 3, hence they are more powerful with even far less processing capabilities of
processors.
11.17 Summary
The Game theory is the formal study of conflict and cooperation. Another branch
of game theory, called, combinatorial Game theory (CGT) studies strategies and
mathematics of two-player games of perfect knowledge such as Chess or Go. In
games of perfect information, such as Chess, each player knows everything about
the game at all times. Poker is a game of imperfect information, as other players do not
know your cards. In zero-sum games, if one player loses, the other will gain; Poker,
Chess, and Parlor are games in this category. Games of bargaining, trade, agreements,
are non-zero-sum games, where all the players gain on successful completion of the
game, and loose in the event of failure.
330 11 Adversarial Search and Game Theory
Strategy for two-player game is: every unique board configuration is converted
into a unique single integer number. While one player tries to maximize this number,
the opponent does the reverse.
Prisoner’s dilemma is a game of conflict and cooperation; and complete informa-
tion, such that each player is aware of strategies of the other player.
A dominant strategy provides roughly the highest payoff for a player, irrespective
of what strategies are employed by the other players.
For the games of the bargain, Nash arbitration scheme is used to arrive at an
agreement, called Nash equilibrium (state), which is a set of strategies for all n players
such that each player is playing the best response to every other player’s best response.
The equilibrium state follows the principle of a system at rest.
The scoring in two-player game is done using the MINIMAX algorithm, where
the score at any level i is either minimum or the maximum of what is available at the
level i + 1.
Games like, tic-tac-toe, and chess requires the search, which can be conducted
using the minimax search. However, this approach has exponential complexity for
searching the game-tree.
An improved search method, called alpha-beta search approach, saves the search
time by pruning the game-tree through alpha and beta cutoffs.
Exercises
1. Demonstrate the min-max search for the tic-tac-toe puzzle (for 10 moves).
2. Demonstrate the alpha-beta search for the tic-tac-toe puzzle.
3. Apply alpha-beta search (from left to right) to the game-tree in Fig. 11.18. Show
the backed-up value of each node. Mark with an X any branches that are not
searched. Identify these as alpha/beta cutoffs. Mark the best move with an arrow
from the root node.
a. Identify these as alpha/beta cutoffs, mark the best moves with an arrow from
the root node.
b. Compute the time complexity of search with worst-case branching factor of
3 and height of the tree as h.
4. A game nim is played as follows: there are two players who remove one, two, or
three coins, alternately from a stack of five coins. Represent this game playing
as a search tree. Suggest any suitable strategy for winning this game?
5. Given the tree in Fig. 11.19, explore this tree using the alpha-beta procedure.
Indicate all parts of this tree that are cutoff. Also, indicate the winning path(s),
and strike out all the values that are not required to be computed.
a. Identify these as alpha/beta cutoffs, mark the best moves with an arrow from
root node.
b. compute the time complexity of search with worst-case branching factor of
3 and height of tree as h.
Exercises 331
Minimizer
Maximizer
6 4 9 6 4 0 3 3
6. Consider the min-max tree shown in the Fig. 11.20, whose leaves are labeled
with natural numbers; n and m are variables.
a. Assign values to n and m such that, to compute the value at the root node, no
alpha-beta cutoff is possible. Compute the value of the root node.
b. Assign values to n and m such that, to compute the value at the root node, an
alpha-beta cutoff is possible. Indicate the cut and compute the value of the
root node.
7. Following are either two-player or many player games, and standard minimax/
alpha-beta search techniques are to be applied for game playing. Explain, how
well these techniques apply to these games? Give the formal approach for each,
where possible.
a. Basketball
b. Badminton
c. Football
d. Soccer
e. Tennis
f. Kabbadi
8. Consider two firms A and B, that give service in the same market. The firms
incur constant average costs of $2 per unit, and are free to choose a high price
of $10 or a lower price of $5 per unit for marketing. When both firms set a high
price, the total demand is 5,000 units which are split evenly between the two
332 11 Adversarial Search and Game Theory
3 4 n m
firms. When both set a low price, the total demand is 10,000 units, which is again
split evenly. If one firm sets a low price and the second a high price, the low
priced firm sells 8,000 units, and the high priced firm only 1,000 units.
Analyze the pricing decisions of the firms A and B as a noncooperative game.
a. Construct the payoff matrix, where the elements of each cell of the matrix
are the two firms’ profits.
b. Derive the equilibrium set of strategies.
c. Is it the game of prisoners’ dilemma? Justify.
9. Find out the Nash equilibrium for the game shown in Fig. 11.21, which is played
between two-players A and B, having strategies s1 , s2 , s3 and s1 , s2 , s3 , respec-
tively.
10. Consider the Table 11.1 for two-player game.
Let the friends A and B do not speak to each other, but have many common
friends. Both of them want to invite these friends to the party, either on Saturday
or Sunday. However, both of them prefer Sunday over Saturday. If both decide
the party for the same day, it will be considered as a disaster with a (negative)
payoff of $500 for both. If they plan the party on different days, the one proposing
Sunday gets a payoff of $250, and the other of $200.
12. For two-players game shown in Table 11.2, find out all Nash equilibria states,
possible domination, maximin moves, and best responses.
13. Find out the saddle point for the game shown in Fig. 11.5.
14. What can be the true considerations, for example, to reach to Nash equilibrium
faster in a buyer vs seller game?
15. Are we in a position to determine in advance that arbitration between two parties
will result in an equilibrium state? Justify for yes/no.
16. Justify the following statement: “It is often found that Nash arbitration is often
unfair, in which rich becomes richer and poor becomes poorer.”
17. For any minimax game of even height game-tree, show that final results of the
game shall be the same, irrespective of who plays first.
18. Apply the minimax algorithm for the problem of prisoner’s dilemma (Fig. 11.2),
and demonstrate the backup of values from static levels.
19. Why the game of tic-tac-toe is BFS search? Justify.
20. Write an algorithm, to compute the weight of any general configuration of tic-tac-
toe given any number of X s and Os on arbitrary position. Note that if 0 ≤ |X | ≤ 5
then 0 ≤ |O| ≤ 4 and vice versa.
21. In a game search tree with the root node at level 0, suppose the alpha cutoff
occurs at level i. What is the minimum required depth of the tree? Justify your
answer.
22. For a game-tree of depth d, with branching factor b uniform at all the sub-tree,
let alpha cutoff occurs at every alternate depth due to the second node from left in
each sub-tree. What is an expression for the time-complexity of the search using
only alpha cutoff? Assume that the time-complexity of the minimax algorithm
is O(bd ).
23. Assume that in a game-tree, the alpha and beta cutoff occurs at every alternate
level. Find out the time complexity of this search.
24. Show that if static nodes are ordered in order of their static values, then the time
d
complexity of alpha-beta search is O(b 2 ).
25. Show that if static node weights are in random order, then the time complexity
3d
of alpha-beta search is O(b 4 ).
334 11 Adversarial Search and Game Theory
26. In the Fig. 11.15, find out the alpha-beta cutoffs in each case, when order of
sub-trees are P3 P2 P1 , P2 P3 P1 , and P2 P1 P3 . What you conclude by this change
in order?
27. Are the winning results of minimax and alpha-beta search tree identical for a
given search tree? Justify.
28. Show that cost functions of symmetric configurations in tic-tac-toe are identical.
29. Two firms Alpha and Beta serve the same market. They have constant average
costs of $2 per unit. The firms can choose either a high price ($10) or a low price
($5) per unit for their product. When both firms set a high price, the total demand
is 10,000 units, which is split evenly between the two firms. When both set a low
price, the total demand is 18,000 units, which is again split evenly. If one firm
sets a low price and the other a high price, the low priced firm sells 15,000 units,
while the high priced firm sells only 2,000 units. Analyze the pricing decisions
of the two firms as a noncooperative game.
a. In the scenarios mentioned above, form representation, construct the payoff
matrix, where the elements of each cell of the matrix are the two firms’ profits.
b. Derive the equilibrium set of strategies.
c. Explain why this problem an example of the prisoners’ dilemma game.
30. Write a parallel search algorithm for an alpha-beta search to reduce the search
space.
31. Select one or more than one, or True/False, or nil answer in the following
multiple-choice questions:
a. Prisoner’s dilemma is following type(s) of game:
(A) Noncooperative game (B) Non-zero-sum game
(C) Imperfect information game (D) all the above
b. Poker is following type(s) of game:
(A) Zero-sum (B) Perfect Information
(C) Non-zero sum (C) None of above
c. Which of the following are static games?
(A) Prisoner’s dilemma (B) Bid of winning a contract
(C) Chess (D) Poker
d. Which of the following are called simultaneous move games?
(A) Prisoner’s dilemma (B) Bid of winning a contract
(C) Chess (D) Poker
e. The tic-tac-toe is following type of search:
(A) Iterative deepening (B) DFS
(C) Best-first search (D) BFS
f. Alpha is defined by the values of successors of:
(A) MAX position (B) MIN position
(C) Both MAX and MIN (D) None of above
g. Alpha cutoff occurs among the successors of:
(A) MIN position (B) MAX position
(C) Depends of the structure of the tree (D) None of above
Exercises 335
References
1. Abramson B (1989) Control strategies for two-player games. ACM Comput Surv 21(2):137–
161
2. Davis MD (1983) Game theory—a nontechnical introduction. Dover, New York
3. Jansen BJ, Spink A (2007) Sponsored search: is money a motivator for providing relevant
results? Computer 8:52–57
4. Jansen BJ et al (2009) The components and impact of sponsored search. Computer 5:98–101
5. Nash JF Jr (1950) Equilibrium points in n-person games. Proc of the Nat Aca of Sciences
36(1):48–49
6. Nau DS (1983) Decision quality as a function of search depth on game trees. J ACM 30(4):687–
708
7. Neumann JV, Morgenstern O (2007) Theory of games and economic behavior (Commemor
edn), Princeton University Press
8. Nilsson NJ (1980) Principles of artificial intelligence, 3rd edn. Narosa Publishing, India
9. Prisner E (2014) Game theory through examples, Electronic edn. Mathematical Association of
America. ISBN 978-1-61444-115-1
10. Roth AE (1983) Towards a theory of bargaining: an experimental study in economics. Science
220:687–691
11. Schaeffer J, Herik HJ (2002) Games, computers, and artificial intelligence. Artif Intell 134:1–7
12. Slagle JR, Dixon JK (1969) Experiments with some programs that search game trees. J ACM
16(2):189–207
13. Sandholm T (2015) Solving imperfect-information games. Science 347(6218)
14. Stockman GC (1979) A minimax algorithm better than alpha-beta? Artif Intell 12:179–196
Chapter 12
Reasoning in Uncertain Environments
Abstract Real-life propositions are neither fully true nor false, also there is always
some uncertainty associated with them. Therefore, the reasoning using real-life
knowledge should also be in accord. This chapter is aimed to fulfill the above objec-
tives. The chapter presents the prerequisites—foundations of probability theory, con-
ditional probability, Bayes theorem, and Bayesian networks which are graphical rep-
resentation of conditional probability, propagation of beliefs through these networks,
and the limitations of Bayes theorem. Application of Bayesian probability has been
demonstrated for specific problem solutions. Another theory—the Dempster–Shafer
theory of evidence—which provides better results as the evidences increase is pre-
sented, and has been applied on worked examples. Reasoning using fuzzy sets is yet
another approach for reasoning in uncertain environments—a theory where mem-
bership of sets is partial. Inferencing using fuzzy relations and fuzzy rules has been
demonstrated, followed by chapter summary, and a set of exercises at the end.
12.1 Introduction
The real-life propositions are neither fully true nor false, and there is uncertainty asso-
ciated with them. Therefore, the reasoning drawn from these are also probabilistic.
This requires probabilistic reasoning for decision-making. This chapter presents two
approaches for probabilistic reasoning—the Bayes theorem and Dempster–Shafer
theory. The first is implemented as Bayesian belief networks, where nodes (events)
are connected using directed graph with edges of the graph representing cause–effect
relationships; the beliefs propagate in a network as per the rules of Bayes conditional
probability. The Dempster–Shafer (D-S) theory is based on evidential reasoning such
that evidences are conjuncted, and as more and more evidences take part in the rea-
soning, the ignorance interval decreases. The classical probability theory is a special
case of D-S theory of Evidential Reasoning.
Generally, the end user of an expert system identifies and ascertains about the
symptoms and their uncertainties, while the uncertainties in the evaluation of symp-
toms are carried out by the experts. Because different people judge these uncertain-
ties, they are likely to judge it differently, and there is no scope of normalization.
Consequently, it is likely to be a large uncertainty. Often the reckoning methods
are also simple, which are some “ad hoc representations”, on account of lack of
theoretical foundations.
Many different methods for representing and reasoning with uncertain knowledge
have been developed during the last three decades, including the certainty factor
12.1 Introduction 339
calculus, Dempster–Shafer theory, possibilistic logic, fuzzy logic, and Bayesian net-
works (also called belief networks and causal probabilistic networks).
In the following, we list the commonly used terminology of the probability theory.
Considering an uncertain even E, the probability of its occurrence is a measure
of the degree of likelihood of its occurrence.
A sample space S is the name given to the set of all possible events.
A probability measure is a function P(E i ) that maps every event outcomes,
E 1 , E 2 , . . . , E n from event space S, to real numbers in the range [0, 1].
These outcomes satisfy the following axioms (basic rules) of probability:
1. for any event E ⊆ S, there is 0 ≤ P(E) ≤ 1,
2. an outcome from space S that is certain is expressed as P(S) = 1, and
3. when events E i , E j are mutually exclusive, i.e., for any events E i , E j , there
is E i ∩ E j = φ, for all i = j. In that case, P(E 1 ∪ E 2 ∪ . . . ∪ E n ) = P(E 1 ) +
P(E 2 ) + · · · + P(E n ) = 1.
Using these three axioms, and the rules of set theory, the basic laws of proba-
bility can be derived. However, the axioms alone are not sufficient to compute the
probability of an outcome, because it requires an understanding of the corresponding
distribution of events. The distribution of events must be established using one of
the following approaches:
1. Collection of experimental data to perform statistical estimates about the under-
lying distributions,
340 12 Reasoning in Uncertain Environments
Some examples of computation of probability are given below. The upper case
variable names represent the sets.
count o f (All A)
P(A) = (12.1)
count o f (all events)
Therefore,
P(B | A)P(A)
P(A|B) = . (12.4)
P(B)
Most of the knowledge we deal with is uncertain, hence the conclusions based
on available evidences and past experiences are incomplete. Many a time it is only
possible to obtain partial knowledge about the outcomes.
The Bayesian networks provide a powerful framework for modeling uncertain inter-
actions between variables in any domain of concern. The interactions between the
variables are represented in two ways: 1. Qualitative way, using directed acyclic
graph, and 2. Quantitative way, which makes use of conditional probability distri-
bution for every variable in the network. This probability distribution-based system
allows to express the relationship between the variables, as functional, relational, or
logical.
Along with the above, the dynamic probability theory provides a mechanism
for revising the probabilities of events in a coherent way, as the evidences become
available. We represent the conditional probability, i.e., the probability of occurrence
of an event A, given that the event B has already occurred, as P(A|B). Note that, if the
events A and B are totally independent, then P(A|B) and P(A) will have the same
result, otherwise, due to its dependency, the occurrence of A will be effected, and
accordingly the probability P(A|B). Here, occurrence of event A is called hypothesis
and occurrence of event B is called as evidence. If we are counting the number of
sample points, we are interested in the fraction of events B for which A is also true,
12.3 Conditional Probability and Bayes Theorem 341
i.e., (B ∩ A) is what fraction of B? The meaning of fraction is like the A and B are
fractions of the universe set Ω. In (B ∩ A), both A and B are partly included, which
is also represented as (A, B). From this it becomes clear that
P(A, B)
P(A|B) = . (12.5)
P(B)
Equation (12.6) is also called “product rule” or simple form of Bayes theorem. It
is important to note that this form of the rule is not often stated a definition, but a
theorem, which can be derived from simpler assumptions. The term P(A|B) in the
equation is called posterior probability.
The Bayes theorem is used to tell us how to obtain a posterior probability of a
hypothesis A once we have observed some evidence B, given the prior probability
P(A) of event A, and probability P(B|A)—the likelihood of observing B—were A
to be given. The theorem is stated as
P(B|A)P(A)
P(A|B) = . (12.7)
P(B)
The formula (12.7), though simple, has abundant practical importance in the area
such as diagnosis, may it be patient or a fault diagnosis in software or an electric
circuit, or any other similar situation. Note that it is often easier to conclude the
probability of observing a symptom given a disease, than the reverse—that is, of a
disease, given a symptom. Yet, in practice it is the latter (i.e., find out the disease or
fault given the symptom), which is required to be computed. See what the doctors
do, it is the disease (hypothesis) they try to find out given the symptom (evidence).
There are certain conditions under which the Bayes theorem is valid. Let us
assume that there is only one symptom (S) and corresponding only one diagnosis
(D), and having given this we will try to extend the formula of the Bayes theorem. The
conditional probability for this is expressed as P(S|D), which says about relative
frequency of occurrence of the symptom S when diagnosis D is true. The term
P(S|D) can be expressed by
|S ∩ D|
P(S|D) = , (12.8)
|D|
where
P(S|D)P(D)
P(D|S) = . (12.9)
P(S)
|S ∩ D|
P(D|S) =
|S|
|D| |S ∩ D|
= ∗
|D| |S|
|S ∩ D|
= |D| ∗
|D| ∗ |S|
|S ∩ D| 1
= |D| ∗ ∗
|D| |S|
P(S|D) ∗ P(D)
= ; using Eq. (12.8)
P(S)
Example 12.1 Suppose that the following statistics hold for some disease:
P(tuber culosis) = 0.012,
P(cough) = 0.10, (for over 3 weeks)
P(cough | tuber culosis) = 0.90, i.e., if a patient has tuberculosis, the coughing
is found in 90% of the cases.
Given this, compute the probability of having tuberculosis, given that coughing
exists in those persons, for over 3 weeks.
Hence, the relative probability that a patient who had a tuberculosis was found
to be having also the cough is, 8.33 (= 0.9/0.108) times the probability of finding
tuberculosis in the coughing patients.
12.3 Conditional Probability and Bayes Theorem 343
Relative Probability
The Bayes theorem allows the determination of the most probable diagnosis, assum-
ing the presence of symptoms S1 . . . Sm . The probability is determined using: 1.
a priori probabilities P(Di ) that belong to a set of diagnoses, and 2. likelihood
P(S j /Di ). The latter is the relative frequency of the occurrence of a symptom S j
given that the diagnosis is Di . In the Bayes probability, an absolute probability is
not important, but a relative probability Pr is determined. Pr is the probability of a
diagnosis compared to the other diagnoses, which is computed as
The exception to the above is allowed when symptoms are caused directly by the
same diagnosis.
From Eqs. (12.10) to (12.11), it follows that
P(S1 ∧ . . . ∧ Sm |D)
P(D|S1 ∧ . . . ∧ Sm ) =
P(S1 ∧ . . . ∧ Sm )P(D)
P(S1 |D) . . . P(Sm |D)
= .
P(S1 ∧ . . . ∧ Sm )P(D)
The Bayes theorem in the above form is correct, subject to the fulfillment of the
following conditions:
1. The symptoms may depend only on the diagnoses, and must be independent of
each other. This becomes a critical point, when more and more symptoms are
ascertained.
2. The set of diagnoses is complete.
3. Single fault assumption or mutual exclusion of diagnoses—presence of any one
diagnosis automatically excludes the presence of other diagnoses. This criteria is
justified only in a relatively small set of diagnoses.
4. The requirement of correct and complete statistics of prior probabilities (P(D))
of the diagnoses, and conditional symptom-diagnosis probabilities(P(S|D)).
344 12 Reasoning in Uncertain Environments
c2
e
cn
Figure 12.1 shows a Bayesian network with many causes (c1 , c2 , . . . , cn ) and a
single effect e. Due to this, specification of its local distribution can be quite onerous
in the general case. As per the definition for a Bayesian network, it is necessary to
compute the probability distribution of the node conditional on every instance of its
parents. For example, if a node x has n number of parents, each is binary in nature
(either 0 or 1 only), then we need to specify total 2n probability distributions for the
node x. We can reduce the complexity of this computation in such cases by adding
more structures in the model. For example, we can convert our model into a n-way
interaction model by associating with each cause node an inhibitory mechanism that
prevents the cause from producing the effect. In such a model, the effect will be
absent if and only if all the inhibiting mechanisms associated with the present cause
node are active. And, if any of the inhibiting mechanism for this node is inactive, the
cause node will produce the effect.
A problem domain is nothing but a set of variables. Thus, a Bayesian network for the
domain {x1 , . . . , xn } can represent a joint probability distribution over these variables.
This distribution consists of a set of local conditional probability distribution, that
when combined with a set of assertions of conditional independence, gives a joint
global distribution.
The composition is based on the chain rule of probability (see Fig. 12.2), which
dictates that
n
P(x1 , . . . , xn ) = P(xi |x1 , . . . , xi−1 ). (12.12)
i=1
For each given variable xi , let us assume that Si ⊆ {x1 , . . . , xi−1 } is a set of
variables which causes the variable xi , and {x1 , . . . , xi−1 } to be conditionally inde-
pendent. That is
The approach to this idea is that the distribution of xi can be represented as condi-
tional on a parent set of Si which is substantially smaller than {x1 , . . . , xi−1 }. Given
these two sets, a Bayesian network can be represented as a directed acyclic graph
such that each variables x1 , . . . , xn corresponds to a node in that graph and the parents
of the node corresponding to the variables in Si . Note that since the graph coincides
with the conditioning set Si , the assertions of the conditional dependence are directly
encoded by the Bayesian network structure. This is expressed in Eq. (12.13).
A conditional probability distribution P(xi |Si ) is associated with each node
xi , such that there is one distribution for each instance of Si . By combining the
Eqs. (12.12) and (12.13), we obtain a Bayesian network for {x1 , . . . , xn } which
uniquely determines a joint probability distribution for these variables. That is,
n
P(x1 , . . . , xn ) = P(xi |Si ). (12.14)
i=1
Because the joint probability distribution for any network can be determined
using Bayesian network for a given domain, this network can be used in principle to
compute any probability of interest. Consider that we are given a simple Bayesian
network, with structure w → x → y → z, and we are interested to find out the
probability “of occurrence of event w given that the event z has already occurred,”
which can be expressed as P(w|z). From the Bayes rule, and making use of the above
structure w . . . z, the equation can be expressed as follows.
P(w, x, y, z)
P(w, z) x,y
P(w|z) = = . (12.15)
P(z) P(w, x, y, z)
w,x,y
In the above, P(w, x, y, z) is called the joint probability distribution computed for the
Bayesian network w → x → y → z. On observation of numerator and denominator
terms of Eq. (12.15), we can understand that this approach is not feasible, due to the
reason that it requires summing over exponential number of terms. However, we can
exploit the conditional independence encoded in a Bayesian network to arrive at a
more efficient computation. Using this feature, the network structure of Eq. (12.15)
can be transformed to
P(w, x, y, z)
x,y
P(w|z) =
P(w, x, y, z)
w,x,y
P(w) P(x|w) P(y|x)P(z|y)
x y
= . (12.16)
P(w) P(w)P(x|w) P(y|x)P(z|y)
w x y
12.4 Bayesian Networks 347
For the situation shown in Fig. 12.3, the variables x and y are called marginally
independent, but they are conditional dependent, for a given variable z. It is possible
to convince ourselves using the following example. Let the first variable have the
assignment x = “rain”, and the second is y = “sprinkler on”. When both are True,
they will cause the lawn to become wet.
So far we have not made any observation about the lawn, and occurrence of x
(rain) and y (sprinkler on) are independent. However, once it is observed that the lawn
Rain Sprinkler on
x y
z Wet lawn
x z y
The next example shows the cause–effect relationships shown in Fig. 12.4. In this
case, the probability distribution is given by
x2 x3
x4 x6
x5
P(x5 |x4 ∧ x3 )P(x4 |x2 ∧ x3 )P(x2 |x1 )P(x3 |x1 )P(x6 |x3 )P(x1 ). (12.21)
(ii) As another example, let W = “Worn piston rings” cause both B = “blue exhaust”,
as well as, L = “low oil level” (see Fig. 12.7). The joint probability distribution
is given by
P(W, B, L) = P(W ) P(B|W ) P(L|W ). (12.23)
(iii) Consider another example, with variables, L = “Low oil level”, which can
be caused either by C = “excessive consumption” or by E = “oil leak” (see
Fig. 12.8). The joint probability distribution is given by
Based on the above discussions, we have the following steps for the construction
of a Bayesian network:
W O L
B L
Smoke(S) Alarm(A)
Leaving(L) Report(R)
Figure 12.9 shows a small network of six binary variables given in Tables 12.1 and
12.2. The use of such network may be to compute answers to probabilistic queries, for
example, we may like to know the probability of fire, given that people are reported
to be leaving from the room, or answers for queries like what is the probability of
smoke given that the alarm is off.
A Bayesian network has two components: a structure in the form of a directed
acyclic graph, and a set of conditional probability tables. The acyclic graph’s nodes
correspond to variables of interest (see Fig. 12.9), and the graphs edges have a
formal interpretation in the form of direct causal influences. A Bayesian network
must include the conditional probability tables (CPTs) for each variable quantifying
the relationship between a variable and its parents in the network.
Consider that there are variables A = “Alarm”, F = “tempering”, and F =
“Fire”, each of which may be True or False. Figure 12.9 shows the acyclic graph,
and Table 12.2 shows the conditional probability distribution of A, given its par-
ents as “Fire” and “Tempering”. As per this CPT, the probability of occurrence
of the event A = T r ue, given the evidence F = T r ue and T = False (row 3),
is P(A = T r ue|F = T r ue, T = False) = 0.99. This probability is called the net-
work parameter.
The guaranteed consistency and completeness are the main features of Bayesian
networks. The latter is possible because there is one and only one probability distri-
bution, which satisfies the network constraints.
A Bayesian network with n binary variables will induce a unique probability
distribution for over 2n instantiations; such instantiations for the network shown in
Fig. 12.9 are 64. This distribution provides sufficient information to predict prob-
ability for each and every event we can express using variables S, F, T, A, L , R,
appearing in this network. An event may be a combination of these variables with
True/False values. An example of an event may be to found out as “probability of an
Alarm and Tempering, given no smoke, and a report has indicated people leaving the
building.”
An important advantage of Bayesian networks is the availability of efficient algo-
rithms for computing such probability distributions. In the absence of these networks,
it would require explicit generation of required probability distributions. Generating
such distributions explicitly is infeasible for a large number of variables. Interestingly,
in areas such as genetics, reliability analysis, and information theory, the already
existing algorithms are subsumed by the more general algorithms for Bayesian net-
works.
We will try to discover the central insight behind Bayesian networks, due to which it
becomes possible to represent large probability distributions in a compact way. Con-
sider Fig. 12.9 and the associated Conditional Probability Tables (CPTs), Tables 12.1
352 12 Reasoning in Uncertain Environments
and 12.2. Each probability appearing in a CPT specifies a constraint which must be
satisfied by the distribution induced by the network. Consider that a distribution
must assign the probability of 0.1 to the event of “having smoke without fire”, i.e.,
P(S = T r ue|F = False), where S and F are variables, respectively for “smoke”
and “fire”. These constraints, however, are insufficient to help in concluding a unique
probability distribution. So, what additional information we need? This answer lies
in the structure of a Bayesian network; as we know, that additional constraints are
specified by the Bayesian network in the form of probabilistic conditional indepen-
dence. As per that, once the parent of every variable is known, the variable in the
structure is assumed to be independent of its non-descendent parents. Figure 12.9
shows that the variable L that stands for “leaving the premises” is taken as indepen-
dent of its non-descendant parents T , F, and S, once its parent A becomes known.
Put differently, as soon as the variable A becomes known, the probability distribu-
tion of variable L will not change on the availability of new information about the
variables T , F, and S.
As a different case in Fig. 12.9, we assume that variable A has no dependence
on variable S (a non-descendant parent of A). This assumption becomes true as
soon as the parents F and T of variable A becomes known. These independence
constraints, due to non-descendant parents are called Markovian assumptions of a
Bayesian network [2].
From the above discussion, do we mean that whenever a Bayesian network is
constructed, it is necessary to verify the conditional non-dependencies? This, in fact
depends on the construction method adopted. There are three main approaches to the
construction of Bayesian networks:
1. subjective construction,
2. a construction based on synthesis from other specifications, and
3. construction by learning from data.
The first method in the above approaches is somewhat less systematic, as one rarely
thinks about the conditional independence while constructing a Bayesian network.
Instead, one thinks about causality, i.e., adding an edge E i → E j , from event E i
to event E J , whenever E i is recognized as a direct cause of E j . This results in a
causal structure where Markovian assumption is read as: “each variable becomes
independent of its non-dependents, once direct causes are known.”
A distribution of probabilities induced due to a Bayesian network also satisfies
additional independencies beyond the Markovian. All these independent variables
can be identified using a graphical test, called d-separation.2 As per this test, any two
variables E i and E j shall be considered to be independent if every path between E i
and E j is blocked by a third variable E k . For instance, consider the path α as shown
in Fig. 12.9.
α : S ← F → A ← T. (12.25)
2 d-separation: A method to determine which variables are independent in a Bayes net [3].
12.4 Bayesian Networks 353
Now, consider that alarm A is triggered. The path in Eq. (12.25) can be used to
show a dependency between variables A and T as follows: we know that observing S
(smoke as evidence) increases the likelihood that fire F has taken place. This is due to
the direct cause–effect relation. Also, the increased likelihood that F has taken place
explains away the tempering T as cause of the alarm, i.e., there are lesser chances
that the alarm is due to tempering. Hence, a path in Eq. (12.25) can be used as a
dependence between S and T . Therefore, the variables S and T are not independent
due to the presence of unblocked path between them.
However, instead of variable A, if F is a given variable, this path cannot be used
to show a dependency between S and T , and in that case the path will be blocked
by F.
The Hidden Markov Models (HMMs) are useful for modeling dynamic systems with
some states not observable, and when it is required to make inferences about changing
states, given the sequence of outputs they generate. The potential applications of
HMMs are those that require temporal pattern recognition, like speech processing,
handwriting recognition, recognizing gestures, and in bioinformatics [2].
Figure 12.10a shows an HMM—a model of a system with three states a, b, c, and
with three outputs x, y, z, and possible transitions between the states. There is prob-
ability associated with each transition, e.g., there is a transition from state b to c with
20% probability. Each state can produce certain output, with some probability, e.g.,
state b can produce output z with probability 10%. An HMM of Fig. 12.10a has been
represented by a Bayesian network as shown in Fig. 12.10b. There are two variables,
St (for state at time t), and Ot (for system output at time t), with t = 1, 2, . . . , n.
The variable St has three values a, b, c, and Ot also has three values x, y, z. Using
d-separation on this network, one can derive the characteristic properties of HMMs,
i.e., once the system state at time t is known, its states and outputs are independent
at time > t, and also independent at time < t. Figure 12.10b is a simple dynamic
Bayesian network.
.20
a b c S1 S2 S3 Sn
.10
O1 O2 O3 On
x y z
(a) HMM (b) Dynamic Basian network.
Fan1 Processor 1
Hard
Power disk
supply drive
Fan2 Processor 2
Subsystem-1 Block B
Subsystem-1
Block B OR AND
Subsystem-2 Subsystem-2
A4
A2
D
F2 P2
ditional Probability Tables (CPTs) of this system. Consider the variables as follows:
E is for power supply, F1 and F2 for fans, P1 , P2 for processors, and D for hard disk.
Let these variables represent the availability of the corresponding component. Let us
assume that variable S represents the availability of the whole system. The variables
Ai and O j represent the logical AN D and O R, respectively.
Network Constructing by learning from data
A third approach for the construction of Bayesian networks works on the principle
of learning from data, like patients medical records, airline ticket records, or buying
patterns of customers, etc. Such data sets can help the network to learn parameters
given the structure of the network, or it can learn both the network and its parameters
when the data set is complete. However, learning only the parameter is an easier task.
Since the learning itself is an inductive task, the learning process is guided using
the induction principle. Two main principles of inductive-based learning are 1. max-
imum likelihood function, and 2. learning using the Bayesian approach. The first
approach is suitable to those Bayesian networks that maximize the probability of
observing the given data set, while the second approach uses some prior information
that encodes preferences on a Bayesian network, along with the likelihood principle.
Let us assume that we are interested in learning the network parameters. Learning
using the Bayesian approach allows to put a prior distribution on each network
parameter’s possible values. The data set and the prior distribution together induce
new distribution on the values of that parameter, called posterior distribution. This
posterior distribution is then used to pick a value for that parameter, e.g. distribution
mean. As an alternative, we can also use different parameter values while computing
answers to queries. This can be done by averaging over to all possible parameter
values according to their posterior probabilities.
Given a Bayesian network as in Fig. 12.12, we can find out the overall reliability
of this system. Also, for example, given that the system is unavailable, we can find
the most likely configuration of the two fans, or the processors. Further, given a
Bayesian network, we can answer the questions such as What single component can
be replaced to increase the system reliability by, say, 10%? These are the example
questions which can be answered in these domains using the principles of Bayesian
probability distributions.
356 12 Reasoning in Uncertain Environments
The Dempster–Shafer Theory (DST) of evidence can be used for modeling several
single pieces of evidences within a single hypothesis relations,3 or a single piece of
evidence in relations that are multi-hypotheses type. The DST is useful for the assess-
ment of uncertainty of a system where actually only one hypothesis is true. Another
approach to DST, called the reliability-oriented approach, contains the system with
all hypotheses—pieces of evidence and data sources. The hypothesis is a collection
of all possible states (e.g., faults) of the system, which is under consideration [5].
In DST, it is a precondition that all hypotheses are singletons (i.e., elements) of
some frame of discernment—a finite universal set Ω, with 2|Ω| subsets. However,
we will write it simply as 2Ω subsets. A subset of Ω is a single hypothesis or a
conjunction of hypotheses. The subsets 2Ω are unique and not all of them disjoint,
however, it is mandatory that all the hypotheses are disjoint and mutually exclusive,
in addition to being unique.
The pieces of evidences above are symptoms or events. An example of a symp-
tom or evidence is the failure that has occurred or may occur in the system. An
evidence is always related to a single hypothesis or to a set of hypotheses (multi-
hypothesis). Though theoretically it can be debated, it is not allowed in DST that
many different evidences may lead to conclude the same hypothesis or the same
collection of hypotheses. A relation between a piece of evidence and a hypothesis
is qualitative, which correspond to a chain of cause–effect relations—an evidence
implies an existence of hypothesis or hypotheses, like in the Bayes rule.
The DST represents a subjective viewpoint in the form of a computation for an
unknown objective fact. The data sources used in DST are persons, organizations,
or other entities that provide the information. Using a data source, the mapping in
Eq. (12.26) assigns an evidential weight to a diagnosis set A ⊆ Ω. Since A ∈ 2Ω ,
the set A contains a single hypothesis or a set of hypotheses. The probability of any
diagnosis A is a function whose value is between 0 and 1, and is expressed by
Note that the difference between DST with probability theory is that, the DST map-
ping clearly distinguishes between the evidence measures and probabilities.
Definition 12.1 (Focal Element) Each diagnosis set A ⊆ Ω, for which m(A) > 0,
is called a focal element.
Definition 12.2 (Basic Probability Assignment) The function m is called a Basic
Probability Assignment (BPA) that fulfills the condition:
m(A) = 1. (12.27)
A⊆Ω
Meaning of the above statement is that, for the evidences presented by each data
source which are equal in weight, it is necessary that all statements of a single data
source are normalized. In other words, no data source is more important than the
others. We assume that if a data source is null, then its BPA should also be null, i.e.,
m(φ) = 0.
A belief measure is given by the function, which also ranges between 0 and 1,
such that
bel(A) = m(B). (12.29)
B⊆A; B=φ
As we add more and more evidences in the system, the uncertainty interval
reduces, as will be seen in examples in the following. This is a natural and logi-
cal property of DST.
The DST allows combination of evidences to predict the hypothesis with a stronger
belief. It is called Dempster’s rule of combination, which aggregates two independent
bodies of evidence, into one body of evidence, provided that they are defined within
the same frame of discernment. The Dempster’s rule of combination combines two
evidences and computes new basic probability assignment as
m 1 (B) ∗ m 2 (C)
A=B∩C
m(A) = m 1 ⊗ m 2 (A) = . (12.32)
1− m 1 (B) ∗ m 2 (C)
B∩C=φ
358 12 Reasoning in Uncertain Environments
In the above, A = φ and numerator part in the equation represent the accumulated
evidences for the sets B and C, that supports the hypothesis A (since A = B ∩ C),
and the denominator is called the normalization factor [5].
The Dempster–Shafer Theory (DST) (also called theory of belief functions) is the
generalization of the Bayes theory of conditional probability. To carry out the com-
putation of probability, the Bayes theory requires actual probabilities for each ques-
tion of interest. But in DST, the degree of belief of one question is based on the
probabilities of the related questions. The “degree of belief” may or may not have
real mathematical properties of probability, but, how much they actually differ from
probabilities will depend on how closely the two questions are related.
The DST is based on two ideas: 1. obtaining degrees of belief for one question from
subjective probabilities of related question, and 2. use Dempster’s rule for combining
such degrees of belief when they are based on independent items of evidence.
The DST differs from the application of Bayes theorem in the following respects:
1. Unlike the Bayes theorem, the DST allows the representation of diagnoses in
the form of classes and hierarchies, as shown in Fig. 12.13.
2. For a probability of X % for a diagnosis, the difference (100 − X )% is not taken
as against the diagnosis, but interpreted as an uncertainty interval. That is, the
probability of diagnosis lies between X % and 100%, and as more and more
evidence is added the diagnosis may tend toward 100%.
3. Probability against a diagnosis is considered as the probability of the complement
of the diagnosis. For example, if an event set is {A, B, C, D}, the probability
against the event “A” is probability of {B, C, D}, which will have further distri-
bution.
Consider a biased coin, where the probability of arriving at the head is P(H ) = 0.5.
Since it is biased we do not know as to what the rest 50% is attributed to. In the DS
theory, this is called the ignorance level. Since we do not know where the rest of the
probability goes, we assign it to the universe U = {H, T }. Which means that it can
attribute to {T } and {H, T }.
Now let us assume that there is an evidence, that probability for arriving at the tail
is 46%. Due to this new evidence, the balance probability assignment is 1 − 0.45 −
0.46 = 0.09, and this can be assigned to the set {H, T }, that cannot be attributed to
{H } or {T }, unless there is some further evidence available. Hence, 0.09 probability
assigned stands for the coin standing vertical when it is tossed.
Example 12.5 Using DST for finding probabilities for certain diseases.
Let us assume that in a certain isolated island the universe set of disease is Ω =
{malaria, typhoid, common cold}. A team of Doctors visit this island and perform a
laboratory test (T1 ) on a patient, which shows that it is the case of 40% for Malaria.
Hence, as per the DS theory, the balance 60% goes to the ignorance level for the
entire universe {malaria, typhoid, common cold}.
Fearing lack of confidence, the Doctors ask the patient for a second laboratory
test, for some other parameters; the evidence now shows a belief case of 30% for
Malaria. Hence, the rest 30% goes to the entire universe {malaria, typhoid, common
cold}. Thus ignorance is 30%.
The Doctor then asks for a third laboratory test, to further add to the evidences, and
the finding shows that the belief for Typhoid is 20%. This makes the total belief of
90%, and the ignorance level reduces to 10%, which again is assigned to the universe
set {malaria, typhoid, common cold}.
The above example shows that as more and more evidence is added, we get more
and more strong belief about the distribution of probability on various diagnoses,
and the ignorance level reduces. This, in fact, is what it should be, and the natural
case of reasoning in a probability case.
Let us assume that the kind or errors which can occur in software testing is the
set: {data validation errors, computation errors, transmission errors, output error}. In
this case also we use DST, similar to the cases discussed above, and can compute
the joint probability distribution of errors.
Example 12.6 Find out the distribution of probabilities for some diseases shown
by a hierarchically structured diagnoses as shown in Fig. 12.13. Consider that the
following are the probabilities of certain events: against A = 40%, and evidence for
Y = 70%.
In DST, when the evidences are combined, the probabilities get multiplied and the
resultant set is the intersection of original sets. Table 12.3 shows the new distribution.
The probability of an event or a set of events is expressed not as an absolute value
but an uncertainty interval [a, b], where a and b are lower and upper probability
limits, respectively. The value a is computed as the sum of the probabilities of the
set and its subsets, while b is a value 100 − c, where c is the sum of the probabilities
of complements of the set and complements of the subsets corresponding to a.
The probability range [a, b] for diagnoses {A} and {B} are computed as follows:
[a, b] = [{A}, 1 − ({B, C, D} + {B, C} + {B, D} + {C, D} + {B} + {C} + {D} + {})]
= [0, 1 − 0.6] = [0, 0.4].
[a, b] = [{B}, 1 − ({A, C, D} + {A, C} + {A, D} + {C, D} + {A} + {C} + {D} + {})]
= [0.42, 1 − 0.0] = [0.42, 1].
eliminate the empty set and the remaining sets of the probabilities can be taken as
100%. This can be done by dividing all probabilities of non-empty sets by a factor
equal to (1 − s), where s is the sum of the probabilities of empty sets. For this
example, s = 0.084 + 0.056 = 0.14 and 1 − s = 1 − 0.14 = 0.86. Now, the sum
of probabilities without an empty set is 1.0. Next, we obtain the following results
using Dempster’s rule of combination.
0.336
{B} = = 39.0%
0.86
0.06
{D} = = 7.0%
0.86
0.224
{A, B} = = 26.0%
0.86
0.144
{B, C, D} = = 16.8%
0.86
0.096
{A, B, C, D} = = 11.2%.
0.86
The probabilities for {A} and {B} are computed as follows:
{A} = [{A}, 1 − ({B, C, D} + . . . + {B} + {D})] = [0, 0.372], i.e., 0–37.2%.
{B} = [{B}, 1 − ({A, C, D} + . . . + {D})] = [0.39, 0.93], i.e., 39–93.0%.
From the above exercise, we note that the uncertainty interval (probability of
{A, B, C, D}) has decreased after addition of new evidence E 3 . In addition, the
probabilities of A and B have turned out to be more precise, i.e., narrow.
However, we note that the computation required in DST is combinatorial, since
with n diagnoses a total of 2n number of sets need to be computed [7].
age, for example, ({Anand/1, Babita/0.8, Choudhar y/0.3}). That means, perhaps
Anand is 23 years old, Babita is 27 years old, and Choudhary is in his 50s. The
membership function of a set maps each element of the universe to some degree of
association with the set.
The fuzzy set concept is somewhat similar to the concept of the set in the classical
set theory. When we say, “Sun is a star”, it means the Sun belongs to the set of stars,
and when we say, “Moon is a satellite”, it means the Moon belongs to the set of
satellites of the Earth. At the same time, both these statements are true, hence, they
both map to true or 1. The statement “the Sun is a black hole” is false as it does
not belong to the set of blacks holes yet. Hence, the statement maps to false or 0.
Thus, mapping of a statement in classical logic is either to 1 and 0, having crisp
values. Since, with each statement in the classical set theory having logical value 0
or 1, the set of these statements is called a crisp set, and also, these sets have a logic
associated, (T /F) due to the membership of their elements. Thus, sets and logic are
two sides of the same coin, they go together. Various properties of fuzzy sets, like
relations, logic, and inferences, have counterparts in the classical sets [1].
To propose a formal representation, a fuzzy set A is represented as A = { a(u)u
|u∈
U }, where u is an element, and a(u) the membership function, called characteristic
function, and represents the degree of belongingness; U is universe set. The notation
u ∈ U means “every u ∈ U ”.
The Fuzzy operations have counterparts operations in classical set theory as union,
intersection, complement, binary relation, and composition of relations.
The common definitions of fuzzy set operations are as follows.
u
A∪B = |u∈U (12.33)
max(a(u), b(u))
u
A∩B = |u∈U (12.34)
min(a(u), b(u))
µ µ
1 1
0 0
20 40 25
Age Age
Linguistic variable ‘young’ Numerical variable ‘young’
Let us assume that A and B are fuzzy sets of universes U and V , respectively. The
Cartesian product U × V is defined just like for ordinary sets. We define the Cartesian
product A × B as follows,
(u, v)
| u ∈ U, v ∈ V . (12.36)
min(a(u), b(v))
364 12 Reasoning in Uncertain Environments
Every fuzzy binary relation (or a mapping relation or fuzzy relation), from fuzzy
set U to fuzzy set V , is a fuzzy subset of relation R defined below. Here m(u, v) is
called a membership function, having the range [0, 1].
(u, v)
R :U ×V = | u ∈ U, v ∈ V . (12.37)
m(u, v)
The fuzzy sets are extensions and more general forms of ordinary sets; accordingly,
the fuzzy logic is an extension of the ordinary logic. Just as there are correspondences
between ordinary sets and ordinary logic, there exists correspondence between fuzzy
set theory and fuzzy logic. For example, the set operations of union (OR), intersection
(AND), and complement (NOT) exists in classical set theory (classical logic) as well as
in fuzzy system. The degree of belongingness of an element in a fuzzy set corresponds
to the truth value of the proposition in fuzzy logic.
A fuzzy implication is viewed as describing a relation between two fuzzy sets.
Using fuzzy logic we can represent a fuzzy implication such as A → B, i.e., if “A
then B”, where A and B are fuzzy sets. For example, if A is fuzzy set young, and B
is fuzzy set small, then A → B may mean, if young then small. A simple definition
is A → B = A × B, where A × B is the Cartesian product of fuzzy sets A and B.
Given:
Implication: “If A then B”
Premise: “X is true”,
12.6 Fuzzy Sets, Fuzzy Logic, and Fuzzy Inferences 365
Derive:
Conclusion Y .
To derive a fuzzy inference, we perform the following steps:
1. Given, “if A then B”, compute the fuzzy implication as a fuzzy relation, R =
A × B,
2. Induce Y by Y = X ◦ R.
We have noted that linguistic variable in fuzzy logic behaves as a source of data
compression, as a linguistic variable may stand for a large number of values. The
fuzzy logic, in addition, provides fuzzy if-then rule or simply the fuzzy rule, and a
fuzzy-graph. The fuzzy rule and fuzzy graph bear the same relation to numerical
value dependencies that the linguistic variable has with numerical values [8].
Figure 12.15 shows a fuzzy graph f ∗ of a function dependence f : X → Y ,
where X ∈ U , Y ∈ V , are linguistic variables. Let the set of values in linguistic
variables X and Y be Ai , Bi , respectively. This graph is an approximate compressed
representation of f in the form of f ∗ as
f ∗ = A1 × Bi + A2 × B2 + . . . + An × Bn , (12.41)
which is equal to
n
f∗ = Ai × Bi . (12.42)
i=1
Ai × Bi
(Fuzzy point)
Ai X
366 12 Reasoning in Uncertain Environments
μ f ∗ (u, v) = Vi (μ Ai ∧ μ Bi ). (12.43)
f ∗ : i f X is A1 then Y is B1
i f X is A2 then Y is B2
...
i f X is An then Y is Bn . (12.44)
This rule can be represented equivalently as a fuzzy graph (see Fig. 12.16) using the
fuzzy expression for f ∗ ,
An important concept in a fuzzy graph is any type of function (or relation) that can
be approximated by a fuzzy graph. For example, a normal distribution for probability
f∗
0
X
12.6 Fuzzy Sets, Fuzzy Logic, and Fuzzy Inferences 367
0
X
The interesting point in fuzzy calculus is that it aims to develop computational pro-
cedures for basic operations on fuzzy graphs. These operations are generalizations
of basic operations on crisp sets and functions, and can be defined as follows: The
required computations can be simplified if the operation of “∗” is taken as mono-
tonically increasing. That is, if a, b, a , and b are real numbers, then the following
holds:
a ≥ a, b ≥ b ⇒ a ∗ b ≥ a ∗ b
a ≤ a, b ≤ b ⇒ a ∗ b ≤ a ∗ b. (12.48)
From these operations, it can be easily deduced that “∗” is a distributive operator
over “∨ (max)” and “∧ (min)”. Accordingly,
a ∗ (b ∨ c) = a ∗ b ∨ a ∗ c
a ∗ (b ∧ c) = a ∗ b ∧ a ∗ c. (12.49)
Assignment Intersection
Y f Y f
f (u) g
0
X X
Composition Combination
f
Y Y f (u)
g◦f g(u)
g(f (u)) f (u) ∗ g(u)
X X
g∗
f∗
V Interection of fuzzy-
graphs g ∗ and f ∗
Consider finding the intersection of fuzzy graphs (Fig. 12.18) f ∗ and g ∗ , where
f∗ = Ai × Bi (12.52)
i
and
g∗ = Cj × Dj. (12.53)
j
These operations on fuzzy functions and fuzzy relations are shown in Fig. 12.18.
12.6 Fuzzy Sets, Fuzzy Logic, and Fuzzy Inferences 369
12.7 Summary
P(A, B)
P(A|B) = (12.56)
P(B)
fuzzy operations that have counterparts in classical set theory are union, intersection,
complement, binary relation, and composition of relations.
Other branches of fuzzy set theory are fuzzy mathematical programming, fuzzy
arithmetic, fuzzy topology, fuzzy graph theory, and fuzzy data analysis.
In fuzzy logic systems, there is no learning capability, no memory, and no capabil-
ity of pattern recognition. Fuzzy systems combined with neural networks will have
all these three capabilities.
Different forms of hybrid systems are constructed using a combination of fuzzy
logic and other areas. These are Fuzzy-neuro systems, Fuzzy-GA (fuzzy and genetic
algorithms), and Fuzzy-neuro-GA. They help to complement each other’s weak-
nesses, and create new systems that are more robust to solving problems.
Exercises
3. Consider that for a coin, the probability that when tossed the head appears up is
x and for tails it is 1 − x. Answer the following:
a. Given that the value of x is unknown, are the outcomes of successive flips
of this coin independent of each other? Justify your answer in the case of
head and tail.
b. Given that we know the value of x, are the outcomes of successive flips of
the coin independent of each other? Justify.
P(Y | Z ) = P(Y | X, Z ),
and
P(X | Z ) = P(X | Y, Z ).
5. Out of two new nuclear power stations, one of them will give an alarm when
the temperature gauge sensing the core temperature exceeds a given threshold.
Let the variables be Boolean types: A = alarm sounds, F A =alarm be faulty, and
F G = gauge be faulty. The multivalued nodes are G = reading of gauge, and T
= actual core temperature.
a. Given that the gauge is more likely to fail when the core temperature goes
too high, draw a Bayesian network for this domain.
b. Assume that there are just two possible temperatures: actual and measured,
normal and high. Let the probability that the gauge gives the correct reading
be x when it is working, and y when it is faulty. Find out the conditional
probability associated with G.
c. Assume that the alarm and gauge are correctly working, and the alarm
sounds. Find out an expression for the probability in terms of the various
conditional probabilities in the network, that the temperature of the core is
too high.
d. Let the alarm work correctly unless it is faulty. In case of being faulty, it
never sounds. Give the conditional probability table associated with A.
6. Compute the graphical representation of the fuzzy membership in the graph 12.19
for following set operations:
a. Small ∩ T all
b. (Small ∪ Medium)–T all
Medium
Tall
Exercises 373
a. Ā
b. Ā ∩ A
c. Ā ∪ A
References
1. Chowdhary KR (2015) Fundamentals of discrete mathematical structures, 3rd edn. EEE, PHI
India
2. Darwiche A (2010) Bayesian networks. Commun ACM 53(12):80–90
3. http://web.mit.edu/jmn/www/6.034/d-separation.pdf. Cited 19 Dec 2017
4. Heckerman D, Wellman MP (1995) Bayesian networks. Commun ACM 38(3):27–30
5. Kay RU (2007) Fundamentals of the dempster-shafer theory and its applications to system safety
and reliability modelling. J Pol Saf Reliab Asso 2:283–295
6. Munakata T, Jani Y (1994) Fuzzy systems: an overview. Commun ACM 37(3):69–77
7. Puppe F (1993) Systematic introduction to expert systems. Springer
8. Zadeh LA (1994) Fuzzy logic, neural networks, soft computing. Commun ACM 37(3):77–84
Chapter 13
Machine Learning
13.1 Introduction
Learning is one of the most important activities of human beings and living beings in
general, which help us in adapting to the environment. Humans learn from nature as
well as from special learning environments through different techniques that vary in
complexity. The learning requires transformations of ideas and information structures
in the human mind.
This chapter presents the basic concepts and examples of learning techniques
found in humans, and their implementations aspects for learning in machines. A
common view holds that learning involves making changes in the learning system
that will improve it in some way. These changes are adaptive, because it is found
that the system does some task or tasks more effectively the next time when they are
drawn from the same population of tasks.
Most of the knowledge of the world we have around us, and about the explicit
set of tasks, is not formalized, and even not available in text form. The latter is a
necessary requirement for it to be understood by any computer program. This is the
reason it is not easy to write a program for a computer to do many tasks that we
humans do so easily, such as understanding spoken sentences, images, languages, or
driving a car. Any attempt to achieve this, i.e., organizing sets of facts in elaborate
data structures to enable computers to understand it, has achieved very little success.
The concept of learning is based on the principles of training the computing
machines, and enabling them to teach themselves. Tenet of these machines/programs
is related to what we consider as good decision. For humans and animals, evolutionary
principles dictate that any decision made by them should lead to such behaviors that
optimize the chances of survival and reproduction. In the societies of human beings,
a good decision is one that might include social interactions that bring status, or a
sense of well-being. For a machine such as a self-driving car, such criteria cannot
exist! So the quality of decision-making for a self-driving car is measured based on
the criteria as to how close the autonomous vehicle imitates the behavior of trained
human drivers.
It is important to note that the knowledge required to make a good decision in a
particular situation is not necessarily clear to the extent to code it into a computer
program. Consider, for example, that a mouse has knowledge about its surroundings,
and has an innate sense of where to sniff and how to find food or mates, and at the
same time save itself from predators. It is not easy for any programmer to specify
step-by-step procedures or a set of instructions to produce such behaviors for some
artificial mouse. However, this all remains encoded in the brain of a mouse in the
form of knowledge.
Before we think about creating computers (programs) that can train themselves,
it is important to answer the fundamental question as to how we humans acquire
knowledge. Partly, the knowledge in humans may be innate, but most of it is
acquired through experience and observation. What we know intuitively has been
often learned from examples and practice—the process which cannot be turned into
a clear sequence of steps as a computer program. Since 1950 people have looked for
ways and means and tried to refine general principles that allow animals or humans,
to acquire knowledge through experience. These are the subjects of discussion in this
chapter. Machine learning aims to create theories and procedures—learning algo-
rithms—that allow machines to learn. One such is learning from examples presented
to them, called learning by examples [2].
1. List the differences among the three main styles of learning: supervised, rein-
forcement, and unsupervised. [Familiarity]
13.1 Introduction 377
Rote Learning
Learning by Instruction
Learning by Deduction
Learning by Analogy
Learning by Induction (or Similarity)
Reinforcement Learning
Discovery-based Learning
learner. This kind of learning is used when we learn the tables in the early school
classes. A major concern in this learning is how to index the stored knowledge for
future retrieval.
Learning by Instruction
The learning which is next higher in efforts after rote learning is learning by instruc-
tion. This approach requires the knowledge to be transformed into an operational
form before it can be integrated into the knowledge base. This learning takes place
among the students in a class when the class teacher presents a number of facts
directly in a properly organized form to the students.
In learning by instruction (or learning by being told), the basic transformations
performed by a learner are selection and reformation (mainly at a syntactic level) of
information provided by the teacher.
Note that, depending on how the teachers presents the knowledge in the class,
different types of learning take place, and learning by instructions is not the only
learning induced by a teacher into his/her students.
Learning by Deduction
Deductive learning is carried out through a sequence of deductive inference steps
using known facts. Using this, new facts and knowledge are logically derived. For
example, if we have the knowledge of a rule, say “a father’s father is a grand father,”
and we have the case that a is the father of b and b is the father of c, then we can
say that a is the grand father of c. Here we have deduced the result, hence the name
deductive learning.
Learning by Analogy
Analogical learning provides learning of new concepts through the use of similar
known concepts or solutions. This is used, for example, when we solve some problems
in our college examination, based on the solution of somewhat similar solutions we
have already done at home. Also, for example, learning to drive a bike when we
already know how to drive a bicycle is an example of analogical learning.
Learning by Induction
We make use of an inductive learning when after seeing a number of instances or
examples of a concept we formulate a general concept.
In any human learning activity, mostly the strategies out of those discussed above
are involved. It is important to understand the difference between these strategies
on the part of designing a learning system. Though most current artificial systems
focus on any single learning strategy, one may expect that machine learning research
will give increasing attention to a multi-strategy approach, the one which is close
to the human learning system. We as a learner use almost always a combination of
strategies in our daily life. For example, remembering a telephone number, we use
rote learning; seeing someone wearing a new dress every day, we learn that so-and-so
is rich through inductive learning. Learning that those students who complete their
13.2 Types of Machine Learning 379
homework in time get more Cumulative Percentile Index (CPI) is also an example
of inductive learning.
Another example, when we learn Java language having learned C++ is analogical
learning. When we are asked to solve a new puzzle, it is discovery-based learning.
Whenever, a teacher appreciates the student’s efforts by saying “very good”, for
asking an interesting question, or answering a question, it is reinforcement-based
learning for the student.
When you have concluded based on some clues that a teacher is not in a mood of
teaching, most likely you have used deductive learning (or reasoning).
Any field of scientific study is best defined by the central question it asks. That way,
the field of Machine Learning seeks the answer to the following questions:
1. How to build a computer system that can automatically improve its performance with
experience, and 2. What are the fundamental laws that govern the learning processes?
The faculty of Learning, like the faculty of Intelligence, involves such a vast
range of processes that it is difficult to define it precisely. A dictionary definition of
learning includes phrases such as “to gain knowledge, or understanding of, or skill in,
by study, instruction, or experience.” Another dictionary definition says, learning is
“modification of a behavioral tendency by experience.” Zoologists and psychologists
also study the learning processes in animals and humans.
This chapter focuses on learning in machines. There are several similarities
between learning in animals and in machines. Many techniques in machine learning
are derived from the theories of animal and human learning and are presented in
the form of computational/cognitive models by psychologists. It is quite likely that
the concepts and techniques being explored by researchers in machine learning may
illuminate certain aspects of biological learning.
Regarding the machines, it is roughly said that a machine learns whenever it
changes in its structure or program or data, based its input or in response to external
information. This change takes place in such a manner that its expected future per-
formance improves. A change like addition of a record in a database, though makes a
change in data structure, does not fall in the discipline of learning. As another exam-
ple, the performance of a speech-recognition machine improves after hearing several
samples of a person’s speech—we feel convinced that the machine has learned. This
process is obviously a learning process.
Machine learning usually refers to the changes in systems that perform tasks such
as recognition, diagnosis, planning, robot control, prediction, and their combinations.
These tasks are usually associated with artificial intelligence (AI). The “changes”
mentioned above is in one of the two types: 1. enhancements in already performed
task by a system, or 2. creating special ability in the system.
380 13 Machine Learning
(a) There are many tasks that cannot be defined well in advance, except by examples.
Like many times we are able to specify input/output relations as pairs of data but
a precise relationship between inputs and desired outputs cannot be specified.
For example, the weight-to-height ratio of a male as input and health as output.
We would like that based on these input–output sample examples, the machine
be able to adjust its internal structure, so that for new inputs it produces correct
outputs, and thus suitably constrain its input/output function to approximate the
relationship implicit in the given examples.
(b) It is possible that among large sets of data, like—credit cards, medical records,
weather data, astronomical data, etc., there are hidden important relationships
and correlations. The machine learning methods can often be used to extract
these relationships and patterns through the process of data mining, a field of
machine learning.
(c) The machines produced by the human designers often do not work as desired in
certain environments. In fact, when the machine was designed, certain charac-
teristics of the working environment might not be completely known at design
time, hence could not be incorporated in the design. This often happens in space
explorations—like missions to Moon, Mars, Saturn, and even to Sun. Machine
learning methods can improve/modify the performance of the system on the site
in the existing machine design.
(d) The amount of knowledge available and required for certain tasks may be too
large for explicit encoding by humans. But machines can be designed that learn
this knowledge gradually with time, and might be able to capture it continuously
on the fly as it needs.
(e) There is no need to redesign the machines if there are machines that can adapt to a
changing environment. Such machines, once designed, can learn with situations,
and functions for a long time—for example in long planetary missions.
With the continuous redesign of AI, the changes are taking place in the world
knowledge as well as vocabulary.
The questions in the domain of AI cover a broad range of learning tasks, like:
How to design an autonomous mobile robot that learns to navigate using its own
13.3 Discipline of Machine Learning 381
experience? How to mine the data of historical medical records to learn that a given
patient X will respond best to what treatments? How to build search engines that
automatically customize to the user’s interests and other similar questions.
The following is a more precise definition of machine learning.
Depending on how we specify T , P, and E, the learning task might also be called
by names such as data mining, autonomous discovery, and database updating, pro-
gramming by example.
The field of machine learning focuses on how to make computers that program
themselves through their experience, with provision of some initial structures. This is
in contrast to the field of computer science that so far focused primarily on manually
programming computers.
Other fields that are closely related to Machine Learning are Psychology and
Neuroscience, which are concerned with the study of human and animal learning,
and are collectively called Cognitive Science. The questions—How do animals learn,
and how can computers be made to learn—most probably have highly intertwined
answers. The insights into machine learning far outweigh due to contributions from
the fields of statistics and computer science than due to studies in human learning
behavior. The main reason behind this is the poor state of understanding about human
learning processes. However, the relation between machine learning and human
learning behavior is continuously becoming stronger, mainly due to the new machine
learning algorithms, such as temporal difference learning. These algorithms are being
suggested as the explanations for neural signals observed in learning in the animals.
The primarily goal of machine learning is to design computers that can adapt and
learn from their experience without intervention from programmers. Following is
an example: the present software engineering, data-intensive software construction
and testing, is aimed to construct programs that are correct and robust. It is data-
intensive software construction, i.e., given a set of input data to be processed, code
functions that produce output (data), such that the code should correctly function
even when input data changes, and even on the face of incorrect data, should reject
them rather than producing wrong results. A machine can be designed to learn from
the pattern of data and continuously improve its performance until it reaches the
desired performance. The error (difference) between the expected performance and
the actual performance can be fed back to act as a tuning to the software, that is,
to make it learn. This latter is the automation of human-intensive task of software
engineering that we call testing.
Machine learning methods for software engineering are appropriate whenever
hand engineering of software is difficult and yet data is available to be analyzed by
learning algorithms. Following are the situations where machine learning algorithms
are expected to outperform humans:
382 13 Machine Learning
The process of learning requires that new knowledge structures be created due to
some form of input stimuli. Then, this new knowledge is assimilated in the existing
knowledge and is tested for its utility and efficiency. That is, it will be used in per-
forming some task. Figure 13.1 shows a general learning model. The environment is
part of the overall system for learning. The environment may be taken as a teacher or
the one provided by nature which produces random stimuli, or a room with obstacles
where a robot needs to navigate from one wall to another, or may be a classroom
with a teacher and students who are taught using a blackboard and chalk. The com-
munication language between a teacher and a learner may be the same in which a
knowledge base was created or may be different.
The input to the learner component may be physical stimuli or some kind
of description or is symbolic. The information by the learner component cre-
ates/modifies the knowledge structures in the knowledge base, which is used by
the performance component to carry out some tasks, like solving some problems,
game playing, etc. The response produced by the performance component is evalu-
When there is essentially a teacher present for learning to take place, it is supervised
learning, and a learning achievable without a teacher is called unsupervised learning.
Teacher, here, means not necessarily a physical teacher, but even nature is a teacher.
For example, we put fingers in hot water and we get it burnt. So here, feedback is
from hot water, which teaches us that if you again put the fingers in hot water any
time in future, it would be burnt in similar way.
Supervised learning methods require an oracle to be made, such that it forecasts the
classes (categories) to which the given examples would belong and then classifies
the examples in those categories. These methods assume
This learning eliminates the need of a teacher, and the learner is solely responsible
to form his own concepts and evaluate these for learning. In fact, the unsupervised
methods have to discover the concept classes to which the given examples belong,
i.e., mapping the examples to concept classes.
For example, scientists are usually not blessed to have a teacher to help them pursue
the research, instead they propose hypothesis to explain the observations made by
them, and evaluate their hypotheses based on criteria like generality, simplicity, and
elegance, and test these hypotheses through experiments designed by themselves.
Another example of unsupervised learning is unsupervised WSD, which is based
on unlabeled corpora, and do not exploit any manually sense-tagged corpus to provide
a sense choice for a word in the context.
The discovery-based learning is also a category of unsupervised learning.
The ability of human beings to master meeting the complex demands is mainly
based on the human ability to exploit previous experiences, like if we have previous
experience of walking through the Thar desert unaided for five miles from 10 years
ago, we will not hesitate to repeat the next adventure now even for still more distance.
Based on our previous experiences, we are able to predict the characteristics or
relations of man-made or natural objects, we can reason about possible outcomes of
actions, and we can apply the previous successful routines and strategies to new tasks
and problems, as in the case of journey through a desert example. It is understood that
the basic process to expand the knowledge is—first construct a hypothesis such that
13.6 Inductive Learning 385
In situation do action
situation1 action1
situation2 action2
… …
situationgen actiongen .
The above example shows that inductive learning is the generalization from a set
of examples; and with little pondering we note that this is one of the most fundamental
learning types among humans. Learning of concepts is a typical inductive learning
approach—given the examples of some concepts, such as “cat”—it will allow us to
correctly recognize future instances of this concept; similarly, learning the concept
that something is a “good stock investment”, it will allow us to correctly perform
instances of this type in the future, i.e., if an investment has been done intelligently,
and has been found rewarding, there are good chances that, in future also we may
adopt a similar strategy. Such instances are very common in life, hence, the most
common application of inductive learning is in concept learning.
386 13 Machine Learning
Definition 13.2 (Concept Learning) Concept learning is, given the number of pos-
itive and negative examples of some concepts with some background knowledge, to
find a general description or hypothesis of the concepts describing all the positive
examples only, and ignoring all the negative examples.
• Supervised learning,
• Unsupervised learning,
• Concept hierarchies, or
• Single/multi-label learning.
A challenge with machine learning is to deal with large spaces of possible hypotheses,
and the search associated with that space. This problem is better addressed using
expert domain knowledge to constrain the search. Further, the problem is somewhat
simplified, as machine learning allows to use to some extent the general domain
knowledge, which holds true for the entire domain. For constraining the search, it
allows the use of “local” expert’s knowledge relating to specific situations, which is
valid for chosen learning examples. For this, a domain expert provides some learning
examples, these together with other examples are input to the learning algorithms.
These examples are explained in the form of arguments: for and against, and called as
argumented examples; hence the learning using such examples is called Argument-
Based Learning (ABL) [5].
Definition 13.4 (Learning from examples) Having given some examples (argu-
ments), find a theory that is consistent with them.
An algorithm after learning from these examples will induce the following rule:
Now, we want to see how this rule will modify when it faces the arguments.
which explains Mrs. Red’s example, using the given argument: credit was approved
because Rich = Yes; now, this rule AB-covers Mrs. Red. Based on this discussion we
note that the use of arguments in learning contributes to the following advantages:
1. They impose constraints over the space of possible hypotheses, thus reducing
both time and space complexity of each search, enabling faster and more efficient
induction of theories;
2. An induced theory is more meaningful for an expert system since it is consistent
with given arguments as well as examples.
In fact, there may be many possible hypotheses from the perspective of a machine
learning method that may explain the given examples sufficiently well. But these
hypotheses should be simple enough to be understandable to expert (systems). Using
arguments should lead to hypotheses that explain given examples in similar terms to
that used by an expert, and should correspond to full justifications.
The argument-based examples is a resource for domain expert, which is added
as partial domain knowledge in a learning system in advance. This prior knowledge
13.6 Inductive Learning 389
A mutual online concept learning system is an agent-based system (see more details
in Chap. 16) with communication channel for exchange of information among the
agents (see Fig. 13.2). It is assumed that the communication channel is an ideal
one without any noise. A one-agent framework includes the basic elements like
concept, instance producing mechanism, instances, instance interpreting mechanism,
and concept adaption mechanism (or online concept learning), with communication
channel shared among agents.
Adapt Adapt
Concept Concept
Produce
Interpret
Interpret
Produce
Agenti Agentj
Agentk
Instances Instances
Communication
Channel
For the sake of simplicity, we consider a single-agent for online concept learning
from a teacher/environment, that is static. In contrast to a single-agent system, the
environment is continuously changing (not fixed) in a multi-agent online concept
learning system, because the agents act on the environment shared by themselves.
The learning task to be performed is as follows: obtain the target concept or
function f ∗ : {0, 1}n → {1, −1}, that maps each instance1 to the correct label or
class in response to the learning instances and feedback from the teacher [9].
A sequence of trials is needed for the online learning task to take place. In a trial,
the following order is followed by the events:
1. an instance is received by a learner from a teacher;
2. the instance is labeled as 1 or −1 by the teacher;
3. the teacher tells the learner whether the label is correct or not;
4. this feedback is used by the learner to update its concept.
In the above steps, each new trial begins when the previous trial ends.
A learning algorithm called the Perceptron algorithm takes real-valued inputs in
the form of a vector, and outputs as 1 if the result of computation is greater than a
threshold (θ ), otherwise −1 is the output [3].
To be precise, if an input instance is −
→
x = (x1 , . . . , xn ), the output computed by
−
→
the perceptron is f ( x ):
−
→ +1 if ni=1 wi xi > θ
f( x )= (13.5)
−1 otherwise.
In the above, −
→ w = (w1 , . . . , wn ) ∈ Rn is called the current weight vector of the
perceptron. For the sake of convenience we take threshold θ as 0, and instead add
a constant x0 = 1 with a weight w0 . For the sake of compactness the perceptron
function in Eq. (13.5) is written as
f (−
→
x ) = sgn(−
→
w .−
→
x ).
Let −
→
w .−
→
x = y. Now, we can rewrite Eq. (13.5) as
+1 if y > 0
sgn(y) =
−1 otherwise.
1 Instance and example are the same here, which are input to the function.
392 13 Machine Learning
−
→w = (0, . . . , 0). On receiving an input instance −
→
x = (x1 , . . . , xn ), the learning algo-
−
→ −
→
rithm predicts the label of x to be f ( x ). For the sake of compactness we represent
f (−
→x ) by y.
If it is found that the predicted label is correct, then there are no changes in the
weight vector. But if the predicted label is wrong, the weight vector is updated using
the perceptron learning rule given below. Here, λ is the learning rate.
−
→
w =−
→
w + λy.−
→
x. (13.6)
The learning methods making use of “propositional calculus” are called attribute-
value/propositional learners. They use objects with a fixed set of attributes, selected
from a predefined set of values. The learning based on first-order relational
descriptions—the relational learning—induces descriptions of relations and uses
the objects described in terms of their components and relations among the compo-
nents. As an example, if an object X has attributes a, b, c then this can be represented
as a tree with X as the root and a, b, c as leaves, showing X having relation with
each leaf. This relation is the propositional relation. The background knowledge of
relations is formed using the language of examples and concept descriptions, which
is typically a first-order logic. Particularly, the learners that induce hypotheses in the
form of logic programs, i.e., Horn Clauses, are called inductive logic programming
systems.
The inductive learning has application in automatic construction of knowledge
base in expert systems, which are an alternative to classic knowledge acquisition
systems. The inductive techniques are also used for refinement of existing knowledge
base, since they facilitate the detection of inconsistencies, redundancies, lack of
knowledge, and also are helpful in the simplification of the rules provided by the
domain expert. Due to their capability to detect patterns present in the input examples,
the inductive methods are also used in the domains of biology, psychology, medicine,
and genetics.
13.6 Inductive Learning 393
yes
friendly
394 13 Machine Learning
Holding Unfriendly
d b
or all
oo
sw
flag
n
Friendly Friendly
Friendly
Any unseen example is classified as follows: start from the root node, attributes
are tested at internal nodes, repeat the process and proceed to a terminal node. It
is assumed that in an example that belong to a different class, at least one of the
attributes has a different value.
In a decision tree, each internal node is labeled by an attribute and links. The
attribute is selected from a set of possible values. In the example of Fig. 13.4, one of
the attributes is “is smiling” and the set of its values is {yes, no}. Another attribute
is “holding” with the set of values as {sword , flag, balloon}. The decision-tree con-
struction has two important objectives: one is selection of the most informative
attribute having the closest sense to the context in the example, and the second is
to minimize the number of tests. Note that the number of tests is height (or depth)
of the tree—paths length from the root to terminal node, and it directly defines the
difficulty level of the problem [4].
OK
Ba
Credit history?
d
1, 6, 5 Low moderate low
n go
ow od risk risk risk
kn
un bad 11, 13 14 15, 16
Debit?
High Moderate
gh Lo Risk
Hi w Risk
10, 12 8, 9
High Risk Moderate
Risk
2, 4
3, 7
-
bank of COUNTRY?
no yes
bank/RIVER bank/FINANCE
Consider an example of a decision tree for Word Sense Disambiguation (WSD), i.e.,
to find out the correct sense of a given word in a phrase. We are interested in the word
“bank” in a sentence: “we sat on the bank of Ganges.” The process of disambiguation
is demonstrated through a decision tree shown in Fig. 13.6.
The figure shows that the tree is traversed such that for every internal node it
asks a question, having the answer. The answer to each branching is provided from
the hidden knowledge in the sentence, as well as through commonsense knowledge.
We note that it follows the path of no-yes-no from the root node until the terminal
node, and the choice of sense bank/RIVER is made. In the tree, the node “bank of
COUNTRY” stands for, for example, “bank of India”, “Bank of America”, etc. The
terminal nodes for which no choice can be made based on specific feature values are
indicated with empty values by “−”.
Construction of the decision tree requires extensive access to the training database.
Sometimes the training database is so huge that it does not fit the memory, in that
case some efficient data access method is needed to achieve the scalability.
During the construction of a decision tree it requires construction of statistics in
memory for each node, and is carried out in a single scan. The split at any node is
based on some database partition that satisfies the split criteria.
396 13 Machine Learning
The statistics for splits is updated in the memory at each node in a path starting
from root node up to terminal node. The partitioning at each node should satisfy the
splitting criteria corresponding to that node, and sufficient statistical data in the form
of corpus is necessary for the construction of any learning decision tree.
The discovery-based learning requires the maximum learning efforts among all types
of learning. One of the earliest discovery-based programs is Automated Mathe-
matician (AM), which derives a number of concepts in mathematics. It is based
on the concepts of set theory. The AM discovers the natural numbers by modify-
ing a concept called “bag”, which is a generalization of the concept of a set in
set theory. The concept “bag”permits multiple occurrences of set elements, e.g.,
{a, a, a, b, c, c, d , d }. Hence, the natural number 4 is represented as {1, 1, 1, 1}. An
addition of the natural numbers is shown as the union of sets: {1, 1, 1, 1} ∪ {1, 1} = 6.
The operation of multiplication is the series of additions: 3 ∗ 4 = {1, 1, 1} ∪ {1, 1, 1}
∪{1, 1, 1}∪ {1, 1, 1} = 12. The operation of subtraction is the difference of the sets:
4 − 3 = {1, 1, 1, 1} − {1, 1, 1} = {1}. The division is obtained by repeated the set
differences. A prime number is an integer having only two dividers, the unity and
the number itself.
One approach for discovery-based learning is learning by concept hierarchies, as
explained in the following.
Learning by Discovering the Concept Hierarchies
One of the most general approaches to solving a complex problem is to decompose
the problem into smaller, less complex, and manageable subproblems. This principle
is the foundation for structured induction in machine learning, where a concept
hierarchy is defined and rules are learnt for each of the (sub)concepts. In this process,
the concept hierarchy can be manually constructed or it can be automated, instead
of learning a single complex classification rule from examples.
Consider that given five Boolean attributes x1 , . . . , x5 , it is required to learn
Boolean function y, expressed as
For five attributes there is space of 32 points. Consider that out of these, there
are total 24 randomly selected points (i.e., 75% of the total) are to be selected as
examples for learning. These are called as attribute-value vectors, and are useful for
hiding the actual structure of the original expression. In Eq. 13.7 it is possible to
justify that the best possible structure of sub-concepts will be as given in Fig. 13.7,
with definition of functions as f1 = ⊕, f2 = ∧, f3 = ∧, and f4 = ⊕.
An approach for learning by concept hierarchies is based on function decompo-
sition. Consider decomposing a function y = f (X ) into y = g(A, h(B)). The orig-
13.7 Discovery-Based Learning 397
f4
f2 f3
x4 x5 x3 f1
x1 x2
x1 x2 xk xm x1 x2 c
xk xm
(a) (b)
inal function is shown in Fig. 13.8a and b shows the decomposed function. In the
original function, X = {x1 , x2 , ..., xk , ..., xm }, and after decomposition, A = {x1 , x2 },
B = {xk , ..., xm } are sets of input attributes, and y is variable, which represent the
class. Also, input attributes X = A ∪ B. The f , g, h are functions, which are partially
specified as examples, i.e., by sets of attribute-value vectors with assigned classes.
The functions g and h are determined in the decomposition process, such that their
joint complexity is lower than the complexity of F. The attribute c is called inter-
mediate concept and defined as c = h(B), which is discovered in the decomposition
process.
The function decomposition can be applied recursively on the functions g and
h, which should result in a hierarchy of concepts, as shown in Fig. 13.8b. For each
concept in the hierarchy, like h(B), there is a corresponding function (h here), that
decides the dependency of the concept, e.g., concept c, on its immediate descendants
in the hierarchy [11].
398 13 Machine Learning
Our learning is through feedback of our actions in the real world. Human beings
always learn by interactions with the environment/teacher, which are cause and effect
relations. The environment or the world around us is the teacher, but its lessons are
often difficult to detect or grasp by the mind or analyze, hence hard to learn. The
best example is learning by a dog or a young child, where good actions are rewarded
and bad actions are discouraged. It is not defined as a more general example, but the
actions with the world.
The programs that improve their performance at some task by rewards and pun-
ishments from the environment are examples of Reinforcement Learning (RL). For
many automatic learning procedures for real-world tasks, like job-shop scheduling
and elevator scheduling, RL has been found to be very successful. The total dis-
counted reward the learner receives is optimized in the reinforcement learning, i.e.,
a reward that is received after one time step is considered equivalent to a fraction of
the same reward received immediately before [1, 8].
There are four components of RL:
1. a policy,
2. a reward function,
3. a value mapping, and
4. a model of environment.
Learning agent’s choices and methods of action are defined under a given policy.
The policy is represented using a set of production rules. Sometimes the policy could
also be probabilistic in nature. The reward function is a relationship between the state
and goal, which maps each state into a reward measure, and indicates the need of
that action to achieve the goal.
A reinforcement-based learning differs from supervised learning in the sense that
a reinforcement system may not have a teacher to respond to each action, instead the
learner itself creates a policy for interpreting the feedback. For example, playing in a
particular style leads to winning a chess game for a player is reinforcement learning.
The RL is a commonly used framework for problems requiring a number of agents,
but their presence does not contribute to much of complexity as RL needs only limited
feedback for learning. With the objective to maximize the expected reward, RL
algorithms attempt to learn policies that help in taking optimal sequential decisions.
However, this learning may be intractably slow as the methods scale up to more
real-world problems with large state spaces.
A commonly used approach to solving RL problems makes use of a function,
called value function approximation, using which an agent tries to learn a value for
each state. This value is a long-term expected reward from entering a particular state,
and such values are used for deriving a policy.
The state space for real-world problems contains infinitely a large number of
possible states features. Hence, the designer of any such task must pick up only
the most relevant features. Consider the task: Travel to Mumbai for work, and the
13.8 Reinforcement Learning 399
weather report for New Delhi is not likely to be relevant. From these we should
construct a feature set. In most real-world situations, the feature set for a problem
is broken down into a number of subsets, such that each subset can learn a specific
concept of the domain. In this case, some concepts/feature-set subsets may be more
important than the others.
Using positions as features, the agent can learn the concept of position and basic
path planning. The features raining, shoes, and umbrella are useful in learning as
to how the weather governs the policy, and the features of time, and day in a week
may be useful in learning to handle traffic, for example. A conventional approach to
solving this problem through RL is to learn in a 6D space (positions, raining, shoes,
umbrella, time, weekday) when all the features are taken into account.
It is important to note that, the order in which features have been selected plays
a significant role, in terms of difficulty level of the problem, and robustness of its
solution.
In addition to the state change, the agent receives a reward r for arriving to state
q, based on the reward function:
R(S × A) → R. (13.9)
400 13 Machine Learning
Q3 I3
From known flow rates: Infer Kirchoffs current Law:
Q3 = Q1 + Q2 I3 =??
402 13 Machine Learning
match with the new situations in some aspects and, therefore, may offer the promise
of suggesting a solution to the new situation.
The difference between inductive learning and analogical learning is that, induc-
tion or similarity-based learning is based on the observation of a number of training
examples, but analogical and explanation-based learning requires only one exam-
ple in the form of a known solution or a past experience, as a sufficient criteria for
learning to take place.
Analogical Reasoning
Consider the statement: “I have a Pomeranian dog, who is friendly with the children.
Whenever I see another Pomeranian dog, I assume that he too will be friendly with
the children.” This type of reasoning we usually perform in our daily life. But it also
appears that it does not guarantee the truth. The reason being that is the existence of
one or few shared characteristics does not mean that all the remaining characteristic
are identical.
For analogical reasoning to work, it is necessary that some relevant and known
similarities between two objects or situations be give a sufficient base to a reason
to believe that there are further similarities between them, and that should help to
answer an open question. However, it is not always that way. At the most, analogical
thinking can give the base to think of some probabilities, and these may be the base for
probabilistic judgment. The kind of thinking that takes place in analogical reasoning
has a simple structure:
1. Object A possesses a set of characteristics, say X ,
2. Object B shares the characteristics X ,
3. The object A has characteristic Y also,
4. Because the object A and B share characteristic X , based on this we conclude what
is not known yet, that is, B shares Y as well, which is not necessarily correct.
In Fig. 13.10a, the analogical reasoning will work, while in (b) it may not.
(a) (b)
404 13 Machine Learning
A(X ) ∧ B(X )
A(Y )
−−−−−−
∴ B(Y ). (13.11)
The set of common properties X is the similarity between A and B, and the
conclusion property Y is projected from A onto B. The process may be summarized
by the following statements.
Justification. The analogical reasoning method defined in Eq. (13.11) can be justified
by a two-step process: 1. from the first premise A(X ) ∧ B(X ), conclude a general-
ization ∀ X [A(X ) → B(X )], and 2. instantiate the property X and generalize it to
Y , then apply the inference rule of modus ponens to obtain the conclusion B(Y ).
Due to this process, only the first step is non-deductive. Hence, it appears as if the
problem of justifying the analogy has been reduced to the problem of justifying a
single-instance inductive generalization.
A criteria for the evaluation of strength of an (enumerative) induction states that,
as the number of instances confirming the generalizations increases, an inference
increases in plausibility. If this is the only criteria used for analogical inference, then
all the conclusions projected by any analogy without counterexamples should also be
plausible. But, this fails. Consider the example, if the inspection of an Indian peacock
reveals that its tail is more colorful than its legs, a projection of this conclusion onto an
unseen peacock is plausible. But, projecting that a hanging feather on the peacock’s
tail will be observed on a second peacock is not plausible. Anyone who has closely
looked at the tail of even a single peacock will have no counterexamples to these
conclusions. Also, both conclusion properties can be projected, so the difference in
agreeableness is accounted for some other criterion. Hence, the problem of analogy
is different from (enumerative) induction because the analogy requires a stronger
criterion for plausibility.
The analogical learning is a nontrivial form of inference, i.e., the conclusions
based on the analogies are not necessarily logical entailments of the previous (source)
knowledge. Analogical is a form of plausible reasoning for the current situation.
The steps in Eq. (13.11) require many other things, like the criteria for similarity
measure, some form of indexing, efficient recall mechanism, a flexible mapping
between base and object domain and, various levels of abstraction.
Following are the steps to perform the analogical reasoning:
4. Determining Analogy and Matching. Here, two patterns are selected for their
similarities and mapped from the base to the target domain.
5. Similarity measures. Various similarity measures are defined, and matchings are
evaluated as per those.
The Access and Recall mechanisms are useful features, as per which the similarity
feature of a new problem serves as an index to the previously solved problem.
This will help us in recalling the previous problem when required.
6. Abstraction. The matchings are abstracted away to generalize it.
The newly formulated solution is validated for its applicability through some
trial process, like theorem provers or simulation. If the validation is supported, a
generalized solution is formed, which accounts for old and new, both, which is equal
to a new learning.
When a case is inferred analogous to another on the basis of a unifying principle
such that it is accepted without having been tested against other possibilities, such
inference due to the analogical reasoning will be faulty. In addition, when some
similarities between two cases are considered decisive on the basis of insufficient
investigation of relevant differences, such analogical reasoning will also go wrong.
Given the data, one way to characterize the learning problem is based on the
goal/target of the learner algorithm, which may be a concept, or description of a
class of objects. A learning algorithm may also acquire plans, heuristics for problem
solution, or other form of procedural knowledge in any structure discussed so far,
including the predicate logic.
The set of operations performed by the algorithm may include generalizations
rule, heuristics rule, or a plan that satisfies the goal. The typical operations are
generalizations or specializations of symbolic expressions, adjusting the weight of
a neural network, etc. In concept learning, a learner may generalize a definition as
follows:
There is a potential concept space, which is searched by the learner algorithm and
to find the concept. The complexity of this concept space is the main complexity of
the problem solution. The learning program/algorithm must be committed to order
and the direction of the search, as well as it should make use of the available training
data and heuristics, to carry out the search efficiently. The learner may make use of
the heuristics, and, for example, learn the concept ball, by using the first example as
a candidate concept:
size(object1 , small) ∧ color(object1 , red ) ∧ shape(object1 , round )
406 13 Machine Learning
– A target Concept
– Learning by Analogy
– Learning by Instruction
– Learning by Induction
– Learning by Deduction
– Reinforcement Learning
– Discovery-based Learning, which determines the effective definition of this con-
cept, called local concept, which requires a high-level description.
Consider using EBL to learn that a given specific object “this cup” is a cup. The
target concept is a rule in predicate form, which is to be used to infer whether a given
object is in fact a cup. We express this as
In this implication, the premise is a conjunctive expression with variable X , and pred-
icate expression part(X , Y ) indicates that X s part is Y . Let the domain knowledge
include
canbelifted (X ) ∧ holdsliquid (X ) → cup(X )
13.11 Explanation-Based Learning 407
cup(thiscup).
small(thiscup).
put(thiscup, handle).
own(bob, thiscup).
part(thiscup, bottom).
part(thiscup, bowl).
color(thiscup, red ).
pointsup(bowl).
concave(bowl).
The operational criteria should be such that the target concept can be defined
in terms of observable structured properties of object, like it points up and its
parts:bottom, handle, bowl, etc. The domain rules are provided in such a way that
these enable the learner (algorithm) to infer whether a description is operational or
not. Making use of this approach, a theorem prover can construct an explanation as
to why the given example is in fact an instance of the training concept—a proof that
confirms that the target concept logically follows from the example. This is shown
in Fig. 13.11, as a generalized proof tree.
The EBL has the following advantages:
1. The training examples often contain many irrelevant information (or noise), such
as “make of the cup”. However, the domain theory allows the learner to select
only the relevant aspects of training example.
part(mycup,
handle) pointsup(bowl)
light(mycup)
part(mycup, bowl)
concave(bowl)
small(mycup)
408 13 Machine Learning
2. Usually, an example may allow for many possible generalizations, and most of
them may not be useful. However, since the EBL inference logically follows
the example, the generalization of EBL are logically consistent with the domain
theory.
1. What are the situations in which it is preferable to represent and learn a policy
directly, instead of learning a value function first? The difference between learning
a policy versus value function is a policy specifies what action should be chosen
in each state, but a value function provides accumulated long-term reward for
each state. The best action can be selected based on the criteria: the one that will
lead to the state with the highest value as per the value function. Some methods,
like genetic algorithms, genetic programming, and parameterized policies, search
directly in policy space. However, the other methods, like temporal difference
learning and real-time dynamic programming, construct value functions.
2. What should be the properties of an ideal value function approximator?
3. What are the best ways to resolve the trade-off between exploitation (i.e., execu-
tion of the current policy) and exploration (i.e., search for better policies)?
4. What are the ways to solve large and partially observable Markov decision prob-
lems?
13.14 Summary
1. Rote Learning.
2. Learning by Instruction.
13.14 Summary 411
3. Learning by Deduction.
4. Learning by Analogy.
5. Learning by Induction.
6. Reinforcement Learning.
7. Discovery-based Learning.
These strategies are in increasing order of complexity of learning as well as the
magnitude of transformation required.
A general learning model comprises: learner unit, knowledge base, performance
evaluation, and the teacher, which may or may not include a physical teacher.
The learning techniques are fundamentally classified as supervised learning and
unsupervised learning. The major difference between these is that the supervised
learning requires the presence of a teacher, while the unsupervised technique does
not require the teacher component.
Inductive Learning is a generalization carried out using a set of examples. It
is one of the most commonly used learning techniques used by humans. Learning
through concepts is a typical inductive learning process: given the examples of some
concepts, such as “cat”, we attempt to conclude a definition such that it will be helpful
to correctly recognize future instances of the concept “cat”. Note that, inductive
learning makes use of specific facts/examples rather than general axioms as in the
deductive processes.
While searching for possible hypotheses in the large space, the space can be
constrained using the domain knowledge of the experts. Learning from examples is
given the examples find a theory that is consistent with the examples. The explanation
examples are called argumented examples, and these are in the form of arguments:
for and against.
Based on the idea, whether it is a single concept or multiple concepts that are
to be learned, we call it single-concept or multiple-concept learning. A concept is a
function (a mapping), such that the learning task is supposed to discover this function,
called target function f ∗ : {0, 1}n → {0, 1}. The function maps each instance to the
correct label (also called class). In the learning process, the system responds to
learning instances and also to feedback received from the teacher.
The attribute-value learners or propositional learners are the methods that use
a formalism of propositional calculus, where objects are defined with a fixed set
of attributes. The learning methods based on first-order relational descriptions are
called relational learners; they induce descriptions of relations, and make use of
objects described in terms of their components and relations among components.
Automated Mathematician (AM) is one of the earliest discovery-based learning
programs, deriving a number of concepts in mathematics. It is based on the concepts
of set theory. AM discovers the natural numbers by modifying concept of “bag”,
which is a generalization of set. An approach to solving a problems, which is among
the most general, is to decompose the problem into smaller, less complex, and better
manageable subproblems. This principle is the foundation for learning by concept
hierarchies. The rules for each concepts and sub-concepts are learned as a discovery
learning.
412 13 Machine Learning
Exercises
1. At the end of the day, try to recollect any five important events you have witnessed
this day. Try to associate the type of learning you have used in these events.
2. Analyze and propose the type of structures you consider are appropriate in the
following learning processes:
a. Rote learning
b. Inductive learning
c. Supervised learning
d. Unsupervised learning
e. Learning by example
f. Analogical learning
you may take the example of the number tables, and indexing you need, so that
one can speak the table of say “5”, and also one can answer 5 × 6 = 30.
3. What is the difference between the learning by induction versus learning by
examples.
4. Explain the major difference between analogical learning and inductive learning
in respect of the approach used for learning.
5. Suggest a learning method for each of the following, explaining why the sug-
gested method is appropriate, and provide logical steps to learn using it.
a. To learn how to drive a car after having observed a trained driver, while you
are riding along with the same.
b. To learn how to drive a car after having known the driving of a bullock-cart.
c. To learn how to drive a car after having learned the driving of a tractor.
d. To learn how to fly an aircraft after having very closely observed the birds
flying, like eagle, crane, and hawk.
e. Learning to keep your wallet protected after having lost it due to theft.
6. The helicopter takes off straight, instead of having a run before taking off. It
has similarity with peacock in the birds. Explain a mathematical model, which
supports the learning by analogy of a copter with a peacock.
References
Abstract A machine learning system, in general, learns from the environment, but
statistical machine learning programs (systems) learn from the data. This chapter
presents techniques for statistical machine learning using Support Vector Machines
(SVM) to recognize the patterns and classify them, predicting structured objects using
SVM, k-nearest neighbor method for classification, and Naive Bayes classifiers. The
artificial neural networks are presented with brief introduction to error-correction
rules, Boltzmann learning, Hebbian rule, competitive learning rule, and deep learn-
ing. The instance-based learning is treated in details with its algorithm and learning
task. The chapter concludes with a summary, and a set of practice exercises.
14.1 Introduction
Statistical machine learning systems help the programs automatically learn from the
data. This is an attractive option as an alternative to manually coding every rule
in a program. Machine learning is used in computer science and beyond, e.g., in
spam filters in email, web searching, placement advertisement, stock trading, credit
scoring, drug design, fraud detection, and in many more applications. The statistical
machine learning can be said to be a kind of a data mining.
The field of machine learning developed in a new direction during 1979–80,
with innovations like decision trees and rule learning. These methods were applied
in expert systems. In the late 1980s, there were renewals of research interests in
machine learning architectures that used perceptron as the basic building block.
This was particularly because the limitations which were highlighted by Minksy and
Papert in the early days of AI were overcome by multilayer networks that used
simple computing elements. The latter used perceptrons like nodes called neural
networks. These networks were trained using biologically inspired backpropagation
14.2 Classification
In that sense, for email filtering we use a classifier that classifies the email mes-
sages into “spam” and “no spam” classes. The input to this classifier may be a Boolean
vector,
x = (x1 , . . . , xj , . . . , xd ), (14.1)
14.2 Classification 417
Machine Learning
Algorithm
Prediction
New Data set Classifier (Infernce)
where x1 , . . . , xd corresponds to words from some standard dictionary, such that the
element xj = 1 (True) if the jth word of that dictionary is present in the email under
consideration. If the jth word from the dictionary is absent in email, then xj = 0. The
aim is naturally to classify the emails based on these words. A learner (program)
inputs a training set of examples, comprising (xi , yi ), where xi = (xi,1 , . . . , xi,d ),
as an observed input, yi as the supposed to be corresponding output, and also the
classifier, which is supposed to be the class of this email. The test, that the learner
program (i.e., classifier) has successfully learned is, whether this classifier produces
the correct output yt for future examples xt . That is, whether the spam filter has
correctly classified the first-time seen email messages as spam or not spam, or some
other classes as dictated by the classifier [6].
Though there are thousands of learning algorithms today, the learning can be
expressed as comprising three parts: representation, evaluation, and optimization.
Representation
The classifier must be represented in some formal language. Choosing a representa-
tion amounts to choosing a set of classifiers with the capability to learn. The repre-
senting set is called the hypothesis space of the learner, i.e., it covers those features
of the inputs to which we give importance. The classifier must be in the hypothesis
space, otherwise it cannot learn.
Evaluation
The evaluation function, called object function or scoring function of a learning
algorithm, should be able to distinguish between good classifiers from bad classifiers.
Optimization
This function is useful for searching a method for a classifier that helps in scoring
the highest. Proper choice of optimization technique is key to the efficiency of the
learner.
The learners can be divided into two major types: 1. whose representation com-
prises fixed size, e.g., linear classifiers, 2. those whose representation can grow with
data, e.g., decision trees.
418 14 Statistical Learning Theory
w x=y+1
wx=y
w x=y−1
of points with dependent attribute = 0 and the boundary of those with dependent
attribute = 1. The variables x, y and w are vectors.
The margin (which is a hyperplane) is the measure of “safety” in separating the
two sets of points, the larger is better. Figure 14.3 shows three separator lines between
the set of data values of two categories, with separator line 3 producing the maximum
margin. Computing the optimal separating surface, in a standard SVM formulation
requires solving an optimization problem, which is quadratic in nature [2, 3].
The SVMs are based on Analogical Learning, which we discussed in the previous
chapter. The Support Vector Machine (or Support Vector Network (SVN)) maps the
input vectors into some high-dimensional feature space that we call Z, through some
nonlinear mapping, which is chosen a priori. A linear decision surface is constructed
in this space, with special properties that ensure high generalization capability of the
network. This approach gives rise to two problems: the first is conceptual in nature,
while the other is technical. The conceptual problem is regarding how to find out a
420 14 Statistical Learning Theory
separating hyperplane that will generalize well in the presence of very high dimension
of the feature space. The technical problem is about how to computationally treat
such a high dimensionality space problem?
The conceptual part of this problem can be solved for the case of optimal hyper-
planes for separable classes. An optimal hyperplane is a linear decision function with
maximal margin between the vectors of two classes (see Fig. 14.2). To construct such
a hyperplane, only a small amount of data is required, called support vectors, which
determine the margin. The optimal hyperplanes should separate the training data
without errors. The optimal hyperplane algorithm is described as follows.
The set of labeled training patterns,
are linearly separable if there exists a vector w and a scalar b (for bias) such that the
following inequalities are valid for all elements of the training set given in Eq. (14.2).
w.xi + b ≥ 1 if yi = 1,
w.xi + b ≤ 1 if yi = −1.
(14.3)
yi (w.xi + b) ≥ 1, i = 1, . . . , . (14.4)
The optimal hyperplane, which is the unique one separating the training data with
a maximal margin, can be expressed by
w0 .x + b0 = 0. (14.5)
1
L(w, b, Λ) = w.w − αi [yi (xi .w + b) − 1], (14.6)
2 i=1
α1 αi αj αs
c
u1 ui uj us uk = K(xk , x)
xk
As discussed above, in SVMs it is attempted to find out the support vectors with
maximum margin. An algorithm for this purpose finds a decision function for n-
dimension pattern vectors x, which belong to either of the two classes, A or B.
Input to the training algorithm is a set of k examples x1 , ..., xk with labels y1 , ..., yk ,
respectively.
(x1 , y1 ), (x2 , y2 ), . . . , (xk , yk ) (14.7)
yi = +1 if xi ∈ class A
where (14.8)
yi = −1 if xi ∈ class B.
Making use of these training examples during the learning phase, the SVM algo-
rithm finds the decision functions D(x), where x is an unknown pattern. Once the
training phase is over, the class of unknown pattern is predicted using the following
rule:
x ∈ A, if D(x) > 0
x ∈ B, otherwise.
It is mandatory that the decision functions D(x) are linear in their parameters,
however they are not restricted to linear dependencies of x. These functions can be
expressed either in the direct space, or in dual space. The notation used for direct
space is identical to the one used in perceptions, i.e.,
422 14 Statistical Learning Theory
k
D(x) = wϕ(xi ) + b. (14.9)
i=1
where coefficients αi are the parameters to be adjusted, and xi are the training patterns.
The function K is a predefined kernel, for example, radial bias function or a potential
function.
A potential drawback of SVMs and other statistical learning methods is that they
treat the category structure as “flat”and do not consider the relationships between
categories, which are commonly used for expressing the concepts as hierarchies or
as taxonomies. The taxonomies or taxonomical structures offer clear advantages in
supporting tasks like browsing, searching, or visualization. It is to realize that all the
real-word concepts have complex hierarchical structures. For example, in browsing,
the list of answers to a query can be taken as the children of the root (keyword).
Further, each answer link would correspond to a web page, which may have many
links, thus making a tree structure.
To better appreciate the above, consider the problem of natural language parsing
as shown in Fig. 14.5, where the parser receives the input in the form of a sentence of
natural language, and the output is a parser-tree as a decomposition of the sentence
into its constituents. While support vector machines are used in NLP applications,
such as Word Sense Disambiguation (WSD), parsing is not suited as a problem to be
solved through SVMs for classification. This is because, in parsing, the output is not
a classification of yes/no but a labeled tree, representing the structure of a sentence.
So the question is, how can we take an SVM and learn predicting a parse-tree?
The question above arises not only for predicting the structured trees but also for
a variety of other structures, like DNA sequence, or sequence ordering for image
segments, etc.
Let us assume that there is multi-class SVM for document classification, where
each document belongs to exactly one category. Let {(xi , yi )}ni=1 be a set of n labeled
training documents. Here, xi ∈ d denotes a standard vector representation for the
i-th training document, and there are total d number of such documents. Each label
is yi , where yi ∈ Y ≡ {1, . . . , k}, and k is the total number of categories.
14.4 Predicting Structured Objects Using SVM 423
NP VP
Det N V NP
Det N
National elections
[Musk’s] campaigning in
the [National elections] position
has motivated [many new young this group
voters.] [His] [position] on many new young voters
[bitcoin money] was well
Musk’s
received by [this group]. His bitcoin money
Let us assume that there is weight vector wy for every class 1 ≤ y ≤ k. We will
refer to the stacked vector of all weights by w = (w1 , . . . , wk ). The structured output
prediction is the term used for this kind of prediction, where our objective is to learn
a function h : X → Y , that maps the inputs x ∈ X to complex and structured
output y ∈ Y . In the structured output prediction, tasks range from predicting an
equivalence relation, say, in noun-phrase co-reference resolution, to predicting a
correct and well-formed machine translation.
Consider a situation that a man called “Musk” contests in the elections and hence
campaigns to seeks votes as a presidential candidate, and suggests election mani-
festo of promoting the bitcoin money if he wins the elections. Figure 14.6 shows an
example of resolving noun-phrase co-reference, where there is an equivalence rela-
tion between a noun phrase (e.g., “Musk”) and its co-reference “His”, and similarly
between “this group”and “many young voters”.
The problems of this type exists elsewhere also, for example, in image segmen-
tation, for determining an equivalence relation, say y, over a matrix of pixels x, and
the problem of web search to predict a document ranking y for a given query x. The
structural SVMs (SSVMs) are suitable for use, as well as to address these all, and a
large range of prediction tasks with structured output.
424 14 Statistical Learning Theory
How to map the input to some structured output? Basically, a task of prediction
of a structure is similar to a task of multi-class learning. Each possible parse-tree
y ∈ Y may correspond to one class, and classifying a new example x is nothing but
predicting its correct class out of many possible classes, as shown in Fig. 14.7.
However, the problem is that there is a very large number of classes Y . In case of
parsing, the number of possible parse-trees are exponential as a function of the length
of the sentence. This situation is similar for most other prediction problems, which
output some structure. Hence, there is need for finding a compact representation of
output spaces, which are large in size. In addition, a single prediction for an example
is computationally challenging, this requires the enumeration of all the outputs. The
third challenge is, how to distinguish between two wrong parse-trees, where one may
be closer to the correct one. This, we call as prediction error [9, 11].
We can derive a structural SVM from a multi-class SVM, such that for each class y,
the multi-class SVMs use a weight vector wy , and each input x has a score for each
class y via the function,
f (x, y) ≡ wy .Φ(x), (14.11)
where Φ is a function that extracts the vector Φ(x) of binary or numeric features, from
x. Hence, every feature has additive-weighted influence in the modeled compatibility
between inputs x and classes y. For the classification of x, a prediction rule h(x)
chooses simply the highest-scoring class as the predicted output. This is expressed by
Det N
”The dog chased the thief.”
S
Class 2 NP VP
Det N V PP
IN N
S
Class k
NP VP
NP CC NP V
Det N N
14.5 Working of Structural SVMs 425
This will result in a correct prediction of output y for input x, provided that the weights
w = (w1 , . . . , wk ) are chosen such that the inequalities f (x, ȳ) < f (x, y) hold true
for all incorrect outputs ȳ
= y. For a given training sample (x1 , y1 ), . . . , (xn , yn ), this
leads directly to a margin (a hard-margin) formulation of the learning problem by
requiring a fixed margin (= 1) separation of all training examples, while using the
norm of w as a regularizer:
1
minw w2 , such that f (xi , yi ) − f (xi , ȳ) ≥ 1, (∀i, ȳ
= yi ). (14.13)
2
For a k-class problem, the optimization problem has a total n(k − 1) inequalities,
that are linear in w. This is because, one can expand
The nearest neighbor method is a statistical learning method, also called sometimes
as memory-based method. The method is used for clustering of objects based on
some similarity in their attributes. The k-nearest neighbor (k-NN) or simply nearest
neighbor method is an analogical type of learning. Given a training set θ of N number
of labeled patterns, each with n attributes, a nearest neighbor algorithm decides that
some new pattern x belongs to the same category to which its closest neighbors in θ
belong. In other words, a k-nearest neighbor algorithm assigns a new pattern, x, to
that category xi to which the majority of its k-closest neighbors belong.
Consider there are five cars manufactured by company A, and for every car ten
important attributes have been identified, like engine size, color, wheel size, fuel
used, etc. Then, training set size is N = 5, and attribute size of each training set is
n = 10. Let a new model is introduced by company B, and the nearest neighbors’
426 14 Statistical Learning Theory
size is taken as k = 4. In this situation, a car from the company A having four
attributes common to the new car is, say model-Zeta. Therefore, Zeta being the nearest
neighbor to the new model from company B, the new model is put in the class of
model-Zeta.
If k is relatively large, there are less chances of the decision going wrong due
to noisy training pattern close to x. However, the large values of k also reduce
the sharpness or acuteness of this method. The k-NN method can be thought of as
estimating the values of the probabilities of belongingness to class, given an input
pattern x.
The k-NN approach has been shown to be a useful non-parameteric technique
for regression and classification. In both cases, the input comprises the k-closest
neighbors of the vector. However, finding the k-NNs for a test sample, among N
design samples, is a time-consuming process, particularly for large N . The reordering
of these samples requires N (N − 1)/2 pairwise distance computations, which is a
time-consuming process for large values of N .
When used for object classification, the k-NN algorithm has many practical appli-
cations, e.g., in the areas of artificial intelligence, pattern recognition, statistics, cog-
nitive psychology, vision analysis, and medicine, to name a few. The decision rule
in k-NN provides a simple non-parameteric procedure for the assignment of a class
label to the input pattern based on the class labels represented by the k-closest (for
example, in terms of Euclidean distance) neighbors of the vector.
Let us assume that we are given N design samples, {x1 , . . . , xN }, each of which is n-
dimensional sample. With this, it is required to compute the k-nearest neighbors (k <
n) in a test sample y, as measured by an appropriate distance function d (y, xi ) [7].
There is no need of preprocessing of the labeled sample set in the nearest neighbor
classifier. The nearest neighbor classification rule assigns an input sample vector
y of unknown classification, to a class which is THE nearest neighbor to it (see
Algorithm 14.1). The idea of the nearest neighbor can be extended to k-nearest
neighbors with the vectors y being assigned to the class that is represented by a
majority among the k-nearest neighbors. In the algorithm, dk stands for distance
from y to xk , for k-nearest neighbor.
When more than one neighbor is taken into account, it is likely that there may
be a tie among the qualifying classes, which corresponds to maximum number of
neighbors in the group of k-nearest neighbors. There is a simple way to handle this
problem by restricting the possible values of k. For example, if there is two-class
problem, and k is restricted to odd values only, no tie can occur.
Occurrence of a tie can be handled using the following approach. An unclassified
(i.e., sample) vector is assigned to the class, of those labels that are tied, and for
that class the sum of distances from the sample to each neighbor in the class is a
minimum. There are chances that there may still remain a tie.
14.6 k-Nearest Neighbor Method 427
For numerical attributes, the distance metric used in the nearest neighbor is
Euclidean distance. Considering two patterns, x1 = x11 , x12 , . . . , x1m , and x2 =
x21 , x22 , . . . , x2m , distance between two using Euclidean is given by
m
d (x1 , x2 ) = (x1j − x2j )2 . (14.15)
j=1
To keep the spread of attribute values along each dimension approximately same,
the above value of the distance is usually modified by scaling of the features. In
that case the distance is represented by the following equation, where aj2 is the scale
factor for the dimension j.
m 2
d (x1 , x2 ) = aj (x1j − x2j )2 . (14.16)
j=1
1 1
in the two patterns of category 1, there are three and four dots, each followed by 1s.
The pattern y, represented by a box, is to be classified. In the circle enclosing,
since the majority are in the category of 1, the new pattern y should be classified in
category 1.
The Bayesian networks are powerful tools for decision-making and reasoning under
uncertainty. These networks are specified by two components:
x1 x2 xn
A Naive Bayes is composed of a DAG with one root node (called parent), which is
an unobserved node representing class of each object to which the testing set belongs,
and several leaf nodes, corresponding to observed nodes, with strong assumption of
independence among these leaf nodes in the contexts of their common parent. The
leaf nodes represent different attribute (features) specifying this object. Note that
the Naive Bayes are composed of two levels only as shown in Fig. 14.9, whereas a
Bayesian network may consist of many levels.
The decision of a Bayesian classifier is represented as a matrix of P(xj |yk ), which
specifies the probability of occurrence of each feature value (x1 , . . . , xj , . . . , xn ) given
each class yk . To classify a new example having features among x1 . . . xn , we make
use of Bayes theorem as
P( xj |yi )P(yi )
argmaxyi P(yi | xj ) = argmaxyi (14.17)
k P( xj |yk )P(yk )
that computes P(yi | xj ), which is the probability of the example in class yi given the
features xj . The subexpression xj denotes a conjunct of attribute values all occurring
in an example. The summation is performed over N (1 ≤ k ≤ N ) classes, and the
above probability is calculated for each class yi , and then the class of the highest
probability is selected. The probability P(yk ) is estimated from the distribution of
the training examples among classes. If independence
of all attributes is assumed
under a given context (i.e., class), then P( xj |yk ) can be calculated using
The values P(xj |yk ) are calculated from the probability matrix. Here, xj is the evidence
on attribute nodes, which can be dispatched into pitches of n evidence of features,
x1 , . . . , xj , . . . , xn . Because Naive Bayes is working on the assumption that these
attributes are independent of each other (given the parent node), their combined
probability is obtained by substituting Eq. (14.18) into (14.17), which results in,
j P(xj |yi )P(yi )
argmaxyi P(yi | xj ) = argmaxyi
k [ j P(xj |yk )P(yk )]
Note that, there is no need to explicitly compute the denominator k [ j P(xj |
yk )P(yk )], since it is common among all the computation being a normalization
constant. In some practice, log is computed instead of a product, because probabilities
involved can be very small.
The Bayesian learning method, as a classifier, builds the matrix P(xj |yk ) from
training examples, by examining the frequency values in each class. It is possible to
compute this matrix incrementally by incorporating one instance at a time. Alterna-
tively, it can be constructed (i.e., without incremental approach), using all the data
at a time.
Due to its simple structure, the Naive Bayes has many advantages. It is efficient,
as the inference (i.e., classification) is achieved in a linear time. However, the Bayes
network with general structure has complexity of NP-complete. Naive Bayes con-
struction is incremental, in the sense that it can be easily updated, e.g., addition of
new cases. The major problem with Naive Bayes is the assumption of strong indepen-
dence relation, i.e., assumption of independence of features in the context of session
class is not always true, and leads to negative influence on the inferred results.
Naive Bayes is most commonly used in text classification, where words are fea-
tures, and presence/absence of a word can be used to determine the topic of a docu-
ment.
Given N number of training examples, each with n number of attributes, com-
plexity of generating probability matrix for Bayes classifier is O(n.N ). Hence,
the classifier is substantially faster as the runtime is independent of the decision
“rule”generated. Apart from this, the basic operation is performed only once.
this gap, and the amplitude of the signal together, is called as synaptic strength. As
a neuron learns enough, its synaptic strength increases, and in that situation, if it
is stimulated by an electrical impulse, there are better chances that it would send
messages to its neighboring neurons.
The current neural-based learning algorithms need close involvement of humans,
for producing better results. Majority of these algorithms are based on supervised
learning, where each training example is carried out using human-crafted labels,
about what is being learned. Consider an example of a picture of sunrise associated
with the caption: “Sunrise”. In that instance, the goal of the learning algorithm is to
take the photograph as an input, and produce output, the name of the object in the
image, i.e., “sunrise”. We know that the mathematical process of transforming an
input to the output is a function. The synaptic strength, which is a numerical value,
produces this function, which corresponds to the solution to the learning through
ANN.
It is interesting to note that it can be achieved through rote learning also, but it
would not be useful. In fact, when we want to teach to the algorithm “what the sunrise
is”, then to have the algorithm recognize any sunrise, even the one for which we have
not trained! This is the ultimate goal of machine learning algorithm.
The Artificial Neural Networks (ANN) possess the following important
properties:
Learning ability,
Massive parallelism,
Adaptability,
Fault tolerance,
Distributed representation and computation,
Generalization ability, and
Low energy consumption.
which make them candidate for many applications. Although the details of the pro-
posals vary, the most common models for learning and computation take the neuron
as the basic processing unit. Each processing unit is characterized by the following:
Each of these aspects of the unit are represented mathematically by real numbers.
Hence, each connection of a neuron has an associated weight, called synaptic
strength, which influences the effect of the incoming input on the activity of the unit.
This weight is either positive, called excitatory or, negative, called inhibitory.
The basic model of artificial neuron with binary threshold is shown in Fig. 14.10.
The mathematical neuron computes the weight as the sum of its n number of input
432 14 Statistical Learning Theory
Feedforward Feedback
network
network
signals x1 , . . . , xn , and the generated output is 1 if the sum is above some threshold
value u, otherwise the output is zero. This can be represented by
n
y=θ wi xi − u (14.20)
i=1
where θ (.) is called a unit step function at 0, and wi is the synapse weight associated
with the ith input. For the sake of simplicity, we consider the threshold u as another
weight, w0 = −u attached to the neuron with a constant input x0 = 1. A properly
chosen weight allows a synchronous arrangement of such neurons to perform uni-
versal computations. There is a crude analogy of this neuron model to biological
neurons as follows: the wires and interconnections model the axons and dendrites,
respectively, in the biological neuron; connection weights in this model correspond
to synapses in biological neuron; and threshold function approximates the activity in
soma. However, this model is an overly simplified one of a true biological neuron.
The ANNs can be considered as weighted directed graphs, where artificial neurons
act as nodes, and directed edges with weights are connections between neurons, and
between outputs and inputs of neuron.
Based on the connection pattern an ANN can be classified as
1. Feedforward networks: In these, the direct graphs have no loops, and
2. Recurrent feedback networks: There are loops because of feedback connections.
Figure 14.11 shows the single-layer feedforward and feedback neural networks.
The most common family of feedforward networks is the multilayer perceptron. The
neurons in these networks are organized into layers with unidirectional connections
between them. Usually, the feedforward networks are of a static type, as they produce
only one set of output values instead of a sequence of values for a given input. These
networks are without memory, as their output is independent of the previous states
of the network.
14.8 Artificial Neural Networks 433
The feedback networks are called recurrent networks, and are dynamic systems.
Neurons’ output patterns are computed when a new input pattern is presented to these
networks. Due to feedback paths, the input to each neuron is then modified, which
causes the network to enter a new state.
The problem of learning in neural networks is simply the problem of finding
a set of connection strengths which allows the network to carry out the desired
computation [10].
An ANN usually learns of the connection weights from the available training pat-
terns, the performance of which gets improved over time through iterative updating
of the weights. The ability of ANNs to automatically learn from examples makes
them attractive and exciting. From a given collection of representative examples, the
ANNs learn underlying rules as a form of input–output relationship, instead of fol-
lowing the rules specified by human experts. This simple factor is a major advantage
of neural networks over traditional experts systems.
To understand this learning process, first there is need of a model of learning
environment in which these neural networks operate. That means, we must know
about what information is available to the network. This model is called the learning
paradigm. The second requirement is to know about how these weights are updated.
This means to know, as to which rule controls the updating process of the weight.
In fact, a learning algorithm refers to some procedure which uses learning rules to
adjust the weights of inputs.
All the three paradigms of learning exist in the neural networks, i.e., supervised
learning, unsupervised learning, and hybrid learning. Supervised learning requires
a teacher, where the network is provided with correct answers, i.e., output, for every
input pattern. The weights are so determined that they allow the network to produce
answers, which are as close as possible to known correct answers.
The unsupervised learning-based neural networks explores the underlying struc-
ture in the data, or correlations between patterns in the data, and organizes patterns
into categories based on the correlations.
A hybrid learning ANN combines both the supervised and unsupervised
approaches.
The learning rules in the neural networks are of four basic types. These are error
correction, Boltzmann, Hebbian, and Competitive learning. Their basic principles
are presented in the following.
A learning based on error-correction rules makes use of simple concepts. During the
training, the input x1 , . . . , xn is applied to the network, and the flows through the
network generates a set of values in the units of output y. As the next step, the actual
output y is compared with the desired target d . If the output and the target match,
no change is made to weights. If there is no match, change is made to weights of
some of the connections. The problem is to find out as which connections in the
434 14 Statistical Learning Theory
network were at fault that caused the error and resulting in a mismatch. Obviously,
it is supervised learning paradigm. The principle used for error-correction learning
rules is based on the error signal (d − y), to modify the connection weights so as to
gradually reduce the error magnitude.
The perception-based learning works on this principle of error correction. A
perceptron comprises a single neuron with adjustable weights, wi , i = 1, . . . , n, and
a threshold value u, as shown in Fig. 14.10 (page no. 432). Given an input vector x =
(x1 , . . . , xn )t , where t is the iteration number, net input to the neuron is expressed as
n
v= wi xi − u. (14.21)
i=1
The learning in ANN based on Boltzmann machines has many properties: they are
symmetric and recurrent (i.e., feedback) networks. It consists of binary units (+1 for
“on”, and −1 for “off”). The symmetric property means, weight on the connection
from unit number i to unit j is identical to weight from unit j to unit i, formally,
wij = wji . The neurons are also of two types: 1. Visible neurons are a subset of the
entire set of neurons, and they interact with the environs, and 2. Hidden neurons are
the remaining neurons, which do not interact. Each neuron is a stochastic unit, which
generates and outputs (or it is a state) according to the Boltzmann distribution of
statistical machines.
A Boltzmann machine also operates in one of two modes: 1. Clamped mode, where
visible neurons are clamped onto a specific state determined by the environment; and
2. Free-running mode, in which both the visible and hidden neurons are allowed to
operate freely.
14.8 Artificial Neural Networks 435
The Boltzmann learning algorithm works to adjust the connection weights in such
a way that the desired probability distribution is satisfied by the states of the visible
units. According to the rule of Boltzmann learning, the change in the connection
weight wij is given by
where η is learning rate, ρ ij and ρij are correlations between the states of units i and j
when the network operates in the clamped mode and free-running mode, respectively.
The values of ρ ij and ρij are estimates using Monte Carlo experiments [8].
The Hebbian learning rule specifies the magnitude of the weight by which the con-
nection between two units is increased/decreased in proportion to the product of
their activation. It builds on the Hebbs’s learning rule, which states that the connec-
tions between two neurons might be strengthened if the corresponding neurons fire
simultaneously. This rule works well as long as the input patterns are uncorrelated,
however, this condition places serious limitations on the Hebbian learning rule. A
Hebbian rule for ANN is described as
where xi , yj are the output values of neurons i and j, respectively. These are connected
by the weight wij , and η represents the learning rate. An important property of this
approach is that learning is carried out locally, that is, the change in synapse weight
depends only on the activities of two neurons connected by it. This simplifies the
implementation of the circuit.
A more powerful learning rule is the delta rule, that utilizes the discrepancy
between the desired and actual output of each output unit to change the weights
feeding into it.
The simplest possible competitive learning network has a single layer of output
units, as shown in Fig. 14.11, where each output unit j connects to all the input units
xi s, through the weights wij , i = 1, . . . , n. Each output unit also connects to all other
units via inhibitory weights, but has a self feedback with an excitatory weight. Due
to the competition, only the unit i with the largest net input becomes the winner. This
can be expressed by
∀i wi∗ x ≥ wi x. (14.24)
In this learning, only the weights of the winner unit gets updated.
Beginning from 2005, deep learning—a neural net-based approach, which is driven
for its inspiration from brain science—began to come into its own, and has now
become a singular force propelling AI research forward.
The deep learning is concerned with simulation of ANNs that “learn” gradually,
in the areas of image processing, speech recognition, and understanding, and even to
make the decisions of their own. The basic technique relies on ANNs, which do not
precisely mimic as to how the actual neurons are working. Instead of this they are
based on the general principles of mathematics that allow them to learn from examples
to recognize people or objects in a photograph, and translate the spoken language
from one to another. The technologies based on deep learning have transformed the
AI research, and have produced far accurate results in speech recognition, computer
vision, natural language processing, and robotics.
To be successful in generalizing after observing a number of examples, deep
learning network needs more than just the examples. For example, it depends on
hypotheses about the data and assumptions about what can be a possible solution
for a particular problem. A typical hypothesis that can be built into a system might
conclude that if data input for a particular function in two situations are almost similar,
the output should not change drastically. For example, on altering a few pixels in an
image of a dog will not transform it into a picture of cat.
One type of a neural network that consists of hypotheses about images is called
convolutional neural network. These networks when used in deep learning have
many layers of neurons, which are organized in such a way that the output is less
sensitive to the deviation from the original object, due to changes in the input image.
For example, we will note the changes in a face when viewed from different angles,
however we will still recognize it correctly. A well-trained deep learning network
will also do a similar job.
The design of convolutional networks take their inspirations from multilayered
structure of visual cortex—part of the brain that receives input from eyes. Too many
layers of virtual neurons in a convolutional neural network are what that make the
network “deep”, and hence it is better able to learn about the world about it.
14.9 Instance-Based Learning 437
The instance-based learning makes use of the supervised approach, or learning from
examples. The input is a sequence of instances, where each instance is a set of n
attribute–value pair, thus creating an n-dimensional instance space. Only one of these
attributes corresponds to a category attribute, and the other attributes are predictor
attributes. A category is a set of all instances in the instance space which have the
same value for their category attribute. For the sake of simplicity, it is assumed that
categories are disjoint.
438 14 Statistical Learning Theory
The main output of IBL algorithms is concept or concept description. The algo-
rithm is a function that maps instances to categories, i.e., for a given instance from
a instance space, it produces the classification—a predicted value for instance’s cat-
egory attribute.
An instance-based concept description comprises a set of stored instances and,
some information about their past performances, e.g., the number of correct and
incorrect classification predictions. The set of instances can change once each training
instance has been processed. The concept descriptions are decided based on how the
selected similarities and classification functions of the IBL algorithm make use of the
current set of saved instances. In all the IBL algorithms, at least two of the following
three components constitute these functions [1].
Function of Similarity
This function computes the similarity between two instances: one is a training
instance, and the other is in the concept description. The similarities are represented
in numeric values.
Classification Function
The input to this function are 1. similarity function’s result, and 2. classification
performance records of instances in the concept description, which produces output
of the classification.
Concept Description Updater
An updater keeps records of classification performance and resolves as to which
instance to include in the concept description. Inputs to the concept description
updater are training instance, similarity results, classification results, and current
concept description. The output of the updater is the modified concept description.
The first two functions in the above decide how the saved instances in the concept
description can be used to predict the category attributes. Thus, concept description
in IBL not only comprises a set of instances, but also these two functions.
There is an important difference between IBL algorithms and most other super-
vised learning methods: the IBL algorithms do not construct explicit abstractions,
like decision trees or rules. Most of the learning algorithms produce generalizations
using the instances, and use simple matching procedures to classify the instances
when presented in future. This eliminates the need of storing rigid generalizations
of concept descriptions for IBL algorithms.
Algorithm 14.3 is the simple instance-based learning algorithm. The similarity func-
tion, sim(x, y), used in the algorithm is expressed by
14.9 Instance-Based Learning 439
n
sim(x, y) = − f (xi , yi ). (14.25)
i=1
To compute the similarity, the instances have n attributes. For numerical attributes,
the relation used is
f (xi , yi ) = (xi , yi )2 . (14.26)
14.10 Summary
Machine learning systems help the programs automatically learn from the data; it can
be said to be a kind of a data mining. Machine learning is used in computer science
and beyond, e.g., in search engines, spam filters, advertisements, credit scoring, fraud
detection, stock trading, drug design, and in many other applications.
440 14 Statistical Learning Theory
The most common models for learning and computation take the neuron as the
basic processing unit. Each such processing unit has following characteristics:
1. Activity level. It is the state of polarization of a neuron.
2. Output value. It depends on the firing rate of the neuron.
3. Input connections. It is the collection of synapses on the cell and their dendrite.
4. Bias value. It is an internal resting level of the neuron.
5. Output connections. These are neuron’s axonal projections.
The Artificial Neural Networks (ANNs) are weighted directed graphs, where
nodes are treated as artificial neurons and directed edges (with weights) are connec-
tions between neurons, and also they act as connections between input and outputs.
The problem of learning in neural networks is the problem of finding a set of con-
nection strengths which can allow the network in future to carry out the desired
computation. There exist all three paradigms for learning in a neural network: super-
vised, unsupervised, and hybrid.
Instance-based learning (IBL) approaches is supervised machine learning tech-
nique. Several variants of instance-based approaches have been devised, e.g., memory-
based learning, exemplar-based learning, and case-based reasoning. The informa-
tion provided by the stored examples is used in some way, to perform the predictions
in the IBL. Once the query has been answered, the prediction itself and other inter-
mediate results are discarded.
As opposed to IBL, the model-based or inductive learning methods derive pre-
dictions in an indirect way: in step one, the observed data is used in order to induce a
model, say a decision tree or a regression function. Then in the second step, the pre-
dictions are obtained on the basis of this model, which can also serve other purposes
like explaining or justifying the inferences.
Exercises
(0, 0) (0, 0)
(1, 0) (1, 0)
(a) (b)
3. Consider the SVM for the training data given in 2 , in Fig. 14.12a, b; find out the
separating hyperplanes in both the cases.
4. Why is the Naive Bayes classification called Naive? What are the main goals
behind this classification?
5. Consider the data given in Table 14.1,
and use these to train a Naive Bayes classifier with designation attribute as the
class label and all the remaining attributes regarded as input. Once you have your
Naive Bayesian classifier, test the following unseen instances to find out the class:
a. Marketing, 36 . . . 41, 51K . . . 55K
b. Sale, 36 . . . 41, 71K . . . 75K
References
7. Fukunaga K, Narendra PM (1975) A branch and bound algorithm for computing K-nearest
neighbors. IEEE Trans Comput 750–753
8. Jain AK et al (1996) Artificial neural networks: a tutorial. Computer 3:31–46
9. Joachims T (2009) Predicting structured objects with support vector machines. Commun ACM
52(11):97–104
10. Rummelhart DE et al (1994) The basic ideas in neural networks. Commun ACM 37(3):86–92
11. Shawe-Taylor J (2009) Machine learning for complex predictions. Commun ACM 52(11):96–96
Chapter 15
Automated Planning
Abstract Automated planning deals with the tasks of finding out ordered sets of
actions that allow a system to transform an initial state to a state satisfying goal spec-
ification. The set of actions is a plan, and it belongs to PSPACE-complete. Automatic
planing or scheduling generates a set of actions automatically. This chapter presents
the nature of automated planning, the classical planning problem, agent types that
execute the problem, and worked examples. It also covers, the concepts and imple-
mentation aspects of forward planning, partial-order planning, planning languages,
a case study of general planning language—STRIPS, and search strategies. Plan-
ning with propositional logic, planning graphs, and hierarchical network planning
are demonstrated. The multiagent planning techniques are presented for goal and
task refinement, decentralized planning, and on how to do coordination after it is
planned. This is followed by the chapter summary, and a set of exercises.
15.1 Introduction
and human agents. There is always a need for planning when an agent wants to
control the evolution of its environment. An algorithm for a planning problem has
the following inputs:
1. possible courses of actions,
2. a predictive model that underlies the dynamics, and
3. a performance measure that evaluates the courses of actions.
The output or solution to a planning problem is one or more courses of actions that
satisfies the specified requirements for the minimum performance. Hence, a planning
problem involves deciding “what” actions to be performed and “when” they should
be performed. The “when” part of the problem has been traditionally called the
scheduling problem. The separation between “what” and “when” is motivated by
computational considerations. In fact the algorithm for automating scheduling has
been around for a long time, for example, in the areas of operations research as well
for automating the complete planning problems.
The classical scheduling methods are concerned with the planning problems where
resource constraints are static, and prespecified in numerical form. This is in con-
trast to the AI methods for scheduling that allow declarative specifications for both
symbolic and numeric resource constraints and handle dynamic changes in the con-
straints, i.e., while in execution. The Constraint Satisfaction Problems (CSFs) (see
Chap. 10) form the canonical backbone of most AI scheduling methods, and there
are several sophisticated heuristic search strategies that are common for such prob-
lems [11].
Although the classical panning model has been prevailing for a long time, a major-
ity of research in planning is toward planning in environments that are dynamic,
stochastic, and in a partially observable world. These requirements are not compat-
ible with the classic planning assumptions, which are for deterministic, static, and
fully observable world. To cope with the partially observable environments, gathering
of information is designed as part of the planning activity, and the existing classical
planning techniques are extended to accommodate interleaving of scheduling and
planing tasks. The environments that are stochastic are modeled using Markov deci-
sion processes, while the planning in such environments requires construction of
policies for the corresponding Markov decision processes [6].
All variants of domain-independent planning problem are known to be computation-
ally hard (P-SPACE complete or worst); efficiency can be improved only through
exploitation of problem distribution space and through the knowledge of domain
structure. For improving the efficiency of plan generation, formal planning models
are preferred. Following are the basic approaches for achieving efficiency enhance-
ment:
1. split both, domain and the problem into tractable sub-components using decom-
position techniques,
2. find out the control information using inductive and speedup learning techniques,
3. define abstractions for expected problem distribution, and
4. define language(s) to express domain-specific search control information.
15.1 Introduction 447
and a language for describing problem states. The problem states correspond to
instantaneous descriptions of the world and operators correspond to actions that an
agent can perform to change the state of the world [3].
The planning is about how an agent achieves its goals, even the simplest ones an
agent must reason about the future. Since the goal is not achievable in a single step,
the number of steps to be carried out needs to be broken up into subtasks, and steps
for each needs to be the goal; what an agent will do in the next step also depends on
its past.
To complete each subtask, there are actions, which need to be carried out; each
action has a subsequent state as well as a preceding state. To be simple at start of this
subject, the following assumptions are made:
1. the actions are deterministic, i.e., the agent can determine the consequent of the
actions,
2. the world is fully observable, i.e., the agent can observe the correct state of the
world, and
3. the closed world assumption, i.e., the facts not described in the world are false.
2 Partial function: Every state “(state, action)” pair does not necessarily result in a state.
15.3 The Basic Planning Problem 449
Classical planning problem is the simplest case of planning, where the environment
is static and deterministic, and the planner has complete knowledge about the current
state of the world. Two important issues are faced by the classical planners: 1. To
model the actions and changes, and 2. To organize search for plans (i.e., sequences
of actions) that are capable of achieving the goals. The logic programming and
nonmonotonic reasoning are frequently used in planning and search, however, many
implemented planners have used a variant of action model, called the STanford
Research Institute Problem Solver (STRIPS). This model represents the state of the
world in terms of state variables and their values, and actions as state-transforming
functions, which are deterministic in nature. Most of the early planners modeled the
state-space exploration of the world-states as a search, and the transitions between
states represented the actions.
The state exploration method worked suitable only for problems having small
search space, due to the complexity in space and time. Instead, the utility of manip-
ulating partial plans during search became popular, and this led to the design of
algorithms that search in the space of partial plans.
The classical planning problem is defined as follows. Given
1. a description of the known part of the initial state of the world (in a formal
language, usually propositional logic) denoted by I,
2. a description of the goal (i.e., a set of goal states), denoted by G, and
3. a description of the possible atomic actions (R (i.e., rule)) that can be performed,
modeled as state transformation functions,
determine a plan, i.e., a sequence of actions that transforms each of the states fitting
the initial configuration of the world into one of the goal states.Thus, classical plan-
ning problem is a tuple I , G, R. Consider the following example of planning the
“Transport a passenger by cab.”
Suppose that initially (i.e., in all states of the world that match the description I),
there is a cab at a location A, represented by a binary state variable cab(A), and a
passenger at a location B, represented by passgr(B). In each of the states described
by G the passenger should be at a location C, denoted by passgr(C). Furthermore,
suppose that there are three actions (move, load, unload) that can transform (some
part of) the state of the world.
450 15 Automated Planning
Agents are classified according to the techniques they employ in their decision mak-
ing:
1. Reactive agents: They base their next decision solely on their current sensory
input.
2. Planning agents: They base their course of action considering the anticipated
future situations, possibly as a result of their own actions.
mcc mc
Robo
Mail
Room AI Lab
A delivery robot shown in Fig. 15.2 is responsible for some jobs. It can pick up mail
from the mail room and deliver it to Shyam’s office, and also can pick up coffee from
the coffee room and can deliver it to Shyam’s office. The robot, called Robo, can also
reach these places by moving clockwise (mc) as well as moving counterclockwise
(mcc). Assume that the mail it handles is a postal mail (not the email). The domain
of the world is represented by the terminology described as follows [13].
Various locations are represented by the following symbols. Each of them can be
true/false.
Following are the various states’ variables of the world, which can also be
true/false:
mw: Mail waiting in the mail room
rhc: Robo is holding coffee
swc: Shyam wants coffee
rhm: Robo is holding mail
Absence of a state is represented by its negation. For example, rhc indicates that
Robo is not holding coffee. Various actions performed by the Robo are
indicates that the Robo is in an AI lab, Robo has no coffee in hand, Shyam wants
coffee, mail is not waiting, and Robo holds the mail. Another state,
indicates that the Robo is in the lab, Robo is holding coffee, Shyam is waiting for
coffee, mail is waiting in the mail room, and Robo is not holding the mail. Table 15.3
shows the transitions for certain states. For example, in the third row, we note that
after performing the action dm (deliver mail), in new state created, the state rhm
indicates that variable “Robo is holding mail” is false.
Table 15.3 Some mapping: State × Action → State, for Fig. 15.2
State Action Resulting state
lab, rhc, swc, mw, rhm mc mr, rhc, swc, mw, rhm
lab, rhc, swc, mw, rhm mcc off , rhc, swc, mw, rhm
off , rhc, swc, mw, rhm dm off , rhc, swc, mw, rhm
… … …
15.4 Forward Planning 453
One of the simplest plannings is to treat the planning as a path planning problem in
the state-space graph. The nodes here are states, transitions are actions, and results
of actions are also states. A forward planner searches the state-space graph from
the start state for the goal state. Figure 15.3 shows the state-space graph for forward
planning with the start state as cs, rhc, swc, mw, rhm, with three transitions from
start state, corresponding to the actions: pick up coffee (puc), Robo moves clockwise
(mc), and Robo moves counterclockwise (mcc). If we choose the action puc, the next
state is cs, rhc, swc, mw, rhm. The new state indicates that the Robo still remains
facing the coffee shop; since it picked up coffee, the robot holding coffee is no more
false, mail waiting remains true (unchanged). The new states as a consequence of
the various actions are self-explanatory, and are similar to this description [2, 13].
Note that, being closed reasoning, in the world of actions we explicitly specify
each of the variable as True or False.
The branching factor in Fig. 15.3 is 3, and the search can be done in DFS or
BFS. Theoretically, since, the Robo can be at any of the four locations, and the other
four parameters in a state can be true/false, there are 4 × 2 × 2 × 2 × 2 = 64 total
possible states in the world. Obviously, all of these states are not possible to reach
in a graph search.
The representation above is simple and clear, but it is not suitable due to following
reasons:
puc mcc
mc
cs, rhc, swc, mw, rhm mr, rhc, swc, mw, rhm
mcc of f, rhc, swc, mw, rhm
mc
mr, rhc, swc, mw, rhm
of f, rhc, swc, mw, rhm mcc
dc mc mcc mc
of f, rhc, swc, mw, rhm lab, rhc, swc, mw, rhm cs, rhc, swc, mw, rhm
A partial-order planning is a search carried out by refining the partial plans through
addition of actions and orderings. Alternatively, a search may proceed through per-
forming abstract actions in the plan, by replacing fragments capable of carrying out
those actions—the planning is called hierarchical planning. It has been now well
understood that even the state-space search methods are nothing but other ways of
refining partial plans. This refining is done by growing the prefix or suffix of the plan
and different ways of refining a partial plan can be suitably interleaved.
The forward planner enforces a total ordering on actions at all the stages in the
planning process. The idea of partial ordering between actions only commits to an
ordering between actions when forced. A partial ordering is a “≤” relation, that is,
reflexive, transitive, and antisymmetric. It is the set of actions together with partial
ordering, representing a “before relation” on actions, such that any total ordering of
the actions, consistent with partial ordering, will solve the goal from the initial state.
That is, Act0 < act1 means action Act0 appears before the Act1 in the partial order.
A partial-order plan comprises the following sets:
1. Set of steps. Each step maps to some operator, except the start and end step. There
are no preconditions for start step and start state is its post-condition. Similarly,
the final step has the goals as its preconditions and has no post-conditions.
2. Set of orderings of steps. Each ordering is a pair of steps, in the form of previous
and next step. The start step is ordered first of all, and the finish step is ordered
after all the steps.
3. Set of causal links. Each causal link is a pair of steps and a proposition, which is
a post-condition of the first step and a precondition of the second. The first step
is always ordered before the second step.
In fact, a partial planner works as follows: begin with actions start and finish,
and with partial order start < finish. The planner maintains an agenda set of P, A
pairs, where, A is action in plan, and P is the precondition of A. First, G, finish is
chosen, such that G is the precondition for Goal. Then at each stage, a pair P, act1
is chosen from the agenda, where P is the precondition for action act1 . Subsequent
to this, act0 is chosen to achieve P, which is either already in the plan or it is start to
achieve P, or it could be a new, that is, added to the plan. The act0 must occur before
15.5 Partial-Order Planning 455
act1 , that adds a new causal link. Any action to delete P must happen after act1 or
before act0 . If act0 is a new action, its preconditions are added to the agenda, and the
process continues till the agenda is empty.
The STRIPS formulation for planning problems uses a simple and general format
for specifying operators with clear semantics. It employs propositional logic as a
language for describing states of the world, with a set of conditions and Boolean
variables to describe states. A complete assignment maps the set of conditions to
possible values. For example, truth values and a partial assignment maps a subset of
conditions to values, and a complete assignment is a state. An operator comprises
two partial assignments—first is preconditions or results, which decide the states
for applying the operator. The second is post-conditions, that decide the next state
resulting when an operator is applied in a particular start state. In addition, the implicit
frame axioms indicate that the value of any condition, which is not mentioned in an
operator’s post-conditions, is unchanged due to the application of an operator.
A planning problem instance consists of a set of operators, an Initial state, and a
Goal. The Goal may be a Boolean formula also. Generally, a solution is in the form
of a partially ordered multi-set of operators, which satisfies the following condition:
any total ordering consistent with the given partial order transforms the initial state
assignment into a new assignment that satisfies the Goal formula. The STRIPS for-
mulation provides the semantic foundations for many extensions, including those
handling external events, multiple agents, probabilistic transformations, and vari-
ants that allow the agent to observe aspects of the current state and choose actions
conditioned on observations.
The STRIPS is action-centric representation, which is based on the idea that most
things are not affected by single action. It specifies action, precondition, and effect.
For example, for the goal: “Robo to pick up coffee”, we write
precondition : cs ∧ rhc
effect : rhc.
The other features are unaffected in the above. The action of delivering coffee
(dc) is
Apart from the above, you need to specify the initial states and goal.
The action or the rule has two formats:
456 15 Automated Planning
In the above formulas, the first two are causal rules, and the last one is the frame
rule. Similarly, the state “Robo holds coffee” in the resulting state would depend on
whether it was holding coffee in the previous state and its action:
where the first is frame rule, and the second is the causal rule.
The problem space for STRIPS language comprises the initial world model, a set
of available operators with their effects on the world model, and a goal statement.
The world model is represented by a set of well-formed formulas (wffs) of FOPL,
e.g., to describe a world model in which the robot is at location a and boxes B, C
are at locations b, c respectively, we would use the following wffs to express this
knowledge [4]:
ATR(a)
AT (B, b)
AT (C, c).
We might also use the wffs to express the general rule that an object u at place x is
not in a place y, i.e., an object cannot exists at two places.
We can represent a complex world model using the first-order predicate logic, and
can use standard theorem-proving programs to answer questions about the model.
The operators are grouped into families of operators, called schema. For example, an
operator goto for moving the robot from one point m on the floor to other point n is
a schema. For this, distinct operators (one for each pair of points) are grouped into a
family of goto operators, and goto(m, n) represents a move from the initial position
m to the final position n. The members of goto schema are goto(m, a1 ), goto(a1 , a2 ),
..., goto(ak , n). In STRIPS, specific constants will already have been chosen for the
operator parameters when an operator is applied to a world model.
First of all, it is necessary to determine whether or not there is an instance of an
operator schema applicable to the current world model. It is required that an instance
of the corresponding wffs (well-formed formulas) schema exists, and it logically
follows from the model. Each operator schema is defined by a description having
two parts: conditions under which the operator can be applied, and the effects of
application of that operator. The precondition for an operator schema is represented
as a wff. The effects of application of an operator are defined by a list of wffs that must
be added to the model, and a list of wffs that are no longer true, hence must be deleted.
As an example, consider the question of applying instances of the operator subschema
goto(m, b) to a world model containing the wff ATR(a). If the precondition wff
schema of goto(m, n) is ATR(m), then we note that the instance ATR(a) can be proved
from the world model. Thus, an applicable instance of goto(m, b) is goto(a, b).
It is important to understand the difference between the parameters in wff schema,
and existentially and universally quantified variables. These variables are used in
FOPL formula in theorem-proving programs (e.g., Resolution theorem) that would
handle wff schema. For example, in resolution, ∀x∀y goto(x, y) will have substitution
458 15 Automated Planning
{a/x, b/y}, to make it a clause, whereas in STRIPS there is a chain of pairs of (x, y)
to implement goto(x, y), hence it will require some modifications.
A goal statement is also in the form of a wff, e.g., a task of moving the boxes B
and C to locations d , e, respectively, might be expressed a goal as
In summary, the problem in state space for STRIPS comprises three entities:
1. Initial world model. It is a set of wffs to describe the present state of the world.
2. Set of operators. The operators and their description in the form of effects and
preconditions as wffs.
3. Goal. It is a condition in the form of a wff.
The problem is taken as solved when a world model that satisfies the goal wff is
deduced using the initial world model with the application of operators.
Consider the graph/tree shown in Fig. 15.4, with s0 as start state, and a0 . . . an−1
as actions representing a path leading to the goal state sn . If s0 , a0 . . . an−1 , sn are
considered as propositional expressions, then Eq. [10],
15.7 Planning with Propositional Logic 459
an−1
sn
is a propositional expression representing the path to goal state from start state. If
there is only one goal, then p is the only path to goal, and all other propositions p
are false. Equation 15.6 can always be transformed into a CNF (conjunctive normal
form) or SAT expression. Thus, planning through propositional logic is to find a
satisfiability expression, comprising start state, path, and goal state, i.e., logical
sentence equal to
In other words, a model that satisfies a sentence will assign true to all actions
that are part of a correct plan. If the planning problem is unsolvable then there is
no sentence that is satisfiable. The following example demonstrates planning with
propositional logic [12].
Initial Plan: Let us assume that initially, the plane p1 is at DEL (Delhi) and p2 at CAL
(Calcutta). The actions should correspond to flying these planes so that the goal: “p1
at CAL, and p2 at DEL” is satisfied.
The initial state (0) is represented by
Since the propositional logic has no closed world, it is also necessary to show that
initially the planes p1 , p2 are not CAL, DEL, respectively, i.e.,
If that fails, apply certain actions and again try it at time T = 1, and so on, up to
time T = Tmax . This is shown in Algorithm 15.1, where for every iteration, for the
next value of T (time), the problem is translated to SAT (satisfiability) problem using
the procedure translate-to-SAT. This results in CNF expression and a mapping of
the solution to the problem. The SAT-Solver procedure returns the assignment for the
above CNF expression. If the assignment is satisfying the solution (i.e., not null), the
solution is extracted for the present mapping, and assignment is returned as result.
If this does not happen, the procedure is iterated as per the Tmax iteration time [8].
We have a propositional symbol for each occurrence. For the plane p1 to be at CAL,
there is a proposition:
There ought to be a plan that tries to achieve the goal at T = 1. Now suppose
CNF is
Initial state ∧ successor state axioms ∧ goal1 ,
15.7.2 Analysis
For the proposition fly having general form as fly(p, o1 , o2 ), with time T -steps, the
number of propositions represented by |p|, and objects (o1 , o2 , etc.) as O, the com-
plexity expression is given by,
where T is the number of time steps, p is arity of function (here it is 2), O is the
number of objects o1 , o2 , etc. We note that the complexity is exponential. The term
|Act| is for the action, like fly, i.e., how many predicates are there in total [8].
The planning graphs give better heuristics and consist of a sequence of levels, for
time steps in plan. Each level has literals (constant values) which have become true
because of the previous action, and each level has preconditions for the next action.
The planning graphs represent the actions as well as inactions. The following example
demonstrates the application of planning graphs [5, 12].
Example 15.4 Problem of the solution to “have a Pizza and eat Pizza”.
init(have(pizza))
Goal(have(pizza) ∧ eaten(pizza))
Action(eat(pizza)
Precond : have(pizza)
Effect : ¬have(pizza) ∧ eaten(pizza))
Action(cook(pizza)
Precond : ¬have(pizza)
Effect : have(pizza)).
The planning graph for these actions is shown in Fig. 15.5. The box action in the
figure indicates the mutual exclusions of actions.
In the planning graph, all the actions Ai at level i contain all actions that are
applicable at state Si , along with constraints saying which part of the actions cannot
be executed. Every state at level Si contains all literals that could result from any
choice of actions at Ai−1 , along with constraint saying which part of actions cannot
be executed.
462 15 Automated Planning
cook(pizza)
have(pizza) have(pizza) have(pizza)
¬have(pizza) ¬have(pizza)
eat(pizza) eat(pizza)
eaten(pizza) eaten(pizza)
¬eaten(pizza) ¬eaten(pizza) ¬eaten(pizza)
S0 A0 S1 A1 S2
Fig. 15.5 A planning graph for “To have pizza and eat pizza”
We note that graph plan provides lesser complexity because it does not require
choosing about all actions. It just records the impossibility of certain choices using
mutex, i.e., through mutual exclusion links. For example, when ¬have(pizza) is cho-
sen, the action have(pizza) is excluded. This results in the complexity as a function
of low polynomial actions and literals [11].
In the hierarchical task planning, each level of hierarchy is decomposed into smaller
levels. It is common for areas like military mission, administration, program devel-
opment, where a task is reduced to small number of activities at the next level, so that
the computational effort of arranging those activities is low. This results in reduction
in complexity to linear time from the original exponential [13].
Consider an example of “building a house”, where the task of house building can
be decomposed to acquiring land, preparation of design map, obtaining the NOC (no
objection certificate) from municipal corporation, arranging for a house loan, hiring
a builder, paying the builder, and so on, as shown in Fig. 15.6.
House
¬Money
We understand that some of these activities can be done in parallel, but not all.
Thus, there is a partial-order relation between those actions. The decomposition can
be expressed as decompose(a, d ), where action a is decomposed into a partial-order
plan d . Various activities for building a house as per the plan in Fig. 15.6 can be
formally described as follows.
There are a number of good reasons for having multiple agents creating plans:
1. The agents may represent real-life entities, which may require the privacy and
autonomy also to be maintained.
2. Changing of an existing distributed system to a multiagent system is less costly
than to a centralized system .
3. Creating and maintaining plans locally allows for more efficient reaction, espe-
cially when communication is limited, and
4. Dividing a planning problem into smaller pieces and solving those in parallel turns
out to be many times more efficient. This is true particularly when the individual
planning problems are not tightly coupled.
The solution to a multiagent planning problem is a plan: a partially ordered
sequence of actions that, when executed successfully, results in a set of achieved
goals for some of the agents. Most techniques can deal with problems where the
actions and goals of the agents are 1. only weakly dependent upon each other, 2.
agents are cooperative, and 3. communication is reliable. However, in general a mul-
tiagent planning problem may encounter a lot variety of situations along these three
axes. Some characteristics are described below.
15.10 Multiagent Planning Systems 465
Planning is a combination of phases 2 and 3 in the above, which are often inter-
leaved. Any of these steps could be performed by one agent or some subset. Not all
the above phases of general multiagent planning process need to be considered in
every multiagent planning problem. For example, there is no need for phase 1 if there
are no common or global goals for the multiagents. Some multiagents system may
combine different phases, for example, while constructing their plans agents may
coordinate with each other due to the combination of phases 2, 3, and 4. Alterna-
tively, robots may postpone coordination until the execution phase (i.e., combining
phase 4, 5). This may result, for example, when they unexpectedly encounter each
other while following their planned routes.
It is possible in general, to interleave any combination of five phases listed above,
depending on the nature of the planning problem, resulting in a wide variety of
possible problem sub-classes. In the following we present the phases in more detail.
466 15 Automated Planning
Centralized methods often take care of the assignment of goals and tasks to agents
during planning. There are, however, many other methods to assign tasks in a more
distributed way, giving the agents a higher degree of autonomy and privacy. For
example, complex task allocation protocols may be used, or auctions and market
simulations can also be used.
One way to assign the tasks to agents is through the method of auction, which is
a way to assign a task to that agent who attaches (or bids) the highest value or the
lowest cost of performing it, called private value. A protocol called Vickrey auction
is a frequent example in multiagent systems, where each agent makes a closed bid,
and the task is assigned to the highest bidder, but not on the price of that bidder,
but for the price of the second highest bidder! The Vickery protocol has the good
property due to which the bidding agents will simply bid their true private values,
with no need for additional reasoning about it is worth for the others.
Economics and market simulations can also be the basis for allocation of resources
among agents. For example, it is shown how costs can be turned into a coordination
device. These methods are useful for task assignment (phase 2), and also for the
coordination of agents after plan construction (phase 5).
In concerning the value-oriented environments, these game-theoretical approaches
become more important where agents reason about cost of their decision-making (or
communication).
Task assignment can be done through a single agent, using Hierarchical Task Net-
works or nonlinear planning. More than one planner with more sophisticated models
of temporal extent can be introduced, along with centralizing as well as combining
the phases 2 through 4.
Instead of one agent planning for rest of the agents, the second and third phases
can be implemented through local planning by each of the agents. In principle, any
planning technique can be used in this condition, and different agents may even use
different techniques. Some of the approaches join the individual plannings (phases
2 and 3), along with the coordination of the plans (phase 4).
A distributed version of a planner can be used to integrate phases 1 through 4, to
plan for a single agent in parallel.
15.11 Multiagent Planning Techniques 467
In one setting, each agent has a partial knowledge of the plans of other agents
using some specialized plan representation techniques. In such kind of environment,
coordination is achieved as follows: If an agent A informs the other agent B about
part of its own plan, then agent B merges this information into its own partial global
plan. The agent B can then try to improve the global plan by, for example, eliminating
redundancy in the plan, and this improved plan is shown to the other agents, who
might reject, or accept, or modify it. This process may run concurrently with the
execution of the (first part of the) local plan.
One of the tasks is the planning of coordination after plans are constructed on individ-
ual basis (phase 4), called plan merging . Its objective is construction of a joint plan
for a set of agents, given the individual plans of each of the participating agents. In
coordination planning, every pair of agents helps each other by changing the state of
the world such that the conditions of the other agent become satisfied. In this process,
changing the state of the world may be helpful to these two, but may also interfere
with the correct conditions of the remaining n − 2 agents, assuming a system of n
agents.
To specify the constraints on plans, one approach is to use propositional tem-
poral logic, which will ensure that only feasible states of the environment are
reached. A theorem prover algorithm generates a sequences of communication
actions, on receiving these constraints. In fact, these communication actions imple-
ment semaphores that guarantee that no event will fail. For resolving conflicts, restric-
tions are required to be introduced on individual plans in phase 3, which will also
ensure efficient merging.
A different approach to plan merging uses the distributed approach to improve
social welfare, based on the sum of the benefits of all agents. This approach uses a
process of group constraint aggregation, where agents construct an improved global
plan by voting for joint actions, in an incremental way. The agents even propose
algorithms to deal with insincere agents, and to interleave planning, coordination,
and execution [11].
15.12 Summary
1. Allocate goals.
2. Refine goals into subtasks.
3. Add resource including the agents.
4. Communicate planning choices.
5. Execute the plans.
Exercises
1. Consider the standard Towers of Hanoi problem with 3 pegs and 4 number of disks
(d1 , d2 , d3 , d4 , with d1 at the top). The disks are to be transferred from start-peg
to end-peg, using intermediate peg, one at a time such that at no time larger disk
comes over the smaller. The disk d1 is the smallest and d4 is the largest. Make
use of only 3-predicates: unary predicate: clear, and binary predicates: on and
smaller, and only one action: puton(x, y) needs to be used.
Write the domain of the problem, and make use of forward planning to plan the
solution to move all the 4 disks from start-peg to end-peg.
2. Given the 3-SAT problem:
solve it using forward planning. (Hint: you need to assume some operators (i.e.,
actions) to assign the values to variables p1 . . . p3 .)
3. Given the Table T , and blocks A, B, C, D, having different positions on the table,
apply STRIPS to plan the solution of the following problem:
Initial state I as
i.e., the blocks A . . . D are on the table, and their tops are clear.
Final State F :
on(A, B), on(B, C), on(C, D), on(D, T ), clear(A).
Use the action puton(X , Y ), x = y, where X is a block A . . . D and Y is either
table T or block A . . . D. Give the forward planning to reach state G starting with
state I.
4. Use STRIPS for planning of the following problem: You are at home, and you
have money, and you are required to buy milk. Assume the necessary start and
goal states, actions, preconditions, and results for this planning job.
5. Give the STRIPS representations to actions: pick up mail and deliver mail (ref.
Fig. 15.2).
6. Suppose the robot (in Fig. 15.2) cannot carry both coffee and mail at the same
time. Make use of some constraints to provide the planning for this situation.
470 15 Automated Planning
Assume that the robot can carry a box in which it can place objects, so that it can
carry the box and the box can hold the mail and coffee.
7. Modify the problem in Fig. 15.2, so that the robot has the work of cleaning the
four rooms (mail room, office, coffee shop, lab). Assume that it will clean the
room only when the room is unclean, and will not consume more than one rotation
mcc or mc to reach any of these rooms.
8. Using the method of hierarchical task network planning, provide the automated
planning for the following problems:
a. Shopping grocery items from market.
b. Deliver a lecture of AI.
c. Robot path planning to cover the diagonal in a room.
9. Assume that you have three operators:
f1 : Precondition: a; effect: ¬a ∧ b
f2 : Precondition: a ∧ c; effect: ¬a ∧ b ∧ ¬c
f3 : Precondition: b ∧ c; effect: ¬c ∧ d
Show the first three layers (proposition, action, and proposition) of the graph plan
when the initial state is a ∧ c (a and c both are true). Include the mutual exclusions
and justify each of them.
References
Abstract The intelligent agents are being viewed as new theoretical models of com-
putation that more closely reflects current computing reality, aimed as new generation
models for complex and distributed systems. An agent system can work as a single
agent, or as a multiagent system. The intelligent agents have many applications—
they are used in software engineering, in buying and selling—like online sales, bids,
trading; the agents are also modeled for decision-making—with preferences and cri-
teria for making decisions. This chapter also presents the classification of agents,
agent system architecture, how the agents should coordinate among themselves, and
the formation of a coalition between agents. The multiagents communicate with each
other using agents’ communication languages which are oriented towards perform-
ing actions. Other categories of agents are mobile agents—programs which can be
moved to any far off place, and can communicate with the environment. The chapter
ends with chapter summary, and the set of exercises.
16.1 Introduction
An artificial agent or intelligent agent is a recent term in computer science, and specif-
ically in artificial intelligence. There is a number of definitions of agents. The agents
are viewed as a new theoretical model of computation, that reflects current com-
puting reality in a better (tangible) way than the existing model of Turing Machine.
They are being projected as a next-generation model to engineer the complex and
distributed systems.
Among many characterizations of agents, the following definition is most com-
mon: An agent is an encapsulated computer system, which is situated in some envi-
ronment and it is flexible and capable of autonomous action in that environment in
order to meet its desired goals. There are associated number of questions about this
definition that require further explanation, which becomes somewhat clear through
the extended definition of agents, in the following:
1. Agents are entities for problem-solving for clearly identifiable problems with
well-defined boundaries and interfaces;
2. They receive inputs related to the state of their environment through sensors, when
embedded in an environment, and act on the environment through effectors;
3. An agent is designed to fulfill a specific requirement, and it has a particular goal
to be achieved;
4. They have control over both their own behavior and over the internal state, a
property called autonomous;
5. The agents have flexible problem-solving behavior. They are both reactive (able
to respond in time to the changes occurring in their environment) and proactive
(able to act in anticipation of future goals).
Agents are also being used as a framework to bring together various AI’s sub-areas
to design and build intelligent systems. In spite of the intense interest of research
community and progress made in the area of agents, a number of fundamental ques-
tions about the nature and the use of the agent-oriented approach remain unanswered,
which are as follows:
• What are the fundamental concepts and notions of agent-based computing?
• What makes the agent-based approach a natural and powerful computational
model?
• What are the implications of agent-based computing, in the wider perspectives,
for AI and computer science in general?
Autonomous
It is the property of an agent, as per which it proactively initiates the activities as
per its goal. An agent has its own thread of control, can act on behalf of its user, and
without necessarily depending on the messages from other agents.
Mobile
An agent can move itself from one execution context to another. For this activity,
it can move its code, and carry on executing from the current point onward, or it
can start afresh. Other modes of execution can be, serialization of its code and state,
such that it may continue its execution in a new context, and at the same time, it may
retain the same old state and can continue to work.
Adaptable
An agent is adaptable to a new environment; its behavior can change after its deploy-
ment through its own learning, downloading new capabilities, and through user cus-
tomization.
Knowledgeable
A software agent has reasoning capability, due to which it can reason about the
acquired information, about the knowledge of other agents, its user, and about its
goals.
Collaborative
Some agents are called collaborative agents, which can communicate with other
agents and work in a cooperative manner. This collaboration can be formed either
in a static manner or a dynamic manner. The collection of such agents is called
multiagent system.
Persistent
The infrastructure used by the agents allows them to retain their knowledge and states
over an extended period of time. The agents have property of robustness, i.e., they
work correctly on the face of some failures at run time.
Many characteristics of Intelligent agents are result of capabilities like adaptabil-
ity, cooperation, and proactivity. see Fig. 16.1 shows agent taxonomy. The circles
correspond to general agent capabilities, and intersection corresponds to the agents
having either two or three capabilities. For example, an agent having the capability
of proactive and cooperative is called a collaborative agent.
Depending on the functions performed, agents can be classified in one of several
major categories.
Adaptive agents
They can learn from their previous experience, and can change how they should
behave in a given situation, and can also behave differently in given situations.
474 16 Intelligent Agents
Smart
Personal Collaborative
Proactive
Cooperative agents
They can communicate with other agents, and their action can be according to the
results of the communication performed.
Proactive agents
These agents can initiate proactive actions, i.e., without any prompting from the user
or other agents.
Personal agents
They are proactive and can interact directly with a user. While interacting with the
user, they present some personality or character, can do monitoring and adapting to
the user’s activities, can learn the user’s style and preferences. They can automate
or simplify certain rote tasks. Many software tool-kits, e.g., Microsoft Agent, offer
software services set that support the presentation of software agents as interactive
personalities and includes natural language and animation capabilities.
Collaborative agents
These agents are proactive and cooperate with other agents. They communicate and
interact in groups, many times on behalf of a number of users or organizations, or
services. Multiple agents exchange messages to negotiate or share information. Some
of their applications are: online auctions, planning, negotiation, logistics, supply-
chain management, and telecommunication services.
Smart agents
The smart agents exhibit a combination of all capabilities, i.e., they are adaptive,
cooperate with other agents, and are proactive.
Mobile agents
These agents are sent to remote sites to collect the information, and forward it to
the central or any other location. Before sending the results to a specified location,
these agents can aggregate and analyze data or perform some local control. They are
typically implemented in any of the following languages: Java, Java-based compo-
nent technologies, VBScript, Perl, TCL, or Python. The data-intensive processing is
16.2 Classification of Agents 475
usually performed at the source, as this avoids the shipment of bandwidth consuming
raw data. Examples of such applications are network management agents, Internet
spiders, and NASA’s mobile agents for human planetary exploration, etc.
The multiagent systems are used by organizations or people with different /conflicting
goals, and having proprietary information. In such systems, multiagent system is
required to handle their interactions. As an example, in a manufacturing scenario
where company A produces launching-pad of missiles but subcontracts to company
B to produce the missiles. To build the whole system comprising missiles launcher
and missile, the internals of both companies must be modeled. However, none of
these companies are ready to share the details with the other company. Possibly,
the two companies may reach to some agreement, or when not ready to share any
details due to protocols imposed by the government, multiagent system (MAS) can
be created, with one agent for each company, that represents the goals and interest
of each company [10].
As another example, consider a teaching time-table system for a college. This
domain requires, different agents to represent the interest of different people in the
college. Faculty wants their classes should be evenly distributed throughout the week,
with possibly all classes in as few rooms as possible, management wants that all the
resources be used fully, students want that no more than two/three theory classes be
held each day. Similarly, the technicians will have their own requirements. In such
a scenario, a multiagent system, where different constraints are handled by agents
separately, can create time table static/dynamic to best meet all the constraints.
The multiagent system creates parallelism by assigning different tasks to differ-
ent agents, hence making overall a fast response system. In addition, many agent’s
systems will have redundant agents, this helps in building robustness in the system.
This is possible because, the control and responsibilities are shared among the agents,
hence the system can tolerate failures of some of the agents, and still working cor-
rectly and efficiently. The areas of applications that requires graceful degradation at
the time of failure, instead of sudden failure, are suitable domains where multiagent
systems’ use is welcomed. However, if single entity or processor or agent controls
everything, the then entire system may crash if there is a single failure.
The multiagent systems have the benefit of scalability. Because they are inherently
modular, it is easier to add new systems to a multiagent system than to add new
capabilities to monolithic systems. Due to the flexibility available, it is easier to
program a multiagent system.
476 16 Intelligent Agents
S
Agent
E
Goals
Actions
Domain
knowledge
In a multiagent system, there are several agents which are capable of modeling
each other’s goals and actions. In a general multiagent system, there may be direct
interactions among the agents. The inter-agent communication is viewed separate
from communication with the environment. A major difference with single-agent
system is that, in multiagent systems, the environment dynamics can be determined
by the other agents also, which can affect the environment in an unpredictable way.
Thus, all multiagent systems can be treated as having dynamic environments.
16.3 Multiagent Systems 477
The Fig. 16.3 shows multiagent environment, where each agent is part of the
environment, as well as can be modeled as a separate entity. There may be any
number of agents with different degrees of heterogeneity and with or without the
ability to communicate directly [10].
In agent-oriented view of the world, it has been found that most problems require the
participation of multiple agents to represent the distributed nature of the problems,
with multiple locations of control, and multiple competing interests. In addition,
the agents need to interact with each other to manage dependencies resulting from
their existence in a common environment, and to achieve their individual objectives.
Such interactions may vary greatly—from a simple information exchanges to, in a
more complex form—to request for particular action for coordination/cooperation,
or negotiation, or arranging interdependent activities.
Agent interactions are differing on two characteristics with respect to computa-
tional models, like networking and shared computing: 1. Agent-oriented interactions
are conceptualized as taking place at knowledge level—realized in terms of what
goals should be followed, by whom, and at what time. 2. The agents are flexible
problem solvers, operating in an environment that is partially observable, and agents
have partial control over it. Therefore, the interactions need to be also handled in a
similar and flexible manner.
The agents make use computational models to make run-time decisions about the
type and scope of their interactions, and also to initiate and respond to interactions
that were not anticipated at the time of design of the system [5].
478 16 Intelligent Agents
Agent
Agent level interactions
Organizational interactions
Many complex and intelligent agents navigate on the Internet, collect the relevant
data and process them, perform various tasks including data analysis and data com-
munication, and make the decisions on behalf of their users. The present generation
of intelligent software agents can manage, organize, and communicate huge amounts
of data on behalf of their users. For example, the agents in e-commerce applications
can dynamically discover and compose e-services and mediate interactions. They
can also be used to serve as delegates to handle routine tasks, like, monitoring activ-
ities, set up contracts, win the bid, execute business processes, and can find the best
services [3].
Agents reside and execute in a conceptual and physical location called an agency
(see Fig. 16.5), which provides facilities for locating and messaging to mobile agents
and also to those that are detached agents. The agency also facilitates for collecting
knowledge about agents. The core of the agency is an agent platform, with component
model infrastructure, which provides the local services to agents. The core also
includes the proxies to access remote services like security, agent management,
communication, persistence, and naming. For mobile agents, the agent platform also
provides agent transport. Some additional services provided by most agent systems
are in the form of specialized agents that reside in some remote agency. There are
some standard service agents, like a broker, auctioneer, or community maker, which
augment the basic agents infrastructure. The agent platform along with service agents
monitor and control message exchange by detecting any violation of rules while they
engage in communication. The agent’s platform is the system’s core, however, in
addition to this, a component model infrastructure empowers the agents with local
services and proxy access to remote services.
An agent system comprises components with simple interfaces. The major part
of the system’s capability results from its loose coupling, which helps the agents to
interact dynamically through the exchange of messages asynchronously. For com-
munication with each other, the agents must follow some common and well-defined
Benefits at a global level of an agent system are bound to improve if all agents
cooperate. However, cooperation among the agents is difficult to realize, particularly,
in situations when agents are self-interested. For example, if a number of agents are
trying to get the same resource, say download a specific file, the download speed is
bound to decrease. Instead of this approach, if they work on grouping a social decision
that is mutually beneficial, it will be good for all of them. Therefore, designing
mechanisms that promote cooperation among self-interested agents is important. In
fact, several game theory approaches have been found to be useful for the study of
cooperation in agents, e.g., the Prisoner’s Dilemma (PD) as a theoretical framework
(see Chap. 11), which is well known for this purpose. The PD can be useful for
understanding the role of local interactions to maintain cooperation among the agents.
It is based on the conflict of interest, i.e., between what is the best for the individual
(i.e., defection) and what is best for the group (i.e., cooperation). This creates a
situation of social dilemma. Therefore, specific mechanisms are required to evolve
cooperation to help the population to overcome this dilemma.
There are three basic approaches to prevent social dilemmas, and to promote and
stabilize cooperation as follows:
Coalition-based mechanism
The coalition-based approach is useful for establishing collaboration among agents,
with an individual having properties and objectives. These mechanisms use a tax
model, due to which agents can achieve cooperation when coalitions are formed
around some emerging leaders. To maintain coalitions, the leaders charge some tax
from their agents in favor of some benefit (e.g., guaranteed cooperation, protection
from cheaters, etc). The concept of coalition has been used in the game theory for a
long time, and has been proved useful in real-world economic scenarios. The dynamic
coalition formation model considers the grid topology of agents for cooperation
between them and makes use of spatial prisoner’s dilemma.
The coalitions facilitate cooperation between self-interested agents. The first
approach is: a leader of coalition is paid by the agents that are in the coalition.
The coalition leader also imposes its decision on the agents in the coalition to max-
imize cooperation. The decision making of each coalition is done in a centralized
manner by a single entity, called leading agent. The agents’ cooperation with their
coalition-mates also assumes some restriction in the collaboration.
The coalition-based approach is a clear example of the known trade-off between
the benefits versus the costs of collaboration (e.g., taxes). Therefore, this mechanism
is called dynamic coalition formation model, and also tax model.
Partner switching mechanisms
In most real-world situations, the network topology changes frequently. There is
an empirical evidence in the games on dynamic topologies that a partner switching
leads to cooperative behavior. A variant model of prisoner’s dilemma allows agents
482 16 Intelligent Agents
to either adjust their strategies or switch their defective partners, with the aim that
partner switching may help stabilize cooperation.
Self-governing institutions
The resource allocation in case of self-governing Institutions is modeled in a net-
work, based on a formal characterization of socio-economic principles. An agent
should autonomously decide how to behave with respect to coalition-mates and
agents outside its coalition. Although some mechanisms promote cooperation on
different network topologies, these networks are static.
join other agents for a small amount of time. This time limitation is due to the fact
that for n number of agents, the total number of possible coalition structures to be
enumerated are of the order of O(n n ), which is too large unless n is a small number.
For large n, the computation cannot be carried out in realistic times. Hence, it is
necessary to make use of domain knowledge, with mathematical games and some
constraints to solve the problem of coalition formation in an efficient way, for the set
of agents of any specified characteristics [8].
To form a dynamic coalition, it is required to have decentralized procedures to
allow self-interested agents to negotiate the formation of coalitions, as well as to
divide the coalition payoffs. In the real-world scenarios, the agents may turn out
to be selfish and may focus on improving their own performance, but if they are
cooperative, the performance of the whole system will improve. Hence, the theory
of non-cooperative games (i.e., agents are selfish) is suitable to model the formation
of coalitions and their dynamics. The prisoners’ dilemma is a case of this category,
since the prisoners are considered as selfish (defecting is the dominant strategy) in
this game. A variant of this, called Iterated Prisoner’s Dilemma (IPD) game is widely
used to model various social and economic phenomena, and the cooperation among
agents. In the IPD, where a total number of rounds is random or unknown, sustained
cooperation strategies are likely to emerge.
In addition, all the agents in a coalition behave like a unit, and all together decide
how they should behave with those in the coalition (called insiders), and those outside
the coalition (called outsiders). Hence, the decision about what is a coalition and what
484 16 Intelligent Agents
Agents use certain trading strategies to trade among the agents, based on some
model, which can be called by a general name property ownership and trade (POT)
model. The trading model is based on extension of the Iterated prisoner’s Dilemma,
in which agents can cooperate or defect the actions. The model of POT comprises
two types of players: 1. The Possessors P, who own the resources, and 2. Traders
T , who sell and buy resources.
The strategy of any agent pi ∈ P models the practice of ownership, but does not
trade. The behavior of pi depends on whether it owns a resource or not. If pi owns
a resource it acts as a defector, but if it does not, then it cooperates. This strategy
is shown in Algorithm 16.2, where owns( pi , r esour ce) indicates that possessor pi
owns some resource, de f ects( pi ) means pi defects, and cooperate( pi ) means pi
cooperates.
A Algorithm 16.4 shows the basic strategy followed by agents either to join a coalition
or leave a coalition (change to a new one). If an agent ai has the poorest payoff among
all its neighboring agents after completion of the previous round of computations
(line 1), then ai makes a new coalition with some agent a j (line 4), who is free and
has the best payoff. If a j is already in some other coalition, then ai joins a j ’s coalition
(line 6). This rule also enables any agent to change from one coalition to another, in
case that agent receives poor payoffs in the former coalition [8].
In a dynamic network, agents form coalitions to behave as a unity. An agent can
belong to only one coalition at a time. All agents belonging to a coalition are not
required to be linked with each other, but behaves as a set to act together to maximize
their performance. However, an agent must have at least one link to some agent
belonging to its coalition. If that is not the case, it is an isolated agent, hence it must
be declared as an independent agent (lines 9–10, Algorithm 16.4). This connection
helps an agent to know its coalition information, strategy, share, and divide gains.
Again, if an agent changes link, it does not imply that it changes its coalition—it
simply rewires to change neighbors.
The agents that are in a coalition, must agree to some specific behavior to play
with insiders (agents in the coalition) and also with outsiders (agents outside the
coalition). We assume a flat coalition, i.e., there is no leader or central authority to
impose any policy. To decide the coalition behavior in this situation, each agent votes
for a strategy of either P or T (for possessors and traders) to play with insiders as
well as with outsiders (line 16, Algorithm 16.4).
To decide the vote, each agent uses a Learning Automata (LA), trained from its
trading history and payoffs (lines 12–15). The LA algorithm keeps two probability
models: I n Pr ob, and Out Pr ob. The model [I n Pr obT, I n Pr ob P] is used to assess
the strategy to play against insiders. Here, I n Pr obT is the probability of being inside
the coalition as a trader, and I n Pr ob P is the probability of being inside the coalition
as possessor. In a similar way, [Out Pr obT, Out Pr ob P] is to assess the strategy to
play against outsiders.
486 16 Intelligent Agents
Abstraction
The abstraction is a process of defining a simplified model of any system, such that
only the necessary and important details or properties and emphasized, while all
unnecessary details are suppressed.
Organization
The process of organization is concerned with the identification and managing the
relationships between various problem-solving components. Specifying and imple-
menting organizational relationships are helpful to tackle the complexity due to two
reasons: 1. It facilitates the grouping of a number of basic components, which are
collectively treated as a unit at a higher level for the purpose of analysis. 2. Due to
grouping the components as a unit, as well as to specify the relationships between
them, a number of components may work together (cooperate) to provide a particular
functionality.
The Software agents were used much earlier for the applications, like filtering infor-
mation, match people having similar or identical interests, and automating repetitive
behavior.
In the recent past, agents have found the applications in e-commerce to con-
duct business-to-business, business-to-consumer, and consumer-to-consumer trans-
actions. Consider an example of buying and selling, where a company willing to
place an order for procurement of stationery, assigns the tasks to agents to moni-
tor the quantity and usage patterns of paper within the company. It also launches
the buying agents when paper inventory is low. The Buying agents would typically
perform the following tasks, more or less, in order [6].
1. collect the information automatically about vendors and the required products
which best fulfills the needs of the company,
2. evaluate the various offers from the vendors,
3. make a decision about merchants and products that require further investigation,
4. negotiate the terms of transactions with these merchants, and finally
5. place orders and make automated payments.
There are several descriptive theories and models that seek to capture buying
behaviors, e.g., Nicosia model, Howard-Sheth model, and Engel-Blackwell model.
All these share six fundamental stages of the buying process:
1. Identification. The buyer can be motivated through product information, hence
he/she becomes aware of some unmet needs.
488 16 Intelligent Agents
2. Product brokering. The buying process comprises as its part, the Information
Retrieval to determine what to buy. IR consists of evaluation of product alterna-
tives based on the criteria provided by the buyer, whose result is a set of products,
called “consideration set.”
3. Merchant brokering. This activity combines the “consideration set” with
merchant-specific information to help customer to decide as from where to buy the
goods. This stage also comprises the evaluation of merchant alternatives based on
buyer-provided criteria. The later is typically, the price, warranty, delivery time,
availability, and reputation, which are not necessarily be in order, but varies case
to case.
4. Negotiation. This step considers how to settle on the terms of transition. Negoti-
ations vary in duration and complexity, for price and other attributes.
5. Purchase and delivery. This step signals either termination of the negotiation
stage or occurs some time afterward.
6. Product service and evaluation. This involves the post-purchase product service,
customer service, and evaluation of the satisfaction of overall buying experience.
In the present online buying and selling, many of the processes and criteria dis-
cussed above are in a common place.
The continuous running personalized autonomous agents are well suited to medi-
ate for consumer behaviors, like, information filtering, IR, personalized evaluations,
time-based interactions, and complex coordination. Many agents perform constraint-
based, and collaborative filtering. Many websites of online-shopping use rule-based
techniques to personalize the products offering for individual customers. Some web-
sites use agents to experiment data-mining techniques to discover patterns in cus-
tomers’ purchase behavior, exploit those behaviors for sales, and use these patterns
also to help customers to find other products that meet their true requirements.
The product alternatives are compared at the product brokering stage, whereas
the merchant alternatives are compared at the merchant brokering stage.
For modeling agents as decision makers, it is necessary to have modeling methods that
use formal notions of mental state to represent and reason about agents. The mental
states may consist of mental attributes such as beliefs, knowledge, and references.
In multiagent systems, the success of one’s actions and plans are governed by the
actions of other agents. Thus, agents can help in constructing plans that are likely to
succeed. The mental level models can bring two informal properties: 1. They provide
an abstract way of representing agents, which is implementation-independent, and, 2.
These models are built using an intuitive approach, and use attributes, such as goals,
beliefs, and intentions. The abstract nature of models have the following practical
implications [1]:
16.8 Modeling Agents as Decision Maker 489
1. Structure. The structure holds the designer’s initial database of beliefs, goals,
intentions, which are manipulated by the agent.
2. Grounding. It is the base for the model construction process, which is essential
because we cannot directly observe the mental state of another agent.
3. Existence. Under what conditions a model will exist? Answer to this question
will be helpful in evaluating any proposal for mental level models. Therefore, it
is necessary to know, what assumptions are made, or biases we are making when
we model agents in this manner.
4. Choosing a model. How do we choose between different models that are consistent
with our data?
A model’s mental level structure consists of three key components: beliefs, references,
and decision criteria, which in order corresponds to accounting for the agent’s per-
ception about the world, its goals, and method of choosing actions under uncertainty,
respectively.
The agents’ belief help in establishing about which states of the world it considers,
are plausible. As an example, the possible worlds of interest may be about weather
conditions: rainy and non-rainy, and let the agent believes rainy to be plausible. In
fact, the agent’s preference indicates how much it likes each preference. The agent
may have two possible actions: take an umbrella along to protect from rain, and do
not take an umbrella along. The outcomes of these actions are shown in Table 16.1.
The agent’s preferences tell us how much significance it gives to these values. We
will prefer to use real numbers to describe these values, such that larger numbers
indicate better outcomes, as shown in Table 16.2 [1].
An agent would choose its action (take or not take umbrella) by applying its
decision criteria to the outcome of different actions in the world. A commonly used
decision criterion is maximin, where the action for the “best worst-case” outcome is
chosen. The best outcome out of (10, −4) and (−1, 8) is 10 and the worst is −4.
The best in the worst-case is −1, so the agent chooses the action “Do not take an
umbrella”. This is because the outcome worst −1 is better than worst −4.
490 16 Intelligent Agents
Having gone through the above example, we are now in a position for grounding.
We can view the problem of describing a mental state of the agent as a CSP (Constraint
Satisfaction Problem). The state of the model is such that it should have generated
the observed behavior, and it is consistent with the background knowledge.
In the above example, the background knowledge is agents preferences, given in
the Table 16.2, and decision criteria is maximin. We observe that if the agent goes out
without an umbrella, it believes that “no rain will come”, for it is had other beliefs it
would have taken a different action.
Once an agent’s model is constructed, it can be used to predict its future behavior.
To give it a formal shape, we consider that an agent A , is described as a state machine,
with set of possible local states L A , a set of possible actions AA , and a program,
which we call its protocol PA . Thus an agent is a tuple,
A = L A , AA , PA (16.1)
the state when the thermostat indicates that the temperature is less than the room tem-
perature, and ‘+’ for temperature greater than or equal to desired room temperature.
The thermostat’s protocol is given below.
State − +
Action Turn-on Turn-off
The thermostat’s actions are modeled as AA = {tur non, tur no f f }, and the envi-
ronment’s states are, L E = {cold, hot, ok}. For the sake of simplicity, we assume
that the possible world is L A × L E , which are displayed in the heading row in the
Table 16.3.
Given the set of possible worlds W , we can associate with each local state l of
the agent (the thermostat), a subset S, W (l), comprising of all worlds in which the
local state of the agent is l.
In the above example, the effects of an action on the environment do not affect
the state of the thermostat. In addition, the static one-shot model assumes some
simplifications. The first assumption is that the room temperature is affected only
by the thermostat and not by external influences. Second assumption is that the
thermostat state’s actions do not affect its state.
It may be noted that, while the thermostat knows its local state, it knows nothing
about the room temperature. Consequently, we made all pairs of L A × L E possible,
including the (−, hot) as the possible world, indicating that thermostat’s local state
is indicating low temperature, while the environment state is hot. In one aspect, it
simplifies the system by assuming all possible worlds, while in other terms, it is
a blessing, as this is taken as a situation, where we assume that there may be a
measurement error in the thermostat [1].
Definition 16.1 (Belief) A belief assignment function B may be defined as B :
L A → (2 S − φ), so that for all l ∈ L A we have B(l) ⊆ W (l). The value B(l) is
referred as worlds plausible at l.
492 16 Intelligent Agents
16.8.3 Preferences
The beliefs make sense being part of a more detailed description of the agent’s mental
state, which has more associated aspects. One such aspect is the agent’s preference
order in the possible worlds, which can be taken as the agent’s desire. There are
various assumptions about the structure of the agent’s preferences, which consider a
total order on the set of possible worlds W . However, we may need a richer algebraic
structure, in some cases, e.g., one in which addition is defined. We use a value function
to represent the agent’s preferences.
Definition 16.2 (Value function) A value function is a function u : S → R.
This numeric approach of representation of agent’s preferences is most convenient,
where a state s1 is at least as preferred as state s2 , iff u(s1 ) ≥ u(s2 ).
Considering the example of the thermostat (as agent), the goal of the agent is
to make room temperature ok. Thus, the thermostat/agent prefers any global state
in which the environment’s state is ok, over any global state which may be cold
or hot, and is indifferent between cold and hot. And, it is also indifferent between
the states, where the environment’s states are identical, i.e., (+, ok) and (−, ok).
This preference order over possible worlds can be represented by a value function.
The value function assigns zero to global state in that environment when the state is
either hot or cold, and assigns 1 where the environment’s state is ok. This outcome
is represented in Table 16.4, where ∗ stands for either − or +.
If the exact state of the world was known to the thermostat, it would have no trouble
in selecting proper action based on the value of its outcome. Considering the case of
value as cold, the Turn-on action would lead to the best outcome. However, when there
is uncertainty, the thermostat must compare vectors of plausible outcomes instead
of a single outcome. For example, for the belief assignment B(l) = {cold, ok}, the
plausible outcome of the action Turn-on is (1, 0), and of the action turn-off it is (0, 1).
Given the transition function τ , the belief assignment B, and an arbitrary, fixed
enumeration of elements of B(l), the plausible outcomes of a protocol P in l is a
tuple whose kth element is the value of the state generated by applying P starting at
the state of B(l).
Table 16.4 Global outcome Worlds (→) (∗, cold) (∗, hot) (∗, ok)
preference for an agent
Action (↓)
Turn-on 1 0 0
Turn-off 0 1 1
16.8 Modeling Agents as Decision Maker 493
The values can be compared easily, however, it is not clear as how to compare the
plausible outcomes. Therefore, we choose some protocols. A strategy for making
choice under uncertainty is required that depends on the agent’s attitude towards
risk. The strategy can be represented by decision criteria, which is a function with a
set of plausible outcomes and returning a set of most preferred out of these. For this
we make use of maximin criteria discussed earlier. Hence, we reproduce the same
Table as 16.5.
Note that when both the worlds are plausible, the two plausible outcomes are
(10, −4) and (−1, 8). On the condition, the maximin criteria is used, the first action,
corresponding to “take umbrella” is the most preferred one. But, when the principle
of indifference is used, the plausible outcome “do not take umbrella” is preferred.
Accordingly, decision criteria can be defined as follows:
Definition 16.3 Decision criteria.
A decision criteria is a function:
n n
ρ: 2R → 2R − φ (16.3)
n∈N n∈N
The agents working together, irrespective of whether they are cooperating or compet-
ing, is called a multiagent system. These systems provide higher level of abstraction
than the traditional distributed computing. The abstractions are closer to the users’
expectations, and allow the designers a higher flexibility in determining the behavior.
For example, instead of hard-wiring a specific behavior into the agents, multiagent
494 16 Intelligent Agents
system designers design the agents with the capability to negotiate amongst them-
selves and find out the best course of action for a given situation. The ACLs (Agent
Communication Languages) must be flexible enough to accommodate abstractions
such as negotiations. But, the same flexibility makes it harder to succeed in under-
standing their semantics [9].
Due to this reason, we must examine many elements to arrive at he meaning of a
communication, which includes, type of meaning, perspective, basis (semantics or
pragmatics), context, and coverage, i.e., number of communication actions included.
The formal study of languages comprises three parts: 1. Syntax, which is con-
cerned with organizing the symbols to create the structure of language sentences, 2.
Semantics, which deals with what sense is denoted by the sentences and their parts,
and 3. Pragmatics, which is concerned with how the sentences are interpreted and
used. The combined meaning of a sentence is obtained due to semantics and pragmat-
ics. The pragmatics includes those considerations that are external to the language,
like, state of the agents, and the environment in which the text exists. Therefore,
the pragmatics can restrict, as to how the agents can relate to one another and how
they process the messages which are sent or received. In a situation when agents are
not fully cooperative or they cannot find out the implications, they cannot meet the
pragmatic requirements.
Semantics versus pragmatics
A perspective can be combined with a type of meaning, either personal or conven-
tional. In case of personal, the meaning of communication is based on intention
and interpretation of receiver/sender. The action, “purge this file” shall be taken by
the receiver as directive, whereas “This is an old file”, shall be taken as an asser-
tion. In Fig. 16.6, the inform construct is to give the information to the receiving
agent, the request construct requests for rain, and so on. In conventional meaning,
the meaning of communication actions is based on usage conventions. A language
is nothing but a system of conventions. Violating the idea of conventions, the tradi-
tional approaches go against the wisdom of having different labels for communication
actions. The language KQML (Knowledge Query Management Language) have all
acts as variants of tell, whereas for communication language Arcol, it is infom.
Context
In general, we do not understand a communication without context. Here, in agents, it
is agent’s physical or simulated environment, which becomes the context. For agents,
the social context is not as subtle as for humans, but they must understand what an
agent expects from others.
Coverage of communicative acts
When heterogeneous autonomous agents exchange information, the meaning of the
exchange is decided by communicative actions. All these actions fall into one of the
following categories:
• Assertive. This action is to inform. For example, “The door is shut.”
• Directive. This is for request, for example, “Shut the door”. It can also be used for
query, e.g., “Can I shut the door?”
• commissive. To promise something, e.g, “I will shut the door.”
• Prohibitive. It can ban something. For example, “Please do not shut the door”
• Declarative. It causes events in themselves. For example, “This information is
redundant.”
• Expressive. To express emotions and evaluations. “I wish that hurricane will stop.”
Communication actions can be represented in stylized forms like, “I hereby
request …” or “I hereby declare …”. The grammatical form emphasize that through
the language, you not only make statements but perform actions. The action by speak-
ing becomes the essence of communication. Figure 16.6 shows that all primitives of
this agent language are assertive or directive. In the agent language Arcol, one can
simulate commissiveness using other acts. All the acts can be reduced to the cate-
gory of assertive, but these categories have only restricted meanings. For example,
a request in Arcol language is the same as conveying to the receiver that the sensor
intends for it to perform the action.
Considering the code given in Fig. 16.6 for agent Avi, each communication act has
a challenge for language, which promotes mental agency. The traditional approaches
ignore whether Bob has really the capability to cause rain when it is requested or
allowed to do so, or whether it can stop the rain when it is prohibited from causing
the rain. Similar is the case for, whether Avi can make it rain when he promises; or
whether Avi has the authority to permit or prohibit any of Bob’s actions or to name
whether conditions.
Finally, the ACL approaches conclude that if Avi’s designer wants it to comply
with, then it does. This is quite unsatisfactory, because it means that agents do not
have any reasoning about their limitations.
A platform that supports the creation and deployment of multiple software agents
must have the capability to interoperate with a wide variety of custom-made, as
496 16 Intelligent Agents
Action Policy
Security Incoming
message
Meta
knowledge Outgoing
message
Action-
base
Function calls
Security
AGENT constraint
Action
constraint
Legacy data
S = (T S , F S ) (16.4)
where,
– T S is set of all data types manipulated by this program, and the set is closed under
all the subtypes, i.e., if τ is subtype of T S , then τ ∈ T S , and,
– F S is the set of all pre-defined functions of set S that are provided by the package’s
API.
of domains, and relations on tuples). Whereas, F S is a set of all functions, i.e., all
relational operations: select, project, Cartesian products, join, union, etc.
At any given point of time t, the state of an agent will refer to a set O S (t) of objects
from the type T S , managed by its internal software code. An agent may change its
state by taking action, which may be triggered internally or by processing a message
received from other agents. But, an agent cannot directly change the state of another
agent, but can do so by issuing a request message to that agent.
Semantic wrapper
A semantic wrapper contains a large collection of semantic information. Following
are the typical contents of this information (see Fig. 16.7):
Because the language Q is suited for interactions, it is used for scenario writing.
Example of primitives for interactions are cue and action. A cue is an event that
triggers interaction, while actions are requests to an agent which causes the change
in the environment. Unlike the programming languages, the language Q does not
define the semantics of cues and actions. Since different agents execute the cues
and actions in different ways, their semantics depend on corresponding agents. The
Example 16.2 demonstrates the cues (preceded with question mark) and actions
(preceded by exclamation mark).
Example 16.2 Cues and Actions.
In the above example, the following cues and actions are there:
1. Agent waits for Tom to say Hello (?hear),
2. Tom walks from the class room to library (!walk),
3. Agent says hello to Tom (!speak), and
4. Agent asks, do you see the library in north (?see).
The above are synchronous actions, and each one to be followed on completion
of the previous.
The asynchronous actions allow overlapped execution, like, in the Example 16.2,
walk can be asynchronous action, we can walk and speak, and so are agents. To
represent an action to be executed in asynchronous mode, we precede the action
with a double exclamation (!!), e.g., !!walk. For this, the agent may say hello to Tom,
just after it has begun the walk.
Example 16.3 Guarded commands.
(guard
((?hear "Hello" :from Tom)
(!speak "Hello" :to Tom) ... )
((?see library
:direction north)
(!walk :from class_room
:to library) ... )
(otherwise
(!send "I am still waiting"
:to Dickens) ... )))
16.9 Agent Communication Languages 499
Just like the common programming languages, the interaction language Q has
commands for conditional branching and recursive calls. Apart from this, its has
commands, called guarded commands for situations that require to observe multiple
cues at the same time. A guarded command combines the cues, actions, and forms.
After either of the cues becomes true, the guarded command evaluates the corre-
sponding form. If no cue is satisfied, it evaluates the otherwise clause, as shown
in Example 16.3.
In the above code, if any cue is encountered, e.g., “agent hears hello from Tom,”
then corresponding forms will be performed, i.e., agent says (replies) “hello to Tom.”
If no cue is observed by the guard command, it performs the otherwise clause,
and the agent sends the message “I am still waiting” to Dickens.
A collection of state transitions in the language Q constitutes a scenario. A sce-
nario defines each state as a guarded command, and it can include the conditions. A
program writer can draft scenarios in the form of simple state transitions, which can
describe fairly complex tasks. The scenarios can be invoked recursively.
All agents are not of the type, mobile agents. An agent sitting at a far off place can
communicate with its environment through older time mechanisms, RPC (remote
procedure call) and messaging. Such agents are called stationary agents, and executes
only on the system on which they begin execution (not moving, but stationary). If
such agents need information, which is not available on their systems, or they need
interaction with an agent (program) residing on other systems, they usually make
use of a communication mechanism, such as RPC or messaging.
However, a mobile agent is not bound to a particular system on which it begins
execution, but it is free to travel to other hosts in the network. Once created in one
execution environment, the mobile agent can transport its state (including data and
other information), and its code, to other execution environments in the network,
when everything is delivered there, it starts execution. The mobile agent is designed
with special ability, due to which it can transport itself from one system to another
system in the same network. This ability of the agent allows it to move to the system
containing an object with whom this agent wants to interact, then to take the advantage
of being in the same host or network, as an object. There is a number of benefits of
using mobile agents rather than doing the same job by remote procedure calls and
messaging. Some of the advantages are as follows.
Reduction in network traffic
Distributed systems need communication, involving interactions with multiple des-
tinations to perform a given task. This results to a large traffic in the entire network.
The mobile agents permit us to package a conversation and dispatch it to a destina-
tion host where interactions take place locally. This gets rid of the flow of raw data
in the network. When a large quantity of data is stored at remote hosts, that data is
500 16 Intelligent Agents
processed locally by the transported mobile agent rather than transferring over the
network. That is, computation is moved to the data center.
Overcome network latency
Critical real-time systems, such as nuclear reactors, and robots in manufacturing
processes are required to respond to act to changes in their environments. Controlling
such systems through a factory network introduces significant latency, due to many
reasons, like network being busy. For critical real-time systems, such latency is not
tolerable. Mobile agents offer a solution by moving the required programs and state
at the place where it is needed.
Encapsulates protocols
When data is exchanged in a distributed environment, each host owns the code, which
implements the protocols. However, when protocols change to add new features, it
becomes difficult to upgrade the protocol. Since mobile agents can move to remote
hosts to establish “channels”, this problem does not occur.
Execute asynchronously and autonomously
The mobile devices often need to rely on expensive and fragile network connections.
Tasks that require a continuously active connection between a mobile device and fixed
network are not economically, as well as technically feasible. To solve this, tasks are
dispatched into the network in the form of mobile agents, which can operate at their
ease, can move anywhere, where CPU resources and memory are abundant, and can
operate asynchronously and autonomously. The devices can reconnect at a later time
to collect back the agent.
Adapt dynamically
The mobile agents have capability to sense their execution environment, and can
react autonomously to changes. The mobile multiagent system can distribute the
agents geographically among the hosts in the network to perform any required task.
Robust and fault-tolerant
Mobile agents’ ability to react dynamically to unfavorable situations is useful to
build robust and fault-tolerant distributed systems. For example, if a host is being
shutdown, all agents executing on that machine are warned and given time to migrate
and continue their operation on another host in the network!
assume that all other agents also follow the same. These laws, in one hand, constrain
the plans available to agents, and on the other hand, will guarantee certain behaviors
on the part of other agents. A social law may include communication protocol which
leads to rational deals with multiagents. The protocol may also include rationality
constraints for cooperation. The social law also, includes the rules, like those that
exist for humans for driving—left-drive or right-drive as it may be prevailing [7].
The idea of traffic rules for mobile robots highlights the important aspects of the
artificial social system approach. Rather than having a centralized controller or a
robot to continuously negotiate in order to avoid collisions, the better approach is
that robot should follow the traffic rules, “keep always to the left of the road”.
To consider the applicability of social laws, assume that there is a multi-robot
network, and we think of laws for agent mobilization in such systems. In such a
network, it is assumed that there is coordination among multiagent (i.e., robots). It
is formally defined using the following definition.
Definition 16.4 (Multi-robot network) A multi-robot network consists of a graph
G = (V, E), and set R of robots, and a strictly positive length function λ = E → R,
such that λ associates with each edge (u, v) of G, a distance which robot needs to
travel to go from u to v. We assume that there exists a clock such that a robot is at
some node or at some point between the nodes, at each point in the time scale.
The action of a robot (agent) is direction and velocity. The velocity is a number
of distance units it passes in a unit time. The direction and velocity are decided by
the robot when it is a node in the graph. Also, a robot can observe another robot. The
robots need to meet the goals which arrive at them in a dynamic fashion. The goals
shall be met without collision to other robots. A collision may take place if they are
at the same node at the same point of time, or at the same step distance on edge at
the same time point. Based on these facts, it is possible to define some social laws.
Definition 16.5 (Social law for robot’s movement) Given a graph G = (V, E), the
social law for robot’s movement determines a subset A ⊆ E of edges in which robot
is allowed to move, and restricts the direction of movement along each edge of A,
and also restricts the velocity at which robots are allowed to move along each edge
e ∈ A.
In the above definition, the social law is traffic law, which should guarantee that
each robot will be able to achieve its goals (say reaching to a destination node),
without any collision with other robots. This irrespective of what the other robots do.
Given a multi-robots system, a useful social law is one that guarantees non-
collision system, even if all the robots initially enter the graph G at arbitrary nodes,
with offset of at least one unit of time from each other, and they obey the social laws.
Such a system also guarantees that all the robots will reach their targets ultimately.
Modeling social actions
Design of social laws can be reduced to the problem of finding a route in a graph.
For this purpose consider that there is a simple graph G = (V, E), having no cycles,
502 16 Intelligent Agents
no parallel edges, no cut-vertex, and has at least three vertices. A graph with no
cut-vertex is called block.
Let N be the set positive integers, and f : V → N be a labeling of vertices in
G. For U ⊆ V , let min f (U ) and max f (U ), respectively, be the smallest and
largest labels of two vertices in U . An f-minimal vertex in U is any u ∈ U for which
f (u) = min f (U ), and similarly for f-maximal.
The labeling f as above indicates a directed graph G f = (V, A) on the same
vertex set as G, whose edge set A ⊆ E is obtained from E by removing each edge
(u, v) ∈ E with f (u) = f (v). Alternatively, orienting each edge (u, v) ∈ E with
f (u) < f (v), in the direction vu if v is f-maximal and u is f-minimal in the block
containing (u, v), and in the direction (u, v) otherwise.
Given this scenario, we have the following definition for routing a graph.
Definition 16.6 (Routing graph) A routing of a graph G = (V, E) is labeling f :
V → N of its vertices, for which the induced graph G f is strongly connected. A
routing under which there is unique f-minimal vertex r ∈ V , shall be called root.
Assigning the root is called rooting process.
The routing of a graph can serve as a basis for useful social laws in multi-robot
systems. For this, the robots are required to enter the graph network from an f-
minimal vertex with an offset of one or more time unit from another. Additionally,
the robots are required to move only along the arcs of the graph G f induced by
the routing f . For this, a velocity function is defined as v : A → R as follows: for
e = (u, v) ∈ A, put v(e) = λ(e)/D if u is f-maximal, and v is f-minimal, and put
v(e) = λ(e)/( f (v) − f (u)) otherwise. It is mandatory that robots should move at
the calculated velocities.
16.12 Summary
Software agents navigate on the Internet to collect relevant data, perform tasks,
and make decisions autonomously. They can transfer an enormous amount of data
on behalf of their users. Some agent systems also include standard service agents,
such as broker, auctionerr, or community maker.
Cooperation in multiagents is difficult when agents are self-interested, say, every-
one tries to download the same file at the same time, the speed will come down. Thus,
they need to communicate and cooperate. The cooperative agents should avoid the
situation of a prisoner’s dilemma. In cooperative agents coalitions need to be formed.
There are two approaches for this: (1) optimization-based, which finds an optimal
coalition, and (2) game theoretic approach, which has applications in many real-world
domains.
When agents are constantly changing coalitions, there is a need of formation of
dynamic coalitions. The total possible coalitions turn out to be of the order of O(n n )
for n number of agents. Hence, the number of agents should be small in number. To
study the coalitions phenomena, the agents are represented by nodes of a graph and
edges by the links indicating coalitions. The agents which are ready to provide the
resources are taken as sellers and those receiving are taken as buyers. This becomes
a structure to design a coalition algorithm.
Agents approach can also be applied to software engineering, where agents are
treated as next-generation components and this software engineering as case-based
software engineering. The complexity issues in software engineering can be tackled
through decomposition, abstraction, and organization. Agents can be assigned the
task of buying and selling. Software agents are used for filtering information that
matches people-to-people with similar interests, and automate the repetitive behavior.
There are theories to model the buying agents, which share the six fundamental
stages of buying processes: need identification, product brokering, merchant broker-
ing negotiations, purchase and delivery, product services and evaluation. For model-
ing agents as decision makers, formal notions of the mental attributes are used, such
as belief, knowledge, and references, accordingly, the modeling is called mental level
modeling.
When agents function together in cooperative or competitive mode, the multia-
gent system must provide the abstractions. Instead of providing specific behavior it
is designed flexible and can be coded using agent communication languages. The
languages have syntax, semantics, and pragmatics.
Many of the agents are mobile, and can just sit at a far place and communicate
with its environment, for example, through remote procedure calls or messaging. A
mobile agent has a feature that it can partly execute on one system, and can move to
another along with data, and can continue to execute the remaining part. Mobile
agents reduce network traffic, overcome network latency, encapsulate protocols,
execute asynchronously and autonomously, adapt dynamically, and have features
of robustness and fault tolerance.
The society has adopted a set of social laws, and each agent will be required to
obey these laws and will assume that all other agents also follow the same. These
laws, in one hand, constrain the plans available to agents, and on the other hand, will
guarantee certain behaviors on the part of other agents.
504 16 Intelligent Agents
Exercises
1. Label the following as an agent or not an agent. Explain your reasoning with
justification for each.
2. In a multiagent system agent interact with the environment. How you can model
a situation where one agent modifies the environment and the other perceive it,
as a dynamic system?
3. How the architecture of a computer system is different from agent system? Give
the salient differences, and justify their significance.
4. A rat searches for food, and at the same time it has to save itself from its predators,
and expecting any such it either runs away or hides. For example, a single-agent
system model of a rat succeeds in protecting itself from predators as well as in
searching the food.
5. Explain the coordination and coalition functions between agents. How they differ
from each other.
6. Write the coalition algorithm in your own language.
7. Give an example of evidence of the prevailing use of agents in online buying from
the online stores.
8. Give a brief note of agent communication languages and compare them with other
high-level languages.
References
1. Brafman RI, Tennenholtz M (1997) Modeling agents as qualitative decision makers. Artif Intell
94:217–268
2. Eiter T et al (1999) Heterogeneous active agents, I: semantics. Artif Intell 108:179–255
3. Griss ML, Pour G (2001) Accelerating development with agent components. Computer 5:37–43
4. Ishida T (2002) Q: a scenario description language for interactive agents. Computer 11:42–47
5. Jennings NR (2000) On agent-based software engineering. Artif Intell 117:277–296
6. Maes P et al (1999) Agents that buy and sell. Commun ACM 42(3):81–91
References 505
7. Onn S, Tennenholtz M (1997) Determination of social laws for multi-agent mobilization. Artif
Intell 95:155–167
8. Peleteiro A (2014) Fostering cooperation through dynamic coalition formation and partner
switching. ACM Trans Auton Adapt Syst 9:1:1–1:31
9. Singh MP (1998) Agent communication languages: rethinking the principles. Computer 12:40–
47
10. Stone P, Veloso M (1997) Multiagent systems: a survey from a machine learning perspective.
CMU-CS-193 1-37
Chapter 17
Data Mining
17.1 Introduction
As Information Technology (IT) has progressed, there has been an abundant increase
in volumes of collected data in the recent past from all sorts of varieties. It is, therefore,
beyond the capabilities of humans to extract meaningful information from this vast
amount of data, and it has become necessary to develop algorithms which can extract
meaningful information from these vast stores of data. Searching for useful chunks
of information in the huge amounts of data is known as the field of data mining.
Data mining can be applied to all varieties of formats of data, including relational,
transaction, spatial databases, as well as large stores of unstructured data such as the
World Wide Web.
The amount of data stored in digital form worldwide has on the average doubled
every 9 months, over many years, which is twice the rate for increase of computing
power, predicted by Moore’s law. This doubling of stored information, called storage
law, is one of the reasons of motivation for data mining. Irrespective of whether
this increasing volume of data will support exploration in commercial or scientific
activity, the data is potentially a valuable information [1].
It took many established organizations to accumulate large volumes of data about
their employees, suppliers, customers, products, and services. To perform the data
mining, also known as knowledge discovery in databases, organizations make use
of tools to sift through this vast amount of data to find the trends, patterns, and
correlations among the data sets, which can guide for strategic decision-making.
The traditional algorithms for data analysis assumed the input data sets of limited
number of records, hence the available memory was sufficient. Current databases,
however, are much larger to fit in the main memory of most computers. In addition,
to be efficient, the techniques for data mining applied to very large databases must
be highly scalable. An algorithm is considered as scalable if, given a fixed size of
main memory, its runtime increases linearly with the number of records in the input
database.
Data mining is concerned with the identification of patterns of important structures
in data; these structures may represent patterns, statistical or predictive models of
data, or some relationship among the parts of the data. In the context of data mining,
the terms like patterns, models, and relationships have definite meanings as follows:
a pattern is a compact summary of a subset of data (such as people who own a racer
car are likely to participate in car races); a data model can be model of an entire data
set, and it can be predictive. This model can be used to predict the future customers’
behaviors (such as whether the customer may buy a so and so product), based on the
historical data of interaction with some employees of the company. It can also be a
joint probability distribution with reference to some variables.
An algorithm can enumerate a large number of data patterns using a finite database,
but it is important to find out what data structure is suitable for a particular data set.
Identifying interesting structures and useful data patterns among the large number
of possibilities is the job of a good data mining algorithm, and it must do it fast
even for large data sets. Consider sales transactions in a database of an online store,
where some items’ transactions are more frequent than others, e.g., smartphones.
The variable values frequently occurring together in such databases could be used
to answer, say, which items were bought together from the same store. Such an
algorithm could also discover a pattern in the database in terms of demography, with
very good confidence.
Data Mining is a Science
The concept of unsupervised learning—from basic facts or axioms—has remained a
curiosity since long. In the present scenario, knowledge discovery engines are com-
monly used to extract general inferences from facts or using training data. Using more
structured approach, the statistical methods attempt to quantify the vast amounts of
17.1 Introduction 509
data by known and intuitively understood models. The approach of problem solution
through assembling knowledge from existing data sources is a radical shift from
traditional approaches of problem solution.
The nature of typical data sets, in respect of their size, diversity, high dimension-
ality, their distributed nature, and noisy contents, makes it challenging to decide the
formal specifications to any problem. This lack of control gives scope for solutions
that are over-fitting, have limited coverage with missing/incorrect data coupled with
high dimensionality. Once such problem is specified, the solution techniques deal
with the presentation and scalability of the problem, and complexity of the solution.
This complete process through which data mining makes its transitions is a science,
called data science.
Due to the emergence of WWW in the form of a large distributed data reposi-
tory, and the provision of large online databases that can be tapped for significant
commercial gain, interest of researchers and commercial organizations toward data
mining techniques has gone up exponentially. Many core mining techniques have
been explored by the researchers in the major domains, like classification, clustering,
rule associations, and time-series analysis. The work on data mining has focused on
scaling data mining algorithms to very large data sets. In the following sections of
this chapter, we shall discuss the algorithms concerning three classical data mining
problems: market-basket analysis, clustering, and classification.
Learning Outcomes of this Chapter:
1. Compare and contrast different uses of data mining as evidenced in both research
and application. [Assessment]
2. Explain the value of finding associations in market-basket data. [Familiarity]
3. Characterize the kinds of patterns that can be discovered by association rule
mining. [Assessment]
4. Describe how to extend a relational system to find patterns using association rules.
[Familiarity]
5. Evaluate different methodologies for effective application of data mining. [Assess-
ment]
6. Identify and characterize sources of noise and redundancy in presented data.
[Assessment]
Induction
It is the most common perspective, which is based on the principle of proceeding from
specific to general, and answers the questions, like “Given ten specific examples of
good tourist destinations, find out the characteristics of a favorite tourist attraction?”
Thus, induction is typically implemented as a search through a space of possible
hypotheses. Such searches usually employ some special characteristics or aspect to
arrive at a good generalization, like “sand-dunes are favorite.”
Compression
Many a time, one set of data may correspond to a number of general concepts. In such
cases, mining techniques typically look for the most easily described pattern. This
principle is called Occam’s Razor, which effectively equates mining to compres-
sion, such that it becomes possible to describe the original data in a more compact
form using learned patterns, rather than exhaustively enumerating original data itself
requiring much larger size of space. The solid base to this issue are (1) feasibility
of models such as Minimum Description Length (MDL) principle and (2) compu-
tational learning theory. Most data mining systems make use of one of these views
of compression to establish the effectiveness of mining patterns. For example, if set
of a data is of size ten, and the number of patterns mined turns out to be 20 features
long, then this mining is not good at compression.
Querying
The perspective of query comes from the databases, because most business data
reside in industrial and commercial databases and warehouses. Commercial database
experts take data mining as a form of query. For mining the data (which often is in
the form of text), it required to often enhance the expressiveness of the query, like
“find all the customers with similar transactions.” The other perspective of querying
is to find out suitable model for database, in place of relational, for data mining.
Approximation
This view of mining has an objective to find an accurate model of data, and introduce
some deliberate approximation in it to find out some hidden structures in the data.
One technique that has been found useful for this is latent semantic indexing, which
is useful in document retrieval. This technique makes use of transformations based
on linear algebra and approximations of matrices to locate hidden structures of word
usages, which means doing the searches beyond the simple keyword search.
17.2 Perspectives of Data Mining 511
Search
The search is related to induction, but its focus is on efficiency, and uses forward-
pruning patterns, e.g., frequent itemsets, to restrict the overall pattern space.
Beyond what has been discussed above, there are many other approaches to clas-
sify the task of data mining, which fall in various categories:
• based on the data they operate on, e.g., discrete data, labeled data, continuous data,
and time-series data;
• based on application domains, e.g., finance models, economic models, web-log
mining, and semi-structured models;
• based on their induced representations, e.g., association rules, decision trees, and
correlations.
The field of data mining aims to explore very large data sets efficiently, using methods
that are convenient, easy, and practical. However, this should be without extensive
training as well without a large work force. All the data mining applications have some
common goals, of identifying the patterns in the data, interpretation of these patterns,
and then perform the prediction or description either qualitatively or quantitatively
in general for all the data including those which may be generated in the near future.
Following are the major goals as well as the nature of applications of data mining.
Scaling analysis to large databases
The meaning of scalability is capability to handle large volumes of data which cannot
fit in memory of any computer. The objective is to abstract away the patterns from
the large databases, which provide information to mining algorithms to search for
the patterns.
Scaling to higher dimensional data and models
The normal statistical data analysis is a two-step process—at the first step we formu-
late some model, then use the data to fit to this model. However, for human beings,
formulation of a hypothesis that results in a model is not possible when the data set has
a very large number of variables, of the order of thousands, involving various demo-
graphics, text document analysis, retail transactions, and web browsing. Through
automated discovery from such a large volume of data, a model can be derived, and
can be used in lower dimensional spaces, as the problem can be understood much
easier at that level.
Automating search
The search requires enumeration of large data sets, creating hypotheses—the jobs
beyond human capacity. However, these are done by algorithms meant for this pur-
pose.
512 17 Data Mining
Data mining is a data-driven field; here mining is concerned with real-world data
sets. In traditional data analysis, popular data sets are d-dimensional vectors of x
measurements on N number of objects, or N such objects having d number of mea-
surements (or attributes). These data are called multivariate data, and are represented
as an N × d matrix of data [4].
Some of the classical data mining and analysis problems associated with multi-
variate data are the following:
• Clustering: It is the process of learning a function, which can map x into a set of
categories, such that the categories are not known in advance.
• Classification: It is the process of learning a function (i.e., learning a mapping)
from x to y, where y is a categorical, or already defined scalar target variable of
interest.
• Density estimation: It is estimating the probability density function (PDF), for
vector x, i.e., p(x).
• Regression: The regression is the same as classification, except that table y takes
real values.
themselves further more attributes, like “price” for itemsets and “contents” for web
pages, as well as inter-item relationships, like laptop and its cover, web page which
is the home page, etc. In addition, the rows of the transactional data (i.e., basket) can
have further attributes like purchased as general, weekends, monthly, or seasonally.
The transaction data, instead of all available at one time, are in fact arriving continu-
ously as a stream. As they flow by, they are available for mining only once. Similarly,
the web logs continue to grow as browsing continues, over time, resulting in a stream
of data. In such situations, the data miner’s interest is also to study the evolution of
the activity. The data streams give challenge, as to how to compute the aggregate
of transaction data. One possible approach is to use an incremental learning model,
such as classification tree. The further sections in this text discuss in more detail
about data streams.
The text-based data can be represented in the form of a graph. The N objects can
be represented as nodes in a graph, and edges can represent relationships among
the objects. Such “data graphs” can be used for representation of web pages, where
web pages are nodes and hyperlinks are edges of the graph. These graphs can be
represented by adjacency matrix, where nodes are labels for row and columns and
edges are entries in the cells of adjacency matrices. These matrices are however,
large and sparse. Like in graphs, some nodes have extremely high degree—outgoing
or incoming edges—while others may have degree of one only. If nodes of these are
sorted accordingly to their degrees, the result is a rule of the form,
1
degr ee α (17.1)
rank a
where a is called as “degree” of the exponent.
Use of matrices for the representation of graph has benefits that many classical
methods in linear algebra can make use of graphs for analyzing the properties of the
corresponding problems. For example, to discover from the connectivity information,
Google PageRank algorithm makes use of a recursive system of equations, that define
the importance of any page in terms of the importance of the pages pointing to it, as
well as how many such pages are pointing to it. The page rank of each page then can
be computed by solving these linear equations.
17.5 Classes of Data Mining Algorithms 515
The predictive modeling algorithms use data sets of training records as their input.
The goal of predictive modeling is to build a model that predicts some specified
attribute(s) value from the values of the other attributes. The predictive methods are
based on one or more of the following: Bayesian probability-based methods, decision
trees, neural networks, and support vector machines. We elaborate the two methods
here. The Bayesian approach and neural nets have been discussed in the previous
chapters.
Linear Classifiers
A model after the training procedure can be a rule set or the whole training set like
the nearest neighbor. In the following we will see that a straight line can be a model
as well. In an example, it has been found that the height and weight of persons are
available and medical experts have found that some of them are over-weight or under-
weight (see table 17.1). Accordingly, the data have been divided into two classes,
and are labeled as shown in Fig. 17.1. Consultation with experts may be expensive,
so we would like to construct a model from the available data. Then for any new
person, given the weight and height, this model could easily predict whether he/she
is over- or under-weight.
A model can be a rule like If weight ≥ 60, then over-weight.
190
170
Height
160
150
140
40 50 60 70 80 90
Weight
1.9
1.7
height
1.6
1.5
1.4
40 50 60 70 80 90
Weight
We note that this rule does not make much sense, because some tall people may be
thin even though their weights are more than 60. A better model may be the following
rule:
If weight/(height)2 ≥ 23, then over-weight.
The objective of classification is to identify a good model so that future predictions
are accurate. Here “weight” and “height” are called features or attributes. In statistics,
they are called variables. Each person is considered as a data instance (or a data
observation). Mathematically, we have x = [weight, height] as a data instance and
y = 1 or −1 as the label of each instance. Here, we have six training instances
x1 , . . . , x6 with corresponding class labels as y = [−1, −1, −1, 1, 1, 1]T (here −1
and 1 mean under-weight and over-weight, respectively).
As discussed above, we now concentrate on the linear classifier, and show that
a straight line can be a model also. Figure 17.2 shows that a line is separating the
training data of over-weight and under-weight persons, and this line can be expressed
by
0.2 × weight − 10 × height + 3 = 0. (17.2)
w T x + b = 0, (17.3)
where x = [weight, height]T , w = [0.2, −10]T , and b = 3. Then, for any new data
x, we check whether it is on the left or the right side of the line. That is,
Machine-based classification
arg
M
w x=y+1
wx=y
w x=y−1
518 17 Data Mining
on the splitting criteria that evaluates a condition on the input record at the current
node. At the final node (leaf node), the process associates the record with the unique
leaf node of the tree.
17.5.2 Clustering
The process of clustering partitions a set of data, according to some similarity mea-
sure, into several groups such that “similar” records are in the same group, so that
each group represents a similar subpopulation in the data. As an example, each cluster
could be a group of customers, which has similar purchase histories or interactions or
some other factors or combinations (Sect. 17.6 discusses clustering in more details).
One technique to achieve scalability in clustering is to incrementally summarize
the regions of the data such that denser regions are summarized first. Because a cluster
corresponds to a denser region, the records within this region can be summarized
collectively through a summarized representation called “cluster feature” (CF). An
example of CF is a triple comprising of cluster centroid, cluster radius, and the
number of points in the cluster.
The CF-based approach is considered efficient due to two reasons: (1) they con-
sume lesser memory space as all the objects in a cluster are not required to be
maintained, and (2) if properly designed, they constitute sufficient information for
computing all intra-cluster and inter-cluster measurements required for making clus-
tering decisions. In addition, the distances between clusters, CFs, and radii of clusters,
and hence other properties of merged clusters, all can be computed quickly from the
CFs of individual clusters.
Some points in clusters can be discarded, while the others can be compressed as
defined below.
Definition 17.4 (Compressible point) A point that is not discardable, but belongs
to a tight subcluster consisting of a set of points that always move between clusters
simultaneously, is called a compressible point.
The CFs are also useful for scaling iterative clustering algorithms, like k-means
algorithm, and while doing so it identifies three types of points (records): sets of
discardable points, sets of compressible points, and a set of main-memory points. In
a cluster, only the CF of all the discardable points are retained, and the actual points
are discarded. Out of the remaining points, the compressible points are compressed,
and those still remaining, since they are neither discardable nor compressible, are
designated as main-memory records. The iterative cluster forming algorithm then
moves only the main-memory points and the CFs of compressible points, between
clusters until a criterion function is optimized, which concludes the formation of a
cluster.
17.5 Classes of Data Mining Algorithms 519
Under the association rules, we use a concept called market basket—a well-defined
business activity—which is a collection of items purchased by a customer in an
individual transaction. Such a transaction is possible due to a customer’s purchase
from, say, a grocery store, or an online purchase from a store such as Flipkart and
Amazon.
By performing business activities over time, the retailers usually accumulate huge
collection of transactions of performing a business activity. A common type of analy-
sis performed on the collections of such transactions’ database is to find sets of items,
or itemsets, that appear together in many transactions. A pattern usually required to
be extracted through this analysis consists of an itemset (market basket) and the cor-
responding number of transactions that contain it, with the objective that the business
can use knowledge of these patterns to improve the placement of items in a store
together, or can arrange the mail-order catalog pages, or web pages on a website, or
use this criteria to motivate potential customers with attractive offers who can buy
these itemsets.
The task of association rule mining is finding correlation between items in a
data set. The initial research in rule mining was largely motivated by the analysis
of market-basket data, the result of which allowed companies to better understand
purchasing behaviors and, as a consequence, better market audiences [5].
Data mining and clustering are both exploratory activities, hence clustering methods
are well suited for data mining. Clustering is often used as an important initial step
in several data mining processes. The data mining approaches that use clustering
methods are predictive modeling, database segmentation, and visualization of large
databases [6].
Data mining is often carried out on relational databases, in cases of transactions
that have well-defined structure and its columns can be used as features. However,
it is also carried out on very large unstructured databases like WWW, where the
contents are natural language text, mostly in HTML or XML format.
Clustering is in fact not a new field; it was being used in machine learning,
statistics, and biology. However, scalability was not a design goal in these fields,
and it was always assumed that complete a data set would fit in the main memory,
and focus remained on improving clustering quality and not scalability. Due to this
historical reason, majority of the clustering algorithms available today do not scale
to large data sets. Clustering methods are used in data mining for segmenting the
databases into homogeneous groups, that serve the purposes of data compression,
because now we are working with the clusters and not with individual items. It helps
to identify characteristics of subpopulations that are targeted for specific purposes
520 17 Data Mining
(e.g., marketing certain items aimed at specific section of the population). Clusters
in large databases can be visualized, to help the analysts in identifying groups and
subgroups that have similar characteristics.
By classification and grouping of objects based on common properties in some
meaningful way, we are able to analyze and describe the world. We human beings
are skilled enough at dividing the objects into groups (i.e., clusters) and can assign
particular objects to these groups. For example, even young children can label the
objects, as flowers, animals, trees, books, etc. Hence, clustering is an important cri-
teria for understanding the objects as different classes. Every new object is identified
based on whether it belongs to a so and so class or not. For example, if an object
x ∈ A, where A is a class, say apple, then we say “x is an apple.” Naturally, whether
x ∈ A or not depends on certain attributes of x.
The clustering process groups various data objects based only on information in
the data items, which describes the objects and their relationships to objects in the
same group. The goal of clustering is that the objects within a group be similar/related
to one another, and different/unrelated to objects in other groups. The clustering will
be better or more distinct, if there is greater similarity/ homogeneity within a group
and greater difference between the groups.
The following example demonstrates a sample database as clusters.
X X
(a) (b)
points in a multidimensional space. Intuitively, the patterns in the same cluster are
more similar to each other than those in different clusters. Figure 17.4a, b depicts an
example of clustering based on patterns, where input patterns are shown in (a), and
the desired clusters formed are shown in part (b), where clusters belonging to the
same cluster are shown by identical labels. Note that the measurement of the patterns
in this case is (x, y) coordinate values.
One technique for clustering is called supervised learning, while other is called
unsupervised learning. In supervised technique, a collection of labels, i.e., pre-
classified patterns, are already provided, and the task is to label newly encountered
unlabeled patterns. The already provided labels, called training patterns, are used
to learn the descriptions of classes that in turn are used to label new unlabeled pat-
terns. In the case of unsupervised learning, the task is to group a given collection of
unlabeled patterns into some meaningful clusters. In fact, some kind of labels are
there, associated with the clusters this time also, but this category of labels are data
driven—obtained solely from the data, and not predefined.
Business
In business and commerce, a large amount of information is collected on current and
potential customers, then clustering is performed on this information to segment the
customers into smaller number of groups so that additional analysis can be performed
on each group, for example, to predict marketing potential.
Similarly, there are applications for image segmentation, pattern classification,
and data mining, discussed in the following. In many such problems in such areas,
there is little prior information (or statistical models) available about the data, and
it is requirement that a decision-maker should make as few assumptions about the
data as possible. It is under these conditions the clustering approach is useful for
exploring the interrelationships among the data points to determine their structure.
(a) (b)
Pattern
represent- Patterns with
Patterns Feature ations Inter-pattern proximities Grouping Clusters
selection/
extraction similarity
feedback loop
A better quality of pattern representation will result in a clustering that is simple and
easily understood; on the contrary, when it is a poor representation, it may result in a
complex clustering, even some times its true structure may be difficult to recognize.
For example, if a standard technique like Cartesian coordinates is used to represent
the patterns, many clustering algorithms are likely to fragment the data into two or
more clusters. But, if polar coordinate are used for representation of the clusters,
the radius coordinate causes tight clustering and a one-cluster solution can be easily
obtained.
A pattern can represent a physical object or an abstract notion. A physical object
can be a chair, table, book, house, etc., while an abstract notion can be, e.g., a style
of writing, attitude, belief, etc. Both of these can be represented in the form of
multidimensional vectors, one dimension for a one feature. The features of a pattern
can be either quantitative or qualitative. For example, if weight and color are the two
features used, then (black, 5) is the representation of a black object with 5 units of
weight, for degree of blackness.
526 17 Data Mining
d
d2 (xi , x j ) = ( (xi,k − x j,k )2 )1/2
k=1
= xi − x j 2 . (17.4)
Equation (17.4) is a special case of the Minkowski’s metric, where p was taken
as 2, expressed by
d
d p (xi , x j ) = ( (xi,k − x j,k ) p )1/ p
k=1
= xi − x j p . (17.5)
The approach based on Euclidean distance has an intuitive appeal, and the method
is commonly used to evaluate proximity of objects in 2 and 3D spaces. The method
works well when the data set comprises “isolated” or “compact” clusters. However, it
has a drawback, as there is a tendency of large-scaled feature to dominate the others
features. The solutions to this, is to incorporate normalization to the continuous
features (with a common range or variance) or some other weighting schemes.
Example 17.3 Consider a set of 2D data points as shown in Table 17.4, and given
a new data point, x = (2.5, 2.9) as a query, rank these database points based on
similarity with the query, using Euclidean distance.
528 17 Data Mining
Using Eq. (17.4), we compute the Euclidean distance for the 2D data points x1 , …,
x5 with respect to the query x = (2.5, 2.9). The result are shown in Table 17.5.
Since proximity between items plays an intuitive role in clustering, a method based
on the nearest neighbor distances can be an obvious choice as a basis of clustering
procedures.
An iterative algorithm assigns each unlabeled pattern to the cluster of its nearest
labeled neighbor pattern, with the condition that the distance to that nearest pattern
is below the given threshold.
This process continues until all the input patterns are labeled.
To grow the clusters from the nearest neighbor, a concept called mutual neighbor
distance, call it M Nd , can be used, which is expressed as
In the above, xi , x j are patterns, Cn (xi , x j ) represents the count of neighbor numbers
of x j with respect to xi . Figure 17.7 illustrates this concept. There are total six pat-
terns, P, Q, R, S, T, andU . In part (a), nearest neighbor of pattern P is Q, and Q’s
nearest neighbor is P. Also, Cn (P, Q) = Cn (Q, P) = 1, hence M Nd (P, Q) = 2. If
Cn (Q, R) = 1 but Cn (R, Q) = 2, then M Nd (Q, R) = Cn (Q, R) + Cn (R, Q) = 3.
Figure 17.7b is obtained from Fig. 17.7a by adding three new patterns S, T, andU .
Now, M Nd (Q, R) = 3, but M Nd (P, Q) = 5. Note that M Nd between P and Q has
increased from 2 to 5 due to additional three patterns S, T, andU , though the position
of P and Q have not changed.
17.7 Clustering Algorithms 529
X2 X2
R R
Q Q
U
P P
T S
X1 X1
(a) (b)
Fig. 17.7 Nearest neighbor clustering: a P and Q are more similar than P and R, b Q and R are
more similar than Q and P
Fig. 17.8 Construction steps of k-nearest neighbor graph using original data: a original data, b 1-
c 2- d 3-nearest neighbor graphs
The above discussed algorithm has a problem of providing a choice of the desired
number of output clusters. However, it is computationally more efficient than hier-
archical clustering, when used for a large data set.
Squared Error Algorithms
The squared error function approach is the most intuitive concept for partitional
clustering, as it is ideally suited for compact and isolated clusters. For an input set of
X patterns, the squared error for clustering C , consisting K clusters (C1 , . . . , C K ),
can be expressed as
K
mj
( j)
e2 (X , C ) = xi − c j 2 . (17.7)
j=1 i=1
In Eq. (17.7), c j is centroid of the jth cluster in total K clusters formed, m j is the
( j)
number of patterns in the jth cluster, and xi is the ith pattern in the jth cluster.
The steps of the squared error clustering algorithm are listed in Algorithm 17.1.
The repetition in the repeat …until loop continues until the convergence is achieved,
i.e., the cluster membership is stable.
k-means partitional clustering
The k-means is a partitional clustering technique that tries to find a k number of
clusters, the count is specified by the user. These are represented by their centroids.
It is the simplest and the most commonly used algorithm that uses thesquared error
criterion. The k-means algorithm starts with a random initial partition and keeps
reassigning the patterns to clusters based on the similarity between the pattern and
the cluster centers (centroid distances) until a convergence condition is reached. In
the process of clustering, there is no reassignment of any pattern from one cluster
to another, which gives it a property of linear time complexity. In other words, the
squared error decreases to some minimum threshold after some number of iterations.
Following are the major advantages of k-means algorithm. (1) It is easy to imple-
ment, (2) its time complexity is O(n), where n is the number of patterns. However, its
17.7 Clustering Algorithms 531
R
Q S T
P
X1
A collection of all the clusters corresponding to a given input data set is gener-
ally called as clustering. In the following discussions, we shall distinguish between
various types of clustering approaches such as partitional (i.e., un-nested) versus
hierarchical (i.e., nested), monothetic versus polythetic, and hard versus fuzzy [7].
Partitional versus Hierarchical
At the top level, distinction is made between hierarchical and partitional approaches
for clustering, where hierarchical-based methods produce a nested series of partitions.
However, the partitional (un-nested) methods produce only one partition. In the
hierarchical clustering, the features are used sequentially, whereas in the creation of
partitional clustering they are used simultaneously.
The hierarchical clustering algorithms produce a sequence of clusters, which
are nested in nature. They have a single all-inclusive cluster at the top, and single
532 17 Data Mining
X2
Cluster 3 U V
Similarity
Cluster 2
T
Cluster 1 S
Q
P R
P Q R S TU V
X1
(a) (b)
Fig. 17.10 a Data items in three clusters, b dendogram for clusters in (a)
point clusters at the bottom of the hierarchy. At the start, the algorithms, called
Agglomerative hierarchical algorithms, treat each data point as a separate cluster.
After this initial step, each further step merges two clusters that are most similar. For
demonstration of this, see Fig. 17.10b, from bottom towards top. For total n number
of data points at the beginning, each such step compares each node with each other
node, hence the worst case analysis is O(n 2 ). After each merger, the total number of
clusters reduce by one. The user can repeat these steps until the desired number of
clusters are obtained or the distance between two closet clusters goes above a certain
threshold. Since in the worst case n number of data points reduce to one cluster
in total n − 1 number of steps, the worst case time complexity of this algorithm is
O(n 3 ).
In some of the hierarchical methods, each cluster is represented by its centroid.
A centroid of a cluster is a data point, which is closest to the centre of that cluster. In
centroid-based methods, the distance between two clusters can be measured by how
similar the centroids of these clusters are. However, the centroid-based scheme of
distance calculation fails when some data points in a cluster are closer to the centroid
of other clusters than the centroid of their own cluster.
The working of a hierarchical clustering algorithm is illustrated in Fig. 17.10a. It
makes use of a 2D data set, in the form of seven patterns labeled as P, Q, R, S, T ,
U , and V . After merging these data sets, three clusters (1, 2, and 3) are obtained.
A hierarchical algorithm results in a nested grouping of patterns, which produces a
dendrogram, as illustrated in Fig. 17.10b. The similarity levels at which groupings
change are also marked in Fig. 17.10b; any two clusters can also be merged to form
a larger cluster based on the minimum distance criteria. In this example, clusters 2
and 3 can be merged to make a single bigger cluster.
A partitional clustering is created simply by dividing the set of input data objects
into non-overlapping subsets at the output such that each data object is in exactly
one subset, i.e., a cluster. A simple partitional (monothetic) clustering considers the
features sequentially to divide the given collection of patterns (see Fig. 17.11).
17.8 Comparison of Clustering Techniques 533
The field of data mining has led to the development of clustering techniques that
minimize the number of scans through a pattern space, reduce the number of patterns
examined during execution, or reduce the data structures’ size in the algorithm. The
main advantage of incremental algorithms is that for them it is not required to store
the entire pattern matrix in the memory, to be used again and again. But, instead,
only part of that is stored at a time, which helps in reducing the total requirement
of space. These algorithms are generally not iterative, hence the execution time is
also small.
Algorithm 17.2 illustrates a typical incremental clustering algorithm. Let D0 be
the first data item, and C0 be the first cluster.
17.9 Classification
The basic idea of the data classification problem can be simply described as follows:
given a training data with known labels or classes (e.g., as shown in Table 17.6), we
would like to learn a model, so that it can be used to predict data with unknown
labels. Let us consider that we have identified some customers through clustering
17.9 Classification 535
of the aggregated purchase information about the currently existing customers for a
certain company (see Table 17.2, page no. 520). Further, we have also acquired the
mailing list of potential customers out of these, with their demographic information.
As the next step, we would like to assign each person in the mailing list to one of
the three groups: A , B, and C, as shown in Table 17.6. The latter is for the purpose
of mailing them a catalog of items tailored to the individual’s buying patterns. This
task, data mining, makes use of historical information about current customers to
help in the prediction of cluster membership of new customers.
Let us assume that the training database with historical information has records
with attributes: salary, age, employment, and group. The goal is to build a model that
takes as input the predictor attributes and outputs a value for the dependent attribute.
When the dependent attribute is a numerical value, the problem is called regression,
otherwise it is a classification problem. In our present discussion the dependent
attributes (also called class labels) are A, B, and C, hence it is a classification
problem. The Table 17.6 is a sample training database with four predictor attributes:
salary, age, and employment, and group as dependent attribute.
There are many classification models for data mining applications. These include
the following: genetic algorithms, Bayesian networks, neural networks, log-linear
methods, statistical methods, and decision tables. The classification trees (i.e., deci-
sion trees) are popular due to the following reasons:
• Their representation is intuitive, which makes classification model easy to under-
stand,
• An analyst does not need to supply any input parameter for construction of decision
trees,
• The accuracy of prediction of decision trees is better than other classification
methods,
• It is possible to construct decision trees from very large training databases using
algorithms that are scalable and fast.
536 17 Data Mining
Decision Trees
The decision trees are tree structures whose leaves are classifications and their
branches are conjunctions of features that lead to classifications. Their significance
is because many data mining methods generate decision trees. The problem solution
methods learn these decision trees. One approach to learning a decision tree is to
split the example set into subsets, based on the value test of some attribute. This
process is repeated recursively on the subsets, with each split value becoming a sub-
tree root. The splitting stops when the subset becomes so small that further splitting
is not possible, i.e., the subset example contains only one classification. A split is
considered best if it produces the minimum number of classification in the subset.
That means, subsequent learning generates smaller subtrees, which will require less
further splitting, in effect reducing the number of steps for solution of Problem [8].
A decision tree algorithm used for the classification comprises two steps: tree
building and pruning. In the first step, most of the decision tree is grown top-down
in a greedy way. Starting with the root node, the database is examined by a method,
called “split selection”, that selects the split condition at each node. Then, the database
is partitioned and the procedure is applied recursively. In the pruning stage, the tree
constructed in the previous phase is pruned to control its size. The pruning methods
select the tree in a way that minimizes prediction errors. In some cases, decision-tree
construction algorithms separate the process of tree building and pruning, but other
algorithms interleave these processes to avoid unnecessary expansion of some nodes.
Algorithm 17.3 shows a sample tree building phase, node n where tree is to be
split. At the output, the algorithm provides a decision tree for the data partition D,
and new node value n which is the root of the decision tree.
The choice of splitting criteria determines the equality of the decision tree. If the
training database does not fit into memory, we need a scalable data access method.
Many scalable algorithms are designed with a built-in feature, which ensures that
only a small set of statistics, like aggregate measures and counts, be sufficient to
implement the split selection method. Since aggregated data is far smaller in size
than the actual data, it is possible to construct the statistics in memory for each node
in a single scan over the corresponding database partition. These nodes satisfy the
17.9 Classification 537
Group C Age
≤ 40 > 40
Employment Group D
Group B Group A
splitting criteria, such that in the process of recursively splitting we ultimately reach
the class label node at leaf. In a decision tree, each internal node is labeled with a
prediction attribute, called splitting attribute, and each leaf node is labeled with a
class label.
Example 17.5 Figure 17.12 shows the decision tree for a training data set as shown
in Table 17.6. The splitting attributes are salary, age, and employment, and the class
labels are Groups A, B, C, and D.
Every edge that originates from an internal node is labeled with a splitting predicate,
which involves only the node’s splitting attribute. The splitting predicate in the deci-
sion tree (Fig. 17.12) are ≤50K , >50K , ≤40, >40, Self, Academia, and Industry.
The splitting predicate has a property that any record will take a unique path from
the root to exactly one leaf node, that is the class label of that record. At a node, the
combined information about splitting attributes and splitting predicates is called the
splitting criterion.
could more effectively target marketing the policy Y through those clients who hold
policy X but not Y , to motivate them to buy policy Y . In effect, the association rule
represents the knowledge about purchasing behavior of customers. The association
rule mining has been effectively applied to many different domains that include
market-basket analysis in commercial environments, astrophysics, crime prevention,
fluid dynamics, and counter terrorism, i.e., all the areas in which a relationship
between objects can be concluded as useful knowledge [5, 9].
Association rules analysis is a technique to uncover how items are associated with
each other. There are three commonly used ways to measure the association, which
indicate the strength of the association.
Definition 17.9 (Support) For a given set of data items, an association rule has
support s for some set of items X if s percent of transactions include all the items of
set X .
In the sales transactions, for example, if we discover that sale of some items beyond
a certain proportion tend to have a significant impact on the total profits, we might
consider using that proportion as support threshold.
Definition 17.10 (Confidence) “Confidence” says how likely the itemset Y is pur-
chased when itemset X is purchased, i.e., to determine the value of X → Y . For a
given set of data items, an association rule has confidence c if c percent of transactions
that contain itemset X also contain itemset Y .
“Confidence” is measured by the proportion of transactions with item X , in which
Y also appears. For the total number of transactions T , we compute c as
(X, Y )
c=
X
(X, Y )/T
=
X/T
suppor t (X, Y )
= .
suppor t (X )
In association rule mining, usually the goal is to discover all association rules
having the value of support and confidence greater than some user-specified minimum
threshold value [8].
Definition 17.11 (Lift) “Lift” (l) says how much it is likely that the itemset Y will be
purchased whenever itemset X is purchased, while controlling “lift” for how popular
the item Y is.
The lift is measured as
suppor t (X, Y )
l= . (17.8)
suppor t (X ) × suppor t (Y )
Example 17.6 Given the sales transactions in Table 17.7, find out the “support” for
the following:
1. Laptop,
2. Smartphone, and
3. Laptop and Smartphone
and “confidence” of a music system with respect to
1. Laptop,
2. Smartphone, and
3. Laptop and Smartphone.
There are three transactions in this table: numbers 101, 102, and 103, stored in
a relational database system. We note that out of the three transactions, two have
Laptop, three have Smartphone, and two have Laptop and Smartphone combined.
Accordingly, their “support” is 67, 100, and 67 percent, respectively.
In the second part, we are interested to compute confidence of “Laptop” → “Music
system”, i.e.,
The lift of 0.75 (< 1) indicates that item Y is unlikely to be bought by a customer
who is buying item X .
Most of the algorithms for association rules mining comprise two distinct steps:
1. Find all sets of items with minimum support. The data are of the order of millions
of transactions, and a mining algorithm needs to count a large number of candi-
date itemsets to identify the frequent ones; hence this phase is computationally
expensive, and often time consuming.
2. Generation of inferences. The inferences or rules can be generated directly from
the frequent itemsets, and there is no need to scan the data sets again.
In the above two steps, most of the time is usually consumed by the first step, hence
the scalability becomes more important. The techniques for scalability can be divided
into two groups: 1) those techniques which reduce the total number of candidates to
be counted; and 2) those which make candidates’ counting more efficient. The first
category of techniques make use of the property—subsets of a frequent itemset must
also be frequent, and uses this concept as a pruning technique to reduce the number
of itemsets to be counted.
Variations of actual problem
The other approaches used in the data mining algorithms are focused on variations of
the actual problem. For example, for data sets and support values where the frequent
itemsets are very long—with n items there are 2n possible frequent subsets—this
makes finding all frequent itemsets an intractable problem. However, by looking
ahead, the set of maximal frequent itemsets can still be found efficiently. We can
make use of the following criteria for this: if an itemset is identified as infrequent, its
subsets are also infrequent, and hence none of its subsets need to be counted. A key
approach to this solution is to maximize the probability that itemsets counted due to
looking ahead are actually frequent, which requires estimation of some heuristics.
For this, a good heuristic can be to bias candidates’ generation such that most frequent
items appear in most of the candidate groups. This heuristic will make it more likely
for high-frequency items to be part of long and frequent itemsets.
Why at all the finding of frequent itemsets is a nontrivial problem? The first reason
is the number of customer transactions in the database, which can be so large that it
usually does not fit into the memory of a system. The second is the potential number
of frequent itemsets is exponential on the number of different items, although the
actual number of different itemsets can be much smaller. If a table has four different
itemsets, there can be 24 − 1 = 15 potential frequent itemsets. But, if minimum
support is taken as 60%, only five itemsets are actually frequent. Thus, there is need
of algorithms that are scalable with respect to the number of transactions, such that
they need to examine as few infrequent itemsets as possible.
17.10 Association Rule Mining 541
3: Discards the common items from remaining set (T − Td ) (since they are known to be present.)
4: Discard the items not present in any of the candidates.
5: Count each of the candidate group.
This reduction in the number and size of the remaining transactions can result in
substantial improvements in the speed of counting.
A sequential pattern mining problem can be stated using the following examples.
• Can we find out some strong correlations between users doing online purchasing
from a web page based on their behavior patterns versus the sequences of pages
visited and/or speed at which he/she browses a website pages?
• A market basket is a record with fields: customer ID, and list of all items, with
most of the item fields as False, and True only where the customer has bought
those items. Can we derive through analysis, more useful information about the
buying pattern of the a customer, if information related to time and sequence is
also included in the market basket?
• Can we recognize a user as a genuine user or hacker based on the suspicious
behavior observed due to analyzing a sequence of commands entered by that
user?
• Can we declare the elements of actions as “best practices”, based on the outcomes
and analysis of sequence of actions performed on any activity that results in the
given outcomes?
• Given the sequences of alerts and status by any system before it has failed, can we
determine those sequences set or subsequence set through some analysis which
are confirmatory predictions about the failure? That is, whether it is possible to
map those sequences to failure?
In simple terms, having many significant or important events occurring over the
scale of time, space, or some other metric, it is required to learn from the data by
considering the ordered sequences appearing in the data.
Consider that input sequences are stored in a database D, and input sequence is
an event with fields: sequence-id, event time, and the list of items. It is assumed that
event time is unique in every event, hence this field can be used as the event ID. We
define the support (the frequency) of a sequence, as the number of input sequences
in the database that contains the sequence αs and express it as σ (α, D).
The support can also calculated by a different method. For user-specified minimum
support value min_supp (a frequency), a sequence is frequent if the sequence occurs
more than min_supp times. We denote the frequent k-sequences as Fk . In addition, a
frequent sequence is called as maximal if it is not a subsequence of any other frequent
sequence. Further, a frequent sequence is called maximal if it is not a subsequence
of any other frequent sequence.
The above discussion can be termed as the problem definition for sequence mining
algorithms, whose data are transaction database or they are transaction data sets.
an event. If, instead of an event time, a time window is considered for web mining,
then there can be more than one item (multiple pages), and in that case it becomes a
general case like market basket, having many items in an event, where an event is in
the form of a time window.
Example 17.7 Sequential pattern mining to raise promotional sale in a online store.
As discussed above, when a user is accessing web pages, such that only one page is
accessed at a time, it gives rise to a sequence database, which is equal to only one
item in each sequence’s ordered event list.
Let E = {a, b, c, d, e, f } be a set of events for six products’ web pages accessed
by different users in different order in some online sales application. The E also
indicates a set of six products.
We assume that web accesses database is web log, and for four users the web
access sequence is in the form of four records: [T 1, adbce]; [T 2, abecd f ];
[T 3, bab f aec]; [T 4, ab f ac]. From the web-log pattern mining of this web
sequence database, we find a frequent sequence abc, indicating that all the users
who visit product a’s web page http://www.prompt-sale.com/a.htm also immediately
visit product b’s web page http://www.prompt-sale.com/b.htm and then visit prod-
uct c’s page http://www.prompt-sale.com/c.htm (These are only ficticious names of
portals).
Having the patterns sequence of visiting the web pages in a certain order, e.g.,
the customers who visit the web page for product a will also visit web page b and
c, it is possible to increase the sales of product b and c. By placing promotional
discounts on product a’s web page, which is visited a number of times in sequence,
it is possible to increase the sale of other products b, and c.
The web log can be maintained on the server where the web page is stored.
However, it can also be stored on the client side, or on a proxy server. Each choice
has its own advantages and disadvantages concerning finding the users’ relevant
patterns and navigational sessions.
Apriori algorithm is used for mining of frequent itemsets, and for association rule
learning over transactional databases. The algorithm works based on the identification
of frequent individual items in the database and extending them to larger and larger
itemsets, as long as those itemsets appear frequently in the database. Apriori has the
property that “any subset of a frequent itemset must also be frequent.” This algorithm
makes several rounds of computation to compute frequent itemsets, such that in the
ith round it computes all frequent i-itemsets. A round in the Apriori algorithm has
the following three steps:
1. candidate generation,
2. candidate counting, and
17.11 Sequential Pattern Mining Algorithms 545
Let the total number of items after scanning in the fist generation be 5, i.e., C1 =
{a : 6, b : 6, c : 3, d : 3, e : 5}, here every item has been represented with its support
in the form s : n, where s is the sequence, and n is the support count. Let us assume
that a minimum absolute support is min_supp = 4 (i.e., 67%). After pruning is over,
the list of frequent 1-sequences is F1 = {a : 6, b : 6, e : 5}. Using the join operation
), the Apriori has generated candidate sequences at k = 2 as, C2 = F1
( ap−gen F1
= {aa : 3, ab : 5, ae : 4, ba : 3, bb : 4, be : 4, ea : 1, eb : 2, ee : 0}. Here,
ap−gen
stands for Apriori “generate and set-join operation.” Note that sequences like ac, ad
are not considered as their subsequences: a, c and a, d are each pair not frequent,
and similar rules applied to the others cases. The support count in the sequences of
C2 is as per their frequency in the database.
Next, it prunes all sequences in C2 that do not have frequent subsequences (i.e.,
their subsequences do not appear in F1 ), and so on. The final resulting frequent set
is f s = Uk Fk = {a, b, e, ab, ae, bb, be, aba}, where Uk is the union of k number
of frequent sequences of Fk . The pruning procedure performed on itemsets in Ck
removes sequences that have infrequent subsequences.
Example 17.9 Perform data mining using Apriori-based algorithm to find out the
frequent itemsets for the sales transaction data, for a minimum support of 60%.
Consider the transactional database shown in Table 17.7 (page number 539); for
convenience of reference it is reproduced as Table 17.8.
For min_supp value of 60%, an itemset is frequent if it is present in at least
two transactions. As per the Apriori algorithm, we perform many rounds through
the given database. In the first round all single items, which are candidate itemsets,
are counted during the candidate counting step. The frequent itemsets resulting in
1-sequence itemset are {Laptop(67%), Smartphone(100%), Music system(67%)}.
In the second round, the pairs are counted as frequent itemsets, and only those
pairs are candidates in which the individual item is frequent in round one, and they are
present in at least 60% of the transactions. As per these restrictions, the 2-sequence
are dropped, as well as the customer sequences not containing any L_itemsets, but
these customer sequences are still considered as part of database for the purpose
of customers’ count.
4. Sequence phase. The sequence phase mines the set of L_itemsets to find the
frequent subsequences. The algorithm make multiple passes over the data, with
each pass beginning with a seed-set, for producing potentially large sequences
(i.e., candidates), and these candidates’ support is also calculated during this
pass. The sequences not meeting the minimum support criteria are pruned, and
the remaining become the seed-set for the next pass. The process starts with the
large 1-sequences, and terminates when either no candidates are generated or
none meet the minimum support.
5. Maximal Phase. The phase finds all maximal sequences among the set of large
sequences. The process is similar to finding all subsets of a given itemset, the
algorithm is similar as well.
Example 17.10 (Association rule mining) Consider the database having four trans-
actions as shown in Table 17.9. It is required to find the association rules if the
minimum support is 60% and minimum confidence is 80%.
storing data, mining new hypotheses, and confirm these with data or supplement by
experiment.” These results/data can be combined with the clinical data to achieve
better results and resolution for treatments.
Telecommunications
Due to the wide use of telecommunication in the world, vast quantities of high-
quality data is available already, mainly including call details collected at network
switches mainly for billing, which can be used for data mining tasks in detection
of toll fraud and marketing to consumers. Based on the data mining algorithms on
telecommunications data, the following benefits can be drawn:
• New architectures for networks. The new generation network infrastructures may
adapt to changes in traffic demands dynamically, to be achieved through data
mining techniques to understand and predict the network load.
• Mobility and micro-billing. Consumer-related activities may have separate billing
patterns.
• Mobile services. Data mining may enable customers for adaptive solutions, so that
they can get it through a few keystrokes.
• Security. Data may be collected and maintained through the records of billing,
travels, migration, to obtain national security-related information, to ensure the
national security.
Geospatial Data
The geographical data has scope and uses, which include digital data of all sorts,
which can be created, stored, processed, and disseminated by government and private
organizations, many of them through high-resolution remote sensing devices, data
collected through geographical positioning systems, and other devices, which are
position aware, like cellular phones.
The new pattern mining and clustering algorithms, which are highly scalable, are
useful to discover new and unexpected patterns, trends, and relationships embedded
in these data.
Climate Data and Earth’s Ecosystems
The climate data acquired through the terrestrial observation, Earth-observation satel-
lites, and ecosystem models provides lot of data, which can be mined for patterns
to discover future ecological problems and their management, including ecology
and even the health of the planet Earth. Such data consists of global snapshots of
the Earth, at typical intervals, with variables like atmosphere, land, and ocean (sea
surface temperature, precipitation), or accumulation of carbon.
There are two main components of Earth science data mining: (1) Modeling eco-
logical data, and (2) Design of efficient algorithms for discovering potential patterns
in these data. Through these patterns as well the existing patterns, it may be possible
to predict the effects, such as El Nino and tornado, etc.
17.13 Summary 551
17.13 Summary
The digital storage capacity has doubled over every 9 months—this is the phenomena,
which has necessitated the mining of data. Data mining provides tools, to sift through
vast amount of data accumulated by organizations over the years, to find the trends,
patterns, and correlations, which can be helpful in planning.
Data mining process is confronted with dimensionality, size, diversity, noise, and
distributed nature of data sets, which makes the formal specification of the problem,
a challenge. The solution techniques needs to deal with complexity, scalability, and
presentation. The entire process through which data mining makes its transitions
is called data science. The algorithms on data mining address three classical data
mining problems, i.e., clustering, market bucket analysis, and density estimation.
One important difference between the traditional databases, and data mining is the
size of data in later, such that data does not fit in the storage of computer, therefore, the
mining algorithm should be scalable. Data mining is concerned with the identification
of interesting data structure in the data, which may be patterns, statistical or predictive
model, or relation information. A model can be used to predict future customer
behaviors, e.g., the customer “A” may buy the so and products and goods.
The major goals of data mining are
• scaling analysis of large databases,
• scaling to higher dimensional and data models,
• automated search, and
• finding patterns and models understandable.
Five perspectives from the existing areas of machine learning (statistics, algorithms,
and databases) are used in data mining. These comprise induction, compression,
querying, approximation, and search.
The data mining techniques are also classified based on the following criteria:
• their induced representations, e.g., decision trees, association rules, and correla-
tions;
• the data on which they operate, e.g., continuous, discrete, labeled, time series; or
• application domains, e.g., finance, economic models, web-log mining, and semi-
structured models.
The common form of data for data mining is records of individuals, like transaction
record or record in the form of requested web page—the set of such records can
be viewed as N rows × d column matrix, and entry (i, j) is 1, for example, in a
transaction, if item j was purchased in the session i. The itemsets (d) in a row is
called basket. In one approach to data mining, the objective is to find statistically
significant itemset, that is, whose frequency is higher than some baseline frequency.
The transaction data is often in the form of a continuous stream, like continuous sales
transactions, or web logs. This gives a challenge on how to compute aggregate of the
data. The incremental learning algorithms are used in such cases.
552 17 Data Mining
Many of the traditional algorithms access the databases multiple times, or ran-
domly for analysis. This is not possible when databases are very large. The algorithm
used for such large databases are prediction methods, clustering, and association
rules. The clustering partitions a set of records according to some similarity func-
tion, into groups such that “similar” records are in the same group, an identification
of similar subpopulations in the data. Clustered records are represented by a sum-
marized feature called cluster feature (CF), which are efficient as they occupy less
space. This grouping is sufficient for inter-cluster and intra-cluster measurements. A
pattern x (which may be a feature vector, a datum, or observation) is a single data
item used by the clustering algorithms. It is typically a vector of d dimensions, repre-
sented as x = (x1 , . . . , xd ). In the feature vector x, the individual scalar components
xi are called features (or attributes).
Structured features can be represented using a tree data structure. The feature
selection techniques identify a subset of the existing features to be used later, while
feature extraction techniques compute new features from the original set. It is com-
mon to measure the dissimilarity between two patterns by measure of distance
between these objects, say, O1 and O2 , determined by d(O1 , O2 ).
The association rules, another method of classification, makes the use of mar-
ket basket—collection of items purchased by a customer—finds correlation between
itemsets, helping to understand the purchase behavior of customers. The cluster-
ing process groups the data objects using the information found in the data items
as well as the relation between data items. Clustering is also called unsupervised
classification, where it is required to group the given unlabeled patterns into mean-
ingful clusters. The supervised classification classifies the unlabeled patterns into
pre-classified patterns.
General applications of clustering are for pattern analysis, decision-making, and
machine learning. Typical applications are Information Retrieval (IR), analysis of
semantic information, analysis of business (sales) transactions, e.g., to find out poten-
tial customers, etc. The utilities like summarization, nearest neighbor, and compres-
sion can also be constructed based on techniques of data clustering. Typical pattern
clustering activity involves—pattern representation, feature selection and extraction,
pattern proximity measures, and data groupings (clustering).
Proximity or similarity measure is done by a distance function in Euclidean dis-
tance, where grouping can be based on many criterions, like soft (fuzzy) versus hard
partitioning, hierarchical versus partitional algorithms, etc. Patterns for clustering
algorithm may be represented using Cartesian coordinates or polar coordinates, and
a pattern can measure either a physical object, like chairs, or an abstract notion, like
style of writing. Different types of clustering are categories, which are distinguished
as having contrast characteristics, e.g., monothetic versus polythetic, hierarchical
(nested) versus partitional (un-nested), exclusive versus fuzzy, and complete versus
partial. The hierarchical clustering algorithm yields nested grouping of patterns in
the form of a dendogram, such that two clusters can be nested to form a larger cluster
based on minimum distance criterion.
17.13 Summary 553
When the patterns set to be used is very large such that constraints on memory size
and execution time affect the algorithm, we use the incremental algorithm. The latter
has advantage that it does not require the entire matrix to be stored in the memory,
hence space requirement is less. Since these algorithms are not iterative, the time
required is also less.
Since proximity between items plays an intuitive role of clustering, the nearest
neighbor approach is a useful basis of clustering procedures, which iteratively assigns
each unlabeled pattern to the cluster of the nearest labeled pattern provided that
the distance with that is below some prespecified threshold value. The partitional
technique of clustering produce clusters by optimizing a criterion function defined
on a subset of patterns or on all patterns. Its algorithm is run multiple number of times,
and every time with different starting points, and the best outcome is considered from
all the runs.
Many data mining approaches generate classification trees—a tree whose leaves
are classifications and their branches are conjunctions of features that lead to those
classifications. The decision-tree algorithm has two phases—tree building and tree
pruning. In the tree building phase, the algorithm examines the database using thesplit
selection method to compute the locally best criteria. This works recursively to build
trees.
We can mine the rules in the data, called association rules. The rules mining
algorithms capture the set of important correlations present in the database. For a
given set of transactions where each transaction is a set of items, an associative (or
association) rule can be an implication X ⇒ Y , e.g., in case of sales transactions,
this may indicate that those who purchase itemset X may also purchase itemset Y .
Therefore, for selling itemset Y , one may target the people whose bought X , and not
again who bought itemset Y . The association rule mining consists of two steps: (1)
find all itemsets with minimum support, and then, (2) generate the inferences from
these minimum support itemsets.
Sequential pattern mining technique can be described as given a database of
sequences where each sequence is a list of transactions ordered by transaction time,
discover all sequential patterns having a minimum support. A sequence in a sequential
pattern mining is an ordered list of events, where an event is denoted as i 1 , i 2 , . . . , i k ,
and i j is an item.
A sequential pattern mining algorithm mines the database of sequences, looking
for repeating patterns (known as frequent sequences). These sequences can be ana-
lyzed to find associations between different items or events in their data for purposes,
like marketing campaigns, business reorganization, prediction, and planning.
Web usage mining is an application of sequential pattern mining concerned with
finding user navigational patterns on the World Wide Web, by extracting knowledge
from web logs.
A commonly used family of algorithms, called Apriori algorithm, computes fre-
quent itemsets through several rounds, such that in a round i it computes all i-
itemsets’ frequent transactions. A round has two steps: 1) candidate generate, 2)
candidate counting, i.e., those having support greater than min_supp.
554 17 Data Mining
Exercises
1. How is data mining different from querying databases, like Oracle or MySql?
2. What were the trends in Information Technology (IT), which gave birth to the
field of data mining? Why did the field of data mining emerge so late compared
to databases?
3. Are the goals presented in this chapter identically applicable to all the data min-
ing domains, for example, mining of data generated due to collision in particle
accelerators versus online sales transactions versus Twitter data? Justify your
answer.
4. Given that the Apriori algorithm makes use of prior knowledge of subset support
properties,
a. Show that all non-empty subsets of a set of frequent items must also be frequent.
b. Show that the support of any non-empty subset R of itemset S must be at least
as large as the support of S.
5. The algorithms for frequent patterns mining consider only distinct items in a trans-
action (market basket or shopping basket). However, the multiple occurrences of
an item are common in a shopping basket, e.g., we often buy the things like 10
eggs, 3 breads, 2 kg dalia, 4 kg oil, 2 kg milk, etc., and this can be important
in transaction data analysis. Suggest an approach on how you will modify the
Apriori algorithm, or propose alternate method to efficiently consider multiple
occurrences of items?
6. Assume nonnegative prices of items in a store, and find out the nature of constraint
they represent in each of the following cases. Also suggest, how you will mine
the association rules in these.
a. At least one Sudoku game.
b. Itemsets, whose sum of prices is less than $250.
c. There is one free item, and other items, whose sum of prices is equal or more
than $300.
d. The average price of all the items is between $200 and $500.
7. Given a decision tree, you have the following options:
a. Convert the decision tree into rules, and then prune the resulting rules,
b. Prune the decision tree and then convert the pruned tree into rules.
Critically analyze both the approaches, and discuss their merits and demerits.
Exercises 555
8. Find out the worst case time complexity of decision-tree algorithm. Assume that
data set is D, each item has n number of attributes, and the number of training
tuples are |D|. (Hint. Answer is |D|.n. log |D|.)
9. It is required to cluster a given set of data into three clusters, where (x, y) represent
the location of the object, and the distance function is Euclidean distance. The
points are P1 (3, 12), P2 (3, 8), P3 (9, 5), Q 1 (4, 7), Q 2 (6, 4), Q 3 (7, 5), R1 (2, 3),
and R2 (5, 8). Use the k-means algorithm to show the three cluster centers after
the first round of execution, and after the final round of execution.
References
1. Fayyad U, Uthurysamy R (2002) Evolving data mining into solutions for insights. Commun
ACM 45(8):28–31
2. Han J et al (1999) Constraint-based multidimensional data mining. Computer 4:46–50
3. Ramakrishnan N, Ananth YG (1999) Data mining: from serendipity to science. Computer
8:34–37
4. Smyth P et al (2002) Data-driven evolution of data mining algorithms. Commun ACM
45(8):33–37
5. Bradley P et al (2002) Scaling mining algorithms to large databases. Commun ACM 45(8):38–
43
6. Karpis George et al (1999) Hierarchical clustering using dynamic modelling. Computer 4:68–
75
7. Jain AK et al (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
8. Ganti V et al (1999) Mining very large databases. Computer 8:38–45
9. Ceglar A, Roddick JF (2006) Association mining. ACM Comput Surv 38(2):1–46
10. Nizar R et al (2010) A taxonomy of sequential pattern mining algorithms. ACM Comput Surv
43:1:3.1–3.41
11. Carl H et al (2013) Sequential pattern mining—approaches and algorithms. ACM Comput Surv
45(2):19:1–19:39
12. Han J et al (2002) Emerging scientific applications in data mining. Commun ACM 45(8):54–58
Chapter 18
Information Retrieval
18.1 Introduction
The problem of Information Retrieval in the present context is highly relevant when
the volume of information generated is much more than the individuals can easily
digest. Consequently, it is extremely difficult to search, locate, and disseminate the
precisely desired information from the storage media, irrespective of whether it is
local or globally distributed. Ever-increasing information needs to be continually
added to the storage systems. This process has been fueled further by the arrival of
the Internet and the World Wide Web, as well as digital libraries, research publications
repositories, electronic editions of newspapers, journals, and magazines.
Given a set of documents or information collection and information need, Infor-
mation Retrieval (IR) identifies the documents or other units of information in the
collection, which are relevant to that particular information need. The information
need is in the form of a set of questions one is interested to find an answer to. The aim
of IR is to locate the document or file which holds the desired information. Some-
times, it is also required to locate the actual position of the required information in
the document selected when the document is of a large size.
Here are some examples of IR tasks: finding an article in the Times of India that
discusses the freedom struggle of India; searching the recent postings in Twitter that
are related to a particular model of smart phones; finding the entries referring to
butterflies in an online encyclopedia, etc. [2].
The early IR methods were based on manually assigned keywords to the doc-
uments and required complicated Boolean queries. These methods were primarily
used by retrieval experts. However, in the 1970s, automatic indexing and natural lan-
guage queries gained popularity, and the IR facility became more and more available
to non-experts. The documents were commonly indexed by automatically consid-
ering all the terms in them as independent words. The documents were represented
using the set of these keywords, known as Bag-of-Words (BOW). The query format-
ting also became simplified in the form of short natural language formulation. This,
however, added noise in the documents’ representation, but the basic methodology
for indexing remained the same. Due to this, the non-expert users faced increasingly
more troubles, due to the “vocabulary problem”. This was because the keywords
chosen were often different than those used by the authors of the relevant docu-
ments. Due to this, systems’ recall 1 rates came down. In other cases, the contextual
differences between ambiguous keywords were not properly resolved through the
BOW’s approach, which reduced the precision of the results. These two problems
are generally referred to as synonymy and polysemy, respectively [16].
To solve the problem of polysemy,2 automatic Word Sense Disambiguation algo-
rithms helped to disambiguate the documents’ contents as well as the query. The
disambiguation algorithms make use of resources like WordNet Thesaurus, corpora
text, or co-occurrence data to find the possible senses of a word and map word
occurrences to the correct sense. The disambiguated senses are used in indexing and
in query processing, which would help to retrieve only the documents that match
the correct sense. Unfortunately, sometimes the inaccuracy and errors of automatic
disambiguation are more dangerous than not using the disambiguation at all [5].
The models of Information retrieval assume each document described by a set
of representative keywords, called index terms. An index term is a word from a
document, which is supposed to represent the semantics of the document. In general
the index terms are nouns, as nouns carry maximum meaning of a sentence. Each
of the index terms in a document is not equally important; some of the index terms
describe the given document better than others. An index term which appears in every
document in a given set of documents is of no use. However, a term which appears
in only, say, five documents out of thousands of documents should distinctly identify
1 Precision and recall are parameters which represent the performance of any IR system.
2 Polysemy is an ambiguity in an individual word or phrase, such that the word can be used (in
different contexts) to express two or more different meanings.
18.1 Introduction 559
All Documents
Retrieved
Documents
Relevant
Documents
Revevant
Ret
retrieved
Documents Rel
Precision
0.5
Typical
0.0
0.0 0.5 1.0
Recall
probability is computed for each term in the collection of documents. For terms
that match between a query and document, the similarity measure is computed as
a combination of probabilities of each of the matching terms. This similarity is a
measure of relevance of the query to the document.
• Inference networks. A Bayesian network is used to infer the relevance of a doc-
ument to a query, the later is a measure of similarity between the query and the
document.
• Fuzzy set-based retrieval. In this IR approach, a document is mapped to a fuzzy
set.3 For example, rain/0.9 versus rain/0.2 shows that in the first case, the meaning
of the keyword rain indicates heavy rain, while in the second set it shows a very
mild rain [7].
The Boolean model for IR is a simple information retrieval model, based on set theory
and Boolean algebra. Since the concept of set is quite intuitive, the Boolean model
provides the framework that is easy to grasp by a common user of an IR system,
as well as simple to implement. Also, the Boolean queries in the form of Boolean
expressions, are precise in their semantics [14].
The Boolean model considers that the index terms, which are the represen-
tation of documents, are either present or absent in a document. Consequently,
the index terms’ weights are taken as binary (1 for presence and 0 for absence),
i.e., for a document Di the index term t j is either present or absent in Di , i.e.,
weight wi j ∈ {0, 1}. A query Q is composed of index terms connected through
the binary operators ∧, ∨, ¬ (and, or, not). A query can be expressed as con-
junctive normal form (CNF) or disjunctive normal form (DNF). For example the
query “Q = ta ∧ (tb ∨ ¬tc )” = “(ta ∧ tb ) ∨ (ta ∧ ¬tc )” can be written in DNF as
−
→
Q = (1, 0, 0) ∨ (1, 1, 0) ∨ (1, 1, 1), where each term is a binary weighted vector
corresponding to the tuple (ta , tb , tc ). Figure 18.3 represent this query using a Venn
diagram.
IR is a process of matching the patterns in the user inquiry with the patterns in the
prospective text documents. If the inquiry words are taken as a set of words X , and
the text document words are taken as a set of words Y , then IR is nothing but finding
the binary relation X × Y . However, words may appear in different morphological
forms. It is, therefore, necessary that—before the matching is performed, the words
in the inquiry as well as the words in the text document are reduced to their basic
form—called stem words, by a process called stemming.
Consider that X is a Boolean set inquiry (or query) keywords, Y is a Boolean set
representing the keywords in the document. With this, a binary relation from X to Y
can be represented as
3 A set that contained not only the elements but a number associated with each element that indicates
tc
The vector space model determines the measure of similarity (sim(Q, Di )) between
a query and a document using a vector that represents query Q, and document Di .
The model is based on the intuitive notion that meaning of a document is conveyed
by the words present in the document. Figure 18.4 represents the concepts of vectors
for a query Q and documents D1 , D2 , . . . , Dn . Here t is the total number of terms
in the collection [12].
Document
D21 D22 . . . , D2t
D2
Document
Dn Dn1 Dn2 . . . , Dnt
564 18 Information Retrieval
The vector space model is based on a method which compares how close the query
vector is to the document vector. The traditional method of determining the closeness
between two vectors is the angle between them, where the angle is computed using
dot products of two vectors. In our context here, the similarity coefficient is used
instead of the angle. If a term is present in the vector, 1 is placed else 0 is placed in the
corresponding position in the vector. To account for multiple occurrences of a term in
the document, frequency of that term is used instead of merely its presence/absence.
Thus, for a query a, b, two documents’ vectors, 1, 0, and 5, 1 indicate that in
the first terms a and b have frequency of 1 and 0, respectively, and for the second
these frequencies are 5 and 1, respectively.
The similarity between query and documents can be computed as the distance
between the query and each of the document vectors. If a document has the same
vector as the query, their distance is minimum and have the highest similarity.
Instead of specifying the list of terms in a query, a user is often interested to
indicate that certain terms are more important than others. One approach to this
is that user indicates a higher weight to specific terms by manually specifying the
weight. The other approach is automatically assigning the weight equal to the number
of times (frequency of the term) a term appears in a document. But, if a common term
appears in every document, it cannot be used to distinguish a rare document from the
rest. For example, if a term appears only in very few documents, say two documents
comprising such a term out of a thousand documents, then that term is an important
criterion to distinguish those two documents as relevant to that term. In this case,
the similarity is proportional to the ratio of the total number of documents (d) to the
documents count (say d f j , the document frequency) which have the occurrence of
this rare term. The ratio is called inverse document frequency, id f .
The important terms encountered in the above discussion are defined and listed
below as
t : Terms. It is the number of distinct terms (keywords) in the entire collection of
documents.
t f i j : Term frequency. It is the number of occurrences of the term ti in the document
Dj.
d f j : Document frequency. It is the number of documents that contain the term.
id f j : Inverse document frequency. It is equal to log ddf j , where d is the total number
of documents.
Each document is represented as a vector, with a total t number of components,
and each entry in the vector corresponds to a distinct term from the entire collection
set. In addition, each component in a vector is filled with weights computed for the
corresponding term. This weight is automatically assigned based on how frequently
the term has occurred in the entire documents collection, and another weight is based
on how often the term has appeared in the particular document. The weight of a term
in a document increases, the more often it appears in that document, and decreases,
the more often it appears in other documents.
The weight of a term in a document is defined as a combination of term frequency
t f , and the inverse document frequency id f . Each term has a position in the vector,
18.4 Vector Space Model 565
if it is present in the document, that position is marked with its weight, else marked as
weight 0. To compute the weight (dwi j ) of a term ti in a document D j , the following
equation of t f.id f is used:
dwi j = t f i j × id f j . (18.2)
n
sim(Q, D j ) = qwi × dwi j . (18.3)
i=1
18.5 Indexing
The vector space model and other retrieval strategies make use of an inverted index
file structure to avoid the length of search in the keywords of every document for
which relevancy is to be established. Instead of searching into a document, an inverted
index is generated in advance for all the keywords in the document set. An entry for
each of the n terms (t1 , ..., tn ) is stored in an index structure, like the one shown in
Fig. 18.5. For each term ti , a pointer references to a linked list, which contains an
entry for each document containing this term as well as the term frequency in that
document is present. For example in row i, there are entries for (d j , t f i j ) forming
a connected list for term ti . The d j is a document in which term frequency is t f i j
for the term ti . The figure shows that the term ti has a frequency of 7 in document
d3 . This indexing structure has the advantage that the retrieval system can search the
term quickly, as well as the documents in which the term appears [18].
A key challenge in the construction of an index is the size of the data involved in
the index itself (see Fig. 18.5), which is a dynamic data structure typically used for
cross-reference generation, but cannot be kept together in the memory of a typical
system due to its size. The task to be performed here is a matrix transposition,
given that the documents’ terms’ matrix is very sparse. Such matrices are not so
easy to manipulate directly as an array. Therefore, the index construction makes use
566 18 Information Retrieval
Inverted List
tt (d7 , 1)
When all the documents have been processed, the runs available in the disk are
merged to get the final index. The merging process builds the final vocabulary on the
fly. Since the runs from the disk are read (into a buffer) for merging, a sufficiently
large size of the buffer will reduce the disk accesses, hence making the process further
faster. However, if the free disk space is limited, the final index can be written back
into the RAM at the space occupied by the runs, progressively as they are processed,
which is helpful in representing the final inverted lists more efficiently. The latter
becomes possible because the final index is typically smaller than the runs, hence
the vocabulary information is not duplicated.
The index construction using the merge-based approach is common and practical
for data collections of all sizes. It is scalable, and operates efficiently in a memory
size as small as 100 MB. The overheads of a disk space can be limited to a small
fraction of the final index, as it requires only one parsing pass over the data, and the
method can be extended from keyword indexing to phrase indexing.
568 18 Information Retrieval
Inserting a new document into the text collections amounts to inserting a few bytes to
the end of every inverted list depending on how much the terms in that document are,
i.e., size of the document. Since a document may consist of 100s–1000s of terms, such
insertion of terms requires fetching and extending 100s of inverted lists, and it may
require 10–20 s in the worst cases, to rebuild the inverted list for the new collection
of documents. However, the merge-based inversion approach can index thousands
of documents per second, making this method almost 10,000-times faster [12].
For fast insertion of terms for indexing into the inverted list, the disk resident
part of the list be not accessed frequently, else it will reduce the overall speed. The
practical solution is to amortize the update cost over a sequence of insertions. There
are many properties of text databases, which allow strategies for cost amortization.
The new documents are not immediately available for searching, if they are searchable
they can be made available through a temporary in-memory index—that is, the last
subindex in the merging.
There are three broad categories available for updating the index as the new
documents get added into the collection, they are rebuild from scratch, merge an
existing index with an index of new documents, and incremental update.
Rebuild
There are applications where the index is not updated at all, but it is rebuilt from
scratch at some regular intervals. The presence of new documents is established
through crawling, and an update is not immediately needed for these documents.
This approach is considered economical even for gigabytes of data, where rebuilding
takes just a few minutes, rather than updating the existing index, which anyway will
require fetching the document as well as the index to be updated.
Intermittent Merge
Number of inverted lists can be maintained for the documents collection, in the
memory of a system. Having the lists in memory, it is easy (less complex) to insert new
documents into these lists when the new documents are discovered. This is carried
out using Algorithm 18.2 discussed on page 567, through the process of merging the
indexes. Alternatively, the existing documents’ index remains in a standard inverted
file, and new documents are indexed as an inverted file data structure in the memory.
With this, the two indexes can share a common vocabulary. In this process, both the
indexes are made available to queries, and the result of any query is the union of the
query results from both of these inverted indexes. When the size of the in-memory
index crosses a threshold size, it is merged into the index file on the disk.
Other approaches for indexing use incremental update, or choice of alternative
strategies for indexing as a combination, suffix arrays, wavelet trees, Bayesian infer-
ences, predicate-based indexing, and probabilistic indexing [14].
18.6 Probabilistic Retrieval Model 569
H
G
αD βD
D
D D
570 18 Information Retrieval
In fact, an IR system does not directly handle the documents and the queries,
but handles them indirectly, in the forms of their representations. For example, a
document is represented in the form of index terms, and a query is represented
as a Boolean expression comprising terms and Boolean operators (see page 562
for more detail). Let us assume that Q and D are representations of queries and
documents, respectively. These representations have a mapping from the original
query and documents through some functions, which are expressed as α Q : Q → Q
and α D : D → D . Hence, if there are two different documents, but represented
with the identical set of index terms, then they will be mapped onto the identical
representation.
Further mapping is introduced from the representation (Q , and D ) to object
descriptions, to make the models more general. This is done by supplementing a
weight to the index term forms of the queries’ and documents’ representations. The
weight is a real number. Let us assume that these object descriptions, for query set
and document set, respectively, are Q and D , and the mappings as β Q : Q → Q
and β D : D → D , respectively. Due to the introduction of weight the new mapping
shows a more accurate relevance relation between the query and its descriptor set,
and similar is the case for the document set to their descriptors. Therefore, the more
correct value of the relevance function r is r : Q × D → H .
For a submitted query qi ∈ Q to the IR system, the documents d j ∈ D are ranked
according to the decreasing order of r (qi , d j ), such that the document with the
highest rank (of relevance) is at the top. The job of an IR system that ranks the
documents in the order of their relevancy for a query qi is to calculate relevance and
rank every document d j ∈ D . Often, for the sake of simplicity, the description and
representation are treated the same, and both are represented as the form of set of
terms [4].
The fuzzy retrieval technique is based on the fuzzy set theory and fuzzy logic—an
extension of the classical set theory. This fuzzy retrieval technique is based on the
concept that the word matching between the inquiry word set and the text word set
should not be limited to the perfect matching with the stem words. But, since the
words in inquiry also match with their synonyms in the text documents, the matching
should be graded, depending on the degree or level of matching in the range from
0 to 1. The extremes of this range, a special case in fuzzy logic, corresponds to the
Boolean matching. Fuzzy logic is more realistic than the Boolean logic, simply due
to the fact that it considers the exact as well as vague matching, the latter being more
frequently encountered in the real world [7].
In the relations over fuzzy sets, the elements of two sets have a degree of asso-
ciation as a form of relation rather than simply—related (binary 1) or not related
(binary 0). The degree of association ranges from 0 to 1, where 0 indicates the total
18.7 Fuzzy Logic-Based IR 571
absence of relation and a 1 indicates the total presence of the relation; therefore, a
fuzzy relation between query keywords set and of document’s keywords is
R : X × Y ∈ [0, 1] (18.4)
and the range of R(x, y) varies from 0 to 1 depending on how close the y, (y ∈ Y )
is associated with x, (x ∈ X ).
Information Retrieval Using Fuzzy Sets
The membership value R(xi , y j ) specifies, for each xi ∈ X , y j ∈ Y , the grade of
relevance of index term xi with the document y j . The grade of relevance depends on
many factors [3]:
1. position of term y j in the text document, if the document is a research article,
and y j appears in the list of keywords, the abstract, or in the conclusion part, the
relevance is higher;
2. frequency of occurrence of y j in the document;
3. xi and y j are terms formed from the same basic stem word; and
4. y j is synonym of xi —the proximity of the meaning of xi and y j decide value of
R(xi , y j )’s closeness with 1.
The criteria (1) and (2) above are user-defined and they can be programmed in
the implementation according to the user needs. The stem word criteria (3) requires
a data structure similar to the one shown below in Fig. 18.7, which helps to locate
the stem word for a given word, and then the stem word is substituted in the original
text before the retrieval technique is applied to it [4].
Other, more often used approach for stemming is through some stemming algo-
rithm, e.g., Porter’s stemmer, which takes the benefit of certain patterns in the words
to obtain the stem word. For example, for the words with “ing” at the end, the stem
word can be obtained by removing the “ing” part, like “book” from “booking”. Also,
there are other features, like removing “ed” at the end of the past tense of a verb; we
obtain its stem word, say “book” and “look” from “booked” and “looked”, respec-
tively. Along with this, there are some more complex patterns, like “goose” from
“geese”, etc.
Another important relation for IR based on criteria (iv) above is the fuzzy the-
saurus, which plays a pivotal role for FIR. The fuzzy thesaurus shows the relationship
between the pairs of words based on their centrality or degree of relevance. The struc-
ture of a fuzzy thesaurus (T ) is
W C1W C2R D
where W C stands for word category and R D is the degree of relationship between
the words W C1 and W C2. A typical examples for this can be as follows.
A = Q ◦ T, (18.5)
F = A∗R (18.7)
where “∗” is a matching operator, which evaluates the degree of fuzzy matching by
multiplying the fuzzy membership of the augmented inquiry with the fuzzy mem-
bership of the corresponding words in the text. Finally, the relevance measure of the
text with the inquiry under consideration is obtained by summing all the values of
fuzzy matching for the text.
x1 x2 x3
Q = [1 0.6 0.8] (18.8)
where 1, 0.6, and 0.8 are called the centralities of x1 (Einstein’s), x2 (Scientific), x3
(Theory of Relativity), respectively. The centrality indicates the presence of certain
qualities, whose computations are modeled as a computation of fuzzy membership
degree. The relevant part of the fuzzy thesaurus T , restricted to the support of Q, is
given by the matrix:
⎡ ⎤
x1 x2 x3 x4 x5 x6
⎢ x1 1 .6 .9 0 .1 0 ⎥
T =⎢ ⎣ x2 .6 1 .8 .6 .5 .8 ⎦ .
⎥ (18.9)
x3 .9 .8 1 0 0 .7
Note that terms’ pairs (x1 , x1 ), (x2 , x2 ), and (x3 , x3 ) has each a fuzzy matching
of 1. On using (18.5), and the data given by Eqs. (18.8) and (18.9), we get
A = 1 .8 .9 .6 .5 .7 . (18.10)
The values in augmented query A are obtained as follows: we show the compu-
tation for third element (A3 ), i.e., 0.9 as
18.8 Concept-Based IR
The keyword-based approach we discussed above is also called the BOW (Bag-of-
Words) approach. The concept-based information retrieval makes use of semantic
concepts for the representation of the documents and queries, instead of (or in addition
to) keywords; and the approach performs the retrieval in the concept space of query
and documents. Hence, this retrieval model is less dependent on the specific terms
(keywords) used, and yields a match even when the same notion is described by a
different set of terms in the query or in the target document, or in both. This helps
in eliminating the problem of synonymy. Since more relevant documents are likely
18.8 Concept-Based IR 575
to be retrieved from the store, the recall rate increases. In the similar way, if the
concepts chosen for the words, particularly the ambiguous words in the query and
document are accurate, the non-relevant documents that were retrieved with the BOW
approach could be eliminated from the results. Since the non-relevant documents are
lesser in the retrieved documents now, it increases the precision rate. Note that in
the keyword-based approach of IR, non-relevant documents’ share increases in the
retrieved documents when the keywords have multiple meanings, e.g., “bank” (of
river and of money). The problem of many senses of a word is called polysemy.
The concept-based methods can be characterized by the following three
parameters:
1. Concept representation. The concept-based IR makes use of real-life concepts
that closely resemble human perception. The concepts are based on the language
of the text of documents and queries.
2. Mapping method used. It is the method used for mapping the natural language text
to concepts. In fact, the ideal mechanism is the manual approach, which builds a
hand-crafted ontology of concepts along with a list of words to be assigned to each
concept. However, this approach is inefficient and time consuming. The automatic
mapping between concepts and words can also be done through machine learning,
but with loss of accuracy.
3. Use in IR. The concepts are used during the indexing and retrieval phases of the
documents. A simpler but less-accurate method applies the concept analysis in
one stage only. For example, it is better to apply the concept analysis in the query
expansion rather than document retrieval.
A large and diverse knowledge repository, like Wikipedia, can create a powerful
concept ontology, well-suited for concept-based IR. A wide range and exhaustive cov-
erage of topics in Wikipedia’s diversity, coupled with automatic ontology-building
capability, can be used for the purpose of highly fine-grained ontology. Additionally,
the inverted index language provides a mapping from massive natural language text
terms of the entire Wikipedia collections, to the concepts as per the context and sense
of the text terms. This entire process produces a powerful classifier that automatically
maps any given text fragment to its concept ontology [9].
....
....
....
Wikipedia articles wordn < ci , wi > < cj , wj > ....
associated with. Thus, each word appearing in the Wikipedia corpus can be seen as
triggering one or more concepts c j it points to the inverted index. An attached weight
w j with the concept represents the degree of association between that word and the
concept. This process is illustrated in Fig. 18.8, which shows how the semantic anal-
ysis is carried out of a Wikipedia article.4 The articles and the words in the articles are
processed to build a weighted inverted index, which represents each word of article
as a vector in the space of all Wikipedia concepts, i.e., articles in the Wikipedia itself.
In the concept-based indexing, each document (i.e., article) in the corpus is
mapped to a vector of weighted concepts, represented by c1 , w1 , c2 , w2 , ...,
cn , wn . Like Bag-of-Words (BOWs) vectors, the concept-based vectors are also
sparse, hence concept weights are zero for a large majority of the Wikipedia con-
cepts. Since every word in the document to be indexed may be related to a large
number of concepts, and a document containing a collection of words is likely to be
related to an even larger number of words, indexing an entire list of related concepts
for every document is not feasible. Therefore, only the concepts having the highest
weights (for example, “Bazaar”, “Economy”, etc., in Fig. 18.8) are used for indexing.
In a sorted representation of the weighted vector, this subset of concepts is simply
its prefix.
It more difficult to map the long documents in full, into the concept space. For
example, a small part of a long document might be relevant to a given query, but
the semantics of this small part are not likely to be fully represented in the concepts
vector of the complete document. Note that, a somewhat similar problem also exists
in the Bag-of-Words approach, where the term frequency (t f ) value is normalized
to account for the documents of varying lengths.
Due to the averaging effect of the representation of longer text fragments, and due
to the practical limitation to use only a small subset of the representation of concepts,
the challenge in the concept-based retrieval technique is even greater. Because the
concepts generated for a subset of the larger document, where the subset is relevant
4A Wikipedia article is an article about some topic, for example, we find articles in the collections
of Wikipedia, like websites, Internet, WWW, etc.
18.8 Concept-Based IR 577
The retrieval algorithm 18.4 has a single parameter s that controls the query
concept vector’s cutoff. The value of s can be chosen the same as that during the
indexing, however it is not necessary. If the entire corpus is indexed with large cutoff
values, the resultant costs for computation as well as storage will be high, hence
it is not advisable to index the entire corpus with large cutoffs. Since the query is
much smaller in size, there are no such costs as that for text corpus, hence a finer
representation will be beneficial. That can be achieved using a higher value of s.
18.9 Automatic Query Expansion in IR 579
Most IR systems, and particularly the web search engines, have a standard user
interface comprising of an input box to accept from a user, a query in the form of
keywords. The submitted keywords are matched against the collection of index terms
to find the documents that contain those keywords. The results are then sorted by
various methods. When there are many topic-specific keywords in the user’s query,
which accurately describe user’s information need, the system is likely to return good
matches for the query. However, when this query is short—comprising 2–3 words,
as the case usually is—there is likely ambiguity in the language of the query, then
this simple retrieval model is sensitive to errors [1].
The most serious issue in retrieval effectiveness is the term mismatch problem, i.e.,
indexers and the users do often not use the same keywords. For example, a document
uses “tv” while user submits the query with keyword “television”. This is known
as the vocabulary problem. This problem gets compounded due to polysemy, i.e.,
the same word with different/multiple meanings, such as Java (name of language,
and also name of a place), and also due to synonymy, i.e., different words with the
identical or similar meanings, such as “tv” and “television”. Synonym words, along
with word inflections (like in plural forms, “book” versus “books”) may result in a
failure to retrieve relevant documents. This may decrease the recall. The problem of
polysemy may cause retrieval of erroneous or non-relevant documents, thus causing
a decrease in precision.
Several approaches have been proposed, to deal with the vocabulary problem;
some of them are the following:
The technique of query expansion is one of the most natural and successful tech-
niques, using which the original query is expanded with other words that best capture
the user’s actual intent, or it simply produces a more useful query—a query that is
more likely to retrieve relevant documents.
As the use of search engines increased over time, the size/length of the user’s
query has also increased. In the year 2009, an average query’s length was 2.3 words.
However, there has been an increase in the number of long queries per user or
per session of interaction, of five or more words; the most common queries are
still those of one, two, and three words. When a query is short, the vocabulary
problem is more serious, because the shortage of query terms limits the scope of
handling synonymy. At the same time, the reduced resizing of data makes the effects
of polysemy more severe. This required the need and scope of Automatic Query
Expansion (AQE). Over the years a number of AQE techniques have increased that
employ sophisticated methods for finding new features related to query terms. Today,
580 18 Information Retrieval
there are firm theoretical foundations, and a better understanding of the utility and
limitations of AQEs. For example, what the critical parameters are that affect the
performance of the IR system, what types of AQEs are useful and what not, etc.
Along with the AQE, the basic techniques are being increasingly used in con-
junction with mechanisms to increase their effectiveness, like method combinations,
selection of information source dynamically for expansion, and discriminating poli-
cies for the application of methods. These advances have been supported by many
experimental findings. The AQE methods have gained their popularity due to the
evaluation results obtained at the Text REtrieval Conference series (TREC).
Document Ranking using AQE
The IR systems including search engines depend on computing the importance of
terms occurring in the query and documents to determine the relevance of docu-
ment(s) to the queries. The commonly used indicator of relevance is the similarity
measure between the query and the document. Considering that a query is repre-
sented by q and a document by d, the similarity measure between them, sim(q, d),
is expressed by
sim(q, d) = wt,q × wt,d (18.13)
t∈(q∩d)
where wt,q and wt,d are the weights of term t in a query q and a document d,
respectively, according to some weighting criteria adopted. The weight of a term t
in a document is typically proportional to the term frequency (t f ) in that document
and to the inverse document frequency (id f ). The purpose of id f is to diminish the
effect of very frequent terms like “the” for which we are not interested to find the
relevance. But we want to increase the effect of terms, which occur rarely, in a few
documents only, for example, “blue” and “bird”, which might be occurring only in
a few documents. If N is the total number of documents and n is the number of
documents in which the rare term occurs, the id f for this term is log( Nn ). Note that
if a term like “the” occurs in every document, then id f is zero, hence it will not
contribute to the similarity measure.
The similarity computation between a document and query representation,
expressed in Eq. 18.13, can be easily modified by abstracting away from the original
underlying weighting model, so as to work for the query expansion. In this revised
model, the basic input to AQE module is 1. original query q and, 2. source of data
from which to compute the weight and the expansion terms. The output of the AQE
module is the query formed (say, q ) due to the expansion of query terms, and their
associated weights w . These new weighted query terms q , w are used for com-
puting the similarity between the query q and the original document d. The new
similarity computation formula is expressed by
sim(q , d) =
wt,q × wt,d . (18.14)
t∈(q ∩d)
18.9 Automatic Query Expansion in IR 581
The commonly used data source for generating new query terms is the collection
itself into which we are searching, and sometimes it is the thesaurus to get the
synonyms of the keywords. The simplest approach to weight the query expansion
terms is to use the same weighting function that is being used by the ranking system.
When more complex features in the terms, like phrases, are used for query expansion,
the underlying system must be able to handle such features. An example is “Jodhpur
departmental store”, which though a phrase, can be treated as a single term, for the
purpose of indexing.
In the following we address some new areas, in addition to document ranking,
where AQE is used heavily; these are: question answering, multimedia IR, informa-
tion filtering, and cross-language IR. This introduction is followed by pointers to
more recent applications [1].
Question Answering
The goal of question answering (QA) is to provide concise responses (instead of
full-length documents) to certain types of natural language questions, such as “Who
built the Taj Mahal?” Like the Document Ranking (DR), the QA is faced with a
fundamental challenge of mismatch between the vocabularies of the question and
answer.
To improve on the early stage of QA, which is document retrieval, a common
strategy used is to expand the original question terms with terms that are expected to
appear in documents containing answers to this question. One important goal of QA
translation is to learn the associations between question words and the answer words,
which may be synonyms to the words in question. For example, for the question,
“Where Taj Mahal is located?,” the answer part embedded in a document may be
“Taj Mahal is in Agra,” or “Taj Mahal is situated in the city of Agra.” Here, the words
“situated” and “exists” are the synonyms of the word “located”. Different resources
for AQE for QA include using lexical ontologies like WordNet—a lexical database
of synonyms (called synsets), and semantic parsing of questions based on roles, and
other criterias [7].
Multimedia Information Retrieval
With the widespread use of digital media and digital libraries, the requirements of
searching the multimedia documents like image, speech, and video have become
important. Up till recently, the multimedia IR systems performed only text-based
search over the media metadata, like annotations and captions, which are surrounding
the html/xml descriptions. However, when these metadata are absent, this method is
not suitable. Hence, the IR relies on some form of multimedia content analysis,
which is implemented in combination with the AQE techniques. For example, a
transcription is produced by an automatic speech recognition system for spoken
document retrieval systems, and this is augmented with related terms, in advance to
raising the query. Since automatic speech transcriptions often contain mistakes, this
form of a document expansion is very useful for spoken document retrieval.
582 18 Information Retrieval
For image retrieval systems, a typical approach is making use of query examples
having visual features, like colors, textures, and shapes. The query is iteratively
refined through a relevance feedback.
For video retrieval systems, both the documents and queries are mostly multi-
modal, i.e., they have both textual as well as visual aspects. An expanded text query
is usually compared against the textual description of the visual concepts, and the
matched concepts are used for visual refinement.
Information Filtering
Information filtering (IF) is different from IR. It removes redundant or unwanted
information from an information stream prior to presentation to a human user. Its
main goal is the management of the information overload and increment of the
semantic signal-to-noise ratio. The documents arrive continuously and the user’s
information needs evolve over time, as per the experience of the user. Some examples
of information filtering are electronic news, blogs, e-commerce, and e-mails. There
are two main approaches to IF: 1. collaborative IF, which is based on the preferences
of like-minded users, and the other is 2. content-based IF. However, these techniques
are said to bear a strong conceptual similarity to IR, because the user profile can be
modeled as a query and the data stream can be treated as a set of collection of
documents. The user profiles (i.e., queries) are learned using relevance feedback
techniques, or other forms of query expansion, such as those based on similar users.
Cross-Language Information Retrieval
The Cross-Language Information Retrieval (CLIR) is concerned with retrieving the
documents written in a language different than the language of the users’ query.
Earlier methods of such retrieval consisted of translating a query into the docu-
ments’ language, and then using the standard techniques of IR. The query translation
in these approaches was performed using machine-readable bilingual dictionaries,
through machine translation, or using parallel corpora. However, no such translation
is absolutely correct, and regardless of the translating resource used, there are usually
limitations due to insufficient coverage, for example, there are terms which cannot
be translated or do not require the translation, and due to the translation ambiguity
from the source language to the target language.
To reduce the errors introduced due to translation, one standard technique is to
use query expansion, so that even when the translation does not contain errors, use of
semantically similar terms yields better results than those due to the literal translation
of terms only. The query expansion can be applied either before the translation of
query, or after, or even can be applied at both the places. It has been found that query
expansion, before the translation of a query, yields better results than doing it after
the translation. The expansion done at both the places provided even better results.
18.9 Automatic Query Expansion in IR 583
Inter- Ranked
mediate candidate Expansion Reformulated
features Query
Features Features
Query
Feature Query
Data Feature -
processing Generation Selection Reformu-
and Ranking lation
Data
Source(s)
The Automatic Query Expansion (AQE) is performed in a number of steps, the major
steps are
1. Preprocessing the source text,
2. Generating expansion terms (features) and ranking them,
3. Selection of expansion features, and
4. Reformulation of Query.
Figure 18.9 shows these steps of AQE.
The objective of preprocessing the data source data is to transform the raw data
source used for expanding the user query into such a format so that it can be processed
more efficiently by the subsequent steps. The preprocessing task performs extraction
of intermediate features, followed by the construction of data structures for easy
access and manipulation of such features. The preprocessing is usually independent
of a particular user query to be expanded, but it is specific to the type of data source
and expansion method used [1].
To compute the initial retrieval run, it is necessary to index the documents collec-
tion, and then run the query against this collection’s index. The process of indexing
consists of, in order, the following steps:
1. text extraction from documents, which are in a certain format, like HTML, PDF,
MS Word, where there is format information as well as text inside them,
2. tokenization of the extracted text,
3. stop-word removal from the tokenized text (removal of common words such as
articles and prepositions),
4. stemming (reduction of inflected or derivational words to their root form), e.g.,
reduce “tokenize, building, training” into “token, build, train”, etc.
5. word weighting (score is assigned to each token such that the weight reflects the
importance of the tokens in each document).
We take an example of a short HTML text fragment to illustrate the weights
associated with tokens.
In the above HTML document, first, text is extracted, then stop words “the” and
“an” are removed, then it is stemmed using Porter’s stemmer, and weight is assigned
to each word based on their frequency. Finally, we obtain the text representation as
follows:
automat 0.16, queri 0.33, ex pan 0.16, incr ease 0.16, semantic 0.16.
this is because the query terms are processed one at a time. The AQE runs with sizes
of practical interest, for example 10–20 word queries were found to be much slower
(by a factor of ten), than the original queries of 3–4 words.
The mismatch of words between the query and documents for relevant documents is
an issue in IR for a long time. In the following section, we discuss AQE with respect
to alternative strategies in reference to the vocabulary problem [1].
Interactive Query Refinement
In interactive query refinement, the system provides several suggestions for reforma-
tion of the query, and it is the user who selects the best choice out of that. With respect
to the computations required to be performed, in Interactive Query Refinement (IQE)
versus AQE, the first two stages are common with both, i.e., data acquisition and can-
didate feature extraction. The IQE does not follow the steps of feature selection and
query reformulation of AQE. One of the best-known examples of Interactive Query
Refinement is the suggestion of complete query, which offers real-time hints to user
to complete a search query. This happens when a user progresses in typing the query
in the inbox, like we note in many search engines, including Google. The IQE has
better potential for superior results than AQE, but generally requires higher expertise
on part of the user. Looking from the usability point of view, an IQE provides the
user with better control over query processing than the AQE.
Relevance Feedback
The relevance feedback takes two inputs: 1. results initially returned due to the given
query, and 2. feedback provided by the user about whether those results are relevant
or not. Based on these two inputs, the system submits a new query to the search engine
provided that previous results returned were not considered useful for fulfilling the
need of the user.
The features in the assessed documents are used to adjust the weights of terms
in the original query and/or for adding words to the query. The relevance feedback
has the effect of reinforcing the system’s original decision. This is done by the user
by modifying the expanded query to look closer at the retrieved relevant documents.
However, the AQE tries to form a better match with the user’s existing intentions, and
does not give the user a second chance to rethink, support/reject the results produced
by the first query. The specific data sources using which the expansion features are
generated in the relevance feedback may be more reliable than the sources generally
used by AQE. In the relevance feedback, the user must assess the relevance of the
documents, requiring a user to be better trained.
The relevance feedback has directly inspired one of the most popular AQE tech-
niques, called pseudo-relevance feedback, which has also provided foundations for
modeling query reformulation in a variety of AQE approaches.
586 18 Information Retrieval
For understanding the basics of Bayesian networks as well as Naive Bayes, the reader
may refer to the previous chapters (section 12.4, page no. 344, and section no. 14.7,
page no. 428). Using the conditional probability, the probability that a document dk
is relevant to the query q j can be expressed as P(R | q j , dk ). An accurate definition
of probability of relevance depends on the definition of relevance itself. The term
relevance is to some extent a subjective entity, and depends on many variables,
which are functions of documents, user’s information need, and the user itself. A
perfect retrieval is not achievable in true sense. However, it is possible to define an
optimal retrieval for the probabilistic model for IR. This optimal retrieval can be
proved theoretically with respect to representations (or descriptions) of documents
and information needs [4].
Let us assume that collections of queries and documents are described by a set of
index terms. Let T = {t1 , t2 , ..., tn } be the set of terms in the documents’ collection,
and a query q j and document dk are taken as subsets of terms in T . For the sake of
retrieval, each document is described by the presence/absence of these index terms.
Therefore, any document dk can be represented using a binary vector:
−
→
x = (x1 , x2 , . . . , xn ), (18.15)
P(−
→
x | R)P(R)
P(R | −
→
x)= . (18.16)
P(−
→
x)
In the above,
P(R | −→x ) is called posterior probability—the probability of relevance, given that
the document is − →x,
−
→
P( x | R) is called likelyhood function or probability of evidence, which is the
probability of randomly selecting the document of description − →x from the set R
of relevant documents,
P(R) is prior probability of relevance, i.e., the probability that a randomly selected
document from the entire collection is relevant, and
P(−→x ) is probability that a randomly selected document has a description − →x . It is
determined as a joint probability distribution of the n terms in the collection [6].
Equation (18.16) can be expressed in simple language as
P(q| p).P( p)
P( p|q) = . (18.17)
P(q)
When the set of prior probabilities P( p) are given for the root of a DAG that
represents the document, the network can be used to compute the probability of
belief associated with all the remaining nodes. Figure 18.10 shows a document di at
18.10 Using Bayesian Networks for IR 589
AN D q1 q2 AN D
For IR using Bayes inference, a user specifies one or more topics of interest while
keeping in mind some document features, the latter are to be used as evidence for
topics of interest mentioned above. The task of IR using Bayesian inference network
is documented as Algorithm 18.5, which requires building an inference network for
the representation of query terms and document features (i.e., terms), and computa-
tion of posterior probabilities (P( p|q) in Eq. 18.17) based on the prior probability
of the document [6, 10].
590 18 Information Retrieval
(b) The conditional probability P( f i j |ti ) of each feature f i j is defined as the prob-
ability that feature f i j is present in the document, given that this document is
relevant to topic ti .
Given the above, the task of IR is to compute the posterior probability P(ti | f i1 ,
. . . , f im ), which is the probability that the document is relevant to ti , given that we
have observed the presence or absence of all of the features f i j , called evidences.
The Bayes rule can be directly applied for this computation. The network topology
shown in Fig. 18.11 is called Bayes inference and, can be expressed by
P(ti ).P( f i1 , . . . , f im | ti )
P(ti | f i1 , . . . , f im ) = . (18.18)
P( f i1 . . . . , f im )
Generally, we are not interested in absolute numerical values of the posterior prob-
abilities, but want to just rank the documents by the posteriors. Thus, we can eliminate
the denominator term P( f i1 . . . , f im ) in this equation as long as this denominator
remains the same, with varying ti ,s. Further, we can simplify this Bayes rule to a
linear decision rule given in Eq. 18.19, where I ( f i j ) is an indicator variable that
equals to 1.0 only if f i j is present in the document and 0.0 otherwise, and w is a
coefficient corresponding to a specific (feature, topic) pair:
g(ti | f i1 , . . . , f im ) = I ( f i j ) × w( f i j , ti ). (18.19)
j
f1 f2 f3 f4 fm
In the previous section, we discussed that, a document relevant to one topic does
not affect one’s belief about the relevance or non-relevance of that document to any
other topic.
An IR network topology that removes this assumption, and explicitly represents
the relationships among different topics is shown in Fig. 18.12. We note that the
Bayesian inference problem becomes more completed with multiple topics.
The addition of a relationship between topics in Fig. 18.12 requires two changes in
specifying the network probabilities. In the first one, we must specify what are those
topics (t1 , t2 , ...), and then compute the strength of the relevance between them. In
the second step, we must compute the probability of each feature ( f i j ), conditioned
on each combination of its parent topic. Having done in this manner, it gives rise to
a number of probabilities, which are combinatorial in size.
However, there is an approach to simplify the multiple-topic network, such that it
looks like a single-topic network, and its range is only all the possible present/absence
combinations of topics t1 , t2 , ..., etc. Thus, in a way we have translated the topology
from one form to another. The modified topology of Fig. 18.12 is shown in Fig. 18.13,
where node S represents the compound topic. An advantage of this representation is
that the same set of formulas (18.18) and (18.19) for the computation of probabilities
can be used now also. The disadvantage of this approach is that the compound query
shall contain 2n states for n parent states, which will complicate the computations.
The semantic web is a vision of the future WWW. It is an extension of the current
web, where information is given with well-defined meanings, that will better enable
the computers so that computers and people can work in cooperation. However, a
universal implementation of the semantic web—a full substitution for the existing
web—is still far away from reality. Therefore, it could be useful to have a system
that analyzes the documents from a contextual point of view for more accurate
18.11 Semantic IR on the Web 593
Definition 18.2 Ontology (Definition-2). An ontology defines the basic terms and
their relations consisting the vocabulary of a topic, as well as the rules for combining
terms and relations to define extensions to the vocabulary.
To solve the problem of polysemy,5 the concept of “word sense” from WordNet is
used, which helps the user interacting with the system to associate with every concept
a list of terms semantically are related to.
18.12 Distributed IR
When the size of data is very large, or when it is required to support high query vol-
umes, single machine is not enough to support the load of IR, even when the number
of the enhancements discussed above have been used. For example, even when the
Internet was not so common, in mid-2004, Google search engine processed more
than 200 million queries a day against more than 20 GB of crawled data, and it used
over 20,000 computers. The requirement in the present time is hundred times bigger.
For handling large size of data loads, a combination of distribution and replication is
necessary. The distribution means, document collections and their indexes are split
across multiple machines, so that answers to every query is synthesized from many
collections of components. The word replication means mirroring, which involves
making enough identical copies of the system so that the required query load can be
handled with acceptable minimum response time [18].
Document-Distributed Architectures
A very simple distributed document’s architecture is to partition the collection and
allocate one sub-collection to each of the separate processors (see Fig. 18.14). To
make use of distributed document architecture, a local index is built and main-
tained for each sub-collection, and when a query arrives, it is passed to every sub-
collection to search and evaluate in parallel against every local index. The sets of sub-
collections’ answers are then merged in some way to provide an overall answer. The
main advantages of such document’s partitioning system is that, collection growth
is handled by designing one of the hosts in the form of dynamic collection, such that
only this host needs to rebuild its index. The parts of the process that are computa-
tionally expensive are distributed equally across all the hosts in the computer cluster.
These parts are searching the index, updating individual indexes, computation of
weights, and id f for documents, etc.
Figure 18.14 shows a simple inverted index of a document-distributed retrieval
system. It shows two index partitions: 1. a term-based index partition, and 2.
document-based index partition. Elements in a inverted list have format document
number, term-frequency, for example 2, 2 in second row, second column indicates
that the term “quick” has frequency 2 in the document number 2, while 3, 1 in the
same row, third column indicates that the term “quick” has frequency 1 in the docu-
ment number 3. Each of the two dashed regions in Fig. 18.14 show one component
of a document-distributed retrieval system; with this, one processor indexes all terms
Document number
1 2 3 4 5 6
< 6, 1 >
the
quick < 2, 2 > < 3, 1 >
a < 6, 1 >
Term-based
lazy < 1, 3 > < 2, 2 > < 3, 3 > < 4, 1 > < 5, 3 > < 6, 2 > index partition
dog < 1, 1 > < 3, 1 >
today < 4, 3 >
Document-based
index partition
that appear in the first two documents of the collection, creating a document-based
partition.
Term-Distributed Architectures
An alternative strategy for partitioning the index is term-based partitioning, where
index is split into components by partitioning the vocabulary, with one possible
partition shown by the dotted horizontal split in Fig. 18.14. Each processor has full
information about a subset of the terms, i.e., all the necessary things to handle a
query. Hence, only the relevant subset of the processors needs to respond to a query.
The term partitioning has the advantage that it requires fewer disk seek and transfer
operations during query evaluation than document partitioning because each term’s
inverted list is still stored contiguously on a single machine rather than being split in
fragments across multiple machines.
On the other hand, each of these disk transfer operations involve more data.
Mainly because, in a term-partitioned arrangement we are discussing, the majority
of the processing load is on the coordinating machine; the experiments have shown
that it can easily become a bottleneck and starve the other processors of work.
Term-Based Versus Document-Based Partitioning
Compared to the term-based partitioning, the document-based partitioning typically
results in a better balance of workload and achieves higher throughput for queries.
18.12 Distributed IR 597
Also, the document-based partitioning (document distribution) allows for more nat-
ural index construction and for document insertion. On the other side, for term-
partitioned index, index construction involves first of all, distributing the documents
and building the document-partitioned index. Then, once the vocabulary split has
been agreed upon by the processors, the index fragments are exchanged between all
pairs of processors.
The document-based partition also has the practical advantage of providing the
search service even when one of the hosts is offline for some reason, because any
answers not resident on that machine are still available to the system. For exam-
ple, there are 10 machines, m 1 ...m 10 and 100 documents d1 ...d100 , with document
distribution as follows:
machine m 1 : d1 ...d10 ,
...
machine m 10 : d91 ...d100 .
Now, if machine m 5 is offline, the queries for all the documents, except d51 ...d60 ,
is answered. Whereas, in the term-based distribution, since index terms of every doc-
ument are uniformly distributed on all the processors, the query cannot be answered
even if one processor is offline. However, in the term distribution case, if any machine
is offline or idle, it is immediately noticeable because it will affect queries belonging
to almost every document.
Google indexing uses document-based partitioning, with massive replication and
redundancy at all the levels, for example at machine level, cluster level, and indi-
vidual level. In addition, the document partitioning remains effective even if the
collaborating systems are independent and unable to exchange their index data. The
distributed system makes use of a meta-searcher, using which the final result answer
list is synthesized from the possibly overlapping answer sets provided by a range of
different services.
18.13 Summary
the total number of relevant documents in the entire collection, i.e., what fraction of
the relevant documents have been retrieved.
The common retrieval strategies are Boolean Model, Vector Space Model, Prob-
abilistic Model, Inference networks, and Fuzzy set-based retrieval.
The Boolean model is a simple retrieval model based on Set Theory and Boolean
Algebra, Vector Space Model computes the measure of similarity (sim(q, d j ))
between query q ∈ Q and a document d j ∈ D, where q and d j are vectors for query
and document, and the Q, D are sets for queries and documents, respectively.
The probabilistic retrieval model computes the similarity measure (sim(q, d j ))
between the query q and a document d j as the probability that document d j is relevant
to q. This model estimates a query term’s weight on how often the term appears or
does not appear in relevant and non-relevant documents, respectively. One class of
probabilistic approach for IR is Bayesian networks. These networks are annotated
directed graphs encoding probabilistic relationship among distinctions of interest, in
an uncertain reasoning problem.
The fuzzy retrieval technique is based on fuzzy set theory and fuzzy logic—an
extension of the classical set theory. The word matching between the query set and
the text word is not limited to the perfect matching, but, the matching is graded,
depending on the degree or level of matching in the range from 0 to 1. In fuzzy set
-based retrieval, a membership value R(xi , yi ) specifies for each xi ∈ X , yi ∈ Y , the
grade of relevance of index term xi with the document yi .
In a concept-based information retrieval, queries and documents are represented
using semantic concepts, instead of (or in addition to) keywords, and they perform
retrieval in that concept space. This results in a retrieval model that is less dependent
on the specific terms used, and yields matches even when the same sense is described
by different terms in the query and target documents.
The retrieval models make use of an inverted index file structure. Instead of search-
ing into a document, an inverted index is generated in advance for all the keywords
in the document set. For each term, a pointer references to a linked list, which con-
tains an entry for each document containing this term as well the term frequency
in that document. The single key problem with this index is that volume of data
involved cannot be held in the main memory. To solve this problem, the indexes are
constructed in parallel and can be merged after a regular intervals.
Yet, there is another method for IR that uses an approach based on the measure of
semantic relatedness, applied to evaluate the relevance of a document with respect to
a query in a given context. The approach makes use of structures like lexical chains,
ontologies, and semantic networks. The semantic approach implements a context-
based system, such that keywords are processed in the context of the information
from which they are retrieved. This approach helps in solving semantic ambiguity,
and results in giving a more accurate retrieval, that is based on the real-world interests
of the user.
In most information retrievals, user queries are short and the natural language is
inherently ambiguous, consequently, the retrieval is prone to errors and omissions.
The most critical language issues are polysemy and synonymy. To resolve this, the
18.13 Summary 599
query is expanded by appending the synonyms of query terms into the query, called
Automatic Query Expansion (AQE).
When large volumes of data sets are involved or when the query volumes are high,
one machine may be inadequate to support the users’ query load, even when the
various enhancements and optimizations are carried out. For handling heavy load of
users’ queries, a combination of distribution and replication is required. Distribution
means, the document collection and their indexes are split across multiple machines
(servers) and that answers to the query as a whole must be synthesized from the
various collection components. Replication (or mirroring) involves making enough
identical copies of the system so that the required query load can be handled at speed
and accuracy.
Exercises
1. Show how the vector space model can be modeled using an inference network.
2. Consider a documents collection made of 100 documents. Given a query q, the
set of documents relevant to the users is D ∗ = {d4 , d15 , d34 , d56 , d98 }. An IR
system retrieves the following documents D = {d4 , d15 , d35 , d56 , d66 , d88 , d95 }
References
15. Rinaldi AM (2009) An ontology-driven approach for semantic information retrieval on the
web. Trans Internet Technol 9(3):10:1–10:24. https://doi.org/10.1145/1552291.1552293
16. Smith LC (1976) Artificial intelligence in information retrieval systems. Informat Process
Manage 12:189–222. Pergamon Press
17. Wright (1921) Correlation and causation. Agric Res 20:557–585
18. Zobel J, Moffat A (2006) Inverted files for text search engines. ACM Comput Surv 38(2):1–56
Chapter 19
Natural Language Processing
Abstract The abundant volume of natural language text in the connected world,
though having a large content of knowledge, but it is becoming increasingly difficult
to disseminate it by a human to discover the knowledge/wisdom in it, specifically
within any given time limits. The automated NLP is aimed to do this job effectively
and with accuracy, like a human does it (for a limited of amount text). This chapter
presents the challenges of NLP, progress so far made in this field, NLP applications,
components of NLP, and grammar of English language—the way machine requires
it. In addition, covers the specific areas like probabilistic parsing, ambiguities and
their resolution, information extraction, discourse analysis, NL question-answering,
commonsense interfaces, commonsense thinking and reasoning, causal-diversity, and
various tools for NLP. Finally, the chapter summary, and a set of relevant exercises
are presented.
19.1 Introduction
languages, are finite machines, they have finite memory, and have finite processing
power.
When we are interested in an infinite language set, say L, we can use the finite
devices called generating grammars to investigate the structure of L. In that sce-
nario, the theory of language will contain specification of a class of functions from
which grammars for a particular language may be drawn. Natural language process-
ing (NLP) is a collection of computational techniques for automatic analysis and
representation of human languages, motivated by theory. However, the automatic
analysis of text, at par with humans, requires a far deeper understanding of natural
language by machines, which is still far from reality. There are many examples of
NLP, like, online information retrieval, aggregation, and question-answering, have
been mainly based on algorithms relying on the textual representation of web pages,
as well NLP to some extent. Such algorithms are very good at retrieving texts (IR),
splitting it into parts, checking the spellings, and and word-level analysis, but not
successful for analysis at sentence and paragraph level. Hence, when it comes to the
question of interpreting sentences and extracting meaningful information, however,
the capabilities of these algorithms are still very limited.
NLP, in general, requires high-level symbolic capabilities, which includes the
following:
Access and acquisition of lexical, semantic, and episodic characteristics,
Creation and propagation of dynamic bindings,
Manipulation of constituent recursive structures,
Coordination of many processing and learning modules,
Identification of basic language constructs (e.g., objects and actions) and,
Representation of abstract concepts.
All the above capabilities are needed to shift from mere NLP to what is usually
called as natural language understanding. The present approaches to NLP are based
on the syntactic representation (also called syntactic structures) of text, i.e., relying
on the word co-occurrence frequencies. Such algorithms have the limitation that they
can process only based on the information they can see in the text being processed,
but cannot consider the background information what we humans do. For example,
when we say that “Sachin Tendulkar is a good batsman,” we understand this sentence
due to abundant information we have in our brain about the game of cricket, and bout
the successes of “Sachin Tendulkar” in many cricket matches. However, the present
algorithms do not have all this background with them, hence their understanding is
limited about any given text.
As human text processors, being humans we do not have such limitations, as
every word coming across in the text activates a cascade of semantically related
concepts, relevant episodes, and sensory experiences, all these enable the completion
of complex NLP tasks. These tasks are, for example, word sense disambiguation,
textual entailment, and semantic role labeling, all in a quick and effortless way.
Many new computational models are making attempts to bridge the cognitive gap
by emulating the processes recognized as being part of the human brain, and used
for language processing by humans. These approaches depend on semantic features
19.1 Introduction 605
that cannot be explicitly expressed through the text. The computational models are
useful for theoretical purposes, e.g., scientific studies, such as exploring the nature
of linguistic communication and its properties, as well for practical and industrial
applications, such as enabling effective human–machine communications.
Challenges of NLP
Developing a program that understands natural language is a difficult problem. Major-
ity of natural languages are large, they contain infinitely many sentences. Also there
is much ambiguity in natural language. Many words have several meanings, such as
can, bear, fly, orange, and the same sentences many times have different meanings
in different contexts. Due to this, creation of programs that understands a natural
language is a challenging task.
The syntax of a language helps us to decide how the words are being combined
to make larger meanings. For example, in the sentence “the dealer sold the merchant
a dog,” it is important to be clear about what is sold to whom. Some of the common
examples are the following:
I saw the Golden gate bridge flying into San Francisco.
(Is the bridge flying?)
I ate dinner with a friend.
I ate dinner with a fork.
Can companies litter the environment
(Is this a statement or question?)
Finally, assuming that we have overcome the problem at the previous levels, we
must create an internal representation, and then, some how use the information in an
appropriate way. This is the level of semantics and pragmatics. Here too the ambi-
guity is prevalent. Consider the following sentences.
Jack went to store. He found the milk in aisle three. He paid for it and left.
Here the problem is deciding whether “it” in the sentence refers to aisle or three,
the milk, or even the store.
The most important part in the above is what is internal representation, so that
these ambiguities in understanding the sentence do not occur and machine under-
stands the way a human being understands the sentences.
5. Identify techniques for information retrieval, language translation, and text clas-
sification. [Familiarity]
6. Tools for NLP. [Usage]
Right from its start, NLP focused on these tasks: natural language translation, infor-
mation retrieval, information extraction, text summarization, question answering,
topic modeling, and the recent one on opinion mining.
Syntax Versus Semantics
The NLP advances that took place after the 1960s paid attention on syntax, and
partly on the semantics, but it was more of syntax-driven processing [1]. However,
the semantic problems of Natural Language and its processing were obvious require-
ments from the beginning itself, but strategy adopted was to tackle the syntax first,
for the more direct applications. Many works on NLP recognized the need for exter-
nal knowledge for interpretation and reasoning on input language (Minsky, 1968),
who explicitly emphasized semantics, e.g., the general-purpose semantics with case
structures for representation and semantically driven processing [17].
First-Order Predicate
One of the most popular Natural Language representation technique used is FOPL
(First-Order Predicate Logic), which is a deduction-based logic consisting of axioms
and inference rules. The FOPL can be used to construct relationally rich predicates
formulas, using quantification. The inferences can be drawn using formulas, called
knowledge base, with the application of resolution principle. The FOPL supports
the expressions of syntactic, semantic and, to a certain extent pragmatic. The syn-
tax expresses a way of grouping of symbols, so that they are considered properly
formed. The semantics specifies the meaning of these well-formed expressions. The
pragmatics is more difficult to implement, which specifies to make use of context-
related information to provide better correlations between different semantics. The
later is important in tasks like word sense disambiguation.
Default Logic
The Default Logic was proposed [15] to formalize the default assumptions, e.g., “all
birds fly.” However, problems arose when default logic formalized the facts that were
true in the majority of cases but false in the exceptional cases, i.e., “exceptions to
these general rules,” e.g.’ “penguins do not fly” [15].
Production Rules
The production rules based model is another popular model for representation of Nat-
ural Languages (Chomsky, 1956). This model keeps ongoing assertions in a volatile
working memory. The production rules comprise as their premises the set of condi-
tions and consequent as a set of actions (i.e., IF <conditions> THEN <actions>).
19.2 Progress in NLP 607
The importance of NLP is due to the fact that there is a huge amount of data in
WWW, at least 20 billion pages, and that can be used as a big resource, provided that
important information can be found from these through NLP. Some of the well-known
uses of this data are possible through the following applications [4]:
Indexing and searching large texts,
Information retrieval (IR),
Classification of text into categories,
Information extraction (IE),
Automatic language translation,
19.3 Applications of NLP 609
The syntax analysis of a language performs two main functions in analyzing natural
language text, which are as follows:
• Determining Structure. Given the sentences, the syntax analysis should identify
the subject and object of each verb, and determine what each modifying word is,
610 19 Natural Language Processing
Linguistics
Computational Theoretical
Linguistics Linguistics
Language Language
Analysis Generation
Syntax Semantic
Analysis Analysis
and what phrase it modifies. This is achieved by assigning a tree structure to the
input, during the process called as parsing.
The objective of semantic analysis is to determine what a sentence means. The seman-
tics seeks the conditions under which the sentence is true. Almost equivalently, we
can say, what are the inference rules for the sentences of the language. Characterizing
the semantics of interrogative and imperative sentences are more problematic.
Why the meaning of a sentence is important as long as a sentence is correct or
syntactically correct? Here is one reason. Consider that we build a natural language
system, our aim is not only to ensure that sentence is correct, but usually we want the
system to do some thing in response to input, like, retrieve data, move a robot arm,
etc. In general, this will mean translating the natural language input into the formal
language of a database retrieval system, a robot command system, etc. In comparison
with natural language, these formal languages will have the following properties:
• unambiguous,
• simple rules of interpretation and inference, and
• logical structure determined by the form of the sentence.
Thus, there is a significant similarity between the tasks that translate into such a
practical language, and those translate into a logical formalism.
Formal Languages for Meaning Representation
One method for meaning representation is propositional Logic. Using this logic,
given that “If it is raining then ground is wet, and It is raining,” we can infer that
“ground is wet.” We observe that as long as we preserve the general structure of the
argument, including the “if ... then ...” connective in the first part, we may change
the sentences making up the argument and still have a valid argument. For example,
“If it is night then Ram is sleeping, and It is night,” we may conclude that “Ram is
sleeping.” And, the overall argument is a valid argument.
The above rule is represented in predicate logic as
Given that: P → Q, and
P,
we may conclude: Q.
The above rule is called Modus Ponens. Similar to propositional logic, the rule of
Modus Ponens also exists in Predicate Logic also. For details of these logics, refer
to chapters two and three in this book.
The other approaches for semantics are Semantic networks, and Conceptual
dependency, which are presented in details in the chapter seven of this book.
So far we have discussed the structure and meaning of individual sentences, but what
is solution is the meaning of the entire text. Because, the information conveyed by a
612 19 Natural Language Processing
text is clearly more than sum of its parts—more than the meaning of its individual
sentences. For example, if a text tells a story, describes a procedure, or presents an
argument, we must understand the connections between the component sentences
in order to have fully understood the story. Discourse analysis is the study of these
connections between the sentences. Since these connections are implicit in the text,
identifying them may be a difficult task.
As a simple example of the challenge of discourse analysis, just consider the
following brief description of a naval encounter:
Just before the dawn, the Vikrant sighted the an unidentifiable blue-ship and fired two tor-
pedoes. It sank swiftly, leaving one survivor.
The problem we face is, what stands for “it”? There are four candidates in the
first sentence: dawn, Vikrant, blue-ship, and torpedo. The semantic analysis helps to
exclude the “dawn” as a meaning for “it”, and number agreement excludes “torpe-
does”. But, still leaves two candidates Vikrant, and blue-Ship, which are both ships.
Our syntax and semantic analysis tools will not enable us to resolve between these
two, as which stands for “it”. To get the noun for “it”, we must understand the causal
relationship between the firing of torpedoes and sinking of ship. Since Vikrant fired
torpedoes, it must have fired on the blue-ship (because Vikrant is not supposed to
fire on itself !). Hence, “it” stands for blue-ship.
Since much of the information conveyed by this text is implicit, we cannot claim
that the text is adequately understood unless the implicit information is recovered.
The role of discourse analysis is to recover this information.
As we have discussed that discourse is a multi-sentence text, discourse comes in
many forms, describe a scene, give instructions, or present an argument. The tools for
discourse analysis are Frames, Scripts, and Conceptual dependencies, which were
covered in detail in chapter seven in this book.
19.5 Grammars
The grammar of a language can be viewed as a theory of the structure of that lan-
guage. Like any mathematical theory, the theory of grammar is based on a a set of
finite number of observed sentences, but it projects this to an infinite set of grammat-
ical sentences. This becomes possible by creating general laws (grammatical rules),
framed in terms of such hypothetical constructs as the particular phonemes, words,
and phrases, of the spoken language under analysis. A properly formulated grammar
should be able to determine unambiguously the set of all grammatical sentences in
that language [2].
Let us assume that for most languages there are certain clear examples of grammat-
ical and ungrammatical sentences. Consider the following sentences of in English:
1. John ate a sandwich,
2. Sandwich ate a John.
19.5 Grammars 613
Suppose a large corpus-x does not contain either of these sentences. In this scenario
can we say, grammar that is determined for corpus-x will direct the corpus to include
the first sentence and exclude second. Even though such simple cases maybe not
adequate to confirm that all correct sentences of that language will be tested as
correct and all wrongs will be rejected, but still it becomes a strong test.
The first step toward the linguistic analysis of a language is to provide a finite
system of representation for its sentences. By the grammar of a language L we mean
a device that produces all of the strings that are sentences of L and no other strings.
The description of phrase structure comprises the grouping of words into phrases,
which are then grouped into smaller constituent phrases and so on, until the ultimate
constituents (generally morphemes) are reached. The phrases are typically classified
as noun phrases (NP), verb phrases (VP), etc. For example, the sentence “the man
took the book,” might be analyzed as shown in Fig. 19.2.
Evidently, description of sentences in such manner permit considerable simpli-
fication compared tor the word-by-word model. This is because the composition of
a complex class of expressions such as NP, can be stated just once in the grammar,
and this class can be used as a building block at various stages in the construction of
sentences. We now question, what form of grammar corresponds to this conception
of linguistic structure? We are going to get the answer in the following.
Sentence
614 19 Natural Language Processing
S ⇒ α1 ⇒ α2 ⇒ · · · ⇒ αn = w. (19.1)
S → NP VP
V P → V er b N P
N P → the man | the book
V er b → took (19.2)
D1 : S → N P V P
→ N P V er b N P
→ the man V er b N P
→ the man took N P
→ the man took the book
Also, we have:
D2 : S → N P V P
→ N P V er b N P
→ the man V er b N P
→ the man V er b the book
→ the man took the book.
These derivations D1 and D2 are evidently equivalent; they differ only in the
order in which the rules are applied. We can represent this equivalence graphically
by constructing diagrams that correspond, in an obvious way, to derivations. Both
D1 and D2 reduce to the diagram as shown in Fig. 19.3. The diagram gives the phrase
structure of terminal sentence “the man took the book.” Note that “the man”, “took”,
“the book”, and “took the book” are phrases as these are traceable back to some
nodes. However, “man took” is not a phrase as it is not traceable back to any node.
Example 19.1 Consider that various tuples of a grammar are given as follows:
19.5 Grammars 615
NP VP
V = {S, N P, N , V P, V, Ar t}
Σ = {boy, icecr eam, dog, bites, likes, ate, the, a}
P = {S → N P V P,
N P → N,
N P → Ar t N ,
V P → V N P,
N → boy | icecr eam | dog,
V → ate | likes | bites,
Ar t → the | a}
Structural Representation
It is convenient to represent the sentences as tree or a graph to help expose the
structure of the constituent parts, called parts-of-speech (POS). For example, the
sentence, “The boy ate an Icecream” can be represented as a tree shown in Fig. 19.4.
For the purpose of computation, a tree must also be represented as a record, or a
list, or some similar data structure. For example, the tree above in Fig. 19.4 can be
represented as a list:
NP VP
Art N V NP
Art N
A more extensive English grammar can be obtained with the addition of other
constituencies such as prepositional phrases (PP), prepositions (Prep), adjectives
(ADJ), determiners (DET), adverbs (ADV), auxiliary verbs (AUX), and many other
features. Correspondingly, the additional rewrite rules are as follows:
P P → Pr ep N P
V P → V ADV | V P P | V N P P P | AU X V N P
Det → Ar t AD J | Ar t
The grammars are classified in many ways: when they are based on their generating
rules they are called Chomskian grammars; when a grammar transforms the passive-
voice sentences to active voice, it is transformational grammar; and when a sentence
generated by a grammar has more than one meaning(sense), it is called ambiguous
grammar. The following describes the theory of each in brief.
In fact, it is not always possible to formally characterize the natural languages with
a simple grammar as discussed above. The grammars are defined by Chomsky Hier-
archy, as type 0, 1, 2, 3, with simplest as type-3 and most complex as type-0. The
19.6 Classification of Grammars 617
S → aS
S → a.
We note that the left-hand side is a variable symbol, and right hand is only terminal
symbol, or a terminal followed with variable symbol.
The type-2 grammars, called context-free grammars (CFG) have left-hand side
symbol of variable, and right-hand side has any combination of variables and terminal
symbols. Following all the productions belong to type-2 grammars, where S, A, B
are variable symbols; a, b are terminals, and α is any combination of variables and
terminals.
S → aS | aSb | a B | a AB | α
A→a
B→b
The typical rewrite rules for type-0, called unrestricted grammar and type-1 called
context-sensitive grammar (CSG), are are follows:
S → aS | a AB
AB → B A
a A → ab | aa
α→β
produce different structures for different sentences, even though they have the same
meaning. For example,
Ram gave Shyam a book.
A book was given by Ram to Shyam.
In the above, the subject and object roles in second sentence are switched. In the
first, subject is “Ram” and object is “Book”, while in second sentence they are other
way round. This is an undesirable feature for machine processing of a language. In
fact, sentences having the same meaning should map to the same internal structures.
However, by adding some extra components, we can produce a single representa-
tion for sentences having the same meaning, through a series of transformations. This
extended grammar is called Transformational grammar. In addition, semantic and
phonological components are components, which are added as new, helps in inter-
preting the output of the syntactic components, as meaning and sound sequences.
These transformations are tree manipulation rules, which are taken from dictionary,
where words contain semantic featuring each of the lexica.
Using transformational generative grammar, a sentence is analyzed in two stages:
(1) basic structure of the sentence is analyzed to determine the grammatical con-
stitutional parts, which provides the structure of the sentence. (2) This structure is
transformed into another form, where a deeper semantic structure is determined.
The application of transformations is to produce a change from passive voice
form of the sentence into active voice, which changes a question to declarative form,
handle negations, and provide subject–verb agreement. Figure 19.5a–c shows the
three stages of conversion, from passive voice to active voice of a sentence.
However, the transformational grammars are now rarely used as computational
models for language processing [14].
Aux V Prep NP
S S
NP VP
NP VP
V NP Passive
V N
Fig. 19.6 a Grouping the words, b Parse-tree I: “He drove down the hill in the car”
Fig. 19.7 a Grouping the words, b Parse-tree II: “He drove down the hill in the car”
620 19 Natural Language Processing
shows another grouping for the same sentence and the corresponding parse-tree
is shown in Fig. 19.7b. Since this sentence can be generated through two different
parse-trees, the sentence is ambiguous, and hence its grammar is also ambiguous.
This sentence1 consists a syntactic ambiguity, as in one case the PP “with spoon” is
governed by noun kheer (i.e., as part of the NP kheer with spoon (see Fig. 19.8a).
As an alternative sense the PP “with spoon”, is governed by the verb eats (i.e., as
a modifier of the verb, as indicated in Fig. 19.8b. Of these, the later case of verb
attachment (i.e., Fig. 19.8b) is, of course, the correct analysis.
The much interest in PP attachment comes from the PP being a common phe-
nomenon when parsing the languages such as English, which is a major cause of
parser errors. The ambiguity due to PP is called attachment ambiguity.
The syntactic preferences do not provide the solutions in dealing with PP attach-
ment, and they are not very effective for predicting the difference in PP attachment
S S
NP VP
NP VP
N V NP
N
Shakuntala eats V NP PP
N PP
Shakuntala eats NP
N P
kheer NP
P
kheer with N
with N
spoon
spoon
(a) (b)
Fig. 19.8 The prepositional phrase “with spoon,” being governed by: a noun kheer, b verb eats
between noun attachment and verb attachment. This difficulty led to a shift from
syntactic approaches in the 1980s toward AI-based approaches, the later used world
knowledge for resolving PP attachment ambiguity. Figure 19.8a shows that, due to
the availability of the knowledge that spoon is an eating implement, it would suggest
a preference for verb attachment. In a similar way, the knowledge that they are not
a foodstuff, it would suggest a preference against noun attachment.
An algorithm that resolves the attachment ambiguity, defines constraints that are
derived from semantic and world knowledge. These constraints are used for com-
posing the meaning of the child unit, which is attached to the meaning of each of the
possible parent’s syntactic units. This would be helpful for attaching the child unit to
the parent units (see Fig. 19.8). These constraints are called selectional constraints,
and usually represented in the form of the permissible range of fillers in the slots in
frames that represent the meanings. The potential filler, which is meaning of a child
unit, is compared against the constraint by a fuzzy match function. A popular way to
do this fuzzy match is to compute a weighted distance between the two meanings in
a semantic or ontological network of concepts. The closer the two are in the network,
higher is the score assigned to the particular choice.
Parsing is carried out to compute the structural description of a sentence. This struc-
tural description is assigned by the grammar of the language, and as a precondition
it is assumed that sentence is well-formed, i.e., grammatically correct. The process
of parsing consists of at least the following activities:
• Mathematical characterization of derivations in a grammar, and those associated
with parsing algorithms.
622 19 Natural Language Processing
• Computing the time and space complexities of parsing algorithms, with length of
input as the length of the input sentence.
• Comparing different grammar formalism and showing equivalences among them
whenever possible.
• Combining grammatical and statistical information for improving the efficiency
of parser, and ranking the parsers.
The structural descriptions provided by a grammar depend on the formalism to
which the grammar belongs. For the well-known context-free grammar (CFG), the
structural description is phrase structure grammar.
A parse-tree describe the structure of a sentence, and is also the record of the sequence
of steps of derivation of the sentence. The parse-trees are useful due to following
reasons:
• In carrying out semantic analysis, the parsing is an important intermediate stage,
• Grammatically checking the sentence, and
• The parsing is useful in following:
– Mechanical translation,
– Question answering, and
– Information Extraction.
A syntactic parser can be thought of as searching through the space of all possible
parse-trees to find the correct parse-tree for the given sentence. Before we go through
the steps of parsing, let us consider the following rules for grammar.
V NP
det Nom
Det → a | an | the
N → book | f light | meal
V → book | include | pr oper
Aux → Does
pr ep → f r om | to | on
Pr oper -N → Mumbai
N om → N om P P
The parse-tree for the sentence “Book that flight” is shown in Fig. 19.9.
Example 19.2 Set of production rules (P) and parsed sentences for different forms
of verb-phrase.
A NP can be location, date, time, or others. Following are other examples of parts of
speech (POS).
Pronoun V PP
Prep NP
N
I prefer morning flight
Pr onoun → me | I | you | it
Pr oper -N → Mumbai | Delhi | I ndia | U S A
Ad j → cheapest | non-stop | f ir st | latest | other
Pr ep → f r om | to | on | near
Con j → and | or | but
The following examples show the substitution rules and with values for each parts-
of-speech to be substituted.
Making use of above rules, Fig. 19.10 demonstrates the parsing of sentence “I prefer
morning flight.”
S →VP
VP → V NP
S → Aux N P V P.
W h-N P → W hat.
Many times, the longer sentences are conjuncted together using connectives, e.g.,
I will fly to Delhi and Mumbai. The corresponding rule is
VP → VP and VP.
In top-down parsing, the searching is carried out from the root node, substitutions
are carried out at every step, and the progressing sentence is compared with the input
text sentence to determine whether the sentence generated progressively matches
with the original. Figure 19.11 demonstrates the steps of top-down parsing for the
sentence “Book that flight”.
To carry out the top-down parsing, we expand the tree at each level as shown in
the figure. At each level, the trees whose leaves fail to match the input sentence,
626 19 Natural Language Processing
Fig. 19.11 Top-down parsing for the sentence, “Book that flight”
are rejected, leaving behind the trees that represent the successful parses. Going this
way, we ultimately construct the original sentence: “Book that flight.”
Figure 19.11 shows that, first, root node is created as S, then in step (2), it is
expanded to generate three possible sub-trees,
S → N P V P,
S → Aux N P V P,
S → V P.
First of these is expanded at level 2, in step (3), by rule “NP → det Nom”, which,
we know will not match. Neither, when NP is expanded by NP → PropN in the next
sub-tree. So, we need to expand the next choice “S → Aux NP VP” of stage (2).
This also does not work. Finally, what works is as follows:
S →VP
VP → V NP
V → Book
N P → Det N
Det → that
N → f light.
In the discussion above, we drew a sharp line between a correct and an incorrect
parses, which was based on whether a terminal node either matched or did not match
the next word in the sentence. According to this parsing, a phrase is either acceptable
or not. There are situations under which we can relax these requirements, e.g., when
we cannot afford to try alternatives exhaustively. The examples are: analysis of con-
nected speech, text segmentation, and identification of words can never be done with
complete certainty. At best, one can say that certain sound is more probable than
the other. Consequently, one may associate a fractional number with each terminal
node, indicating the probability or quality of nodes below that, in respect of forming
grammatically correct sentence [5].
In the probabilistic parsing, the non-terminal nodes will be assigned some values,
which are sophisticated enough to realize that syntactic and semantic restrictions are
taken care of. Hence, a parser that aims to deliver the best analysis even if every
analysis violates some constraint, must associate a measure of being grammatical
(i.e., acceptable) with the analysis of portions of the sentence. The sentence ultimately
associates a measure with the analyses of the complete sentence. As a matter of rule,
it is possible to generate an analysis of every sentence with a nonzero acceptability
or matching probability, and then select the one having the best analysis.
One simple parser of this type is “best-first” parser, which is based on the modified
form of standard top-down serial parsing algorithm for context-free grammars. The
standard algorithm tries to generate one possible parse-tree until it gets stuck (i.e.,
on generation a terminal node which does not match the next word in the sentence).
In case of stuck, it “backs up” to try another alternative. A best-procedure, is like
a best-first search that tries all alternatives in parallel. A measure in the numerical
value is associated with each alternative path that indicates the likelihood, that this
analysis matches the sentence processed so far, and it can be extended to the complete
sentence analysis. At each progressive step, the path with the highest likelihood is
extended. In the process, if the “measure” of current path falls below that of some
other path, the parser shifts its attention to that of other paths.
A CFG (context-free grammar) consists of
terminal words w1 , w2 , ..., w V ,
non-terminals N 1 , N 2 , ..., N n ,
start symbol N 1 , and
production rules N i → α j ,
where α j is sequence of terminals and non-terminals. Given this, we can define a
generative PCFG (probabilistic context-free grammar) as
terminal words w1 , w2 , ..., w V ,
non-terminals N 1 , N 2 , ..., N n ,
start symbol N 1 , and
production rules N i → α j ,
628 19 Natural Language Processing
where α j is sequence of terminals and non-terminals. And can define the rule prob-
abilities as
∀i P(N i → α j ) = 1, (19.3)
j
which shows that for each set of productions having same left-side variable (N i ), the
sum of probabilities is unity.
We consider that sentence is represented as sequence of words w1 w2 ...wm , and wab
= wa ...wb is a subsequence. Let non-terminal N i dominates subsequence wa ...wb is
represented by Nab i
, .i.e., Ni is root of sub-tree having children sequence as wa ...wb .
∗
Let N ⇒ α. We represent the probability of sentence w1n as
i
P(w1n ) = P(w1n,t ) (19.4)
t
Example 19.3 Construct a correct parse-tree for the sentence “I saw an astronomer
with telescope,” for the following PCFG:
P(t1 ) =1.0 × 0.45 × 0.7 × 0.60 × 1.0 × 0.20 × 0.35 × 1.0 × 1.0 × 0.45 × 0.25
=0.001488375
19.8 Natural Language Parsing 629
N V NP
I saw NP PP
Art N P rep NP
an with N
astronomer
telescope
N VP PP
I V NP P rep NP
an astronomer telescope
P(t2 ) =1.0 × 0.45 × 0.3 × 0.60 × 1.0 × 0.35 × 1.0 × 0.45 × 1.0 × 0.15 × 0.25
=0.00047840625
available from the statistics of semantics from a large corpus of natural language
text.
The probabilistic parsing provides a solution to ambiguity in language and gram-
mar, as it is conclusive from the above example. But, this is not necessarily complete,
but in partial. The other benefit is that it gives a probabilistic language model.
We consider the following small text, which we would like to use for Information
Extraction.
Firm XYZ is a full service advertising agency specializing in direct and interactive market-
ing. Located in Bigtown CA, Firm XYZ is looking for an Assistant Account Manager to help
manage and coordinate interactive marketing initiatives for a marquee automotive account.
Experience in online marketing, automotive and/or the advertising field is a plus. Assistant
Account Manager responsibilities ensures smooth implementation of programs and initia-
tives, helps manage the delivery of projects and key client deliverables …Compensation:
50, 000 − 80, 000, Hiring Organization: Firm XYZ.
INDUSTRY Advertising
POSITION Assistant Account Manager
LOCATION Bigtown, CA
COMPANY Firm XYZ
SALARY 50,000–80,000
The majority of current systems follow this general architecture, although specific
systems are characterized by their own set of modules. In general, the combination
of such modules allow of the functionalities for IE system, discussed below [21].
Text filtering
Partial Parsing
Concepts
hierarchy Pattern extraction Pattern Base
Resolving co-references
Discourse Analysis
with a simple phrasal parser (parsing based on phrases). The later finds the local
phrases. In addition, the discourse module does the job of event pattern matching,
and merging the of templates. The above approach is useful because, the full pars-
ing is expensive, not robust, hence produces ambiguous results, and cannot manage
off-vocabulary conditions.
The current IE systems make use of partial parsing, such that the process of finding
the constituent partial structure of a sentence consists one or more cascaded steps
onto the text fragments. The generated constituents are tagged as parts-of-speech, like
noun/verb, or phrases like prepositional or others. After parsing of the constituents,
the system resolves domain-specific dependencies based on the semantic restrictions
imposed by the extraction environment. The dependencies are resolved using the
following two alternatives.
Pattern Matching
Majority of IE systems follow the approach based on pattern matching, called named
extraction patterns or IE rules, to resolve the dependencies. In this approach, simpli-
fication of the syntax helps in reducing the semantic processing, and that leads ulti-
mately to simplify the pattern matching task. The use of patterns is scenario-specific
to recognize both, modifier and argument dependencies between constituents.
Grammatical relations
For representation of the grammatical relations, a graph is constructed (similar to
dependency grammars) using the general rules of interpretation, such that previously
detected chunks are constructed as nodes, and relations among these nodes are the
labeled arcs. Such graphs are useful for reasoning about IE. There are three different
types of grammatical relations for IE, as follows.
• First type is defined in the form of general relations of subject, object, and modifier.
• Second type is specialized modifier relations, which are specified as temporal
relations, and relations concerning to locations.
• Third type is mediated relations, i.e., they are mediated by prepositional phrases.
The output template generation phase is the final stage of IE systems. This stage
maps the extracted pieces of information to the required output format. Due to the
domain-specific restrictions in the output structure, some inferences can be drawn in
this phase. The inference drawing can happen in the following situations:
1. a value from a predefined set is taken as it is by the output slots;
2. a forced instantiation of output slots;
3. when an extracted information generates multiple output templates, it results in
inference;
4. some times, the normalization of the output slots produces inferences, e. g., when
date in the coded form is normalized, it produces, say day or day of month or
week, etc. Similarly, when products list is normalized we may get name of items
instead of its code, etc.
The Natural Language Question Answering (QA) derives the common features from
NLP (Natural Language Processing) and Information Retrieval (IR). The QA (i.e.,
NL-QA) promises to deliver “answers”, unlike the “hits” by IR. Most of QA is
focused questions based on facts like “When did first human land on Moon?”, “Who
invented the paper clip?”, and “How far is the Moon from earth?” These are called
“factoid” questions, which can be typically answered using the named entities such
as people, organizations, measures, dates, locations, etc. [7].
The commonly used approach for answering the factoid questions of open-domain
requires the technique of named-entity recognition, together with IR. Typically, major
steps of a “traditional” ontology-driven question answering system that is primarily
based on IR and named-entity recognition technologies, are shown in Fig. 19.15.
Usually, the large ontologies drive the process of QA, which relates the question
types to semantic classes of answers. The detailed description of steps for QA is as
follows:
1. Question Analysis. The first step of QA focus is on the analysis of the question,
where knowledge resources are explored to find out the expected semantic type
of the answer.
2. Document retrieval. Next, in the document retrieval phase, the candidates’ docu-
ments containing the terms from the questions are retrieved. This retrieval is often
done using off-the-shelf IR engines.
3. Identification of passages. In the next stage, the system identifies passages within
these restricted candidate documents that contain a concentration of terms from
the question–these regions are likely to contain the answer.
4. Answer extraction. Finally, in the answer extraction stage, named-entity recog-
nizers identify candidates of the correct semantic type [9].
634 19 Natural Language Processing
Question Analysis
Document Retrieval
Passage Retrieval
Answer Extraction
However, there are challenges of the QA approach discussed above. One is due
to complex many-to-many mapping between question types and answer types. As
an example, the answer to a “who” question can be name of a person (e.g., “Who
invented the light bulb?”) or an organization’s name (e.g., “Who won the World Series
in 2004?”), and there are other possibilities also. On the other side, different question
types may map onto the single semantic answer type. For example, the questions –
“Where was the world book fair held in 1900?” and “What city hosted the world
book fair in 1900?”, would map to the same answer. To overcome these challenges
of multiple questions with single answer, a system must have elaborate ontological
resources which explicitly encode semantic relationships between questions and
answers. More advanced techniques such as abductive inferencing, feedback loops,
and fuzzy matching of syntactic relations are used, but the factoid question answering
is still driven by information retrieval and named-entity recognition technologies, and
only on large ontologies.
The data redundancy based approach depends on statistical regularities, using which
it is possible to extract “easy” answers to factoid questions from web. For answering
a question, the system makes some connection between the question and the passages
containing the answers. Having done this connection at the lexical level, the rest of
the job is simple. The meaning of the lexical connection between the question and
likely answer carrying passage is the presence of large degree of overlap between
these texts. But, in reality, it is not the case very often. This is because, the richness
of natural language, and its expressive power provides the facility to create varying
19.10 NL-Question Answering 635
textual forms which have the same meaning (semantics). But, in that case also, there
is a good degree of semantics resemblance between the question and answer carrying
text. To appreciate this scenario, consider the following question, with two different
answers, which though lexically different but have the same semantics.
When did Sikkim become the state of India?
1. Sikkim became a state on 16 May 1975.
2. Sikkim was admitted to India on May 16, 1975.
In the above cases, both the passages contain correct answers, however, it seems
obvious that for computer, it would be easier to extract the answer from passage 1
than from passage 2. We note that the task of answering the factoid questions will
be far easier for the computer if the answer passage carries the same words as that
in the question.
From the point of view of text processing algorithms, answers can be expressed
in a variety of different ways. Within text collection, it is possible that the answer
to above question lies in passage 2. Though in that text, the answer is not obviously
stated. In such cases, since the answers share few words common with the question,
a sophisticated natural language processing may be required to find out the rela-
tion between them. These processing may typically comprise the following types of
domains [7].
Recognizing syntactic alternations,
Collapsing of paraphrases,
Resolving the anaphora,2
Making commonsense inferences, and
Performing date calculations.
A special grammar, called SDG (Structured Description Grammar) can be used for
information representation and extraction. The basic approach here is data redun-
dancy based. In this, a text sentence (S) is mapped into a transition graph or state
diagram, as shown in Fig. 19.16. It shows seven types of structurally different sen-
tences labeled as 1, 2, …, 7, based on the number of wh-pronouns and their positions,
which can be mapped into this structure. The structure of each sentence explicitly
indicates the position of interrogating wh-pronouns, who, what, where, when and
why. The positions of these wh-pronouns are invariant [3].
Following examples demonstrate that English-language sentences structures map
into the transition diagram of SDG. It can be easily verified that the sentences (a) to
(d) are of types 7, 7, 3, 4 respectively.
2 Anaphora: making use of a pronoun or similar word instead of repeating a word used earlier.
636 19 Natural Language Processing
S 1
Who What Where When Why
Sentence types
2
3
5
6
7
To make our computers easier to use, it is required to make them understand the
meaning of what we tell them. Usually these efforts are failed because meaning is
not one thing but a combination of many. This is because, the activities of human
thought engage an enormous collection of different structures and processes [10].
In human understanding, what some thing X means to us depends on how rep-
resentations of X (in our brain) connects to, that is, having the links to other things
we know. If we understand something in only one way then we know it very little,
because when something goes wrong, we are left no links to connect that object
(sense) with others. But if we use several features, each integrated with its set of
related pieces of knowledge, then if one of them fails we can switch to another. This
we implement by turning our ideas around in our mind to examine the other pieces
19.11 Commonsense-Based Interfaces 637
of knowledge from different perspectives, and carry on this process until we find one
that works.
When our goal is that our computers must understand us, we will need to equip
them with this kind of facility, i.e., they should connect each object/action (sense) to
many other senses. For example, the sense “cat” is connected with other senses: color,
size, height, no. of legs, tails, eyes, ears, leopard shape, etc. Those ways computers
must be equipped with knowledge, like human.
Present Limitations of Computers
For computers to understand like humans, there should be a program with common
sense like humans. This is a difficult task, because a typical program has only one
way to deal with a problem, so if something goes wrong, the program is totally
stuck. However, as humans, we search for alternatives in the event of failure of one
approach. The limitation with present-day computers is that they always start from
scratch. To make the computers of more worth, we need to supply them a great library
of commonsense knowledge, which are common even with young children.
The present-day computers are not designed with commonsense in them, nor they
have capability to learn from experience. The programs can solve difficult problems
in specialized subjects, but even today, computers cannot do the things, which even
young children do so easily. The current programs behaves in a limited and inflexible
way. Some of the critical differences between the capabilities of computers and
human are distinctly explained below.
Vitalist
Computers can do only what they are programmed to do so. The computers have
been programmed by people to speak, but in fact the computers do not know what
those words mean. The meaning is an intuitive thing, and it cannot be reduced to
zeros and ones. The present computers can only do logical things, and meanings are
not necessarily logical.
Humanist
The machines are not humanist, as machines have no goals or hopes, nor any fears or
phobias. They do not even know that exist, nor they have a sense of accomplishment.
Logician
Since we want that computers should have commonsense, it is more important to
understand as what the commonsense is? We need to define it more clearly and
precisely. That policy seems alright, but it is wrong when it comes to psychology.
Of course, we do not like to be imprecise, but some times, the definitions are not
sufficient. So instead, we will take a different approach, and try to design (as opposed
to defining) machines that can do the things we want.
638 19 Natural Language Processing
When we say “commonsense thought,” we are indicating to things that most people
can accomplish, even not knowing many times that they are doing them. Thus, when
we hear a sentence like: “Bill told the waiter he wanted some chips,” we will infer
all possible inferences. Some of them are [10] the following:
– The word “he” means Bill, and not the waiter.
– This event possibly took place in a restaurant, where Bill was a customer, and
waiter was close to him at the time. The waiter was working there, waiting for
Bill’s meal order at that time.
– Bill wants potato chips, not memory or IC chips. No count of chips is mentioned.
But, it may be 20–30, and in thousands.
– Bill will start eating the chips soon after he gets them.
– Bill and waiter are both human beings, Bill is old enough to talk, and the waiter is
old enough to work.
– This event happened after the date of invention of potato chips (i.e., year 1853).
– It is Bill’s assumption that waiter also infers all those the abovementioned things.
Every child learns to use thousands of words at a very young age, but no computer
knows even the meaning of some of those, so no computer can even understand a
casual conversation. For example, if you mention about “string,” any child would
know what it means, because it knows that many places are there it can be used, for
example, it can be used to tie a cart and pull it. With this, the child would also know
things like these:
– An object can be pulled by a string, and cannot be pushed by it.
– It is not good for eating, and you cannot construct a cart using a string.
– Before we put a string into a box, the box needs to be opened.
– And so, on.
Let us find out the size of networks of commonsense knowledge, i.e., network of, say
1000 words, each word having links with other knowledge structures, in different
ways. Some links are for objects, while others are for processes. Each such link will,
in turn, lead to other links (all in a semantic network), so that whenever some process
gets stuck, we usually can find some alternative. Language is not the only feature with
such abilities, but each expert skill a person possesses, there must be other similar
order of structures, say, for vision, for hearing, for perception, for speech, and for all
kinds of physical manipulations, and also for various kinds of social knowledge. So,
it may be that we possess millions/billions of units of knowledge.
to solve. Not only the learning, but we must also adopt efficient ways to retrieve
and apply the relevant knowledge. Many processes gets engaged in the activities of
imagining, planing, predicting, and deciding, which also make use of many excep-
tions and rules. For these, it requires knowledge about how to think, i.e., how to
organize and control those processes, also even if the representations are different,
it can describe the same situation. In addition, there must be a facility to convert the
new experiences into suitable memory structures to represent them. Some of the pos-
sible structures are: property lists, frames, frame-arrays, database query languages,
explanation-based reasoning, logic programming, rule-based systems, semantic net-
works, scripts, and stories.
The first step in knowledge representation is to select a representation. Using any spe-
cific representation will shortly lead to limitations, which may quickly accumulate,
leading the reasoning to ultimately halt. It is usually required to use several different
representations for each fragment of the commonsense knowledge about each idea,
thing, or relationship. We swiftly change from one method of representation to other
methods. And, that depends on which methods are better than other methods for
solving problems.
Finally, the following three are the important capabilities we need for character-
izing the problems for making use of commonsense reasoning.
Negative Expertise
This deals with ways through which we recognize each of the methods as when they
are going to fail. If it becomes possible to recognize the way particular things went
wrong, it becomes a clue for deciding what action should be taken next. By knowing
the way how each of the methods is likely to fail, it can be used to control the higher
activity, e.g., by brain we control the mental activity.
Knowledge Retrieval
Retrieving the relevant information from the commonsense knowledge networks
requires appropriate methods to identify, as what problems or situations in the past
are having maximum resemblance to the context of present problem. This means,
the systems need ways to describe what we are trying to do, and then reason about
those descriptions. This is based on the skillful use of analogies.
Self-reflection
Our computers, when implementing commonsense reasoning, are required to keep
records which describe acts and thinking they have recently done, so that they are
able to reflect on the results of what they tried to do. Also, they must be aware of
what they are doing. This is what we call with human as consciousness.
How the programs like playing chess or proving theorem are different from com-
monsense reasoning? Is the reasoning process in a child’s brain for fixing blocks
of different colors is a game, simpler than that of chess or theorem proving? The
answer is No. There is a fundamental difference between these programs, like for
chess game or theorem proving versus playing these block games. In fact, the pro-
grams of expert systems, like chess or theorem prover, require much less knowledge
640 19 Natural Language Processing
skills. However, the children make use of thousands of skills, they manipulate all
kinds of things, irrespective of their texture and shapes; they stack the things up and
then knock down then, and learn the dynamics of how the stack got scattered. Even
for building a small toy house of blocks, one needs to mix and match the knowledge
of many kinds: that is, about shapes and colors, speed, space and time, support and
balance, stress and strain, and the economics of self-management.
Do computers lack in learning all these things? Not likely. These limitations persist
because we have learned to program only in certain ways. Our decades-old approach
to solve some problem X has been: find the best way to represent X , find the best
way to represent the knowledge needed to solve X , and find the best way to solve
X , and so on! These steps lead to write specialized programs that cope with solving
only that type of problems—called brittle programs! This has resulted to millions of
specialized programs for solving only that kind of problems, such as playing chess,
playing cards, or designing a bridge, banking, etc. In the computer programs, price
for best was sacrificed in limiting the resources.
To make the machines deal with commonsense reasoning, we must use multiple
ways to represent the knowledge, acquire huge knowledge in that, and find common-
sense ways to reason with this. Consider the following example.
Mary was invited to Jack’s party.
She wondered if he would like a kite.
What leads us to infer that Mary was thinking to take a kite as a gift, but there
was no mention of “birthday” or “present” in the first sentence. There should be a
suitable representation of phrase “invited to party”, which could help to infer that
she is thinking of a kite as a suitable gift for Jack.
It is not well known as which structure of representation is best suitable for any
specific purpose. On the contrary, the brain represents a problem in several ways,
as well as represents the required knowledge. When one method fails to solve a
problem, we quickly switch over to another approach.
Now, to reply to the above question, we do a small analysis. Consider that we
are traveling by electric train, and there is a possibility of it being halted in any of
the stations due to electricity failure. Assuming that there are 10 railway stations,
and it may stuck at any one of them. Therefore, it is not difficult to plan for food
and shelter, as there can be a maximum 10 different solutions. So, given a cause of
halting of train, there are a maximum of 10 possible effects.
Consider another example, a possibility of meteorite strike on earth of sufficient
size, and depending on where it strikes, there are different rescue operations and
strategies needed depending on where it strikes. Naturally, for a single cause, there
are hundreds or thousands of possible different measures needed(effects), and even
more. Naturally, this second problem is more difficult to solve.
19.11 Commonsense-Based Interfaces 641
The cause-effect matrix (shown in Fig. 19.17) illustrates the relation “cause ×
e f f ect” to reasoning process, which can be used to arrive at the solution. In the
cases, when the causes are only a few, and each cause having a small number of
effects as shown in the top-left corner cells of the matrix, such the problems are easy
to solve even by exhaustive search methods, or some times there is no need to search,
but simply answer is recalled [10].
When the causes are many, each with a small effect, then statistical methods and
neural networks may work well, as indicated by the top-right corner cell of the matrix.
But, such systems are likely to breakdown if those causes have different characters
that interact in hard-to-predict ways.
The symbolic or logical reasoning, shown at the bottom-left corner cell of the
matrix, can work well when the causes are few, but corresponding effects are in large
number, except that the search may exponential in the case of sequential processes.
It is a rare situation, when many causes correspond to a large number of effects,
indicated in the right-bottom of the matrix, such systems are interactable, unless
the system is linear. Sometimes it is possible to arrive at solutions by reformulating
those difficult problems by switching to different representations, that emphasize
fewer and more pertinent features, so that we can work with simpler descriptions.
Except in the right corners of the matrix, there are multiple causes with mod-
est number of effects, the heuristic programs succeed in these conditions, using
knowledge combined with controllable search. This is the region of “classical AI”
research—the causes and effects are both moderate. Consider the adjacent cells to
the bottom and to the right of classical AI. In these regions, analytical methods are
usually not helpful, but it may be possible to use use our knowledge and experi-
ence to construct and apply analogies, plans, or case-based reasoning methods. Such
methods may not work perfectly, but they are frequently useful for practical purposes.
The conclusion of the above discussion is that there cannot be a single type of
approach possible for all types of problems. Accordingly, one should not seek a
uniform way to represent commonsense knowledge. In fact, there is frequent need to
use several representations when we face a difficult problem and then we will need
additional knowledge about it. The causal-diversity method may help, but eventually
642 19 Natural Language Processing
it must be replaced by more resourceful, knowledge based systems that can generate
new representations.
In fact, there is no best way to represent the knowledge. The present limitations of
machine intelligence are largely due to our quest for seeking unified theories capable
of reasoning well in all situations. The versatility human can emerge from a large-
scale architecture of representation, where each of several different representations
can help overcome the deficiencies of other representations.
Consequently, the commonsense knowledge in the machines be built, which rep-
resents knowledge about so many things like strings, houses, clothing, roads, books;
in other words, everything that most children know. For such commonsense knowl-
edge base, we will need ways to link each unit of knowledge to the use, or functions
that each unit knowledge can serve. Figure 19.18 shows some typical representation
levels, which have been covered in previous chapters.
There are a number of tools that exist, either as open-source or research tools created
by some research laboratories, as well as closed source for natural language process-
ing, and speech processing. These tools have built-in functions to perform a number
of commonly used tasks, which can be directly called, or using these scripts can be
written to perform more complex jobs of NLP and speech processing. For example,
for NLP, they can tokenize the given text, can do stemming and POS (parts of speech
tagging), can find out word frequencies in given documents, parsing of NL sentences,
etc. These inputs can help to compute, for example, t f × id f , which can be helpful
in IR (information retrieval), IE (information extraction), text classification, etc. In
the following part, we discuss some such tools, which are either open source or they
can be obtained on request from respective research laboratories.
19.12.1 NLTK
The NLTK (Natural Language Toolkit) is a collection of Python libraries and pro-
grams for symbolic and statistical natural language processing. It is suited to practi-
tioners as well those who are learning natural language processing (NLP), those who
19.12 Tools for NLP 643
are conducting research in NLP or areas close to NLP, like, empirical linguistics,
cognitive science, AI, IR, and machine learning.
The NLTK has been also used as a teaching tool, as a study tool for individuals,
and as a prototyping and building platform for research systems. Python language has
been chosen due to its shallow learning curve, its transparent syntax and semantics,
and due to its extraordinary capability for handling strings. Python is an interpreted
language, which facilitates interactive exploration. It is an object-oriented language,
allows data and methods to be encapsulated and reused easily. Python is also available
with an extensive standard library, that includes tools for speech processing, natural
language processing, numerical processing, and graphical programming [12].
Consider that tasks of stemming and parts-of-speech (POS) tagging are indepen-
dent, and both operate on sequences of tokens. If the stemming task is done first,
the information required for POS tagging is lost. If tagging task is performed first,
the stemming process must be able to skip over the tags. If these two tasks are done
independent of each other, it becomes difficult to align the resultant texts. Hence, as
the combinations of tasks increase, it becomes extremely difficult to manage the data.
To address this problem, NLTK version 1.4 onward comes with a new architecture
where tokens are based on Python’s native dictionary datatype, such that the tokens
can have an arbitrary number of named properties. The Tag and Stem are examples
of these properties. The NLTK allows for even whole sentence and document to
be represented as a single token with Sub-tokens attribute that holds sequences of
smaller tokens.
A parse-tree can also be treated as a token, which has special property/attribute
of Children. The benefit of this type of architecture in NLTK is that it unifies many
different data types, and allows distinct tasks to be run independently. Of course, this
architecture comes with an overhead for programmers, because the program needs
to keep track of a growing number of property names.
The following commands in Python, with NLP tool NLTK installed, demonstrate
the tokenization of a given text into sentence tokens and word tokens. Since there is
only one sentence, the sentence token is one only.
$ python
Python 2.7.14 (default, Sep 23 2017, 22:06:14)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> text="Fundamentals of Artificial Intelligence."
>>> print(sent_tokenize(text))
644 19 Natural Language Processing
>>> print(word_tokenize(text))
[’Fundamentals’, ’of’, ’Artificial’, ’Intelligence’, ’.’]
>>>
Example 19.5 Stemming of given set of words.
Following are the commands for stemming a set of words to reduce them to their
stem words. This makes use of the Porter Stemmer algorithm.
The parts-of-speech tagging (grammatical tagging) or disambiguation of word-
category, is a process of marking-up word in a text (corpus) corresponding to a
particular POS. This is carried out based on its definition as well as its context.3
Various POS in the English language are: noun, verb, adjective, adverb, pronoun,
preposition, conjunction, and interjection.
The POS tagging is carried out as part of computational linguistics using some
algorithms. These algorithms associate discrete terms, as well as hidden parts of
speech, in accordance to a set of descriptive tags. POS-tagging algorithms fall into
two distinctive categories: rule-based and stochastic based. For example, Brill’s
tagger, one of the first and most widely used English POS taggers, makes use of
rule-based algorithms. The following example demonstrates the POS tagging using
NLTK.
Example 19.6 Parts-of-speech tagging.
3 Context: Relationship with adjacent and related words in a sentence, or phrase, or a paragraph.
19.12 Tools for NLP 645
>>> text
[’Part’, ’of’, ’speech’, ’tagging’, ’and’, ’POS’, ’tagger’]
>>> nltk.pos_tag(text)
[(’part’, ’NN’), (’of’, ’IN’), (’speech’, ’NN’),
(’tagging’, ’NN’), (’and’, ’CC’), (’POS’, ’NNP’), (’tagger’,
’NN’)]
19.13 Summary
Exercises
7. Convert the following passive voice to active voice. Construct the necessary
trees. Also write the steps.
The village was looted by dacoits.
S → NP VP
NP → N
N P → Det N
V P → V PP
P P → Pr ep N P
N → Ra jan | bench
Det → the
pr ep → on
8. Given the parse-tree in Fig. 19.19, construct the grammar for this.
9. Construct the grammars and parse-tree for the following sentences.
a. The boy who was sleeping was awakened.
b. The boy who was sleeping on the table was awakened.
c. Jack slept on the table.
10. Construct the parse-trees and resolve the ambiguities in the following sentences
using “selectional constraint”. Also, specify whether the ambiguities are syntac-
tic, semantic, or some other?
a. “He saw the man with the horse.”
b. “he saw the man with gun.”
c. “He saw the man with binocular.”
11. What are the different types of ambiguities in natural language, like English?
References 649
References
1. Cambria E, White B (2014) Jumping NLP curves: a review of natural language processing
research. IEEE Computat Intell Mag 5:48–57
2. Chomsky N (1956) Three models for the description of language. IRE Trans Inform Theory
2(3):113–124
3. Chowdhary KR, Bansal VS (2006) Information Extraction from Natural Language Texts. J of
the Institution of Engineers(India), 87:14–19
4. Chowdhary KR (2004) Natural Language Processing for Word Sense Disambiguation and
Information Extraction. PhD Thesis, JNV University, Jodhpur (India)
5. Jurafsky D, Martin J H (2002) Speech and Language Processing - An Intro to Natural Lan-
guage Processing, Computational Linguistics, and Speech Recognition. Pearson Education
Asia, ISBN 81-7808-594-1
6. Lenat D, Guha R (1989) Building Large Knowledge-Based Systems: Representation and Infer-
ence. The Cyc Project. Addison-Wesley, Boston, MA
7. Lin J (2007) An Exploration of the Principles Underlying Redundancy-Based Factoid Question
Answering. ACM Trans on Inf Sys 25(2):1–55
8. Miller AM (1995) Wordnet: A Lexical Database for English. Communications of the ACM
38(11):39–41
9. Minsky M (1968) Semantic Information Processing. MA, MIT Press, Cambridge
10. Minsky M (2000) Commonsense-based Interfaces. Communications of the ACM 43(8):67–73
11. Mueller E (1998) Natural Language Processing with Thought-Treasure. Erik T. Mueller, New
York
12. https://www.nltk.org/, Cited 19 Dec 2017
13. Pearl J (1985) Bayesian networks: A model of self-activated memory for evidential reasoning.
UCLA comput Sci, Irvine, CA: Tech Rep CSD-850017
14. Ralph G (1994) Computational Linguistics - An Introduction, Studies in Natural Language
Processing, Cambridge Univ Press
15. Reiter R (1980) A logic for default reasoning. Artificial Intelligence 13:81–132
16. Ronald C et al (1997) Survey of the state of art in human language technology – Studies in
Natural Language Processing, Cambridge Univ Press
17. Schank R (1975) Conceptual Information Processing. Elsevier Sc Inc, Amsterdam, The Nether-
lands
18. Singh P (2002) The open mind common sense project. http://www.kurzweilai.net/ Cited 19
Dec 2017
19. Sowa J (1987) Semantic networks. Encyclopedia of Artificial Intelligence, S. Shapiro edn,
Wiley, New York
20. Baldwin T (2009) et al (2009) Prepositions in Applications: A Survey and Introduction to the
Special Issue. Computational Linguistics, Vol 35, 2:119–150
21. Turmo J et al (2006) Adaptive Information Extraction. ACM Computing Surveys 38(2):1–47
Chapter 20
Automatic Speech Recognition
Abstract There are basically two application modes for automatic speech recog-
nition (ASR): using speech as spoken input or as knowledge source. Spoken input
addresses applications like dictation systems and navigation (transactional) systems.
Using speech as a knowledge source has applications like multimedia indexing sys-
tems. The chapter presents the stages of speech recognition process, resources of
ASR, role and functions of speech engine—like, Julius speech recognition engine,
voice-over web resources, ASR algorithms, language model and acoustic models—
like HMM (hidden Markov models). Many open-source tools like—Kaldi speech
recognition toolkit, CMU-Sphinx, HTK, and Deep speech tools’ introduction, and
guidelines for their usages are presented. These tools have interfaces with high-level
languages like C/C++ and Python. The is followed with chapter summary and set of
exercises.
20.1 Introduction
Speech has long been viewed as the future of computer interfaces, promising signif-
icant improvements in ease of use over the traditional keyboard and mouse. There
are basically two application modes that exist for speech recognition: 1. Speech as
spoken input to computers, and 2. The speech is used as data or knowledge source.
The first application mode comprises the potential applications as, dictation systems,
navigation, and transactional systems.
In the application of dictation, a system transcribes the spoken words into written
text, and for dictating letters, reports, business correspondence, or e-mail messages,
to the computer/machine.
The speech can be used in the form of commands to navigate around the
applications, for example, selection of main application, then its one of the sub-
applications, and sub-sub-application, till you reach to the command to execute the
final application.
The speech can be used for transactional applications, i.e., to use speech in the
form of command or command sequence to cause a transaction to be performed. For
example, the speech-based transactions can be purchase of stock, book a flight ticket,
reserve an itinerary for tour, or doing the fund transfer.
The second mode of speech application, i.e., speech as a knowledge source has
applications like meeting capture, and knowledge management. These applications
are basically multimedia indexing systems that use speech recognition to transcribe
words verbatim from an audio file into text. Subsequently, the IR (Information
Retrieval) techniques are applied to the transcript to create an index with time offsets
into the audio. Users can access the index using text keywords to search a collection
of audio or video documents.
We understand both written and spoken language, and also know that skill of
reading is learned much later. We now focus on spoken language. The problem
of understanding the spoken language can be divided into major parts discussed
below [6].
Phonological
The phonological processing step relates sounds to the words we recognize. Phone
is smallest unit of sound, and the phones are aggregated into word sounds.
Morphological
This is lexical knowledge, which relates to word construction from basic units called
morphemes. A morpheme is the smallest unit of meaning, for example, the construc-
tion of friendly from friend and ly.
Syntactic
It is knowledge about how the words are organized to construct meaningful and
correct sentences.
Pragmatics
It is a high-level knowledge about how to use sentences in different contexts and how
the contexts effect the meanings of the sentences.
World
It is useful in understanding the sentence and carry out the conversation. It includes
the other person’s beliefs and goals.
Speech recognition by machine is important, as many problems get solved if our
computer/laptop can recognize the spoken works. We know, in the present days it has
become possible to do search in some of the search engines by speaking rather than
entering the keywords. In the following, we discuss some of the potential applications
of automatic speech recognition (ASR).
Learning Outcomes of this Chapter:
1. Distinguish the goals of sound recognition, speech recognition, and speaker recog-
nition and identify how the raw audio signal will be handled differently in each
of these cases. [Familiarity]
20.1 Introduction 653
At the simplest level, the programs that are speech driven can be characterized by the
words or phrases we say to a given application and how that application interprets
them. An application’s active vocabulary is what it listens for, determines what it
understands. The speech recognition process makes use of a speech engine, which
is language-independent, and what it recognizes can be from several domains. A
domain comprises a vocabulary set, pronunciation models, word-usage models that
are associated with specific speech applications. An acoustic component is also
present in the speech recognition engine, as part of voice models used by the speech
engine during the recognition. The voice models can be speaker independent, or may
be unique to the speaker.
Figure 20.1 shows the resources used by a typical speech engine during the process
of speech recognition. The domain-specific resources can vary dynamically during
CS domain
Vocabularies
Pronunciations
Word-usage models
User voice
Speaker-independent
Voice-model
To have voice facility over the web, a group of industry organizations, which included
AT&T, IBM, Lucent, and Motorola, had established the VoiceXML Forum, in March
1999 to develop and promote a new language—the Voice extensible Markup Lan-
guage (http://www.w3.org/Voice/). The VoiceXML forum was established with the
objective, to bring the content delivery to interactive voice response applications,
using Web-based developments [7, 8].
The VoiceXML (or VXML) is a digital document standard for specifying inter-
active media and voice dialogues between humans and machines. This language
has applications for developing audio and voice response applications, such as for
banking systems, customer services, and for automated customer service portals.
VoiceXML applications are commonly used in industries and segments of com-
merce, some of the examples are as follows.
20.3 Voice Web 655
Speech synthesis,
Recognize spoken and touch-tone key input,
Digitize audio,
Order inquiry,
Package tracking,
Flight tracking,
Voice access to email,
Emergency notification,
Audio news magazines,
Can record spoken input,
Like voice dialing, and
Directory assistance.
VoiceXML applications are developed and deployed in all the above major fields.
These applications are analogous to how a web browser interprets and visually renders
the Hypertext Markup Language (HTML) it receives from a web server. In similar
ways, the VoiceXML documents are interpreted by a voice browser, and users interact
with voice browsers via the public network like Internet. Like HTML, the VoiceXML
text provides tags that instruct the voice browser to provide the functions of automatic
speech recognition, speech synthesis, dialogue management, and audio playback.
Figure 20.2 shows the architecture of VoiceXML model, which makes use of
standard client–server paradigm that integrates voice and data services. A voice
service consists of sequence of voice interaction dialogues between a user and an
implementation platform. A document server, which can be external to the imple-
mentation platform, provide the dialogues. The overall service logic is maintained
by the document servers or web servers, which also perform database and legacy
system operations, and produce dialogues.
A VoiceXML document specifies each interaction dialogue that is conducted by
the VoiceXML interpreter. The following is an example of a VoiceXML document:
Imprementation
platform
656 20 Automatic Speech Recognition
A user’s input affects dialogue interpretation, and the system collects this infor-
mation in the form of requests, which it submits to a document server. The later can
reply with another VoiceXML document to continue the user’s session with other dia-
logues. The grammar-based recognition vocabularies are commonly used to support
voice services in VoiceXML applications.
The initial attempts for speech recognition were targeted to use expert knowledge of
speech production and perception processes. Soon it was found that such knowledge
was inadequate for capturing the complexities of continuous speech. Currently, the
statistical modeling techniques trained using hours of speech have provided most
speech recognition advancements [4].
The process of speech recognition starts with a sampled speech signal. This signal
has a good deal of redundancy because the physical constraints on the articulators
that produce speech—the glottis, tongue, lips, and so on—prevent them from mov-
ing quickly. Consequently, the ASR (Automatic Speech Recognition) system can
compress information by extracting a sequence of acoustic feature vectors from the
signal.
Typically, the system extracts a single multidimensional feature vector every 10
ms that consists of 39 parameters. These feature vectors, which contain information
about the local frequency content in the speech signal, are called acoustic observa-
tions because they represent the quantities the ASR system actually observes. The
system attempts to infer the spoken word sequence that could have produced the
observed acoustic sequence.
To simplify the design, we assume that speaker’s vocabulary is known to the ASR
system. Having adopted this approach, it is helpful in restricting the search for the
possible word sequences only within the words listed in the ASR lexicon. The ASR
lists the vocabulary and provides phonemes—a set of basic units of words, which
are usually individual speech sounds to pronounce each word.
The commercially available lexicons usually include tens of thousands of words.
The length of the word sequence uttered by the speaker is not necessarily be known,
for the same word by different speakers, as well as by the same speaker at two dif-
20.4 Speech Recognition Algorithms 657
ferent times. Consider that length of the word sequence is N . If V is taken as the size
of the lexicon, the ASR system can hypothesize V N possible word sequences. The
language constraints dictate that these word sequences are not equally likely to occur.
For example, the word sequence “please call me” is more likely to occur than the
sequence “please me call.” In addition, the acoustic feature vectors extracted from
the speech signal can provide important clues about the phonemes which produced
them. The sequence of phonemes that corresponds to the acoustics observations, can
imply the word sequences that could have produced the sequence of these sounds.
Hence, the acoustic observations experienced provide an important source of infor-
mation that can help further narrow down the space of possible word sequences.
The ASR systems use the acoustic observations information to compute the prob-
ability that these observed acoustic feature vectors have been produced when the
speaker uttered a particular word sequence. Essentially, the system efficiently com-
putes these probabilities and outputs the most probable sequence of words as a
decoded hypothesis.
The most successful speech recognition systems of today, use a generative proba-
bilistic model, shown as Eq. 20.1. The speech recognizer tries to find the probability
of word sequence ŵ1N (of N words) that maximizes the word sequence’s probabil-
ity, having given some observed acoustic sequence y1T . This approach makes use of
Bayes’ theorem to compute the conditional probability of p(w1N ) given y1T . When
Bayes equation is expanded in second line of equation (20.1), it ignores the denomina-
tor term ( p(y1T )), common for all possible word sequences. This equation maximizes
the product of two terms: the probability of the acoustic observations given the word
sequence ( p(y1T | w1N )) and the probability of the word sequence itself ( p(w1N )).
Figure 20.3 shows the process described by Eq. 20.1 as a block diagram. The
lexicon, language model, and acoustic model components construct hypotheses for
Three basic components comprise the hypothesis search: a lexicon, a language model,
and an acoustic model. Each one is described in detail in the following [4].
20.5.1 Lexicon
A typical lexicon is shown in Table 20.1 with each lexicon’s possible pronunciations
constructed from phonemes. English language has 44 phonemes. Despite there being
just 26 letters in the language, there are 44 unique sounds (phonemes). These sounds
are helpful in distinguishing one word or meaning from another. An individual word
can have multiple pronunciations, for example, the word “the” has two pronunciations
as shown in Table 20.1. These multiple pronunciations complicate the process of
recognition. The hypothesis search chooses the lexicon on the basis of task, trading
off vocabulary size with word coverage. Although a search can easily find phonetic
representations for commonly used words in various sources, task-dependent jargon
often requires writing out pronunciations by hand.
The search for the most likely word sequence corresponding to the speech features
sampled, requires the computation of terms, p(y1T |w1N ) and p(w1N ) in Eq. 20.1. The
20.5 Hypothesis Search in ASR 659
term p(w1N ) is called the language model. The computation requires the assignment of
probability to a sequence of words w1N . A simplest way we can imagine to determine
such a probability, is to compute the relative frequencies of different word sequences,
like we discussed earlier, that “Please call me” is more probable than “Please me
call.” Note that, the total number of different sequences can grow exponentially
with the length of the sequence, making this approach computationally infeasible.
Therefore, there is a need of approximations.
A typical approximation used assumes the probability of current word in the
sequence as depending on previous two words only, called 2-grams, against n-grams.
When this is considered, the computation can approximate the probability of the word
sequence as follows:
i=N
p(w1N ) ≈ p(w1 ) p(w2 |w1 ) p(wi |wi−1 , wi−2 ). (20.2)
i=3
The term p(wi |wi−1 , wi−2 ) can be estimated through computation by counting
the relative frequencies of word trigrams, or triplets:
N (wi , wi−1 , wi−2 )
p(wi |wi−1 , wi−2 ) ≈ . (20.3)
N (wi−1 , wi−2 )
An acoustic model computes the probability of feature vector sequences (y1T ) under
the assumption that a particular word sequence (w1N ) produced the vectors. In other
words, an acoustic model is P(y1T |w1N ).
660 20 Automatic Speech Recognition
s1 1 − p1 s2 1 − p2 s3 1 − p3
Due to inherently stochastic nature of the speech, a speaker never utters a word
exactly the same way twice. The variation in a word’s or phoneme’s pronunciation
manifests itself in two ways: duration and spectral contents, also known as acoustic
observations. In addition, a particular phoneme’s spectral content are effected due
to phonemes in surrounding context, a phenomenon called co-articulation effect. It
is, therefore, necessary that acoustic models used should take into account these co-
articulation effects. One of the popular acoustic models is based on HMM (Hidden
Markov Model).
Hidden Markov Models
A hidden Markov model offers a natural choice for modeling speech’s stochastic
aspects. HMMs function as probabilistic finite-state machines—the model has a set
of states, and its topology specifies the allowed transitions between them. At every
time frame, an HMM makes a probabilistic transition from one state to another and
emits a feature vector with each transition.
Figure 20.4 represents a phoneme’s transitions using a HMM. The transitions in
the waveform of a phoneme correspond to state transitions in the HMM. We may
think of a HMM as a finite automata with transitions governed by probabilities.
Accordingly, we take a set of state transition probabilities in the HMM as, p1 , p2 ,
and p3 , due to which the possible transitions between the states of the HMM are
governed. The probabilities specify the probabilities of going from one state at time
t to another state at time t + 1. The feature vectors emitted while making a particular
transition in the speech waveform, represent the spectral characteristics of the speech
at that point. These feature vectors vary corresponding to varying pronunciations of
the phoneme. A probability distribution or probability density function can model
this variation. In Fig. 20.4, the functions p(y|1), p(y|2), p(y|3), could be different
for different transitions. Typically, these distributions are modeled as parametric
distributions, which are a mixture of multidimensional Gaussian distributions.
The HMM in Fig. 20.4 has three states, representing the pronunciation of a
phoneme starting at state s1 . Then, the phoneme corresponds to a sequence of tran-
sitions, and terminating at state s3 . Duration of a phoneme is equal to the number
of time frames required to complete the transition sequence. The transition proba-
bilities p1 ... p3 implicitly indicate probability distribution that governs the duration
of the phoneme. If any of these transitions exhibits high self-loop1 probabilities,
the model spends more time in that state, consequently consuming a longer time
1 forexample, the word “speech” can be also pronounced as “spee...ech”, repeating the sound of
’e’, which creates a self-loop.
20.5 Hypothesis Search in ASR 661
to go from the first to the third state. Note that, how time duration a self-loop may
repeat, in unknown and varies from speaker to speaker. However, a self-loop (a state)
may repeat on itself few times, typically 2-5. However, some words’ phoneme(s)
may have exceptionally long loop, for example, chanting of Aum.2 The probability
density functions associated with these three transitions govern the feature vector
output.
A fundamental task required to be performed through an HMM is computation
of the likelihood that it produces a given sequence of acoustic feature vectors. For
example, assume that the system extracted T feature vectors from speech corre-
sponding to the pronunciation of a single phoneme, now the system seeks to infer
which phoneme from a set of, say, 50 was spoken, given these feature vectors. The
procedure for inferring the phoneme first assumes that the ith phoneme was spoken,
then finds the likelihood that the HMM for this phoneme produced the observed
feature vectors. The system then hypothesizes that the spoken phoneme model is the
one which has the highest likelihood of matching the observed sequence of feature
vectors.
If the sequence of HMM states is known, we can easily compute the probability
of a set of feature vectors. For this, the system computes the likelihood of the tth
feature vector yt using the probability density function for the HMM state at time t.
Having done this, the likelihood that set of all the T feature vectors has occurred is
simply the product of all the individual likelihoods yt . Usually, it is not possible to
know the actual sequence of state transitions, for the computation of likelihood, all
possible state sequences are summed. Given that the HMM dependencies are local, it
is possible to derive efficient formulas for performing these calculations recursively.
Parameter Estimation
It is necessary that in advance to using of an HMMs to compute the likelihood
values of feature vector sequences, the HMMs needs to be trained to estimate the
model’s parameters. The training process requires the availability of a large volume
of training data in the form of mappings of “spoken word sequences” versus “fea-
ture vectors” extracted from the corresponding speech signals. The commonly used
process to find a particular correspondence is, estimation of maximum likelihood
(ml) function (θ̂ml ). Given that a correct word sequence is known corresponding to
the feature vector sequence, the maximum likelihood computation process tries to
choose those HMM parameters, which maximize the likelihood of training feature
vector. The computation of feature vectors also keeps target for obtaining the correct
word sequence. Consider that y1T is the representation for stream of T acoustic obser-
vations, and let w1N represents the correct word sequence; for these, the maximum
likelihood function estimate represented by θ̂ml is,
2 chanting of Aum is a common practice during meditation and yoga (IPA:/5wm/), where sound of
T
argmax pθ (st |y1T ) log[ pθ̂ (yt |st )]. (20.5)
t=1 st
θ
An iterative process is used to solve Eq. 20.5, with each iteration having two steps:
1. An expectation step, and 2. A maximization step. The expectation step computes
the posterior probability pθ (st |y1T ), or count of a state. The posterior probability
is conditioned on all the acoustic observations. The system makes use of current
parameter estimate of HMM, and the computation is performed using forward–
backward algorithm. The parameter θ̂ is chosen by the maximization step to maximize
Eq. 20.5. For Gaussian nature of the probability function, the computation can derive
closed-form expressions for this step [4].
Before we proceed to understand some of the commonly available tools for speech
recognition, let us try to understand some of the important terminology of speech
recognition.
Automatic speech recognition (ASR) or simply the speech recognition, or
computer-based speech recognition, is a process of converting the speech into a
sequence of words, using some algorithms which have been implemented as pro-
grams. A standard way of doing this is to first split the utterances in the speech
waveform with respect to certain parameters of speech recognition. Some of these
are: presence of voice activity, duration, pitch, voice quality, voice intensity, S/N
(signal to noise) ratio, and the strength of Lombard effect.3 This is followed with
3 Lombard effect: Involuntary tendency of speakers to increase their vocal effort particularly when
speaking in loud background to enhance the audibility of their voice. Due to the Lombard effect, not
20.6 Automatic Speech Recognition Tools 663
only the loudness increases, but also the other acoustic features such as pitch, rate, and duration of
syllables. The Lombard effect also results in an increase in the signal-to-noise ratio of the speaker’s
signal.
664 20 Automatic Speech Recognition
Julius
microphone or can use any audio device. The first pass of Julius segments input with
short pauses, and the second pass sequentially decodes these segments and slots to
the results. During the first pass, when a short pause has the maximum likelihood at
a certain point of time, a break is placed at that point and second pass is executed
on that utterance segment. Due to this process, word constraints are preserved as the
context within a utterance segment, and first pass may continue over to the next utter-
ance. Using the above sequence of steps, an input speech file with multiple sentences
can be decoded.
Speech Recognition is available in English, and many other languages, and this
feature is common in most smart phones, laptops, and PCs. In the following, we
discuss basic tools available as open source.
Kaldi Speech Recognition Toolkit
Kaldi is an open-source speech recognition toolkit, written in C++ language, and
works under the Apache platform.
It is a finite-state transducer (FST) based framework, with linear algebra support.
Figure 20.6 shows different components of Kaldi: the library modules are grouped
together, which depend on two types of libraries, 1. linear algebra libraries (the
numerical algebra libraries) and 2. OpenFST (finite-state framework). These two
external libraries are also freely available as open source [3, 5].
Access to the library functionalities is provided through command-line tools writ-
ten in C++. These tools are called from a scripting language, for building and running
a speech recognizer. Each tool has a very specific functionality with a small set of
command-line arguments: for example, there are separate commands to be executed
20.6 Automatic Speech Recognition Tools 665
(Shell) Scripts
The RNN has five layers of hidden units, such that for an input x, the hidden
units at layer l are denoted as h (l) , with input as h (0) . The first three layers are not
kept as recurrent layers. At each time t, for the first layer, the output depends on the
spectrogram frame xt along with a context of C frames on each side. The remaining
non-recurrent layers operate on independent data for each time step. Thus, for each
time t, the first three layers are computed by
where g(z) = min{max{0, z}, 20} is the clipped rectified-linear activation function
and W (l) , b(l) are weight matrix and bias parameters for layer l.
HTK Tool
The hidden Markov model toolKit (HTK) is a portable toolkit for building and manip-
ulating hidden Markov models. It is primarily used for speech recognition research,
and speech synthesis, as well for character recognition and DNA sequencing. HTK
consists of set of library modules and tools available in ANSI C source form. These
tools provide sophisticated facilities for speech analysis, HMM training, testing, and
result analysis.
The statistical speech models use here the context-dependent hidden Markov
models. Probabilities of word sequences is based on N-gram, which finds the most
probable word sequence using language model (refer Eqs. 20.1, 20.2) and acoustic
model (refer Eq. 20.1).
20.7 Summary
Exercises
1. Consider alphabet set = {a, b, c, d}. Create finite automata (recognizers) for
following strings.
4. The text processing algorithms are usually written in Python, while the ASR
algorithms, which produce the same text, are written in C/C++. Explain what
could have been the reason behind this?
5. What is the fundamental difference between the language model and acoustic
model? Why are they the same so?
References
21.1 Introduction
The field of Machine Vision (also called Computer Vision) is growing rapidly. It
is concerned with analysis, modification, and understanding of images. The objec-
tive of this field is to understand what is happening in front of a camera, and use
that understanding to control a computer or robotic system, or making use of these
images provide people new images that are more informative or aesthetically more
pleasing than the original images. The machine vision has vast applications, some of
the important application areas of machine vision are the following: automotive sys-
tems, photography, video surveillance, biometrics, movie productions, Web search,
medicines, medical and health sciences, augmented reality, gaming, new user inter-
faces with computers, and in fact, there cannot be an exhaustive list of all possible
areas, as they are continuously increasing in numbers.
As part of advances in machine vision, there are cameras that can focus automat-
ically on our face, and can trigger the shutter when we smile. On the other hand,
using an optical text-recognition system one can transform scanned documents into
text, which can be analyzed or read aloud by a voice synthesizer.
It is common to have driver-assistance systems in the newly built cars that help
the driver to park the car, or give warning signal to the driver in potentially dangerous
situations. Intelligent video surveillance systems exist, that can monitor the security
of public areas. As new smartphones come progressively equipped with more and
more processing power and better resolution of cameras, these phones are becoming
fertile field for potential computer vision applications. These devices have built-in
capability to merge number of photographs into a high-resolution panorama, they can
also read a quick response code, can recognize it and retrieve information about the
product from Internet. The mobile computer vision technology is moving fast toward
touch interface. However, still the computer vision is computationally expensive to
achieve all these in ideal sense.
The computer vision applications have real use only when they can be executed
in real time, for which, it is necessary that processing of each frame needs to be com-
pleted within 30 − 40 msec duration. This is a challenging task, particularly when
the computation takes place, e.g., in a smartphone or some other embedded comput-
ing environment. Generally, it is possible to trade quality for speed or vice versa. For
example, in the panorama stitching algorithm, if time is provided more liberally, it is
possible to find out more matches in source images, and one can synthesize an image
of higher quality. To meet the constraints of time and computations to be carried out,
one can compromise either on quality or spend longer time optimizing the code for
a given hardware architecture.
Further, some of the goals of computer vision or machine vision can be to extract
information from images. For example, an algorithm that produces a structure from
a motion, can recover a 3D model of an object from a sequence of views. As other
examples, from a robot grasping an object it is possible to construct a 3D view;
construction of a 3D object through medical imaging; and graphical modeling to
produce a 3D view. Model-based recognition methods can determine the best matches
of stored objects for image data, for use in visual inspection and image database
searches. Through visual motion analysis, it is possible to recover image motion
patterns for use in vehicle guidance and processing digital video.
The field of computer vision has a close relation to the field of image processing,
as follows: In image processing, the focus is on transformation of images, but in
computer vision, the focus is on extracting information from image data. To extract
a 3D from 2D images is the case of computer vision, and not of an image processing.
However, image filtering and color enhancement are the tasks of image processing.
Though many other computer science areas (computation), and board game play-
ing (chess, and tic-tac-toe), have progressed enough and have speed far higher than
human, the area of computer vision heavily lags behind compared to human capa-
21.1 Introduction 671
Originally the computer vision systems were designed with the aim to be used in
industrial and military applications. Some of the common military applications of
computer vision are: recognition of far-off targets, visual-based guidance systems for
autonomous vehicles, and interpretation of images. In the recent times, many com-
puter vision applications have emerged in medical imaging and multimedia systems,
like preoperative scans of patients in the operating room, complex surgery using
robots, etc. Computer vision techniques are also used for virtual reality applications,
and image database retrieval systems. The successful applications of computer vision
exists as a consequence to two important effects: 1. A technical force push toward
limiting the domain that have sufficient structure or constraint that the many-to-one
1 In
AI literature, the words: “computer vision” and “machine vision” are interchangingly used,
hence in this chapter also, both the terms mean the same.
672 21 Machine Vision
inversion problem can be made tractable, and 2. Economic force push toward having
a problem of sufficient base to be financially viable. The possible solution is an inter-
section of these two conditions. The following is a brief account of machine vision
applications [4].
Visual Inspection
One of the most successful applications of computer vision is in industrial quality
control, for example, in automation of visual inspection tasks. This is in replace-
ment for human visual inspection, which is crucial to quality control across a broad
spectrum of manufacturing, that ranges from low-volume custom products to high-
volume and bulk products.
Assembly and Material Handling
In automatic assembly of products, the central driver is the uncertainty in the factory.
To solve this problem, robot vision is typically used. The robot vision is based on
computer vision algorithms, which have applications in navigation and path control
of robot vehicles and robot manipulators. This application is governed by economic
drive as well as quality control.
Automatic Target Recognition (ATR)
Due to the progress of sophistication of weapon systems, it has become necessary
to provide rapid and accurate identification and tracking of targets. The main drive
force toward the progress in this direction is accuracy of recognition, with tracking
of multiple targets at the same time.
Photo Interpretation
This work requires the digitization of image data, and interpret them using vision
algorithms.
Extraction of 3D Structure
There are many engineering applications, that require construction of complex 3D
geometric models and terrains, such applications are: engineering simulation, numer-
ically controlled machines, virtual reality, and advanced cartography. The input to
such models are: drawings, intensity images, and 3D sensors like laser triangulation.
The algorithms used in these applications derive automatically a 3D slid model of
an object that drives a finite element or graphic simulation program.
One of the most basic and essential characteristics of living beings is the ability to
recognize and identify objects. All the higher animals depend on this ability for their
survival. Without this they are unable to function even in a static and unchanging
environment. Sight is our most impressive sense. It gives us, without conscious
21.3 Basic Principles of Vision 673
Stickman
Eye
Image
efforts, the detailed information about the shape of the world around us. It is also the
challenging area in AI.
The most natural place to begin our study is transduction of physical phenomena
into internal representation. Vision starts with an eye—a device for capturing and
focusing the light that bounces off the objects. Every point on an object (other than a
mirror) scatters incident light in all directions. An eye has a lens that directs the light
from one point on an object to just one point on retina (see Fig. 21.1). The image on
retina is upside down, but human mind corrects it. We note that image is 2D, while
the object is 3D [1].
Definition 21.1 (Vision Problem) A vision problem may be stated as, given a 2D
image, infer the objects that produced it, as well as infer their shapes, position, sizes,
and colors.
However, one imprecision exists in the above definition, it is—how all these
characteristics counts for object. For monochrome images, an image may be thought
as a function, giving a gray-level at every point on the image plane. The gray-level
varies from 0 (maximum black) to 1 (maximum bright). The latter is corresponding
to the maximum response the eye can make. We assume that surface on which image
is constructed is flat in x and y coordinates, which is appropriate for computer vision
also. We approximate the gray-level function by a 2D array image, which breaks the
image into squares, and records average gray-level over each square, with a pixel at
a point, say (i, j), is cell image(i, j). The value of gray-level function at each pixel
is in the range [0, 1].
It is challenge to reason back from image to the scene that created that. Figure 21.1
suggests that each spot in image corresponds to one spot in the real world, namely, the
piece of surface the light bounced off in order to make the image spot. The recognition
of images is a process of pattern recognition. The recognition requires learning,
which means mapping the images patterns to some internal representation—this is
scene transformation into images. However, recognition of objects is inversion of this
process, i.e., reconstruction of scene features from the images, which is a complex
task.
Though it is not yet fully established, but it is hypothesized that human follow the
following process for identification or classification of objects:
674 21 Machine Vision
The recognition is, in fact, a process of establishing a close match between some
new stimulus and the previously stored stimulus pattern. Object recognition is the
task of finding and labeling parts of a 2D image of scene that corresponds to objects
in the scene. For example, from an aerial photo, to identify the various objects, like
building, roads, forests, etc. It is a task as it might be carried out by a human observer
with marking pen, to mark and label the objects.
Object classification is closely related to recognition. The ability to classify var-
ious objects or group of objects accordingly to some commonly shared features is a
form of class recognition. Classification is essential for decision-making, learning,
and many other cognitive acts. Like recognition, classification depends on the ability
to discover common patterns among objects. This ability must be acquired through
some learning process. For the learning to take place, the prominent feature patterns
that characterize classes of objects must be discovered, generalized, and stored for
subsequent recall and comparison.
To recognize an object, it is required to first establish a general description of
each object to be recognized. Most often, a model includes texture, shape, and the
context knowledge about the occurrence of such objects in a scene. For instance,
a mathematical description of a set of shaded rectangles may be used to generate
buildings as objects. A 3D building object could be modeled as a set of rectangular
solids. Texture information might include colors or knowledge about the layout of a
building’s window.
Corresponding to each occurrence or instance of a model in the image, a label
is attached, called model label, that can be thought of as a tag pinned to an area in
the image that we consider showing an instance of the corresponding object model.
The “roads” and “buildings/houses” in Fig. 21.2, shown by arrows, are examples of
model labels. Note that a model may be 2D or 3D, but the labels always show 2D
model instances.
There are many important distinctions about the kind of information we are inter-
ested within a digital image, corresponding the scene of that image. The most ele-
mentary type of information is called syntactic information, which is concerned only
with pixel values, and their meanings. The other information, called semantic infor-
mation, deals with knowledge and meaning of the information. Hence, a syntactic
image operator’s role is to blindly apply an algorithm to the pixel values, without
21.3 Basic Principles of Vision 675
Fig. 21.2 An aerial image of a Hilly Town with labels attached to buildings and roads
concerned to the meaning associated with those pixels. An example is, a procedure
that assembles group of adjacent pixels which have, say, a high contrast with respect
to their neighbors. On the contrary, a semantic operator uses models of the scene and
the image production process. They incorporate symbolic knowledge about how the
image is organized, such as “a part may be lying on top of one another.”
For a process of mechanized recognition requires learning new objects carried out
through the following steps:
1. The stimuli produced by objects is perceived by the sensory devices. The more
prominent attributes, like size, shape, color, and texture produce the strong stim-
uli. The values of these attributes and their relations are used to characterize
an object in the form of a pattern vector as a string. This string is represented
as a classification tree, a description graph, or some other form. The range of
characteristic attribute value is known as the measurement space.
2. A subset of attributes whose values provide cohesive object grouping or clustering,
consistent with some goals associated with the object classifications, are selected.
The range of the subset of attribute values is feature space.
3. Using the selected attribute values, the object or class characterization models
are learned by forming generalized prototype description, classification rule, or
decision functions. These models are stored for subsequent recognition. The range
of decision function values or classification rules is decision space.
Recognition of familiar objects is achieved through application of the rules learned
by step 3 above, by comparison and matching of object features with the stored
676 21 Machine Vision
Learning
Classification
rules
conceptual categories is concerned with some standard problems. One such prob-
lem is that, perceptual commonsense concepts never correspond to any clear classic
category which can be described in terms of necessary and sufficient conditions.
For example, the membership in perceptive concepts categories is never in 0 and 1
classes, but usually there is need to consider a prototype of the category.
The information available in a cognitive system strictly depend on data acquired
through measurement process. Hence, the knowledge derived through this informa-
tion at conceptual level will also be affected due to measurement errors. To solve this
problem, a model mapping is carried out between conceptual and linguistic levels
using a connectionist 2 device. The use of Neural Networks (NN) eliminate the need
for exhaustive description of conceptual categories at the symbolic level. The latter
becomes possible because, a prototype is built based on associative mapping taking
place during the training phase of NNs. A measure of similarity between the proto-
type and the given object is implicit in the behavior of the NN, which is determined
during the learning phase of this network.
For a creature, of type either biological or mechanical (in case of latter, e.g., robot),
for effective interaction with its environment, it needs to know what are the objects,
and where they exists? The computer vision provides the basic methods to understand
as how to make intelligent decisions about the environment of interactions, on the
basis of sensory inputs it receives. To support the intelligent interaction with the
environment, the vision system must know the information about objects in the
world (with which it interacts). This knowledge about the objects and their positions
become known to the vision system only on the basis of the measurements of inputs
received in the form of reflected brightness. To convert this brightness into the world
of measurements, first it is necessary to know how the objects are mapped into the
image brightness. The magnitude of light recorded by an individual sensor is the
result complex interaction of the following parameters:
• Orientation and position of the object, as well as the position and orientation of
the light sources with respect to the sensor,
• 3D shape of the object,
• The reflective properties of the object’s surface (i.e., the rules of physics that govern
the reflection of light rays from surface of the object),
• Spectral sensitivity and quantum efficiency of the sensor, and
• Spectral properties of the light sources.
For generating an image from a given set of scene parameters (i.e., mapping from
scene to image) is a well-defined process. However, to invert an image to compute
the scene parameters that gave rise to the image is an ill-posed and challenging
One approach to perform inversion from image-to-scene that created the image is
through fixing some scene parameters. For example, if we have information about
the surface reflectivity3 of an isolated object, and about the position of light source,
there is a nonlinear equation available that directly relates image brightness at a point
to the object’s local surface orientation at that point. Since there are two unknown
variables for each measured brightness, we need more constraints, to determine the
scene.
One method, called shape-from-shading, assumes that the surface that is locally
smooth is sufficient to provide solution for the object’s shape. The other methods
are called photometric stereo methods. Their working is based on taking several
images with a single position of camera, but with different light sources, and a set of
brightnesses are used corresponding to a single point, to determine the local surface
shape.
There is a class of methods for going from image-to-scene, i.e., image inversion, that
is based on restricting the problem domain. Consider the case of printed character
recognition, like Optical Character Recognition (OCR), where domain consists of
2D objects, with known shapes and restricted range of orientation and positions.
The inversion problem, in this case, is basically separating individual objects (let-
ters, digits, and punctuation marks) in the presence of noisy brightnesses, and then
matching the shapes of those objects against the canonical models. There are in fact
many successful commercial tools available for recognizing the printed characters,
but the problem becomes more complex if we allow in the range of characters, all
the cursive scripts with all their possible variants!
3 The fraction of radiant energy that is reflected from a surface, also called coefficient of reflection.
21.5 From Image-to-Scene 679
Yet another method for image inversion states that, it is possible to achieve image
inversion by acquiring additional images. For example, using stereo vision, which
obtains two views of a scene, it is possible to extract 3D shapes of objects. The
principle of stereo vision is as follows: if one can determine a point in two images that
are projections of a single point in the original scene, and if relative orientation of two
cameras is known, then the method of triangulation can produce a 3D reconstruction
of the shapes of the objects that are existing in the real world. A similar principle
holds for motion sequence of images, where either the camera or the objects in
the real world move between the images. This process again involves a matching
problem, this time there is matching between features from two images, instead of
an image and a model. Most of the machine vision algorithms depend on matching
schemes that measure the similarity between features in the two images, and make
use of multi-resolution methods to control the complexity of such matchings.
Once the 3D shape of the scene is extracted, recognition of the scene is made
possible by matching 3D shape descriptions, while the navigation in the 3D world is
supported by identifying the collision-free paths through the reconstruction world.
A large variety of matching schemes exists, that also includes the methods that
directly search the space, of all possible pairings of models and image features, called,
correspondence space. There are other methods, that directly search the space of all
possible poses of the objects, and there are hybrid methods that can combine both
the type of spaces. Most of the methods rely on geometric constraints to reduce the
complexity of search, the constraints that typically encode information about objects’
shape. The constraints are also used to measure the feasibility of matching the data
features to model features, such that they are consistent with the legal transforma-
tions of the object. The following are typically the difficult parts of this problem:
1. efficiently deciding about the image features that belong to a single object in the
presence of scene clutter and occlusion, and 2. effectively match the partial descrip-
tion of the object in the presence of uncertainty in sensor input.
680 21 Machine Vision
From the above discussions, we find it common that, vision methods attempt
to extract scene parameters, like surface material type and object shape given the
observed brightnesses, and then to use these extracted parameters to match against
known-object models to support the recognition.
The machine vision systems operates on digital images, i.e., quantized collection of
discrete values in 2D space with varying intensity. The methods used in machine
vision are also classified as low-level vision, middle-level vision, and high-level
visions. Though, this is not the only classification for vision systems, but it provides
a useful way of classifying the computer vision problems [9].
The Low-level vision techniques operate directly on images and produce the output
images in the same coordinate system as that of the input. For example, an edge
detection algorithm may take an image with different intensity values as input, and
produce a binary output that indicates the edges in the picture.
The Middle-level vision techniques take an image or the output produced by low-
level vision algorithm as input, and may produce the output something other than
pixels of the image coordinate system. For example, the stereo-vision system may
produce an output shape in 3D using input as two 2D images. As another example,
a structure-from-motion algorithm takes as input a set of image features and produce
at output 3D coordinate of these features.
The high-level vision techniques take as input the results of low- or middle-
level vision algorithms and produce in the output some abstract data structures. For
example, a model-based recognition system may take a set of image features as
input, and provide at output the geometric transformations that map the models in
its database to the respective locations in the image.
The low-level vision computations operate directly on images and produce pixel-
based outputs, that remains within the original image coordinates only. The examples
of these computations are: finding the intensity edges in an image, smoothing images
using filters, computing visual motion fields, and analyzing color information in
images [9].
In the following, we discuss edge detection and smoothing of image, in more
detail.
21.6 Machine Vision Techniques 681
The edge detection of a real-world object is carried out to find out the geometry of the
image and to be used in higher level image processing. There are some physical events
that cause intensity changes, or appearance of an edge in the images. For example, the
object boundaries produce intensity changes which are caused due to discontinuity
in the depth or difference in surface color, or difference in texture. The surface
boundaries produce intensity changes due to a difference in the surface orientation.
However, there are some phenomena of intensity changes that do not reflect directly
on the geometry of the object, the examples are shadows and interreflections [9].
As an example, we shall represent gray-level image by A(x, y), as an intensity of
a function of the image coordinate system (x, y). The intensity of edges in the image
correspond to sudden changes in the value of function A(x, y). For this image, we
compute the gradient magnitude using the expression,
2 2
∂A ∂A
∇ A =2
+ . (21.1)
∂x ∂y
In simple words, where the squared gradient magnitude (i.e., ∇ A2 ) is large, it
is an indication of an edge. The Laplacian operator (∇ 2 ) is another local differential
operator that is used for edge detection, expressed by
∂2 A ∂2 A
∇2 A = + (21.2)
∂x2 ∂ y2
This second-derivative operator retains the information as which side of the edge
is brighter. The zero crossing of ∇ 2 A indicates that the intensity edges of the image,
and the sign on each side of a zero crossing indicates which side is brighter.
The Machine Vision Systems
The images input to machine vision systems are digitized with respect to both the
space and intensity, and are available as an array A[x, y] of discrete intensity values.
The approximations based on finite difference are used to estimate the derivatives
for computing the local differential operators. A discrete 1D sampled function is
represented as a vector f[i]. The first-order derivative of this vector is df/d x, which
can be approximated as f[i + 1] − f[i]. And, the second-order derivative is computed
as an approximation f[i − 1] − 2f[i] + f[i + 1]. Hence, the gradient magnitude can
be approximated as
2 2
∂A ∂A
+ ≈ (A[ j + 1, k + 1] − A[ j, k])2 + (A[ j, k + 1] − A[ j + 1, k])2 .
∂x ∂y
(21.3)
682 21 Machine Vision
1 −x 2
G σ (x) = √ e 2σ 2 , (21.5)
2π σ
In discrete-based system, the values of integral steps over some range are used as
approximations, which are generally at ±4σ steps. These values are normalized so
that they sum to 1, just like in the case of continuous values, with integral as 1.
The middle-level vision techniques receives the images or results from the output of
low-level vision algorithms, and produce some output other than pixels, within the
coordinate systems of the original image. One goal of the middle-level vision is to
extract the 3D geometric information from the images, which are in fact of 2D—
usually referred to as shape-from x, i.e., recovering 3D structure from 2D images.
In fact, presence of a shading in an image reveals information about 3D objects. An
example is, perceiving the shape of a sphere from its image, which we conclude as a
solid rather than a circular disk. This perception is due to the uniform change taking
place in brightness away from the light source [9].
Information about 3D shapes of objects is perceived due to the mirror-like reflec-
tions from the objects. Yet, one more source of 3D shape information is due to change
in location of an object in an image, from one image to another image of the same
object’s views. The commonly used techniques for extracting image shapes from
multiple images are: 1. stereopsis and 2. structure-from-motion. Apart from extrac-
tion of shape-from images, another goal of middle-level vision is to extract structural
description of images. What relations exist between these structures can be found
out using grouping methods, called perceptual grouping. The latter is responsible
for recovering non-accidental alignments of image primitives, like colinear line seg-
ments or co-circular arcs.
Stereopsis
Computer stereo vision is the extraction of 3D information from digital images.
By comparing the information about same object about a scene from two panels.
The human beings compare two images, based on their being superimposed in a
stereoscopic device, such that the image from the right camera being shown to the
observer’s right eye and that from the left camera is shown to the left eye. In a
basic stereo-vision system, two cameras are maintained for observing a scene. The
idea behind this system is that objects closer to camera will appear more displaced
between two views than those which are farther away. By comparing the two images,
it is possible to find out the relative depth information in the form of a map. This map
encodes the difference in horizontal coordinates of corresponding image points.
For any given point in a scene, the amount of this point’s motion between two
views is called disparity. If the camera system is calibrated in world coordinates, then
the actual distance to a point can be found out using the magnitude of its disparity.
Given any two images of a scene, the disparity between two is often the intensity
image having bright points corresponding to larger intensities, i.e., the things that
are closer to the cameras.
684 21 Machine Vision
xl x + b/2
= , (21.7)
f z
xr x − b/2
= , (21.8)
f z
p
x
ol b or
Origin of world
coordinate frame
f
f
pl ol or pr
4 The plane (x, y) is formed by lines (ol , or ), and axis y, the latter is perpendicular to plane of paper.
21.6 Machine Vision Techniques 685
yl y y
= r = . (21.9)
f f z
Note that in this simple camera geometry, only the x location (xl and xr ) of a
projected point differs between the left and right images, and the y location (yl and
yl ) of a given point in space is the same for both images. The disparity is distance
between pl and pr , is just xl − xr . We have from Eqs. 21.7 and 21.8,
xl − xr b
= . (21.10)
f z
Hence, if b and f are known, depth z of the point p can be computed from the
disparity, giving the third dimension, hence creating a 3D view using 2D image. If
b and f are unknown, the relative depths of points can be determined, and we can
grasp the relative 3D views of two images. But, in this case, their absolute distance
from the camera cannot be determined.
The High-level vision systems are used to make abstract decisions, or based on the
visual data they perform classification of objects. These methods usually make use
of the outputs produced by low-level or middle-level vision algorithms discussed
above. The high-level vision systems have main applications as object recognition
systems, or as object tracking systems.
• Object recognition systems. These systems are used to find out whether or not a
particular objects is present in the scene, and also determine the locations of other
objects with respect to camera;
• Object tracking systems. As the name implies, the object tracking systems follow
a moving object in a video sequence, and hence can guide a mobile robot or an
autonomous vehicle.
In the following, we will discuss the principles used in the first application.
Object Recognition Systems
The Object recognition systems compare new image to stored object models to
determine whether any of the models is present in the image. Many object recognition
systems perform both the recognition and localization tasks, i.e., identifying what
is the object present in the image, and also recovering its location in the image or
world coordinate systems. The location of an object is called its pose; it is generally
specified using a transformation that maps the model coordinate system to the image
or world coordinate system.
The object recognition systems can be classified based on the type of problems
they solve. A simple recognition task requires identification of 2D objects that are
686 21 Machine Vision
fully un-occluded (all are fully visible), they appear in a uniform background, and the
lighting conditions are controlled, i.e., there are no reflections or shadows in the view.
The large majority of industrial inspections fall in this category of object recognition,
and can be handled accurately using the commercially available systems.
The object recognition problems are difficult, in the following conditions:
Background of the objects is highly textured,
Lighting conditions are unknown and uncontrolled,
Scene comprises too many objects,
Number of objects models are very large in the storage, and
Some objects may be touching and occluding other objects.
In our discussions, we will consider methods to handle images with multiple
objects, the objects are partially occluded, and there is some amount of cutter in the
background. The methods that can solve the object recognition with the presence
of these challenges, first extract geometric information, like intensity edges from an
image. Then, the next step is to compare 2D geometry with 3D geometry.
An alternative method is computing invariant representations of the image geom-
etry, i.e., it remains unchanged in spite of changes in the viewpoints. Such represen-
tations can be used to create an index of object models in a library of collections of
object models. One of the challenges in geometric recognition is to find out which
portion of the image corresponds to a given object from the library collections. The
recognition problem is often defined as recovering a correspondence between local
features of an image and an object model. There are three classes of approaches as
how to search possible match between a model and image feature.
1. Transformation space methods. These methods use the space of possible trans-
formations that map the model to the image.
2. Correspondence methods. These methods use the space of possible corresponding
features to match with the image.
3. Hypothesize and test. The hypothesize and test methods consider k-tuples of
model and data features for matching with the image.
We will be discussing the correspondence-based methods in more detail.
Correspondence Search
It is a tree search to find out which model’s features match with the image features.
Considering a set of image features as A = {a1 , . . . , an }, and a set of model features
as M = {m 1 , ..., m r }, an interpretation of the image can be expressed as a set of pairs,
N = {(m i , a j ), ...}. The interpretations states, which model features correspond to
the image features.
If the model features are unclear (hidden or occluded), and some features are
outside the image features, in that case the set N is any subset of all “pairs of model
and image features.” A search method used, called interpretation tree approach
makes use of pruned search in otherwise exponential space of possible interpretations.
In these interpretations, the method makes use of all possible ways of pairings of
21.6 Machine Vision Techniques 687
models and image features. In simple terms, the search method makes use of pairwise
relations between features to prune a tree of possible model and image feature pairs.
A traversal path in this tree corresponds to a sequence of interpretations.
To limit the number of pairs (m i , a j ) and (m p , aq ) in the interpretations, the
distances m p − m i and aq − a j must be equal, i.e., are within the allowable
error limits.
The search for interpretations of an image, a pruned-tree search algorithm is used,
which works as follows:
• Each level of the tree, other than root node, corresponds to an image.
• Each branch at a node corresponds to a model feature or there is special branch
called null face.
• Accordingly, each node of the tree specifies a pair: image feature at that level,
model feature taken from the previous level
. This model feature may be null also.
• The three search carried out is Depth-First search (DFS). Any given node is
expanded only when it is pairwise consistent with all the nodes along the path
from the current node back to the root of the tree. That is, a given node is paired
with each node along the path back to the root, and for each such pair, if the
distance constraint is satisfied, the node is expanded otherwise not.
In the above, the null face branch is considered as always consistent.
A path from root to leaf node of the search tree indicates that zero or more model
features exists. However, a null branch path does not account for a model feature.
A path that accounts for k model features is called k-interpretation. Consider that a
reference level to filter out any hypothesis that does not account for enough model
features, is set at a threshold, say k0 . In this case, a matcher algorithm reports only
those k interpretations whose threshold is k > k0 . All these k-interpretations are
guaranteed to be pairwise consistent. Hence, an additional step, to verify the model
is carried out. This step estimates the best transformation for each k-interpretation
and checks that this transformation brings each mode feature within some error range
of each corresponding image feature.
The significance of indexing in vision is to match the images efficiently with the
models stored in the repository. Having this facility, it is possible, in principle, to
search an object in a library of stored models, with time complexity, that is indepen-
dent of number of models stored. To carry out the indexing, it requires finding the
geometric invariants in model recognition—those attributes of images that do not
change due to changes of viewpoints of objects. One key difference of this model
indexing with hashing is that, various instances of the same object in different images
will not generate exactly the same key due to sensing uncertainty.
There are number of methods to use invariants in object recognition. These are
categorized based on: geometric differential, photometric, and thermal properties
688 21 Machine Vision
of images. The indexing methods for structures make use of combinations of sim-
ple features such as points and segments. Other geometric techniques use invariant
properties of curves and co-planar conics. The advantage of using curve features
is lower combinatorial complexity, but it has disadvantage of requiring to extract
features from noisy and cluttered images. However, the intensity information from
images provides their richer description than the geometric features, like points and
line segments.
The geometric hashing is used to explain the invariants in recognition. For this2D
affine transformation is used, with geometric hashing using three points to define a
coordinate system, and with respect to these points, the other points are encoded in
invariant manner. The geometric hashing comprises the following two basic steps:
a. Construction of a model hash table, and
b. Matching of model to image.
The hash table stores redundant and the representation of each object that are
transformation-invariant. For any given model, each ordered triple, m 1 , m 2 , m 3 forms
an affine basis having origin as o = m 1 , and axes x = m 2 − m 1 , y = m 3 − m 1 . For
each such basis every additional model point m i is expressed as (αi , βi ), such that
m i − o = αi x + βi y. The basis triple (o, x, y) and the point m i are then stored in
a hash table using affine invariant indices (αi , βi ). For r number of model points, a
table results with O(r 4 ) entries, i.e., space complexity is combinatorial. The table
is formed using buckets instead of purely a hashing scheme. This is because, due to
the uncertainty in sensing in real-life data makes it difficult to use exact values for
retrieval purpose.
At the time of recognition, the hash table is searched to find out what models are
present in the image. The algorithm steps for recognition of an object are as follows.
i First, the image points are rewritten in terms of an image basis.
ii As the next step, corresponding to an instance of a model that model basis will
be retrieved from the table.
iii An image basis is formed using each ordered triple of image points s1 , s2 , s3 ,
such that origin is O = s1 , and axes are X = s2 − s1 , Y = s3 − s1 .
iv For this image basis, each additional image point si is written as (αi , βi ), such
that si − O = αi X + βi Y .
v The indices (αi , βi ) are used for retrieving matching model basis from the hash
table. Along with retrieving each model basis, corresponding counter is incre-
mented in a histogram.
vi When all the retrieved points for the model basis have been considered for an
image basis, the histogram contains the votes for those model bases which could
match to the current image basis (O, X, Y ).
vii If peak in the histogram for a given model basis (o, x, y) is above a threshold, the
corresponding basis is selected as a potential match. When a next image basis is
chosen, the histogram counts reset.
21.8 Object Representation and Tracking 689
5 Affine transformation is a linear mapping method in geometry, such that it preserves points, straight
lines, and planes. For example, after an affine transformation, a set of parallel lines remain parallel,
say, a transformation from parallelogram to rectangle and vice versa. Affine transformation is
typically used to correct for geometric distortions or deformations that occur with nonideal camera
angles.
690 21 Machine Vision
the kernel in consecutive frames. The motion is usually in the form of a parametric
transformation, like rotation, translation, or affine.
Silhouette Tracking
Object tracking can also be performed by estimating the object’s region in each frame.
Information encoded inside an object’s region is used by the Silhouette tracking,
which is either in the form of appearance density or shape models (i.e., in the form of
edge maps). Further, given an object model, the silhouettes are tracked by matching
the shapes (Fig. 21.8b), or they can be matched by contour evolution (Fig. 21.8c).
Both of these methods can be used for object segmentation, which is applied in the
temporal domains.
The following are major applications of object tracking.
• Video indexing. It comprises automatic annotation and retrieval of the videos in
multimedia databases.
• Motion-based recognition. Motion-based recognition has applications in identifi-
cation of humans based on gait and in automatic object detection.
• Automated surveillance. It is concerned with monitoring a scene to detect suspi-
cious activities or unlikely events.
• Human–computer Interaction. The HCI is concerned with applications, like ges-
ture recognition and eye-gaze tracking, for data input to computers.
• Vehicle navigation. It is used in video-based path planning and obstacle avoidance
capabilities.
• Traffic monitoring. It is concerned with real-time gathering of traffic statistics to
direct the traffic flow.
The feature selection is a process of deciding those attributes of a picture that help
in identifying the object in the picture frame, while object detection or object iden-
tification is a process to identify the object in the frame, based on these features.
Therefore, it is necessary that the features decided should give a strong evidence of
the corresponding object.
Feature Selection
To efficiently distinguish the objects in the feature space, it is desired that the visual
features be unique. The selection of feature is related closely with the objects repre-
sentation. For example, for histogram-based representations, the color is a required
feature, and for contour-based representations, object edges are important features.
Many times, the features with various combinations are useful. Some examples of
features as given below [10].
21.9 Feature Selection and Object Detection 693
Color
The color, in appearance of an object, is primarily influenced by two physical features
of the object: (1) light source’s spectral power distribution, and (2) object’s surface
reflection properties. The color space used for an object is RGB (red, green, blue).
Since, the RGB does not have uniform perceptions, HSV (hue, saturation, value)—
more closer to uniform color space—is preferred as a representation to obtain uniform
perception.
Edges
Since the object boundaries are the cause of strong changes in image intensities,
the edge detection is used to identify these changes. The edges intensities are also
less sensitive to the level of illumination. Hence, the algorithms used for tracking
boundaries of the objects, make use of edges as the important representative features
of the image.
Texture
Texture of an object surface causes an intensity variation in the reflected light from
its surface, hence quantifies properties of the object surface such as smoothness
and regularity. The texture requires an additional processing step to generate the
descriptors.
An object’s representative features usually depend on the application domain;
however, the automatic feature selections is in common practice. The features are
either filter based or they are wrapper based. The methods that are filter based, make
use of general criteria, e.g., the features need to be correlated. On the other hand,
the wrapper methods select the features based on their usefulness in the concerned
problem domain, for example, classifying the performance using the subset of fea-
tures. An example of filter method, called Principle Component Analysis (PCA), is
useful in reduction of number of features. The PCA is based on transformation of
possibly a number of correlated variables into a smaller number of variables that are
not correlated, called principal components. The first principal component chosen
should account for maximum possible variability of the data, and each succeeding
component should account for most of the remaining variability of the components
possible. Adaboost algorithm is an example of wrapper-based method for selection
of discriminatory features for tracking a class of objects. This method identifies a
strong classifier using a combination of moderately less accurate and weak classi-
fiers. Having given a large set of features, it is possible to train one such classifier
for each feature [3].
Among all the features used for object tracking, color is the one most widely used
feature. The colors’ histogram can be used to represent the object’s appearance. In
spite of the popularity of color as an important feature, large majority of color bands
are sensitive to variation in illumination. Thus, in situations where this effect cannot
be stopped, other features are considered to model the object’s appearance.
694 21 Machine Vision
A tracking method always requires a mechanism for object detection, either in every
frame or when the object appears in the video for the first time. The commonly used
approach for object detection is to use the information from a single frame. However,
some method makes use of temporal information computed from a sequence to reduce
the false detection of objects. The temporal information is in the form of frame
differencing, i.e., it highlights the changes in regions from frame to frame. Based on
the object regions available in an image, the trackers establish correspondence from
one frame to next frame of same object, and generate the tracks [10].
Some commonly used methods for object detection as discussed below.
Point Detectors
The point detectors have application in finding the interest points in an image that
have expressive texture in their localities. An important characteristic of interest point
is its invariance with respect to changes in the illumination and camera viewpoint. A
procedure, called Moravec’s operator-based method, is used for finding the interest
points, which computes the variation in image intensity in a 4 × 4 patch in horizontal,
vertical, diagonal, and anti-diagonal directions, and selects the minimum of the four
variations as representative values for the window. A point in this case is declared as
interest point if the intensity variation in a local maxima is 12 × 12 patch.
Image Background Subtraction
It is possible to build a representation of a scene, called background model, and then
to find the deviation from the model for the incoming frame, to detect the object. If
there is a significant change in the image region from the background model, then
it indicates that the object is moving. This approach can also be used to detect if a
new object has been introduced into the scene, or some object has escaped from the
scene. In this method, the pixels representing the regions that undergone the change
are marked for further processing. In such cases, it is common to use connected
component algorithm to obtain the required objects in the image. The process used,
as it appears, is called background subtraction.
Figure 21.9 shows background subtraction results: The part (a) of figure shows
an input image of highway with moving vehicles, pedestrians, and nearby houses
as objects, the part (b) is reconstructed image after projecting input image onto the
eigenspace, and part (c) is an image obtained as difference of two images.
Image Segmentation
Aim of an image-segmentation algorithm is to partition an image into perceptually
similar regions. A segmentation comprises solution of two problems: 1. What is cri-
teria for good partition, and 2. What is procedure for achieving efficient partitioning?
In the following we discuss some of commonly used segmentation techniques that
are useful for object tracking.
21.9 Feature Selection and Object Detection 695
Fig. 21.9 Background subtraction. a input image with objects b reconstructed image c difference
image
N
Ai = V, (21.11)
i=1
696 21 Machine Vision
and
Ai ∩ A j = φ, for i = j. (21.12)
In any graph, the total weight of the pruned edges between its two subgraphs is
called a cut. The weights are typically computed based on color of the image, its
brightness, and texture similarity between the vertices. In one approach, minimum
cut (mincut) criteria is used, with the aim to find out the partition that minimizes a
cut. The weights in this approach are defined based on color similarity. However,
this approach has disadvantage of bias toward over-segmenting the image. This over-
segmenting is caused due to increase in cost of a cut with number of edges going
across the partitioned segments.
The problem of over-segmentation can be solved by using a normalized cut, where
the cut depends on the sum of edge weights in the cut, like before. In addition, the cut
also depends on the ratio of, total connection weight of the vertices in each partition
to weight of all vertices of the graph taken together. For the segmentation based
on images, weights between the vertices are defined by the product of the spatial
proximity and color similarity. Once the weights are computed between each pair of
vertices, a weight matrix W and a diagonal matrix D are computed, such that
N
Di,i = Wi, j . (21.13)
i=1
The segmentation is performed first by computing the eigenvectors and the eigen-
values of the generalized eigensystem, as per the expression,
Next, the second-smallest eigenvector is used for dividing the image into two seg-
ments. For each of the new segment, this process is performed recursively until a
threshold is reached.
In a segmentation that is based on normalized cut, solution to the generalized
eigensystem for large images can be expensive in terms of memory requirements as
well as time requirements. However, this method requires far less number of manually
selected parameters, compared to that of mean-shift clustering-based method.
The supervised learning can be used to automatically learn different object views
from a set of examples to perform the object detection. Learning of different object
views exempts the need of storing the complete set of templates. The supervised
learning makes use of a set of examples to generate a function which maps any
input to desired output. A standard formulation of supervised learning is in the form
21.10 Supervised Learning for Object Detection 697
The term axiom (also called primitives) is considered from the mathematical usage,
using these as the basic building blocks it is possible to derive new constructs. This
approach has the advantage of abstracting the representation, the type, and the algo-
rithm details, within those components so that it is possible to combine them together.
In the vision axioms, each axiom performs a subset of the tasks such that these tasks
can be grouped together to accomplish a more complex task. The axioms are low-
level enough so that the majority of the subsets of vision problems can be represented
by them, and after combining them, it results in high-level solutions [5].
The axioms share common elements defined globally, such as images and ele-
mentary data types. The time, images, and colors are treated as continuous signals
in the axioms of vision. For example, color is RGB, with each channel represented
in the interval of [0, 1]. The width and height of an image are represented in similar
way, with additional value for the aspect ratio. Since time is treated as a continuous
signal, it allows different inputs to the vision system.
The classes of common axioms for vision are: mathematical axioms, source
axioms, model axioms, and construct axioms. These are based in a problem-centric
taxonomy. Each axiom is further divided into three parts: a. description of its tasks,
b. its representations, and c. the processing it would perform when implemented.
The source axioms are used for encapsulating the source information. Thus, they
also describe the acquired data or the source of the acquired data, such as cameras,
images, and properties of images like noise and illuminations in the images.
Noise
This axiom provides the description of noise in the data.
Light
The lighting models are required to approximate the lighting in a scene. It may have
simple systems like in computer graphics, or variation of intensity across images, to
add the effect of flash, or combining of the images, etc.
700 21 Machine Vision
Camera
The camera axioms provide the general description of the camera, and also include
other models and specific parameters for calibration, like focal length, principal point,
etc.
Image
The image comprises descriptions for low-level image processing techniques, and
representation of image, like resize and filter operations.
Blur
The blur is due to motion of camera or the object. The focus is deliberately used for
creative effect or measurement of depth from focus.
The model axioms are used for representing abstract concepts used within computer
vision, such that they are accessible.
Model
This axiom contains descriptions of models used in the vision, like geometric model,
probabilistic model, or example-based model. A model unit can provide conversion
from one format to another format.
Object
This axiom provides types and the descriptions of objects.
Shape
It finds the similarly shaped regions within an image, which is carried out in con-
junction with matching and similarity axioms.
Texture
It performs synthesis and analysis with the help of texture descriptor. It also finds
regions of similar texture, with the help of matching and similarity axioms.
The construct axioms represent constructs used within vision for the purpose of
modeling and used as output to other modeling packages.
21.11 Axioms of Vision 701
Mesh
The mesh axiom is for centralized description of a mesh, using triangles, quads, etc.
A mesh unit provides an input–output system, for construction of meshes from an
image base or from an arbitrary basis in 3D, for reconstruction of output.
Grid
The grid axiom is used for providing N -dimensional grids’ description, for use in
vision. For example, a 2D grid would be a discretized image, and a 3D grid would
be a set of voxels,9 etc. A grid can be indexed at intersections of grid lines. A grid
unit also provides methods of grid construction, conversion, and for IO methods.
Algorithm Composition
The vision axioms can be combined together through loose coupling to form more
sophisticated algorithms. The basic axioms already encapsulate the sophisticated
concepts and algorithms; and combining these we can create higher level systems
such as correspondence systems, tracking, 3D reconstructions, etc. As an example,
for creating a simple 3D reconstruction such as visual hull, a camera can provide cal-
ibrations, grid can provide volumetric constructs, transformation will give projected
grid points, and mesh can create a model for output from volumetric grid.
It is possible to create a simple tracking system with direct use of axioms: ini-
tialization of this system is done using object, matching, and similarity. And then,
an object can be tracked through a sequence of images through use of move and
optimization axioms.
However, the algorithm composition is not suited to be used by developers,
because it requires expert knowledge of vision. Instead of that use, we look at top
most level, which is problem itself. This is because the problem is decomposed into
smaller level parts, and we use this as a basis for solving the problem. This simplifies
the implementation of computer vision using the axioms.
The open source computer vision library, the OpenCV is available under a Berkley
Software Distribution (BSD) license and it is free for use both by academic and
commercial users. The OpenCV has interfaces with C++, C, Python, and Java, and
it is supported by all major Operating Systems. OpenCV was designesd keeping
the computational efficiency as one of the goals, and with a strong focus on real-
time applications. The OpenCV is written in optimized C/C++ code, the library
can take advantage of the availability multicore processing in the computer being
used for running these library programs. The OpenCV is enabled with OpenCL, and
it can take advantage of the hardware acceleration of the underlying heterogeneous
computer systems. Usage of OpenCV range from interactive art, to mines inspection,
to stitching maps on the web through advanced robotics [6].
The OpenCV is designed with other goals of building tools for solving computer-
vision problems. It comprises a mixture of low-level image processing functions
and high-level algorithms for the applications of, face recognition, face detection,
pedestrian detection, feature matching, and object tracking.
The following is an example of a simple program in OpenCV to load and display
a given image file, whose name is provided as input at Linux command line [7].
Example 21.2 Load and display an image file.
Mat imagemat;
// Read image file
imagemat = imread(argv[1], CV_LOAD_IMAGE_COLOR);
// Check for invalid input
if(! imagemat.data)
{
cout << "error" << std::endl;
return -1;
}
//Create a display window
namedWindow("Display Window", WINDOW_AUTOSIZE);
//Show the image inside window
imshow("Display Window", imagemat);
//press any key to terminate the program
waitKey(0);
return 0;
}
21.12 Computer Vision Tools 703
./a.out krc.jpg
21.13 Summary
The vision system has significance due to the fact that, it is one of the essential
characteristics of living beings, and specifically, all the higher animals depend on
this ability for their survival. The vision gives us detailed information about and
around us, without much conscious efforts.
The goal of computer vision or machine vision is to extract the information from
images, so that one can produce a structure from image, e.g., a 3D view from source
images that are 2D. Having this available, it can recover the 3D model of an object,
which in turn can be useful in medical imaging to recover 3D objects, say tumor.
The methods can find best matches for stored objects of image data, for example,
we can search some individual person in a collection of library of photographs, or in
collection of video.
Initially, the vision systems were designed for military applications, like visual
guidance for autonomous vehicles, target recognition, etc. However, in recent many
applications have emerged in medical, for example, for preoperative scan for patients,
virtual reality, image-based data retrieval, assembly and material handling, visual
inspection, and photo interpretation.
Every object, other than mirror, scatters the light incident on it in all directions,
this reflected light is projected on the retina of our eye through a lens, and we are
able to recognize the object, through the interpretation by brain. The vision problem,
which is related to this statement is, given the 2D image, how to infer the shape of
the object that produced this image. However, it is a challenge to reason back from
image to the scene that created it. The recognition requires learning, i.e., mapping
image patterns to some internal representation—transformation of scene to image.
This requires construction of models, i.e., general description of each object that is
to be recognized. A model includes shape, texture, and context knowledge.
Going from image-to-scene is a challenging task. A vision system receives only the
reflected brightness as input in different intensities. To convert these brightnesses into
world measurements, we first need to understand how objects are mapped into image
704 21 Machine Vision
The support vector machines and adaptive boosting are the examples of supervised
learning.
The complexity of vision systems, where basic primitives are identical, for exam-
ple, in 3D objects, and in dynamics of objects, can be simplified by defining and
constructing axioms of vision. These are the basic building blocks from which we
derive new constructs. The axioms are ways for representations, as well as corre-
spond to certain algorithms within these, and are useful for common subtasks in
vision systems.
The common axioms in vision are: mathematical axioms (transformation, opti-
mization, linearity), source axioms (camera, image, light), model axioms (object,
shape, texture), and construct axioms (mesh, grid, algorithm).
One of the important tool for Vision system research is the open source computer
vision library, OpenCV released under a BSD license, is free for academics as well
as commercial use. It provides C++, C, Python, and Java interfaces, and has been
designed for efficiency of computation with a focus mainly on real-time applications.
Exercises
1. Suggest some experimental observations about human vision that support the
idea that vision is graphics — what we see is explicable only partly by the
optical image itself, but is more strongly determined by top-down knowledge,
model-building, and inference processes.
2. Explain in your own words, with supporting geometry, as how the human vision
system inverts the image formed on the retina of eye(s) to visualize the object.
3. How the vision is a learning and cognition process? Justify.
4. Why it is deterministically not possible to recognize an object given an image?
5. We can always map the coordinates from an object to its image, but mapping
the image coordinates to object is challenging. Why?
6. Suggest any five new applications of machine vision, which are not discussed in
this chapter.
7. Most smartphones available now have feature of face detection. Why this feature
is important? Explain the process of face detection in your own words.
8. What are the syntax and semantics information in a vision? Explain in brief.
9. How the axioms of vision are helpful in machine vision?
10. What is image inversion? Explain some of the techniques for this, and the sig-
nificance of each of them.
11. Give proper examples of low-level, middle-level, and high-level visions.
12. Like, two stereo cameras, our eyes are able to find out the depth of vision, e.g.,
how far roughly it is from the eyes. Write an algorithm, to compute this depth
of vision.
13. Does the width between the centers of human eyes, anyway make difference in
computing accuracy of depth of vision? Give a proof for this, based on some
computations.
706 21 Machine Vision
14. Given any image having number of human faces, write an algorithm, to count
the number of faces in the given image.
15. Write an algorithm for panorama-stitching for given n number of images.
16. Suggest various approaches to speedup the processing of image transformations.
References
A AQE
Abduction, 35 working, 583
Acquiring additional images, 679 Argumentation theory, 27
Action, 14 Aristotle, 6
Agent Communication Language (ACL), Artificial consciousness, 9
493 Artificial Intelligence (AI), 1
Agent program Goals, 4
semantics, 495 Roots, 5
Agents Sub-field, 10
adaptive agents, 473 Artificial Neural Networks (ANN), 430
agent-based software engineering, 486 Boltzmann learning, 434
architecture, 479 competitive learning rules, 435
buy and sell, 487 error-correction rules, 433
classification, 472 Hebbian rule, 435
collaborative agents, 474 Assertion Box (ABOX), 201
cooperative agents, 474, 481 Atomic formula, 56
coordination, 480 Attribute language-AL , 198
decision maker, 488 Augmented network, 284
description languages, 497 Automated planning, 447
dynamic coalition formation, 482 Automatic Speech Recognition (ASR), 651
mental level modeling, 489 acoustic model, 659
mobile agents, 474, 499 algorithms, 656
personal agents, 474 hypothesis search, 658
planning agents, 450 Julius, 663
prisoner’s dilemma, 483 language model, 658
proactive agents, 474 lexicon, 658
reactive agents, 450 resources, 653
single-agent systems, 476 tools, 662, 664
smart agents, 474 Automatic Speech Recognition (ASR)
static coalition formation, 482 toolkit
taxonomy, 473 CMU-Sphinx, 665
Agent types, 450 Deep Speech, 665
Alan M. Turing, 7 HTK, 666
Algorithm Julius, 663
unification, 79 Kaldi, 664
Algorithm=logic+program, 112 Axiom, 35
Analogical reasoning, 37 construct, 700
© Springer Nature India Private Limited 2020 709
K. R. Chowdhary, Fundamentals of Artificial Intelligence,
https://doi.org/10.1007/978-81-322-3972-7
710 Index
L IBL, 438
Language and reasoning, 151 instance-based learning, 437, 438
Language, logic, ontology, 151 Machine vision, 669
Learning application, 671
argument based, 387 classification, 675
by analogy, 401 hashing, 687
by concept hierarchies, 396 local edge detection, 681
discovery learning, 396 restricting problem domain, 678
explanation based, 406 techniques, 680
inductive, 384 Machine vision systems, 681
k-nearest neighbor, 425 Maxterms, 40
model, 382 Memory
propositional, 392 frontier search, 229
reinforcement, 400 Middle-level machine vision, 683
reinforcement learning, 398 Minimax search, 318
relational, 392 Minterms, 40
single-agent, 391 Model, 32, 57
strategies, 377 Model based reasoning, 103
supervised, 383 Modus ponens, 35, 42
symbol-based, 405 Modus tollens, 42
unsupervised, 384 Morphological, 652
Learning by analogy, 378 Most General Unifier (MGU), 76
Learning by deduction, 378 Multiagent
Learning by induction, 378 coalition algorithm, 485
Learning by instruction, 378 decision criteria, 493
Learning by reinforcement, 379 interactions, 477
Learning task, 437 model structure, 489
Lifting, 78 preferences, 492
Linear classifiers, 515 social view, 500
Literals, 56 taxonomy, 476
Local search Multiagent systems, 475, 476
greedy, 242
Logic, 25
predicate, 52 N
semantics, 55 Naive Bayes classifier, 428
syntax, 55 Nash arbitration, 314
propositional, 28 Natural Language Processing (NLP), 12
semantics, 33 ambiguous grammar, 619
syntax, 33 applications, 608
Logical consequence, 32 commonsense interfaces, 636
Logic and mathematics, 6 commonsense reasoning
Logic program components, 638
control information, 122 commonsense thinking, 638
path finding, 121 components, 609
Logic programming, 111, 112 discourse analysis, 611
Logic v/s control, 116 grammars, 612
Low-level machine vision, 680 NL parsing, 621
NL-question answering, 633
parsing
M probabilistic parsing, 627
Machine learning, 375, 415 top-down, 625
decision-trees, 393 parsing with CFG, 622
deep learning, 436 phrase structure, 613
714 Index