Computational Thinking A Primer For Programmers and Data Scientists G Venkatesh Madhavan Mukund
Computational Thinking A Primer For Programmers and Data Scientists G Venkatesh Madhavan Mukund
Computational
Thinking
A Primer for Programmers and Data
Scientists
G Venkatesh
Madhavan Mukund
Contents
Preface 7
Acknowledgements 9
1 Introduction 13
1.1 What is Computational Thinking? . . . . . . . . . . . . . . . . 13
1.2 Sample datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Organisation of the book . . . . . . . . . . . . . . . . . . . . . 28
1.4 A guide to the digital companion . . . . . . . . . . . . . . . . . 31
2 Iterator 35
2.1 Going through a dataset . . . . . . . . . . . . . . . . . . . . . . 35
2.2 Flowcharts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Iterator flowchart for cards . . . . . . . . . . . . . . . . . . . . 40
3 Variables 43
3.1 Generic flowchart for iterator with variables . . . . . . . . . . . 43
3.2 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Accumulator . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Filtering 53
4.1 Selecting cards: Iterator with filtering . . . . . . . . . . . . . . 54
4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5 Datatypes 73
5.1 Sanity of data . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Basic datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Compound datatypes . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Subtypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Transforming the data element . . . . . . . . . . . . . . . . . . 83
5.6 Datatypes for the elements in the dataset . . . . . . . . . . . . . 86
7 Pseudocode 103
7.1 Basic iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3 Compound conditions . . . . . . . . . . . . . . . . . . . . . . . 108
7.4 Filtering with dynamic conditions . . . . . . . . . . . . . . . . 109
11 Lists 151
11.1 List as a set or collection . . . . . . . . . . . . . . . . . . . . . 152
11.2 Using the list collections . . . . . . . . . . . . . . . . . . . . . 152
14 Dictionaries 157
16 Graphs 161
18 Recursion 167
23 Concurrency 177
Preface
How do we introduce computing to a lay person? This was the question that
confronted us when we volunteered to offer such a course in the foundation year
of the online BSc degree in Programming and Data Science at IIT Madras. We
searched for a suitable book or any material to use, but found nothing useful. Our
conception of what needs to be taught in the course seemed to differ widely from
what is currently on offer. We figured that "computational thinking" means differ-
ent things to different people. For some it is just another name for programming
in a suitable language. Those with some knowledge of computer science see it as
the study of algorithms (as distinct from the programs used to implement them).
Then there are those who equate it to solving logical or mathematical puzzles
using computers. None of these descriptions seem satisfactory.
Luckily, one of us had experimented with teaching computational thinking to
school students under the banner of his startup company Mylspot Education Ser-
vices. These sessions used data sets with each data element written out on a
separate card. Students try to find the answer to a computational question by work-
ing through the cards by hand. For instance, they could look at students’ marks
cards to award prizes, or look at shopping bills to identify the loyal customers,
or look at cards containing words from a paragraph to identify which person a
pronoun refers to.
The idea of using this hands-on method to explain the key concepts in compu-
tational thinking was appealing on many counts. The practical utility becomes
immediately obvious when interesting questions about commonly occurring and
familiar data sets can be answered. Many of the key abstractions underlying
computational thinking seem to surface naturally during the process of moving
cards and noting down intermediate values. Finally, the difference in complexity
of two alternate solutions can be directly experienced when tried out by hand on a
real data set.
The challenge for us was to do all this in an online course format with recorded
videos, where there is no possibility of interacting with students during the session.
Our response to this was to employ a dialogue format, where the two of us would
work together, posing questions and finding answers by working through the cards.
The NPTel team in IIT Madras rose to the challenge by setting up the studio with
one camera facing down on the work area and another recording our faces. The
feedback we got from students who watched the first few recordings was excellent,
and this gave us encouragement that the method seemed to be working.
It was at this point that we felt that we should write up all the material created
for the recorded sessions in the form of a book. This could form a companion
or reference to the students who take the online degree course. But our intent
was to make it available to a much wider audience beyond the students of the
online degree. The question was - is it even possible to provide the same level of
excitement of the hands-on experience through a static book? Our answer was yes,
it could - provided we are able to create a interactive digital version of the book
that can provide the hands-on experience. We thus decided to also simultaneously
develop an interactive version of the book that would be launched on the Mylspot
platform for practicing problem solutions.
Just as with the online degree course, the audience for this book is likely to be
very diverse. This would mean that we have to reach out not only to those being
introduced to the world of computing for the first time, but also to those with many
years of practical programming experience looking to strengthen their foundations.
It would also mean that we have to be able to enthuse those with no background
(or interest) in theory, without putting off those looking to gain an understanding
of theoretical computer science. Accordingly, we have organised the material in
such a way that there is a good mix of easy and challenging problems to work on.
At the end of each chapter, we have provided some selected nuggets of information
on algorithms, language design or on programming, which can be skipped by
the first time learners, but should hopefully be of interest to the more advanced
learners.
We had a wonderful team at IIT Madras supporting us during the production of
this course. We thoroughly enjoyed creating, recording, writing and coding all of
the material. We sincerely hope that you, the readers, will find the result of this
effort both enjoyable and practically useful.
Acknowledgements
Several people have contributed in the creation of the material on which this
book is based. At the outset, we have to thank Andrew Thangaraj and Prathap
Haridoss, both Professors at IIT Madras, who are responsible for the online degree
programme, and hence bear the risk of our experimentation with material and
methodology for the computational thinking course. Then there is Bharathi Balaji,
the fearless leader of the NPTel and online degree teams, without whose 24x7
support (and constant nagging), we would never have completed the material on
time. It was through the efforts of these three that we were able to quickly put
together the camera arrangements in the studio for recording our lectures.
The online degree content team for the course was led ably by Dhannya S M,
who kept the clock ticking. We have to thank Prathyush P, the overall content
team leader, for putting together most of the datasets, and recording the first few
tutorials. Omkar Joshi arranged the review sessions, provided editorial support for
the videos, and suggested many exercises for the course. He was ably supported
in this activity by Vijay R and Karthik T. Sribalaji R has to be thanked for
accommodating our urgent requests to schedule the recordings and patiently and
professionally managing these long sessions. He was supported in this activity
by Karthik B, Ravichandran V, Mahesh Kumar K P, Vignesh B, Komathi P and
Akshai Kumar R.
Part I
1. Introduction
14 Chapter 1. Introduction
we have to travel to a different city. This looks similar to running a wire between
one room and another in our house. More surprisingly, this may also resemble
the process of using our friends network to contact someone who can provide the
correct medical or career advice.
If we look carefully into each of these examples of similar activities, we will
observe a shared pattern consisting of a sequence of atomic actions. In this book,
we identify computational thinking with the search for such shared patterns
among commonly occurring computational activities.
Why do we need to study computational thinking? For one, it helps make our
thought process clearer when we are dealing with activities of this sort. Would
not our productivity increase greatly if we were able to efficiently retrieve and
re-apply a previously learnt procedure rather than re-invent it each time? For
another, it helps us communicate these procedures to others, so that they can
quickly learn from us how to do something. If we can communicate the procedure
to another person, surely we could communicate it to a computer as well, which
would explain the use of the adjective "computational".
In short, computational thinking gives us the basis to organise our thought process,
and a language to communicate such organisation to others including a computer.
The organising part has relation to the study of algorithms, the language part to
the design of programming languages and systems and the last communication
part to the activity of coding in a given programming language. Thus compu-
tational thinking provides the foundation for learning how to translate any real
world problem into a computer program or system through the use of algorithms,
languages and coding.
The basic unit of the computational activity is some kind of step. Scanning a
shelf in a supermarket requires the basic step where we visually examine one item
(sometimes we may need to pick up the item and look at its label carefully - so the
examine step may itself be composed of a number of smaller steps). The scanning
process requires us to move from one item to the next (which may or may not be
adjacent to the item scanned earlier). When we have scanned one shelf, we move
to the next one on the same rack. We may then move from one rack to another,
and then from one aisle (part of the supermarket) to another aisle (another part of
the supermarket). In essence we carry out the examine step many times, moving
from item to item in a systematic manner so that we can avoid the examination of
the same item twice.
What can we call a step? Is it enough for the step to be stated at a high level? Or
does it need to be broken down to the lowest atomic level? This depends on who
we are communicating the step to. If the person on the other side can understand
the high level step, it is sufficient to provide instructions at a high level. If not, we
may have to break down the high level step into more detailed steps which the
other person can understand.
Suppose that we need to arrange the furniture in a room for a conference or a party.
If the person taking the instructions understands how such arrangements are to be
done, we could simply tell the person - "I am having a party tomorrow evening at
6 pm for about 100 people. Please arrange the furniture in the hall accordingly."
On the other hand, if the person has never seen or carried out such an organisation
before, we may have to say - "One round table can accommodate 5 people and I
have 100 people, so we need 20 round tables and 100 chairs. Distribute these 20
tables around the room, each with 5 chairs."
Once we have defined the different kinds of steps that we can use, the compu-
tational activity is then described as a sequence of these steps. It is wasteful to
write down all the steps one by one. For instance, if we say how to organise one
table with 5 chairs, then that same method can be applied to all the tables - we
do not need to describe it for each table separately. So, we need some method or
language to describe such grouping of steps.
Lets go back to the problem of picking clothes out of the bucket and hanging them
out to dry on more than one clothes lines. We could pick out the first cloth item
and hang it on the first line. Then pick another cloth item and hang it on the first
line somewhere where there is space. If the next cloth item we pick cannot fit
into the any of the free spaces on the line, we move to the next clothes line. We
can immediately see that there is a problem with this method - the first line has
clothes of random sizes placed in a random manner on the line leaving a lot of
gaps between the cloth items. The next cloth item picked may be a large one not
fitting into any of these gaps, so we move to another line wasting a lot of space on
the first line. A better procedure may be to pick the largest cloth item first, place it
on the leftmost side of the first line. Then pick the next largest item, hang it next
to the first one, etc. This will eliminate the wasted space. When we are near the
end of the space on the first line, we pick the cloth item that best fits the remaining
space and hang that one at the end of the line.
What we have described above is a pattern. It consists of doing for each clothes
line the following sequence of steps starting from one end: fit one next to another
cloth items in decreasing order of size till the space remaining becomes small,
then we pick the cloth item that best fits the remaining space on the line. This
"pattern" will be the same for arranging books on multiple shelves, which consists
16 Chapter 1. Introduction
of doing for each book shelf the following sequence of steps starting from one end:
fit one next to another books in decreasing order of size till the space remaining
becomes small, then we pick the book that best fits the remaining space on the
shelf. Note that the only difference is we have replaced clothes line with book
shelf and cloth item with book, otherwise everything else is the same.
We can re-apply the pattern to the travel items and suitcases example. We do
for each suitcase the following sequence of steps starting from one end: fit one
next to another travel items in decreasing order of size till the space remaining
becomes small, then we pick the travel item that best fits the remaining space in
the suitcase. Can you see how the pattern is the same? Except for the items that
are being moved, the places where they are being moved to, the definition of size
and small, everything else looks exactly the same between the three.
We have picked some commonly used datasets that are simple enough to represent,
but yet have sufficient structure to illustrate all the concepts that we wish to
introduce through this book. The data could be represented using physical cards
(if the suggested procedures are to be carried out by hand as an activity in a
classroom). But we have chosen to use a tabular representation of the dataset for
the book, since this is more compact and saves precious book pages. There is a
direct correspondence between a row in the table and a card, and we can work
with table rows much as we do with physical cards. The accompanying digital
companion will make this explicit. It will use the tabular form representation,
with the key difference being that the individual table rows are separate objects
like cards that can be moved around at will.
A data element in this dataset is the marks card of one student (shown below):
18 Chapter 1. Introduction
The basic data element in this dataset is the shopping bill generated when a
customer visits a shop and buys something (as shown below):
20 Chapter 1. Introduction
to one shop)?
• Which customers are similar to each other in terms of their shopping be-
haviour ?
The entire data set is contained in the following table:
22 Chapter 1. Introduction
Chocolates Packed/Food 1 10 10
24 Chapter 1. Introduction
Bananas Fruits/Food 12 8 96
Eggs Food 1 45 45
26 Chapter 1. Introduction
9 his Pronoun 3
10 eyes. Noun 4
11 He Pronoun 2
12 considered Verb 10
13 Monday Noun 6
14 specially Adverb 9
15 unpleasant Adjective 10
16 in Preposition 2
17 the Article 3
18 calendar. Noun 8
19 After Preposition 5
20 the Article 3
21 delicious Adjective 9
22 freedom Noun 7
23 of Preposition 2
24 Saturday Noun 8
25 And Conjunction 3
26 Sunday, Noun 6
27 it Pronoun 2
28 was Verb 3
29 difficult Adjective 9
30 to Preposition 2
31 get Verb 3
32 into Preposition 4
33 the Article 3
34 Monday Noun 6
35 mood Noun 4
36 of Preposition 2
37 work Noun 4
38 and Conjunction 3
39 discipline. Noun 10
40 He Pronoun 2
41 shuddered Verb 9
42 at Preposition 2
43 the Article 3
44 very Adverb 4
45 thought Noun 7
46 of Preposition 2
47 school: Noun 6
48 the Article 3
49 dismal Adjective 6
50 yellow Adjective 6
51 building; Noun 8
52 the Article 3
53 fire-eyed Adjective 8
54 Vedanayagam, Noun 11
55 his Pronoun 3
56 class Noun 5
57 teacher, Noun 7
58 and Conjunction 3
59 headmaster Noun 10
28 Chapter 1. Introduction
60 with Preposition 4
61 his Pronoun 3
62 thin Adjective 4
63 long Adjective 4
64 cane . . . Noun 4
30 Chapter 1. Introduction
Summary of chapter
32 Chapter 1. Introduction
Exercises
2. Iterator
The most common form of computation is a repetition. To find the total of the
physics marks of all the students, we can start with a total value of zero, then
repeatedly select one student and add the student’s physics marks to the total.
Similarly to determine the number of occurrences of a certain word W in the
paragraph, we can repeatedly examine the words one by one, and increment the
count if the examined word is W. Consider now that we have to arrange students
in a class by their heights. We can do this by making them stand in a line and
repeating the following step: pick any two students who are out of order (taller
person is before a shorter person), and simply ask them to exchange their places.
The pattern of doing something repetitively is called an iterator. Any such
repetition will need a way to go through the items systematically, making sure that
every item is visited, and that no item is visited twice. This in turn may require
the items to be arranged or kept in a certain way. For instance, to arrange students
by height, we may first have to make them stand in one line, which makes it easier
for us to spot students who are out of order with respect to height. We have to
know what to do at any step, and also how to proceed from one item to the next
one. Finally, we will need to know when to stop the iterator.
36 Chapter 2. Iterator
2.1.1 Starting
The steps involved to setup the required context for the iterator to function is
collectively called initialisation. When we work with cards, the initialisation
typically consists of arranging all the cards in a single pile (may or may not be
in some order). For instance the marks cards set can be in any order at the start,
while for the paragraph words cards set, the words would need to be in the same
sequence as they appear in the paragraph.
We may also need to maintain other cards where we write down what we are
seeing as we go through the repetition. But this forms part of the discussion in the
next chapter. Once we have setup the context for the iterator, we can proceed to
the steps that need to be repeated.
Within the repeated step, we will first have to pick one element to work on. For an
arbitrarily ordered pile of cards, any one card can be picked from the pile - the top
card, the bottom card, or just any card drawn from the pile at random. If the card
pile is ordered (as in the words data set), then it has to be the top card that has to
be picked from the pile.
The card that is picked is kept aside for examination. Specific fields in the card
may be read out and their values processed. These values may be combined with
other values that are written down on other cards (which we will come to in the
next chapter). The values on the card could also be altered, though we do not
consider examples of this kind in this book. since altering the original data items
from the data set can lead to many inadvertent errors.
When we use datasets in tabular form (rather than physical cards), picking a card
corresponds to extracting one row of the table. Extracting would mean that the
table is reduced by one row (the one extracted). In the case of the ordered datasets,
we will always extract the top row of the table.
Once we are done examining the chosen element, we need to move on to the next
data element to process. But that would be exactly the same as what we did in the
previous step. For instance with cards, we just have to pick another card from the
pile and examine it. So can we just repeat the previous step many times?
In principle we could, but we have not really said what we would do with the card
(or extracted table row) that we have kept aside for examination. Would we just
return that card to the pile of cards after doing whatever we need to do with its
field values? If we did that, then we will eventually come back to the same card
that we have picked earlier and examine it once again. Since we don’t want to
repeat the examination of any card, this means that we cannot return the examined
card to the original pile.
What we can do is to create a second pile into which we can move the card that
we have examined. Think about this second pile as the pile of "seen" cards, while
the original pile would be the pile of "unseen" cards. This ensures that we will
never see the same card twice, and also ensures that all the cards in the dataset will
definitely be seen. Thus the simple process of maintaining two piles of cards - the
original "unseen" pile and the second "seen" pile ensures that we systematically
visit all the cards in the pile, without visiting any card twice.
Note that the second pile will also need to be initialised during the initialisation
step of the iterator. If we are working with cards, this just consists of making some
free space available to keep the seen cards. If we are working with the tabular
form, it means that there is a second empty "seen" table kept next to the original
"unseen" table of rows, and at each step we extract (remove) one row from the
"unseen" table and add the row to the "seen" table.
2.1.4 Stopping
Finally, we have to know when we should stop the iteration. With cards, this
naturally happens when there are no further cards to examine. Since we are
moving cards from one pile (the "unseen" pile) to another (the "seen" pile), the
original "unseen" pile keeps getting smaller and smaller and eventually there will
be no more cards in it to examine. We have to stop at this stage as the step of
picking a card to examine cannot be performed anymore.
In the tabular form, extracting a row makes the table smaller by one row. Eventu-
ally, the table will run out of rows and we have to stop because we cannot perform
the extraction step.
To summarise, the iterator will execute the following steps one after the other:
• Step 1: Initialisation step: arrange all the cards in an "unseen" pile
• Step 2: Continue or exit? if there are no more cards in the "unseen" pile,
we exit otherwise we continue.
• Step 3: Repeat step: pick an element from the "unseen" pile, do whatever
we want to with this element, and then move it to another "seen" pile.
38 Chapter 2. Iterator
2.2 Flowcharts
The step-wise description of the iterator that we saw in the previous section can be
visualised nicely using a diagram called a flowchart. The flowchart was developed
to describe manual decision making processes - for instance to diagnose what
went wrong with a piece of electrical equipment or in an instruction manual to
describe the steps for installing a piece of equipment. They are also useful to
explain a protocol or procedure to another person, who could be a customer or a
business partner.
Our eye is trained to look for visually similar patterns (this is how we identify
something to be a tree or a cat for example). The flowchart thus gives us a visual
aid that may be useful to spot similar looking patterns much more easily. This in
turn may allow us to re-use a solution for one problem to another similar problem
situation from a different domain.
We will mainly be using only 4 symbols from those that are typically found in
flowcharts. These symbols are shown in Figure 2.1 along with the explanation of
where they need to be used. The box with rounded edges is a terminal symbol
used to denote the start and end of the flowchart. The rectangular box is used
to write any activity consisting of one or more steps. The diamond is used to
represent a decision, where a condition is checked, and if the condition is true,
then one of the branches is taken, otherwise the other branch is taken. The arrow
is used to connect these symbols.
2.2 Flowcharts 39
Using the flowchart symbols in Figure 2.1 we can now draw a generic flowchart
for the iterator using the description in Section 2.1.5. The result is shown in Figure
2.2. After start comes the initialisation step, followed by a decision if the iteration
is to continue or stop. If it is to stop it goes to the end terminal. Otherwise, it goes
to the repeat step after which it returns to the decision box.
We can visualise easily the iteration loop in the diagram which goes from the
decision box to the repeated steps and back. Any diagram which has such a loop
can be modeled using an iterator pattern.
Start
Initialisation step
Continue No (False)
End
iteration?
Yes (True)
Step to be repeated
40 Chapter 2. Iterator
Start
Yes (True)
Examine one card from the "unseen"
pile and and move it into the "seen" pile
Summary of chapter
We can iterate over any collection. Specifically the collection could be a set,
or a relation (set of ordered pairs), or a more complex multi-dimensional
relation (set of ordered tuples). For instance, we could iterate over the
collection of all pairs of cards drawn from the marks card data set. A more
interesting example may be iteration over all possible subsets of the marks
card data set.
Iterators and complexity: examining the structure of the iterator immedi-
ately gives us some clue about the order of complexity of the procedure we
are implementing using the iterator. If we are examining all elements of the
card set, it will be O(N). If over all pairs, it will be O(N 2 ), over all k-tuples,
it will be O(N k ), and over all subsets, it will be O(2N ).
Exercises
3. Variables
In the previous chapter, we saw how the iterator can be used to repeat some activity
many times. Specifically, it can be used to go through the cards systematically,
ensuring that we visit each card exactly once. We now look at doing something
with the iterator. In order to get any result through iteration, we will need to
maintain some extra information. The main vehicle to do this is the variable.
When we work with cards, we can use an extra card to note down this intermediate
information, so the extra card could be independently numbered or named and
would represent a variable. Alternatively, to allow for the card to be better used,
we can note down multiple pieces of intermediate information by giving each
variable a name, and noting down the intermediate information against each of
the named variables on the card. Unlike a field of the card, whose value does not
change, and so is called a constant, the value of the variable can keep changing
during the iteration, which would explain why we call it a variable.
If we are using tables to represent the information, then we need some extra space
somewhere to record the variable names and their intermediate values.
44 Chapter 3. Variables
will need to initialise the variables before the iteration loop starts, and update the
variables using the values in the chosen card at each repeat step inside the iteration
loop. The result is shown in Figure 3.1 below.
Start
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
3.2 Counting
Lets start with the simplest of problems - counting the number of cards in the pile.
To find the number of cards, we can go through the cards one at a time (which the
iterator will do for us). As we go through each card, we keep track of the count of
the cards we have seen so far. How do we maintain this information? We can use
a variable for this (written on an extra card). Let us name this variable count.
The iterator has an initialisation step, a repeat step and a way to determine when to
stop. What would we initialise the variable count to at the start of the iteration?
Since the variable count stores the number of cards seen so far, it needs to be
initialised to 0, since at the start we have not seen any cards. At the repeat step,
we need to update the value of the variable count by incrementing it (i.e. by
increasing its value by 1). There is nothing extra to be done to determine when to
stop, since we will stop exactly when there are no more cards to be seen.
The resulting flowchart is shown in Figure 3.2.
3.3 Sum 45
Start
Initialise count to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Increment count
3.3 Sum
The counting example in the previous section did not require us to examine the
contents of the cards at all. We merely moved the cards from one pile to another
while keeping track of how many we have seen in a variable called count. We
now look at something a bit more interesting - that of finding the sum of the values
of some field on the card. In the case of the classroom dataset, we could find
the sum of the total marks scored by all students put together, or we could find
the total marks of all students in one subject - say Maths. In the shopping bill
dataset, we could find the total spend of all the customers put together. In the
words dataset, we could find the total number of letters in all the words.
46 Chapter 3. Variables
Start
Initialise sum to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
3.4 Average 47
3.3.2 Find the total distance travelled and total travel time of
all the trains
3.4 Average
Suppose now that we wish to find the average value of some field from all the
cards in the dataset. For example we could find the average physics marks of all
students, or the average bill amount, or the average letter count.
How do we determine the average? We need to find the total value of the field from
all the cards and divide this total by the number of cards. We used the flowchart
in Figure 3.3 to find the value of the total represented through the variable sum.
Likewise, the number of cards was represented by the variable count whose value
was found using the flowchart in Figure 3.2. We can get the average now by
simply dividing sum by count.
The problem with the procedure describe above is that it requires us to make two
passes through the cards - once to find sum and once again to find count. Is it
possible to do this with just one pass through the cards? If we maintain both
variables sum and count at the same time while iterating through the cards, we
should be able to determine their values together. The flowchart shown in Figure
3.4 below does this.
48 Chapter 3. Variables
Start
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
The flowchart above works, but seems a bit unsatisfactory. The computation of
average is inside the iteration loop and so keeps track of the average of all the
cards seen so far. This is unnecessary. We can just find the average after all the
cards are seen. The generic flowchart for finding average shown in Figure 3.5
does exactly this.
3.4 Average 49
Start
Yes (True)
Pick one card X from the "unseen" pile Store sum/count
and and move it into the "seen" pile in average
We can use this generic flowchart for finding the average physics marks of all
students - just replace X.F in the flowchart above by X.physicsMarks. Similarly
to find the average letter count, we can replace it by X.letterCount.
What are the intermediate values that we can store in a variable? Is it only numbers,
or can we also store other kinds of data in the variables? We will discuss this
in more detail in Chapter 5. But for now, let us consider just the example of
collecting field values into a variable using what is called a list. An example of a
list consisting of marks is [46, 88, 92, 55, 88, 75]. Note that the same element can
occur more than once in the list.
The flowchart shown below in Figure 3.6 collects all the Physics marks from all
the cards in the list variable marksList. It is initialised to [] which represents the
empty list. The operation append M to marksList adds a mark M to the end of
the list - i.e. if marksList is [a1 , a2 , a3 , ..., ak ] before we append M, then it will
be [a1 , a2 , a3 , ..., ak , M] after the append operation.
50 Chapter 3. Variables
Start
Initialise marksList to []
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
3.5 Accumulator
Note the similarity between the flowchart for sum and that for collecting the list of
physics marks. In both cases we have a variable that accumulates something - sum
accumulates (adds) the total marks, while marksList accumulates (appends) the
physics marks. The variables were initialised to the value that represents empty -
for sum this was simply 0, while for marksList this was the empty list [].
In general, a pattern in which something is accumulated during the iteration is
simply called an accumulator and the variable used in such a pattern is called an
accumulator variable.
We have seen two simple examples of accumulation - addition and collecting
items into a list. As another example, consider the problem of finding the product
of a set of values. This could be easily done through an accumulator in which
the accumulation operation will be multiplication. But in all of these cases, we
have not really done any processing of the values that were picked up from the
data elements, we just accumulated them into the accumulator variable using the
appropriate operation (addition, appending or multiplication).
The general accumulator will also allow us to first process each element through
3.5 Accumulator 51
an operation (sometimes called the map operation) which is then followed by the
accumulation operation (sometimes also called the reduce operation).
For instance consider the problem of finding the total number of items purchased
by all the customers in the shopping bill dataset. Each shopping bill has a list
of items bought, of which some may have fractional quantities (for example 1.5
kg of Tomatoes or 0.5 litres of milk), and the remaining are whole numbers (e.g.
2 shirts). Since it is meaningless to add 1.5 kg of tomatoes to 0.5 litres of milk
or to 2 shirts, we could simply take one item row in the shopping bill to be a
single packed item - so the number of item rows will be exactly the number of
such packed items purchased. This is not very different from what happens at a
supermarket when we buy a number of items which are individually packed and
sealed. The watchman at the exit gate of the supermarket may check the bags to
see if the number of packages in the bill match the number of packages in the
shopping cart.
In this example, the map operation will take one bill and return the number of
packages (item rows) in the bill. The reduce operation will add the number of
packages to the accumulator variable, so that the final value of the accumulator
variable will be the total number of packed items purchased.
Summary of chapter
Exercises
4. Filtering
When we counted the cards or found the sum of some field, we did not have to
discriminate between the cards. We will now consider situations where we need
to do some operation only for selected cards from the dataset. The process of
selecting only some cards for processing is called filtering.
For instance consider the problem of finding the sum of only the girls’ marks.
This will require us to examine the card to check if the gender on the card is M
(i.e. a boy) or F (i.e. a girl). If it is M, then we can ignore the card. Similarly, if
we wish to find the total spend of one customer C, then we check the card to see if
the customer name field is C. If not, we just ignore the card.
54 Chapter 4. Filtering
How do we modify the generic iterator flowchart with variables in Figure 3.1 to
include filtering? The update step has to be done only for those satisfying the filter
condition, so the check needs to be done for each card that is picked from the
"unseen" pile just before the variable update step. This ensures that if the check
condition turns out false then we can ignore the card and go back to the start of
the iteration loop. The resulting flowchart is shown in the Figure 4.1. Note that
the No (False) branch from the "Select card?" condition check takes us back to
the beginning of the loop.
Start
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
No (False)
Select card X ?
Yes (True)
Update the variables using the values in X
4.2 Examples
Let us now consider examples of applying filtering to find something useful from
the datasets.
4.2 Examples 55
Start
Initialise ahmedSpend to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
No (False) X.customerName
is Ahmed ?
Yes (True)
Add X.billAmount to ahmedSpend
56 Chapter 4. Filtering
We are looking only for verbs and can ignore all other cards. So the filtering
condition should be: X.partOfSpeech is Verb ? The update operation is pretty
straightforward, we just increment verbCount (or equivalently, we add 1 to
verbCount). The resulting flowchart is shown in 4.3.
Start
Initialise verbCount to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
No (False) X.partOfSpeech
is Verb ?
Yes (True)
Increment verbCount
Suppose now that we want to find the total marks of the boys and girls separately.
One approach would be to do a filtered iteration checking for boys (flowchart is
similar to Figure 4.3), filter condition being X.gender is M. We store the sum in
a variable boysTotalMarks. We then proceed to find the girlsTotalMarks in
exactly the same way, except that we check if X.gender is F.
The issue with this method is that it needs two passes through the cards, once for
the boys and once again for the girls. Can we find both in a single pass? For this,
we first note that when we check for the filtering condition (X.gender is M), we
are using only one branch - the branch for Yes (True). The other branch for No
4.2 Examples 57
(False) is not used, it simply takes us back to the beginning of the loop. Now if
the result of the check X.gender is M returns No (False), then it must be the case
that the gender is F (since gender can only have one of two values M or F). So we
could put some activity on that branch. This is illustrated in Figure 4.4.
Start
Initialise boysTotalMarks to 0
and girlsTotalMarks to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Add No (False)
X.totalMarks to X.gender is M ?
girlsTotalMarks
Yes (True)
Add X.totalMarks to boysTotalMarks
Figure 4.4: Flowchart for finding total marks for both boys and girls
In the shopping bills dataset, we could try to find the total revenues earned by each
shop. Note that there are three shops - SV Stores, Big Bazaar and Sun General,
so we have to find the total value of all the bills generated from each of these
shops. Let svr, bbr and sgr be variables that represent the total revenues earned
by SV Stores, Big Bazaar and Sun General respectively. We can run one pass
with multiple filtering conditions, one following another, first check for SV Stores,
then Big Bazaar, and finally for Sun General. This is shown in Figure 4.5.
58 Chapter 4. Filtering
Start
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
No (False)
No (False)
Figure 4.5: Flowchart for finding the revenues of all the shops
Note that to make the diagram easier to read, the Yes (True) and No (False)
branches are switched as compared to the single customer spend flowchart we saw
in Figure 4.2.
Observe the flowchart carefully. The straight path through all the filtering condi-
tions in which the shop name is not SV Stores, Big Bazaar or Sun General is not
actually possible. However, for completeness it is always good to express all the
conditions in the flowchart. If for instance the shop name is wrongly written, or a
new shop name is introduced, the flowchart will still work correctly.
4.2 Examples 59
Start
Initialise balanced to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Increment No (False)
X.totalMarks < ST ?
balanced
Yes (True)
Decrement balanced
60 Chapter 4. Filtering
Lets consider a slightly harder example now if finding the number of words in all
the sentences of the words dataset.
There are many ways in which this example differs from the ones we have seen
so far. Firstly, when we pick a card from the pile, we will always need to pick
the topmost (i.e. the first) card. This is to ensure that we are examining the cards
in the same sequence as the words in the paragraph. Without this, the collection
words will not reveal anything and we cannot even determine what is a sentence.
Secondly, how do we detect the end of a sentence? We need to look for a word
that ends with a full stop symbol. Any such word will be the last word in a
sentence. Finally, there are many sentences, so the result we are expecting is not a
single number. As we have seen before we can use a list variable to hold multiple
numbers. Lets try and put all this together now.
We use a count variable to keep track of the words we have seen so far within
one sentence. How do we do this? We initialise count to 0 at the start of each
sentence, and increment it every time we see a word. At the end of the sentence
(check if the word ends with a full stop), we append the value of count into
a variable swcl which is the sentence word count list. Obviously, we have to
initialise the list swcl to [] during the iterator initialisation step.
The resulting flowchart is shown in the Figure 4.7 below.
Start
Yes (True)
Pick the first card X from the "unseen"
pile and and move it into the "seen" pile
Increment count
Yes (True)
Append count to swcl and Set count to 0
Figure 4.7: Flowchart for finding the word count of all sentences
Note that we do increment of count before the filtering condition check since the
word count in the sentence does not depend on whether we are at the end of the
sentence or not. The filtering condition decides whether the count needs to be
reset to 0 (to start counting words of the next sentence), but before we do the reset,
we first append the existing count value to the list of word counts swcl.
In the previous examples, the filtering was done on a single field of the card. In
many situations, the filtering condition may involve multiple fields of the card.
We now look at examples of such compound conditions.
62 Chapter 4. Filtering
As the first example, lets say we want to find the total marks of all the girls from
Chennai in the classroom dataset. This requires us to check for two conditions -
gender is F and town/city is Chennai. We can build the flowchart for this in the
same way that we did the shop revenues flowchart in 4.5, where the conditions are
checked one after another. The resulting flowchart is shown in Figure 4.8, where
the accumulator variable cgSum holds the required sum of the Chennai girls total
marks. Filtering conditions of this kind can be called nested since the loop for
one condition contains the loop for the next.
Start
Initialise cgSum to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
No (False)
X.gender is F ?
Yes (True)
No (False) X.townCity
is Chennai ?
Yes (True)
Add X.totalMarks to cgSum
As we saw in the Chennai girls sum example in Figure 4.8, when we need to check
two different fields of the card, we can use nested conditions where one condition
is checked after another. This however seems unnecessarily complicated for
something that looks a lot simpler. Can we not simply check for both conditions
together within the filter decision box?
The way to check for more complex decisions (maybe using multiple data fields)
is to use compound conditions. Compound conditions are created from the basic
conditions by using the operators AND, OR or NOT (called Boolean operators).
The redrawn flowchart for finding the sum of all Chennai girls’ total marks using
a compound condition in the filtering decision box is shown in Figure 4.9 below.
Start
Initialise cgSum to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Yes (True)
Add X.totalMarks to cgSum
64 Chapter 4. Filtering
Start
Initialise countFBH to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Yes (True)
Increment countBFH
Figure 4.10: Flowchart for number of boys born in the first half of the year
Start
Initialise countHS to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Yes (True)
Increment countHS
66 Chapter 4. Filtering
different conditions behaves like an OR. Consider for example the flowchart stub
in Figure 4.12. This is clearly the same as taking the OR of the three conditions
as shown in 4.13.
No (False)
X.mathsMarks > 90 ?
Yes (True)
Set highScorer to True
No (False)
X.physicsMarks > 90 ?
Yes (True)
Set highScorer to True
No (False)
X.chemistryMarks > 90 ?
Yes (True)
Set highScorer to True
Yes (True)
Set highScorer to True
We can quickly check that two flowchart stubs in Figures 4.12 and 4.13 are exactly
the same. In both cases, the Boolean variable highScorer is set to True if at least
one of the three conditions holds.
Now consider the sequence of conditions shown in the stub 4.14. Is this the same
as the compound condition shown in the stub 4.15?
No (False)
X.mathsMarks > 90 ?
Yes (True)
Increment countHS
No (False)
X.physicsMarks > 90 ?
Yes (True)
Increment countHS
No (False)
X.chemistryMarks > 90 ?
Yes (True)
Increment countHS
68 Chapter 4. Filtering
Yes (True)
Increment countHS
The answer is No ! The flowchart stubs are clearly not equivalent. To see this,
consider the case where a student has scored more than 90 in two subjects - say
Maths and Physics. Then the stub 4.14 will increment countHS twice once for
the condition X.mathsMarks > 90 and again for the condition X.physicsMarks
> 90. What will the stub in 4.15 do? It will increment countHS only once (since
there is only one compound condition that is checked). So the resulting countHS
of 4.14 will be 1 more than that of 4.15.
Why is it that the update to a Boolean variable highScorer worked, while
incrementing the variable countHS did not? The difference between the two is
that setting the Boolean variable highScorer to True twice is just the same as
setting it once, whereas incrementing the variable countHS twice is not the same
as incrementing it once.
This argument also tells us that if the conditions are mutually disjoint (i.e. it is
never possible for any two of them to be true at the same time), then the sequence
of conditions with exactly the same update operation can be replaced by an OR of
the conditions.
In general, a sequence of conditions may or may not resemble OR - we have to
analysed the updates more carefully before determining if they are the same or
not.
Suppose we want to search for any one card that satisfies some property specified
through a condition.
We could in principle just iterate through all the cards looking for a card which
satisfies the desired condition. When such a card is found, it is set aside, but we
continue through all the remaining cards till the iterator finishes its job. While this
works, it seems wasteful to continue with the iteration when we have found what
we want. Can’t we just stop the iterator when we have finished what we want to
do, which is finding some card? The problem is that we have not said how we can
stop an iterator - we can start it, we can do something in each step and we can
wait till it is over. So far, we are not allowed to stop it mid way.
Let us look at this through an example.
Let us say that we want to search for a high scoring student, defined as someone
who has scored more than 90 in all the subjects. The filtering condition we are
looking at can be written as X.mathsMarks > 90 AND X.physicsMarks > 90 AND
X.chemistryMarks > 90, which does not change across iterations (except of course
that we are applying the same condition to a different X each time). We can write
a simple filtered iterator that checks for this condition and accumulates all the
student names that satisfies this condition in a accumulator variable that is a list
called highscorers. But this is wasteful as we are not asking to find all students
that satisfies the condition, we are only asking for any one student who does.
If we want to stop the flowchart when the condition is satisfied, all we will need to
do is to exit the loop in the iteration and go from there straight to the End terminal
in the flowchart. This is shown in the Figure 4.16 below.
70 Chapter 4. Filtering
Start
Initialise highScorers to []
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Yes (True)
Append X.name to highScorers
Note that rather than go back to the start of the iteration after the append step in
the last activity box, we have exited the iteration loop by going directly to the End
terminal.
Can we make the exit happen at the same decision box where the iterator checks
for end of iteration? For this we need to extend the exit condition to include
something that checks if the required card has been found. We can do this by
using a new Boolean variable called found, which is initialised to False and turns
True when the required card is found. We then add found to the exit condition
box. This is illustrated in the flowchart in 4.17.
Start
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Yes (True)
Append X.name to highScorers
Set found to True
We can now write the generic flowchart for searching for a card, which can be
adapted for other similar situations. The generic flowchart is shown in Figure
4.18.
72 Chapter 4. Filtering
Start
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
No (False)
X is the required card ?
Yes (True)
Update variables
Set found to True
Summary of chapter
Exercises
5. Datatypes
In the datasets that we have considered so far, the field names and values where
all correctly written. This is usually not the case in real life, where the actual data
is entered by humans who are prone to making mistakes. Field names may be
74 Chapter 5. Datatypes
misspelt, data values may have errors in them, and the field names and values may
not be properly matched.
The problem with erroneous data is that it creates a significant overload on the
computational process. Before we process any data element, we would need to
first check the sanity of the data in the card or table row. Every such possible error
will need a decision box in the flow chart to identify the error. Then we will need
to know what to do if an error is found - do we just ignore the card, or do we try to
rectify the error and then go ahead. What are the rules of rectification, i.e. which
values would replace the wrong ones?
The flowcharts or pseudocode will be so full of all these kinds of decisions that the
main logic/pattern of the problem solution will be lost in all the noise. Worse, when
so much checking has to be done, it is quite possible that some will be left out, or
the wrong checking logic is implemented in the flowchart or pseudocode. This
would let an erroneous data element to pass through the flowchart or pseudocode
and give rise to all kinds of unexpected behaviour.
Even if the data elements are correctly written, we could have made errors in how
the values in the data element are used within our procedures. For instance, we
could attempt to add two name fields, or multiply one date field value with another,
or subtract one word from another - all of which are meaningless operations that
are very likely to produce some nonsense values. These values may be used later
in the procedure, resulting in totally unpredictable outcomes.
Would it not be nice if we could simply specify what it means for the data
element to be sane, and what operations are allowed to be performed on it? The
development environment of programming system that processes our procedures
should be able to take these specifications and automatically check that they are
complied with. If not, they can flag errors at processing time, so that the errors
can be rectified. This would greatly relieve the writing of the flowcharts or pseudo
code, which would no longer need to check for any of these kinds of errors.
In this chapter, we will first look at the different conditions that we can place on
the field values in our datasets and the operations allowed to be performed on
them. This will lead us to the concept of a datatype, which lets us define clear
specifications to be written about the data items and the operations permitted on
them.
a space or some special character (like a full stop). We can also see items like
the date field which seem to be like numbers, but are written with non-numeric
characters - for example Jun 1. We now go through all of these data elements
systematically, identifying valid values and operations for each of them.
76 Chapter 5. Datatypes
this does not ensure that we catch all such errors though - for instance the Maths
and Physics marks may get erroneously interchanged. This will not be caught by
providing the total, since the total remains the same.
The most natural way of preventing wrong data entries or operations on the data
fields is to associate what is called a datatype which each of them.
A datatype (or simply type) is a way of telling the computer (or another person)
how we intend to use a data element:
• What are the values (or range of values) that the element can take ?
• What are the operations that can be performed on the data element ?
Thus when we specify that a variable is of a specific type, we are describing the
constraints placed on that variable in terms of the values it can store, and the
operations that are permitted on it.
There are three basic datatypes that we have used in our datasets. These are:
• Boolean: for holding for instance the True or False values of the filtering
conditions
• Integer: for holding numerical values like count, marks, price, amount etc
• Character: for holding single character fields like gender which has M and
F as values
We now look at each of these in turn.
5.2.1 Boolean
An element of the Boolean datatype takes one of only two values - True of False.
So any attempt to store any value other than True or False in a variable of Boolean
datatype should result in an error.
What are the operations that should be allowed on elements of this datatype? We
need to be able to combine Boolean datatype values and also check if a given
Boolean value is True or False. The table given below lists some of the possible
operations possible on the Boolean datatype.
78 Chapter 5. Datatypes
5.2.2 Character
5.2.3 Integer
An Integer datatype element take the values ...,-3,-2,-1,0,1,2,3,... (i.e. the integer
can be a negative number, zero or a positive number.
We can add, subtract, multiply integers. We can also compare two integers. It may
or may not be possible to divide one integer by another, unless we define a special
integer operation that takes only the quotient after division. These operations are
shown in the table below:
We can construct more involved datatypes from the basic datatypes using datatype
constructors. The resulting datatypes are called compound datatypes. The three
main kinds of compound datatypes that we have used in our datasets are:
• strings: are just sequences of characters of arbitrary length. Our datasets
have names of people, shops and cities, shopping item names, categories,
and words from a paragraph - all of which are strings.
• lists: are a sequence of elements, typically all of the same datatype. Our
datasets have for example lists of items in the shopping bill.
• records: are a set of named fields along with their associated value which is
of some specific datatype. Each card in our dataset is an example of a record
- the student marks card, the shopping bill, the word from the paragraph.
We examine each of them in more detail below.
5.3.1 Strings
80 Chapter 5. Datatypes
5.3.2 Lists
Note that the last two operations head and tail are possible only if the list is not
empty.
5.3.3 Records
Unlike lists, a record is a collection of named fields, with each field having a
name and a value. The value can be of any other datatype. Typically there is no
restriction of any kind in terms of the number of fields or on the datatypes of the
fields.
What operations would we want to perform on the record? The one that we will
need to use for our datasets is that of picking out any field using its name. This
operation is simply denoted by ".", where X.F returns the value of the field F from
the record X. The result type is the same as the type of the field F.
5.4 Subtypes 81
5.4 Subtypes
A subtype of a datatype is defined by placing restrictions on the range of values
that can be taken, limiting the kinds of operations that can be performed, and
maybe adding more constraints on the element to ensure that there is sufficient
sanity in the data values.
We now look at each of the subtypes that we have used in our datasets.
82 Chapter 5. Datatypes
We have more or less taken care of most of the fields present in our datasets. But
there are a few tricky ones that we have not yet dealt with - the date of birth field
in the classroom dataset, the fractional quantity and prices in the shopping bills.
Even for the marks field which consists of whole numbers, the average marks
could be fractional.
In each of these cases, we have many options for storing the data. Some of these
options may not really suitable as they may not permit the kinds of operations that
we want to perform on them. We may have to first transform the data element
to ensure that we can easily store it, while allowing the operations we want to
perform on it.
5.5.1 Date
Let us first consider the date of birth field. We have chosen only to represent the
month and date (within the month) values, and have ignored the year of birth. This
is sufficient for the questions we want to ask (mostly to determine birthday within
the year).
Now, a typical value of this field, for instance "6 Dec", looks like a string. So it is
probably natural for us to store this field as a string, with suitable restrictions on
values that can be taken. However, using a string is not very satisfactory since it
will not allow us to compare one date D1 with another D2 to check if D1 < D2
(which would mean that D1 comes earlier in the year than D2). We may also want
to subtract D2 from D1 (to find how many days are there between birth date D1
and D2). All of this means that we need something other than string to represent
date.
We can apply the following simple transformation to the date field to make it
suitable for the comparison and subtraction operations. Since we are only looking
at the month and date within the month, we can represent the date using a single
whole number representing the number of days starting from Jan 1, which is
taken as day 0. So Jan 6 would be represented by the number 5, Jan 31 would
be represented by 30 and Feb 1 would be represented by 31. What should be the
number representing Mar 1? This depends on whether the year is a leap year (in
which Feb will have 29 days) or not. This additional bit of information is required
to correctly store the date of birth as a whole number.
Let us say it is not a leap year, so that Feb has 28 days. Then Mar 1 will be
represented by the number 59 (31 days in Jan and 28 days in Feb adds to 59).
What is the number that should be associated with Dec 6? Note that the number
84 Chapter 5. Datatypes
used to represent a date is just the number of days in the calendar year preceding
that date. We can thus determine the associated date for all the dates in the calendar.
Dec 31 will have the number 364 associated with it (since that is the number of
days preceding Dec 31 in the calendar). So Dec 6 should be represented by 339.
Since date is stored as a whole number, we can perform comparison operations on
it easily. To check if one date D1 precedes another D2, we just have to check that
the number corresponding to D1 is less than that corresponding to D2. Similarly
to find the number of days between two calendar dates, we can just subtract their
associated whole numbers.
We will also need two convenience operations to go between strings and numbers.
The first store(D) will convert a date string D (say "6 Dec") into its number
equivalent - this will require us to add up all the days in the months Jan to
Nov and add 5 to it to get the number 339 representing "6 Dec". In the other
direction, print(N) will take a number in the range 0 to 364, and print the date
string corresponding to it. This is important to do, otherwise any external user
will have to go through the tedious and error prone task of finding out which date
we are referring to using a number. Thus, store("6 Dec") = 339, and print(339) =
"6 Dec".
Note now that after this transformation, date just becomes an integer, so we can
perform any integer like operations on it. Of course, multiplying two dates does
not make any sense, so we should place restrictions on the operations that can be
performed. The values also have to be constrained to lie within the range 0 to 364.
To take care of these constraints, we could define date as a subtype of integer. All
the operations needed to work on the date field can now be shown in the table
below:
which is called float. However, this is a very elaborate datatype that can allow a
very wide range of values from the very small or the very large (both of which
would need exponents to write). We have some simple values with at most two
fractional decimal places, so the use of float may be excessive for us. Is there
something simple we could do?
The simplest way of dealing with this would be to simply multiply the data values
by 100. This will make the value into an integer (for instance 62.75 will be
converted into 6275, which is an integer). Like in the date example, we will need
two convenience functions store(62.75) which will return 6275 to store it as an
integer subtype, and print(6275) which will convert the integer subtype into a
string "62.75" which can be presented when needed.
We can now define Quantity, Price, Cost and Amount as subtypes of Integer.
What operations should be allowed on these subtypes? Clearly addition should be
allowed, since we can accumulate quantities and costs. Subtraction could also be
allowed to determine for instance price difference between two items.
But multiplication and division do not make any sense for these quantities. Luckily,
this is rather convenient for us, because if we needed to multiply two fractional
numbers, then our convenient mechanism of multiplying them by 100 would fail
! Note that print ( store(a) × store (b) ) is not the fractional value a × b, since it
multiplies both a and b by 100 and so when we take their product, it is multiplied
by 10000 in the subtype, and print only reduces this product by 100, so the result
is 100 times too large. Division does exactly the reverse, it makes the number 100
times too small after division. Since we don’t need to multiply or divide, we do
not have to worry about these anomalies introduced due to our transformation.
We can also compare any two of these subtypes using =, < or >. The operations
allowed on the subtypes Quantity, Price, Cost and Amount are shown in the table
below:
86 Chapter 5. Datatypes
We note that the card carries the following fields, so we can create a record
datatype StudentMarks whose field names along with their respective datatypes is
shown below:
• uniqueId: SeqNo
• studentName: Names
• gender: Gender
• dateOfBirth: Date
• townCity: City
• mathsMarks: Marks
• physicsMarks: Marks
• chemistryMarks: Marks
• totalMarks: TMarks
What should be the datatype for the entire classroom dataset ? Clearly it is simply
the compound datatype: List of StudentMarks.
Let us now turn to the Words dataset, an element of which is pictured below.
Clearly, we can make a record datatype ParagraphWord for this, whose field names
and respective types are as shown below:
• uniqueId: SeqNo
• word: Words
• partOfSpeech: WordCategory
• letterCount: Count
What should be the datatype for the entire Words dataset ? Clearly it is simply the
compound datatype: List of ParagraphWord.
88 Chapter 5. Datatypes
We can attempt to make a record for this. The fields which are obvious are shown
below along with their respective datatypes.
• uniqueId: SeqNo
• storeName: Names
• customerName: Names
• billAmount: Amount
But what do we do with the items purchased? Note that there are a number of
purchased items, and that the number may vary from bill to bill. The obvious
datatype to use for this is a List. But a list of what type?
Each purchased item has the item name, category name/sub category name,
quantity purchased, price per unit and cost of item purchased. So the item itself
looks like a record. Let us try to make a record datatype called BillItem with fields
and datatypes as shown below:
• itemName: Items
• category: ItemCategory
• quantity: Quantity
• price: Price
• cost: Cost
Now that we have the datatype for a line in the shopping bill, we can make a
List of BillItem and use that as a field in the shopping bill. The final record
ShoppingBill will have the following fields:
• uniqueId: SeqNo
• storeName: Names
• customerName: Names
• items: List of BillItem
• billAmount: Amount
What should be the datatype for the entire shopping bill dataset ? Clearly it is
simply the compound datatype: List of ShoppingBill. Note that the shopping bill
dataset is a list, each of its elements has within it another list of type BillItem. So
it is a list of lists.
Summary of chapter
Exercises
6.1 Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.1 Collect all the max elements . . . . . . . . . . . . . 92
6.2 Minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 Examples using maximum or minimum . . . . . . . . . . 96
6.3.1 Find the length of the longest sentence . . . . . . . . 96
6.3.2 Harder Example: Find the boundary rectangles . . . 97
6.4 Combining static, dynamic and state conditions . . . . . 97
6.4.1 Largest noun in the third sentence . . . . . . . . . . 98
6.4.2 Checking if a customer is loyal . . . . . . . . . . . . 100
6.4.3 Example: . . . . . . . . . . . . . . . . . . . . . . . 102
6.1 Maximum
We first consider the problem of finding the maximum value of some (numerical)
field in the dataset. The method we are going to use for finding the maximum is
this: we keep a variable called max which will carry the maximum value seen so
far (for the desired field). At the end of the iteration, all the cards will have been
seen, and max will carry the maximum value of the field. When we pick up a card
X, we check if its field X.F has a value higher than the current value of max. If so,
max will need to be updated with this new high value X.F. Note that this means
that the value of max can only keep increasing as we proceed through the cards.
What should the initial value of max be? We could start with something that
is lower than all the field values. In our dataset, we don’t have any fields with
negative values, so we can safely initialise max to 0 at the start.
The generic flowchart for finding the maximum of field F is shown in Figure 6.1.
Start
Initialise max to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
No (False)
X.F > max ?
Yes (True)
Set max to X.F
Note that the filtering condition is not static anymore, since the card’s field is
compared with a variable (max in this case), and not with a constant. The way to
view this is: the filtering condition in the beginning admits all cards (i.e. filters
out nothing). As the iteration progresses, it filters out cards with field values lower
than max, and picks only those that contribute to increasing the max.
We can use the generic flowchart above to find the maximum value of different
(numerical) fields by substituting that field in place of F in the generic flowchart
above - for instance to find the maximum Maths marks, we use X.mathsMarks in
place of X.F in the flowchart. Similarly, to find the bill with the highest spend, we
use X.billAmount and to find the longest word, we can use X.letterCount.
The flowchart in Figure 6.1 produces the maximum value of field F, but it does
not tell us which cards contribute to this max value. For example, it will give us
the maximum total marks, but will not say which students scored these marks.
6.1 Maximum 93
To remedy this, we can keep a second variable called maxCardId, which keeps
track of the unique sequence number of the card that holds the maximum field
value out of cards seen so far. What should maxCardId be initialised to? Any
value that will definitely not occur in the dataset should be fine. The simplest such
integer is -1, which can never be any card’s sequence number, which can only
have 0 or positive values. The required flowchart is shown below in Figure 6.2.
Start
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
No (False)
X.F > max ?
Yes (True)
Set max to X.F and maxCardId to X.uniqueId
Figure 6.2: Finding the card that has the maximum value of a field
Is this flowchart correct? Suppose that the first 4 cards have fields shown below:
UniqueId 9 2 7 11
ChemistryMarks 65 82 77 82
Then max takes value 0 after initialisation and 65, 82, 82, 82 after the first, second,
third and fourth cards are read and processed. The first card sets max to 65. At the
second card, X.chemistryMarks > max, so max gets set to 82. In the third and
fourth cards, the Chemistry marks is not greater than 82, and so they are skipped.
What will the value of maxCardId be? It will be -1 after initialisation. After
the first card, it is set to 9. The second card sets it to 2, and since the third and
fourth card are skipped (filtered out), they don’t change the value of maxCardId.
So, the procedure is correct in the limited sense that it holds one student whose
Chemistry marks is the highest. But this is not so satisfactory since the fourth
card has exactly the same marks as the second card, and it should receive equal
treatment. This would mean that we have to keep both cards, and not just one. We
would need a list variable to hold multiple cards - let us call this list maxCardIds.
The modified flowchart is shown in Figure 6.3.
Start
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Append No (False)
[X.uniqueId] X.F > max ?
to maxCardIds
Yes (True)
Set max to X.F and maxCardIds to [X.uniqueId]
Note that the list variable maxCardIds needs to be initialised to the empty list []
at the start. If we find a card N with a bigger field value than the ones we have
seen so far, the old list needs to be discarded and replaced by [N]. But as we have
seen above, if we see another card M with field value equal to max (82 in the
example above), then we would need to append [M] to the list maxCardIds.
We can apply this flowchart to find all the words with the greatest number of
letters in the Words dataset. For this, replace X.F with X.letterCount.
6.2 Minimum 95
6.2 Minimum
In the last section, we saw how to find the maximum value of some field, and
the cards that hold that maximum value. Suppose now that we want to find the
minimum value instead of the maximum value. Does the same method work?
In principle it should. Instead of keeping a variable called max, we can keep
a variable called min which holds the minimum value of the required field. In
the update step, instead of checking whether the field F has a value larger than
max, we now check if the field value is lower than min. So far it looks pretty
straightforward.
The key question is: what do we initialise min with? In the case of max, we could
initialise with 0, since in our dataset there are no negative values in the numeric
fields. Since we know that min will keep decreasing as we go through the cards,
we could in principle start with the maximum value of the field across all the
cards - but this requires us to find the maximum first ! Rather than do that, we
just set min to some high enough value MAX that is guaranteed to be higher than
the maximum value of the field. Usually, the field’s subtype would have some
constraint on the range of values allowed - so we can simply take the upper end of
the range as the value of MAX.
In the classroom dataset, we saw that the subject marks had a maximum value of
100, so MAX would be 100 if are trying to find the minimum marks in a subject,
while the total marks had a maximum value of 300, so MAX can be set to 300 if are
trying to find the minimum total marks.
The flowchart in Figure 6.4 produces the minimum value of field F and the list of
cards carrying this minimum value.
Start
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Append No (False)
[X.uniqueId] X.F < min ?
to minCardIds
Yes (True)
Set min to X.F and minCardIds to [X.uniqueId]
Let us now consider a slightly more difficult problem - that of finding the length of
the longest sentence (i.e. the sentence with the largest number of words). For this,
we have to count the words in each sentence, and as we do this we have to keep
track of the maximum value of this count. The flowchart in Figure 4.7 gives us the
word count of all the sentences. Can we modify it to give us just the maximum
count rather than the list of all the sentence word counts? The required flowchart
is shown in Figure 6.5.
Start
Yes (True)
Pick the first card X from the "unseen"
pile and and move it into the "seen" pile
Increment count
Yes (True)
No (False)
Set count to 0 count > max ?
Yes (True)
Set max to count
We now look at finding something (for example the maximum or minimum) only
within a range of cards and not for the whole set of cards.
The obvious one is to pick cards which satisfy some (static) filtering condition - for
example we could find the largest verb in the paragraph by iterating over the cards,
filtering to check if the word’s part of speech is Verb, and within the verbs, we
find the word with the highest letter count. All we will need to do is to replace the
filtering condition X.F > max in Figure 6.1 with the condition X.partOfSpeech
is Verb AND X.letterCount > max. This is thus an example of filtering using
a combination of a static condition (X.partOfSpeech is Verb) along with a
dynamic condition (X.letterCount > max). As another example consider the
problem of finding the bill with the highest spend for a given customer named N.
We could check for the customer name (static condition X.customerName is N)
along with the dynamic condition X.billAmount > max.
Sometimes, we have to go beyond using a static condition to select the range of
cards to process. Suppose that as we go over the cards, the iteration passes through
distinct phases (each of which we can call a state). The state of the iteration
can come about because of certain cards that we have seen in the past. Without
state, we will have to use filtering only to check the values of the current card
(independent of what we have seen in cards in the past). In the examples that
follow, we will look at a few examples of the use of state - for instance to keep
track of the sentence number we are currently in, or to keep track of the last bill
seen of a certain customer.
A good example of the use of state is through the Boolean variable found for
exiting during the middle of an iteration that we say in Figure 4.18. The iteration
goes through two states - the state before the item was found (when found is
False) and the state after it is found (when found is True). However, in that
example, we did not do anything in the latter state, except exit the iterator. So the
use of state is not of much use in that example.
We could argue that keeping track of max is also like maintaining a state (the value
of these variables). But this is not very useful as a way to understand what is
going on - there are too many states (values of max), and the transition from one
state to another is not very logical (max can change to anything depending on the
card we are seeing). Compare this with the case of the sentence number - there
are only a few states (number of sentences in the paragraph), and the transition is
quite meaningful (sentence number increments by one during each transition).
So maintaining the state of iteration in a variable (or variables) is a way of
simplifying complex dynamic conditions (wherever possible) by separating the
part that depends on history (the state) from those that depend only on the current
card.
sentence. This requires us to select all the nouns - an example of static filtering
(since we will be comparing the field partOfSpeech with the constant "Noun").
But it also requires us to keep track of which sentence we are currently in (the
state), for which we will need to compare a variable snum for sentence number,
which is incremented each time we see a full stop. It also requires us to find the
maximum length word - which is a dynamic condition.
Since we are not interested in looking beyond the third sentence, we should exit
the iteration loop once we have finished with the third sentence. The exit condition
needs to take care of this (as we did in Figure 4.18 for exit after finding something).
The flowchart for doing all this is shown in Figure 6.6.
Start
Yes (True)
Pick the first card X from the "unseen"
pile and and move it into the "seen" pile
Yes (True)
Set max to X.letterCount
No (False)
X.word ends with full stop ?
Yes (True)
Increment snum
Observe the flowchart closely. There are a number of tricks that have been used.
Firstly, to exit the flowchart after the third sentence, we could have used a Boolean
variable found as we did in the generic find flowchart in Figure 4.18. However,
this is not really necessary, as there is a good equivalent in the form of checking a
condition of the state (snum < 3). Note that the state represents if we are in the
first sentence (snum=0), second (snum=1) or third sentence (snum=2). Secondly,
note that all the conditions are being checked in a single decision box - check for
a state (snum=2 meaning that we are in the third sentence), the word is a noun
(static condition), and that the word is longer than max (the dynamic condition).
Only if all these are satisfied will max be updated. Thirdly, both branches of this
decision box converge to the check for end of sentence, which if true causes snum
to be incremented.
In the case of the words dataset, the words were assumed to be in an order
according to their occurrence in the paragraph. So using the count of the number
of sentences as a state variable made some sense. What kind of state information
can we carry when the cards are not in any order?
Let us now consider an example of cards that may not be any order - the shopping
bill dataset. Say we want to find out if a certain customer named N is loyal -
defined as follows: N is loyal if all the bills that have N as the customer name
all share the same shop name. This would mean that customer N shops only in
one shop (i.e. is loyal to that shop). How do we check if the given customer N is
loyal? As we iterate through the cards, we keep looking for the customer name
N. If we find such a card, we change the state (variable shop initially empty) to
the name of the shop on the card (say S). As we go through the remaining cards,
we keep looking for a card with customer name N (static condition) and also for
state not being empty (i.e. we have seen a previous card of the same customer),
and check if the card’s shop name will change the state again. If it will, then the
customer is not loyal and we can exit the iteration.
The flowchart is shown below in 6.7. The Boolean variable loyal starts with the
value True, and becomes false when a second shop is encountered for the same
customer name.
Start
Yes (True)
Pick the first card X from the "unseen"
pile and and move it into the "seen" pile
No (False)
X.customerName is N ?
Yes (True)
Yes (True)
Set loyal to False
A few points to note about the flowchart. The exit condition is strengthened with a
check for loyal. If we have found that the customer is not loyal, then the iteration
exits (nothing further to do). Within the iterator, we first check for the static
condition - whether the customer name is N. If not, we skip the card. If it is N,
then we check if the state shop is previously set and if so if it is different from the
current shop name - which cause loyal to be set to False. In any event, we set
the state shop to the shop name on the card. This is merely for convenience (so
we don’t have to write another decision box). If shop was "" earlier, then it will
take the shop name from the card. If the shop name remains the same, assigning it
to shop will not change the state. If the shop name is different, then shop is set to
the latest shop name (i.e. the state changes), but this is never used again because
we will exit the loop.
6.4.3 Example:
Summary of chapter
Exercises
7. Pseudocode
Start
Initialise count to 0
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Increment count
Counting cards
Step 0 Start
Step 1 Initialize count to 0
Step 2 Check cards in Pile 1
Step 3 If no more cards, go Step 8
Step 4 Pick a card X from Pile 1
Step 5 Move X to Pile 2
Step 6 Increment count
Step 7 Go back to Step 2
Step 8 End
However, this text notation does not emphasize the logical structure of the compu-
tation. For instance, Steps 2, 3 and 7 together describe how we iterate over the
cards till all of them are processed. Steps 4, 5 and 6 are the steps that we perform
repeatedly, in each iteration.
Instead of using plain English to describe the steps, is is better to develop special-
ized notation that captures the basic building blocks of computational processes.
Start
count = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
Increment count
}
End
Pseudocode 7.1: Counting cards
With minor changes, we can modify this pseudocode to add up the values in a
particular field F across all the cards. Instead of incrementing a variable count
with each iteration, we update a variable sum by adding the value X.F to sum for
each card X that we read. Here is the pseudocode for the flowchart from Figure 3.3.
Start
sum = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
Add X.F to sum
}
End
Pseudocode 7.2: Generic sum
We can merge and extend these computations to compute the average, as we had
earlier seen in the flowchart of Figure 3.5.
Start
count = 0
sum = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
count = count + 1
sum = sum + X.F
}
average = sum/count
End
Pseudocode 7.3: Average
We have made one more change with respect to Pseudocode 7.1 and 7.2. The
increment to count and the update to sum are also now described using assignment
statements. The statement “count = count + 1” is not to be read as an equation—
such an equation would never be true for any value of count! Rather, it represents
an update to count. The expression on the right hand side reads the current value
of count and adds 1 to it. This new value is then stored back in count. In the
same way, the assignment “sum = sum + X.F” takes the current value of sum, adds
X.F to it, and stores the resulting value back in sum.
7.2 Filtering
The next step is to add a conditional statement that will allow us to do filtering.
Consider the flowchart for finding total marks for both boys and girls, shown in
Figure 4.4. Here is the corresponding pseudocode.
Start
boysTotalMarks = 0
girlsTotalMarks = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
if (X.gender == M) {
boysTotalMarks = boysTotalMarks + X.totalMarks
}
else {
girlsTotalMarks = girlsTotalMarks + X.totalMarks
}
}
End
Pseudocode 7.4: Total marks of both boys and girls
executed. The block labelled else is executed if the statement is false. There may
be no else block. For instance, here is an example where we only sum the total
marks of girls. If X.gender is not equal to F, we don’t perform any action within
the body of the iteration.
Start
girlsTotalMarks = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
if (X.gender == F) {
girlsTotalMarks = girlsTotalMarks + X.totalMarks
}
}
End
Pseudocode 7.5: Total marks of girls
Start
cgSum = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
if (X.gender == F AND X.townCity == Chennai)) {
cgSum = cgSum + X.totalMarks
}
}
End
Pseudocode 7.6: Total marks of girls from Chennai
Start
max = 0
maxCardID = -1
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
if (X.F > max) {
max = X.F
maxCardID = X.uniqueId
}
}
End
Pseudocode 7.7: Card with maximum value of a field
Summary of chapter
Exercises
We have seen that some patterns are repeated across different computations.
Understanding the structure of these patterns helps us build up abstractions such
as filtered iterations, which can be applied in many contexts.
At a more concrete level, variants of the same computation may be required in for
the same context. For example, the computation described in Figure 6.1 allow us
to find the maximum value of any field F in a pile of cards. We can instantiate F to
get a concrete computation—for instance, for the Scores dataset, if we choose F
to be chemistryMarks, we would get the maximum marks in Chemistry across
all the students.
We can think of such a generic pattern as a unit of computation that can be
parcelled out to an external agent. Imagine a scenario where we have undertaken
a major project, such as building a bridge. For this, we need to execute some
subtasks, for which we may not have the manpower or the expertise. We outsource
such tasks to a subcontractor, by clearly specifying what is to be performed,
which includes a description of what information and material we will provide the
subcontractor and what we expect in return.
In the same we, we can package a unit of computation as a “contract” to be
performed on our behalf and invoke this contract whenever we need it. This
packaged code that we contract out to be performed independently is called
a procedure. We pass the information required to execute the computation as
parameters—for instance, to tell a procedure that computes a generic maximum
which field F we are interested in. The procedure then returns the value that
corresponds to our expectation of the task to be performed, in this case, the
maximum value of field F across all the cards.
Let us write the flowchart in Figure 6.1 to illustrate our pseudocode for procedures.
Procedure findMax(field)
max = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
if (X.field > max) {
max = X.field
}
}
return(max)
End findMax
Pseudocode 8.1: Procedure for generic maximum
We use the word Procedure to announce the start of the procedure and signal the
end of the code with the word End. The words Procedure and End are tagged
by the name of the procedure—in this case, findMax. We indicate the parameters
to be passed in brackets, after the name of the procedure in the first line. At the
end of its execution, the procedure has to send back its answer. This is achieved
through the return statement which, in this case, sends back the value of the
variable max.
Here is a typical statement invoking this procedure.
maxChemistryMarks = findMax(chemistryMarks)
The right hand side calls the procedure with argument chemistryMarks. This
starts off the procedure with the parameter field set to chemistryMarks. At the
end of the procedure, the value max is returned as the outcome of the procedure
call, and is assigned in the calling statement to the variable maxChemistryMarks.
Thus, the call to the procedure is typically part of an expression that is used to
compute a value to be stored in a variable.
8.2 Parameters
In the previous example, the parameter passed was a field name. We can also
pass a value as a parameter. Suppose we want compute the maximum value
in a particular field for students of a particular gender. This would require two
parameters, one specifying the field to examine and the other describing the gender,
as shown below.
Procedure findMax(field,genderValue)
max = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
if (X.field > max AND X.gender == genderValue) {
max = X.field
}
}
return(max)
End findMax
Pseudocode 8.2: Procedure for maximum of a specific gender
To compute the maximum Physics marks among the boys in the class, we would
write a statement like the following.
maxPhysicsMarksBoys = findMax(physicsMarks,M)
The arguments are substituted by position for the parameters in the procedure,
so field is assigned the name physicsMarks and genderValue takes the
value M. As before, the value returned by the procedure is stored in the vari-
able maxPhysicsMarksBoys.
Start
maxMaths = findMax(mathsMarks)
maxPhysics = findMax(physicsMarks)
maxChemistry = findMax(chemistryMarks)
maxTotal = findMax(totalMarks)
subjTotal = maxMaths + maxPhysics + maxChemistry
if (maxTotal == subjTotal) {
singleTopper = True
}
else {
singleTopper = False
}
End
Pseudocode 8.3: Check if there is a single outstanding student
8.4 Side-effects
What happens when a variable we pass as an argument is modified within a
procedure? Consider the following procedure that computes the sum 1+2+· · ·+n
for any upper bound n. The procedure decrements the parameter value from n
down to 1 and accumulates the sum in the variable sum.
Procedure prefixSum(upperLimit)
sum = 0
while (upperLimit > 0) {
sum = sum + upperLimit
upperLimit = upperLimit - 1
}
return(sum)
End prefixSum
Start
n = 10
sumToN = prefixSum(n)
End
Pseudocode 8.5: Calling prefix sum with a variable as argument
What is the value of n after prefixSum executes? Is it still 10, the value we
assigned to the variable before calling the procedure? Or has it become 0, because
prefixSum decrements its input parameter to compute the return value?
Clearly, we would not like the value of n to be affected by what happens within
the procedure. If we use the metaphor that the procedure is executed by a sub-
contractor, the data that we hold should not be affected by what happens within
a computation whose details are not our concern. Imagine if the procedures we
wrote to compute the average of some field in a pile of cards overwrote the values
on the cards!
When a procedure modifies a value and this has an impact outside the computation
of the procedure, we say that the procedure has a side effect. In general, side
effects are not desirable. Getting back to the idea of drawing up a contract with
the procedure, one of the clauses we impose is that none of the values passed as
parameters are modified globally by the procedure. Implicitly, each argument is
copied into a local variable when the procedure starts, so all updates made within
the procedure are only to the local copy.
However, in the examples we have seen, some data is shared with the procedure
without explicitly passing a parameter. For instance, we never pass the pile of
cards to work on as a parameter; the procedure implicitly has direct access to the
pile of cards. When the procedure moves cards from one pile to another, this will
typically have the side effect of rearranging the pile.
In some situations, the outcome we want from a procedure is the side effect, not
a return value. For instance, we will see later how to rearrange a pile of cards
in ascending order with respect to some field value on the card. A procedure to
execute this task of sorting cards would necessarily have to modify the order of
the pile of cards.
We will return to the topic of side effects and examine it in more detail in Chap-
ter 12.
Summary of chapter
Exercises
Part II
9. Element ←→ Dataset
There are many ways to make sense of the data that is given to us. In descriptive
statistics, we try to create some summary information about the dataset using
aggregate values such as the mean (same as average) of the data. This is usually
accompanied by information about how dispersed the data is around the mean,
and whether the data is symmetric around the mean or is skewed to one side.
As we have seen in Part I, computational methods allow us to collect such aggre-
gate values (and perhaps many more) from the entire dataset, by making one or
more passes through it using iterators and (accumulator) variables to store the
results.
In this section, we will see how we can make further sense of the dataset by
identifying where each element stands with respect to these aggregate values
collected from the entire dataset.
As an example, consider the procedure to find the maximum Chemistry marks
and the element(s) that contribute to this value. The procedure not only finds the
maximum Chemistry marks but also creates a relationship between the maximum
elements and the rest of the dataset - all the elements will have Chemistry marks
below these maximum elements. Likewise, when we find the minimum, all the
elements will be positioned above the elements with minimum marks.
As another example, say we have collected the average total marks of all the
students, then we can separate the students into two groups - those who are above
average and those who are below average. We could also form three groups -
those around the average marks (say within average plus or minus 5 marks), those
significantly above average (with total marks that is at least 5 above average),
and those significantly below average (with total marks that is at least 5 below
average).
If we know the maximum, minimum and average, we could make many more
groups - lets say groups A, B, C, D and E (which are grades that can be awarded
to the students). Those in the average ±5 can be given C grade. We can divide
the range between the maximum and average + 5 into two equal parts, with the
higher half getting A and the lower half getting B. Similarly, the range between
minimum and average − 5 can be divided into two equal halves, with the higher
being awarded D and the lower E.
We can also use filtering to find averages for subsets of the data. For example, say
we find the average of the boys and the average of the girls total marks. Then by
comparing these averages, we can say if the girls are doing better than the boys
in this class. This may not be entirely satisfactory because there could be one
boy with high marks who skews the boys’ average to the higher side. A better
method may be to find the number of boys who are above the overall average and
compare it with the number of girls who are above average. But this is not really
accurate, since the class may have many more boys than girls - so what we have
to do is to find the percentage of girls who are above average and compare it with
the percentage of boys who are above average.
Similarly, in the shopping bill dataset, we can find the average bill value and
then proceed to categorise a customer as low spenders, average spenders or high
spenders depending on whether the average bill value for that customer (use
filtering) is much below average, around the average or much above average
respectively.
In the words dataset, we could find the average frequency and average letter count
of the words. Then we could determine which words are occurring frequently (i.e.
above the average frequency) or are long (i.e. above the average lettercount).
As we saw in the examples above, to find the relation between a data element
and the entire dataset, we will need at least two iterations - executed in sequence.
The first iteration is used to find the aggregate values such as the average, and the
second (or later) iterations are used to position an element (or subset) in relation
to the entire dataset.
Start
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
Initialise variables
Yes (True)
Pick one card X from the "unseen" pile
and and move it into the "seen" pile
If you observe the flowchart carefully and compare it with the flowcharts that we
saw in Part I, you will notice that there are two iteration stubs embedded in the
flowchart (identified through the initialisation, decision to end the iteration, and
update steps). Whereas in most of the iterators seen earlier the iterator exited and
went straight to the end terminal, in this flowchart, the first iteration exits and goes
to the initialisation of the second iteration. It is the second iteration that exits by
going to the end terminal. That is precisely why we say that the two iterations
are in sequence. The second iteration will start only after the first iteration is
complete.
Start
Arrange all cards in Pile 1
Initialise the aggregate variables
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
Update aggregate variables using the field values of X
}
Arrange all cards in Pile 1
Initialise the variables
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
Update variables using the aggregate variables and the field values
of X
}
End
Pseudocode 9.1: Sequenced iterations
could determine the average only for a subset of the data elements, and compare
these with the average of the whole, which may reveal something about that
subset.
Start
Arrange all cards in Pile 1
sum = 0, count = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
sum = sum + X.totalMarks
count = count + 1
}
average = sum/count
Arrange all cards in Pile 1
aboveList = []
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
if (X.totalMarks > average) {
Append [X] to aboveList
}
}
End
Pseudocode 9.2: Above average students
overall average and compare it with the percentage of boys who are above average.
This is better than just checking if the girls average is higher than the boys average,
since the girls average could be skewed higher due to a single high scoring girl
student (or alternatively, the boys average may be skewed lower due to a single
low scoring boy student).
The pseudocode for finding the above average students can be modified using
filters to find the count of the above average boys and above average girls as shown
below:
Start
Arrange all cards in Pile 1
sum = 0, count = 0, numBoys = 0, numGirls = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
sum = sum + X.totalMarks
count = count + 1
if ( X.gender == Girl ) {
numGirls = numGirls + 1
} else {
numBoys = numBoys + 1
}
}
average = sum/count
Arrange all cards in Pile 1
boysAbove = 0, girlsAbove = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
if (X.totalMarks > average) {
if ( X.gender == Girl ) {
girlsAbove = girlsAbove + 1
} else {
boysAbove = boysAbove + 1
}
}
}
girlsAreDoingBetter = False
Start
Arrange all cards in Pile 1
max = 0, min = 300
in others, so a person who is merely above average in all three subjects ends up
having a higher total than students who excel in individual subjects. To take this
into account, we add another constraint: to get a prize, a student must not only be
within the top three total marks, but must also be in the top three in at least one
subject.
To solve this problem, we have to first find the top three marks in each individual
subject. Once we have this data, we can iterate over all the students and keep track
of the top three total marks, checking each time that the student is also within the
top three in one of the subjects.
How do we find the top three values in a set of cards? We have seen how to
compute the maximum value: initialize a variable max to 0, scan all the cards, and
update max each time we see a large value.
To find the two highest values, we maintain two variables, max and secondmax.
When we scan a card, we have three possibilities. If the new value is smaller than
secondmax, we discard it. If it is bigger than secondmax but smaller than max,
we update secondmax to the new value. Finally, if it is larger than max, we save
the current value of max as secondmax and then update max to the new value.
We can extend this to keep track of the three largest values. We maintain three
variables, max, secondmax and thirdmax. For each card we scan, there are
four possibilities: the value on the card is below thirdmax, or it is between
secondmax and thirdmax, or it is between secondmax and max, or it is greater
than max. Depending on which case holds, we update the values max, secondmax
and thirdmax in an apropriate manner.
Here is a procedure to compute the top three values of a field in a pile of cards.
Procedure topThreeMarks(field)
max = 0
secondmax = 0
thirdmax = 0
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
if (X.field > max) {
thirdmax = secondmax
secondmax = max
max = X.field
}
if (max > X.field AND X.field > secondmax) {
thirdmax = secondmax
secondmax = X.field
}
if (secondmax > X.field AND X.field > thirdmax) {
thirdmax = X.field
}
return(thirdmax)
end topThreeMarks
return(True)
}
else {
return(False)
}
end subjectTopper
We can now write the code for the main iteration to compute the three top students
overall.
We have to maintain not only the three highest marks, but also the identities of the
students with these marks, so we have variables max, secondmax and thirdmax
to record the marks and variables maxid, secondmaxid and thirdmaxid to
record the identities of the corresponding students.
We record the third highest mark in each subject by calling the procedure topThreeMarks
three times, once per subject. We then iterate over all the cards. We first check if
the card represents a subject topper; if not we don’t need to process it. If the card
is a subject topper, we compare its total marks with the three highest values and
update them in the same way that we had done in the procedure topThreeMarks,
except that we also need to update the corresponding identities.
Start
max = 0
secondmax = 0
thirdmax = 0
maxid = -1
secondmaxid = -1
thirdmaxid = -1
mathsThird = topThreeMarks(mathsMarks)
physicsThird = topThreeMarks(physicsMarks)
chemistryThird = topThreeMarks(chemistryMarks)
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
if (subjectTopper( X, mathsThird,
physicsThird, chemistryThird) {
if (X.totalMarks > max) {
thirdmax = secondmax
thirdmaxid = secondmaxid
secondmax = max
secondmaxid = maxid
max = X.totalMarks
maxid = X.id
}
if (max > X.totalMarks AND X.totalMarks > secondmax) {
thirdmax = secondmax
thirdmaxid = secondmaxid
secondmax = X.totalMarks
secondmaxid = X.id
}
if (secondmax > X.totalMarks AND
X.totalMarks > thirdmax) {
thirdmax = X.field
thirdmaxid = X.id
}
}
}
End
Pseudocode 9.7: Procedure for top three values
We could add more criteria for awarding the top three prizes. For instance, we
could insist that there must be at least one boy and one girl in the top three—it
should not be the case that all three are boys or all three are girls. To handle this,
we could examine the top three students we have found using the process we have
described and check their gender. If there are both boys and girls in the list, we
are done. Otherwise, we discard the third highest card and repeat the computation
for the reduced pile.
If we repeat this process enough times, are we guaranteed to find a suitable set
of three students to award prizes? What if all the subject toppers are of the same
gender? What if there are ties within the top three totals?
As you can see, framing the requirement for a computation can also be a compli-
cated process. We need to anticipate extreme situations—sometimes called corner
cases—and decide how to handle them. If we do not do a thorough analysis
and our code encounters an unanticipated corner case, it will either get stuck or
produce an unexpected outcome.
Summary of chapter
Exercises
In the last chapter, we used two iterations one after another to find relations
between elements and the entire dataset. In the examples that we considered, we
saw that establishing such a relationship positions each element relative to the
entire dataset. For example, we could say whether a student was above average,
or we could assign a grade to the student based upon his relative position using
the minimum and maximum mark values.
But what if the field of the dataset does not allow numerical operations such
as addition to be performed? Then we cannot determine the average, nor the
minimum or maximum. Even if the field admits numerical operations, we may
not be happy with the coarse positioning of the element relative to the average, or
some intermediate points based on the minimum, maximum and average. We may
want to position the element much more precisely relative to some other elements.
In this chapter, we look at directly establishing pair-wise relationships between
elements. As a simple example, consider the problem of determining if two
students share the same birthday (i.e. are born in the same month and on the same
day of the month). This will require us to compare every pair of students to check
for this. As another example, we may want to determine if a customer has made
two purchases of the same product from two different shops. We have to compare
each bill of that customer with other bills of the same customer to check for this.
A somewhat more complicated problem is that of resolving the personal pronouns
(for example the word "he") in a paragraph - i.e. to identify the name of the (male)
person that the personal pronoun ("he") refers to.
To compare every pair of elements, we have to generate all the pairs first. How
do we do that? The basic iterator pattern allows us to go through all the elements
(cards) systematically, without repeating any element twice. To compare an
element picked in the first iteration with all other elements, We will need a second
iteration that sits inside the first iteration, and so this pattern is called a nested
iteration (one iteration within another).
pairs (A,B) and (B,A) through the nested iteration. For most practical applications
(for example those in which we will need to only compare the fields of A and B),
it is enough if only one of these two pairs, either (A,B) or (B,A) is produced - we
don’t need both.
What is the consequence of this? Go back to the nested iterations and look closely.
Every element moves from the "unseen" pile to the "seen" pile in the outer iteration.
So if we simply compare the element that is being moved to the "seen" pile with
all the elements in the "unseen" pile, we will have produced exactly one out of
(A,B) or (B,A). Can you see why this is so? If it was A that was picked in the outer
iteration before B, then B is still in the "unseen" pile when A is picked, and so
the pair (A,B) is produced. When B is picked, A is already moved into the "seen"
pile, so we will never compare B with A - and so (B,A) will never be produced.
The generic flowchart for nested iterations discussed above is depicted in Figure
10.1 below.
Start
Yes (True)
Pick one card X from the "unseen" pile
Yes (True)
Pick one card Y from the "temp" pile
and move it into the "unseen" pile
Observer the flowchart carefully. Unlike in the case of the sequenced iterations,
where the exit from the first iteration took us to the initialisation step of the second
iteration, here it is in reverse. The exit of the second (inner) iteration takes us to
the exit condition box of the first (outer) iteration.
Note also that we are properly keeping track of the cards that have been visited.
The "unseen" pile is supposed to keep track of cards that remain to be processed
in the outer iteration. We now also need to keep track of cards visited during
the inner iteration. To do that, we move all the cards from the "unseen" pile to
a "temp" pile. So for the inner iteration, the "temp" pile holds all the unvisited
cards, and the "unseen" pile holds the cards visted in the inner iteration, but not
yet visited in the outer iteration. At the end of the inner iteration, all the cards are
transferred back to the "unseen" pile, and the outer iteration can then proceed as if
Start
Arrange all cards in Pile 1, Initialise the variables
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move all the remaining cards from Pile 1 to Pile 3
Move X to Pile 2
while (Pile 3 has more cards) {
Pick a card Y from Pile 3
Move Y to Pile 1
Update variables using the field values of X and Y
}
}
End
Pseudocode 10.1: Nested iterations
The generic flowchart for nested iterations discussed above is depicted in Figure
10.2 below.
Start
Yes (True)
Pick one card X from the "unseen" pile
Yes (True)
Pick one card Y from the "temp"
pile and move it into the "seen" pile
Observer the difference between this flowchart and the earlier one. Here, we move
all the cards (except the current card X) from the "seen" pile to a "temp" pile. So
for the inner iteration, the "temp" pile holds all the unvisited cards. At the end of
the inner iteration, all the cards are transferred back from the "temp" pile to the
"seen" pile. The outer iteration can then proceed as if the "seen" pile were never
disturbed.
Start
Arrange all cards in Pile 1, Initialise the variables
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move all cards from Pile 2 to Pile 3
Move X to Pile 2
while (Pile 3 has more cards) {
Pick a card Y from Pile 3
Move Y to Pile 2
Update variables using the field values of X and Y
}
}
End
Pseudocode 10.2: Another way to do nested iterations
Start
Arrange all cards in Pile 1, Initialise pairsList to []
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move all cards from Pile 2 to Pile 3
Move X to Pile 2
while (Pile 3 has more cards) {
Pick a card Y from Pile 3
Move Y to Pile 2
Append [ (X,Y) ] to pairsList
}
}
End
Pseudocode 10.3: List of all the pairs
Let us now consider the concrete example of checking if the classroom dataset
has a pair of students who share the same birthday (i.e. the same month of birth
and the same date within the month). The simplest way to do this may be to first
generate all the pairs of students (pairsList)from the dataset as above, and then
iterate through this list checking each pair (A,B) to see if A and B share the same
birthday. But this requires two passes over all pairs of students. Can’t we do
the comparison within the nested iteration itself? This way, we can also exit the
iteration when we find a pair that meets our desired condition. To exit from the
loop early, we will use a Boolean variable found, as we saw in Figure 4.18.
The modified pseudocode for checking if two students have the same birthday is
shown below (using the generic pseudocode for nested iteration):
Start
Arrange all cards from the classroom dataset in Pile 1
found = False
while (Pile 1 has more cards AND NOT found) {
Pick a card X from Pile 1
Move all remaining cards from Pile 1 to Pile 3
Move X to Pile 2
while (Pile 3 has more cards AND NOT found) {
Pick a card Y from Pile 3
Move Y to Pile 1
if (X.dateOfBirth == Y.dateOfBirth) {
found = True
}
}
}
End
Pseudocode 10.4: Same birthday
Note that we exit both the outer and the inner iterations if a pair is found with the
same birthday (by adding an AND NOT found in both the exit conditions). When
found becomes True, it will cause an exit of the inner iteration, which takes us
directly to the exit condition of the outer iteration, which will also exit.
How many comparisons are required to be done before we can conclusively say
whether a pair of students share a birthday? If there is no such pair, then we have
to examine all the pairs before we can conclude this to be the case. The entire
nested iteration will exit with found remaining False. How many comparisons
were carried out?
If there are N elements in the classroom dataset, then the number of all pairs
is N × N = N 2 . However, we will never compare an element with itself, so the
number of meaningful comparisons is reduced by N, and do becomes N × N − N
which can be rewritten as N × (N − 1). We saw that our pseudocode will avoid
duplicate comparisons (if (A,B) is produced, then the reverse (B,A) will not be
produced due to the trick we employed in maintaining the piles). So the number
of comparisons is reduced to half of the number of meaningful comparisons, i.e.
N × (N − 1)
to .
2
But is the same as the number of comparisons produced by the generic nested
iteration in Figure 10.1? Let us check. The first element picked in the outer
iteration will be compared with all the "unseen" elements (there are N − 1 of these
after the first element is picked). The second element will be compared with only
N − 2 elements since two elements would have been moved to the "seen" pile
from the "unseen" pile. So, we can see that the number of comparisons will be
(N − 1) + (N − 2) + ... + 2 + 1 which can be written in reverse as 1 + 2 + ... + (N −
N × (N − 1)
2) + (N − 1) that we know from our school algebra to be equal to .
2
What about the alternate nested iteration method in Figure 10.2? The first element
from the outer loop will have nothing to be compared with (since "seen" will be
empty). The second element will be compared with 1 element (the first element).
At the end, the last element will be compared with the entire "seen" pile which
will have all the remaining elements, i.e. N − 1 elements. So the number of
N × (N − 1)
comparisons is just 1 + 2 + ... + (N − 2) + (N − 1) which is equal to .
2
So it checks out !
The problem with nested iterations is that the number of comparisons can grow
really large when the dataset size N is large, since it is a quadratic in N, as shown
in Figure 10.3.
Figure 10.3
The following table shows how the number of comparisons grows with N. As we
can see, when N is one million, the number of comparisons becomes really really
large.
N × (N − 1)
N
2
2 1
3 3
4 6
5 10
6 15
7 21
8 28
9 36
10 45
100 49500
1000 499500
10000 49995000
100000 4999950000
1000000 499999500000
Start
Arrange all cards from the classroom dataset in Pile 1
Initialise Bin[i] to [] for all i from 1 to 12
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile 2
Append [X] to Bin[month(X)]
}
found = False, i = 1
while (i < 13 AND NOT found) {
found = processBin(Bin[i])
i=i+1
}
End
Pseudocode 10.5: Same birthday using binning
Start
Arrange all cards from the classroom dataset in Pile 1
studyPairs = []
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move all remaining cards from Pile 1 to Pile 3
Move X to Pile 2
while (Pile 3 has more cards) {
Pick a card Y from Pile 3
Move Y to Pile 1
if (X.mathsMarks - Y.mathsMarks > 9 AND
Y.physicsMarks - X.physicsMarks > 9) {
Append [(X, Y)] to studyPairs
}
if (Y.mathsMarks - X.mathsMarks > 9 AND
Procedure processBin(bin)
Arrange all cards from bin in Pile 1
binPairs = []
while (Pile 1 has more cards) {
Pick a card X from Pile 1
The pseudocode for finding study pairs using binning is now shown below.
Start
Arrange all cards from the classroom dataset in Pile 1
Initialise binA to [], binB to [], binC to [], binD to []
while (Pile 1 has more cards) {
Pick a card X from Pile 1
Move X to Pile
if (X.mathsMarks + X.physicsMarks < 125) {
Append [X] to binD
}
else if (X.mathsMarks + X.physicsMarks < 150) {
Append [X] to binC
}
else if (X.mathsMarks + X.physicsMarks < 175) {
Append [X] to binB
}
else {
Append [X] to binA
}
}
studyPairs = []
Append processBin (binA) to studyPairs
Append processBin (binB) to studyPairs
Append processBin (binC) to studyPairs
Append processBin (binD) to studyPairs
End
Pseudocode 10.8: Study group pairing using binning
Note that through the use of a simplifying heuristic (that we are more likely to find
pairs within the bins rather than across the bins), we can reduce the comparisons
substantially.
Start
Arrange all words cards in Pile 1 in order
possibleMatches = []
while (Pile 1 has more cards) {
Pick a card X from Pile 1
if (X.partOfSpeech == Pronoun AND
personalPronoun(X.word)) {
Move all cards from Pile 2 to Pile 3 (retaining the order)
while (Pile 3 has more cards) {
Pick the top card Y from Pile 3
Move Y to the bottom of Pile 2
if (Y.partOfSpeech == Noun AND
isProperName(Y.word)) {
Append [(X,Y)] to possibleMatches
}
}
}
Move X to the top of Pile 2
}
End
Pseudocode 10.9: Possible matches for each pronoun
Secondly, when we are at such a personal pronoun, we start looking in the "seen"
pile (Pile 2 moved to Pile 3). To make sure that we are looking through the "seen"
pile in such a way that we are reading backwards in the paragraph, the Pile 2
has to be maintained in reverse order of the words in Pile 1. This is ensured by
placing the card picked up on the top of Pile 2. While transferring cards from Pile
2 to Pile 3, we need to preserve this reverse order. The inner iteration should not
disturb the order of cards, so when we are done examining a card from Pile 3, we
place it at the bottom of Pile 2, so that after the inner iteration is done, Pile 2 will
be exactly as before. The code to move card X to Pile 2 is moved after the inner
iteration, so that it does not affect the "seen" pile (Pile 2) over which we need to
search. It also needs to be done whether we are executing the inner loop or not.
Finally, note that we do not stop when we find the proper name noun Y matching
the pronoun X, since this match may in some cases not be the correct one (but we
keep the pair as the first possible match in our list). If we are interested only in
finding the first possible match, we can use a Boolean variable found which is set
to True when the match is found, and the inner iteration can exit on this condition.
Summary of chapter
Exercises
11. Lists
11.1.4 Example: Collect all the items sold of one category . 152
11.1.5 Example: Collect the first verb from each sentence . 152
11.2.3 Example: Which shop has the most loyal customers? 152
Summary of chapter
Exercises
Downloaded by Prasun Babu ([email protected])
lOMoARcPSD|44901534
Summary of chapter
Exercises
13.1.3 Example: Find transit station for going from one city
to another . . . . . . . . . . . . . . . . . . . . . . . 156
13.1.4 Example: Find the subject noun for a verb in a sentence 156
13.1.3 Example: Find transit station for going from one city
to another
Summary of chapter
Exercises
14. Dictionaries
If we take a look at the whole paragraph in the words dataset, we can observe
that there are quite a number of short words, and some of these seem to occur
more than once. So this leads us to ask the hypothetical question: "Are the high
frequency words mostly short?". How do we determine this? What does "mostly"
mean?
The usual way to settle questions of the kind: "Do elements with property A
satisfy property B?" is to put all the elements into 4 buckets - A and B, A and
not B, not A and B, not A and not B. This is shown in the table below, where the
top row is B and the bottom row is not B, the left column is not A and the right
column is A:
We then count the number of elements for each box of the table, and represent them
using percentages of the total number of elements. If the majority of elements fall
along the diagonal (the sum of the percentages of the boxes (not A and not B) and
(A and B) is much higher than the sum of percentages in the boxes (A and not B)
and (not A and B)), then we know that A and B go mostly together, and so the
hypothesis must be true.
We are now ready to carry out this procedure as shown in the pseudocode below.
The variables HFHL, HFLL, LFHL and LFLL represent high frequency and high
length, high frequency and low length, low frequency and high length and low
frequency and low length, where high and low represent above average and below
average values respectively.
Summary of chapter
Exercises
Summary of chapter
Exercises
16. Graphs
Summary of chapter
Exercises
Summary of chapter
Exercises
Part III
18. Recursion
Summary of chapter
Exercises
Summary of chapter
Exercises
Summary of chapter
Exercises
Summary of chapter
Exercises
Summary of chapter
Exercises
23. Concurrency
Summary of chapter
Exercises
Summary of chapter
Exercises
Summary of chapter
Exercises
Summary of chapter
Exercises