CO3002/CO7002 Analysis and Design of Algorithms: S C M S
CO3002/CO7002 Analysis and Design of Algorithms: S C M S
CO3002/CO7002
Analysis and Design of Algorithms
Lecture Notes
2021/22
Dr. Stanley Fung
Contents
1 Basic Concepts 2
1.1 What is an algorithm? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Why study algorithm design? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Efficiency of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Asymptotic complexity of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Upper and lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Analysing time complexities of simple algorithms . . . . . . . . . . . . . . . . . . 11
1.7 Revision: Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Lower Bounds 40
5.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Optimal algorithm for n coins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Lower bound for sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
i
CONTENTS ii
6 Greedy Algorithms 46
6.1 An interval selection problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Principles of greedy algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Analysis of earliest finishing first . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.4 The knapsack problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9 Dynamic Programming 72
9.1 Weighted interval selection and separated array sum . . . . . . . . . . . . . . . . 72
9.2 Principles of dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.3 Sequence comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.3.1 Longest common subsequence . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.3.2 Sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.4 Negative edges and cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.5 All-pairs shortest paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.5.1 The Floyd-Warshall algorithm . . . . . . . . . . . . . . . . . . . . . . . . 86
10 Network Flow 92
10.1 Flows and networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.2 The Ford-Fulkerson algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.2.1 Running time of Ford-Fulkerson . . . . . . . . . . . . . . . . . . . . . . . 95
10.3 Minimum cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.4 Bipartite matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Foreword
There are already many well-written algorithm textbooks or lecture notes online, and what
we teach in this module are mostly fairly standard material. Hence writing a complete set of
lecture notes for this module feels a bit reinventing the wheel, but every now and then people
ask for it, so here it is. There are inevitably many mistakes; please do point them out to me.
The references [CLRS], [DPV], [KT], [SSS] refer to following textbooks:
SSS S. S. Skiena, The Algorithm Design Manual, 2nd edition, Springer, 2008.
The best way to learn algorithms is to attempt a lot of exercises. The exercises at the end
of each chapter will be discussed in tutorial sessions each week. Please note the following
important points:
• You are expected to spend time outside of scheduled class sessions on the questions.
We will not cover every question in class. Solution outline will be on the web after
each tutorial session.
• Questions marked # test your understanding of basic concepts and ability to exe-
cute given algorithms. Everyone is expected to be able to do them, and they will
form the majority of your final exam.
• Questions marked * are more challenging questions, mainly concerned with design-
ing algorithms for new problems. The take-home coursework will be of a similar
nature. They will also form a small part of the final exam.
1
Chapter 1
Basic Concepts
Related reading:
[CLRS] Chapters 1, 2.2, 3
[KT] Chapters 2.1, 2.2, 4
[DPV] Chapter 0
[SSS] Chapter 2
• What is a “step”?
We will not define precisely what is meant by “computational problems” here. This requires
formal notions such as formal languages and Turing machines. But informally, a computational
problem is something that can be solved using a set of standard computational operations (such
as those found in the instruction set of a CPU). We distinguish two types of problems:
• Decision problems, where the answer is just yes or no. An example is: given a map,
determine whether it is possible to colour all the regions with at most three colours, with
the usual restriction that countries sharing borders have to be assigned different colours.
• Optimisation problems, where we want to find the maximum or minimum value of some-
thing under certain constraints. An example is: giving a set of tasks, each with a deadline
and a profit, find a way to schedule them to maximise the total profit of tasks completed
before their deadlines.
Often these problems can be converted into one another: for example, “find the minimum
number of colours required to colour this map” can be solved by a series of yes/no questions
“can this map be coloured in k colours,” with k = 1, 2, . . .
It should be emphasised that an algorithm must consist of a finite sequence of well-defined
operations. Without them being well-defined, it would not be possible to analyse their correct-
ness and efficiency.
2
CHAPTER 1. BASIC CONCEPTS 3
Programs vs. algorithms. An algorithm expresses the high level idea to solve a problem;
the idea almost always does not depend on the choice of the programming language, or the
architecture of the underlying machine. Thus algorithm design is not concerned with these
“little details.” In a sense, an algorithm is what is left unchanged when you translate from one
programming language to another. At this moment this is perhaps all very abstract; hopefully
by the end of the module, when you have seen many examples, you will see what this means.
In principle, an algorithm, being just an “idea,” can be expressed in any way, for example
just plain English. However, this often lacks precision, and so to express our ideas clearly we
sometimes use pseudocode: something that looks like programming languages, but with no
rigid syntax as long as it is clear what it means.1 The following is some example pseudocode:2
Even though it probably uses syntax different from the programming languages you know,
it is not difficult to see what this algorithm is trying to do. In fact, this is more “code” than
“algorithm,” as the “idea” here is really just “scan the array element by element.” Later,
when we deal with more complicated algorithms, we will not spell out this level of detail when
describing algorithms.
inside or outside a loop, etc. This does not improve the scalability of the algorithm, something
we will explain later. What we do instead in algorithm design is to use a different way of
thinking – “algorithmic thinking,” which requires a very different mindset. To quote one of
your textbook’s author, Steven Skiena:
Designing the right algorithm for a given application is a major creative act... being
a successful algorithm designer requires more than book knowledge. It requires a
certain attitude – the right problem-solving approach.
Worst-case running time. Even if we fixed one algorithm, the running time is not unique,
as it typically depends on the input to the algorithm. For example, consider the search problem
of finding a target element from a given array A of size n. One algorithm, the sequential search
or linear search, simply checks each element in A one by one until either the element is found
or all elements are checked. Sometimes, if you are lucky, the algorithm only needs to check A[1]
and it matches the target and so it finishes in just one “step.” This is the best case scenario.
The worst case, however, happens when the target is at the last position of A (or not in A at
all), in which case it takes n “steps.”
We are interested in the worst-case running time of an algorithm. In other words, we want
to know what is the maximum number of steps the algorithm takes over all possible inputs
3
This is the complexity-theoretic version or extension of the Church-Turing thesis.
CHAPTER 1. BASIC CONCEPTS 5
(of a certain size). There are a number of reasons why we are interested in this. Worst-case
analysis gives an upper limit of the running time for any input, so we have a guarantee as to
when the execution would stop, so you do not need to wait forever. Other measures, such as
average-case running time, are possible, but have their own difficulties. It is not always easy to
define what “average” is – there needs to be knowledge or assumption of the input distribution.
For example, what is the “average” time to search an array? Is it roughly n/2 steps since
“on average” the target is in the middle? It may well be, but maybe not, depending on the
particular problem. Average-case running time is also often more difficult to analyse. For these
reasons, unless otherwise stated, we are always interested in worst-case running times.
We will use the following running example to illustrate these concepts. First we define the
problem:
Here, the computational model is not a computer, but a pan balance. A “step” is one
application of it: put some coins on either side and observe the outcome, which is one of
three possibilities: either the left side is heavier, or the right side is heavier, or they are
equal. You are not allowed to use other things like a digital scale, since that would be a
different computational model. We want an efficient algorithm, i.e., one that takes the
fewest steps or weighings.
Name the coins A, B, . . . , H. Weigh A : B. If one side is lighter, report it and finish.
Otherwise we know that both are genuine. Repeat the procedure for C : D, E : F, G : H.
Weigh ABCD : EFGH. One side got to be lighter, which contains the counterfeit coin.
Suppose ABCD is lighter (the opposite case is similar). In the second step weigh AB :
CD. Again, one side is lighter, say AB. In the third step weigh A : B.
because we want to know how scalable an algorithm is. Naturally, when the input gets bigger,
most algorithms will take a longer time to run, but how much more? Ideally we want it to grow
only proportionally (i.e., if the input size doubles then the time also doubles). In many (but not
all) cases, this is the best thing one can hope for. But there are less scalable – sometimes much
less scalable – algorithms. To quantify this, we indicate the running time or time complexity
of an algorithm as a function of its input size. The input size is usually denoted by n. For
example, an algorithm may take at most 10000n steps for any input of size n (remember we
want worst-case running times). Another one might take 500n2 steps. A third one might take
2n steps. Which is “the most scalable”? To see this, suppose we run them on a machine that
executes 100,000 steps per second. Here is how they fare with different input sizes:
n 10000n 500n2 2n
1 0.1s 0.005s 0.00002s
10 1s 0.5s 0.01024s
50 5s 12.5s 357 years
100 10s 50s 4 × 1017 years
Clearly, the first one grows proportional to the input size. The second one is somewhat
worse; it was initially faster but then becomes slower than the first one. The last one grows
crazily even with very reasonably-sized inputs, and thus scales very badly.
The big-O notation. Recall the big-O notation that you learned in your first year? You
might not have been told, back then, what it was for. Well, this is what it is for. You are going
to see it in almost every page here. It captures this notion of scalability, also known as the
order of growth or the asymptotic time complexity. Here is its formal definition:
Definition 1.1 For two functions f (n) and g(n), we say f (n) is O(g(n)) , or informally4 f (n) =
O(g(n)), if there exists constants c and n0 such that
Intuitively, this means “for large enough n, f (n) is at most a constant times larger than
g(n).” In the asymptotic sense, this means “f grows slower than (or same as) g.”5 For example,
Figure 1.1 shows (not to scale) f (n) = 0.5n + 4 and g(n) = n2 + 3, and f (n) = O(g(n)) because
apart from an initial segment, g is always bigger than f .
The big-O notation is ideally suited to describe the time complexities of algorithms, because
of its following properties:
4
Strictly speaking this is incorrect as O(g(n)) is a set of functions. However, this informal use of the = sign
is very commonplace. You will even see it as part of normal arithmetic like n2 + O(n). The = sign is therefore
not symmetric: for example, don’t write “O(n2 ) = 3n2 + 2n − 1.” Also, as the use of big-O is to simplify the
expressions, don’t write “n2 = O(3n2 + 2n − 1)” even though it is technically correct.
5
Here is an unfortunate confusion of terminology: f grows slower than g because f ’s values increase much
more slowly than g’s; but if these values are running times, it means f represents the time complexity of a faster
algorithm! Also, note that Big-O can be used for other things, not just time complexities.
CHAPTER 1. BASIC CONCEPTS 7
• It ignores lower order terms. In an expression like 2n2 + 3n − 1, anything other than the
highest power term 2n2 , such as 3n and 1, are called lower order terms.
Why do we want to ignore lower order terms? Except for the most trivial algorithms, the
time complexity of an algorithm is often very difficult to state precisely down to something
like 2n3 + 3n2 + 4n + 5. Furthermore, such detail is pointless. An algorithm with time
complexity 2n3 + 3n2 + 4n + 5 and another one with 2n3 + 7n2 + 8n + 9 has practically
the same efficiency when n is large, because the first term 2n3 “dominates” the value of
the expression when n is large and the other terms are insignificant. (Exercise 1-3 asks
you to experiment with this using Excel or other similar programs.)
• It ignores constant factors. This refers to the c in the definition. For example, 2n2 and
3n2 are both O(n2 ) and are therefore “the same.” These constants do not measure the
performance of algorithms well, because it is affected by external factors like machine
speeds. It also does not measure the scalability: both 2n2 and 3n2 quadruple when n
doubles. If you want to improve the performance by a constant factor, you can always
buy a faster processor, but you cannot improve the scalability (e.g. from O(n2 ) to O(n))
by doing that. Indeed, this is precisely why we need algorithm design.
• It ignores what happens when n is small. This refers to the n0 in the definition. We only
care about their performances when n is large; any poor performance for “small” input
sizes are ignored.
The following are some common time complexities (in increasing order):
• O(log n): Logarithmic, faster than linear (in fact much faster).
• O(n log n): A time complexity that somehow arises very often.
• O(n2 ): Quadratic. Slower but usually acceptable except with really large inputs.
The big-O notation can also be used with multiple variables. For example, if the input
consists of two arrays of sizes m and n, or a graph with n vertices and m edges, that the time
complexities could be O(m + n) (linear), O(mn), etc.
Tractability. We usually say the running time is efficient if it is a polynomial in the input
size, i.e. O(nk ) for constant k. Otherwise (e.g. exponential, O(2n )), we regard it as inefficient.
Looking back at Table 1.1 you will see that exponential functions do grow much faster than
polynomial functions. This division between efficient and inefficient algorithms may seem a bit
arbitrary: after all, a polynomial running time of n100 is slower than an exponential running
time of 1.001n for all but the most colossal-sized inputs. But this definition withstood the test
of time and turned out to be a good measure, both theoretically and practically. Also, it is very
rare to see these kinds of exotic running times: often, if someone discovered a (say) O(n14 ) time
algorithm, it gets quickly improved to some lower order polynomials.
Problems that do not have polynomial time algorithms are called intractable. Usually, noth-
ing significantly better than brute force (search all possibilities) is known. This also relates to
the class of problems known as NP-complete, where we believe they have no efficient algorithms
but no one managed to prove that.
All we have discussed so far refer to the time taken to solve a problem by running an
algorithm. These are upper bounds:
Definition 1.2 An upper bound (on the time complexity) of an algorithm is the worst-case
number of operations sufficient to solve a problem by that algorithm.
This is the worst-case complexity of an algorithm: under any input, the algorithm will finish
after that many steps. For example, we can say that the Algorithm A1 for the coin weighing
problem has an upper bound of 4, and Algorithm 1.1 has an upper bound of O(n), because
they never take more than that number of steps.
As we said before, often there are many algorithms for the same problem, and obviously we
want “good” (efficient) algorithms, which means we want the upper bound to be as small as
possible.
There are a number of confusing technicalities about upper bounds. Upper bounds may
be “loose” (i.e., not tight). Suppose someone proved that a certain algorithm A has an upper
bound of O(n3 ). It is also technically correct to say that A has an upper bound of O(n4 ), or
indeed O(n1000 ), since that many steps are clearly sufficient (a lot more than sufficient) for A
to finish. They are rather pointless, but technically correct, statements.
But this kind of deliberate silliness is not the only reason for such looseness. It is often
difficult to analyse algorithms exactly, and, for example, we may only be able give a “loose”
proof that A takes O(n3 ) time, when in fact it may never require more than O(n2 ) time on any
input; it is not always easy to figure out what the worst input of an algorithm is. Indeed it may
be possible that in the future some other people can prove that A indeed takes O(n2 ) time.
Other times, we can establish the tightness of the upper bound: for example we prove that A
CHAPTER 1. BASIC CONCEPTS 9
runs in O(n2 ) time on all inputs and that A does require n2 steps on some input. In this case
we sometimes (confusingly) refer this as the lower bound of an algorithm, i.e., the number of
operations really taken by that algorithm on some input. (This is not to be confused with the
lower bound of a problem, described next.)
We often use the same concept on problems rather than algorithms:
Definition 1.3 An upper bound (on the time complexity) of a problem is the worst-case
number of operations sufficient to solve the problem by some (known) algorithm.
So for example, as you will see later, there are sorting algorithms that run in O(n log n) time
and O(n2 ) time, respectively. We can then say that sorting has an upper bound of O(n log n).
(Similar to the reasoning we have just seen, we can also say that sorting has an upper bound
of O(n2 ) – it is a correct upper bound, just not the best upper bound.)
As an algorithm designer, we pursue better and better upper bounds. But when do we stop?
It is natural that problems have an inherent complexity and require a certain minimum number
of steps to solve, no matter how clever the algorithms are. This is what we call lower bounds:
Definition 1.4 A lower bound (on the time complexity) of a problem is the worst-case
number of operations necessary to solve the problem by any algorithm.
The lower bound is for a problem; it has nothing to do with any specific algorithm. A
problem does not have a lower bound x just because a known algorithm for it takes x steps,
or that all known algorithms take at least x steps. It is a proof that it is impossible to have a
more efficient algorithm, including yet-to-be-discovered ones. How can we possibly prove such
a thing? We will cover this in Chapter 5.
An algorithm has optimal time complexity, or is simply called optimal, if its upper bound
matches the lower bound of the problem. The job of an algorithm designer is to design algorithms
as close to optimal as possible; this means reducing the gap between upper and lower bounds.
If they match, then we know that the algorithm we have is optimal. This also means we do not
just reduce the upper bounds (design better algorithms), but also try to prove better, stronger
lower bounds – and a lower bound is better if it is bigger, because it closes the gap.6
In an ideal world, all problems would have matching upper and lower bounds. However,
many important and sometimes simple problems do not yet have matching bounds. We do not
even know what the optimal algorithm for multiplying two n-digit integers is!
Look at the coin weighing problem and the two algorithms A1 and A2 again. A1 has an
upper bound of 4, and A2 has an upper bound of 3. A common mistake is then to simply
say that P has an upper bound of 4 and a lower bound of 3. What’s wrong with this?
6
This is perhaps counterintuitive to beginners, who think “surely it is bad news that something is hard to do,
and a bigger lower bound must be worse news?” But by doing this we are getting closer to the “truth” – what
the optimal time complexity of the problem is.
CHAPTER 1. BASIC CONCEPTS 10
For the upper bound: clearly 3 steps are sufficient to solve the problem P (since A2 can
do that). Hence P has an upper bound of 3. (Technically 4 is also an upper bound, but
it is not the best possible upper bound.)
For the lower bound: we know nothing about the lower bound of P (for now; wait till
Chapter 5...) There might be algorithms better than A1 or A2 waiting to be discovered.
But suppose someone told you that they proved a lower bound of 2 for P. So now there
is a gap between the best upper bound so far (3) and the best lower bound so far (2).
So we should either try to design a better algorithm (to reduce the upper bound to 2)
or to find a stronger lower bound proof (to raise the lower bound to 3) so that the two
bounds match. Until then, we do not know whether the upper/lower bounds we have are
optimal.
Here’s a slightly different way of describing all this which may help to explain the terminology
“upper” and “lower” bounds. Every problem has a “true” optimal number of steps / time
complexity out there, waiting to be discovered. For example, to find the counterfeit coin from
among 1000 coins; or to sort n numbers in increasing order. Until you actually find it, the truth
is hidden from you. That truth is also fixed and you are not “improving” it when designing a
new algorithm; you are merely getting closer to the truth. If you found an algorithm for the
1000-coin problem that takes 10 steps, for instance, it gives an upper limit on what the true
optimal number of steps is (since we now know 10 steps is certainly possible, but maybe it could
be 9, 8, ...). Similarly, if we somehow proved that at least 3 steps are needed no matter what
algorithm you use, it gives a lower limit on the true number of steps (which may now be 3, 4,
5, ...) So we want to designing better algorithms to push down the upper bound, and prove
better lower bounds by pushing it up, until they reach the true optimal number of steps.
We want to have notations like big-O to denote asymptotic lower bounds, but big-O is not
suitable because the direction of the inequality is wrong. So we have these notions:
Definition 1.5 For two functions f (n) and g(n), we say f (n) is Ω(g(n)) if there exists con-
stants c and n0 such that
f (n) ≥ c · g(n) for all n ≥ n0
Definition 1.6 For two functions f (n) and g(n), we say f (n) is Θ(g(n)) if there exists con-
stants c and n0 such that
f (n) = c · g(n) for all n ≥ n0
Informally, f (n) = Ω(g(n)) if f grows faster than (or same as) g, and f (n) = Θ(g(n)) if f
grows at the same rate as g. Big-Ω is usually used to express lower bounds, while Big-Θ is used
to express optimal bounds or when we want to emphasise that certain quantities grow exactly
at a certain rate (and not just loosely upper- or lower-bounded).
As a very rough analogy, we have
f (n) = O(g(n)) ⇐⇒ f ≤g
f (n) = Ω(g(n)) ⇐⇒ f ≥g
f (n) = Θ(g(n)) ⇐⇒ f =g
but the (in)equality signs must be interpreted in an asymptotic sense.
CHAPTER 1. BASIC CONCEPTS 11
First, remember that a “step” is a basic operation that takes constant time. We can assume
that all standard primitive operations supported by a CPU (or programming language) take
constant time. This includes arithmetic (+, −, ×, /), comparisons (<, =, >), reading from and
writing into a given memory location (assuming no seek time is needed), etc. For example, lines
1, 3, 5, on their own, takes constant time.7
A program consists of a sequence of statements; naturally, for consecutive statements, the
time to execute them is the sum of the times to execute each of them, since they are run
sequentially.
Loops are also simple: we just need to know how many times the loop is executed, then
multiply this by the time it takes to execute one iteration of the content of the loop.
Statements of the form “If A then B else C” has a time complexity that is the maximum of
that of the sum of A and B, or the sum of A and C. Remember testing for A takes time; also,
we won’t go into both branches, just one of B or C, and we take the worst case.
Finally, if there are function calls, then the time it takes can be analysed separately.
Now let’s apply these to the above algorithm. There are two nested for loops. Starting with
the innermost, line 5 on its own take O(1) time. But the j loop is iterated n times, so the time
complexity of lines 4–6 (on its own, ignoring what’s outside) is O(n). The outer for loop repeats
n times, and the contents include a constant time statement (line 3), the j loop that takes O(n)
time, and lines 7–9 that also take O(1) time. So in total lines 2–10 take n × O(n) = O(n2 ) time.
Lines 1 and 11 take O(1) time. Thus in total the algorithm takes O(n2 ) time.
Since this is a first example, we have done it slowly, step-by-step. With more experience
you will be able to simply “see” that this is O(n2 ) “because” it has two for loops, and this kind
of elaborate explanation is not needed.
7
Please do not say line 5 takes O(2) or O(3) time just because you think there are two steps, + and ←, or three
steps if you count the memory access, because of two reasons: first, as we said, O(1) is a special notation. Second,
what constitutes one “step,” as we also said, depends on factors like the instruction set and is not important for
the asymptotic analysis of algorithms.
CHAPTER 1. BASIC CONCEPTS 12
Exercises
# 1-1 (Upper and lower bounds)
Three algorithms A1, A2 and A3 for a certain problem P are known. Their worst-case
time complexities on inputs of size n are 5n2 , 2n + 4 and 3n respectively. Is each of the
following statements true or false?
1-2 (Logarithms)
(a) If we start with a number n and in each step we reduce it by half, how many steps
does it take to make it equal to (or smaller than) 1?
(b) Let f (n) = log n and g(n) = 2n . What are f (g(n)) and g(f (n))?
Related reading:
[CLRS] Chapters 2.1, 10.1-10.3
[SSS] Chapters 3.1-3.2, 4.1-4.2, 4.9
Almost always, our algorithms are there to process data. We need efficient ways to represent
the data we are going to process, or the intermediate working data that the algorithm produces.
This is not just about the space usage, but also the time it takes to operate on the data.
The efficiency of algorithms is highly dependent on the use of appropriate data structures.
Computer scientist Niklaus Wirth even authored a book titled “Algorithms + Data Structures
= Programs.”
Different algorithms require different number of each type of operations. Some common
operations on data include insertion, deletion, editing and searching. Different data structures
support different operations with a different time complexity. There is often a tradeoff with the
various competing objectives like space usage and time efficiency, and the choice depends on the
particular algorithm in question. For example, if an algorithm requires a lot of insertion and
deletion but not much else, then it makes sense to choose a data structure that supports these
operations efficiently, perhaps at the expense of being slower at other operations or in its space
usage. Another algorithm that never requires insertion or deletion, for example, can choose a
data structure that does not support that, or is inefficient at doing that.
Some common data structures that we discuss in this chapter are: arrays, linked lists, stacks
and queues.
14
CHAPTER 2. SOME ELEMENTARY DATA STRUCTURES AND ALGORITHMS 15
struct node {
int data; // assume our data are integers
node *next; // pointer to next node
}
A linked list does not support direct access. To get to the i-th element, you have to travel
along or traverse the list step by step. The following is how to search, which takes O(n) time
for a list with n elements:
node* search(int x) {
node *temp = head;
while (temp != null && temp->data != x)
temp = temp->next;
return temp; // null if not found
}
The advantage of linked lists is that insertion or deletion can be done in constant time,
assuming we already have a pointer to the location to insert to or delete from (for example
as the result of a previous search operation). This is because we only need to reassign a few
pointers. See Figure 2.1.
(Note that the element is merely “bypassed” and not actually “deleted,” and still occupies
memory space. In C/C++ you would need a free/delete operation, and in Java this is handled
by garbage collection.)
For simplicity, in the above code we have not fully handled error checking or boundary
conditions, e.g. what happens when it is already at the end of the list and there is no “next”
element etc. These can be added without affecting the time complexity.
We can also have doubly linked lists, where each element has both a pointer to the next
one as well as a pointer to the previous one; or circular linked lists, where the end of the list
points back to the head. They support easier navigation through the list, but takes more time
to update after insertion/deletion and also takes up more space (both by a constant factor).
In contrast, a queue is a first-in-first-out (FIFO) data structure. Insertion can only be made
at one end (the tail), and deletion only at at other end (the head). The operations supported
are:
Stacks and queues are examples of abstract data structures. They are specified by the set
of operations they can perform and their running times, without details of how they are done.
Essentially it is like classes in object-oriented languages that encapsulate the inner working
CHAPTER 2. SOME ELEMENTARY DATA STRUCTURES AND ALGORITHMS 17
details from the programmer, and only expose an “interface.” Algorithm designers only need
to look at the interface when considering which data structure to use.
Abstract data structures are implemented using other data structures. Sometimes they can
be implemented in more than one ways. For example, both stacks or queues can be implemented
using arrays or linked lists in a way that support constant time per operation. In the following
we discuss their implementations using arrays. (Exercise 2-1 asks you to implement them using
linked lists.) For convenience (to ignore issues of memory allocation), we assume the stack/queue
has a fixed maximum size (100).
class Stack {
int S[0..99];
int top = -1;
void push(int x) {
if (top == 99) print "stack overflow";
else { top++; S[top] = x; }
}
int pop() {
if (top == -1) print "stack underflow";
else { top--; return S[top+1]; }
}
}
The above is the stack implementation. It uses a top pointer to point to (i.e. record the
index of) the position of the top of the stack. S[0..top] are the cells occupied by elements.
Using a similar idea for a queue may sound straightforward at first, but it requires more
care. The problem is with deletion: after the head element left, we cannot shift the elements all
to the left by one position, since it will take O(n) time. We can use two pointers, h and t, to
point to the head and tail of the queue respectively, with the head moving to the next position
when an element is deleted. Thus the elements are not always occupying from the beginning of
the array, but in the range Q[h..t−1]. But after a while there will be a lot of wasted empty space
at the beginning of Q. So we need to “wrap around” when the tail reaches the last element
(imagine the array being circular). This also means we may have a somewhat counterintuitive
situation where the tail is in front of the head. The following pseudocode implements this idea.
class Queue {
int Q[0..99]; // NB: size 100 but stores only 99 elements
int h = 0, t = 0;
void Enqueue(int x) {
if ((t+1)%100 == h) print "queue full";
else {
Q[t] = x;
t = (t+1)%100;
}
}
CHAPTER 2. SOME ELEMENTARY DATA STRUCTURES AND ALGORITHMS 18
int Dequeue() {
if (t == h) print "queue empty";
else {
x = Q[h];
h = (h+1)%100;
return x;
}
}
}
Note that although the array has size 100, the queue only stores a maximum of 99 elements.
Otherwise, you cannot distinguish between a full queue and an empty queue. (Pause for a
moment and consider why.)
2.3 Searching
Searching is a very fundamental problem: given an array A with n elements, report the location
of an element x in A, or that it does not appear in A. A trivial algorithm is simply to search
one by one:
This is linear search. Clearly, its time complexity is O(n). It can be shown that this is
optimal if the input does not have any other known properties. However, a linear search time
is often unacceptable, since we usually search from a vast collection of data.
If the elements are already sorted, we can do better. The idea of binary search is to reduce
the search space by half by checking the middle element. Suppose the elements are sorted in
increasing order, and we are looking for an element x. If the middle element is > x, then x
can only be in the first half; if the middle element is < x, then x can only be in the second
half. We are now faced with the same problem but of smaller (half) size. We can then repeat
the same idea on the remaining half, either via recursion or by a controlled iteration. We will
explain more about recursion in the next chapter; here we present a non-recursive version. We
use two indices lo and hi to indicate the range of the array we are searching; at any point of
the execution, A[lo..hi] is where the target may still possibly be.
CHAPTER 2. SOME ELEMENTARY DATA STRUCTURES AND ALGORITHMS 19
What is the time complexity of binary search? Each execution of the contents inside the
while loop takes O(1) time. So the question amounts to how many times is the while loop
executed. Observe that each execution of the loop reduces the range (hi − lo + 1) by about half.
The value hi − lo + 1 is n at the beginning, and 1 at the end. Thus, assuming n is a power of
2, the value decreases as the sequence n → n/2 → n/4 → . . . → 2 → 1 which we know (from
Exercise 1-2) has log2 n steps. Thus the overall time complexity is O(log n). It is therefore
much better than linear search, which has a linear time complexity.
2.4 Sorting
Another natural and very fundamental problem in computer science is sorting: given n items
that are “comparable” (i.e., one can assign <, = or > to any two given elements), arrange them
in ascending (or descending) order. The most natural example is of course sorting numbers, but
many other things like strings can be sorted as long as an ordering can be assigned to them.
Here we study two simple sorting algorithms. Later we will introduce more advanced algo-
rithms that are more efficient.
Selection sort. This is based on the idea of repeatedly finding the smallest element, swap it
to the first position, and repeat for the remaining subarray. Or in pseudocode:
The algorithm maintains the invariant that when the outer loop finishes for a certain i,
A[1..i] contains the i smallest numbers in sorted order.
Its time complexity is easy to analyse. There are two for loops: The outer i for loop is
executed n − 1 times. The inner j for loop is executed at most n − 1 times, for each value of
i. (To be precise it is executed n − i times, but this is at most n − 1. It can be shown that
even when counting precisely with n − i, the big-O time complexity is unaffected.) The contents
inside the nested loops take O(1) time per execution. Thus the total time complexity is O(n2 ).
Insertion sort. Here we use a different idea: we process elements one by one, while main-
taining an initial part A[1..i] of A to be sorted. Each time we consider the next element A[i + 1],
and insert it in the correct position within A[1..i] to create a bigger sorted array A[1..i + 1].
Repeat the procedure to further extend the sorted part until the whole array is sorted.
The while loop considers elements in A[1..i − 1], starting from the end, and for each element
check if it is larger than A[i]. If so, the insertion point is before this element, so we move this
one position to the right to create a space for possible insertion. We keep doing this until we
encounter an element smaller than A[i], which means we should now place A[i] into this space.
It is easy to see that the time complexity is again O(n2 ): the outer for loop clearly executes
at most n times, and for each such iteration, the inner while loop iterates at most n times.
Some discussions. Before we leave this topic we discuss a few minor points.
First, in insertion sort, we do not really need O(n) time to find the correct position to insert.
The subarray A[1..i] is already sorted; hence we can simply use binary search for the correct
position! Incorporating this idea gives us binary insertion sort. Its total number of comparisons
made is reduced to O(n log n). However, since we need to move the elements after insertion
(which takes O(n) time), the total time is still O(n2 ). But this might be useful if comparison
is a very expensive operation.
Both selection sort and insertion sort take O(n2 ) time, but they have different properties.
In selection sort, once a position is found for an element, it remains there; this is not true for
insertion sort. But insertion sort benefits from partially-sorted input: if the input is “more or
less” sorted, it is typically faster, since the inner while loop will not need to go too far before
finding the correct insertion point. Insertion sort can also be done on-line, meaning that it can
run without waiting all the inputs to arrive. In other words, it can be run if the data is being
“streamed in.” This is not the case for selection sort.
CHAPTER 2. SOME ELEMENTARY DATA STRUCTURES AND ALGORITHMS 21
Exercises
2-1 (Stacks and queues)
Implement stacks and queues using linked lists so that all operations are supported in
O(1) time.
2-2 (Sorting)
What is the worst-case input for sorting 5 numbers in ascending order using insertion
sort? How many comparisons are needed? What if it is n numbers instead of 5?
∗ 2-3 (Pancake sorting)
There is a stack of n pancakes in different sizes. We want to arrange them in increasing
order of sizes from top to bottom. The only operation allowed is to insert a spatula
immediately below a certain pancake, and flip all the pancakes above (see picture). This
counts as one step.
Give an algorithm that sorts any stack of n pancakes with at most 2n − 3 flips in the
worst case. (Hint: consider selection sort and/or recursion.) 1
∗ 2-4 (A fatal search algorithm)
The Computer Science department is moving into a new building with n floors, and the
professors are given a task of determining the “lethal height” of the building, i.e., the floor
x where people falling from it or above will die while people falling from floor x − 1 or
below will not. They have one resource: their own lives.
(a) A simple algorithm is to ask a professor to try jumping out of each floor (starting
from the bottom) until he or she dies. What is the worst-case number of jumps
required?
(b) The above method is too slow. Assuming many professors are willing to help, give
an algorithm that only takes O(log n) jumps. What is the maximum number of lives
it will take?
(c) The method (a) above uses only one life but is too slow, whereas the method (b) is
faster but may take many lives. Give an algorithm that uses at most two professors
and is asymptotically faster than that in (a).
1
In 1979 Bill Gates, the Microsoft founder, improved this upper bound to 5n/3, and gave a lower bound of
17n/16. Another 30 years passed before someone else improved the upper bound further to 18n/11. It is still
far away from the best lower bound currently known, 15n/14. Determining the optimal bound remains an open
problem.
Chapter 3
Related reading:
[CLRS] Chapters 4.3-4.5
[KT] Chapter 5.2
[DPV] Chapter 2.2
You should know from your programming classes that functions/methods can call them-
selves. Similarly, a recursive algorithm is an algorithm that calls itself. People new to program-
ming often find the idea very confusing. Well, people didn’t invent recursion just for fun or just
to give students headaches; it is a very powerful tool in algorithm design. In fact, most of the
algorithmic techniques that will be introduced in this module can be interpreted as recursion,1
and if you manage to get your head round it, it is actually a simple idea and leads to simple
proofs.
• Only one disc, from the top of a peg, can be moved in each step
There is a simple recursive algorithm for the problem. It is based on the observation that,
if you know how to solve the problem with n − 1 discs, you can solve it with n discs:
1. Recursively move the top n−1 discs from A to B (leaving the bottom disc at A untouched,
as if it is just part of the table/ground)
3. Recursively move the n − 1 discs from B to C (again leaving the bottom disc at C un-
touched)
22
CHAPTER 3. RECURSION AND RECURRENCES 23
Clearly,2 this algorithm gives a correct solution, in the sense that the moves are legal, that
it terminates, and that the final configuration is what we want. But does it complete it in the
minimum number of moves? What is the number of moves made?
Working out the algorithm on small values of n gives
We cannot use the methods in Chapter 1 to analyse the time complexity (or in this case,
the number of moves) of the Tower of Hanoi algorithm. It recursively calls itself, so its time
complexity, as a function of n, also depends on itself! So first, let’s try to write down a formula
to relate the time complexity to itself, based on the workings of the algorithm.
Let T (n) be the number of moves made by the algorithm when given n discs. The algorithm
first invokes itself to move the top n − 1 discs, thus making T (n − 1) moves (we don’t yet know
what this number is, but just carry on...); then it makes a single move to move the biggest disc;
finally it invokes itself again, costing another T (n − 1) moves. Therefore we have the formula
T (n) = 2T (n − 1) + 1 (3.1)
This allows us to work out the number of moves made by the algorithm for any given n. For
example, T (6) = 2T (5) + 1 = 2(31) + 1 = 63, since we already know that T (5) = 31. However,
it would be very inefficient to work it out one by one like this. What if we want to find T (100)?
Do we have to work through all 99 previous values? We want to be able to express T (n) purely
as a function of n, and not as a function of itself. For example, for the above formula, it turns
out that the correct solution is T (n) = 2n − 1. You can check with the values of the table to
see it is indeed correct.3
Formulas like (3.1) are called recurrences. It expresses a function, which is usually time
complexity for our purposes, as a function of itself, but with smaller arguments. It must
also come with a base case, which is where the recursion stops, because it cannot continue
indefinitely. Usually, the base case is reached when n is sufficiently small so that it is trivial
to solve the problem without recursion, and the number of steps will then also be obvious.
In Tower of Hanoi, we should specify the base case T (1) = 1 (or T (0) = 0). The process of
obtaining an explicit solution from the recurrence is calling solving the recurrence.
There are two things you really should know how to do, when given a recursive algorithm:
first, to write down a recurrence formula that expresses the time complexity of the algorithm;
and second, to solve the recurrence.
parameters each time; this can be done by passing a different subarray or (like below) passing
the indices to indicate the range of the array to be considered. It therefore means that you
need to separately specify how we begin running the algorithm; for the one below, we begin by
calling Binary-Search(A, 1, n, x).
The recurrence for the number of element comparisons made by this algorithm is T (n) =
T (n/2) + 1, since it makes one comparison first (line 5), then invokes one recursive call, of size
(number of elements) half of the original. Note that despite the appearance of two recursive
calls (lines 8 and 10), only one of them is invoked every time.
We will later see that the solution of this recurrence is T (n) = O(log n), which is unsurprising
given that we already know the non-recursive version of binary search runs in that amount of
time. To solve this, and many other similar recurrences, in the next section we introduce general
methods for solving recurrences.
T (n) is expressed in terms of T (n − 1), so the argument (the thing inside the bracket) gets
smaller. The formula also means that we can express T (n − 1) in terms of T (n − 2), i.e.
T (n − 1) = T (n − 2) + (n − 1)
and similarly
T (n − 2) = T (n − 3) + (n − 2)
CHAPTER 3. RECURSION AND RECURRENCES 25
People often got confused by this, as they do not see that the n is a “dummy” variable.
We can replace every occurrence of n by n − 1, or indeed anything; in effect (3.2) is not
one formula but many in one:
T (2) = T (1) + 2,
T (3) = T (2) + 3,
T (4) = T (3) + 4, . . .
However, please be careful that the replacement must be carried out faithfully. For
example, replacing n by n2 in the formula T (n) = T ( n2 ) + 13 n2 gives
n n/2 1 n
T( ) = T( ) + ( )2
2 2 3 2
Another common problem people have is to make mistakes with indices, brackets etc.
Please make sure you are aware of the following (and apply them correctly):
The process of iterative expansion can also be represented by a recursion tree. The
following is the recursion tree for the recurrence T (n) = 2T (n/2) + n:
The leaves are the base cases and the extra summation terms are on the right. The solution
of the recurrence is the sum of all rightmost and bottommost terms. This may be helpful for
some kinds of recurrences.
where a, b, d are constants. Because the process of solving recurrences is so tedious and error-
prone, someone decided to solve it once and for all and developed a general formula. This is
what is called the master theorem. It says:4
Theorem 3.1 (Master Theorem) If T (n) = aT (n/b) + O(nd ), where a, b, d are constants,
then
log a
O(n b )
if d < logb a
T (n) = O(nd log n) if d = logb a
O(nd )
if d > logb a
We will not give a proof here, but basically it is just a big iterative substitution with different
limiting cases.
Hint: A easy way to memorise the formula is that: the answer is simply n to the power
of the larger of the two: d or logb a. Unless when the two are equal, in which case we
throw in an extra log n factor.
Worked example. Consider the binary search recurrence T (n) = T (n/2) + 1. We have
a = 1, b = 2, d = 0 (note that n0 = 1, hence d = 0), and logb a = log2 1 = 0 which is equal to d.
Hence the second case applies and the answer is O(n0 log n) = O(log n).
Still, you need to know the more elementary ways of solving recurrences, because:
• Not all recurrences can be solved by the master theorem. For example, T (n) = T (n−1)+1
cannot be solved by it because T (n − 1) is not of the form T (n/b), i.e. n divided by a
constant.
• It only gives big-O answers, so if you require an exact answer you will need other methods.
• To make sure you know how to solve recurrences “manually” we will sometimes explicitly
ban you from using it.
Definition 3.2 The floor of a real number x, denoted by bxc, is the largest integer smaller
than or equal to x. The ceiling of a real number x, denoted by dxe, is the smallest integer
larger than or equal to x.
In other words, floor is rounding down and ceiling is rounding up; thus for example b3.7c = 3,
d5.2e = 6, and d7e = b7c = 7.
4
Actually what we present here is a weak form of the theorem; the full one can be found in [CLRS].
CHAPTER 3. RECURSION AND RECURRENCES 28
So, for example, if an algorithm divides the problem into two halves and solves them re-
cursely, it may have a recurrence like T (n) = T (dn/2e) + T (bn/2c) + O(n) instead of just
T (n) = 2T (n/2) + O(n). However, we will almost always omit these complications; we may
state that we assume n is power of 2. This makes n always divisible by 2 at any step of the
algorithm. Without this, applying iterative substitution methods would be impossible. In al-
most all cases, it can be shown that it does not affect the result. Another way of looking at
this is to consider this assumption as increasing the input size by at most a constant factor.
For example, an array with 17 elements can be rounded up to 32 elements (by inserting some
sentinel values at the end) to become a power of 2, and 32 is less than double of 17.
Exercises
# 3-1 (Writing down recurrences)
For each of the following, write down a recurrence expressing the time complexity of the
algorithm described.
(a) The algorithm performs n steps, then recursively solve a subproblem of size n/3.
(b) The algorithm recursively solves four subproblems each of size n/2, then perform
some O(n) time computation afterwards.
(c) The algorithm recursively solves two subproblems of size n/3 and 2n/3 respectively,
then perform some constant time computation afterwards.
√
(d) The algorithm perform one step, then recursively solve a subproblem of size n.
# 3-2 (Solving recurrences)
Solve the following recurrences using repeated substitution, giving the answer in big-O
notation. Verify your solution using the Master Theorem.
Related reading:
[KT] Chapters 5.1, 5.5
[DPV] Chapters 2.1, 2.3, 2.4
[CLRS] Chapters 4.1, 7, 9
[SSS] Chapters 4.5, 4.6, 4.10
In this chapter we introduce the first of our major algorithm design technique – divide and
conquer, and illustrate it with a number of practical problems, including sorting and integer
multiplication.
The general principle of divide and conquer is very simple. If we don’t know how to solve a
big problem, break it down into smaller subproblems which hopefully are easier to solve:
But the distinguishing feature of this paradigm when applied to the design of algorithms is
that, we partition the problem in such a way that the subproblems are just smaller versions of
the same problem. Hence we simply apply recursion to the subproblems. We therefore do not
need to care how the smaller subproblems are solved. What actually happens is that they will
be reduced to even smaller subproblems, and so on...1 Of course, we cannot divide into smaller
and smaller subproblems forever; when they become small enough, we reached the base cases
where they can be solved “trivially.”
29
CHAPTER 4. DIVIDE AND CONQUER 30
A moment of thought would reveal that in fact 2n − 3 comparisons are enough: after finding
the maximum, there are only n − 1 numbers left from which to find the minimum. But this
improvement is tiny. Can we do still better? So let’s try applying the divide and conquer
paradigm. First, divide the n numbers into two equal-sized halves, in some arbitrary way. Then
we recursively find the maximum and minimum within each of the two halves. The solutions of
these subproblems give us the solution of the original problem: the minimum of the n numbers
must be the minimum in the half that it went to, and so we only need to compare the two
minima to find the global minimum. The same goes for the maxima. We therefore have the
algorithm:
(Here we assume for convenience that the function returns a pair of numbers.)
Let’s analyse the number of comparisons, T (n), made by this recursive algorithm. Assume n
is a power of 2. We have T (n) = 2T (n/2) + 2 since there are two recursive calls, each involving
n/2 numbers. Afterwards we make two extra comparisons (lines 13 and 18), hence the +2.
There are two base cases, T (1) = 0, T (2) = 1, corresponding to the cases in lines 1 and 3
CHAPTER 4. DIVIDE AND CONQUER 31
T (n) = 2T (n/2) + 2
= 2(2T (n/4) + 2) + 2
= 22 T (n/22 ) + 22 + 2
= 2k T (n/2k ) + (2k + 2k−1 + . . . + 2)
= (n/2)T (2) + 2(n/2 − 1)
= 3n/2 − 2
And thus we reduced the number of comparisons from approximately 2n to about 1.5n.
Now let’s consider the actual merging. Imagine you have two piles of exam scripts, both
sorted in student numbers, and you want to merge them into one big sorted pile. The first
one should clearly come from either the top of the first pile or the top of the second pile. So
you move the smaller one to the output pile, revealing a new exam script in the pile you just
removed from. Then you simply repeat the procedure. So this is the merge algorithm:
2
Why do we need two base cases? Part of the reason is because if n is odd then it cannot be divided evenly.
But see Exercise 4-5 for another reason.
CHAPTER 4. DIVIDE AND CONQUER 32
Clearly, everything inside the while loop can be done in O(1) time (per iteration). How
many times is the while loop executed? Observe that each time the loop is executed, one of i or
j is increased by exactly one. They start at 1, and one of them exceeds n when the loop finishes.
Hence the loop must have been executed at most O(n) times. (The actual number varies from
about n to about 2n, depending on how i and j advance; for example i may reach the end first
with j hardly moved, or they may increment in an “interleaved” way.) The copying after the
while loop also takes O(n) time. Hence the overall time complexity is O(n).
The time complexity of merge sort is therefore given by the recurrence T (n) = 2T (n/2) +
O(n), where the two T (n/2) are the time taken by the two recursive calls and the O(n) is for the
merge. It therefore follows (from the Master Theorem) that T (n) = O(n log n). It is therefore
more efficient than the O(n2 ) time sorting algorithms from Chapter 2.
Later we will see that this is in fact the best possible time complexity for sorting. However,
merge sort is not really that good in practice. One reason is that it involves copying elements
to a different array; in other words it is not in-place.
4.3 Quicksort
Sometimes the same algorithm design technique can be applied in different ways to give different
algorithms for the same problem. Here we devise a different sorting algorithm using divide and
conquer differently.
In merge sort, most of the computations are spent in the merge() step. The divide step is
trivial. Can we find a more efficient way to merge things, perhaps at the expense of dividing
more slowly? Here is an idea: instead of dividing into two halves arbitrarily, we move the “small”
numbers to one end of the array, and the “large” numbers to the other end. Recursively solve
these two subproblems. Once the two subarrays are sorted, they can just be left there, next to
each other, and no merging is needed!
This more careful divide step we call it partition. More specifically, we choose a pivot
element, around which we decide whether an element is “large” or “small.” This pivot can be
any element in the array, but (for the convenience of our specific variant of Partition below) we
choose the last element of the array. Then we move all the small elements toward the head of
the array, and the big one towards the end. The pivot should be separating the two halves.
There are numerous ways this can be done; the easiest is perhaps to copy each element
to a separate array. However below we use one particular variant which is possibly better in
CHAPTER 4. DIVIDE AND CONQUER 33
x ← A[r] // pivot
i←p−1
for j = p to r − 1 do
if A[j] ≤ x then
i++
swap(A[i], A[j])
end if
end for
swap(A[i + 1], A[r])
return i + 1
Intuitively, i and j in the pseudocode maintain two regions of the array: when outside the
if block, A[p..i] stores the small numbers, A[i + 1..j − 1] stores the large numbers, and the rest
are unprocessed. Each time, a new number A[j] is considered. If it is large, then it simply
stays there as it is already adjacent to the end of the large region, and we increase the size
of the large region by incrementing j (which we do anyway since we also advance to the next
unprocessed element in the next step). If it is small, we have to move it, but in order to avoid
shifting everything, we swap it with the first number in the large region. See the figure.
After the for loop is finished, we perform one final
step to move the pivot so that it sits between the small
and large regions (again without shifting the whole re-
gion).
It is easy to see that its running time is O(n).
Now that we have a efficient partition procedure,
the sorting algorithm follows immediately:
problems are of the same size. It depends on what the pivot is. If the pivot is “the middle
element,” then indeed the two subproblems have size roughly n/2. If this happens at every level
of recursion, then we have the recurrence
T (n) = 2T (n/2) + O(n)
which gives a solution of T (n) = O(n log n).
But what about an unbalanced partition? The worst situation is when the pivot is the
smallest or the largest element, resulting in one subproblem having n − 1 elements and the
other one empty. Again, if it happens at every level of recursion, then
T (n) = T (n − 1) + O(n)
which gives a solution of T (n) = O(n2 ). As usual, we are interested in the worst-case time
complexity, so it turns out that in the worst-case, Quicksort is less efficient than merge sort and
is in the same category as selection or insertion sort.
Quicksort (with our version of Partition) also has the peculiar property that its worst-case
performance is triggered when the input is already sorted. This is because the largest number
is chosen as the pivot, and hence the partition is the most unbalanced; the subproblem is again
a sorted array and hence this goes on at every level of recursion. For this and other reasons, in
practice some other variations of Quicksort are used.
But it is called Quicksort, surely it got this name for a reason?
• In merge sort, the two subproblems always have equal size; but in Quicksort, the two
subproblems may not have equal size.
• In merge sort, the divide step is trivial (just any way woul do), the combine is the difficult
part (merge sorted lists); whereas in Quicksort, divide is difficult (find good split), but
combine is trivial (just put them together).
• Merge sort has optimal worst-case time complexity O(n log n), yet it is not good in prac-
tice; on the other hand, although Quicksort has a worst-case time complexity of O(n2 ),
its average time complexity is O(n log n) and works better in practice.
2 4 1 7 5
+ 3 4 8 9
-----------
2 7 6 6 4
What about multiplication? If anybody still remembers, this is how we multiply two n-bit
(or n-digit) numbers back in school:
3 4 1 5
x 1 2 1 3
---------------
3 4 1 5
6 8 3 0
CHAPTER 4. DIVIDE AND CONQUER 36
3 4 1 5
1 0 2 4 5
---------------
4 1 4 2 3 9 5
(The exact way of presentation may be different from what you learnt in school, but the
basic idea should be the same.) We produce n intermediate rows, one per the multiplication of
one digit from one of the numbers, to all n digits of the other number. Thus each row takes
O(n) time to produce. They have to be added up: there are up to 2n columns, and n rows, so
each column of addition takes O(n) time and in total it takes O(n2 ) time to get to the answer.3
A = 2n/2 a + b
For example (switching to base 10), the 6-digit number 123456 is related to its two 3-digit
components 123 and 456 by
123456 = 103 (123) + 456
So, if we want to multiply an n-bit number A = 2n/2 a + b with another n-bit number
B = 2n/2 c + d, we can rewrite this as
Each of ac, ad, bc and bd are n/2-bit by n/2-bit multiplications. Once we got the answers
to these subproblems (by recursion), the final answer can be computed from the above formula
using some extra additions and multiplications to 2x for some x. Note that multiplication to
2x (and the calculation of 2x itself) are not actually multiplications; it merely appends x zeros
to the number (just like in base 10, where 10x · y is simply y appended by x zeros), which can
be done in computers in bit shifts. These all take only O(n) time. Thus the time complexity of
this divide and conquer algorithm is given by the recurrence
Unfortunately, its solution is O(n2 ), so after going through all this trouble, it is not faster than
the basic algorithm!
3
So, if you have always thought at school that multiplication is more complicated and time-consuming than
addition, you were right even though you had no concept of asymptotic complexity back then...
CHAPTER 4. DIVIDE AND CONQUER 37
ad + bc = (a + b)(c + d) − ac − bd
(simply expand the bracket on the right hand side to see this.) We are going to compute ac
and bd anyway; thus, instead of two multiplications, we can only make one more multiplication,
the product of a + b and c + d. This gives us the value of ad + bc using the above formula. Of
course, it involves some extra additions and subtractions, but they take only linear time, and
we are spending linear time on other parts anyway.
Since we now have three subproblems instead of four, the recurrence becomes
4
The idea can be further extended to give even faster algorithms, by dividing each number into more than
two parts. But this approach is not currently the best one. For a problem as basic as this, it is not known
what the optimal time complexity is; the currently best algorithms are all extremely complicated and have time
complexity given by some very complicated functions.
CHAPTER 4. DIVIDE AND CONQUER 38
Exercises
# 4-1 (Merge sort)
Show the execution of the merge sort algorithm when the input is [11, 5, 19, 17, 2, 7, 3,
13].
# 4-2 (Partition)
Show the execution of the Partition algorithm for each of the following input arrays.
Assume the last element of the array is chosen as the pivot.
T (n) = . . .
= 2k T (n/2k ) + 2(2k − 1)
= nT (1) + 2(n − 1) when k = log n
= n(0) + 2(n − 1)
= 2n − 2
Note that the buying date must be before the selling date, and so the problem is not
trivial; you cannot simply search for the minimum and maximum values, which would not
work if for example A = [3, 6, 4, 3, 2, 5].
If there is no such way of making profit (for example, the price decreases throughout all
n days) then no buy/sell activity should take place, and the profit is 0.
(a) A straightforward approach is to try all possible buy and sell dates combinations,
compute the profits, and find the maximum one. What is the time complexity of
this approach?
A divide and conquer approach can be used to solve the problem. Call the days between
buying and selling (including the actual buy and sell dates) the holding period. For
example if we buy at day 2 and sell at day 5 then the holding period covers days 2 to 5.
Observe that one of the following must be true for the maximum-profit strategy: either
the holding period is entirely within days 1 to n/2; or the holding period is entirely within
days (n/2 + 1) to n; or the holding period crosses the boundary between days n/2 and
(n/2 + 1).
(a) A direct approach is to try all possible subarrays, compute their sums and take the
maximum. Show how to do this in O(n2 ) time. (Hint: look at Exercise 1-4 again.)
Consider using divide and conquer and divide A into two subarrays A1 = A[1..n/2] and
A2 = [n/2 + 1..n]. Note that the maximum-sum subarray of A must either (i) lie entirely
within A1 , or (ii) lie entirely within A2 , or (iii) cross the boundary between A1 and A2 .
(b) Give an O(n) time algorithm to find the maximum-sum subarray crossing the bound-
ary. (Hint: “grow” on each side of the boundary.)
(c) Hence give an O(n log n) time algorithm for the problem. Assume n is a power of 2.
(Actually, this question is “equivalent” to the stock market question; see [CLRS].)
Chapter 5
Lower Bounds
Related reading:
[CLRS] Chapter 8.1
Given: 8 coins, one of which is counterfeit and is lighter; a pan balance (that shows only
which side is heavier)
Goal: to find the counterfeit coin using as few weighings as possible
We described some algorithms that can solve this problem in worst-case 4 weighings or 3
weighings, respectively. These are upper bounds, and we always want to improve (decrease)
the upper bound (find better algorithms). Can we do better? What about 2 weighings? 1
weighing?
In fact, 2 weighings are sufficient, and here is the algorithm, represented as a tree:
Remember, we never end our pursuit for better algorithms until we found the optimal one.
But how do we know we cannot do even better? This is about proving lower bounds.
40
CHAPTER 5. LOWER BOUNDS 41
• The number of leaves: this corresponds to the total number of different outcomes that we
want to distinguish.
• The branching factor, the number of branches coming out of a node: this corresponds to
the number of different possible actions after one “step.”
Each particular path from the root to a leaf of a given decision tree corresponds to one
particular sequence of actions taken by that algorithm, and the number of steps taken (i.e.
time) is equal to the length (number of edges) of this path. In Figure 5.1, every possible path
has the same length of 2; but in general different paths may have different lengths. The worst-
case number of steps taken by the algorithm, therefore, corresponds to the height of the tree, i.e.,
the longest of all paths, or equivalently how “deep” the deepest node is. As we want algorithms
with good (small) worst-case number of steps, we want algorithms that have a small height. In
other words we prefer fat and shallow trees rather than slim and tall trees. Finding a lower
bound on the worst-case number of steps taken by an algorithm is equivalent to finding a lower
bound on the height of any tree that satisfies the two fixed properties (number of leaves and
branching factor).
It is in fact easy to establish a relationship between those two fixed quantities and the height
of a tree. Consider a tree with branching factor b, height h and n leaves. At the first level (under
the root) there can be at most b nodes. Each of these can have up to b children of their own, so
at the second level there are up to b2 nodes. In general, level k has at most bk nodes. To have
n leaves we therefore need bh ≥ n, which means
h ≥ dlogb ne
where the ceiling is because the height must be an integer. This formula allows us to easily
work out a lower bound simply by finding out what the total number of different outcomes (n)
and the branching factor (b) are.
CHAPTER 5. LOWER BOUNDS 42
In our coin weighing problem, we have n = 8 and b = 3, so the lower bound is dlog3 8e = 2,
so two weighings are necessary and our algorithm is therefore optimal.
Note: you are not expected to know the exact value of logs like log3 8 ≈ 1.8927, but
you should know its integer range (i.e. between 1 and 2 in this case because 31 = 3 and
32 = 9).
• Whenever you are asked to prove a lower bound, do not draw me a tree like the
one in Figure 5.1. I cannot emphasise this enough. A tree shows one specific
algorithm, but a lower bound proof is an argument about all possible algorithms,
i.e., all possible trees. Showing one tree with a height of 2 says nothing about the
height of other trees.
• Having a lower bound of 2 does not mean there exists an algorithm that takes 2
steps. All it proves is that 1 step is not sufficient, and 2 steps might be sufficient.
• Decision trees are not the only method to prove a lower bound. Sometimes the
bound it manages to prove is rather weak, i.e., not the best possible, and other
techniques may prove stronger bounds.
The number of weighings T (n) made by this algorithm can be analysed by solving the
recurrence T (n) = T (n/3) + 1, T (1) = 0:
T (n) = T (n/3) + 1
= T (n/9) + 1 + 1
= ...
= T (n/3k ) + k
= T (1) + log3 n
= log3 n
Question: the lower bound is dlog3 ne and the upper bound is log3 n, which means if log3 n
is not an integer, the lower bound is higher! But this is impossible. What happened?
restricted situation is also a valid lower bound for the more general scenario, since an algorithm
for the more general scenario must also apply to the special case, so the special case cannot be
more difficult. The reason we do this is that it simplifies the counting in the next step.
We need to work out the number of different outcomes (leaves). This is not just the number
of elements n itself; rather, it is the number of ways that n elements can be permuted, since this
is what a correct sorting algorithm needs to be able to distinguish. For example, for 3 elements
there are 6 outcomes (a > b > c, a > c > b, b > a > c, b > c > a, c > a > b, c > b > a). This
figure shows one possible algorithm to distinguish them:
n! = 1 × 2 × . . . × n ≤ n × n × . . . × n = nn
Hence log n! ≤ log nn = n log n. But this is not the direction we want: we want to show the
number of steps is at least some constant times n log n. Instead, consider the following:1
n! = 1 × 2 × . . . × (n/2) × . . . × n
≥ 1 × 1 × . . . × 1 × (n/2) × . . . × (n/2)
= (n/2)n/2
We get the second line by replacing each term in the first half of the chain (1, 2, . . . , n/2) with
1, and each term in the second half (n/2 + 1, . . . , n) with n/2. Clearly this can only make the
product smaller. Yes, we are giving up a lot, but it turns out we are still left with enough: n/2
copies of n/2. Therefore log n! ≥ log(n/2)n/2 = (n/2) log(n/2) = (n/2)(log n − 1) = Ω(n log n).
1
There are more “direct” ways to lower-bound n!, for example using something called Stirling’s approximation.
You are not required to know that, or to come up with this “creative” way of lower-bounding the value of n!.
CHAPTER 5. LOWER BOUNDS 45
Exercises
# 5-1 (Coin weighing decision tree)
Draw the decision trees that correspond to the coin weighing algorithms A1 and A2 in
Chapter 1.
# 5-2 (Coin weighing variants)
In each of the following variations to the coin weighing problem, give optimal upper and
lower bounds:
(a) List all the possible outcomes. (Some possible outcomes are a < b < c, b = c < a, a =
b = c, etc.)
(b) What is the lower bound on the number of comparisons needed?
(c) Give an optimal algorithm for solving this problem. Represent your algorithm using
a decision tree.
∗ 5-4 (Heavier/lighter variant)
Suppose there are 12 coins, exactly one of which has slightly different weight than the rest,
but we do not know whether it is heavier or lighter than the others. We want to identify
this coin, and to determine whether it is heavier or lighter than other coins. Again we are
only using a pan balance.
(a) Show that 3 weighings are always necessary in the worst case, no matter what algo-
rithm is used.
(b) Show that 3 weighings are not enough in the worst case if there are 13 coins, no
matter what algorithm is used. (Note that the decision tree argument by itself is not
enough to prove it!)
∗∗ (c) Going back to 12 coins, is it actually possible to have an algorithm using at most 3
weighings in the worst case?
Chapter 6
Greedy Algorithms
Related reading:
[KT] Chapter 4.1
[CLRS] Chapters 16.1-16.2
In this chapter we introduce the second of our major algorithm design technique – greedy
algorithms. In Chapter 8 we will see further problems that can be solved with the greedy
principle.
For such a small example, it is easy to see – by visual inspection – that the best solution
is lecture + lab class + pub, for a total of 3 activities. But can we have a systematic way (in
other words, design an algorithm) to solve the problem?
Before we describe possible algorithms, let’s give a slightly more abstract way of describing
the problem. Each activity can be considered as an interval. An interval i has a starting time
s(i) and a finishing time f (i). Two intervals i and j are in conflict if s(i) < s(j) < f (i) or
s(j) < s(i) < f (j). Given a set of intervals, our objective is to identify the largest subset of
intervals such that no two of them are in conflict.
Very often, an abstract formalisation of the problem is useful. Problems arising from different
contexts may turn out the same or very similar. Of course, in this particular case, it is pretty
much just a change of wording from “activity” to “interval,” but there are more complicated
transformations (or reductions), and sometimes completely different looking problems turn out
to be equivalent.
Now let’s consider some natural attempts to get an algorithm:
46
CHAPTER 6. GREEDY ALGORITHMS 47
This seems a natural choice: basically we just take on whatever that arrives if we are free.
But it is not a correct algorithm. On the previous example, it chooses lecture and then football,
which is not optimal. In fact, its solution can be very bad: consider this input
The algorithm would choose the single long interval, but the optimal solution chooses all
the n − 1 short intervals.
This also sounds reasonable: a short interval seems less likely to be in conflict with other
intervals. But it is easy to come up with examples to show that it is not optimal; for example
two long, adjacent (non-overlapping) intervals and a short interval overlapping both of them.
This sounds very similar to Algorithm 6.1 (which chooses the earliest starting time). The
idea here is to leave as much time as possible for other intervals, which again might be reasonable.
It can be checked that it produces the correct solution for all the previous examples. Of
course, returning the correct solution for a few examples does not mean the algorithm is always
correct. Even if you spent a long time and failed to find an example that makes it fail, it does
not mean there isn’t one. To prove that an algorithm is correct is therefore often a non-trivial
task. But to prove it is not correct, you merely need to demonstrate one counterexample.
It turns out that Algorithm 6.3 is indeed correct. We will revisit Algorithm 6.3 in a moment,
but first let’s take a look us this “class” of algorithms we have just proposed.
• They usually have a fast running time (e.g. O(n log n)).
• They don’t often produce optimal solutions, because of their “short-sightedness” (imagine
playing chess by only maximising the “advantage” you gain at every move, without looking
further ahead.)
• They sometimes give “good” solutions in the sense that they are not too far away from the
optimal solution, e.g. by a constant factor, but sometimes the solutions can be terribly
bad.
We have seen examples where seemingly reasonable greedy algorithms are not correct, and
some others are in fact correct. It is therefore not always obvious that a given greedy algorithm
is correct or not, and we need proofs of their correctness. (Exercise 6-3 gives another example
of such a problem.)
There actually are some subtleties in the above pseudocode. The variable k indexes the last
interval added to A. This allows the algorithm to avoid naively checking the new interval with
every other interval in A for overlapping. Each time a new interval S[i] is being considered,
it checks if s(S[i]) ≥ f (S[k]), which if true would mean S[i] does not overlap S[k] (and also
does not overlap any other interval in A – since they are all “in front of” S[k]) and so S[i] can
be chosen. Otherwise, if s(S[i]) < f (S[k]), then S[i] must overlap S[k]. (It cannot be “wholly
before” S[k] and not overlap with it, because that means f (S[i]) < f (S[k]) and S[i] would have
been considered before S[k] in the sorted order.)
Next we analyse the running time of the algorithm. Sorting the intervals by finishing time
takes O(n log n) time. The for loop is executed O(n) times, each doing only a constant amount
of work (ignoring the issue of data structure for A). Therefore, in total it takes O(n log n) time.
CHAPTER 6. GREEDY ALGORITHMS 49
Correctness. Finally we give an informal argument on the correctness of the algorithm, i.e.,
why it does return the maximum number of intervals. Let I1 be the interval with the earliest
finishing time. We split the n intervals into two groups: S1 , those that start before I1 finishes;
and S2 , those that start after I1 finishes. See the figure below.
Observe that any feasible solution – in particular, the optimal solution S ∗ – can choose at
most one interval from S1 . And choosing I1 is “safe,” because it is not in conflict with any
interval in S2 . In other words, if S ∗ chose some other interval in S1 , it can be replaced with
I1 with no harm. Similarly, if S ∗ does not include anything from S1 , it can add I1 with no
harm. S ∗ is then left with a smaller set of intervals, S2 . The greedy algorithm also picks I1 , and
since everything in S1 overlaps with I1 , it will also next choose intervals from S2 . Thus both
S ∗ and the greedy algorithm pick I1 and then find a solution from S2 . We can then repeat the
argument on S2 to show that it is correct for the greedy algorithm to pick the earliest-deadline
interval in S2 , and so on.
Using the same example as before, with W = 6kg and the 3 items sorted in decreasing order
of density:
The algorithm takes all of (3kg, $100), all of (2kg, $60), and 1kg from (5kg, $120). The
total value it gets is $(100 + 60 + 120/5) = $184.
We will not give a full proof of correctness here. In the greedy algorithm, the vector of
quantities (as a fraction) of items chosen, in descending order of value/weight ratio, is always
(1, 1, . . . , 1, x, 0, 0, . . . , 0) where 0 ≤ x ≤ 1. (For the above example this vector is (1, 1, 1/5).)
So, roughly speaking, if another solution picks less than 1 for some of the high-density items
(and a higher fraction in some other items), we can always exchange for higher density items
to increase the total value while keeping the same total weight.
CHAPTER 6. GREEDY ALGORITHMS 51
Exercises
# 6-1 (Knapsack)
Consider a simpler version of the knapsack problem where the value of an object is simply
its weight, i.e. vi = wi for all i. You cannot take part of an object. For each of the following
greedy algorithms, either give a simple explanation on why it returns an optimal solution,
or give a counterexample to show it doesn’t. Make the counterexamples as bad as you
can.
Related reading:
[KT] Chapter 3
[CLRS] Chapters 22.1-22.3, 22.5
[DPV] Chapters 3, 4.1-4.2
[SSS] Chapter 5
Graphs are very important mathematical objects in computer science. Its usefulness comes
in its ability to model many situations that involve “relations” between “entities,” such as road
networks, computer networks, hyperlink relation between web pages, social acquaintances, and
many more less obvious ones.
Most of the remaining part of this module is related to graphs. In this chapter we recap on
basic graph terminologies and introduce a few elementary algorithms.
Two simple graph properties. We state two simple properties that will be useful later.
The minimum n − 1 is attained when the graph is just a line (path); each new vertex added
to it requires an edge. The maximum occurs when it isa complete graph, i.e., there is an edge
between any two vertices, so the number of edges is n2 . (Another way to prove this is: every
vertex has n − 1 incident edges and there are n vertices, but each edge is counted twice.)
This simple formula also states an important fact: the number of edges is always between
linear (Θ(n)) and quadratic (Θ(n2 )) in the number of vertices. This allows us to classify graphs
52
CHAPTER 7. ELEMENTARY GRAPH ALGORITHMS 53
into sparse graphs, where there are few edges (say m = Θ(n)), and dense graphs where they
are many edges (say m = Θ(n2 )).1
Proposition 7.2 The sum of degrees of all vertices equals twice the number of edges.
The proof of this is extremely simple. Each edge e = (vi , vj ) contributes 1 to the degree of
vi and 1 to the degree of vj . So, as we add up the degrees of all vertices: deg(v1 ) + deg(v2 ) +
. . . + deg(vn ), each edge contributes twice to this sum.
Figure 7.1: Left: an example graph. Middle: adjacency list. Right: adjacency matrix.
Adjacency list. It consists of n lists, one for each vertex. Each is a linked list of all the
neighbours of that vertex.
It uses O(n + m) space: there are n head nodes, and the total number of non-head nodes
is equal to the total degree of the vertices, which we know (Proposition 7.2) is 2m. It is
therefore relatively good for sparse graphs: graphs with fewer edges take up less memory.
To list all the neighbours of a node, simply traverse the list, in O(1) time per neighbour.
To check whether two nodes are adjacent, it takes time proportional to the degree of one
of the vertices, by traversing its list and looking for the other vertex.
Adjacency matrix. It is a 2-dimensional n by n array, where the (i, j)-th entry is 1 if there
is an edge from vertex i to vertex j, and 0 otherwise.
Clearly, it takes O(n2 ) space, irrespective of how many edges there are; which is therefore
(comparatively) good for dense graphs.
Listing all the neighbours of a vertex takes O(n) time, by scanning the corresponding row
of the matrix and identify the entries with 1s.
1
This is not a precise definition...
CHAPTER 7. ELEMENTARY GRAPH ALGORITHMS 54
Checking whether two given nodes are adjacent can be done in O(1) time, simply by
checking whether that matrix entry is 1 or 0.
Both representations can be naturally generalised to directed graphs and weighted graphs
(for example putting the edge weight instead of 0/1 in the adjacency matrix).
7.3.1 BFS
Breadth-first search, as the name suggests, prefers breadth over depth. It finishes visiting
vertices close to the starting vertex before visiting those further away. More precisely, vertices
are visited in order of their distance (number of edges) from the starting vertex: first those at
one hop, then two hops, and so on. Imagine a wavefront spreading out from the starting vertex
(see Figure 7.2).
Figure 7.2: Left: a graph with an “extending wavefront.” Right: BFS tree.
Each vertex will then be assigned a level (the Li ’s) which is its distance from s. Note that,
for example, node 7 is at level 2, because it was discovered via 1-3-7, and not the longer path
1-2-6-7.
The order and the way the vertices are discovered encodes a parent-child relationship among
the vertices: if a vertex v is first discovered while visiting u (as u’s neighbour), we can view v
as the “child” of u. A BFS tree is a tree that represents this relationship. It also shows the
levels of nodes of a BFS traversal. See Figure 7.2.
We can also describe BFS in more detail, specifying the data structure used. It turns
out that BFS can be easily implemented using a queue. Recall that a queue is a FIFO data
structure (Chapter 2). Each time a new vertex is visited, its neighbours are added to the end of
the queue (if not already inserted). This way, we ensure that the order we visit the vertices are
consistent with what BFS requires. Note that no particular order is specified when inserting
the neighbours of one vertex into the queue. We also need an additional array to keep track of
whether a vertex was discovered already.
Running it on the graph in Figure 7.2, the contents of the queue changes as follows:
What is the running time of BFS? The while loop executes n times (each node is enqueued
exactly once and dequeued exactly once). For each iteration of the while loop, the inner for
loop seems like it will execute at most m times, one per neighbour. Thus it gives a bound of
CHAPTER 7. ELEMENTARY GRAPH ALGORITHMS 56
O(nm). It is correct but we can get a tighter analysis. Consider the total number of for loop
iterations over the entire course of the algorithm (not just one iteration of the while loop). Each
vertex is enqueued and dequeued exactly once. When it is dequeued we check all its neighbours,
which (using an adjacency list) can be done in time proportional to the number of neighbours.
Thus the total number of iterations is equal to the sum of degrees of all vertices, but we know
(Proposition 7.2) that it is 2m. Enqueuing and dequeuing takes at most O(n) time. Therefore
in total it takes O(n + m) time, which is linear in the size of the graph.
7.3.2 DFS
Another way of traversing a graph is called depth-first search. Here the idea is to go as deep
into a graph as possible until a dead end is reached, at which point we backtrack and visit some
other branch. Essentially, you are just excited to meet new neighbours and every time you find
a new one, you go to find their own neighbours and so on, ignoring the old ones (until later).
The following is a recursive way to implement this idea correctly.
Procedure DFS(G, u)
Mark u as explored
for each edge (u, v) do
if v is unexplored then
DFS(G, v)
end if
end for
Figure 7.3: Left: A possible DFS traversal of the same graph. Right: DFS tree.
Those older neighbours that you know are there but haven’t met yet, you keep them “in the
back of your head” to be revisited later. It turns out that a stack is the right data structure for
CHAPTER 7. ELEMENTARY GRAPH ALGORITHMS 57
this task. The algorithm is almost exactly the same as BFS, just replacing the queue with the
stack:2
Running it on the graph in Figure 7.3, the contents of the stack changes as follows: (here
we deliberately insert vertices in reverse numerical order into the stack so that they pop out in
the “natural” order; as in BFS, no specific order is prescribed by the algorithm)
The running time of DFS is analysed similarly as BFS. Essentially, each execution of the
for loop corresponds to an edge, and hence it was executed only O(m) times over the course of
the algorithm. The total runtime is again O(m + n).
Similar to BFS, we can construct a DFS tree (Figure 7.3).
2
You might notice that there is another difference: the line that assigns true to the Explored[] array is in a
different place (hence also the different name, Discovered[] vs. Explored[]). This means the same vertex can be in
the stack multiple times. This is necessary to reflect correctly the order the vertices are processed. For example,
(in this particular traversal) node 3 is “discovered” very early on as a neighbour of 1, but is “explored” as a child
of 6. There are, actually, further subtleties in this issue, which we would rather not go into here.
CHAPTER 7. ELEMENTARY GRAPH ALGORITHMS 58
Exercises
# 7-1 (Graph traversal)
The figure below shows a maze where S is the entrance and T is the exit. Show a BFS
traversal of the maze by giving the BFS tree and the contents of the queue after each step.
Repeat using DFS and a stack.
S A B
E F
G H
Most algorithms on graphs take O(n2 ) time or more when the input is given as an ad-
jacency matrix, simply because we have to read the entire matrix. However this is not
always true. Given the adjacency matrix representation of a directed graph, we want to
determine whether the graph has a sink, i.e. a vertex with n − 1 incoming edges and no
outgoing edges.
(c) What is the relationship with the celebrity problem? What is an efficient algorithm
for this problem?
Chapter 8
Related reading:
[DPV] Chapters 4.4-4.5
[KT] Chapters 4.4-4.6
[CLRS] Chapters 23, 24.3, 6
[SSS] Chapters 6.1, 6.3
In this chapter we consider two important graph problems that can be solved in a greedy
way: minimum spanning trees and shortest paths.
59
CHAPTER 8. GREEDY ALGORITHMS ON GRAPHS 60
Kruskal’s algorithm. At a high level, it can be described by this pseudocode (there are many
details to be filled in later):
Consider the example in Figure 8.1. The edges are added to T in this order: (a, g), (c, d),
(b, g), (c, g), (d, e), (a, f ). Edges like (a, b) are considered but not added. Note that during the
algorithm there are multiple disconnected parts of the tree that will eventually be connected.
Correctness. Recall that the correctness of greedy algorithms is not always obvious. We
need to prove that this algorithm does always produce the MST. It clearly produces a spanning
tree (it has n − 1 edges so must span all vertices); the question is whether it is the minimum
one. We can prove that by contradiction. The following is a sketch of the proof.1
Suppose there is a graph G where Kruskal’s algorithm produces a tree T K , the actual
minimum spanning tree is T ∗ , and T K and T ∗ are not the same. Therefore, there must be an
edge e in T ∗ that is not in T K . Edge e separates T ∗ into two parts, S1 and S2 . Let x and y
be the endpoints of e in S1 and S2 , respectively. Since e is not in T K , T K must connect x and
y via another path P . In this path P , there is at least one edge e0 connecting S1 and S2 . See
Figure 8.2.
1
See if you can find the details we glossed over...
CHAPTER 8. GREEDY ALGORITHMS ON GRAPHS 61
This edge e0 must have smaller weight than e, since T K adds edges in increasing order of
weights and chose e0 ahead of e. Hence we can remove e from T ∗ and add e0 to T ∗ to get another
spanning tree with smaller weight; this contradicts the optimality of T ∗ .
Running Time. Sorting the m edges by weight takes O(m log m) time, which is the same
as O(m log n) (recall that m = O(n2 ), so log m = O(2 log n)). Inside the while loop, the time
complexity depends on two important operations:
1. Checking whether two vertices are already connected (i.e., in the same component): This
is performed O(m) times in total, over all executions of the loop, since we test each edge
one by one.
2. Connecting two vertices (components) together when an edge is added: this is performed
O(n) times in total, over all executions of the loop, since the final MST has n − 1 edges.
Note that the above is counting the number of times the steps are executed; we are yet to
find out the time complexity of performing each individual step. So how to perform each of
these operations efficiently? To answer this question, we take a brief detour to look at something
called disjoint set data structures.2
These data structures are also called union-find data structures. Assuming we have such
data structures, we can use them to support Kruskal’s algorithm as follows. The ground set
is the set of vertices. At any stage of the algorithm, each disjoint set represents the vertices
connected by some MST edges. Initially each vertex begins in its own set. To check whether
an edge e = (u, v) should be added, i.e., whether u, v are not in the same component, just
check whether Find(u) = Find(v). If not, we add that edge, which means we need to union the
disjoint sets with Union(u, v).
For example, with Figure 8.1 again, the disjoint sets evolve as follows:
Initially: {a}{b}{c}{d}{e}{f }{g}
Add (a, g): {a, g}{b}{c}{d}{e}{f }
Add (c, d): {a, g}{b}{c, d}{e}{f }
Add (b, g): {a, b, g}, {c, d}{e}{f }
...
Below we consider some possible union-find data structures and their time complexities.
Attempt 1: Array. In this approach we simply use an array A to keep track of the set
names of the elements: A[i] stores the set name of element i. For example, if the disjoint sets
are {1,2,7} {3,4} {5} {6}, then A could be [1, 1, 3, 3, 5, 6, 1].
With such an array, Find(i) is easy: just return A[i], which can be done in O(1) time. But
Union(i, j) is more difficult: it needs to change all entries that currently have the value of either
A[i] or A[j] to a same value. For example, if we perform Union(1,3) on the above example, A
needs to become [1, 1, 1, 1, 5, 6, 1]. The entire array needs to be scanned, hence it takes O(n)
time.
Attempt 2: Linked list. Put all elements in the same disjoint set in a linked list. The
element at the head of the list is the set representative.
Find(u) requires O(n) time, as we have to go through all lists to find where u appears.
Union(u, v) requires stitching the two lists together, which can be done in O(1) time (if we use
doubly-circular linked lists, for example).
At the right of the table we visualise the linked lists recorded in the second row (single-
element lists are not shown for clarity). Note that it is only a visualisation for your understand-
ing; it does not need to be separately recorded as it is already in the array.
Similarly, after Union(3,4):
name 1 2 3 3 5 6 1
next 7 0 4 0 0 0 0 1 → 7, 3 → 4
size 2 1 2 / 1 1 /
After Union(2,7):
name 1 1 3 3 5 6 1
next 2 7 4 0 0 0 0 1 → 2 → 7, 3 → 4
size 3 / 2 / 1 1 /
When we merge two sets of unequal sizes, as in the above, we apply the weighted-union
heuristic: we only spend time proportional to the size of the smaller set. This requires some
trickery: we cannot scan the entire array to change the set names (as in Attempt 1); and when
we join the linked lists we must not spend time traversing the long list to find its end. Hence
we use a somewhat special procedure:3
1. Start with the head of the longer list. (We know which is longer since we have the size
field.)
3. Change the next pointer to point to the head of the shorter list.
4. Travel along the shorter list (i.e., use the “next” value to identify the next column),
modifying the names (first row) of those elements along the way.
5. When we reached the end of the shorter list, set the next value to where the head were
pointing to originally (i.e., the second element in the longer list).
As a more complicated example, consider the next step Union(3,7):
name 1 1 1 1 5 6 1
next 3 7 4 2 0 0 0 1→3→4→2→7
size 5 / / / 1 1 /
Worked example. The following tables show the execution of the algorithm for the graph
in Figure 8.1, with f as the starting vertex. Initially
v a b c d e g
D[v] 11 ∞ ∞ ∞ 12 13
The smallest entry is 11, which means we connect to vertex a and now consider its outgoing
edges. For example the distance to b is improved, while that to e isn’t. The updated values are
v a b c d e g
D[v] / 5 ∞ ∞ 12 1
We pick the new smallest value, 1, and consider outgoing edges of g, and update the table
again. This repeats until all vertices are processed.
The running time of Prim’s algorithm depends on the data structure we use to support two
key operations used by the algorithm:
CHAPTER 8. GREEDY ALGORITHMS ON GRAPHS 65
• Find the minimum D value (line 6): this is performed O(n) times.
• Change the D values (line 9): this is performed O(m) times. Line 9 in fact only needs
to be considered for neighbours v of u. Thus the for loop takes time proportional to the
degree of u, and the total number of times line 9 is executed is equal to the sum of degrees
of all vertices, which is 2m.
If we simply use an array to store D, finding minimum takes O(n) time, and updating a
value takes O(1) time. Thus the total time is O(m · 1 + n · n) = O(n2 ). But in the following we
will introduce the heap data structure, which can support either of these operations in O(log n)
time. So the total time is O(m log n + n log n) = O(m log n).
Which of these two (array or heap) is better therefore depends on the number of edges: for
sparse graphs where m = O(n), the heap is better, but for dense graphs where m = Θ(n2 ), the
array is better.
• A binary tree is complete if each level is full except possibly the last, and that nodes at
the last level are filled left to right.
A complete binary tree is balanced, i.e., the depth of any two leaves differ by at most 1.
As a result, the height of an n-node heap is O(log n).
• The heap property requires that the element at a node must not be larger than either of
its children.5 Note that the order of the two children is distinguishable, but there is no
requirement that the smaller child has to be on the left.
4
If you have heard this term from memory management in programming languages, no it is a completely
different thing.
5
This is for min-heaps; it is possible to have the opposite type, max-heap, where the parent must not be
smaller. Unless otherwise specified, all heaps here are assumed to be min-heaps since this is what we need.
CHAPTER 8. GREEDY ALGORITHMS ON GRAPHS 66
Here is how the heap supports each of the operations in O(log n) time (See Figure 8.3):
1. Decrease value: (For our purposes we only need to handle the reduction of values, but
a heap can also handle increase of values in a similar way.) Updating a value is easy:
just write to that array entry. But once a value is reduced, it may become smaller than
its parent, violating the heap property. If that happens, we swap that element with its
parent. But this “bubble up” may need to happen again at the next level since the element
may still be smaller than its new parent. This keeps going until either it is not smaller
than its parent, or it reaches the root. Each “swap” takes O(1) time and since the height
of the heap is O(log n), the total time is also O(log n).
2. Insertion: We first put the new element at the next available position (i.e., to the right of
the rightmost node at the bottommost level, unless it is full in which case we start a new
level). This new element may violate the heap property with its parent, so we perform a
similar “bubble up” process, again in O(log n) time.
3. Delete minimum: Finding the minimum is easy: it is always at the root. But we cannot
just remove it and leave an empty space. What we do is to move the last (bottommost
level, rightmost) element to the position of the root first; this new root may now violate
the heap property, with one or both of its children. We then “sift down” the violations
by swapping. Note that if it is smaller than both its children, the smaller one should be
swapped with the parent.
Figure 8.3: Heap operations. Top left: decrease value. Top right: insertion. Bottom: delete
minimum.
CHAPTER 8. GREEDY ALGORITHMS ON GRAPHS 67
Figure 8.4: Top: exponentially many paths. Bottom: greedily selecting shortest outgoing edge
does not work: it chooses the path 1-4-8-5 but the optimal one is 2-4-2-3.
Alternatively, in some purely greedy sense, one might just pick the shortest outgoing edge
at every node, starting at s.6 But this is not a correct algorithm. Figure 8.4 (bottom) shows
an example.
Before we move on, we describe the optimal substructure property satisfied by the short-
est path problem. It means that, if P is a shortest path from s to t, and it goes through an
intermediate vertex x, then the sub-path of P from s to x (call it P1 ) must itself be a shortest
6
Many people mistook this as Dijkstra’s algorithm – it is not!
CHAPTER 8. GREEDY ALGORITHMS ON GRAPHS 68
path from s to x. (And similarly for the sub-path from x to t.) The reason is that, if there is an
alternative s-x path P 0 that is better than P1 , we can simply replace P1 with P 0 , and continue
onto the original x-t path, to give a better path than P , contradicting its optimality. This can
also be seen in Figure 8.4, where the path 1-7-1-5 is not optimal because 1-7 can be replaced
by 2-4 (and 1-5 can also be replaced by 2-3).
This sounds completely obvious, and it is a property common
to problems that can be solved greedily, and is also key to the
technique of dynamic programming which we will learn in the
next chapter. But note that this seemingly obvious property is
not enjoyed by other, similar, problems. For example, consider
the problem of finding the longest simple path in a graph. It does
not have optimal substructure, as can be seen in this graph: the
longest path from a to d is a-b-c-d, but the sub-path c-d itself is
not the longest path from c to d.
There is no known polynomial time algorithm for the longest
path problem (it is in fact NP-complete), despite it looking like just the opposite of shortest
paths.
The D array stores the provisional best distances to each of the vertices. Each time we look
for the next vertex with the smallest D value, i.e., closest to s, that is being reached for the
first time. (Imagine pouring water down at s and watch how they spread.) For this vertex u,
the algorithm checks all its neighbours, to see if we now have a better distance by reaching it
via u: by spending a distance of D[u] from s to u, then add a distance of d(u, v) following the
edge (u, v). If this is shorter, update the D value of v.
Figure 8.5: Example graph for Dijkstra. The red (thick) edges are edges of the shortest path
tree.
Worked example. The following tables show an example execution of the graph in Figure 8.5,
with a as the starting vertex. Initially
v b c d e f g
D[v] 5 ∞ ∞ ∞ 11 1
P [v] a / / / a a
First g has the minimum value, so we consider all neighbours of g to see what can be
updated. For example, D[b] should be updated since D[g] + d(g, b) = 1 + 3 = 4 is better than
the old D[b]. But D[f ] should not be updated as D[g] + d(g, f ) > D[f ]. After one round we get
v b c d e f g
D[v] 4 5 8 10 11 1
P [v] g g g g a a
Then we find the smallest D value (excluding g that has been considered), which is b, and
repeat the process. On this occasion the table does not change as no better path was found.
Then this process goes on. In the end, we have
v b c d e f g
D[v] 4 5 7 10 11 1
P [v] g g c g a a
When the algorithm finishes, the D array stores the shortest distances to each of the vertices.
But this is just a number, not a path. This is where P [], or the predecessor array, comes
into play. P [v] records the vertex just before reaching the destination v for the current best
path found. Each time D is updated, a better path to v is identified, and u is the “stop” just
before the destination. This is sufficient to reconstruct the actual shortest paths: if in the end
P [v1 ] = v4 , for example, it means the shortest path to v1 has v4 as its last stop. So we check
CHAPTER 8. GREEDY ALGORITHMS ON GRAPHS 70
Exercises
# 8-1 (Kruskal’s algorithm)
Consider the graph given by the adjacency matrix below, where non-zero entries represent
edge weights, zero or infinity means there is no edge. Find the minimum spanning tree
of this graph using Kruskal’s algorithm. You only need to show the order of edges being
considered and added to the solution.
v1 v2 v3 v4 v5
v1 0 1 3 ∞ 5
v2 1 0 3 ∞ 6
v3 3 3 0 4 5
v4 ∞ ∞ 4 0 2
v5 5 6 5 2 0
# 8-2 (Prim’s algorithm)
Repeat the above using Prim’s algorithm, starting at vertex v1 . Use an array to store the
distance information. Show the contents of the data structure after each step.
8-5 (Heapsort)
How can a heap be used to perform sorting? What is the time complexity?
# 8-6 (Dijkstra’s algorithm)
CHAPTER 8. GREEDY ALGORITHMS ON GRAPHS 71
Consider the graph below. Find the shortest paths from s to all other vertices using Dijk-
stra’s algorithm with the array data structure. Show the contents of your data structure
after each step.
a 2 b
2 7
8
s t
3 6
10 1
c 4 d
Chapter 9
Dynamic Programming
Related reading:
[KT] Chapters 6.1-6.3, 6.6, 6.8
[CLRS] Chapters 15, 24.1-24.2, 25.1-25.2
[DPV] Chapters 4.6, 6.3, 6.6
[SSS] Chapter 8
In this chapter we introduce our third major algorithm design technique, dynamic program-
ming (DP). The “programming” here does not mean writing computer programs, but (back
then before there is such a thing called computer programs) refers to a way of tabulating num-
bers to compute solutions. The best way to learn it is to see how it is used; so we will see how
this technique is applied to a number of important problems, including sequence comparisons
and shortest paths.
72
CHAPTER 9. DYNAMIC PROGRAMMING 73
Here we assume that an empty subset has sum 0. For example, suppose A = [2, 4, 1, 3, 1, 6, 5, 4, 2].
We cannot pick both A[6] and A[7] (6 and 5) because they are “too close” to each other. The
best solution for this example is to pick A[2], A[6] and A[9], giving a sum of 4 + 6 + 2 = 12.
It might seem natural to use a greedy algorithm that repeatedly picks the largest element,
as long as it is not “too close” to elements already chosen. But this is not correct: a simple
counterexample is the array [6, 1, 7, 6]. Greedy would have picked 7, but the optimal solution is
6+6. Other greedy approaches also fail to work. It seems we need some new ideas.
Yes, this is a completely trivial, tautological statement. But pause here for a moment and
make sure you truly understand what it means (and why it is trivial)! This is the kind of
observations that underline the design of all dynamic programming algorithms.
This trivial observation allows us to reduce the problem to a smaller subproblem of the same
form, and hence apply recursion:
• If A[n] is part of the solution, then A[n − 1] and A[n − 2] cannot be; consider A[1..n − 3]
in the next step
• If A[n] is not part of the solution, consider A[1..n − 1] in the next step
But we don’t know which one is correct – so we simply try both and see which solution is
better!
Let’s formulate this a bit more formally. First we have to define subproblems. Let S(i)
be the value (sum of chosen numbers) of the solution considering only the first i elements, i.e.
A[1..i]. The reason we do this is because the recursion (the way we define it) always removes
elements from the end, and the subproblems are therefore always of this form. What we want to
find in the end is S(n). Based on the idea we described, we can write down a recursive formula:
It is a maximum of two choices, the first corresponding to picking A[i], and the second
corresponding to not picking A[i].
Like all recursion we need base cases. Here we have S(0) = 0 because an empty array
has sum zero; S(1) = max(0, A[1]) because with just one element, we will pick it unless it is
negative, in which case we would rather have an empty sum; and S(2) = max(0, A[1], A[2])
because we can pick at most one of A[1] or A[2] (or neither). We need three base cases because
the recursive formula may “jump” from i to i − 3; this way, we guarantee that the recursion
will be “caught.”
It is straightforward to turn this formula into a recursive algorithm: (call S(n) to get the
solution.)
CHAPTER 9. DYNAMIC PROGRAMMING 74
Unfortunately, this algorithm is very inefficient. In fact it takes exponential time! The
reason is that it will produce exponentially many subproblems; see Figure 9.1.
Figure 9.1: The tree of recursive calls made. Notice the redundancy.
Memorising solutions. However, we are just one trick away from getting an efficient algo-
rithm. Observe that many of the subproblems, and indeed entire subtrees, are repeated in the
figure. There is no point in recomputing those subproblems so many times, since they obviously
have the same answers (and this is precisely why the algorithm is slow). Instead, we can memo-
rise previously computed solutions: use an array to store the solutions of subproblems computed
before, and only compute fresh if they have not been previously computed. In pseudocode:
CHAPTER 9. DYNAMIC PROGRAMMING 75
Procedure S(i):
if M [i] == ∞ then // not computed before
M [i] ← max(S(i − 3) + A[i], S(i − 1))
end if
return M [i]
Here ∞ (the mathematical symbol for infinity) does not actually have to be infinity, but
any sort of indicator to indicate that the entry has not been computed before.
Removing recursion. We can simplify the algorithm even further and remove the recursion
to convert it into an equivalent iterative algorithm. This comes from the observation that the
recursion is only called on smaller values of i. Hence, if we compute the array M in the “correct”
order (in this case, increasing i), every time the necessary array entry is required (according to
the recursive formula), it must have been computed already. Hence we can eliminate recursion
completely and obtain the following extremely simple algorithm:
Time and space complexity. The time complexity of this last algorithm is clearly O(n),
since it has a for loop that executes O(n) times and the content inside takes O(1) time per
iteration.
In dynamic programming algorithms we are also interested in the space complexity of the
algorithm, i.e., the amount of working memory used, since we are, in effect, trading space for
time. The algorithm uses an array M of size n, and thus it takes O(n) space.
Finding the actual solution. The above algorithm computes the value of S(), which is the
value of the objective function (i.e. the sum), but not the actual solution (i.e. which elements
to pick). But it is easy to get that information. First, add to the above code an additional
array to record the information on which of the two cases is chosen at every step:
CHAPTER 9. DYNAMIC PROGRAMMING 76
If we apply this method to the array A = [2, 4, 1, 3, 1, 6, 5, 4, 2], these are the computed
values of M [] and T ake[]:
i 0 1 2 3 4 5 6 7 8 9
A 2 4 1 3 1 6 5 4 2
M 0 2 4 4 5 5 10 10 10 12
T ake F T T F T F T F F T
Note that the T ake array (despite its name) does not tell you directly the solution of the
problem of size n; in other words, the solution is not to simply pick those items with “true” in
that array. Rather, we have to “trace back”: since T ake[9] is true, we pick A[9] and recursively
look at T ake[6]. It is again true, so pick A[6] and recursively look at T ake[3]. It is false, so do
not take A[3], and recursively look at T ake[2], and so on.
includes Ii and discards all intervals that overlap it, leaving a subproblem with fewer intervals;
or discards Ii and leaves a subproblem S(i − 1). Thus this is similar to the array sum problem,
except that the “gap” is not necessarily 2 but depends on how the intervals overlap. This is also
why they have to be ordered in finishing times: if Ij is the “first” (going backwards from Ii−1 )
interval that does not overlap with Ii , then all intervals before Ij will also not overlap with it.
Thus we only need to check for one gap, although it is variable-sized.
1. A recursive formulation. The problem must be such that it can be solved building on
solutions of smaller subproblems of the same form.
3. Overlapping subproblems. Typically, the recursion will only reduce the size of the
problem by a small amount; in addition, many of the subproblems will appear repeatedly
over the course of the recursion. This differs from divide and conquer. This is also why
we can use a table to memorise solutions of subproblems; it would not have provided
any advantage if the same subproblems are not repeatedly encountered. This is what we
observed in Figure 9.1.
All dynamic programming algorithms (that you will encounter in this module) follow the
same basic structure. The only part that is specific for each individual problem is the recursive
formulation; the rest are fairly standard stuff. Therefore, to design a dynamic programming
algorithm we typically follow these steps:
Step 1: Define a recursive formulation that utilises the structure of the optimal solution. This
is the difficult part. For this module, often we will define the subproblem for you, but you
still need to work out the recursive formula.
Step 2: Compute the numerical value of the optimal solution using a table. Note that this table
always only stores the values (cost/profit) we are trying to minimise/maximise. There are
two approaches:
In the top-down approach, we start the recursive call from the largest subproblem, de-
scending to smaller subproblems. Each time we first check whether we have computed
CHAPTER 9. DYNAMIC PROGRAMMING 78
that subproblem previously by looking up the values from a table. The first time a sub-
problem is solved, its value is stored in the table. As subproblems are only solved when
they are required, in theory not all table entries are always filled.
In the bottom-up approach, we iteratively fill the table starting from the smallest sub-
problems, in an order such that recursion is not required because the necessary smaller
subproblems must be solved already. This means there is no recursion overhead.
Most of the time, we will describe our algorithms using the bottom-up approach.
Step 3: Construct the actual solution from the table using traceback. The traceback part is
essentially identical for every DP algorithm and thus we will often omit it. First an extra
table stores the choice made at each step; from this we can trace back the path (the list
of subproblems) followed and reconstruct the solution step by step.
Otherwise, if A[m] 6= B[n] (doesn’t match), then at least one of them, or both, cannot be
part of the LCS, because otherwise a crossing would happen. This again allows us to reduce to
a smaller subproblem, where one or both of the strings is shortened by one.
To turn this observation into a recursive formulation, let LCS(i, j) be the length of LCS of
A[1..i] and B[1..j], i.e., the first i letters of A and the first j letters of B. So we have, for any
i > 0 and j > 0,
( − 1, j − 1) + 1
LCS(i if A[i] = B[j]
LCS(i, j) = LCS(i − 1, j)
max if A[i] 6= B[j]
LCS(i, j − 1)
What are the base cases? We cannot apply the recursion if one of the strings becomes empty.
When that happens, the length of the LCS must be zero: the only common subsequence between
an empty string and another string is the empty string. Hence LCS(0, j) = 0 and LCS(i, 0) = 0.
This immediately gives us a recursive algorithm; like before, we need a table to store solutions
and first check the table to see if the subproblem was computed previously (top-down approach).
Because the subproblems are parameterised by i and j, where each of them can take on values
from 0 to m (or n), we need a 2-dimensional table of size m + 1 by n + 1.
Bottom-up algorithm. Just like the previous problem, we can develop a bottom-up, non-
recursive algorithm. It is important to note the right order to fill in the entries in the table,
so that every time an element is needed, it is already computed. Since this is a 2-dimensional
problem, the situation is slightly more complicated. But observe each table entry (i, j) depends
on at most 3 other entries: (i − 1, j − 1), (i, j − 1), (i − 1, j). All these would have been computed
already if we use a standard double-for-loop like this:
Figure 9.2 is an example of the L array computed when the two strings are abcbdab and
bdcaba. The table is filled row-by-row, column-by-column. The recursive formula means that if
A[i] = B[j], then the entry is simply 1 plus the element immediately to the top-left of the current
cell; if A[i] 6= B[j], the entry is equal to either the one immediately above, or immediately to
the left, whichever is bigger. See the two cells with red arrows as examples. Thus it can be
calculated quite easily even manually.
It is convenient to write down the letters alongside the first row/column, so it is easy to see
which letters you are comparing. Furthermore, this neatly shows the meaning of the values in
the table: the value in a cell is the length of the LCS of the two strings up to the corresponding
CHAPTER 9. DYNAMIC PROGRAMMING 80
positions. For example, the blue circled entry in Figure 9.2 corresponds to the length of the
LCS between abcbd and bd.
Time and space complexity. Clearly, we used a table with m + 1 rows and n + 1 columns,
hence the space complexity is O(mn).
As for the time complexity, there are two nested for loops, and the content inside the for
loop takes only O(1) time to compute since it only involves a constant number of entries. Hence
the time complexity is O(mn).
Traceback. The table stores the lengths of the LCSs; it does not tell you what the actual
LCSs are. To do that, we need to store, at each cell, which of the choices we have picked, or
equivalently what directions do the entry of this cell come from. As an exercise on paper, we can
draw arrows to indicate where the entries come from, as in Figure 9.2. In a real implementation
of course there won’t be arrows, but a separate 2-D array with entries that indicate the direction
to follow.
Starting from the bottom-right corner of the table, we can follow the arrows to trace back
a path to the beginning. This defines a LCS: each diagonal move means the corresponding
characters in the two strings are “matched,” and each horizontal or vertical move means a
letter in one of the strings is skipped. The yellow shaded cells in Figure 9.2 is one possible path,
and lead to the LCS bcba on the right.
Note that there can be multiple paths that can be traced (because sometimes different choices
give the same value), and this corresponds to different LCSs (that has the same, maximum,
length).
Exercise: find another LCS of the same length from this table.
CHAPTER 9. DYNAMIC PROGRAMMING 81
The base cases are D(i, 0) = i and D(0, j) = j, because when one of the sequences is empty,
the best way to transform to the other sequence is by a series of insertions/deletions.
A dynamic programming algorithm then follows easily; just replace the recursive formula
(and the base cases) of Algorithm 9.6 with the ones above.
The following is an example execution of the algorithm.
In this section we develop an algorithm for finding shortest paths when the graph has
negative edges. Not only is it more general, but it turns out that similar ideas can be found
in algorithms for decentralised settings, i.e., no node has global information (e.g. distributed
routing algorithms).
Before that, we have to get something out of the way first: negative cycles. Now that we
allow negative edges, we may have introduced negative cycles, i.e., cycles with negative total
edge weight. This causes a problem because a graph with negative cycles does not have a
well-defined shortest path: each time you go around a negative cycle, the length of the path is
reduced so one can simply go round a negative cycle indefinitely to get a −∞ length. But for
those graphs with negative edges but not negative cycles, shortest paths are still well-defined.
So from now on, we assume that the graphs we consider have no negative cycles.
We first establish two simple but essential properties of shortest paths:
This is true because, if a shortest path contains a positive (or zero) cycle, we can remove
it and the total weight is not worsened. And we already assumed that negative cycles do not
exist.
This is because a path that does not visit the same vertex twice (if it does it would form
a cycle, but the previous property already ruled that out) must be visiting a new vertex for
each new edge along the path. Thus, if a path has more than n − 1 edges, then it must reach
more than n vertices (including the starting vertex), which contradicts that the graph has only
n vertices.
CHAPTER 9. DYNAMIC PROGRAMMING 83
The for loop runs up to n − 1 only because we know that a shortest path can have at most
n − 1 edges (Proposition 9.2). Figure 9.4 shows an example execution.
This algorithm runs in O(n3 ) time. To see this, the two for loops each have at most n
iterations. But line 4 require not O(1) time but O(n) time, because it has to find the minimum
among n quantities according to the recursive formula. This is your first example where filling
in an entry in the table takes more than constant time (you will see more examples later!)
The space complexity is O(n2 ), due to the 2-D table.
Simplifying the algorithm. Observe that S(i, ∗) depends on S(i − 1, ∗) only; there is there-
fore no need to keep the earlier entries S(i−2, ∗), . . . , S(0, ∗). We therefore only need the current
column and the previous column, and this allows reuse of memory and cut the space complexity
down to O(n). In fact, we can take this idea further and simplify the recursive formula: let
M [v] be an 1-D array. Then
We can further simplify the algorithm to give the Bellman-Ford algorithm3 below. We omit
its proof of correctness here.
As in other shortest path algorithms, the D array only stores the distances, not the paths,
but the paths can be reconstructed by a predecessor array (P in the above).
It is easy to see that the time complexity of this algorithm is O(mn) and its space complexity
is O(n).
Note that in this algorithm, i no longer means the number of edges; it has no obvious
meaning. Also, note that the same D (and P ) value can be updated more than once in one
iteration.
Worked example. The following shows an example execution. The (arbitrary) edge order
used was: (sa), (sb), (sc), (se), (ca), (ad), (bd), (eb), (dc), (de)
3
Some people/books use the name “Bellman-Ford” to refer to Algorithm 9.7, which here we call proto-Bellman-
Ford (not an official name). Here we follow [CLRS] and call this one Bellman-Ford.
CHAPTER 9. DYNAMIC PROGRAMMING 85
Another way of interpreting this algorithm is as follows. Recall that in Dijkstra’s algorithm
(Algorithm 8.3), the core part is to check for shorter distances to each vertex:
This is actually a key operation, which we call the edge update procedure, that appears again
and again in all shortest paths algorithms that we study. This operation is always “safe”: if
there is a D[v] value, it always means there is a path of that distance to v, and it is updated
(to a smaller value) only if there is a better path with that shorter distance.
In fact, most shortest path algorithms simply apply this edge update procedure “in some
right order” to all the edges. In Dijkstra, this correct order is based on the D values, i.e., how
far the vertex is away from s. Here, it also applies the edge update procedure to all edges,
but we don’t know what is “the right order”, so we choose an arbitrary one and do it a “large
enough” number of times – it turns out that n − 1 times are enough.
This gives an O(n4 ) time algorithm: a 3-D n × n × n table where each entry takes O(n) time
to fill. There are some tricks that can improve this to O(n3 log n), but we will not cover them
here.
(k) (k)
We can arrange the Dij in matrices, where D(k) is an n × n matrix, k = 0, 1, . . . , n, and Dij
is the entry in the i-th row, j-th column of D(k) .
CHAPTER 9. DYNAMIC PROGRAMMING 87
The above algorithm implements the idea. As before, the D matrices only store the lengths
of the shortest paths, and to recover the actual paths we use additional matrices P which stores
(k−1)
the predecessor information. In the case where Dij is better, the predecessor is unchanged
(k−1) (k−1) (k−1)
(same as Pij ), but in the case where Dik + Dkj is better, the predecessor is not
(k−1)
necessaraily k but the last stop in the second leg of this path, Pkj .
0 3 8 4 −4 / 1 1 2 1
∞ 0 ∞ 1 7
/ / / 2 2
D(3) : ∞ 4 0 5 11 P (3) : / 3 / 2 2
2 −1 −5 0 −2
4 3 4 / 1
∞ ∞ ∞ 6 0 / / / 5 /
0 3 −1 4 −4 / 1 4 2 1
3 0 −4 1 −1
4 / 4 2 1
D(4) :
(4)
7 4 0 5 3 P : 4 3 / 2 1
2 −1 −5 0 −2
4 3 4 / 1
8 5 1 6 0 4 3 4 5 /
0 1 −3 2 −4 / 3 4 5 1
3 0 −4 1 −1 4 / 4 2 1
D(5) : 7 4 0 5 3 P (5) : 4 3 / 2 1
2 −1 −5 0 −2
4 3 4 / 1
8 5 1 6 0 4 3 4 5 /
It is easy to see that the time complexity of the algorithm is O(n3 ) because of the three
nested for loops. The space complexity can be only O(n2 ) since each D and P array has n × n
elements and there is no need to keep all previous arrays; just keep the most recent one would
do.
CHAPTER 9. DYNAMIC PROGRAMMING 89
Exercises
9-1 (Fibonacci numbers)
The Fibonacci numbers are defined by the formulas
That is, each number is the sum of the previous two. So the first few Fibonacci numbers
are 0, 1, 1, 2, 3, 5, 8, 13, 21.
(a) Give a simple recursive algorithm to compute the n-th Fibonacci number Fn .
(b) Consider instead the following algorithm:
F [0] ← 0; F [1] ← 1
for i = 2 to n do
F [i] ← F [i − 2] + F [i − 1]
end for
print F [n]
What is the time and space complexity of the algorithm?
(c) Consider the following algorithm:
x ← 0; y ← 1
for i = 1 to n do
x←x+y
y ←x+y
end for
print x and y
Trace the execution of the algorithm for n = 4, showing the values of x and y
immediately after the contents of the for loop is executed every time. What is this
algorithm doing? What is its time and space complexity?
# 9-2 (Longest common subsequence)
Find the longest common subsequence of the sequences TTACG and ATACAG. Show the
contents of the dynamic programming table.
# 9-3 (Edit distance)
Find the edit distance between the two sequences TTACG and ATACAG, if the cost of
each insertion or deletion is 2 and the cost of each substitution is 3. You should write down
the recursive formula for the optimal cost D(i, j) and show the contents of the dynamic
programming table.
(3) Otherwise (A[1] 6= B[1]), let x be the earliest position at which A[1] appears in B, and
y be the earliest position at which B[1] appears in A. If x ≤ y, match A[1] to B[x] and
recursively solve the subproblem A[2..m], B[x + 1..n]. If x > y, then match B[1] to A[y]
and recursively solve the subproblem A[y + 1..m], B[2..n].
Show that this algorithm does not always return the longest common subsequence.
# 9-5 (Bellman-Ford algorithm)
Find the distances of the shortest paths from s to all other vertices in the graph below
using the Bellman-Ford algorithm. Show the contents of the data structure after each
round.
5
b d
−2
6
−4
8 7
s
−3
7
a 9 c
−2
v1 v2
8
5 2
−1 4
v3
(a) Let N (i, j) denote the minimum number of coins required to make an exact amount
of i, using only coins from the set c1 , c2 , . . . , cj . Give a recursive formula for N (i, j).
Also, write down the base case. (Hint: either you use one (or more) coin of value cj ,
or you use none. Either way you reduce at least one of i or j.)
(b) Give a dynamic programming algorithm for the problem. Analyse the time and space
complexity of the algorithm, in terms of n and k.
CHAPTER 9. DYNAMIC PROGRAMMING 91
(c) Illustrate how your algorithm works with n = 6 and coin denominations {1, 3, 4},
by showing the contents of the dynamic programming table.
∗ 9-8 (Sentence segmentation)
You are given a string of length n which is possibly formed from a sequence of valid words
but without spaces and punctuation marks. An example is
isitherealready
Your task is to determine whether the string can be broken down into a sequence of valid
words. Sometimes there can be more than one ways; sometimes there can be none. (Note
that here we only care about whether each word is valid, but not whether the whole
sentence makes sense or is grammatically correct.)
You have a function Dict(s) which takes a string s and returns true if s is a valid word in
a dictionary, and false otherwise. Assume this function takes constant time.
(a) Let S(i) denote the truth value (true or false) of whether the first i characters in the
string can be decomposed into a sequence of valid words. Give a recursive formula
for S(i). Also, write down the base case. (Note that since S() records either true or
false, your recursive formula will need to use logical operators like AND, OR rather
than the usual +, max, min etc.)
(b) Hence, give a dynamic programming algorithm for the problem.
(c) Analyse the time and space complexity of your algorithm.
(d) In other to find the “best” way of breaking down the sentence, suppose instead of
just a truth value, we have a function Score(s) which takes a string s and returns
a score that reflects how “good” that word is; a higher score means a more likely
correct word. For example it could be that Score(already) = +12, Score(a) = +1
and Score(alre) = −20. Assume this function always returns in constant time. The
objective of breaking the string nicely becomes finding the break with the highest
total score of the individual words.
Develop an efficient algorithm for this problem.
∗ 9-9 (Maximum contiguous subarray revisited)
Recall the following problem discussed in Exercise 4-7: given an array A[1..n] with n
elements, we want to find the maximum-sum subarray, i.e. the subarray A[i..j] starting
with A[i] and ending with A[j] such that the sum A[i]+A[i+1]+· · ·+A[j] is maximum over
all possible subarrays. For example, if A = [2, −10, 6, 4, −1, 7, 3, −8], then the maximum-
sum subarray is [6, 4, −1, 7, 3] with sum 19. In Exercise 4-8, we used divide and conquer
and gave an O(n log n) time algorithm. In fact, we can use dynamic programming to give
an O(n) time algorithm!
Let G(i) denote the sum of the maximum contiguous subarray from A[1..i], and let S(i)
denote the sum of the maximum contiguous subarray from A[1..i] with the additional
restriction that the subarray must end with A[i] (this includes the case of an empty
subarray with sum 0.) Derive recursive formulas for G(i) and S(i). Hence give an O(n)
time algorithm for the problem.
Chapter 10
Network Flow
Related reading:
[KT] Chapters 7.1-7.2, 7.5
[CLRS] Chapters 26.1-26.3
[DPV] Chapters 7.2-7.3
[SSS] Chapter 6.5
In this chapter we consider another graph problem, of finding the maximum flow. The
solution does not belong to the three big types of algorithmic techniques we discussed, but it is
itself a very important topic. We also illustrate the idea of problem reduction or transformation
by showing how network flows can be used to solve a kind of matching problem.
Figure 10.1: Left: an example flow network. Right: Same network with a possible flow.
More formally, we are given a directed graph. Each edge e = (u, v) has a capacity ce , which
means at most ce units of flow can pass through the edge from u to v. Note that this is a
directed graph, so this edge does not allow any flow from v to u. Each edge can admit a flow, a
number fe that is not larger than the edge’s capacity. In other words, fe ≤ ce for every e: this
is called the capacity constraint.
The above is only about one edge, but of course in a network, the edges are connected
together at vertices. All vertices (except two special ones below) satisfy the conservation con-
straints: the total incoming flow into the vertex must be equal to the total outgoing flow. In
other words, flows do not appear out of nowhere or disappear, and there should be no “traffic
congestion.” But the traffic have to originate from somewhere: there is a special vertex called
92
CHAPTER 10. NETWORK FLOW 93
the source s where all flows originate from. And there is the opposite sink vertex t which, as
the name suggests, is a vertex with only incoming flows and where the flows “drain away.”
We use the notation f /c on an edge to denote a flow f on an edge with capacity c. Figure 10.1
also shows a possible flow in the network. It can be checked that the flow is indeed conserved.
The value of a flow is the total amount of flow that goes from s to t. Because of the
conservation constraints, the total amount of flow coming out from s is equal to the total
amount of flow going into t; thus this is well-defined. In Figure 10.1, the flow value is 5.
Given such a network, our objective is to find the maximum flow from the source to the
sink. We want to find both the maximum flow value (a number) as well as the actual flow, i.e.,
the flow on each edge. The flow in Figure 10.1 is not the maximum and can be improved (how?)
Greedy algorithm fails. One tempting idea might be to increase flows on edges incremen-
tally in some order. In other words, we greedily (locally) add flows: start with an all-zero flow,
find some path from s to t that has positive remaining capacity, assign some flow along this
path, and repeat. For example, in Figure 10.1 we can add one more unit of flow along the path
1/4, 0/4, 4/5.
Unfortunately this is not a correct algorithm. For example, consider the graph in Figure 10.2
(top left). One could easily have chosen the three straight edges in the middle as a path, and
assign a flow of 20 along this path. Afterwards, nothing more can be done and we are “stuck.”
But the maximum flow is in fact 30, by diverting 10 units into the lower arc and also pushing
10 units in the upper arc. The incremental method gets stuck because we need to reduce the
flow on the middle edge (from 20/30 to 10/30) to increase overall flow!
Figure 10.2: Top left: after assigning 20 units to the straight edges we are stuck. Top right:
residual network from this flow. Bottom left: augmenting path. Bottom right: updated residual
network. There are no more augmenting paths.
CHAPTER 10. NETWORK FLOW 94
(i) the nodes are the same as the original network; and
(ii) for each edge (u, v) with flow f and capacity c, create two edges in the residual network:
(u, v) with capacity c − f , and (v, u) with capacity f .
Intuitively, the forward edge (in the same direction as in the original network) records the
remaining capacity, i.e., how much more flow you can “push” along the edge. The reverse edge
allows one to push some flow in the reverse direction, which effectively “undoes” a previously
assigned flow. The maximum amount of flow you can undo is the flow that is already assigned;
hence this is the capacity of the reverse edge. This allows us to resolve from the situation of
“getting stuck;” for example, in Figure 10.2 the residual network corresponding to the flow in
the top-left is shown in the top-right. (Edges with zero capacity are not drawn.) There is now
a path (the red, thick one in bottom left) from s to t in the residual network; this is called
an augmenting path. We can now increase the flow by adding a flow along this path. The
amount that can be increased is determined by the minimum of all edge capacities on this
path, because that is what constrains the flow. In other words, if an augmenting path have
edge capacities c1 , c2 , . . . , ck , a flow of value min(c1 , c2 , . . . , ck ) can be pushed along this path.
With the increased flow, the residual network should then be updated to reflect the new flow.
Then we simply repeat the process, i.e., to find a new augmenting path in this updated residual
network. Eventually, no more augmenting paths can be found. and the algorithm ends. This is
the Ford-Fulkerson algorithm:
The algorithm does not specify which augmenting path to use, or how to find one. Any such
path can be used, and, for example, breadth first search can be used to give one such path.
It can be shown that, if there is no more augmenting path in the residual network, then the
flow is maximum.
Worked example. Figure 10.3 shows a slightly larger example. In the end, the dashed line
divides those vertices reachable from s from those unreachable from s, after no more augmenting
paths can be found.
The final flow on each edge can be recovered back from the final residual network.
CHAPTER 10. NETWORK FLOW 95
Theorem 10.1 (Integrality Theorem) If all the edge capacities are integers, then there is
a maximum flow with all flow values being integers.
The execution of the algorithm is already a proof of this theorem – if all capacities are
integers, the capacity of any augmenting path must also be an integer, and the flow added at
each step is an integer, and hence the new residual capacities are all integers, and so on.
Note that the theorem does not say that all maximum flows must have this property – just
one of them does. (There could be multiple flows with the same maximum value.)
With this, we can state that the running time of the algorithm is O(mF ), where m is the
number of edges and F is maximum flow value found. The reason is as follows. Each iteration
of the while loop can be completed in O(m) time, because constructing the residual network
takes O(m) time, finding one augmenting path takes O(m + n) = O(m) time, and updating the
residual network also takes O(m) time. How many iterations does the while loop make? Since
all capacities are integers, each augmenting path found must have capacity at least 1, and hence
increases the flow value by at least one. So there can be at most F iterations.
CHAPTER 10. NETWORK FLOW 96
Note that this running time is somewhat different from the others we have seen, because it
depends on the value of the answer we will find. We call this output-sensitive. This is not ideal
– if one multiply all capacities of a network by 10 times, then this suggests the time complexity
also increases by 10 times, which is of course quite unreasonable. But this bound is actually
tight, i.e., the time complexity can actually be that bad. Consider for example the network
below. Obviously, the maximum flow is 200, by passing 100 along the top and 100 along the
bottom. But Ford-Fulkerson could, in principle, repeatedly find augmenting paths that use
the edge with capacity 1 (it will flip repeatedly in the residual networks); thus it will take 200
iterations to complete!
If we select augmenting paths more carefully, we can prove a better running time, and indeed
there are faster algorithms, but we won’t cover them here.
You might notice this minimum cut has the same value as the maximum flow in Figure 10.3.
In fact, the line that divides the two parts in the minimum cut is also the same line in Figure 10.3.
This is not a coincidence:
CHAPTER 10. NETWORK FLOW 97
Theorem 10.2 (Max-flow-min-cut) The maximum flow value of a network is the same as
the value of the minimum cut.
Again, we do not prove it here (it is integral to the proof that Ford-Fulkerson is optimal),
but we can give some intuition. Draw any s-t cut in the graph. Any flow must pass through this
cut, from the s side to the t side, using edges of this cut. Therefore, if this cut has capacity x, no
flow of value larger than x can pass through it. (Imagine the border between two countries and
a finite number of crossing points, each with a certain capacity.) But this argument is true for
any cut, and so also for the minimum cut; therefore the maximum flow is never larger than the
minimum cut. In some sense, a minimum cut is a “bottleneck” of the graph, the “narrowest”
part that constrains how much traffic can flow through. (This explains why max flow is at most
min cut, but why max flow is indeed equal to min cut (and not strictly smaller) requires another
argument.)
The Ford-Fulkerson algorithm also gives us one way of finding a location of the min cut
(not just its value). When the algorithm finishes, some vertices are reachable from s using the
remaining capacities and some vertices are unreachable; the boundary between these two sets
of vertices is one such location. Note that the location of a min cut may not be unique and
there can be others (not found by this method).
It turns out that we can solve the maximum matching problem by reducing (transforming)
a bipartite matching problem into a network flow problem. First, add a source vertex s and
connect it to all vertices on one side of the bipartite graph. Similarly, add a sink vertex t and
connect all vertices on the other side to it. Make sure all edges are oriented to form a directed
graph from s to t. Assign a capacity of 1 to all edges in the graph.
We claim that we can solve the matching problem by solving the max flow problem in this
network. Specifically, we mean that the max flow of the network equals the value of the max
matching. It is obvious that a matching of size x always gives a flow of value x; just follow the
chosen edges in the matching and add those edges connecting s and t. Since it is a matching,
none of the chosen edges in the matching share any vertices and hence the paths formed in the
flow are disjoint (except at s and t), and each path gives a flow of 1.
The opposite direction needs a bit more attention. A flow of value x always gives a matching
of size x because of the following. By the integrality theorem, there exist a max flow where
all the flow values on edges are either 0 or 1 (because all edge capacities are 1). We pick all
edges with flow 1 on them, and those in the middle (bipartite) part of the graph must define
a valid matching. Since all source edges have capacity 1, a flow coming into a left-side vertex
has value at most 1, and by the integrality theorem they cannot be split into two or more edges
with fractional flow values; it can only go to one edge. Similarly, on the sink side, there cannot
be multiple incoming flows going into a vertex because it will then lead to a larger-than-1 flow
along the edge that connects to the sink, which is not possible. Thus a one-to-one matching is
defined.
The above describes the most basic version of matching. There are many other variations of
matching problems that can be modelled as flows, for example by using different edge capacities.
Here we show one example. Suppose there are several depots with different amount of supplies
of goods, and several retail sites with certain demands. We want to match the supplies to
the demands so that the maximum amount is matched. As shown in the figure below, we can
construct a similar biparite graph as before. The supply and demand requirements are modelled
using edge weights: a depot that has a supply of x units can be modelled by an edge from s
to that vertex with capacity x, because that means no more than x units of flow can come out
of that vertex. Note that the flow that come out may be split (shared) between different retail
sites. The same happens at the t side. Now, those edges at the “middle” do not need any
constraints, but we still need to specify a capacity for this to be a proper network flow problem.
We can therefore specify a capacity of infinity.
CHAPTER 10. NETWORK FLOW 99
Exercises
# 10-1 (Ford-Fulkerson algorithm)
Find the maximum flow from s to t in the network in the figure below using the Ford-
Fulkerson algorithm. Show the residual network and augmenting path in each step.
a 4 d
2 3
2
3 3
4 2 5
s t
b e
3 2 3
5 2
c 2 f
(a) Describe in words how this can be transformed into a network flow problem. You
should state clearly what the vertices, edges and edge capacities are.
(b) Construct the network according to (a) for this particular example.
(c) Find the maximum flow and hence solve the blood transfusion problem.