Data Structures — Algorithm Design 1 documentation
Data Structures — Algorithm Design 1 documentation
Data Structures
A data structure represent a particular way of organizing data in the computer memory and a set of operations on this data. The
problem of designing efficient data structures is tightly linked with algorithm design. While some data structures are very specific
to particular algorithms, there exist a set of data structures which are so common that it is important to know them, and to
understand their pros and cons.
In this chapter we will recall the most common data structures, that you should have seen in the introduction to programming
courses, and their runtime complexity for common operations. We will see that for a given set of operations, we usually have
different possible implementations, each having its own benefits and limitations.
Lists
List is a very common data structure used to store values where each value is associated to an index between 0 and n − 1, with
n the number of elements in the list (of course indices could start at 1 instead of 0, this is a convention).
Operation Description
insert_after(e, ne) insert the new element ne after the given element e
extend(l) insert all the elements of the list l at the end of the current list
The most common ways to implement those operations are dynamic arrays and linked lists.
Note
In object-oriented programming languages, the data-structure List is often represented by an interface or an abstract class.
Dynamic arrays and linked lists are then two different classes that implement this interface or that inherit from this abstract
class. For example in Java, List is the interface, ArrayList is a class implementing this interface as a dynamic array and
LinkedList is a class implementing this interface as a linked list.
Dynamic array
Dynamic array is a very common implementation of lists which is readily available in the standard library of many programming
languages: ArrayList in Java, std::vector in C++, list in Python, List in C#, Array in JavaScript…
In this case, the list elements are simply stored in an array. The size of the underlying array is called the capacity, it might be larger
than or equal to the number of elements n in the list: the additional free space is used when new elements are added to the list.
3 5 2
size
capacity
Fig. 7 A dynamic array, with 3 elements and a capacity of 6.
class List:
def __init__(self):
self.array = [None]; # underlying array, current capacity is the length of the array
self.size = 0; # number of elements in the list
Elements in a dynamic array can be accessed in constant time O(1) ( at , previous , next ). Adding/removing an element at the
front or somewhere in the middle of the array implies to move all the elements after the position of the insertion or deletion,
leading to a linear worst case runtime complexity Θ(n).
insert 7
3 5 2
3 5 2
3 7 5 2
Fig. 8 Inserting an element in the middle of a dynamic array implies to move all the elements after the position of insertion.
For example, the pseudocode for removing the front element can be:
class List:
...
def remove_front(self):
# move all the elements to the left, first is discarded
for i in range(1, self.size):
self.array[i - 1] = self.array[i]
self.size = self.size - 1
Because the elements of a dynamic array are packed at the beginning, removing and adding elements at the end of the array is
simpler. However, a difficulty happens when the size of the list reaches its capacity, inserting a new element then requires to
increase the size of the underlying array. The pseudocode of the insert_back function can be:
class List:
...
In practice, these two operations, insert_back and remove_back , are done in constant amortized time Θ(1). The term amortized
here means that over a sequence of k insertions at the back of the array, the average complexity is constant, but, for a single
insertion the complexity might indeed be worth (linear in this case).
class List:
...
def increase_array_size(self):
# allocate new array
newCapacity = math.ceil(len(self.array) * growth_factor) # growth_factor > 1
newArray = [None] * newCapacity # new array of size newCapacity
# copy the old array in the new array
for i in range(self.size):
newArray[i] = self.array[i]
# replace the old array by the new one
self.array = newArray
Let’s consider a simple example, where we start with an empty list of capacity 1 and a growth factor of 2. Assume that we
have a sequence of 2p insertions, with p > 1. Reallocations occur when the capacity is equal to 1, 2, 4, 8, 16, …, 2p−1 and at
each reallocation we have to copy all the elements of the array (so we have to copy 1, 2, 4, 8, 16 elements…). Thus, to insert
our 2p elements, the total number of element copies will be:
p−1 p
1 + 2 + 4 + 8 + … + 2 = 2 − 1.
copies. Therefore, in average, the number of copies per insertion is constant. Some insertions still take a linear time Θ(n), but
those costly insertions are rare and their cost is amortized by all the insertions that are done in constant time Θ(1). This result
can be generalized to any sequence of insertions and removals.
The growth factor is a trade-off between memory and runtime: a large growth factor will lead to less reallocations but will
waste more memory.
Finding an element in a dynamic array can be done in linear time Θ(n) with the linear search algorithm that was described in the
introductory example Algorithm complexity.
Merging two dynamic arrays is also done in linear time with respect to the size of the two lists.
Dynamic arrays are extremely common in practice because they are simple to implement and efficient for most operations. It is
thus essential to be comfortable with using them correctly in algorithms.
For example: a possible solution of unique([4,3,22,3,11,4,-1 ,3,-1]) is [-1,3,4,11,22] . Note that the order of the elements in
unique(A) does not matter.
A simple strategy to do this is to sort the element of A and then browse the elements of the sorted list: whenever an element
is different from the previous one, it means that we have found a new unique element.
For example, if we have the two lists [3, 1, 5, 2, 4] and [2, 7, 8, 5, 3] , the intersection list, denoted inter([3, 1, 5, 2,
4], [2, 7, 8, 5, 3]) , should be [3, 5, 2] . Note that the order of the elements in the intersection list is not important.
A simple strategy to do this is to sort the elements of A and B, and then, to browse the elements of the two sorted arrays in
parallels:
if the current elements in A and B are the same we have found a new common element, if this element is different from
the last found element, then we add it to the intersection list;,
if the current element of A (resp. B) is smaller than the current element of B (resp. A), then we move to the next element
in A (resp. B).
Give the worst-case runtime complexity of your algorithm in terms of the size of the two lists n and m.
Linked list
Linked lists do not rely on a contiguous array to store elements, instead, they store each element in an individual node. In order to
keep track of all the nodes in the list, each node also stores a link (pointer or reference) toward the next node in the list and
possibly the previous one. A linked list that only stores a link toward the next node is called a singly linked list; a list where each
node contains a link toward the previous and the next node is called a doubly linked list.
3 5 2 3 5 2
Singly linked list Doubly linked list
Fig. 9 A singly and a doubly linked list with 3 elements.
The first node of a linked list is called the head and the last one the tail. We assume that it is possible to know the size of singly
and doubly linked lists in constant time Θ(1) (in practice, the linked list data structure stores an integer which is
incremented/decremented whenever an element is inserted/removed). Contrarily to dynamic arrays, it is possible to insert an
element anywhere in a linked list in constant time Θ(1) because there is no need to reorganize the other elements in the list: only
a few links need to be updated. Removing an element can be done in constant time anywhere in a doubly linked list, however, in a
singly linked list, we need to know the previous in order to update its reference to the next node: if this node is known the
removal can be done in constant time, otherwise it’s in linear time.
7 7
3 5 2 3 5 2
Singly linked list Doubly linked list
Fig. 10 Inserting an element in the middle of a linked list implies only local changes to the next (and previous in case of doubly linked lists)
nodes.
However, with linked list, we loose efficient random access: to get the k-th element of a linked list, one must browse the k first
cells of the list from its head, leading to a linear time Θ(n) access. Accessing to the next element is done in constant time by
following the link to the next node. With doubly linked list, accessing to the previous element is also done in constant time,
however with singly linked list, as we don’t have a link to the previous node, one has to browse the list from its head to find the
previous element, leading to a linear time operation.
Finding an element in a linked list can be done in linear time Θ(n) with the linear search algorithm that was described in the
introductory example Algorithm complexity.
Extending a linked list can be done in constant time by just gluing the tail of the first list with the head of the second list.
Summary
In summary, the runtime complexities of the operations these 3 possible implementation of list are:
Singly linked list Θ(n) Θ(n) Θ(1) Θ(1) Θ(n) Θ(1) Θ(n) Θ(
Doubly linked list Θ(n) Θ(1) Θ(1) Θ(1) Θ(1) Θ(1) Θ(1) Θ(
with n the size of the list (and m the size of the second list in extend ) and ∗ denotes amortized complexity.
Exercice 3 : Choosing the right implementation
The following algorithms take a list as parameter:
def algorithm_1(list):
for e in list:
if list.previous(e) is defined:
list.previous(e) = e
def algorithm_2(list):
for e in list:
if list.next(e) is defined:
list.next(e) = e/2
def algorithm_3(list):
for i in range(len(list) // 2):
list[i] = list[list.size() - 1 - i]
def algorithm_4(list):
for e in list:
if e < 0:
list.remove(e)
def algorithm_5(list):
x = list[0];
while x is defined:
if x > 0:
list.insert_front(x)
x = list.next(x)
Give the worst case complexity of each of these algorithms, for each of the three list implementations described above:
algorithm_1
algorithm_2
algorithm_3
algorithm_4
algorithm_5
Corriger >
Stacks
A stack, or LIFO (Last In First Out) data structure, is a container with only 2 operations:
A third operation called top() that returns the top element without removing it is also often proposed for efficiency (note that it
can be obtained by combining pop and push ). A stack can therefore only be accessed from one end.
1 2 3 4 5 6
5 Push
4 Push
3 5
Push 4 4
2 Push 3 3 3
Push 2 2 2 2
1 1 1 1 1
6 6 7 8 9 10
Pop 5
Pop 4
5
Pop 3
4 4
3 3 3 Pop 2
Pop
2 2 2 2
1 1 1 1 1
Fig. 11 Representation of a stack, the elements can only be retrieved in the reverse order of the order in which they were placed on the
stack (image source stack wikipedia ).
For example, given the expression 3 4 + 5 * , you should return 35 (since 3 + 4 = 7 and 7 * 5 = 35 ).
A simple algorithm can solve this problem in linear time Θ(n) with a stack. The idea is to browse the expression from left to
right, then:
[] , and the goal is to determine whether its parentheses () and square brackets [] are “balanced” or not.
For example:
“[()[]]” is balanced
“[()” is not balanced (the first left square bracket has no corresponding right square bracket)
“[(])” is not balanced (also the numbers of left and right paranthesis and square brackets match, they are interleaved)
A simple algorithm can solve this problem with a stack. The idea is the following one: we scan the string from left to right, and
each time we encounter a left parenthesis or a left square bracket, we push it on the stack. Each time we encounter a right
parenthesis or a right square bracket, we pop the stack and check that the popped element is the corresponding left
parenthesis or left square bracket.
Queues
A queue, or FIFO (First In First Out) data structure, is a container with only 3 operations:
Elements are thus inserted at one end of the queue and retrieved at the other end.
1 2 3 4
5 2 3 5
Enqueue Enqueue Dequeue Dequeue
3 5 3 2 5 2
Fig. 12 Representation of a queue, the elements are retrieved in there order of insertion.
For example, for n = 10 , the algorithm should print: 1, 10, 11, 100, 101, 110, 111, 1000, 1001, 1010
A simple algorithm can solve this problem with a queue. The idea is the following one: we start with the number 1, and we
enqueue it. Then, we dequeue the first number x in the queue, print it, and enqueue its two successors x0 and x1 in the
queue (for example, if x = 10, we enqueue 100 and 101). We repeat this operation until we have printed n numbers.
Dequeues
A dequeue (double-ended queue), pronounced deck, like “cheque”, is a container with 6 operations:
Dequeues can be implemented with dynamic arrays or linked lists, an efficient implementation is proposed in an optional exercise
bellow.
Priority queues
In a priority queue, each element is associated to a priority, i.e. a number which indicates the importance of the element. Instead
of behave retrieved in the order of insertion, like in a classical queue, the elements are retrieved by order of priority: the most
important element is served first. By convention, we assume that a lower the priority value, the more important the element is
(for example, an element with priority 2 comes before an element with priority 8).
Operation Description
merge(pq) merge the priority queue pq with the current priority queue
Priority queues are usually implemented with heaps which can themselves rely on a partially sorted contiguous array (they are like
a dynamic array with an additional heap property) or on a generalization of linked list where nodes can have several successors.
The most common variants of heaps and the worst case runtime complexity of their associated priority queue operations are the
following:
heap\operation find-min delete-min insert decrease_key merge
Where n is the size of the heap and * indicates amortized time complexity. While Fibonacci heaps have the best runtime
complexity, they are also the most complex to implement, with a large runtime and memory overhead. That’s why, in practice, it
may be more interesting to use simpler heap implementations, especially the binary heap which only required a dynamic array, in
particular when the merge operation is no needed.
Programming exercise : Implement heap sort in the python notebook and verify that the experimental runtime matches
with the theoretical complexity.
Assume that we use a binary heap in the above algorithm, what is the worst case runtime complexity of this algorithm ?
(n is the size of the array)
Can we improve the worst case runtime complexity of heap-sort by using the Binomial or the Fibonacci heaps?
V ? F
Corriger >
1. Propose a first algorithm in pseudocode with a linearithmic runtime Θ(n log(n)) using heap-sort.
2. Propose a second algorithm in pseudocode with a runtime Θ(n + k log(n)) using a binomial heap.
3. Propose a third algorithm in pseudocode with a runtime Θ(n log(k)) using a binary heap (you can consider that the
heap gives access to the maximal element instead of the minimal one). Hint: the heap doesn't need to hold more than *k*
elements.
Dictionaries
A dictionary is a data structure where each stored element is associated to a unique key (no two elements can have the same key).
In some way, dictionaries are a generalization of lists: in a list each element is associated to a unique integer, its index, between 0
and the length of the list minus 1; in a dictionary indices can be (mostly) anything and are called keys. Dictionaries are also called
associative arrays or maps.
Operation Description
put(k, ne) insert the new element ne associated with the key k
Hash maps rely on an auxiliary hash function that associates an index to any key which allows storing the elements in a classical
array at the index given by the hash of its associated key. However, it is possible that two different keys have the same hash
value, they are hence associated to the same index: in this case, we say that there is a collision between the two keys and special
technics must be used to handle this problem. If we assume that the hash function is good (somehow, it produces few collisions),
then the three operations put , get and remove have a constant average runtime Θ(1). However, the worst case is linear O(n)
with respect to the size n of the hash maps.
Search trees on the other hand assumes that the keys are totally ordered, for any two keys k1 and k2, we have either k1 ≤ k2 or
k2 ≤ k1. The three operations put , get and remove have a logarithmic average and worst case runtime O(log(n)). Search trees
are thus less efficient than hash maps in average, but they behave better in the worst case.
Warning
Dictionaries have become very popular as general purpose containers with high level interpreted languages such has Python
and JavaScript. Existing implementations of dictionaries manage to have good runtime complexities in average, but they are
complex data structure with a significant constant overhead both in terms of runtime and memory. While this overhead can
be ignored in high level interpreted scripts, it becomes important when we consider the efficient implementation of
algorithms, it is thus a good idea to avoid dictionaries when simpler data structure can be used.
A doubly linked list is a good container to store a static (that does not change over time) collection of items V ? F
A dynamic array list is a good container to store a static (that does not change over time) collection of items V ? F
We can insert/remove elements efficiently at the front and at the back of a dequeue V ? F
Searching if an element exists in a list is efficient V ? F
Corriger >
Going further
Exercice 10 : Dynamic array insertions
Compare the effect of two different resizing strategies for dynamic arrays:
1. Linear: the capacity is increased by a constant k = 5 each time the array is full.
2. Geometric: the capacity is multiplied by a constant k = 2 each time the array is full.
Programming exercise : Implement the two functions new_size and insert_back of the class dynamic array in the
python notebook and compare the runtime of the two strategies for a sequence of insertions.
2 3 5
size
capacity
Fig. 13 A dequeue with 3 elements (2, 3, 5) with a capacity of 5. The elements in the array are not packed at the front, the
first value is located at the index start (1 here).
If the space becomes insufficient at the front or at the back, a larger array is allocated and the elements of the old array are
copied in the middle of the new one. If the new capacity is c and the number of elements in the dequeue is n, the new start
index is simply ⌊(c − n)/2⌋
1 1
insert_front
7 2 3 5
start
2 increase array size
7 2 3 5
start
3 insert new element
1 7 2 3 5
start
Fig. 14 Insertion at front in a dequeue with no space left at front.
Programming exercise : Implement a dequeue container based on a dynamic array, all operations must have a constant
(amortized) runtime, in the python notebook
For the array container, we will use a Numpy array (to avoid using Python lists which are already dynamic arrays). We will
use a growth factor of 2. For simplicity we never shrink the array when elements are removed.
Propose a non-recursive algorithm that that reverses a singly linked list of n elements in linear time Θ(n) using only a
constant amount of memory O(1) (you cannot allocate an array with the same size as the list for example).
Programming exercise : Implement the singly linked list reverse function in the python notebook