Comp 272 Notes
Comp 272 Notes
interface tells us nothing about how the data structure implements these operations; it
only provides a list of supported operations along with specifications about what types of
arguments each operation accepts and the value returned by each operation.
A data structure implementation, on the other hand, includes the internal representation of
the data structure as well as the definitions of the algorithms that implement the
operations supported by the data structure. Thus, there can be many implementations of a
single interface. For example, in Chapter 2, we will see implementations of the List
interface using arrays and in Chapter 3 we will see implementations of the List interface
using pointer-based data structures. Each implements the same interface, List, but in
different ways.
Use mathematical concepts required to understand data structures and algorithms.
is in the set
and
-bit word-RAM model. RAM stands for Random Access Machine. In this
model, we have access to a random access memory consisting of cells, each of which stores a
-bit word. This implies that a memory cell can represent, for example, any integer in the set
.
In the word-RAM model, basic operations on words take constant time. This includes arithmetic
operations ( ,
, , ,
), comparisons ( ,
,
,
,
), and bitwise boolean
operations (bitwise-AND, OR, and exclusive-OR).
Any cell can be read or written in constant time. A computer's memory is managed by a memory
management system from which we can allocate or deallocate a block of memory of any size we
would like. Allocating a block of memory of size
takes
time and returns a reference (a
pointer) to the newly-allocated memory block. This reference is small enough to be represented
by a single word.
The word-size
make about
is the lower-bound
, where
is the number of elements stored in
any of our data structures. This is a fairly modest assumption, since otherwise a word is not even
big enough to count the number of elements stored in the data structure.
Space is measured in words, so that when we talk about the amount of space used by a data
structure, we are referring to the number of words of memory used by the structure. All of our
data structures store values of a generic type T, and we assume an element of type T occupies one
word of memory.
Apply correctness, time complexity, and space complexity to data structures and algorithms.
Correctness:
The data structure should correctly implement its interface.
Time complexity:
The running times of operations on the data structure should be as small as possible.
Space complexity:
The data structure should use as little memory as possible.
Worst-case running times:
These are the strongest kind of running time guarantees. If a data structure operation has a worst-
, then a sequence of
time.
, then this
or
element at a time, until we reach the }}}$ th element. The primary advantage is that they are
more dynamic: Given a reference to any list node
to
, we can delete
is in the list.
Unit 4: Skiplists
Implement skiplists.
Unit 5: Hash Tables
Explain hash functions (division, multiplication, folding, radix transformation, digit
rearrangement, length-dependent, mid-square).
Estimate the effectiveness of hash functions (division, multiplication, folding, radix
transformation, digit rearrangement, length-dependent, mid-square).
Differentiate between various hash functions (division, multiplication, folding, radix
transformation, digit rearrangement, length-dependent, mid-square).
Division Hashing
Division hashing using a prime number is quite popular. Here is a list of prime numbers that you
can use and the range in which they are effective:
53 25 to 26
97 26 to 27
193 27 to 28
389 28 to 29
769 29 to 210
The search keys that we intend to hash can range from simple numbers to the entire text content
of a book. The original object could be textual or numerical or in any medium. We need to
convert each object into its equivalent numerical representation in preparation for hashing. That
Digit Rearrangement
Here, the search key digits, say in positions 3 through 6, are reversed, resulting in a new search
key.
For example, if our key is 1234567, we might select the digits in positions 2 through 4, yielding
234. The manipulation can then take many forms:
reversing the digits 432, resulting in a key of 1432567
performing a circular shift to the right 423, resulting in a key of 1423567
performing a circular shift to the left 342, resulting in a key of 1342567
swapping each pair of digits 324, resulting in a key of 1324567.
Length-Dependent Hashing
In this method, the key and the length of the key are combined in some way to form either the
index itself or an intermediate version. For example, if our key is 8765, we might multiply the
first two digits by the length and then divide by the last digit, yielding 69. If our table size is 43,
Collisions happen when two search keys are hashed into the same slot in the table. There
are many ways to resolve collision in hashing. Alternatively, one can discover a hash
function that is perfectmeaning that it maps each search key into a different hash value.
Unfortunately, perfect hash functions are effective only in situations where the inputs are
fixed and known in advance. A sub-category of perfect hash is minimal perfect hash,
where the range of the hash values is also limited, yielding a compact hash table.
If we are able to develop a perfect hashing function, we do not need to be concerned
about collisions or table size. However, often we do not know the size of the input dataset
and are not able to develop a perfect hashing function. In these cases, we must choose a
method for handling collisions.
For almost all hash functions, it is possible that more than one key is assigned to the same
table slot. For example, if the hash function computes the slot based just on the first letter
of the key, then all keys starting with the same letter will be hashed to the same slot,
resulting in a collision.
Collision can be resolved partially by choosing another hash function, which computes
the slot based on first two letters of the key. However, even if a hash function is chosen in
which all the letters of the key participate, there is still a possibility that a number of keys
may hash to the same slot in the hash table.
Another factor that can be used to avoid collision of multiple keys is the size of the hash
table. A larger size will result in fewer collisions, but that will also increase the access
time during retrieval.
A number of strategies have been proposed to prevent collision of multiple keys.
Strategies that look for another open position in the table other than the one to which the
slot is originally hashed are called open addressing strategies. We will examine three
open addressing strategies: linear probing, quadratic probing, and double hashing.
0 1 2
Linear Probing
When a collision takes place, you should search for the next available position in the
table by making a sequential search. Thus the hash values are generated by
after-collision: h(k, i) = [h(k) + p(i) ] mod TableSize,
where p(i) is the probing function after the ith probe. The probing function is one that
looks for the next available slot in case of a collision. The simplest probing function is
linear probing, for which p(i) = i, where i is the step size.
Consider a simple example with table of size 10, hence mod 10. After hashing keys 22, 9,
and 43, the table is shown below. Note that initially a simple division hashing function,
h(k) = k mod TableSize, works fine. We will use the modified hash function, h(k, i) =
[h(k) + i ] mod TableSize, only when there is a collision.
3 4 5 6 7 8 9
22 43
9
When keys 32 and 65 arrive, they are stored as follows. Note that the search key 32
results in a hash value of 2, but slot 2 is already occupied. Thus, using the modified hash
function, with i = 1, a new hash value of 3 is obtained. However, slot 3 is also occupied,
so we reapply the modified hash function. This results in a slot value of 4 that houses the
search key 32. The search key 65 directly hashes to slot 5.
0 1 2 3 4 5 6 7 8 9
2 4 3 6
9
2 3 2 5
Suppose we have another key with value of 54. The key 54 cannot be stored in its
designated place because it collides with 32, so a new place for it is found by linear
probing to position 6, which is empty at this point:
0 1 2 3 4 5 6 7 8 9
0
59
2 4 3 6 5
9
2 3 2 5 4
When the search reaches end of the table, it continues from the first location again. Thus
the key 59 will be stored as follows:
1 2 3 4 5 6 7 8 9
22 43 32 65 54
9
In linear probing, the keys start forming clusters, which have a tendency to grow fast
because more and more collisions take place and the new keys get attached to one end of
the cluster. These are called primary clusters.
The problem with such clusters is that they generate unsuccessful searches. The search
must go through to the end of the table and start from the beginning of the table.
Quadratic Probing
To overcome the primary clustering problem, quadratic probing places the elements
further away rather than in immediate succession.
Let h(k) be the hash function that maps a search key k to an integer in [0, m 1]. Here m
is the size of the table. One choice is the following quadratic function for the ith probe.
That is, the modified hash function is used to probe only after a collision has been
observed.
after collision: h(k, i) = h(k) + c1i + c2i2 mod TableSize, where c2 is not equal to 0
If c2 = 0, then this hash function will become a linear probe. For a given hash table, the
values c1 and c2 remain constant. For m = 2n, a good choice for the constants are c1 = c2 =
.
For a prime m > 2, most choices of c1 and c2 will make h(k,i) distinct for i in [0,(m 1) /
2]. Such choices include c1 = c2 = 1/2, c1 = c2 = 1, and c1 = 0, c2 = 1.
Although using quadratic probing gives much better results than using linear probing, the
problem of cluster buildup is not avoided altogether. Such clusters are called secondary
clusters.
Double Hashing
The problem of secondary clustering is best addressed with double hashing. A second
function is used for resolving conflicts.
Like linear probing, double hashing uses one hash value as a starting point and then
repeatedly steps forward an interval until the desired value is located, an empty location
is reached, or the entire table has been searched. But the resulting interval is decided
using a second, independent hash function; hence the name double hashing. Given
independent hash functions h1 and h2, the jth probing for value k in a hash table of size m
is
h(k, j) = h1(k) + j * h2(k)) mod m
Whatever scheme is used for hashing, it is obvious that the search time depends on how
much of the table is filled up. The search time increases with the number of elements in
the table. In the worst case, one may have to go through all the table entries.
Similar to other open addressing techniques, double hashing becomes linear as the hash
table approaches maximum capacity. Also, it is possible for the secondary hash function
to evaluate to zero, for example, if we choose k = 5 with the following function: h2(k) =
5 (k mod 7).
Separate Chaining
A popular and space-efficient alternative to the above schemes is separate chaining
hashing. Each position of the table is associated with a linked list or chain of structures
whose data field stores the keys. The hashing table is a table of references to these linked
lists. Thus, the keys 78, 8, 38, 28, and 58 would hash to the same position, position 8, in
the reference hash table.
In this scheme, the table can never overflow, because the linked lists are extended only
upon the arrival of new keys. A new key is always added to the front of the linked list,
thus minimizing storage time. Many unsuccessful searches may end up in empty lists,
which reduce the search time of other hashing schemes. This is of course at the expense
of extra storage for linked-list references. While searching for a key, you must first locate
the slot using the hash function and then search through the linked list for the specific
entry.
In this example, John Smith and Sandra Dee end up in the same bucket: table entry 152.
Entry 152 points first to the John Smith object, which is linked to the Sandra Dee object.
Insertion of a new key requires appending to either end of the list in the hashed slot.
Deletion requires searching the list and removing the element.
Study carefully and thoroughly the section titled Separate Chaining in
http://en.wikipedia.org/wiki/Separate_chaining#Separate_chaining, particularly the two
different types of separate chainingseparate chaining with list heads and separate
chaining with other structures.
This web page also introduces you to coalesced hashing, Robin Hood hashing, cuckoo
hashing, and hopscotch hashing, which may help you understand how they differ from
each other.
Coalesced hashing
A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes
within the table itself.[12] Like open addressing, it achieves space usage and (somewhat
diminished) cache advantages over chaining. Like chaining, it does not exhibit clustering effects;
in fact, the table can be efficiently filled to a high density. Unlike chaining, it cannot have more
elements than table slots.
Cuckoo hashing
Another alternative open-addressing solution is cuckoo hashing, which ensures constant lookup
time in the worst case, and constant amortized time for insertions and deletions. It uses two or
more hash functions, which means any key/value pair could be in two or more locations. For
lookup, the first hash function is used; if the key/value is not found, then the second hash
function is used, and so on. If a collision happens during insertion, then the key is re-hashed with
the second hash function to map it to another bucket. If all hash functions are used and there is
still a collision, then the key it collided with is removed to make space for the new key, and the
old key is re-hashed with one of the other hash functions, which maps it to another bucket. If that
location also results in a collision, then the process repeats until there is no collision or the
process traverses all the buckets, at which point the table is resized. By combining multiple hash
functions with multiple cells per bucket, very high space utilization can be achieved.
Hopscotch hashing
Another alternative open-addressing solution is hopscotch hashing,[13] which combines the
approaches of cuckoo hashing and linear probing, yet seems in general to avoid their limitations.
In particular it works well even when the load factor grows beyond 0.9. The algorithm is well
suited for implementing a resizable concurrent hash table.
The hopscotch hashing algorithm works by defining a neighborhood of buckets near the original
hashed bucket, where a given entry is always found. Thus, search is limited to the number of
entries in this neighborhood, which is logarithmic in the worst case, constant on average, and
with proper alignment of the neighborhood typically requires one cache miss. When inserting an
entry, one first attempts to add it to a bucket in the neighborhood. However, if all buckets in this
neighborhood are occupied, the algorithm traverses buckets in sequence until an open slot (an
unoccupied bucket) is found (as in linear probing). At that point, since the empty bucket is
outside the neighborhood, items are repeatedly displaced in a sequence of hops. (This is similar
to cuckoo hashing, but with the difference that in this case the empty slot is being moved into the
neighborhood, instead of items being moved out with the hope of eventually finding an empty
slot.) Each hop brings the open slot closer to the original neighborhood, without invalidating the
neighborhood property of any of the buckets along the way. In the end, the open slot has been
moved into the neighborhood, and the entry being inserted can be added to it.
Robin Hood hashing
One interesting variation on double-hashing collision resolution is Robin Hood hashing.[14][15] The
idea is that a new key may displace a key already inserted, if its probe count is larger than that of
the key at the current position. The net effect of this is that it reduces worst case search times in
the table. This is similar to ordered hash tables[16] except that the criterion for bumping a key does
not depend on a direct relationship between the keys. Since both the worst case and the variation
in the number of probes is reduced dramatically, an interesting variation is to probe the table
starting at the expected successful probe value and then expand from that position in both
directions.[17] External Robin Hashing is an extension of this algorithm where the table is stored
in an external file and each table position corresponds to a fixed-sized page or bucket with B
records.[18]
Unit 6: Recursion
Define recursion.
One of the most succinct properties of modern programming languages like C++, C#, and Java
(as well as many others) is that these languages allow you to define methods that reference
themselves, such methods are said to be recursive. One of the biggest advantages recursive
methods bring to the table is that they usually result in more readable, and compact solutions to
problems.
A recursive method then is one that is defined in terms of itself. Generally a recursive algorithms
has two main properties:
, of degree at
to
to
, then
is called an ancestor of
and
and contains
to one of
its descendants. The height of a tree is the height of its root. A node , is a leaf if it has no
children.
We sometimes think of the tree as being augmented with external nodes. Any node that
does not have a left child has an external node as its left child, and, correspondingly, any
node that does not have a right child has an external node as its right child (see
Figure 6.2.b).
Define binary search tree.
A BinarySearchTree is a special kind of binary tree in which each node,
value$
, from some total order. The data values in a binary search tree obey the binary
is
is greater than
.
Examine a binary tree and binary search tree.
Implement a binary tree and binary search tree.
Define AVL tree.
An AVL tree is a self-balancing binary search tree in which the heights of the two child subtrees
of any node differ by at most 1; therefore, it is also said to be height-balanced. Lookup, insertion,
and deletion all take O(log n) time in both the average and worst cases, where n is the number of
nodes in the tree prior to the operation. Insertions and deletions may require the tree to be
rebalanced by one or more tree rotations.
The balance factor of a node is the height of its right subtree minus the height of its left subtree,
and a node with balance factor 1, 0, or 1 is considered balanced. A node with any other balance
factor is considered unbalanced and requires rebalancing the tree. The balance factor is either
stored directly at each node or computed from the heights of the subtrees.
AVL trees are often compared to redblack trees (See Unit 9) because they support the same set
of operations and because redblack trees also take O(log n) time for the basic operations. AVL
trees perform better than redblack trees for lookup-intensive applications. AVL trees, redblack
trees, and (2,4) trees, to be introduced in Unit 9 and Chapter 9 of Morins book, share a number
of good properties, but AVL trees and (2,4) trees may require some extra operation to deal with
restructuring (rotations), fusing, or splitting. However, redblack trees do not have these
drawbacks.
Unit 8: Scapegoat Trees
Define scapegoat tree.
A ScapegoatTree is a BinarySearchTree that, in addition to keeping track of the number,
nodes in the tree also keeps a counter,
, of
nodes.
At all times,
and
In addition, a ScapegoatTree has logarithmic height; at all times, the height of the scapegoat tree
does not exceed
reduce the height. This isn't a big job; there is only one node, namely
. To
scapegoat,
, we walk from
where
is the child of
implement a scapegoat tree.
. The
) as black nodes.
This way, every real node, , of a red-black tree has exactly two children, each with a welldefined colour. Furthermore, the black-height property now guarantees that every root-to-leaf
path in
is a 2-4 tree!
nodes is at most
heaps. Conventionally, min-heaps are used because they are readily applicable for use in priority
queues.
Note that the ordering of siblings in a heap is not specified by the heap property, so the two
children of a parent can be freely interchanged, as long as this does not violate the shape and
heap properties.
The binary heap is a special case of the d-ary heap in which d = 2.
is at index
node at index
is at index
.
A BinaryHeap implements the (priority) Queue interface.
and
terms of the
and
and merges them, returning a heap node that is the root of a heap that contains all
elements in the subtree rooted at
into two
. We recursively
and
and
Quick sort
The quicksort algorithm is another classic divide and conquer algorithm. Unlike mergesort, which does merging after solving the two subproblems, quicksort does all of its
work upfront.
Quicksort is simple to describe: Pick a random pivot element,
into the set of elements less than
, from
; partition
elements greater than ; and, finally, recursively sort the first and third sets in this
partition.
Heap Sort
Heap-sort
The heap-sort algorithm is another in-place sorting algorithm. Heap-sort uses the binary heaps
discussed in Section 10.1. Recall that the BinaryHeap data structure represents a heap using a
single array. The heap-sort algorithm converts the input array
extracts the minimum value.
More specifically, a heap stores
elements in an array,
, at array locations
. After
transforming
, decrements
, and calls
and
so that
consisting
as an auxiliary array
in
occurrences of 0, followed by
followed by
occurrences
. The
occurrences of 2,...,
Radix-Sort
Counting-sort is very efficient for sorting an array of integers when the length,
, of the array is
bits at a time.11.2 More precisely, radix sort first sorts the integers by their least significant
then their next significant
most significant
bits.
bits,
bits, and so on until, in the last pass, the integers are sorted by their
Quick sort
When quicksort is called to sort an array containing the
times element
A little summing up of harmonic numbers gives us the following theorem about the running time
of quicksort:
Theorem 11..2 When quicksort is called to sort an array containing
expected number of comparisons performed is at most
The
method runs in
The
method runs
comparisons.
Counting sort
The
containing
time.
Radix sort
For any integer
, the
-bit integers in
containing
time.
, and take
in the range
compare sorting algorithms.
containing
integer values
comparisons
in-place
Merge-sort
worst-case
No
Quicksort
expected
Yes
Heap-sort
worst-case
Yes
Each of these comparison-based algorithms has its advantages and disadvantages. Merge-sort
does the fewest comparisons and does not rely on randomization. Unfortunately, it uses an
auxiliary array during its merge phase. Allocating this array can be expensive and is a potential
point of failure if memory is limited. Quicksort is an in-place algorithm and is a close second in
terms of the number of comparisons, but is randomized, so this running time is not always
guaranteed. Heap-sort does the most comparisons, but it is in-place and deterministic.
There is one setting in which merge-sort is a clear-winner; this occurs when sorting a linked-list.
In this case, the auxiliary array is not needed; two sorted linked lists are very easily merged into a
single sorted linked-list by pointer manipulations (see Exercise 11.2).
The counting-sort and radix-sort algorithms described here are due to Seward [66, Section 2.4.6].
However, variants of radix-sort have been used since the 1920s to sort punch cards using
punched card sorting machines. These machines can sort a stack of cards into two piles based on
the existence (or not) of a hole in a specific location on the card. Repeating this process for
different hole locations gives an implementation of radix-sort.
Finally, we note that counting sort and radix-sort can be used to sort other types of numbers
besides non-negative integers. Straightforward modifications of counting sort can sort integers,
in any
, in
interval in
time. Finally, both of these algorithms can also be used to sort
floating point numbers in the IEEE 754 floating point format. This is because the IEEE format is
designed to allow the comparison of two floating point numbers by comparing their values as if
they were integers in a signed-magnitude binary representation.
Unit 12: Graphs
Mathematically, a (directed) graph is a pair
where
to
is
is a sequence of vertices $
is in
. A path
is in
if all of its vertices are unique. If there is a path from some vertex
we say that
is reachable from
then
Figure 12.1: A graph with twelve vertices. Vertices are drawn as numbered circles and edges are
drawn as pointed curves pointing from source to target.
and
, respectively
bytes of memory.
and
the
algorithm runs in
time.
A particularly useful application of the breadth-first-search algorithm is, therefore, in computing
shortest paths
When given as input a Graph,
time.
Implement those search algorithms for traversing a graph in pseudo-code or other programming
languages, such as Java, C, or C++, etc.
Unit 13: Binary Trie
define trie.
A BinaryTrie encodes a set
bit integers in a binary tree. All leaves in the tree have depth
and each integer is encoded as a root-to-leaf path. The path for the integer
if the th most significant bit
the
method runs
Examine a binary trie.
Explain binary trie.
time.
Each node,
's subtree.
's subtree. If
points to