Data Structures and File Processing
Data Structures and File Processing
CS2604:
Data Structures and File Processing
C++ Edition
Cliord A. Shaer
Department of Computer Science
Virginia Tech
Copyright
c 1995, 1996, 1998
The Need for Data Structures
[A primary concern of this course is eciency.]
Data structures organize data
⇒ more ecient programs. [You might
believe that faster computers make it unnecessary to be
concerned with eciency. However...]
• More powerful computers ⇒ more complex
applications.
• More complex applications demand more
calculations.
• Complex computing tasks are unlike our
everyday experience. [So we need special
training]
Any organization for a collection of records can
be searched, processed in any order, or
modied. [If you are willing to pay enough in time delay.
Ex: Simple unordered array of records.]
• The choice of data structure and algorithm
can make the dierence between a program
running in a few seconds or many days.
1
Eciency
A solution is said to be ecient if it solves the
problem within its resource constraints. [Alt:
Better than known alternatives (\relatively" ecient)]
• space [These are typical contraints for programs]
• time
[This does not mean always strive for the most ecient
program. If the program operates well within resource
constraints, there is no benet to making it faster or smaller.]
The cost of a solution is the amount of
resources that the solution consumes.
2
Selecting a Data Structure
Select a data structure as follows:
1. Analyze the problem to determine the
resource constraints a solution must meet.
2. Determine the basic operations that must
be supported. Quantify the resource
constraints for each operation.
3. Select the data structure that best meets
these requirements.
[Typically want the \simplest" data struture that will meet
requirements.]
Some questions to ask: [These questions often help
to narrow the possibilities]
• Are all data inserted into the data structure
at the beginning, or are insertions
interspersed with other operations?
• Can data be deleted? [If so, a more complex
representation is typically required]
• Are all data processed in some well-dened
order, or is random access allowed?
3
Data Structure Philosophy
Each data structure has costs and benets.
Rarely is one data structure better than
another in all situations.
A data structure requires:
• space for each data item it stores, [Data +
Overhead]
• time to perform each basic operation,
• programming eort. [Some data
structures/algorithms more complicated than others]
Each problem has constraints on available
space and time.
Only after a careful analysis of problem
characteristics can we know the best data
structure for the task.
Bank example:
• Start account: a few minutes
• Transactions: a few seconds
• Close account: overnight
4
Goals of this Course
1. Reinforce the concept that there are costs
and benets for every data structure. [A
worldview to adopt]
2. Learn the commonly used data structures.
These form a programmer's basic data
structure \toolkit." [The \nuts and bolts" of the
course]
3. Understand how to measure the
eectiveness of a data structure or
program.
• These techniques also allow you to judge
the merits of new data structures that
you or others might invent. [To prepare
you for the future]
5
Denitions
A type is a set of values.
[Ex: Integer, Boolean, Float]
A data type is a type and a collection of
operations that manipulate the type.
[Ex: Addition]
A data item or element is a piece of
information or a record.
[Physical instantiation]
A data item is said to be a member of a data
type.
[]
A simple data item contains no subparts.
[Ex: Integer]
An aggregate data item may contain several
pieces of information.
[Ex: Payroll record, city database record]
6
Abstract Data Types
Abstract Data Type (ADT): a denition for a
data type solely in terms of a set of values and
a set of operations on that data type.
Each ADT operation is dened by its inputs
and outputs.
Encapsulation: hide implementation details
A data structure is the physical
implementation of an ADT.
• Each operation associated with the ADT is
implemented by one or more subroutines in
the implementation.
Data structure usually refers to an
organization for data in main memory.
File structure: an organization for data on
peripheral storage, such as a disk drive or tape.
An ADT manages complexity through
abstraction: metaphor. [Hierarchies of labels]
[Ex: transistors → gates → CPU. In a program, implement an
ADT, then think only about the ADT, not its implementation]
7
Logical vs. Physical Form
Data items have both a logical and a physical
form.
Logical form: denition of the data item within
an ADT. [Ex: Integers in mathematical sense: +, −]
Physical form: implementation of the data item
within a data structure. [16/32 bit integers: over
ow]
Data Type
ADT: Data Items:
Type
Operations
Logical Form
9
Algorithms and Programs
Algorithm: a method or a process followed to
solve a problem. [A recipe]
An algorithm takes the input to a problem
(function) and transforms it to the output. [A
mapping of input to output]
A problem can have many algorithms.
An algorithm possesses the following properties:
1. It must be correct. [Computes proper function]
2. It must be composed of a series of
concrete steps. [Executable by that machine]
3. There can be no ambiguity as to which
step will be performed next.
4. It must be composed of a nite number of
steps.
5. It must terminate.
A computer program is an instance, or
concrete representation, for an algorithm in
some programming language.
[We frequently interchange use of \algorithm" and \program"
though they are actually dierent concepts]
10
Mathematical Background
[Look over Chapter 2, read as needed depending on your
familiarity with this material.]
Set concepts and notation [Set has no duplicates,
sequence may]
Recursion
Induction proofs
Logarithms [Almost always use log to base 2. That is our
default base.]
Summations
11
Estimation Techniques
Known as \back of the envelope" or \back of
the napkin" calculation.
1. Determine the major parameters that aect
the problem.
2. Derive an equation that relates the
parameters to the problem.
3. Select values for the parameters, and apply
the equation to yield an estimated solution.
Example:
How many library bookcases does it take to
store books totaling one million pages?
Estimate:
• pages/inch [guess 500]
• feet/shelf [guess 4 (actually, 3)]
• shelves/bookcase [guess 5 (actually, 7)]
12
Algorithm Eciency
There are often many approaches (algorithms)
to solve a problem. How do we choose between
them?
At the heart of computer program design are
two (sometimes con
icting) goals:
1. To design an algorithm that is easy to
understand, code and debug.
2. To design an algorithm that makes ecient
use of the computer's resources.
Goal (1) is the concern of Software
Engineering.
Goal (2) is the concern of data structures and
algorithm analysis.
When goal (2) is important, how do we
measure an algorithm's cost?
13
How to Measure Eciency?
1. Empirical comparison (run programs).
[Dicult to do \fairly." Time consuming.]
2. Asymptotic Algorithm Analysis.
Critical resources:
[Time. Space (disk, RAM). Programmer's eort. Ease of use
(user's eort).]
15
Growth Rate Graph
[2n is an exponential algorithm. 10n and 20n dier only by a
constant.]
2n 2n2 5n log n
1400
1200
20n
1000
800
600 10n
400
200
0
0 10 20 30 40 50
2n 2n2
400
20n
300
5n log n
200
10n
100
0
0 5 10 1615
Input size n
Best, Worst and Average Cases
Not all inputs of a given size take the same
time.
Sequential search for K in an array of n
integers:
• Begin at rst element in array and look at
each element in turn until K is found.
Best Case: [Find at rst position: 1 compare]
Worst Case: [Find at last position: n compares]
Average Case: [(n + 1)/2 compares]
While average time seems to be the fairest
measure, it may be dicult to determine.
[Depends on distribution. Assumption for above analysis:
Equally likely at any position.]
When is worst case time important?
[Real time algorithms]
17
Faster Computer or Algorithm?
What happens when we buy a computer 10
times faster? [How much speedup? 10 times. More
important: How much increase in problem size for same time?
Depends on growth rate.]
T(n) n n0 Change n0/n
10n 1, 000 10 000, n0 = 10n 10
20n 500 5 000 , n
√
0 = 10n 10
5n log n 250 1 842 , 10n√< n0 < 10n 7.37
2n2 70 223 n0 = 10n 3.16
2n 13 16 n0 = n + 3 −−
[For n , if n = 1000, then n0 would be 1003]
2
18
Asymptotic Analysis: Big-oh
Denition: For T(n) a non-negatively valued
function, T(n) is in the set O(f (n)) if there
exist two positive constants c and n0 such that
T(n) ≤ cf (n) for all n > n0.
19
Big-oh Example
Example 1. Finding value X in an array. [Average
case]
T(n) = csn/2. [cs is a constant. Actual value is irrelevant]
For all values of n > 1, csn/2 ≤ csn.
Therefore, by the denition, T(n) is in O(n) for
n0 = 1 and c = cs.
20
Big-Omega
Denition: For T(n) a non-negatively valued
function, T(n) is in the set
(g(n)) if there
exist two positive constants c and n0 such that
T(n) ≥ cg (n) for all n > n0.
Therefore, T(n) is in
(n2) by the denition.
Want greatest lower bound.
21
Theta Notation
When big-Oh and
meet, we indicate this by
using (big-Theta) notation.
Denition: An algorithm is said to be (h(n))
if it is in O(h(n)) and it is in
(h(n)).
[For polynomial equations on T(n), we always have . There
is no uncertainty, a \complete" analysis.]
Simplifying Rules:
1. If f (n) is in O(g(n)) and g(n) is in O(h(n)),
then f (n) is in O(h(n)).
2. If f (n) is in O(kg(n)) for any constant
k > 0, then f (n) is in O(g (n)). [No constant]
3. If f1(n) is in O(g1(n)) and f2(n) is in
O(g2(n)), then (f1 + f2)(n) is in
O(max(g1(n), g2(n))). [Drop low order terms]
4. If f1(n) is in O(g1(n)) and f2(n) is in
O(g2(n)) then f1(n)f2(n) is in
O(g1(n)g2(n)). [Loops]
22
Running Time of a Program
[Asymptotic analysis is dened for equations. Need to convert
program to an equation.]
Example 1: a = b;
This assignment takes constant time, so it is
(1). [Not (c) { notation by tradition]
Example 2:
sum = 0;
for (i=1; i<=n; i++)
sum += n;
[(n) (even though sum is n )]
2
Example 3:
sum = 0;
for (j=1; j<=n; j++) // First for loop
for (i=1; i<=j; i++) // is a double loop
sum++;
for (k=0; k<n; k++) // Second for loop
A[k] = k;
P
[First statement is (1). Double for loop is i = (n ). Final
2
23
More Examples
Example 4.
sum1 = 0;
for (i=1; i<=n; i++) // First double loop
for (j=1; j<=n; j++) // do n times
sum1++;
sum2 = 0;
for (i=1; i<=n; i++) // Second double loop
for (j=1; j<=i; j++) // do i times
sum2++;
[First loop, sum is n . Second loop, sum is (n + 1)(n)/2. Both
2
Example 5.
sum1 = 0;
for (k=1; k<=n; k*=2)
for (j=1; j<=n; j++)
sum1++;
sum2 = 0;
for (k=1; k<=n; k*=2)
for (j=1; j<=k; j++)
sum2++;
Plog n
[First is k=1
n = (n log n). Second is P log n−1
k=0
2 = (n).]
k
24
Binary Search
Position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Key 11 13 21 26 29 36 40 41 45 51 54 56 65 72 77 83
25
Other Control Statements
while loop: analyze like a for loop.
if statement: Take greater complexity of
then/else clauses.
[If probabilities are independent of n.]
switch statement: Take complexity of most
expensive case.
[If probabilities are independent of n.]
Subroutine call: Complexity of the subroutine.
26
Analyzing Problems
[Typically do a lot of this in a senior algorithms course.]
Upper bound: Upper bound of best known
algorithm.
Lower bound: Lower bound for every possible
algorithm.
[The examples so far have been easy in that exact equations
always yield . Thus, it was hard to distinguish
and O.
Following example should help to explain the dierence {
bounds are used to describe our level of uncertainty about an
algorithm.]
Example: Sorting
1. Cost of I/O:
(n)
2. Bubble or insertion sort: O(n2)
3. A better sort (Quicksort, Mergesort,
Heapsort, etc.): O(n log n)
4. We prove later that sorting is
(n log n)
27
Multiple Parameters
[Ex: 256 colors (8 bits), 1000 × 1000 pixels]
Compute the rank ordering for all C pixel
values in a picture of P pixels.
for (i=0; i<C; i++) // Initialize count
count[i] = 0;
for (i=0; i<P; i++) // Look at all of the pixels
count[value(i)]++; // Increment proper value count
sort(count); // Sort pixel value counts
28
Space Bounds
Space bounds can also be analyzed with
asymptotic complexity analysis.
Time: Algorithm
Space: Data Structure
Space/Time Tradeo Principle:
One can often achieve a reduction in time is
one is willing to sacrice space, or vice versa.
• Encoding or packing information
Boolean
ags
• Table lookup
Factorials
Disk Based Space/Time Tradeo Principle:
The smaller you can make your disk storage
requirements, the faster your program will run.
29
Lists
[Students should already be familiar with lists. Objectives: use
alg analysis in familiar context, compare implementations.]
A list is a nite, ordered sequence of data
items called elements.
[The positions are ordered, NOT the values.]
Each list element has a data type.
The empty list contains no elements.
The length of the list is the number of
elements currently stored.
The beginning of the list is called the head,
the end of the list is called the tail.
Sorted lists have their elements positioned in
ascending order of value, while unsorted lists
have no necessary relationship between element
values and positions.
Notation: ( a0, a1, ..., an−1 )
What operations should we implement?
[Add/delete elem anywhere, nd, next, prev, test for empty.]
30
List ADT
class List { // List class ADT
public:
List(int =LIST_SIZE); // Constructor
~List(); // Destructor
void clear(); // Remove all Elems
void insert(const Elem); // Insert Elem at curr
void append(const Elem); // Insert Elem at tail
Elem remove(); // Remove and return Elem
void setFirst(); // Set curr to first pos
void prev(); // Move curr to prev pos
void next(); // Move curr to next pos
int length() const; // Return current length
void setPos(int); // Set curr to position
void setValue(const Elem); // Set current value
Elem currValue() const; // Return current value
bool isEmpty() const; // TRUE if list is empty
bool isInList() const; // TRUE if curr in list
bool find(int); // Find value
};
[This is an example of an ADT. Our list implementations will
match. Note that the generic type \Elem" is being used for
the element type.]
31
List ADT Examples
List: ( 12, 32, 15 )
MyList.insert(99);
32
Array-Based List Insert
13 12 20 8 3 13 12 20 8 3
0 1 2 3 4 5 0 1 2 3 4 5
(a) (b)
23 13 12 20 8 3
0 1 2 3 4 5
(c)
33
Array-Based List Class
class List { // Array-based list class
private:
int msize; // Maximum size of list
int numinlist; // Actual number of Elems
int curr; // Position of "current"
Elem* listarray; // Array of list Elems
public:
List(int =LIST_SIZE); // Constructor
~List(); // Destructor
void clear(); // Remove all Elems
void insert(const Elem); // Insert Elem at curr
void append(const Elem); // Insert Elem at tail
Elem remove(); // Remove and return Elem
void setFirst(); // Set curr to first pos
void prev(); // Move curr to prev pos
void next(); // Move curr to next pos
int length() const; // Return current length
void setPos(int); // Set curr to position
void setValue(const Elem); // Set current value
Elem currValue() const; // Return current value
bool isEmpty() const; // TRUE if list is empty
bool isInList() const; // TRUE if curr in list
bool find(int); // Find value
};
34
Array-Based List Implementation
List::List(int sz) // Constructor
{ msize = sz; numinlist = 0; curr = 0;
listarray = new Elem[sz]; }
36
Link Class
Dynamic allocation of new list elements.
class Link { // Singly-linked node
public:
Elem element; // Elem value for node
Link *next; // Pointer to next node
Link(const Elem elemval, Link* nextval =NULL)
{ element = elemval; next = nextval; }
Link(Link* nextval =NULL) { next = nextval; }
};
37
Linked List Position
head curr tail
20 23 12 15
(a)
20 23 10 12 15
[Naive approach: Point to current
(b) node. Current is 12. Want
to insert node with 10. No access available to node with 23.
How can we do the insert?]
head curr tail
20 23 12 15
(a)
20 23 10 12 15
(b)
38
Linked List Class
class List { // Linked list class
private:
Link* head; // Pointer to list header
Link* tail; // Pointer to last Elem
Link* curr; // Pos of "current" Elem
public:
List(int =LIST_SIZE); // Constructor
~List(); // Destructor
void clear(); // Remove all Elems
void insert(const Elem); // Insert at current pos
void append(const Elem); // Insert at tail of list
Elem remove(); // Remove/return Elem
void setFirst(); // Set curr to first pos
void prev(); // Move curr to prev pos
void next(); // Move curr to next pos
int length() const; // Return length
void setPos(int); // Set current pos
void setValue(const Elem); // Set current value
Elem currValue() const; // Return current value
bool isEmpty() const; // TRUE if list is empty
bool isInList() const; // TRUE if now in list
bool find(int); // Find value
};
39
Linked List Insertion
// Insert Elem at current position
void List::insert(const Elem item) {
assert(curr != NULL); // Must be pointing to Elem
curr->next = new Link(item, curr->next);
if (tail == curr) // Appended new Elem
tail = curr->next;
}
curr
... 23 12 ...
Insert 10: 10
(a)
curr
... 23 12 ...
3
10
1 2
(b)
40
Linked List Remove
Elem List::remove() { // Remove/return Elem
assert(isInList()); // Must be valid pos
Elem temp = curr->next->element; // Remember value
Link* ltemp = curr->next; // Remember link
curr->next = ltemp->next; // Remove from list
if (tail == ltemp) tail = curr; // Set tail
delete ltemp; // Free link
return temp; // Return value
}
curr
... 23 10 15 ...
(a)
curr 2
... 23 10 15 ...
it 1
(b)
41
Freelists
System new and delete are slow.
class Link { // Singly-linked node
public: // with freelist
Elem element; // Elem value for node
Link* next; // Pointer to next node
static Link* freelist; // Link class freelist
Link(const Elem elemval, Link* nextval =NULL)
{ element = elemval; next = nextval; }
Link(Link* nextval =NULL) { next = nextval; }
void* operator new(size_t); // Overloaded new
void operator delete(void*); // Overloaded delete
};
42
Comparison of List Implementations
Array-Based Lists: [Average and worst cases]
• Insertion and deletion are (n).
• Array must be allocated in advance.
• No overhead if all array positions are full.
Linked Lists:
• Insertion and deletion (1);
prev and direct access are (n).
• Space grows with number of elements.
• Every element requires overhead.
43
Doubly Linked Lists
Simplify insertion and deletion: Add a prev
pointer.
class Link { // Doubly-linked node
public: // with freelist
Elem element; // Node Elem value
Link* next; // Pointer to next node
Link* prev; // Pointer to prev node
static Link* freelist; // Link class freelist
Link(const Elem Elemval, Link* nextp =NULL,
Link* prevp =NULL)
{ element = Elemval; next = nextp; prev = prevp;}
Link(Link* nextp =NULL, Link* prevp = NULL)
{ next = nextp; prev = prevp; }
void* operator new(size_t); // Overloaded new
void operator delete(void*); // Overloaded delete
};
head curr tail
20 23 12 15
44
Doubly Linked List Operations
curr
... 20 23 12 ...
Insert 10: 10
(a)
curr
4 5
... 20 23 12 ...
10
3 1 2
(b)
// Insert Elem at current position
void List::insert(const Elem item) {
assert(curr != NULL);
curr->next = new Link(item, curr->next, curr);
if (curr->next->next != NULL)
curr->next->next->prev = curr->next;
if (tail == curr) tail = curr->next;
}
45
Stacks
LIFO: Last In, First Out
Restricted form of list: Insert and remove only
at front of list.
Notation:
• Insert: PUSH
• Remove: POP
• The accessible element is called TOP.
46
Array-Based Stack
Dene top as rst free position.
class Stack { // Array-based stack class
private:
int size; // Maximum size of stack
int top; // Index for top Elem
Elem *listarray; // Array holding stack Elems
public:
Stack(int sz =LIST_SIZE) // Constructor: initialize
{ size = sz; top = 0; listarray = new Elem[sz]; }
~Stack() // Destructor: free array
{ delete [] listarray; }
void clear() // Remove all Elems
{ top = 0; }
void push(const Elem item) // Push Elem onto stack
{ assert(top < size); listarray[top++] = item; }
Elem pop() // Pop Elem from stack top
{ assert(!isEmpty()); return listarray[--top]; }
Elem topValue() const // Return value of top Elem
{ assert(!isEmpty()); return listarray[top-1]; }
bool isEmpty() const // Return TRUE if empty
{ return top == 0; }
};
top1 top2
47
Linked Stack
class Stack { // Linked stack class
private:
Link *top; // Pointer to top Elem
public:
Stack(int sz =LIST_SIZE) // Constructor:
{ top = NULL; } // initialize
~Stack() { clear(); } // Destructor
void clear(); // Remove stack Elems
void push(const Elem item) // Push Elem onto stack
{ top = new Link(item, top); }
Elem pop(); // Pop Elem from stack
Elem topValue() const // Get value of top Elem
{ assert(!isEmpty()); return top->element; }
bool isEmpty() const // Return TRUE if empty
{ return top == NULL; }
};
48
Queues
FIFO: First In, First Out
Restricted form of list:
Insert at one end, remove from other.
Notation:
• Insert: Enqueue
• Delete: Dequeue
• First element: FRONT
• Last element: REAR
49
Queue Implementations
Array-Based Queue
front rear
20 5 12 17
(a)
front rear
12 17 3 30 4
(b)
rear
(a) (b)
B C
D E F
G H I
51
Full and Complete Binary Trees
Full binary tree: each node either is a leaf or is
an internal node with exactly two non-empty
children.
Complete binary tree: If the height of the tree
is d, then all levels except possibly level d are
completely full. The bottom level has all nodes
to the left side.
(a) (b)
52
Full Binary Tree Theorem
Theorem: The number of leaves in a
non-empty full binary tree is one more than the
number of internal nodes.
[Relevant since it helps us calculate space requirements.]
Proof (by Mathematical Induction):
• Base Case: A full binary tree with 1
internal node must have two leaf nodes.
• Induction Hypothesis: Assume any full
binary tree T containing n − 1 internal
nodes has n leaves.
• Induction Step: Given tree T with n
internal nodes, pick internal node I with
two leaf children. Remove I 's children, call
resulting tree T'. By induction hypothesis,
T' is a full binary tree with n leaves.
Restore i's two children. The number of
internal nodes has now gone up by 1 to
reach n. The number of leaves has also
gone up by 1.
53
Full Binary Tree Theorem Corollary
Theorem: The number of NULL pointers in a
non-empty binary tree is one more than the
number of nodes in the tree.
Proof: Replace all null pointers with a pointer
to an empty leaf node. This is a full binary tree.
54
Binary Tree Node Class
class BinNode { // Binary tree node class
public:
Belem element; // The node’s value
BinNode* left; // Pointer to left child
BinNode* right; // Pointer to right child
static BinNode* freelist;
// Two constructors: with and without initial values
BinNode() { left = right = NULL; }
BinNode(Belem e, BinNode* l =NULL, BinNode* r =NULL)
{ element = e; left = l; right = r; }
~BinNode() { } // Destructor
BinNode* leftchild() const { return left; }
BinNode* rightchild() const { return right; }
Belem value() const { return element; };
void setValue(Belem val) { element = val; }
bool isLeaf() const // TRUE if is a leaf
{ return (left == NULL) && (right == NULL); }
void* operator new(size_t); // Overload new
void operator delete(void*);// Overload delete
};
55
Traversals
Any process for visiting the nodes in some
order is called a traversal.
Any traversal that lists every node in the tree
exactly once is called an enumeration of the
tree's nodes.
Preorder traversal: Visit each node before
visiting its children.
Postorder traversal: Visit each node after
visiting its children.
Inorder traversal: Visit the left subtree, then
the node, then the right subtree.
void preorder(BinNode* rt) // rt is root of a subtree
{
if (rt == NULL) return; // Empty subtree
visit(rt); // visit performs desired action
preorder(rt->leftchild());
preorder(rt->rightchild());
}
56
Binary Tree Implementation
A
B C
D E F
G H I
[Leaves are the same as internal nodes. Lots of wasted
space.]
,
c
+
4 x a
2 x
57
Union Implementation
enum Nodetype {leaf, internal}; // Enumerate node types
class VarBinNode { // Generic node class
public:
Nodetype mytype; // Stores type for this node
union {
struct { // Structure for internal node
VarBinNode* left; VarBinNode* right; // Children
Operator opx; // Internal node value
} intl;
Operand var; // Leaves just store a value
};
VarBinNode(const Operand& val) // Constructor: leaf
{ mytype = leaf; var = val; }
// Constructor: Internal
VarBinNode(const Operator& op,
VarBinNode* l, VarBinNode* r) {
mytype = internal; intl.opx = op;
intl.left = l; intl.right = r; }
bool isLeaf() { return mytype == leaf; }
VarBinNode* leftchild() { return intl.left; }
VarBinNode* rightchild() { return intl.right; }
};
60
Space Overhead
From Full Binary Tree Theorem:
Half of pointers are NULL.
If leaves only store information, then overhead
depends on whether tree is full.
All nodes the same, with two pointers to
children:
Total space required is (2p + d)n.
Overhead: 2pn.
If p = d, this means 2p/(2p + d) = 2/3 overhead.
[The following is for full binary trees:]
Eliminate pointers from leaf nodes:
n (2p)
2 = p
n (2p) + dn p+d
2
[Half the nodes have 2 pointers, which is overhead.]
This is 1/2 if p = d.
2p/(2p + d) if data only at leaves ⇒ 2/3
overhead.
Some method is needed to distinguish leaves
from internal nodes. [This adds overhead.]
61
Array Implementation
[This is a good example of logical representation vs. physical
implementation.]
For complete binary trees.
0
1 2
3 4 5 6
7 8 9 10 11
(a)
Node 0 1 2 3 4 5 6 7 8 9 10 11
• Parent(r) = [(r − 1)/2 if r 6= 0 and r < n.]
• Leftchild(r) = [2r + 1 if 2r + 1 < n.]
• Rightchild(r) = [2r + 2 if 2r + 2 < n.]
• Leftsibling(r) = [r − 1 if r is even, r > 0 and r < n.]
• Rightsibling(r) = [r + 1 if r is odd, r + 1 < n.]
[Since the complete binary tree is so limited in its shape,
(only one shape for tree of n nodes), it is reasonable to
expect that space eciency can be achieved.]
62
Human Coding Trees
ASCII Codes: 8 bits per character.
Fixed length coding.
Can take advantage of relative frequency of
letters to save space.
Variable length coding.
Z K F C U D L E
2 7 24 32 37 42 42 120
Build the tree with minimal external path
weight.
63
Human Tree Construction
Step 1: 2 7 24 32 37 42 42 120
Z K F C U D L E
9 24 32 37 42 42 120
Step 2: F C U D L E
2 7
Z K
32 33 37 42 42 120
C U D L E
Step 3: 9 24
F
2 7
Z K
37 42 42 65 120
U D L E
32 33
Step 4: C
9 24
F
2 7
Z K
42 65 79 120
L E
32 33 37 42
C U D
Step 5:
9 24
F
2 7
Z K
64
Assigning Codes
306
0 1
120 186
E 0 1
79 107
0 1 0 1
37 42 42 65
U D L 0 1
32 33
C 0 1
9 24
0 1 F
2 7
Z K
37 42
24 42 7 42
7 32 40 42 2 32
2 120 24 37
40
(a) (b)
67
BST Search
class BST {
private:
BinNode* root;
void clearhelp(BinNode*); // Private
void inserthelp(BinNode*&, const Belem); // functions
BinNode* deletemin(BinNode*&);
void removehelp(BinNode*&, int);
Belem findhelp(BinNode*, int) const;
void printhelp(const BinNode*, int) const;
public:
BST() { root = NULL; }
~BST() { clearhelp(root); }
void clear() { clearhelp(root); root = NULL; }
void insert(const Belem val) {inserthelp(root, val);}
void remove(const val) { removehelp(root, val); }
Belem find(const val) const
{ return findhelp(root, val); }
bool isEmpty() const { return root == NULL; }
void print() const {
if (root == NULL) cout << "The BST is empty.\n";
else printhelp(root, 0);
}
};
37
24 42
7 32 40 42
2 35 120
69
Alternate Approach
void BST::inserthelp(BinNode* rt, const Belem val) {
if (rt == NULL)
return new BinNode(val, NULL, NULL);
if (key(val) < key(rt->value()))
rt->left = inserthelp(rt->left, val);
else rt->right = inserthelp(rt->right, val);
return rt;
}
70
Remove Minimum Value
BinNode* BST::deletemin(BinNode*& rt) {
assert(rt != NULL); // Must be a node to delete
if (rt->left != NULL) // Continue left
return deletemin(rt->left);
else // Found it
{ BinNode* temp = rt; rt = rt->right; return temp; }
}
10
rt
5 20
71
BST Remove
void BST::removehelp(BinNode*& rt, int val) {
if (rt == NULL) cout << val << " is not in tree.\n";
else if (val < key(rt->value())) // Check left
removehelp(rt->left, val);
else if (val > key(rt->value())) // Check right
removehelp(rt->right, val);
else { // Found it: remove
BinNode* temp = rt;
if (rt->left == NULL) // Only a right -
rt = rt->right; // point to right
else if (rt->right == NULL) // Only a left -
rt = rt->left; // point to left
else { // Both non-empty
temp = deletemin(rt->right); // Replace with min
rt->setValue(temp->value()); // in right subtree
}
delete temp; // Free up space
}
}
37 40
24 42
7 32 40 42
2 120
72
Cost of BST Operations
Find:
Insert:
Remove:
[All cost depth of the node in question. Worst case:(n).
Average case:(n log n).]
73
Heaps
Heap: Complete binary tree with the
Heap Property:
• Min-heap: all values less than child values.
• Max-heap: all values greater than child
values.
The values in a heap are partially ordered.
Heap representation: normally the array based
complete binary tree representation.
74
Building the Heap
[Max Heap]
1 7
2 3 4 6
4 5 6 7 1 2 3 5
(a)
1 7
2 3 5 6
4 5 6 7 4 2 1 3
(b)
76
Siftdown
For fast heap construction:
• Work from high end of array to low end.
• Call siftdown for each item.
• Don't need to call siftdown on leaf nodes.
void heap::buildheap() // Heapify contents
{ for (int i=n/2-1; i>=0; i--) siftdown(i); }
78
General Trees
A tree T is a nite set of one or more nodes
such that there is one designated node r called
the root of T , and the remaining nodes in
(T − {r}) are partitioned into n ≥ 0 disjoint
subsets T1, T2, ..., Tk, each of which is a tree,
and whose roots r1, r2, ..., rk, respectively, are
children of r.
[Note: disjoint because a node cannot have two parents.]
Root R
Ancestors of V
Parent of V P
V
S1 S2
C1 C2 Siblings of V
Subtree rooted at V
Children of V
79
General Tree ADT
class GTNode {
public:
GTNode(const Elem); // Constructor
~GTNode(); // Destructor
Elem value(); // Return node’s value
bool isLeaf(); // TRUE if is a leaf
GTNode* parent(); // Return parent
GTNode* leftmost_child(); // Return first child
GTNode* right_sibling(); // Return right sibling
void setValue(Elem); // Set node’s value
void insert_first(GTNode* n); // Insert first child
void insert_next(GTNode* n); // Insert right sibling
void remove_first(); // Remove first child
void remove_next(); // Remove right sibling
};
class GenTree {
public:
GenTree(); // Constructor
~GenTree(); // Destructor
void clear(); // Free nodes
GTNode* root(); // Return root
void newroot(Elem, GTNode*, GTNode*); // Combine
};
80
General Tree Traversal
void print(GTNode* rt) { // Preorder traverse from root
if (rt->isLeaf()) cout << "Leaf: ";
else cout << "Internal: ";
cout << rt->value() << "\n"; // Print or take action
GTNode* temp = rt->leftmost_child();
while (temp != NULL)
{ print(temp); temp = temp->right_sibling(); }
}
R
A B
C D E F
[RACDEBF]
81
Parent Pointer Implementation
R W
A B X Y Z
C D E F
Parent's Index 0 0 1 1 1 2 7 7 7
Label R A B C D E F W X Y Z
Node Index 0 1 2 3 4 5 6 7 8 9 10
82
Equivalence Classes
When joining equivalence classes, want to keep
depth small.
Weighted Union Rule: join the tree with fewer
nodes to the tree with more nodes.
Limits depth to log n for n nodes.
[Less than half of the nodes increase depth by 1.]
Path Compression: Make all nodes visited point
to root. [Nearly constant cost.]
class GTNode { // General tree node
public:
GTNode* par; // Parent pointer
GTNode() { par = NULL; } // Constuctor
GTNode* parent() { return par; } // Return parent
};
83
UNION-FIND
Gentree::Gentree(int sz) { // Constructor
size = sz;
array = new GTNode[sz]; // Create node array
}
Gentree::~Gentree() { // Destructor
delete [] array; // Free node array
}
84
Equivalence Processing Example
A B C D E
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 F G H I J
[Process (A,B), (C, H), (G,F), (D, E), (I, F)]
(a)
0 3 5 2 5 A C F J D
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 B H G I E
[Process (H, A), (E, G)] (b)
0 0 5 3 5 2 5 A F J
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 B C G I D
[Process (H, E)]
H E
(c)
5 0 0 5 3 5 2 5 F J
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 A G I D
B C E
H
(d)
85
Path Compression Example
5 0 0 5 5 5 0 5 F J
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 A G I D E
B C H
86
Lists of Children
Index Val Par
0 R 1 3
1 A 0 2 4 6
2 C 1
3 B 0 5
4 D 1
5 F 3
6 E 1
7
87
Leftmost Child/Right Sibling
Left Val Par Right
1 R
R 0
3 A 0 2
6 B 0
R X C 1 4
D 1 5
A B E 1
F 2
C D E F 8 R 0
X 7
[Note: Two trees share same array.]
Left Val Par Right
1 R 7 8
R0
3 A 0 2
6 B 0
R X C 1 4
D 1 5
A B E 1
F 2
C D E F 0 R 0
-1
X 7
88
Linked Implementations
Val Size
R 2
R
A 3 B 1
A B
C D E F C 0 D 0 E 0 F 0
(a) (b)
R A B
A B
C D E F C D E F
(a) (b)
89
Sequential Implementations
List node values in the order they would be
visited by a preorder traversal.
Saves space, but allows only sequential access.
Need to retain tree structure for reconstruction.
90
Convert to Binary Tree
Left Child/Right Sibling representation
essentially stores a binary tree.
Use this process to convert any general tree to
a binary tree.
A forest is a collection of one or more general
trees.
root
(a) (b)
91
Graphs
A graph G = (V, E) consists of a set of
vertices V, and a set of edges E, such that
each edge in E is a connection between a pair
of vertices in V.
The number of vertices is written |V|, and the
number of edges is written |E|.
A sequence of vertices v1, v2, ..., vn forms a path
of length n − 1 if there exist edges from vi to
vi+1 for 1 ≤ i < n.
A path is simple if all vertices on the path are
distinct.
A cycle is a path of length 3 or more that
connects vi to itself.
A cycle is simple if the path is simple, except
for the rst and last vertices being the same.
92
Graph Denitions (Cont)
An undirected graph is connected if there is at
least one path from any vertex to any other.
The maximal connected subgraphs of an
undirected graph are called
connected components.
A graph without cycles is acyclic.
A directed graph without cycles is a
directed acyclic graph or DAG.
A free tree is a connected, undirected graph
with no simple cycles. Equivalently, a free tree
is connected and has |V − 1| edges.
0 2
4 1
3 4 7
1
2
1 3
(a) (b) (c)
93
Connected Components
0 2 6 7
1 3 5
94
Graph Representations
Adjacency Matrix: (|V|2).
Adjacency List: (|V| + |E|).
0 1 2 3 4
0 2 0 1 1
1 1
4 2 1
3 1
1 3 4 1
(a) (b)
0 1 4
1 3
2 4
3 2
4 1
(c)
0 1 2 3 4
0 2 0 1 1
1 1 1 1
4 2 1 1
3 1 1
1 3 4 1 1 1
(a) (b)
0 1 4
1 0 3 4
2 3 4
3 1 2
4 0 1 2
(c)
99
Graph Traversals
Some applications require visiting every vertex
in the graph exactly once.
Application may require that vertices be visited
in some special order based on graph topology.
Example: Articial Intelligence
• Problem domain consists of many \states."
• Need to get from Start State to Goal State.
• Start and Goal are typically not directly
connected.
To insure visiting all vertices:
void graph_traverse(Graph& G) {
for (v=0; v<G.n(); v++)
G.Mark[v] = UNVISITED; // Initialize mark bits
for (v=0; v<G.n(); v++)
if (G.Mark[v] == UNVISITED)
do_traverse(G, v);
}
[Two traversals we will talk about: DFS, BFS.]
100
Depth First Search
void DFS(Graph& G, int v) { // Depth first search
PreVisit(G, v); // Take appropriate action
G.setMark(v, VISITED);
for (Edge w = G.first(v); G.isEdge(w); w = G.next(w))
if (G.getMark(G.v2(w)) == UNVISITED)
DFS(G, G.v2(w));
PostVisit(G, v); // Take appropriate action
}
A B A B
C C
D D
F F
E E
(a) (b)
101
Breadth First Search
Like DFS, but replace stack with a queue.
Visit the vertex's neighbors before continuing
deeper in the tree.
void BFS(Graph& G, int start) {
Queue Q(G.n());
Q.enqueue(start);
G.setMark(start, VISITED);
while (!Q.isEmpty()) {
int v = Q.dequeue();
PreVisit(G, v); // Take appropriate action
for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))
if (G.getMark(G.v2(w)) == UNVISITED) {
G.setMark(G.v2(w), VISITED);
Q.enqueue(G.v2(w));
}
PostVisit(G, v); // Take appropriate action
}}
A B A B
C C
D D
F F
E E
(a) (b)
102
Topological Sort
Problem: Given a set of jobs, courses, etc.
with prerequisite constraints, output the jobs in
an order that does not violate any of the
prerequisites.
J6
J1 J2 J5 J7
J3 J4
104
Shortest Paths Problems
Input: A graph with weights or costs
associated with each edge.
Output: The list of edges forming the shortest
path.
Sample problems:
• Find the shortest path between two
specied vertices.
• Find the shortest path from vertex S to all
other vertices.
• Find the shortest path between all pairs of
vertices.
Our algorithms will actually calculate only
distances.
105
Shortest Paths Denitions
d(A, B) is the shortest distance from vertex A
to B.
w(A, B) is the weight of the edge connecting
A to B.
• If there is no such edge, then w(A, B) = ∞.
B 5
10 D
20
A
2 11
3
15
C E
106
Single Source Shortest Paths
Given start vertex s, nd the shortest path
from s to all other vertices.
Try 1: Visit all vertices in some order, compute
shortest paths for all vertices seen so far, then
add the shortest path to next vertex x.
Problem: Shortest path to a vertex already
processed might go through x.
Solution: Process vertices in order of distance
from s.
107
Dijkstra's Algorithm Example
A B C D E
Initial 0 ∞ ∞ ∞ ∞
Process A 0 10 3∞ 20
Process C 0 5 3 20 18
Process B 0 5 3 10 18
Process D 0 5 3 10 18
Process E 0 5 3 10 18
B 5
10 D
20
A
2 11
3
15
C E
108
Dijkstra's Algorithm: Array
void Dijkstra(Graph& G, int s) { // Use array
int D[G.n()];
for (int i=0; i<G.n(); i++) // Initialize
D[i] = INFINITY;
D[s] = 0;
for (i=0; i<G.n(); i++) { // Process vertices
int v = minVertex(G, D);
if (D[v] == INFINITY) return; // Unreachable
G.setMark(v, VISITED);
for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))
if (D[G.v2(w)] > (D[v] + G.weight(w)))
D[G.v2(w)] = D[v] + G.weight(w);
}
}
0 1 1 3
2 2 11
12
1
[0,3 is a 0-path. 2,0,3 is a 1-path. 0,2,3 is a 3-path, but not a
2 or 1 path. Everything is a 4 path.]
111
Floyd's Algorithm
void Floyd(Graph& G) { // All-pairs shortest paths
int D[G.n()][G.n()]; // Store distances
for (int i=0; i<G.n(); i++) // Initialize D
for (int j=0; j<G.n(); j++)
D[i][j] = G.weight(i, j);
for (int k=0; k<G.n(); k++) // Compute all k paths
for (int i=0; i<G.n(); i++)
for (int j=0; j<G.n(); j++)
if (D[i][j] > (D[i][k] + D[k][j]))
D[i][j] = D[i][k] + D[k][j];
}
112
Minimum Cost Spanning Trees
Minimum Cost Spanning Tree (MST) Problem:
• Input: An undirected, connected graph G.
• Output: The subgraph of G that 1) has
minimum total cost as measured by
summing the values for all of the edges in
the subset, and 2) keeps the vertices
connected.
A B
7 5
C
9 1 2 6
D 2
F
E 1
113
Prim's MST Algorithm
void Prim(Graph& G, int s) { // Prim’s MST alg
int D[G.n()]; // Distance vertex
int V[G.n()]; // Who’s closest
for (int i=0; i<G.n(); i++) // Initialize
D[i] = INFINITY;
D[s] = 0;
for (i=0; i<G.n(); i++) { // Process vertices
int v = minVertex(G, D);
G.setMark(v, VISITED);
if (v != s) AddEdgetoMST(V[v], v); // Add to MST
if (D[v] == INFINITY) return; // Rest unreachable
for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))
if (D[G.v2(w)] > G.weight(w)) {
D[G.v2(w)] = G.weight(w); // Update distance,
V[G.v2(w)] = v; // who came from
}}}
115
Proof of Prim's MST Algorithm
Theorem 14.1 Prim's algorithm produces a
minimum cost spanning tree.
Proof by contradiction:
Order vertices by how they are added to the
MST by Prim's algorithm. v1, v2, ..., vn.
Let edge ei connect (vx, vi+1), x < i.
Let ej be the lowest numbered (rst) edge
added by the algorithm such that the set of
edges selected so far cannot be extended to
form an MST for G.
Let V1 = (v1, ..., vj ). Let V2 = (vj +1, ..., vn).
Marked Unmarked
Vertices v , i < j
i Vertices v , i j
i
\correct" edge
e
0
v u v w
vp v j
e
116
j
Prim's edge
Kruskal's MST Algorithm
Kruskel(Graph& G) { // Kruskal’s MST algorithm
Gentree A(G.n()); // Equivalence class array
Elem E[G.e()]; // Array of edges for min-heap
int edgecnt = 0;
for (int i=0; i<G.n(); i++) // Put edges on array
for (Edge w = G.first(i);
G.isEdge(w); w = G.next(w)) {
E[edgecnt].weight = G.weight(w);
E[edgecnt++].edge = w;
}
heap H(E, edgecnt, edgecnt); // Heapify the edges
int numMST = G.n(); // Init w/ n equiv classes
for (i=0; numMST>1; i++) { // Combine equiv classes
Elem temp = H.removemin(); // Get next cheap edge
Edge w = temp.edge;
int v = G.v1(w); int u = G.v2(w);
if (A.differ(v, u)) { // If different equiv classes
A.UNION(v, u); // Combine equiv classes
AddEdgetoMST(G.v1(w), G.v2(w)); // Add to MST
numMST--; // One less MST
}
}
}
C
Step 1 A B E F
1
Process edge (C. D)
D
C F
Step 2 A B
1 E 1
Process edge (E, F)
D
C
Step 3 A B
1 2
Process edge (C, F)
D
F
E 1
118
Sorting
Each record contains a eld called the key.
Linear order: comparison.
[a < b and b < c ⇒ a < c.]
The Sorting Problem
Given a sequence of records R1, R2, ..., Rn with
key values k1, k2, ..., kn, respectively, arrange the
records into any order s such that records
Rs1 , Rs2 , ..., Rsn have keys obeying the property
ks1 ≤ ks2 ≤ ... ≤ ksn .
[Put keys in ascending order.]
Measures of cost:
• Comparisons
• Swaps
119
Insertion Sort
void inssort(Elem* array, int n) { // Insertion Sort
for (int i=1; i<n; i++) // Insert i’th record
for (int j=i; (j>0) &&
(key(array[j])<key(array[j-1])); j--)
swap(array, j, j-1);
}
i=1 2 3 4 5 6 7
42 20 17 13 13 13 13 13
20 42 20 17 17 14 14 14
17 17 42 20 20 17 17 15
13 13 13 42 28 20 20 17
28 28 28 28 42 28 23 20
14 14 14 14 14 42 28 23
23 23 23 23 23 23 42 28
15 15 15 15 15 15 15 42
122
Pointer Swapping
Key = 42 Key = 42
Key = 5 Key = 5
(a) (b)
123
Exchange Sorting
Summary
Insertion Bubble Selection
Comparisons:
Best Case (n) (n2) (n2)
Average Case (n2) (n2) (n2)
Worst Case (n2) (n2) (n2)
Swaps:
Best Case 0 0 (n)
Average Case (n2) (n2) (n)
Worst Case (n2) (n2) (n)
All of these sorts rely on exchanges of
adjacent records.
What is the average number of exchanges
required?
[n /4 { average distance from a record to its sorted position.]
2
124
Shellsort
void shellsort(Elem* array, int n) { // Shellsort
for (int i=n/2; i>2; i/=2) // For each increment
for (int j=0; j<i; j++) // Sort each sublist
inssort2(&array[j], n-j, i);
inssort2(array, n, 1);
}
[8 lists of length 2]
36 20 11 13 28 14 23 15 59 98 17 70 65 41 42 83
[4 lists of length 4]
28 14 11 13 36 20 17 15 59 41 23 70 65 98 42 83
[2 lists of length 8]
11 13 17 14 23 15 28 20 36 41 42 70 59 83 65 98
[1 list of length 16]
11 13 14 15 17 20 23 28 36 41 42 59 65 70 83 98
O(n1.5) [Any increments will work, provided the last is 1.
Shellsort takes advantage of inssort's best case performance.]
125
Quicksort
Divide and Conquer: divide list into values less
than pivot and values greater than pivot.
[Initial call: ]
qsort(array, 0, n-1);
126
Quicksort Partition
int partition(Elem* array, int l, int r, int pivot) {
do { // Move the bounds inward until they meet
while (key(array[++l]) < pivot); // Move right
while (r && (key(array[--r]) > pivot));// Move left
swap(array, l, r); // Swap out-of-place vals
} while (l < r); // Stop when they cross
swap(array, l, r); // Reverse wasted swap
return l; // Return first pos in right partition
}
Initial 72 6 57 88 85 42 83 73 48 60
l r
Pass 1 72 6 57 88 85 42 83 73 48 60
l r
Swap 1 48 6 57 88 85 42 83 73 72 60
l r
Pass 2 48 6 57 88 85 42 83 73 72 60
l r
Swap 2 48 6 57 42 85 88 83 73 72 60
l r
Pass 3 48 6 57 42 85 88 83 73 72 60
r l
Swap 3 48 6 57 85 42 88 83 73 72 60
r l
Reverse Swap 48 6 57 42 85 88 83 73 72 60
r l
The cost for Partition is (n).
127
Quicksort Example
72 6 57 88 60 42 83 73 48 85
Pivot = 60
48 6 57 42 60 88 83 73 72 85
Pivot = 6 Pivot = 73
6 42 57 48 72 73 85 88 83
Pivot = 57 Pivot = 88
Pivot = 42 42 48 57 85 83 88 Pivot = 85
42 48 83 85
6 42 48 57 60 72 73 83 85 88
Final Sorted Array
128
Cost for Quicksort
Best Case: Always partition in half.
Worst Case: Bad partition.
Average Case:
1 X1
n−
T (n) = n + 1 + (T (k) + T (n − k))
n − 1 k=1
= (n log n)
129
Mergesort
List mergesort(List inlist) {
if (inlist.length() <= 1) return inlist;;
List l1 = half of the items from inlist;
List l2 = other half of the items from inlist;
return merge(mergesort(l1), mergesort(l2));
}
36 20 17 13 28 14 23 15
20 36 13 17 14 28 15 23
13 17 20 36 14 15 23 28
13 14 15 17 20 23 28 36
130
Mergesort Implementation
Mergesort is tricky to implement.
void mergesort(Elem* array, Elem* temp,
int left, int right) {
int mid = (left+right)/2;
if (left == right) return; // List of one ELEM
mergesort(array, temp, left, mid); // Sort 1st half
mergesort(array, temp, mid+1, right);// Sort 2nd half
for (int i=left; i<=right; i++) // Copy to temp
temp[i] = array[i];
// Do the merge operation back to array
int i1 = left; int i2 = mid + 1;
for (int curr=left; curr<=right; curr++) {
if (i1 == mid+1) // Left sublist exhausted
array[curr] = temp[i2++];
else if (i2 > right) // Right sublist exhausted
array[curr] = temp[i1++];
else if (key(temp[i1]) < key(temp[i2]))
array[curr] = temp[i1++];
else array[curr] = temp[i2++];
}}
[Note: This requires a second array.]
Mergesort cost: [(n log n)]
Mergesort is good for sorting linked lists.
[Send records to alternating linked lists, mergesort each, then
merge.]
131
Optimized Mergesort
void mergesort(Elem* array, Elem* temp,
int left, int right) {
int i, j, k, mid = (left+right)/2;
if (left == right) return;
mergesort(array, temp, left, mid); // Sort 1st half
mergesort(array, temp, mid+1, right);// Sort 2nd half
132
Heapsort
Heapsort uses a max-heap.
void heapsort(Elem* array, int n) { // Heapsort
heap H(array, n, n); // Build the heap
for (int i=0; i<n; i++) // Now sort
H.removemax(); // Value placed at end of heap
}
Cost of Heapsort: [(n log n)]
Cost of nding k largest elements: [(k log n + n).
Time to build heap: (n).
Time to remove least element: (log n).]
[Compare to sorting with BST: this is expensive in space
(overhead), potential bad balance, BST does not take
advantage of having all records available in advance.]
[Heap is space ecient, balanced, and building initial heap is
ecient.]
133
Heapsort Example
Original Numbers 73
73 6 57 88 60 42 83 72 48 85 6 57
88 60 42 83
72 48 85
Build Heap 88
88 85 83 72 73 42 57 6 48 60 85 83
72 73 42 57
6 48 60
Remove 88 85
85 73 83 72 60 42 57 6 48 88 73 83
72 60 42 57
6 48
Remove 85 83
83 73 57 72 60 42 48 6 85 88 73 57
72 60 42 48
6
Remove 83 73
73 72 57 6 60 42 48 83 85 88 72 57
6 60 42 48
134
Binsort
A simple, ecient sort:
for (i=0; i<n; i++)
B[key(A[i])] = A[i];
135
Radix Sort
Initial List: 27 91 1 97 17 23 84 28 72 5 67 25
First pass Second pass
(on right digit) (on left digit)
0 0 1 5
1 91 1 1 17
2 72 2 23 25 27 28
3 23 3
4 84 4
5 5 25 5
6 6 67
7 27 97 17 67 7 72
8 28 8 84
9 9 91 97
136
Cost of Radix Sort
void radix(Elem* A, Elem* B, int n, int k, int r,
int* count) {
// Count[i] stores number of records in bin[i]
137
Radix Sort Example
Initial Input: Array A 27 91 1 97 17 23 84 28 72 5 67 25
0 1 2 3 4 5 6 7 8 9
First pass values for Count. 0 2 1 1 1 2 0 4 1 0
rtok = 1.
0 1 2 3 4 5 6 7 8 9
Count array: 0 2 3 4 5 7 7 11 12 12
Index positions for Array B.
0 1 2 3 4 5 6 7 8 9
Second pass values for Count. 2 1 4 0 0 0 1 1 1 2
rtok = 10.
0 1 2 3 4 5 6 7 8 9
Count array: 2 3 7 7 7 7 8 9 10 12
Index positions for Array B.
138
Empirical Comparison
Algorithm 10 100 1,000 10K 30K
Insert. Sort .00200 .1833 18.13 1847.0 16544
Bubble Sort .00233 .2267 22.47 2274.0 20452
Selec. Sort .00167 .0967 2.17 900.3 8142
Shellsort .00233 .0600 1.00 17.0 59
Shellsort/O .00233 .0500 .93 16.3 65
QSort .00367 .0500 .63 7.3 24
QSort/O .00200 .0300 .43 5.7 18
Merge .00700 .0700 .87 10.7 35
Merge/O .00133 .0267 .37 5.0 16
Heapsort .00900 .1767 2.67 36.3 122
Rad Sort/1 .02433 .2333 2.30 23.3 69
Rad Sort/4 .00700 .0600 .60 6.0 17
Rad Sort/8 .00967 .0333 .30 3.0 8
Algorithm 10 100 1,000 10K 100K
Insert. Sort .0002 .0170 1.68 168.8 23382
Bubble Sort .0003 .0257 2.55 257.2 41874
Selec. Sort .0003 .0273 2.65 267.5 40393
Shellsort .0003 .0027 0.12 1.9 40
Shellsort/O .0003 .0060 0.11 1.8 33
QSort .0004 .0057 0.08 0.9 12
QSort/O .0002 .0040 0.06 0.8 10
Merge .0009 .0130 0.17 2.3 30
Merge/O .0003 .0067 0.11 1.5 21
Heapsort .0010 .0173 0.26 3.5 49
Rad Sort/1 .0123 .1197 1.21 12.5 135
Rad Sort/4 .0035 .0305 0.30 3.2 34
Rad Sort/8 .0047 .0183 0.16 1.6 18
139
Sorting Lower Bound
Want to prove a lower bound for all possible
sorting algorithms.
Sorting is O(n log n).
Sorting I/O takes
(n) time.
Will now prove
(n log n) lower bound.
Form of proof:
• Comparison based sorting can be modeled
by a binary tree.
• The tree must have
(n!) leaves.
• The tree must be
(n log n) levels deep.
140
Decision Trees
XYZ
XYZ YZX
XZY ZXY
YXZ ZYX
Yes A[1]<A[0]? No
(Y<X?)
YXZ XYZ
YXZ XYZ
YZX XZY
ZYX ZXY
Yes No Yes No
A[2]<A[1]? A[2]<A[1]?
YZX (Z<X?) YXZ XZY (Z<Y?) XYZ
YZX XZY
ZYX ZXY
Yes No Yes No
A[1]<A[0]? A[1]<A[0]?
ZYX (Z<Y?) YZX ZXY (Z<X?) XZY
There are n! permutations, and at least 1 node
for each permutation.
A tree with n nodes has at least log n levels.
Where is the worst case in the decision tree?
log n! =
(n log n).
141
Primary vs. Secondary Storage
Primary Storage: Main memory (RAM)
Secondary Storage: Peripheral devices
• Disk Drives
• Tape Drives
Medium Price Price per Mbyte
32MB RAM $225 $7.00/MB
1.4MB
oppy disk $.50 $0.36/MB
2.1GB disk drive $210 $0.10/MB
1GB JAZ cassette $100 $0.10/MB
2GB cartridge tape $20 $0.01/MB
RAM is usually volatile.
RAM is about 1/4 million times faster than
disk.
142
Golden Rule of File Processing
Minimize the number of disk accesses!
1. Arrange information so that you get what
you want with few disk accesses.
2. Arrange information so minimize future disk
accesses.
An organization for data on disk is often called
a le structure.
Disk based space/time tradeo: Compress
information to save processing time by reducing
disk accesses.
143
Disk Drives
Boom
(arm)
Platters
Read/Write Track
Spindle Heads
(a) (b)
Sectors
Intersector
Gaps Bits of data
[CD-ROM: Spiral with equally spaced dots, variable speed
rotation.]
144
Sectors
8 1 6 1
7 2 3 4
6 3 8 7
5 4 5 2
(a) (b)
146
Disk Access Cost Example
675 Mbyte disk drive
• 15 platters ⇒ 45 Mbyte/platter
• 612 tracks/platter
• 150 sectors/track ⇒ 512 bytes/sector
• 8 sectors/cluster (4K bytes/cluster) ⇒ 18
clusters/track
• Interleaving factor of 3 ⇒ 3 revolutions to
read one track (50.1 msec)
How long to read a le of 128 Kbytes divided
into 256 records of 512 bytes?
Number of Clusters:
If le lls minimum number of tracks:
150 sectors of one track, 106 of the next
Total time:
612/3 ∗ 0.08 + 3 + 3.5 ∗ 16.7 + 0.08 + 3+
3.5 ∗ 16.7 = 139.3 msec.
If clusters are spread randomly across disk:
612 24
32 ∗ ( 3 ∗ 0.08 + 3 + 16.7/2 + 150 ∗ 16.7)
= 32 ∗ 30.3 = 969.6 msec.
147
Magnetic Tape
Example: 9 track tape at 6250 bytes per inch
(bpi).
At 2400 feet, this yields 170 Mbytes for $20, or
$0.12/Mbyte.
Workstation/PC cartridge tape is similar.
Magnetic tape requires sequential access.
Magnetic tape has two speeds:
• High speed for \skipping."
• Low speed for \reading."
148
Buers
Read time for one track:
612/3 ∗ 0.08 + 3 + 3.5 ∗ 16.7 = 77.8 msec.
Read time for one sector:
612/3∗0.08+3+16.7/2+16.7/150 = 27.8 msec.
Read time for one byte:
612/3 ∗ 0.08 + 3 + 16.7/2 = 27.7 msec.
Nearly all disk drives read/write one sector at
every I/O access.
• Also called a page.
151
C/C++File Functions
FILE *fopen(char *filename, char *mode);
void fclose(FILE *stream);
Mode examples:
• "rb": open a binary le, read-only.
• "w+t": create a text le for reading and
writing.
size t fread(void *ptr, size t size, size t n,
FILE *stream);
if(numrec !=
fread(recarr, sizeof rec, numrec, myfile))
its_an_error();
153
Model of External Computation
Secondary memory is divided into equal-sized
blocks (512, 2048, 4096 or 8192 bytes are
typical sizes).
The basic I/O operation transfers the contents
of one disk block to/from main memory.
Under certain circumstances, reading blocks of
a le in sequential order is more ecient.
(When?) [1) Adjacent logical blocks of le are physically
adjacent on disk. 2) No competition for I/O head.]
Typically, the time to perform a single block
I/O operation is sucient to Quicksort the
contents of the block.
Thus, our primary goal is to minimize the
number fo block I/O operations.
Most workstations today must do all sorting on
a single disk drive.
[So, the algorithm presented here is general for these
conditions.]
154
Key Sorting
Often records are large while keys are small.
• Ex: Payroll entries keyed on ID number.
155
External Sort: Simple Mergesort
Quicksort requires random access to the entire
set of records.
Better: Modied Mergesort algorithm
• Process n elements in (log n) passes.
20 13 14 15 13 17 15 23 14 15 23 28
157
Breaking a le into runs
General approach:
• Read as much of the le into memory as
possible.
• Perform and in-memory sort.
• Output this group of records as a single run.
158
Replacement Selection
1. Break available memory into an array for
the heap, an input buer and an output
buer.
2. Fill the array from disk.
3. Make a min-heap.
4. Send the smallest value (root) to the
output buer.
5. If the next key in the le is greater than the
last value output, then
Replace the root with this key.
else
Replace the root with the last key in the
array.
Add the next record in the le to a new
heap (actually, stick it at the end of the
array).
Input Output
File Input Buer RAM Output Buer Run File
159
Example of Replacement Selection
Input Memory Output
12
16 19 31 12
25 21 56 40
16
29 19 31 16
25 21 56 40
29
19 31
25 21 56 40
19
14 21 31 19
25 29 56 40
40
21 31
25 29 56 14
21
35 25 31 21
40 29 56 14
160
Benet from Replacement Selection
Use double buering to overlap input,
processing and output.
How many disk drives for greatest advantage?
Snowplow argument:
• A snowplow moves around a circular track
onto which snow falls at a steady rate.
• At any instant, there is a certain amount of
snow S on the track. Some falling snow
comes in front of the plow, some behind.
• During the next revolution of the snowplow,
all of this is removed, plus 1/2 of what falls
during that revolution.
• Thus, the plow removes 2S amount of
snow. Falling Snow
Snowplow Movement
Start time T
161
Simple Mergesort may not be Best
Simple Mergesort: Place the runs into two les.
• Merge the rst two runs to output le, then
next two runs, etc.
This process is repeated until only one run
remains.
• How many passes for r initial runs? [log r]
162
Multiway Merge
With replacement selection, each initial run is
several blocks long.
Assume that each run is placed in a separate
disk le.
We could then read the rst block from each
le into memory and perform an r-way merge.
When a buer becomes empty, read a block
from the appropriate run le.
Each record is read only once from disk during
the merge process.
In practice, use only one le and seek to
appropriate block.
Input Runs
5 10 15 ...
Output Buer
6 7 23 ... 5 6 7 10 12 ...
12 18 20 ...
163
Limits to Single Pass Multiway
Merge
Assume working memory is b blocks in size.
How many runs can be processed at one time?
The runs are 2b blocks long (on average).
[Because of replacement selection.]
How big a le can be merged in one pass?
[2b ]
2
165
Search
Given: Distinct keys k1, k2, ... kn and
collection T of n records of the form
(k1, I1), (k2, I2), ..., (kn, In)
where Ij is information associated with key kj
for 1 ≤ j ≤ n.
Search Problem: For key value K , locate the
record (kj , Ij ) in T such that kj = K .
Searching is a systematic method for locating
the record (or records) with key value kj = K .
A successful search is one in which a record
with key kj = K is found.
An unsuccessful search is one in which no
record with kj = K is found (and presumably
no such record exists).
166
Approaches to Search
1. Sequential and list methods (lists, tables,
arrays).
2. Direct access by key value (hashing).
3. Tree indexing methods.
167
Searching Ordered Arrays
Sequential Search
Binary Search
int binary(int K, int* array, int left, int right) {
// Return pos of ELEM in array (if any) with value K
int l = left-1;
int r = right+1; // l, r beyond bounds of array
while (l+1 != r) { // Stop when l and r meet
int i = (l+r)/2; // Look at middle of subarray
if (K < array[i]) r = i; // In left half
if (K == array[i]) return i; // Found it
if (K > array[i]) l = i; // In right half
}
return UNSUCCESSFUL; // Search value not in array
}
Position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Key 11 13 21 26 29 36 40 41 45 51 54 56 65 72 77 83
Dictionary Search
168
Lists Ordered by Frequency
Order lists by (expected) frequency of
occurrance.
• Perform sequential search.
n
X
Cn = i/iHn = n/Hn ≈ n/ loge n
i =1
80/20 rule: 80% of the accesses are to 20% of
the records.
For distributions following the 80/20 rule,
C n ≈ 0.122n.
170
Self-Organizing Lists
Self-organizing lists modify the order of records
within the list based on the actual pattern of
record access.
Self-organizing lists use a rule called a heuristic
for deciding how to to reorder the list. These
heuristics are similar to the rules for managing
buer pools.
[Buer pools can be viewed as a form of self-organizing list.]
• Order by actual historical frequency of
access. (Similar to LFU buer pool
replacement stratagy.)
• When a record is found, swap it with the
rst record on list.
• Move-to-Front: When a record is found,
move it to the front of the list. [Not worse
than twice \best arrangement."]
• Transpose: When a record is found, swap
it with the record ahead of it. [A bad example:
keep swapping last two elements.]
171
Example of Self-Organizing Tables
Application: Text compression.
Keep a table of words already seen, organized
via Move-to-Front Heuristic.
If a word not yet seen, send the word.
Otherwise, send the (current) index in the
table.
The car on the left hit the car I left.
The car on 3 left hit 3 5 I 5.
This is similar in spirit to Ziv-Lempel coding.
172
Searching in Sets
For dense sets (small range, many elements in
set):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0
173
Hashing
Hashing: The process of mapping a key value
to a position in a table.
A hash function maps key values to positions
It is denoted by h.
A hash table is an array that holds the
records. It is denoted by T .
The hash table has M slots, indexed from 0 to
M − 1.
174
Hashing (continued)
Hashing is appropriate only for sets (no
duplicates).
Good for both in-memory and disk based
applications.
Answers the question \What record, if any, has
key value K ?"
[Not good for range queries.]
Example: Store the n records with keys in
range 0 to n − 1.
• Store the record with key i in slot i.
• Use hash function h(K ) = K.
175
Collisions
More reasonable example:
• Store about 1000 records with keys in
range 0 to 16,383.
• Impractical to keep a hash table with
16,384 slots.
• We must devise a hash function to map the
key range to a smaller table.
Given: hash function h and keys k1 and k2.
β is a slot in the hash table.
If h(k1) = β = h(k2), then k1 and k2 have a
collision at β under h.
Search for the record with key K :
1. Compute the table location h(K ).
2. Starting with slot h(K ), locate the record
containing key K using (if necessary) a
collision resolution policy.
Collisions are inevitable in most applications.
• Example: 23 people are likely to share a
birthday.
176
Hash Functions
A hash function MUST return a value within
the hash table range.
To be practical, a hash function SHOULD
evenly distribute the records stored among the
hash table slots.
Ideally, the hash function should distribute
records with equal probability to all hash table
slots. In practice, success depends on the
distribution of the actual records stored.
If we know nothing about the incoming key
distribution, evenly distribute the key range
over the hash table slots while avoiding obvious
opportunities for clustering.
If we have knowlege of the incoming
distribution, use a distribution-dependant hash
function.
177
Example Hash Functions
int h(int x) {
return(x % 16);
}
178
ELF Hash
From Executable and Linking Format (ELF),
UNIX System V Release 4.
int ELFhash(char* key) {
unsigned long h = 0;
while(*key) {
h = (h << 4) + *key++;
unsigned long g = h & 0xF0000000L;
if (g) h ^= g >> 24;
h &= ~g;
}
return h % M;
}
179
Open Hashing
What to do when collisions occur?
Open hashing treats each hash table slot as a
bin.
0 1000 9530
1
2
3 3013
4
5
6
7 9877 2007 1057
8
9 9879
180
Bucket Hashing
Divide the hash table slots into buckets.
• Example: 8 slots/bucket.
181
Closed Hashing
Closed hashing stores all records directly in the
hash table.
Each record i has a home position h(ki).
If i is to be inserted and another record already
occupies i's home position, then another slot
must be found to store i.
The new slot is found by a
collision resolution policy.
Search must follow the same policy to nd
records not in their home slots.
182
Collision Resolution
During insertion, the goal of collision resolution
is to nd a free slot in the table.
Probe Sequence: the series of slots visited
during insert/search by following a collision
resolution policy.
Let β0 = h(K ). Let (β0, β1, ...) be the series of
slots making up the probe sequence.
void hashInsert(Elem R) { // Insert R into hash table T
int home; // Home position for R
int pos = home = h(key(R));// Initial pos on sequence
for (int i=1; key(T[pos]) != EMPTY; i++) {
pos = (home + p(key(R), i)) % M; // Next slot
if (key(T[pos]) == key(R)) ERROR; //No duplicates
}
T[pos] = R; // Insert R
}
183
Linear Probing
Use the probe function
int p(int K, int i) { return i; }
184
Linear Probing Example
0 1001 0 1001
1 9537 1 9537
2 3016 2 3016
3 3
4 4
5 5
6 6
7 9874 7 9874
8 2009 8 2009
9 9875 9 9875
10 10 1052
(a) (b)
186
Pseudo Random Probing
The ideal probe function would select the next
slot on the probe sequence at random.
An actual probe function cannot operate
randomly. (Why?)
Pseudo random probing:
• Select a (random) permutation of the
numbers from 1 to M − 1:
r1, r2, ..., rM−1
• All insertions and searches use the same
permutation.
Example: Hash table of size M = 101
• r1 = 2, r2 = 5, r3 = 32.
• h(k1) = 30, h(k2) = 28.
• Probe sequence for k1 is: [30, 32, 35, 62]
• Probe sequence for k2 is: [28, 30, 33, 60]
187
Quadratic Probing
Set the i'th value in the probe sequence as
(h(K ) + i2) mod M.
Example: M = 101.
• h(k1) = 30, h(k2) = 29.
• Probe sequence for k1 is: [30, 31, 34, 39]
• Probe sequence for k2 is: [21, 30, 33, 38]
188
Double Hashing
Pseudo random probing eliminates primary
clustering.
If two keys hash to same slot, they follow the
same probe sequence. This is called
secondary clustering.
To avoid secondary clustering, need a probe
sequence to be a function of the original key
value, not just the home position.
Double hashing:
p(K, i) = i ∗ h2(K ) for 0 ≤ i ≤ M − 1.
Insert Delete
4
1
0 .2 .4 .6 .8 1.0
190
Deletion
1. Deleting a record must not hinder later
searches.
2. We do not want to make positions in the
hash table unusable because of deletion.
Both of these problems can be resolved by
placing a special mark in place of the deleted
record, called a tombstone.
A tombstone will not stop a search, but that
slot can be used for future insertions.
Unfortunately, tombstones do add to the
average path length.
Solutions:
1. Local reorganizations to try to shorten the
average path length.
2. Periodically rehash the table (by order of
most frequently accessed record).
191
Indexing
Goals:
• Store large les.
• Support multiple search keys.
• Support ecient insert, delete and range
queries.
Entry sequenced le: Order records by time
of insertion. [Not practical as a database organization.]
Use sequential search.
Index le: Organized, stores pointers to actual
records. [Could be a tree or other data structure.]
Primary key: A unique identier for records.
May be inconvenient for search.
Secondary key: an alternate search key, often
not unique for each record. Often used for
search key.
192
Linear Indexing
Linear Index: an index le organized as a
simple sequence of key/record pointer pairs
where the key values are in sorted order.
If the index is too large to t in main memory,
a second level index may be used.
Linear indexing is good for searching variable
length records. [Also good for indexing an entry
sequenced le.]
Linear indexing is poor for insert/delete.
Linear Index
37 42 52 73 98
73 52 98 37 42
Database Records [This is an entry sequenced le.]
AX33
AX35
ZX45
ZQ99
194
Inverted List (Continued)
Secondary Primary
Key Index Key Next
Jones 0 0 AA10 4
Smith 1 1 AX33 6
Zukowski 3 2 ZX45
3 ZQ99
4 AB12 5
5 AB39 7
6 AX35 2
7 FF37
[A linked list.]
195
Tree Indexing
Linear index is poor for insertion/deletion.
Tree index can eciently support all desired
operations:
• Insert/delete
• Multiple search keys [Multiple tree indices.]
• Key range search
5 4
3 7 2 6
2 4 6 1 3 5 7
(a) (b)
196
2-3 Tree
A 2-3 Tree has the following properties:
1. A node contains one or two keys.
2. Every internal node has either two children
(if it contains one key) or three children (if
it contains two keys).
3. All leaves are at the same level in the tree,
so the tree is always height balanced.
The 2-3 Tree also has a search tree property
analogous to the BST.
The advantage of the 2-3 Treeover the BST is
that it can be updated at low cost.
18 33
12 23 30 48
10 15 20 21 24 31 45 47 50 52
197
2-3 Tree Insertion
18 33
12 23 30 48
10 15 15 20 21 24 31 45 47 50 52
14 [Insert 14]
18 33
12 23 30 48 52
10 15 20 21 24 31 45 47 50 55
198
2-3 Tree Splitting
23
20 23 30 20 30
19 21 24 31 19 21 24 31
(a)
[Insert 19] (b)
23
18 33
12 20 30 48
10 15 19 21 24 31 45 47 50 52
(c)
199
B-Trees
The B-Tree is an extension of the 2-3 Tree.
The B-Tree is now the standard le
organization for applications requiring insertion,
deletion and key range searches.
1. B-Trees are always balanced.
2. B-Trees keep related records on a disk
page, which takes advantage of locality of
reference.
3. B-Trees guarantee that every node in the
tree will be full at least to a certain
minimum percentage. This improves space
eciency while reducing the typical number
of disk fetches necessary during a search or
update operation.
200
B-Trees (Continued)
A B-Tree of order m has the following
properties.
• The root is either a leaf or has at least two
children.
• Each node, except for the root and the
leaves, has between dm/2e and m children.
• All leaves are at the same level in the tree,
so the tree is always height balanced.
A B-Tree node is usually selected to match the
size of a disk block.
A B-Tree node could have hundreds of children.
201
B-Tree Example
Search in a B-Tree is a generalization of search
in a 2-3 Tree.
1. Perform a binary search on the keys in the
current node. If the search key is found,
then return the record. If the current node
is a leaf node and the key is not found,
then report an unsuccessful search.
2. Otherwise, follow the proper branch and
repeat the process.
24
15 20 33 45 48
10 12 18 21 23 30 31 38 47 50 52 60
202
B+-Trees
The most commonly implemented form of the
B-Tree is the B+-Tree.
Internal nodes of the B+-Tree do not store
records { only key values to guide the search.
Leaf nodes store records or pointers to records.
A leaf node may store more or less records than
an internal node stores keys.
[Assume leaves can store 5 values, internal notes 3 (4
children).]
33
18 23 48
10 12 15 18 19 20 21 22 23 30 31 33 45 47 48 50 52
203
B+-Tree Insertion
[Note special rule for root: May have only two children.]
33
10 12 23 33 48 10 12 23 33 48 50
[(b) Add (a)
50.] [Add 45, 52, 47 (split),(b)18, 15, 31 (split), 21,
20.]
18 33 48
10 12 15 18 20 21 23 31 33 45 47 48 50 52
33
18 23 48
10 12 15 18 20 21 23 30 31 33 45 47 48 50 52
(d)
204
B+-Tree Deletion
[Simple delete { delete 18 from original example.]
33
18 23 48
10 12 15 19 20 21 22 23 30 31 33 45 47 48 50 52
33
19 23 48
10 15 18 19 20 21 22 23 30 31 33 45 47 48 50 52
205
B-Tree Space Analysis
B+-Tree nodes are always at least half full.
The B∗-Tree splits two pages for three, and
combines three pages into two. In this way,
nodes are always 2/3 full.
Asymptotic cost of search, insertion and
deletion of records from B-Trees, B+-Trees and
B∗-Trees is (log n). (The base of the log is
the (average) branching factor of the tree.)
Example: Consider a B+-Tree of order 100
with leaf nodes containing 100 records.
1 level B+-Tree: [Max: 100]
2 level B+-Tree: [Min: 2 leaves of 50 for 100 records.
Max: 100 leaves with 100 for 10,000 records.]
3 level B+-Tree: [Min: 2 × 50 nodes of leaves for 5000
records. Max: 100 = 1, 000, 000 records.]
3