The Joys of Hashing: Hash Table Programming With C 1st Edition Thomas Mailund - Explore The Complete Ebook Content With The Fastest Download
The Joys of Hashing: Hash Table Programming With C 1st Edition Thomas Mailund - Explore The Complete Ebook Content With The Fastest Download
com
https://textbookfull.com/product/the-joys-of-hashing-hash-
table-programming-with-c-1st-edition-thomas-mailund/
OR CLICK HERE
DOWLOAD EBOOK
https://textbookfull.com/product/domain-specific-languages-in-r-
advanced-statistical-programming-1st-edition-thomas-mailund/
textbookfull.com
https://textbookfull.com/product/domain-specific-languages-in-r-
advanced-statistical-programming-1st-edition-thomas-mailund-2/
textbookfull.com
https://textbookfull.com/product/string-algorithms-in-c-efficient-
text-representation-and-search-1st-edition-thomas-mailund/
textbookfull.com
Functional Programming in R: Advanced Statistical
Programming for Data Science, Analysis and Finance 1st
Edition Thomas Mailund
https://textbookfull.com/product/functional-programming-in-r-advanced-
statistical-programming-for-data-science-analysis-and-finance-1st-
edition-thomas-mailund/
textbookfull.com
https://textbookfull.com/product/functional-data-structures-in-r-
advanced-statistical-programming-in-r-thomas-mailund/
textbookfull.com
https://textbookfull.com/product/data-management-solutions-using-sas-
hash-table-operations-a-business-intelligence-case-study-paul-dorfman/
textbookfull.com
The Joys of
Hashing
Hash Table Programming with C
—
Thomas Mailund
The Joys of Hashing
Hash Table Programming
with C
Thomas Mailund
The Joys of Hashing: Hash Table Programming with C
Thomas Mailund
Aarhus N, Denmark
Chapter 4: Resizing����������������������������������������������������������������������������49
Amortizing Resizing Costs����������������������������������������������������������������������������������50
Resizing Chained Hash Tables����������������������������������������������������������������������������57
iii
Table of Contents
iv
Table of Contents
Chapter 8: Conclusions���������������������������������������������������������������������199
Bibliography�������������������������������������������������������������������������������������201
Index�������������������������������������������������������������������������������������������������203
v
About the Author
Thomas Mailund is an associate professor in bioinformatics at Aarhus
University, Denmark. He has a background in math and computer science.
For the past decade, his main focus has been on genetics and evolutionary
studies, particularly comparative genomics, speciation, and gene flow
between emerging species.
He is the author of Domain-Specific Languages in R, Beginning Data
Science in R, Functional Programming in R, and Metaprogramming in R, all
from Apress, as well as other books.
vii
About the Technical Reviewer
Michael Thomas has worked in software
development for over 20 years as an individual
contributor, team lead, program manager, and
vice president of engineering. Michael has over
10 years of experience working with mobile
devices. His current focus is in the medical
sector using mobile devices to accelerate
information transfer between patients and
health care providers.
ix
Acknowledgments
I am very grateful to Rasmus Pagh for comments on the manuscript,
suggestions for topics to add, and correcting me when I have been
imprecise or downright wrong. I am also grateful to Anders Halager for
many discussions about implementation details and bit fiddling. I am also
grateful to Shiella Balbutin for proofreading the book.
xi
CHAPTER 1
Figure 1-1. Values map to hash keys that then map to table bins
2
Chapter 1 The Joys of Hashing
The typical solution to this is to keep N large but have a second step
that reduces the hash key range down to a smaller bin range of size m with
m ∈ O(n); in the example, you use m = 8. If you keep m small, as in O(n),
you can allocate and initialize it in linear time, and you can get any bin in
it in constant time. To insert, check, or delete an element in the table, you
map the application value to its hash key and then map the hash key to a
bin index.
You reduce values to bin indices in two steps because the first step,
mapping data from your application domain to a number, is program-
specific and cannot be part of a general hash table implementation.1
Moving from large integer intervals to smaller, however, can be
implemented as part of the hash table. If you resize the table to adapt it to
the number of keys you store in it, you need to change m. You do not want
the application programmer to provide separate functions for each m.
You can think of the hash key space, [N] = [0, ... , N – 1], as the interface
between the application and the data structure. The hash table itself can
map from this space to indices in an array, [m] = [0, ... , m – 1].
The primary responsibility of the first step is to reduce potentially
complicated application values to simpler hash keys, such as to map
application-relevant information like positions on a board game or
connections in a network down to integers. These integers can then be
handled by the hash table data structure. A second responsibility of the
function is to make the hash keys uniformly distributed in the range [N].
The binning strategy for mapping hash keys to bins assumes that the hash
keys are uniformly distributed to distribute keys into bins evenly. If this
is violated, the data structure does not guarantee (expected) constant
time operations. Here, you can add a third, middle step that maps from
1
I n some textbooks, you will see the hashing step and the binning step combined
and called hashing. Then you have a single function that maps application-
specific keys directly to bins. I prefer to consider this as two or three separate
functions, and it usually is implemented as such.
3
Chapter 1 The Joys of Hashing
[N] → [N] and scrambles the application hash keys to hash keys with a
better distribution; see Figure 1-2. These functions can be application-
independent and part of a hash table library. You will return to such
functions in Chapter 6 and Chapter 7. Having a middle step does not
eliminate the need for application hash functions. You still need to map
complex data into integers. The middle step only alleviates the need for
an even distribution of keys. The map from application keys to hash keys
still has some responsibility for this, though. If it maps different data to
the same hash keys, then the middle step cannot do anything but map the
same input to the same output.
Figure 1-2. If the application maps values to keys, but they are not
uniformly distributed, then a hashing step between the application
and the binning can be added
4
Chapter 1 The Joys of Hashing
hash keys, as these are easiest to work with when analyzing theoretical
performance. The runtime results referred to in Chapter 3 assume this,
and therefore, we will as well. In Chapter 7, you will learn techniques for
achieving similar results without the assumption.
The book is primarily about implementing the hash table data
structure and only secondarily about hash functions. The concerns when
implementing hash tables are these: given hash keys with application
values attached to them, how do you represent the data such that you
can update and query tables in constant time? The fundamental idea
is, of course, to reduce hash keys to bins and then use an array of bins
containing values. In the purest form, you can store your data values
directly in the array at the index the hash function and binning functions
provide but if m is relatively small compared to the number of data values,
then you are likely to have collisions: cases where two hash keys map to the
same bin. Although different values are unlikely to hash to the same key in
the range [N], this does not mean that collisions are unlikely in the range
[m] if m is smaller than N (and as the number of keys you insert in the
table, n, approaches m, collisions are guaranteed). Dealing with collisions
is a crucial aspect of implementing hash tables, and a topic we will deal
with for a sizeable part of this book.
5
CHAPTER 2
struct bin {
int is_free : 1;
uint32_t key;
};
struct hash_table {
struct bin *table;
uint32_t size;
};
Functions for allocating and deallocating tables can then look like this:
8
Chapter 2 Hash Keys, Indices, and Collisions
The operations you want to implement on hash tables are the insertion
and deletion of keys and queries to test if a table holds a given key. You use
this interface to the three operations:
You then use that index to access the array. Assuming that you
never have collisions when doing this, the implementation of the three
operations would then be as simple as this:
9
Chapter 2 Hash Keys, Indices, and Collisions
When inserting an element, you place the value at the index given by
the mapping from the space of hash keys to the range of the array. Deleting
a key is similarly simple: you set the flag in the bin to false. To check if the
table contains a key, you check that the bin is not free and that it contains
the right key. If you assume that the only way you can get an index at a
given index is if you have the key value, this would be correct. However,
you also usually check that the application keys match the key in the
hash table, not just that the hash keys match. In this implementation, the
application keys and hash keys are the same, so you check if the hash keys
are identical only. Because, of course, you could have a situation where two
10
Chapter 2 Hash Keys, Indices, and Collisions
different hash keys would map to the same bin index, even if the hash keys
never collide. The space of bin indices, after all, is much smaller than the
space of keys.
Collisions of hash values are rare events if they are the results of a
well-designed hash function. Although collisions of hash keys are rare,
it does not imply that you cannot get collisions in the indices. The range
[N] is usually vastly larger than the array indices in the range [m]. Two
different hash keys can easily end up in the same hash table bin; see
Figure 2-1. Here, you have hash keys in the space of size N = 64 and only
m = 8 bins. The numbers next to the hash keys are written in octal, and
you map keys to bins by extracting the lower eight bits of the key, which
corresponds to the last digit in the octal representation. The keys 8 and
16, or 108 and 208 in octal, both map to bin number 0, so they collide in
the table.
The figure is slightly misleading since the hash space is only a factor of
eight larger than the size of the hash table. In any real application, the keys
range over a much wider interval than could ever be represented in a table.
In the setup in this book, the range [N] maps over all possible unsigned
integers, which in most C implementations means all possible memory
addresses on your computer. This space is much larger than what you
could reasonably use for an array; if you had to use your entire computer
memory for a hash table, you would have no space for your actual
computer program. Each value might map to a unique hash key, but when
you have to map the hash keys down to a smaller range to store values in a
table, you are likely to see collisions.
11
Chapter 2 Hash Keys, Indices, and Collisions
Risks of Collisions
Assuming a uniform distribution of hash keys, let’s do some back-of-
the-envelope calculations of collisions probabilities. The chances of
collisions are surprisingly high once the number of values approaches
even a small fraction of the number of indices we can hit. To figure
12
Chapter 2 Hash Keys, Indices, and Collisions
out the chances of collisions, let’s use the birthday paradox (https://
en.wikipedia.org/wiki/Birthday). In a room of n people, what is the
probability that two or more have the same birthday? Ignoring leap
years, there are 365 days in a year, so how many people do we need for
the chance that at least two have the same birthday is above one half?
This number, n, turns out to be very low. If we assume that each date is
equally likely as a birthday, then with only 23 people we would expect a
50% chance that at least two share a birthday.
Let’s phrase the problem of “at least two having the same birthday”
a little differently. Let’s ask “what is the probability that all n people have
different birthdays?” The answer to the first problem will then be one
minus the answer to the second.
To answer the second problem, we can reason like this: out of the n
people, the first birthday hits 1 out of 365 days without any collisions. The
second person, if we avoid collisions, has to hit 1 of the remaining 364
days. The third person has to have his birthday on 1 of the 363 remaining
days. Continuing this reasoning, the probability of no collisions in
birthdays of n people is
where 1 minus this product is then the risk of at least one collision when
there are n people in the room. Figure 2-2 shows this probability as a
function of the number of people. The curve crosses the point of 50%
collision risk between 22 and 23.
13
Chapter 2 Hash Keys, Indices, and Collisions
1.0
0.8
Collision risk
0.6
0.4
0.2
0.0
0 10 20 30 40 50
People
m!
p ( n|m ) = 1 -
m (m - n )!
n
14
Chapter 2 Hash Keys, Indices, and Collisions
1000
1.0
10000
0.8
Collision risk
0.6
0.4
0.2
0.0 20000
In practice, you are less interested in when the risk of collision reaches
any particular probability than you are interested in how many items you
can put into a table of size n before you get the first collision. Let K denote
the random variable that represents the first time you get a collision when
inserting elements into the table. The probability that the first collision is
when you add item number k is
m! k -1
Pr ( K = k|m ) = ×
m ( m - k - 1) ! m
k
where the first term is the probability that there were no collisions in the first
k – 1 insertions and the second term is the probability that the kth element
hits one of the k – 1 slots already occupied. The expected number of inserts
you can do until you get the first collision can then be computed as
m +1
E [ k|m ] = åk × Pr ( K = k|m )
k =1
15
Chapter 2 Hash Keys, Indices, and Collisions
k2
p ( k |m ) »
2m
16
Chapter 2 Hash Keys, Indices, and Collisions
1000
1.0
0.8 10000
Collision risk
20000
0.6
0.4
0.2
0.0
k2
m»
2 p ( k |m )
The formula says that to keep the collision risk low, m has to be
proportional to the square of k, with a coefficient that is inversely
proportional to how low you want the risk.
This formula is potentially bad news. If you need to initialize the hash
table before you can use it,1 then you automatically have a quadratic
time algorithm on your hands. That is a hefty price to pay for constant
1
I t is technically possible to use the array in the table without initializing it, but it
requires some trickery that incurs overhead.
17
Chapter 2 Hash Keys, Indices, and Collisions
time access to the elements you put into the table. Since hash tables are
used everywhere, this should tell you that in practice they do not rely on
avoiding collisions entirely; they obviously have to deal with them, and
most of this book is about how to do so.
18
Chapter 2 Hash Keys, Indices, and Collisions
and shrink them. You can easily combine modulus primes with this idea. If
you pick a prime p > m, you can index bins as h(x) mod p mod m. Modulus
p reduces the problem of regularity in keys and if m is a power of two, you
can grow and shrink tables easily and you can mask to get the bins.
If your keys are randomly distributed, you can easily pick table sizes
that are powers of two. If so, you can use the lower bits of keys as bin
indices and modulus can be replaced by bit masking. If the keys are
random, the lower bits will also be random.
You can mask out the lower bits of a key like this:
Subtracting one from table size m will give you the lower k bits,
provided m is a power of two, and masking with that, gives you the index.
Masking is a faster operation than modulo. In my experiments, I see about
a factor of five in the speed difference. Compilers can optimize modulus to
masking if they know that m is a power of two, but if m is a prime (and larger
than two), this is of little help. How much of an issue this is depends on your
application and choice of hash function. Micro-optimizations will matter very
little if you have hash functions that are slow to compute.
There is a trick to avoid modulo, however,2 and replace it with
multiplication. Multiplication is faster than modulus, but not as fast as
masking. Still, if you need to have m be a prime, it is still better than modulo.
The first idea is this: you do not need to map keys x to x mod m but you
can map them to any bin as long as you do so fairly. By fairly I mean that
each bin in [m] will contain the same number of keys if you map all of the
keys in [N] down to the keys in [m]. This is what the expression x mod m
does, but you can do it any way that you like.3
2
h ttps://tinyurl.com/yazblf4o
3
You cannot make a perfectly fair such mapping if N if m does not divide N —x
mod m cannot do this either. However, if N >> m, then you can get very close.
19
Chapter 2 Hash Keys, Indices, and Collisions
This avoids multiplications and modulo, and only uses fast bit
operations. This will be much faster than modulo.
20
CHAPTER 3
Collision Resolution,
Load Factor, and
Performance
Collisions are inevitable when using a hash table, at least if you want the
table size, and thus the initialization time for the table, to be linear in the
number of keys you put into it. Therefore, you need a way to deal with
collisions so you can still insert keys if the bin you map it to is already
occupied. There are two classical approaches to collision resolution:
chaining (where you use linked lists to store colliding keys) and open
addressing (where you find alternative empty slots to store values in when
keys collide).
You can download the complete code for this chapter from
https://github.com/mailund/JoyChapter3.
C
haining
One of the most straightforward approaches to resolve collisions is to
put colliding keys in a data structure that can hold them, and the most
straightforward data structure is a linked list.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com