0% found this document useful (0 votes)
49 views

Reconfigurable Accelerator For The

This document discusses accelerating the word-matching stage of BLASTN, which is used to find similar DNA sequences, using an FPGA implementation. It analyzes the BLASTN pipeline and identifies word-matching as the most time-consuming stage. The proposed system presents a reconfigurable architecture for word-matching on an FPGA. Experimental results show the FPGA implementation achieves around an order of magnitude speedup compared to software.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Reconfigurable Accelerator For The

This document discusses accelerating the word-matching stage of BLASTN, which is used to find similar DNA sequences, using an FPGA implementation. It analyzes the BLASTN pipeline and identifies word-matching as the most time-consuming stage. The proposed system presents a reconfigurable architecture for word-matching on an FPGA. Experimental results show the FPGA implementation achieves around an order of magnitude speedup compared to software.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Reconfigurable Accelerator for the

Word-Matching Stage of BLASTN


Abstract
BLAST is one of the most popular sequence analysis tools used by molecular
biologists. It is designed to efficiently find similar regions between two sequences that
have biological significance. However, because the size of genomic databases is growing
rapidly, the computation time of BLAST, when performing a complete genomic database
search, is continuously increasing. Thus, there is a clear need to accelerate this process.
In this paper, we present a new approach for genomic sequence database scanning
utilizing reconfigurable field programmable gate array (FPGA)-based hardware. In order
to derive an efficient structure for BLASTN, we propose a reconfigurable architecture to
accelerate the computation of the word-matching stage. The experimental results show
that the FPGA implementation achieves a speedup around one order of magnitude
compared to the NCBI BLASTN software running on a general purpose computer.

INTRODUCTION
Scanning genomic sequence databases is a common and often repeated task in molecular
biology. The need for speeding up these searches comes from the rapid growth of these
gene banks: every year their size is scaled by a factor of 1.5 to 2. The aim of a scan
operation is to find similarities between the query sequence and a particular genome
sequence, which might indicate similar functionality from a biological point of view.
Dynamic programming-based alignment algorithms can guarantee to find all important
similarities. However, as the search space is the product of the two sequences, which
could be several billion bases in size, it is generally not feasible to use a direct
implementation. One frequently used approach to speed up this time-consuming
operation is to use heuristics in the search algorithm. One of the most widely used
sequence analysis tools to use heuristics is the basic local alignment search tool (BLAST)
[2]. Although BLASTs algorithms are highly optimized for similarity search, the ever
growing databases outpace the speed improvements that BLAST can provide on a general
purpose PC. BLASTN, a version of BLAST specifically designed for DNA sequence
searches, consists of a three-stage pipeline.
Stage 1: Word-Matching detect seeds (short exact matches of a certain length between
the query sequence and the subject sequence), the inputs to this stage are strings of DNA
bases, which typically uses the alphabet {A, C, G, T}.
Stage 2: Ungapped Extension extends each seed in both directions allowing substitutions
only and outputs the resulting high-scoring segment pairs (HSPs). An HSP [3] indicates
two sequence fragments with equal length whose alignment score meets or exceeds a
empirically set threshold (or cutoff score).
Stage 3: Gapped Extension uses the Smith-Waterman dynamic programming algorithm
to extend the HSPs allowing insertions and deletions.
The basic idea underlying a BLASTN search is filtration. Although each stage in
the BLASTN pipeline is becoming more sophisticated, the exponential increase in the
volume of data makes it important that measures are taken to reduce the amount of data
that needs to be processed. Filtration discards irrelevant fractions as early as possible,
thus reducing the overall computation time. Analysis of the various stages of the
BLASTN pipeline (see Table I) reveals that the word-matching stage is the most time-
consuming part. Therefore, accelerating the computation of this stage will have the
greatest effect on the overall performance.

EXISTING SYSTEM

BASIC LOCAL ALIGNMENT SEARCH TOOL
A new approach to rapid sequence comparison, basic local alignment search tool
(BLAST), directly approximates alignments that optimize a measure of local similarity,
the maximal segment pair (MSP) score. Recent mathematical results on the stochastic
properties of MSP scores allow an analysis of the performance of this method as well as
the statistical significance of alignments it generates. The basic algorithm is simple and
robust; it can be implemented in a number of ways and applied in a variety of contexts
including straight-forward DNA and protein sequence database searches, motif searches,
gene identification searches, and in the analysis of multiple regions of similarity in long
DNA sequences. In addition to its flexibility and tractability to mathematical analysis,
BLAST is an order of magnitude faster than existing sequence comparison tools of
comparable sensitivity.
A RECONFIGURABLE BLOOM FILTER ARCHITECTURE FOR BLASTN
Efficient seed-based filtration methods exist for scanning genomic sequence
databases. However, current solutions require a significant scan time on traditional
computer architectures. These scan time requirements are likely to become even more
severe due to the rapid growth in the size of databases. In this paper, we present a new
approach to genomic sequence database scanning using reconfigurable field-
programmable gate array (FPGA)-based hardware. To derive an efficient mapping onto
this type of architecture, we propose a reconfigurable Bloom filter architecture. Our
experimental results show that the FPGA implementation achieves an order of magnitude
speedup compared to the NCBI BLASTN software running on a general purpose
computer.
EFFICIENT HARDWARE HASHING FUNCTIONS FOR HIGH
PERFORMANCE COMPUTERS
Hashing is critical for high performance computer architecture. Hashing is used
extensively in hardware applications, such as page tables, for address translation. Bit
extraction and exclusive ORing hashing methods are two commonly used hashing
functions for hardware applications. There is no study of the performance of these
functions and no mention anywhere of the practical performance of the hashing functions
in comparison with the theoretical performance prediction of hashing schemes. In this
paper, we show that, by choosing hashing functions at random from a particular class,
called H3, of hashing functions, the analytical performance of hashing can be achieved in
practice on real-life data. Our results about the expected worst case performance of
hashing are of special significance, as they provide evidence for earlier theoretical
predictions.
AN APPROACH FOR MINIMAL PERFECT HASH
FUNCTIONS FOR VERY LARGE DATABASES
We propose a novel external memory based algorithm for constructing minimal
perfect hash functions h for huge sets of keys. For a set of n keys, our algorithm outputs h
in time O(n). The algorithm needs a small vector of one byte entries in main memory to
construct h. The evaluation of h(x) requires three memory accesses for any key x. The
description of h takes a constant number of up to 9 bits for each key, which is optimal
and close to the theoretical lower bound, i.e., around 2 bits per key. In our experiments,
we used a collection of 1 billion URLs collected from the web, each URL 64 characters
long on average. For this collection, our algorithm (i) nds a minimal perfect hash
function in approximately 3 hours using a commodity PC, (ii) needs just 5.45 megabytes
of internal memory to generate h and (iii) takes 8.1 bits per key for the description of h.
MERCURY BLAST DICTIONARIES: ANALYSIS AND PERFORMANCE
MEASUREMENT
This report describes a hashing scheme for a dictionary of short bit strings. The
scheme, which we call near-perfect hashing, was designed as part of the construction of
Mercury BLAST, an FPGA-based accelerator for the BLAST family of biosequence
comparison algorithms.
Near-perfect hashing is a heuristic variant of the well-known displacement
hashing approach to building perfect hash functions. It uses a family of hash functions
composed from linear transformations on bit vectors and lookups in small precomputed
tables, both of which are especially appropriate for implementation in hardware logic. We
show empirically that for inputs derived from genomic DNA sequences, our scheme
obtains a good tradeoff between the size of the hash table and the time required to ompute
it from a set of input strings, while generating few or no collisions between keys in the
table.
One of the building blocks of our scheme is the H_3 family of hash functions,
which are linear transformations on bit vectors. We show that the uniformity of hashing
performed with randomly chosen linear transformations depends critically on their rank,
and that randomly chosen transformations have a high probability of having the
maximum possible uniformity. A simple test is sufficient to ensure that a randomly
chosen H_3 hash function will not cause an unexpectedly large number of collisions.
Moreover, if two such functions are chosen independently at random, the second function
is unlikely to hash together two keys that were hashed together by the first.
Hashing schemes based on H_3 hash functions therefore tend to distribute their
inputs more uniformly than would be expected under a simple uniform hashing model,
and schemes using pairs of these functions are more uniform than would be assumed for
a pair of independent hash functions.

PROPOSED SYSTEM
In this paper, we propose a computationally efficient architecture to accelerate the
data processing of the word-matching stage based on field programmable gate arrays
(FPGA). FPGAs are suitable candidate platforms for high-performance computation due
to their fine-grained parallelism and pipelining capabilities.


BLOOM FILTERS
Introduction
Bloom filters [2] are compact data structures for probabilistic representation of a set in
order to support membership queries (i.e. queries that ask: Is element X in set Y?). This
compact representation is the payoff for allowing a small rate of false positives in
membership queries; that is, queries might incorrectly recognize an element as member
of the set.
We succinctly present Bloom filters use to date in the next section. In Section 3 we
describe Bloom filters in detail, and in Section 4 we give a hopefully precise picture of
space/computing time/error rate tradeoffs.
Usage
Since their introduction in [2], Bloom filters have seen various uses:
Web cache sharing ([3]) Collaborating Web caches use Bloom filters (dubbed cache
summaries) as compact representations for the local set of cached files. Each cache
periodically broadcasts its summary to all other members of the distributed cache.
Using all summaries received, a cache node has a (partially outdated, partially wrong)
global image about the set of files stored in the aggregated cache. The Squid Web
Proxy Cache [1] uses Cache Digests based on a similar idea.
Query filtering and routing ([4, 6, 7]) The Secure wide-area Discovery Service
[6], subsystem of Ninja project [5], organizes service providers in a hierarchy. Bloom
filters are used as summaries for the set of services offered by a node. Summaries are
sent upwards in the hierarchy and aggregated. A query is a description for a specific
service, also represented as a Bloom filter. Thus, when a member node of the hierarchy
generates/receives a query, it has enough information at hand to decide where to forward
the query: downward, to one of its descendants (if a solution to the query is present in the
filter for the corresponding node), or upward, toward its parent (otherwise).
The OceanStore [7] replica location service uses a two-tiered approach: first it initiates an
inexpensive, probabilistic search (based on Bloom filters, similar to Ninja) to try and find
a replica. If this fails, the search falls-back on (expensive) deterministic algorithm (based
on Plaxton replica location algorithm). Alas, their description of the probabilistic search
algorithm is laconic. (An unpublished text [11] from members of the same group gives
some more details. But this does not seem to work well when resources are dynamic.)
Compact representation of a differential file ([9]). A differential file contains a
batch of database records to be updated. For performance reasons the database is
updated only periodically (i.e., midnight) or when the differential file grows above a
certain threshold. However, in order to preserve integrity, each reference/query to the
database has to access the differential file to see if a particular record is scheduled to be
updated. To speed-up this process, with little memory and computational overhead, the
differential file is represented as a Bloom filter.
Free text searching ([10]). Basically, the set of words that appear in a text is
succinctly represented using a Bloom filter
Constructing Bloom Filters
Consider a set } ,..., , {
2 1 n
a a a A = of n elements. Bloom filters describe membership
information of A using a bit vector V of length m. For this, k hash functions,
k
h h h ,..., ,
2 1

with } .. 1 { : m X h
i
, are used as described below:

The following procedure builds an m bits Bloom filter, corresponding to a set A and
using
k
h h h ,..., ,
2 1
hash functions:
Procedure BloomFilter(set A, hash_functions, integer m)
returns filter
filter = allocate m bits initialized to 0
foreach a
i
in A:
foreach hash function h
j
:
filter[h
j
(a
i
)] = 1
end foreach
end foreach
return filter

Therefore, if a
i
is member of a set A, in the resulting Bloom filter V all bits obtained
corresponding to the hashed values of a
i
are set to 1. Testing for membership of an
element elm is equivalent to testing that all corresponding bits of V are set:
Procedure MembershipTest (elm, filter, hash_functions)
returns yes/no
foreach hash function h
j
:
if filter[h
j
(elm)] != 1 return No
end foreach
return Yes

Nice features: filters can be built incrementally: as new elements are added to a set the
corresponding positions are computed through the hash functions and bits are set in the
filter. Moreover, the filter expressing the reunion of two sets is simply computed as the
bit-wise OR applied over the two corresponding Bloom filters.
Bloom Filters the Math (this follows the description in [3])
One prominent feature of Bloom filters is that there is a clear tradeoff between the size of
the filter and the rate of false positives. Observe that after inserting n keys into a filter of
size m using k hash functions, the probability that a particular bit is still 0 is:
m
kn kn
e
m
p

~ |
.
|

\
|
= 1
1
1
0
. (1)
(Note that we assume perfect hash functions that spread the elements of A evenly
throughout the space {1..m}. In practice, good results have been achieved using MD5
and other hash functions [10].)
Hence, the probability of a false positive (the probability that all k bits have been
previously set) is:
( )
k
m
kn
k
kn
k
err
e
m
p p
|
|
.
|

\
|
~
|
|
.
|

\
|
|
.
|

\
|
= =

1
1
1 1 1
0
(2)
In (2) p
err
is minimized for 2 ln
n
m
k = hash functions. In practice however, only a small
number of hash functions are used. The reason is that the computational overhead of
each hash additional function is constant while the incremental benefit of adding a new
hash function decreases after a certain threshold (see Figure 1).

Figure 1: False positive rate as a function
of the number of hash functions used. The
size of the Bloom filter is 32 bits per entry
(m/n=32). In this case using 22 hash
functions minimizes the false positive rate.
Note however that adding a hash function
does not significantly decrease the error
rate when more than 10 hashes are already
used.
Figure 2: Size of Bloom filter (bits/entry)
as a function of the error rate desired.
Different lines represent different numbers
of hash keys used. Note that, for the error
rates considered, using 32 keys does not
bring significant benefits over using only 8
keys.

1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1 4 7 10 13 16 19 22 25 28 31
F
a
l
s
e

p
o
s
i
t
i
v
e
s


r
a
t
e

(
l
o
g

s
c
a
l
e
)

Number of hash functions
0
10
20
30
40
50
60
70
1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01
B
i
t
s


p
e
r

e
n
t
r
y

Error rate (log scale)
k=2
k=4
k=8
k=16
k=32
(2) is the base formula for engineering Bloom filters. It allows, for example, computing minimal
memory requirements (filter size) and number of hash functions given the maximum acceptable
false positives rate and number of elements in the set (as we detail in Figure 2).
|
|
.
|

\
|

=
k
p
err
e
k
n
m
ln
1 ln
(bits per entry) (3)
To summarize: Bloom filters are compact data structures for probabilistic representation of a set
in order to support membership queries. The main design tradeoffs are the number of hash
functions used (driving the computational overhead), the size of the filter and the error (collision)
rate. Formula (2) is the main formula to tune parameters according to application requirements.
Compressed Bloom filters
Some applications that use Bloom filters need to communicate these filters across the network.
In this case, besides the three performance metrics we have seen so far: (1) the computational
overhead to lookup a value (related to the number of hash functions used), (2) the size of the
filter in memory, and (3) the error rate, a fourth metric can be used: the size of the filter
transmitted across the network. M. Mitzenmacher shows in [8] that compressing Bloom filters
might lead to significant bandwidth savings at the cost of higher memory requirements (larger
uncompressed filters) and some additional computation time to compress the filter that is sent
across the network. We do not detail here all theoretical and practical issues analyzed in [8].

A Bloom filter, conceived by Burton Howard Bloom in 1970 is a space-
efficient probabilistic data structure that is used to test whether an element is a member of
a set. False positive matches are possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can be added to the set, but not
removed (though this can be addressed with a "counting" filter). The more elements that are
added to the set, the larger the probability of false positives.
Bloom proposed the technique for applications where the amount of source data would
require an impracticably large hash area in memory if "conventional" error-free hashing
techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of
500,000 words, of which 90% could be hyphenated by following simple rules but all the
remaining 50,000 words required expensive disk access to retrieve their specific patterns. With
unlimited core memory, an error-free hash could be used to eliminate all the unnecessary disk
access. But if core memory was insufficient, a smaller hash area could be used to eliminate most
of the unnecessary access. For example, a hash area only 15% of the error-free size would still
eliminate 85% of the disk accesses (Bloom (1970)).
More generally, fewer than 10 bits per element are required for a 1% false positive probability,
independent of the size or number of elements in the set (Bonomi et al. (2006)).

Algorithm description


An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the
positions in the bit array that each set element is mapped to. The element w is not in the set {x, y,
z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3.
An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash
functions defined, each of which maps or hashes some set element to one of the m array positions
with a uniform random distribution.
To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at
all these positions to 1.
To query for an element (test whether it is in the set), feed it to each of the k hash functions to
get k array positions. If any of the bits at these positions are 0, the element is definitely not in the
set if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then
either the element is in the set, or the bits have by chance been set to 1 during the insertion of
other elements, resulting in a false positive. In a simple bloom filter, there is no way to
distinguish between the two cases, but more advanced techniques can address this problem.
The requirement of designing k different independent hash functions can be prohibitive for
large k. For a good hash functionwith a wide output, there should be little if any correlation
between different bit-fields of such a hash, so this type of hash can be used to generate multiple
"different" hash functions by slicing its output into multiple bit fields. Alternatively, one can
pass kdifferent initial values (such as 0, 1, ..., k 1) to a hash function that takes an initial value;
or add (or append) these values to the key. For larger m and/or k, independence among the hash
functions can be relaxed with negligible increase in false positive rate (Dillinger & Manolios
(2004a), Kirsch & Mitzenmacher (2006)). Specifically, Dillinger & Manolios (2004b) show the
effectiveness of deriving the k indices using enhanced double hashing or triple hashing, variants
of double hashing that are effectively simple random number generators seeded with the two or
three hash values.
Removing an element from this simple Bloom filter is impossible because false negatives are not
permitted. An element maps to k bits, and although setting any one of those k bits to zero suffices
to remove the element, it also results in removing any other elements that happen to map onto
that bit. Since there is no way of determining whether any other elements have been added that
affect the bits for an element to be removed, clearing any of the bits would introduce the
possibility for false negatives.
One-time removal of an element from a Bloom filter can be simulated by having a second Bloom
filter that contains items that have been removed. However, false positives in the second filter
become false negatives in the composite filter, which may be undesirable. In this approach re-
adding a previously removed item is not possible, as one would have to remove it from the
"removed" filter.
It is often the case that all the keys are available but are expensive to enumerate (for example,
requiring many disk reads). When the false positive rate gets too high, the filter can be
regenerated; this should be a relatively rare event.
Space and time advantages


Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk
which has slow access times. Bloom filter decisions are much faster. However some unnecessary
disk accesses are made when the filter reports a positive (in order to weed out the false
positives). Overall answer speed is better with the Bloom filter than without the Bloom filter.
Use of a Bloom filter for this purpose, however, does increase memory usage.
While risking false positives, Bloom filters have a strong space advantage over other data
structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or
simple arrays or linked lists of the entries. Most of these require storing at least the data items
themselves, which can require anywhere from a small number of bits, for small integers, to an
arbitrary number of bits, such as for strings (tries are an exception, since they can share storage
between elements with equal prefixes). Linked structures incur an additional linear space
overhead for pointers. A Bloom filter with 1% error and an optimal value of k, in contrast,
requires only about 9.6 bits per element regardless of the size of the elements. This advantage
comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature.
The 1% false-positive rate can be reduced by a factor of ten by adding only about 4.8 bits per
element.
However, if the number of potential values is small and many of them can be in the set, the
Bloom filter is easily surpassed by the deterministic bit array, which requires only one bit for
each potential element. Note also that hash tables gain a space and time advantage if they begin
ignoring collisions and store only whether each bucket contains an entry; in this case, they have
effectively become Bloom filters with k = 1.
[1]

Bloom filters also have the unusual property that the time needed either to add items or to check
whether an item is in the set is a fixed constant, O(k), completely independent of the number of
items already in the set. No other constant-space set data structure has this property, but the
average access time of sparse hash tables can make them faster in practice than some Bloom
filters. In a hardware implementation, however, the Bloom filter shines because its k lookups are
independent and can be parallelized.
To understand its space efficiency, it is instructive to compare the general Bloom filter with its
special case when k = 1. If k = 1, then in order to keep the false positive rate sufficiently low, a
small fraction of bits should be set, which means the array must be very large and contain long
runs of zeros. The information content of the array relative to its size is low. The generalized
Bloom filter (k greater than 1) allows many more bits to be set while still maintaining a low false
positive rate; if the parameters (k and m) are chosen well, about half of the bits will be set, and
these will be apparently random, minimizing redundancy and maximizing information content.
Probability of false positives


The false positive probability as a function of number of elements in the filter and the filter
size . An optimal number of hash functions has been assumed.
Assume that a hash function selects each array position with equal probability. If m is the
number of bits in the array, and k is the number of hash functions, then the probability that a
certain bit is not set to 1 by a certain hash function during the insertion of an element is then

The probability that it is not set to 1 by any of the hash functions is

If we have inserted n elements, the probability that a certain bit is still 0 is

the probability that it is 1 is therefore

Now test membership of an element that is not in the set. Each of the k array positions computed
by the hash functions is 1 with a probability as above. The probability of all of them being 1,
which would cause the algorithm to erroneously claim that the element is in the set, is often
given as

This is not strictly correct as it assumes independence for the probabilities of each bit being set.
However, assuming it is a close approximation we have that the probability of false positives
decreases as m (the number of bits in the array) increases, and increases as n (the number of
inserted elements) increases. For a given m and n, the value of k (the number of hash functions)
that minimizes the probability is

which gives

The required number of bits m, given n (the number of inserted elements) and a desired false
positive probability p (and assuming the optimal value of k is used) can be computed by
substituting the optimal value of k in the probability expression above:

which can be simplified to:

This results in:

This means that for a given false positive probability p, the length of a Bloom filter m is
proportionate to the number of elements being filtered n.
[2]
While the above formula is
asymptotic (i.e. applicable as m,n ), the agreement with finite values of m,n is also quite
good; the false positive probability for a finite bloom filter with m bits, n elements, and k hash
functions is at most

So we can use the asymptotic formula if we pay a penalty for at most half an extra element and at
most one fewer bit.
[3]

Approximating the number of items in a Bloom filter
Swamidass & Baldi (2007) showed that the number of items in a bloom filter can be
approximated with the following formula,

where is an estimate of the number of items in the filter, N is length of the filter, k is the
number of hash functions per item, and X is the number of bits set to one.
The union and intersection of sets
Bloom filters are a way of compactly representing a set of items. It is common to try and
compute the size of the intersection or union between two sets. Bloom filters can be used to
approximate the size of the intersection and union of two sets. Swamidass & Baldi (2007)
showed that for two bloom filters of length , their counts, respectively can be estimated as

and
.
The size of their union can be estimated as
,
where is the number of bits set to one in either of the two bloom filters. And the
intersection can be estimated as
,
Using the three formulas together.
Interesting properties
Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrary
large number of elements; adding an element never fails due to the data structure "filling up."
However, the false positive rate increases steadily as elements are added until all bits in the filter
are set to 1, at which point all queries yield a positive result.
Union and intersection of Bloom filters with the same size and set of hash functions can be
implemented with bitwise OR and AND operations, respectively. The union operation on Bloom
filters is lossless in the sense that the resulting Bloom filter is the same as the Bloom filter
created from scratch using the union of the two sets. The intersect operation satisfies a weaker
property: the false positive probability in the resulting Bloom filter is at most the false-positive
probability in one of the constituent Bloom filters, but may be larger than the false positive
probability in the Bloom filter created from scratch using the intersection of the two sets. There
are also more accurate estimates of intersection and union
[clarification needed]
that are not biased in
this way.
[citation needed]

Some kinds of superimposed code can be seen as a Bloom filter implemented with
physical edge-notched cards.
Examples
Google BigTable and Apache Cassandra use Bloom filters to reduce the disk lookups for non-
existent rows or columns. Avoiding costly disk lookups considerably increases the performance
of a database query operation.
[4]

The Google Chrome web browser uses a Bloom filter to identify malicious URLs. Any URL is
first checked against a local Bloom filter and only upon a hit a full check of the URL is
performed.
[5]

The Squid Web Proxy Cache uses Bloom filters for cache digests.
[6]

Bitcoin uses Bloom filters to verify payments without running a full network node.
[7][8]

The Venti archival storage system uses Bloom filters to detect previously stored data.
[9]

The SPIN model checker uses Bloom filters to track the reachable state space for large
verification problems.
[10]

The Cascading analytics framework uses Bloomfilters to speed up asymmetric joins, where one
of the joined data sets is significantly larger than the other (often called Bloom join
[11]
in the
database literature).
[12]

Alternatives
Classic Bloom filters use bits of space per inserted key, where is the false
positive rate of the Bloom filter. However, the space that is strictly necessary for any data
structure playing the same role as a Bloom filter is only per key (Pagh, Pagh & Rao
2005). Hence Bloom filters use 44% more space than a hypothetical equivalent optimal data
structure. The number of hash functions used to achieve a given false positive rate is
proportional to which is not optimal as it has been proved that an optimal data structure
would need only a constant number of hash functions independent of the false positive rate.
Stern & Dill (1996) describe a probabilistic structure based on hash tables, hash compaction,
which Dillinger & Manolios (2004b) identify as significantly more accurate than a Bloom filter
when each is configured optimally. Dillinger and Manolios, however, point out that the
reasonable accuracy of any given Bloom filter over a wide range of numbers of additions makes
it attractive for probabilistic enumeration of state spaces of unknown size. Hash compaction is,
therefore, attractive when the number of additions can be predicted accurately; however, despite
being very fast in software, hash compaction is poorly suited for hardware because of worst-case
linear access time.
Putze, Sanders & Singler (2007) have studied some variants of Bloom filters that are either faster
or use less space than classic Bloom filters. The basic idea of the fast variant is to locate the k
hash values associated with each key into one or two blocks having the same size as processor's
memory cache blocks (usually 64 bytes). This will presumably improve performance by
reducing the number of potential memory cache misses. The proposed variants have however the
drawback of using about 32% more space than classic Bloom filters.
The space efficient variant relies on using a single hash function that generates for each key a
value in the range where is the requested false positive rate. The sequence of values
is then sorted and compressed using Golomb coding (or some other compression technique) to
occupy a space close to bits. To query the Bloom filter for a given key, it will
suffice to check if its corresponding value is stored in the Bloom filter. Decompressing the whole
Bloom filter for each query would make this variant totally unusable. To overcome this problem
the sequence of values is divided into small blocks of equal size that are compressed separately.
At query time only half a block will need to be decompressed on average. Because of
decompression overhead, this variant may be slower than classic Bloom filters but this may be
compensated by the fact that a single hash function need to be computed.
Another alternative to classic Bloom filter is the one based on space efficient variants of cuckoo
hashing. In this case once the hash table is constructed, the keys stored in the hash table are
replaced with short signatures of the keys. Those signatures are strings of bits computed using a
hash function applied on the keys.
Extensions and applications
Counting filters
Counting filters provide a way to implement a delete operation on a Bloom filter without
recreating the filter afresh. In a counting filter the array positions (buckets) are extended from
being a single bit to being an n-bit counter. In fact, regular Bloom filters can be considered as
counting filters with a bucket size of one bit. Counting filters were introduced by Fan et al.
(1998).
The insert operation is extended to increment the value of the buckets and the lookup operation
checks that each of the required buckets is non-zero. The delete operation, obviously, then
consists of decrementing the value of each of the respective buckets.
Arithmetic overflow of the buckets is a problem and the buckets should be sufficiently large to
make this case rare. If it does occur then the increment and decrement operations must leave the
bucket set to the maximum possible value in order to retain the properties of a Bloom filter.
The size of counters is usually 3 or 4 bits. Hence counting Bloom filters use 3 to 4 times more
space than static Bloom filters. In theory, an optimal data structure equivalent to a counting
Bloom filter should not use more space than a static Bloom filter.
Another issue with counting filters is limited scalability. Because the counting Bloom filter table
cannot be expanded, the maximal number of keys to be stored simultaneously in the filter must
be known in advance. Once the designed capacity of the table is exceeded, the false positive rate
will grow rapidly as more keys are inserted.
Bonomi et al. (2006) introduced a data structure based on d-left hashing that is functionally
equivalent but uses approximately half as much space as counting Bloom filters. The scalability
issue does not occur in this data structure. Once the designed capacity is exceeded, the keys
could be reinserted in a new hash table of double size.
The space efficient variant by Putze, Sanders & Singler (2007) could also be used to implement
counting filters by supporting insertions and deletions.
Data synchronization
Bloom filters can be used for approximate data synchronization as in Byers et al. (2004).
Counting Bloom filters can be used to approximate the number of differences between two sets
and this approach is described in Agarwal & Trachtenberg (2006).
Bloomier filters
Chazelle et al. (2004) designed a generalization of Bloom filters that could associate a value with
each element that had been inserted, implementing an associative array. Like Bloom filters, these
structures achieve a small space overhead by accepting a small probability of false positives. In
the case of "Bloomier filters", a false positive is defined as returning a result when the key is not
in the map. The map will never return the wrong value for a key that is in the map.
Compact approximators
Boldi & Vigna (2005) proposed a lattice-based generalization of Bloom filters. A compact
approximator associates to each key an element of a lattice (the standard Bloom filters being
the case of the Boolean two-element lattice). Instead of a bit array, they have an array of lattice
elements. When adding a new association between a key and an element of the lattice, they
compute the maximum of the current contents of the k array locations associated to the key with
the lattice element. When reading the value associated to a key, they compute the minimum of
the values found in the k locations associated to the key. The resulting value approximates from
above the original value.
Stable Bloom filters
Deng & Rafiei (2006) proposed Stable Bloom filters as a variant of Bloom filters for streaming
data. The idea is that since there is no way to store the entire history of a stream (which can be
infinite), Stable Bloom filters continuously evict stale information to make room for more recent
elements. Since stale information is evicted, the Stable Bloom filter introduces false negatives,
which do not appear in traditional bloom filters. The authors show that a tight upper bound of
false positive rates is guaranteed, and the method is superior to standard bloom filters in terms of
false positive rates and time efficiency when a small space and an acceptable false positive rate
are given.

Scalable Bloom filters
Almeida et al. (2007) proposed a variant of Bloom filters that can adapt dynamically to the
number of elements stored, while assuring a minimum false positive probability. The technique
is based on sequences of standard bloom filters with increasing capacity and tighter false positive
probabilities, so as to ensure that a maximum false positive probability can be set beforehand,
regardless of the number of elements to be inserted.
Attenuated Bloom filters
An attenuated bloom filter of depth D can be viewed as an array of D normal bloom filters. In the
context of service discovery in a network, each node stores regular and attenuated bloom filters
locally. The regular or local bloom filter indicates which services are offered by the node itself.
The attenuated filter of level i indicates which services can be found on nodes that are i-hops
away from the current node. The i-th value is constructed by taking a union of local bloom filters
for nodes i-hops away from the node.


Let's take a small network shown on the graph below as an example. Say we are searching for a
service A whose id hashes to bits 0,1, and 3 (pattern 11010). Let n1 node to be the starting point.
First, we check whether service A is offered by n1 by checking its local filter. Since the patterns
don't match, we check the attenuated bloom filter in order to determine which node should be the
next hop. We see that n2 doesn't offer service A but lies on the path to nodes that do. Hence, we
move to n2 and repeat the same procedure. We quickly find that n3 offers the service, and hence
the destination is located.
By using attenuated Bloom filters consisting of multiple layers, services at more than one hop
distance can be discovered while avoiding saturation of the Bloom filter by attenuating (shifting
out) bits set by sources further away.
HASH TABLE


A small phone book as a hash table
In computing, a hash table (also hash map) is a data structure used to implement an associative
array, a structure that can map keys to values. A hash table uses a hash function to compute
an index into an array of buckets or slots, from which the correct value can be found.
Ideally, the hash function should assign each possible key to a unique bucket, but this ideal
situation is rarely achievable in practice (unless the hash keys are fixed; i.e. new entries are never
added to the table after it is created). Instead, most hash table designs assume that hash
collisionsdifferent keys that are assigned by the hash function to the same bucketwill occur
and must be accommodated in some way.
In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is
independent of the number of elements stored in the table. Many hash table designs also allow
arbitrary insertions and deletions of key-value pairs, at (amortized
[2]
) constant average cost per
operation.
[3][4]

In many situations, hash tables turn out to be more efficient than search trees or any
other table lookup structure. For this reason, they are widely used in many kinds of
computer software, particularly for associative arrays, database indexing, caches, and sets.
Hashing
Main article: Hash function
The idea of hashing is to distribute the entries (key/value pairs) across an array of buckets. Given
a key, the algorithm computes an index that suggests where the entry can be found:
index = f(key, array_size)
Often this is done in two steps:
hash = hashfunc(key)
index = hash % array_size
In this method, the hash is independent of the array size, and it is then reduced to an index (a
number between 0 and array_size 1) using the modulus operator (%).
In the case that the array size is a power of two, the remainder operation is reduced to masking,
which improves speed, but can increase problems with a poor hash function.
Choosing a good hash function
A good hash function and implementation algorithm are essential for good hash table
performance, but may be difficult to achieve.
A basic requirement is that the function should provide a uniform distribution of hash values. A
non-uniform distribution increases the number of collisions and the cost of resolving them.
Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using
statistical tests, e.g. a Pearson's chi-squared test for discrete uniform distributions
[5]

[6]

The distribution needs to be uniform only for table sizes that occur in the application. In
particular, if one uses dynamic resizing with exact doubling and halving of s, the hash function
needs to be uniform only when s is a power of two. On the other hand, some hashing algorithms
provide uniform hashes only when s is a prime number.
[7]

For open addressing schemes, the hash function should also avoid clustering, the mapping of two
or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even
if the load factor is low and collisions are infrequent. The popular multiplicative hash
[3]
is
claimed to have particularly poor clustering behavior.
[7]

Cryptographic hash functions are believed to provide good hash functions for any table size s,
either by modulo reduction or by bit masking. They may also be appropriate if there is a risk of
malicious users trying to sabotage a network service by submitting requests designed to generate
a large number of collisions in the server's hash tables. However, the risk of sabotage can also be
avoided by cheaper methods (such as applying a secret salt to the data, or using a universal hash
function).
Some authors claim that good hash functions should have the avalanche effect; that is, a single-
bit change in the input key should affect, on average, half the bits in the output. Some popular
hash functions do not have this property.
[citation needed]

Perfect hash function
If all keys are known ahead of time, a perfect hash function can be used to create a perfect hash
table that has no collisions. If minimal perfect hashing is used, every location in the hash table
can be used as well.
Perfect hashing allows for constant time lookups in the worst case. This is in contrast to most
chaining and open addressing methods, where the time for lookup is low on average, but may be
very large (proportional to the number of entries) for some sets of keys.
Key statistics
A critical statistic for a hash table is called the load factor. This is simply the number of entries
divided by the number of buckets, that is, n/k where n is the number of entries and k is the
number of buckets.
If the load factor is kept reasonable, the hash table should perform well, provided the hashing is
good. If the load factor grows too large, the hash table will become slow, or it may fail to work
(depending on the method used). The expected constant time property of a hash table assumes
that the load factor is kept below some bound. For a fixed number of buckets, the time for a
lookup grows with the number of entries and so does not achieve the desired constant time.
Second to that, one can examine the variance of number of entries per bucket. For example, two
tables both have 1000 entries and 1000 buckets; one has exactly one entry in each bucket, the
other has all entries in the same bucket. Clearly the hashing is not working in the second one.
A low load factor is not especially beneficial. As load factor approaches 0, the proportion of
unused areas in the hash table increases, but there is not necessarily any reduction in search cost.
This results in wasted memory.
Collision resolution
Hash collisions are practically unavoidable when hashing a random subset of a large set of
possible keys. For example, if 2,500 keys are hashed into a million buckets, even with a perfectly
uniform random distribution, according to the birthday problem there is a 95% chance of at least
two of the keys being hashed to the same slot.
Therefore, most hash table implementations have some collision resolution strategy to handle
such events. Some common strategies are described below. All these methods require that the
keys (or pointers to them) be stored in the table, together with the associated values.
Separate chaining


Hash collision resolved by separate chaining.
In the method known as separate chaining, each bucket is independent, and has some sort
of list of entries with the same index. The time for hash table operations is the time to find the
bucket (which is constant) plus the time for the list operation. (The technique is also called open
hashing or closed addressing.)
In a good hash table, each bucket has zero or one entries, and sometimes two or three, but rarely
more than that. Therefore, structures that are efficient in time and space for these cases are
preferred. Structures that are efficient for a fairly large number of entries are not needed or
desirable. If these cases happen often, the hashing is not working well, and this needs to be fixed.
Separate chaining with linked lists
Chained hash tables with linked lists are popular because they require only basic data structures
with simple algorithms, and can use simple hash functions that are unsuitable for other methods.
The cost of a table operation is that of scanning the entries of the selected bucket for the desired
key. If the distribution of keys is sufficiently uniform, the average cost of a lookup depends only
on the average number of keys per bucketthat is, on the load factor.
Chained hash tables remain effective even when the number of table entries n is much higher
than the number of slots. Their performance degrades more gracefully (linearly) with the load
factor. For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10)
is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than
a plain sequential list, and possibly even faster than a balanced search tree.
For separate-chaining, the worst-case scenario is when all entries are inserted into the same
bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data
structure. If the latter is a linear list, the lookup procedure may have to scan all its entries, so the
worst-case cost is proportional to the number n of entries in the table.
The bucket chains are often implemented as ordered lists, sorted by the key field; this choice
approximately halves the average cost of unsuccessful lookups, compared to an unordered
list
[citation needed]
. However, if some keys are much more likely to come up than others, an
unordered list with move-to-front heuristic may be more effective. More sophisticated data
structures, such as balanced search trees, are worth considering only if the load factor is large
(about 10 or more), or if the hash distribution is likely to be very non-uniform, or if one must
guarantee good performance even in a worst-case scenario. However, using a larger table and/or
a better hash function may be even more effective in those cases.
Chained hash tables also inherit the disadvantages of linked lists. When storing small keys and
values, the space overhead of the next pointer in each entry record can be significant. An
additional disadvantage is that traversing a linked list has poor cache performance, making the
processor cache ineffective.
Separate chaining with list heads


Hash collision by separate chaining with head records in the bucket array.
Some chaining implementations store the first record of each chain in the slot array itself.
[4]
The
number of pointer traversals is decreased by one for most cases. The purpose is to increase cache
efficiency of hash table access.
The disadvantage is that an empty bucket takes the same space as a bucket with one entry. To
save memory space, such hash tables often have about as many slots as stored entries, meaning
that many slots have two or more entries.
Separate chaining with other structures[edit source]
Instead of a list, one can use any other data structure that supports the required operations. For
example, by using a self-balancing tree, the theoretical worst-case time of common hash table
operations (insertion, deletion, lookup) can be brought down to O(log n) rather than O(n).
However, this approach is only worth the trouble and extra memory cost if long delays must be
avoided at all costs (e.g. in a real-time application), or if one must guard against many entries
hashed to the same slot (e.g. if one expects extremely non-uniform distributions, or in the case of
web sites or other publicly accessible services, which are vulnerable to malicious key
distributions in requests).
The variant called array hash table uses a dynamic array to store all the entries that hash to the
same slot. Each newly inserted entry gets appended to the end of the dynamic array that is
assigned to the slot. The dynamic array is resized in an exact-fit manner, meaning it is grown
only by as many bytes as needed. Alternative techniques such as growing the array by block
sizes or pages were found to improve insertion performance, but at a cost in space. This variation
makes more efficient use of CPU caching and the translation lookaside buffer (TLB), because
slot entries are stored in sequential memory positions. It also dispenses with the next pointers
that are required by linked lists, which saves space. Despite frequent array resizing, space
overheads incurred by operating system such as memory fragmentation, were found to be small.
An elaboration on this approach is the so-called dynamic perfect hashing,
[11]
where a bucket that
contains k entries is organized as a perfect hash table with k
2
slots. While it uses more memory
(n
2
slots for n entries, in the worst case and n*k slots in the average case), this variant has
guaranteed constant worst-case lookup time, and low amortized time for insertion.
Open addressing


Hash collision resolved by open addressing with linear probing (interval=1). Note that "Ted
Baker" has a unique hash, but nevertheless collided with "Sandra Dee", that had previously
collided with "John Smith".
In another strategy, called open addressing, all entry records are stored in the bucket array itself.
When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot
and proceeding in some probe sequence, until an unoccupied slot is found. When searching for
an entry, the buckets are scanned in the same sequence, until either the target record is found, or
an unused array slot is found, which indicates that there is no such key in the table.
[12]
The name
"open addressing" refers to the fact that the location ("address") of the item is not determined by
its hash value. (This method is also called closed hashing; it should not be confused with "open
hashing" or "closed addressing" that usually mean separate chaining.)
Well-known probe sequences include:
- Linear probing, in which the interval between probes is fixed (usually 1)
- Quadratic probing, in which the interval between probes is increased by adding the
successive outputs of a quadratic polynomial to the starting value given by the original hash
computation
- Double hashing, in which the interval between probes is computed by another hash function
A drawback of all these open addressing schemes is that the number of stored entries cannot
exceed the number of slots in the bucket array. In fact, even with good hash functions, their
performance dramatically degrades when the load factor grows beyond 0.7 or so. Thus a more
aggressive resize scheme is needed. Separate linking works correctly with any load factor,
although performance is likely to be reasonable if it is kept below 2 or so. For many applications,
these restrictions mandate the use of dynamic resizing, with its attendant costs.
Open addressing schemes also put more stringent requirements on the hash function: besides
distributing the keys more uniformly over the buckets, the function must also minimize the
clustering of hash values that are consecutive in the probe order. Using separate chaining, the
only concern is that too many objects map to the same hash value; whether they are adjacent or
nearby is completely irrelevant.
Open addressing only saves memory if the entries are small (less than four times the size of a
pointer) and the load factor is not too small. If the load factor is close to zero (that is, there are
far more buckets than stored entries), open addressing is wasteful even if each entry is just two
words.


This graph compares the average number of cache misses required to look up elements in tables
with chaining and linear probing. As the table passes the 80%-full mark, linear probing's
performance drastically degrades.
Open addressing avoids the time overhead of allocating each new entry record, and can be
implemented even in the absence of a memory allocator. It also avoids the extra indirection
required to access the first entry of each bucket (that is, usually the only one). It also has
better locality of reference, particularly with linear probing. With small record sizes, these
factors can yield better performance than chaining, particularly for lookups.
Hash tables with open addressing are also easier to serialize, because they do not use pointers.
On the other hand, normal open addressing is a poor choice for large elements, because these
elements fill entire CPU cachelines (negating the cache advantage), and a large amount of space
is wasted on large empty table slots. If the open addressing table only stores references to
elements (external storage), it uses space comparable to chaining even for large records but loses
its speed advantage.
Generally speaking, open addressing is better used for hash tables with small records that can be
stored within the table (internal storage) and fit in a cache line. They are particularly suitable for
elements of one word or less. If the table is expected to have a high load factor, the records are
large, or the data is variable-sized, chained hash tables often perform as well or better.
Ultimately, used sensibly, any kind of hash table algorithm is usually fast enough; and the
percentage of a calculation spent in hash table code is low. Memory usage is rarely considered
excessive. Therefore, in most cases the differences between these algorithms are marginal, and
other considerations typically come into play.
[citation needed]

Coalesced hashing
A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes
within the table itself.
[12]
Like open addressing, it achieves space usage and (somewhat
diminished) cache advantages over chaining. Like chaining, it does not exhibit clustering effects;
in fact, the table can be efficiently filled to a high density. Unlike chaining, it cannot have more
elements than table slots.
Cuckoo hashing
Another alternative open-addressing solution is cuckoo hashing, which ensures constant lookup
time in the worst case, and constant amortized time for insertions and deletions. It uses two or
more hash functions, which means any key/value pair could be in two or more locations. For
lookup, the first hash function is used; if the key/value is not found, then the second hash
function is used, and so on. If a collision happens during insertion, then the key is re-hashed with
the second hash function to map it to another bucket. If all hash functions are used and there is
still a collision, then the key it collided with is removed to make space for the new key, and the
old key is re-hashed with one of the other hash functions, which maps it to another bucket. If that
location also results in a collision, then the process repeats until there is no collision or the
process traverses all the buckets, at which point the table is resized. By combining multiple hash
functions with multiple cells per bucket, very high space utilisation can be achieved.
Robin Hood hashing
One interesting variation on double-hashing collision resolution is Robin Hood hashing.
[13]
The
idea is that a new key may displace a key already inserted, if its probe count is larger than that of
the key at the current position. The net effect of this is that it reduces worst case search times in
the table. This is similar to Knuth's ordered hash tables except that the criterion for bumping a
key does not depend on a direct relationship between the keys. Since both the worst case and the
variation in the number of probes is reduced dramatically, an interesting variation is to probe the
table starting at the expected successful probe value and then expand from that position in both
directions.
[14]
External Robin Hashing is an extension of this algorithm where the table is stored
in an external file and each table position corresponds to a fixed-sized page or bucket
with B records.
[15]

2-choice hashing
2-choice hashing employs 2 different hash functions, h
1
(x) and h
2
(x), for the hash table. Both
hash functions are used to compute two table locations. When an object is inserted in the table,
then it is placed in the table location that contains fewer objects (with the default being the h
1
(x)
table location if there is equality in bucket size). 2-choice hashing employs the principle of
thepower of two choices.
Hopscotch hashing
Another alternative open-addressing solution is hopscotch hashing,
[16]
which combines the
approaches of cuckoo hashing and linear probing, yet seems in general to avoid their limitations.
In particular it works well even when the load factor grows beyond 0.9. The algorithm is well
suited for implementing a resizable concurrent hash table.
The hopscotch hashing algorithm works by defining a neighborhood of buckets near the original
hashed bucket, where a given entry is always found. Thus, search is limited to the number of
entries in this neighborhood, which is logarithmic in the worst case, constant on average, and
with proper alignment of the neighborhood typically requires one cache miss. When inserting an
entry, one first attempts to add it to a bucket in the neighborhood. However, if all buckets in this
neighborhood are occupied, the algorithm traverses buckets in sequence until an open slot (an
unoccupied bucket) is found (as in linear probing). At that point, since the empty bucket is
outside the neighborhood, items are repeatedly displaced in a sequence of hops. (This is similar
to cuckoo hashing, but with the difference that in this case the empty slot is being moved into the
neighborhood, instead of items being moved out with the hope of eventually finding an empty
slot.) Each hop brings the open slot closer to the original neighborhood, without invalidating the
neighborhood property of any of the buckets along the way. In the end, the open slot has been
moved into the neighborhood, and the entry being inserted can be added to it.
Dynamic resizing
To keep the load factor under a certain limit, e.g. under 3/4, many table implementations expand
the table when items are inserted. For example, in Java's HashMap class the default load factor
threshold for table expansion is 0.75.
Since buckets are usually implemented on top of a dynamic array and any constant proportion
for resizing greater than 1 will keep the load factor under the desired limit, the exact choice of
the constant is determined by the same space-time tradeoff as for dynamic arrays.
Resizing is accompanied by a full or incremental table rehash whereby existing items are
mapped to new bucket locations.
To limit the proportion of memory wasted due to empty buckets, some implementations also
shrink the size of the tablefollowed by a rehashwhen items are deleted. From the point
of space-time tradeoffs, this operation is similar to the deallocation in dynamic arrays.
Resizing by copying all entries
A common approach is to automatically trigger a complete resizing when the load factor exceeds
some threshold r
max
. Then a new larger table is allocated, all the entries of the old table are
removed and inserted into this new table, and the old table is returned to the free storage pool.
Symmetrically, when the load factor falls below a second threshold r
min
, all entries are moved to
a new smaller table.
If the table size increases or decreases by a fixed percentage at each expansion, the total cost of
these resizings, amortized over all insert and delete operations, is still a constant, independent of
the number of entries n and of the number m of operations performed.
For example, consider a table that was created with the minimum possible size and is doubled
each time the load ratio exceeds some threshold. If m elements are inserted into that table, the
total number of extra re-insertions that occur in all dynamic resizings of the table is at
most m 1. In other words, dynamic resizing roughly doubles the cost of each insert or delete
operation.
Incremental resizing
Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging
the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid
dynamic resizing, a solution is to perform the resizing gradually:
- During the resize, allocate the new hash table, but keep the old table unchanged.
- In each lookup or delete operation, check both tables.
- Perform insertion operations only in the new table.
- At each insertion also move r elements from the old table to the new table.
- When all elements are removed from the old table, deallocate it.
To ensure that the old table is completely copied over before the new table itself needs to be
enlarged, it is necessary to increase the size of the table by a factor of at least (r + 1)/r during
resizing.
Monotonic keys
If it is known that key values will always increase (or decrease) monotonically, then a variation
of consistent hashing can be achieved by keeping a list of the single most recent key value at
each hash table resize operation. Upon lookup, keys that fall in the ranges defined by these list
entries are directed to the appropriate hash functionand indeed hash tableboth of which can
be different for each range. Since it is common to grow the overall number of entries by
doubling, there will only be O(lg(N)) ranges to check, and binary search time for the redirection
would be O(lg(lg(N))). As with consistent hashing, this approach guarantees that any key's hash,
once issued, will never change, even when the hash table is later grown.
Other solutions
Linear hashing
[17]
is a hash table algorithm that permits incremental hash table expansion. It is
implemented using a single hash table, but with two possible look-up functions.
Another way to decrease the cost of table resizing is to choose a hash function in such a way that
the hashes of most values do not change when the table is resized. This approach,
calledconsistent hashing, is prevalent in disk-based and distributed hashes, where rehashing is
prohibitively costly.
Performance analysis
In the simplest model, the hash function is completely unspecified and the table does not resize.
For the best possible choice of hash function, a table of size k with open addressing has no
collisions and holds up to k elements, with a single comparison for successful lookup, and a table
of size k with chaining and n keys has the minimum max(0, n-k) collisions and O(1 + n/k)
comparisons for lookup. For the worst choice of hash function, every insertion causes a collision,
and hash tables degenerate to linear search, with (n) amortized comparisons per insertion and
up to n comparisons for a successful lookup.
Adding rehashing to this model is straightforward. As in a dynamic array, geometric resizing by
a factor of b implies that only n/b
i
keys are inserted i or more times, so that the total number of
insertions is bounded above by bn/(b-1), which is O(n). By using rehashing to maintain n < k,
tables using both chaining and open addressing can have unlimited elements and perform
successful lookup in a single comparison for the best choice of hash function.
In more realistic models, the hash function is a random variable over a probability distribution of
hash functions, and performance is computed on average over the choice of hash function. When
this distribution is uniform, the assumption is called "simple uniform hashing" and it can be
shown that hashing with chaining requires (1 + n/k) comparisons on average for an
unsuccessful lookup, and hashing with open addressing requires (1/(1 - n/k)).
[18]
Both these
bounds are constant, if we maintain n/k < c using table resizing, where c is a fixed constant less
than 1.
Features
Advantages
The main advantage of hash tables over other table data structures is speed. This advantage is
more apparent when the number of entries is large. Hash tables are particularly efficient when
the maximum number of entries can be predicted in advance, so that the bucket array can be
allocated once with the optimum size and never resized.
If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not
allowed), one may reduce the average lookup cost by a careful choice of the hash function,
bucket table size, and internal data structures. In particular, one may be able to devise a hash
function that is collision-free, or even perfect (see below). In this case the keys need not be
stored in the table.
Drawbacks
Although operations on a hash table take constant time on average, the cost of a good hash
function can be significantly higher than the inner loop of the lookup algorithm for a sequential
list or search tree. Thus hash tables are not effective when the number of entries is very small.
(However, in some cases the high cost of computing the hash function can be mitigated by
saving the hash value together with the key.)
For certain string processing applications, such as spell-checking, hash tables may be less
efficient than tries, finite automata, or Judy arrays. Also, if each key is represented by a small
enough number of bits, then, instead of a hash table, one may use the key directly as the index
into an array of values. Note that there are no collisions in this case.
The entries stored in a hash table can be enumerated efficiently (at constant cost per entry), but
only in some pseudo-random order. Therefore, there is no efficient way to locate an entry whose
key is nearest to a given key. Listing all n entries in some specific order generally requires a
separate sorting step, whose cost is proportional to log(n) per entry. In comparison, ordered
search trees have lookup and insertion cost proportional to log(n), but allow finding the nearest
key at about the same cost, and ordered enumeration of all entries at constant cost per entry.
If the keys are not stored (because the hash function is collision-free), there may be no easy way
to enumerate the keys that are present in the table at any given moment.
Although the average cost per operation is constant and fairly small, the cost of a single
operation may be quite high. In particular, if the hash table uses dynamic resizing, an insertion or
deletion operation may occasionally take time proportional to the number of entries. This may be
a serious drawback in real-time or interactive applications.
Hash tables in general exhibit poor locality of referencethat is, the data to be accessed is
distributed seemingly at random in memory. Because hash tables cause access patterns that jump
around, this can trigger microprocessor cache misses that cause long delays. Compact data
structures such as arrays searched with linear search may be faster, if the table is relatively small
and keys are integers or other short strings. According to Moore's Law, cache sizes are growing
exponentially and so what is considered "small" may be increasing. The optimal performance
point varies from system to system.
Hash tables become quite inefficient when there are many collisions. While extremely uneven
hash distributions are extremely unlikely to arise by chance, a malicious adversary with
knowledge of the hash function may be able to supply information to a hash that creates worst-
case behavior by causing excessive collisions, resulting in very poor performance, e.g. a denial
of service attack.
[19]
In critical applications, universal hashing can be used; a data structure with
better worst-case guarantees may be preferable.
[20]

Uses
Associative arrays
Hash tables are commonly used to implement many types of in-memory tables. They are used to
implement associative arrays (arrays whose indices are arbitrary strings or other complicated
objects), especially in interpreted programming languages like AWK, Perl, and PHP.
When storing a new item into a multimap and a hash collision occurs, the multimap
unconditionally stores both items.
When storing a new item into a typical associative array and a hash collision occurs, but the
actual keys themselves are different, the associative array likewise stores both items. However, if
the key of the new item exactly matches the key of an old item, the associative array typically
erases the old item and overwrites it with the new item, so every item in the table has a unique
key.
Database indexing
Hash tables may also be used as disk-based data structures and database indices (such as in dbm)
although B-trees are more popular in these applications.
Caches
Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the
access to data that is primarily stored in slower media. In this application, hash collisions can be
handled by discarding one of the two colliding entriesusually erasing the old item that is
currently stored in the table and overwriting it with the new item, so every item in the table has a
unique hash value.
Sets
Besides recovering the entry that has a given key, many hash table implementations can also tell
whether such an entry exists or not.
Those structures can therefore be used to implement a set data structure, which merely records
whether a given key belongs to a specified set of keys. In this case, the structure can be
simplified by eliminating all parts that have to do with the entry values. Hashing can be used to
implement both static and dynamic sets.
Object representation
Several dynamic languages, such as Perl, Python, JavaScript, and Ruby, use hash tables to
implement objects. In this representation, the keys are the names of the members and methods of
the object, and the values are pointers to the corresponding member or method.
Unique data representation
Hash tables can be used by some programs to avoid creating multiple character strings with the
same contents. For that purpose, all strings in use by the program are stored in a single hash
table, which is checked whenever a new string has to be created. This technique was introduced
in Lisp interpreters under the name hash consing, and can be used with many other kinds of data
(expression trees in a symbolic algebra system, records in a database, files in a file system,
binary decision diagrams, etc.)
String interning
Main article: String interning
Implementations
In programming languages
Many programming languages provide hash table functionality, either as built-in associative
arrays or as standard library modules. In C++11, for example, the unordered_map class provides
hash tables for keys and values of arbitrary type.
In PHP 5, the Zend 2 engine uses one of the hash functions from Daniel J. Bernstein to generate
the hash values used in managing the mappings of data pointers stored in a hash table. In the
PHP source code, it is labelled as DJBX33A (Daniel J. Bernstein, Times 33 with Addition).
Python's built-in hash table implementation, in the form of the dict type, as well as Perl's hash
type (%) are highly optimized as they are used internally to implement namespaces.
In the .NET Framework, support for hash tables is provided via the non-generic Hashtable and
generic Dictionary classes, which store key-value pairs, and the generic HashSet class, which
stores only values.
Independent packages
SparseHash (formerly Google SparseHash) An extremely memory-efficient hash_map
implementation, with only 2 bits/entry of overhead. The SparseHash library has several C++
hash map implementations with different performance characteristics, including one that
optimizes for memory use and another that optimizes for speed.
- SunriseDD An open source C library for hash table storage of arbitrary data objects with
lock-free lookups, built-in reference counting and guaranteed order iteration. The library can
participate in external reference counting systems or use its own built-in reference counting.
It comes with a variety of hash functions and allows the use of runtime supplied hash
functions via callback mechanism. Source code is well documented.
- uthash This is an easy-to-use hash table for C structures.

Reconfigurable computing
Reconfigurable computing is a computer architecture combining some of the flexibility
of software with the high performance of hardware by processing with very flexible high speed
computing fabrics like field-programmable gate arrays (FPGAs). The principal difference when
compared to using ordinary microprocessors is the ability to make substantial changes to
the datapath itself in addition to the control flow. On the other hand, the main difference with
custom hardware, i.e. application-specific integrated circuits (ASICs) is the possibility to adapt
the hardware during runtime by "loading" a new circuit on the reconfigurable fabric.
History and properties
The concept of reconfigurable computing has existed since the 1960s, when Gerald Estrin's
landmark paper proposed the concept of a computer made of a standard processor and an array of
"reconfigurable" hardware.
[1][2]
The main processor would control the behavior of the
reconfigurable hardware. The latter would then be tailored to perform a specific task, such as
image processing or pattern matching, as quickly as a dedicated piece of hardware. Once the task
was done, the hardware could be adjusted to do some other task. This resulted in a hybrid
computer structure combining the flexibility of software with the speed of hardware;
unfortunately this idea was far ahead of its time in needed electronic technology.
In the 1980s and 1990s there was a renaissance in this area of research with many proposed
reconfigurable architectures developed in industry and academia,
[3]
such as: COPACOBANA,
Matrix, Garp,
[4]
Elixent, PACT XPP, Silicon Hive, Montium, Pleiades, Morphosys,
PiCoGA.
[5]
Such designs were feasible due to the constant progress of silicon technology that let
complex designs be implemented on one chip. The world's first commercial reconfigurable
computer, the Algotronix CHS2X4, was completed in 1991. It was not a commercial success, but
was promising enough that Xilinx (the inventor of the Field-Programmable Gate Array, FPGA)
bought the technology and hired the Algotronix staff.
[6]

Reconfigurable computing as a paradigm shift: using the Anti Machine[edit source]
Computer scientist Reiner Hartenstein describes reconfigurable computing in terms of an anti
machine that, according to him, represents a fundamental paradigm shift away from the more
conventional von Neumann machine.
[7]
Hartenstein calls it Reconfigurable Computing
Paradox, that software-to-configware migration (software-to-FPGA migration) results in
reported speed-up factors of up to more than four orders of magnitude, as well as a reduction in
electricity consumption by up to almost four orders of magnitudealthough the technological
parameters of FPGAs are behind the Gordon Moore curve by about four orders of magnitude,
and the clock frequency is substantially lower than that of microprocessors. This paradox is due
to a paradigm shift, and is also partly explained by the Von Neumann syndrome.
The fundamental model of the reconfigurable computing machine paradigm, the data-stream-
based anti machine is well illustrated by the differences to other machine paradigms that were
introduced earlier, as shown by Nick Tredennick's following classification scheme of computing
paradigms (see "Table 1: Nick Tredennicks Paradigm Classification Scheme").
[8]

The fundamental model of a Reconfigurable Computing Machine, the data-stream-based anti
machine (also called Xputer), is the counterpart of the instruction-stream-based von Neumann
machine paradigm. This is illustrated by a simple reconfigurable system
(not dynamicallyreconfigurable), which has no instruction fetch at run time. The reconfiguration
(before run time) can be considered as a kind of super instruction fetch. An anti machine does
not have a program counter. The anti machine has data counters instead, since it is data-stream-
driven. Here the definition of the term data streams is adopted from the systolic array scene,
which defines, at which time which data item has to enter or leave which port, here of the
reconfigurable system, which may be fine-grained (e. g. using FPGAs) or coarse-grained, or a
mixture of both.
The systolic array scene, originally (early 1980s) mainly mathematicians, only defined one half
of the anti machine: the data path: the systolic array (also see Super systolic array). But they did
not define nor model the data sequencer methodology, considering that this is not their job to
take care where the data streams come from or end up. The data sequencing part of the anti
machine is modeled as distributed memory, preferably on chip, which consists of auto-
sequencing memory (ASM) blocks. Each ASM block has a sequencer including a data counter.
An example is the Generic Address Generator (GAG), which is a generalization of the DMA.
Example of a streaming model of computation[edit source]
Problem: We are given 2 character arrays of length 256: A[] and B[]. We need to compute the
array C[] such that C[i]=B[B[B[B[B[B[B[B[A[i]]]]]]]]]. Though this problem is hypothetical,
similar problems exist which have some applications.
Consider a software solution (C code) for the above problem:
for(int i=0;i<256;i++){
char a=A[i];
for(int j=0;j<8;j++)
a=B[a];
C[i]=a;
}
This program will take about 256*10*CPI cycles for the CPU, where CPI is the number of
cycles per instruction.

Now, consider the hardware implementation shown here, say on an FPGA. Here, one element
from the array 'A' is 'streamed' by a microprocessor into the circuit every cycle. The array 'B' is
implemented as a ROM, perhaps in the BRAMs of the FPGA. The wire going into the ROMs
labelled 'B' are the address lines and the wires out are the values stored in the ROM at that
address. The blue boxes are registers used for storing temporary values Clearly, this is a pipeline
and will output 1 value (a useful C[i] value) after the 8th cycle. Hence the output is also a
'stream'.
The hardware implementation takes 256+8 cycles. Hence, we can expect a speedup of about
10*CPI over the software implementation. However, the speedup is much less than this value
due to the slow clock of the FPGA.

CONCLUSION
In this paper, we have presented an FPGA-based reconfigurable architecture to accelerate
the word-matching stage of BLASTN, which is a bio-sequence search tool of high importance to
Bioinformatics research. Our design consists of three substages, a parallel Bloom filter, an off-
chip hash table, and a match redundancy eliminator. Different techniques are applied to optimize
the performance of each substage. The comparison of the performance of our word-matching
accelerator to that of NCBI BLASTN shows a speedup around one order of magnitude with only
modest resource utilization. As FPGA-based designs exhibit high performance for parallel
computing and fine-grained pipelining, we can expect obvious performance improvements of
other applications in Bioinformatics. Therefore, we are also planning to design architecture for
Stage 2 of the BLASTN pipeline (ungapped extension) in order to further improve the overall
application performance.









REFERENCES
[1] GenBank Statistics at NCBI [Online]. Available: http://www.ncbi.
nlm.nih.gov/genbank/genbankstats.html
[2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, Basic local alignment
search tool, J. Molecular Biol., vol. 215, pp. 403410, Feb. 1990.
[3] BLAST Algorithm [Online]. Available: http://en.wikipedia.org/wiki/ BLAST
[4] P. Karishnamurthy, J. Buhler, R. Chamberlain, M. Franklin, K. Gyang, A. Jacob, and J.
Lancaster, Biosequence similarity search on the mercury system, J. VLSI Signal Process.
Syst., vol. 49, no. 1, pp. 101 121, 2007.
[5] Z. Zhang, S. Schwartz, L. Wanger, and W. Miller, A greedy algorithm for aligning DNA
sequences, J. Comput. Biol., vol. 7, nos. 12, pp. 203214, 2000.
[6] W. J. Kent, BLATthe BLAST-like alignment tool, Genome Res., vol. 12, pp. 656664,
Mar. 2002.
[7] B. Ma, J. Tromp, and M. Li, Patternhunter: Faster and more sensitive homology search,
Bioinformatics, vol. 18, no. 3, pp. 440445, 2002.
[8] M. Li, B. Ma, D. Kisman, and J. Tromp, Patternhunter II: Highly sensitive and fast
homology search, J. Bioinf. Comput. Biol., vol. 2, no. 3, pp. 417439, 2004.
[9] K. Muriki, K. D. Underwood, and R. Sass, RC-BLAST: Toward a portable, cost-effective
open source hardware implementation, in Proc. 19th Int. Parallel Distrib. Process. Symp., vol. 8.
2005, pp. 18.
[10] E. Sotiriades, C. Kozanitis, and A. Dollas, FPGA based architecture for DNA sequence
comparison and database search, in Proc. 20th Int. Parallel Distrib. Process. Symp., 2006, p. 8.
[11] D. Lavenier, G. Georges, and X. Liu, A reconfigurable index FLASH memory tailored to
seed-based genomic sequence comparison algorithms, J. VLSI Signal Process. Syst., Special
Issue Comput. Archit. Accelerat. Bioinf. Algorithms, vol. 48, no. 3, pp. 255269, 2007.
[12] J. Buhler, J. Lancaster, A. Jacob, and R. Chamberlain, Mercury BLASTN: Faster DNA
sequence comparison using a streaming architecture, in Proc. Reconfig. Syst. Summer Inst., Jul.
2007, pp. 17.
[13] B. Bloom, Space/time trade-offs in hash coding with allowable errors, Commun. ACM,
vol. 13, no. 7, pp. 422426, 1970.
[14] S. Dharmapurikar and J. Lockwood, Fast and scalable pattern matching for network
intrusion detection systems, IEEE J. Sel. Areas Commun.,
vol. 24, no. 10, pp. 17811792, Oct. 2006.
[15] M. Nourani and P. Katta, Bloom filter accelerator for string matching, in Proc. 16th Int.
Conf. Comput. Commun. Netw., 2007, pp. 185190.
[16] I. Moraru and D. G. Andersen, Exact pattern matching with feedforward Bloom filter, in
Proc. Workshop Algorithm Eng. Experim., 2011, pp. 112.
[17] DRC Coprocessor Information [Online]. Available: http://www. drccomputer.com
[18] Y. Chen, B. Schmidt, and D. L. Maskell, A reconfigurable bloom filter architecture for
BLASTN, in Proc. ARCS 22nd Int. Conf. Archit. Comput. Syst., 2009, pp. 4049.
[19] M. Ramakrishna, E. Fu, and E. Bahcekapili, Efficient hardware hashing functions for high
performance computers, IEEE Trans. Comput., vol. 46, no. 12, pp. 13781381, Dec. 1997.
[20] F. C. Botelho, Y. Kohayakawa, and N. Ziviani, An approach for minimal perfect hash
functions for very large databases, Dept. Comput. Sci., Univ. Federal de Minas Gerais, Belo
Horizonte, Brazil, Tech. Rep., 2006.
[21] R. Pagh and F. F. Rodler, Cuckoo hashing, J. Algorithms, vol. 51, no. 2, pp. 122144,
2004.
[22] BLAST Programs at NCBI [Online]. Available: http://www.ncbi.nlm.nih. gov/BLAST/
[23] Xilinx Virtex-7 Family [Online]. Available: http://www.xilinx.com/ product/silicon-
devices/fpga/virtex-7/index.htm
[24] J. Buhler, Mercury BLAST dictionaries: Analysis and performance measurement, Dept.
Comput. Sci. Eng., Washington Univ., St. Louis, MO, Tech. Rep., 2007.

You might also like