Awesome Big Data Algorithms
Awesome Big Data Algorithms
http://xkcd.com/1185/
Welcome!
More
of
a
computational
scientist
than
a
computer
scientist;
will
be
using
simulations
to
demo
&
explore
algorithm
behavior.
Features
I
will
be
using
Python
rather
than
C++,
because
Python
is
easier
to
read.
I apologize in advance for not covering your favorite data structure or algorithm.
Outline
The
basic
idea
Three
examples
Skip
lists
(a
fast
key/value
store)
HyperLogLog
Counting
(counting
discrete
elements)
Bloom
lters
and
CountMin
Sketches
Folding, spindling, and mutilating DNA sequence References and further reading
Skip
lists
A
randomly
indexed
improvement
on
linked
lists.
Each
node
can
belong
to
one
or
more
vertical
levels,
which
allow
fast
search/insertion/deletion
~O(log(n))
typically!
wikipedia
Skip
lists
A
randomly
indexed
improvement
on
linked
lists.
Very
easy
to
implement;
asymptotically
good
behavior.
From reddit, if someone held a gun to my head and asked me to implement an ecient set/map storage, I would implement a skip list. (Response: does this happen to you a lot??)
wikipedia
Channel
randomness!
If
you
can
construct
or
rely
on
randomness,
then
you
can
easily
get
good
typical
behavior.
Note, a good hash function is essentially the same as a good random number generator
Relevant
digression:
Flip
some
unknown
number
of
coins.
Q:
what
is
something
simple
to
track
that
will
tell
you
roughly
how
many
coins
youve
ipped?
A:
longest
run
of
HEADs.
Long
runs
are
very
rare
and
are
correlated
with
how
many
coins
youve
ipped.
Bloom
lters
A
set
membership
data
structure
that
is
probabilistic
but
only
yields
false
positives.
Trivial
to
implement;
hash
function
is
main
cost;
extremely
memory
ecient.
My
research
applications
Biology
is
fast
becoming
a
data-driven
science.
http://www.genome.gov/sequencingcosts/
feeding books into a paper shredder, digitizing the shreds, and reconstructing the book.
Although for books, we often know the language and not just the alphabet J
Shotgun sequencing is --
Shotgun
sequencing
Genome (unknown) X X X X X X X X X X X X X X X X Reads (randomly chosen; have errors) X
Coverage is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 just draw a line straight down from the top through all of the reads.
Typically 10-100x needed for robust recovery (300 Gbp for human)
Typically 10-100x needed for robust recovery (300 Gbp for human) But this data is massively redundant!! Only need 5x systematic! All the stu above the red line is unnecessary!
Digital
normalization
True sequence (unknown) X Reads (randomly sequenced)
Digital
normalization
True sequence (unknown) X X X Reads (randomly sequenced) X X X X X X X
Digital
normalization
True sequence (unknown) X X X Reads (randomly sequenced) X X X X X X X
Digital
normalization
True sequence (unknown) X X X Reads (randomly sequenced) X X X X X X X
Digital
normalization
True sequence (unknown) X X X X Reads (randomly sequenced) X X X X X X X X
X X X X X X X X X X
X X
Storing data this way is better than best- possible information-theoretic storage.
Smaller problems are pretty much solved. Just beginning to explore threading, multicore, etc. (BIG DATA grant proposal) Goal is to scale to 50 Tbp of data (~5-50 TB RAM currently)
Concluding
thoughts
Channel
randomness.
Embrace
streaming.
Live
with
minor
uncertainty.
Dont
be
afraid
to
discard
data.
(Also,
Im
an
open
source
hacker
who
can
confer
PhDs,
in
exchange
for
long
years
of
low
pay
living
in
Michigan.
E-mail
me!
And
dont
talk
to
Brett
Cannon
about
PhDs
rst.)
References
SkipLists:
Wikipedia,
and
John
Shipmans
code:
http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/pyskip.pdf
And: https://github.com/svpcom/hyperloglog