Foundation in Data Engineering Full Notes
Foundation in Data Engineering Full Notes
∗
Ilaria Battiston
∗ All notes are collected with the aid of material provided by T. Neumann. All images have been retrieved by slides
1
CONTENTS CONTENTS
Contents
1 Tools and techniques for large-scale data processing 3
3 Performance spectrum 8
4 SQL 10
5 Query decorrelation 13
6 Recursive CTEs 15
7 Window functions 16
8 Database clusters 18
10 Hashing 24
12 NoSQL databases 33
13 Key-value models 35
15 SPARQL 36
16 Graph databases 38
2
1 TOOLS AND TECHNIQUES FOR LARGE-SCALE DATA PROCESSING
1.2 Velocity
Velocity is the problematic speed of data: it is generated by an endless stream of events, with no
time for heavy indexing leading to developments of new data stream technologies.
Storage is relatively cheap, but accessing and visualizing is next to impossible. The cause of this are
repeated observations, such as locations of mobile phones, motivating stream algorithms.
3
1.3 Variety 2 FILE MANIPULATION TOOLS
1.3 Variety
Variety represent big data being inconclusive and incomplete: to fix this problem, simple queries are
ineffective and require a machine learning approach (data mining, data cleaning, text analysis). This
generates technical complications, while complicating cluster data processing due to the difficulty to
partition equally.
Big data typically obeys a power law: modeling the head is easy, but may not be representative of
the full population. Most items take a small amount of time to process, but a few require a lot of
time; understanding the nature of data is the key.
Distributed computation is a natural way to tackle big data. MapReduce operates over a cluster
of machines encouraging sequential, disk-based localized processing of data. The power laws applies,
causing uneven allocation of data to nodes (the head would go on one or two workers, making them
extremely slowly) and turning parallel algorithms to sequential.
4
2.1 Command line tools 2 FILE MANIPULATION TOOLS
2.1.1 cat
Syntax: cat [OPTION] [FILE]
cat concatenates file to standard output. When the file is unspecified, it prints to terminal. It can
also be used with multiple files, and different formats (binary, compressed). It is mainly used as
input for other commands, since displaying large data can be computationally expensive.
2.1.2 less
Syntax: less [OPTIONS]
less is opposite to the analog command more, extending it with other features. It does not have
to read the entire input file before starting, so with large files it starts up faster than text editors or
cat.
Commands may be preceded by decimal numbers indicating the number of rows to display/scroll,
and basic pattern matching can be performed as well.
2.1.3 grep
Syntax: grep [OPTION] PATTERNS [FILE]
It can search (Boyer-Moore algorithm) for any type of string, or any file, or even lists of files and
output of commands. It uses basic regular expressions and normal strings.
There also are the variations egrep (for extended regular expressions), fgrep (fixed, to only search
strings) and rgrep (recursive, for folders).
Regular expressions syntax:
• [] enclose lists of characters (and);
5
2.1 Command line tools 2 FILE MANIPULATION TOOLS
2.1.4 sort
Syntax: sort [OPTION] [FILE]
sort sorts lines of text files, writing them to standard input. When no file is specified, it reads
standard input. Some ordering options are -n (numeric), -r (reverse), -k (specific key), -u (removes
duplicates).
It can handle files larger than main memory, and it is very useful to prepare input for other algo-
rithms.
2.1.6 uniq
Syntax: uniq [OPTION] [INPUT [OUTPUT]]
uniq handles (adjancent) duplicates (similarly to SQL DISTINCT) in sorted input, reporting or
omitting them. It also has option -c to count duplicates or -u for unique lines. With no options,
matching lines are merged to the first occurrence.
Its main applications are grouping and counting input.
6
2.1 Command line tools 2 FILE MANIPULATION TOOLS
2.1.7 cut
Syntax: cut [OPTION] [FILE]
cut removes section from each line of files. This can be seen as selecting columns from a CSV, where
relevant parts of delimited input get redirected to output.
Example: cat file | cut -f 1, 3
Returns specific fields of file, delimiter can be specified with -d.
2.1.8 wc
Syntax: wc [OPTION] [FILE]
wc prints newline, word and byte counts for each file (and a total line if more than one file is
specified). A word is a non-zero-length sequence of characters delimited by white space.
Options may be used to select which counts are printed: -c for bytes, -m for chars, -l for lines and
so on.
2.1.9 shuf
Syntax: shuf [OPTION] [FILE]
This command generates random permutations of the input lines to standard output. It is not so
common, although very useful when extracting random samples and running performance tests.
2.1.10 sed
Syntax: sed [OPTION] script [FILE]
sed is a stream editor to perform basic text transformations on an input stream (or file). It works
by making only one pass over the input, therefore it is more efficient than scripts. It can also filter
text in a pipeline.
This command is mostly used to replace text in a file. The operations are separated by a slash, and
the letters (flags) represent options.
Example: cat file | sed ’s/oldText/newText/’
Replaces oldText with newText.
Example: cat file | sed ’s/file[0-9]*.png/"&"/I’
Case insensitive matching and replacing.
2.1.11 join
Syntax: join [OPTION] FILE1 FILE2
join joins the lines of two files on a common field. For each pair of input lines with identical join
fields, writes a line to standard output. The default join field is the first, delimited by blanks. One
file can be omitted, reading standard output.
Unmatched lines are removed, or preserved with -a.
Field number can be specified with -j or -1 FIELD -2 FIELD. Values must be sorted before joining,
and it is not powerful compared to other tools.
7
3 PERFORMANCE SPECTRUM
2.1.12 awk
Syntax: gawk [POSIX or GNU style options] -f program-file [--] file
Gawk is the GNU project’s implementation of the AWK programming language, according to the
POSIX standards. The command line takes as input the AWK program text.
Gawk also has an integrated debugger and profiling statistics for the execution of programs.
AWK’s general structure is:
awk ’BEGIN init-code per-line-code END done-code’ file
Example: awk ’BEGIN x=0; y=0 x=x+$2; y=y+1 END print x/y’
Computes the average of column 2 (identified with the dollar).
3 Performance spectrum
Analyzing the performance is relevant to understand whether an operation is taking too long, and
comparing it according to time and input size. It can improve speed and energy consumption.
The performance spectrum involves many variables: theoretical limits, programming tools (inter-
preted vs. compiled languages) and methods, or hardware (clustering, scaling). Real input is
complex, yet just analyzing simple text queries using benchmarks can be highly explicative.
For instance, a program written in AWK or Python will be slower than C++ since those two are
interpreted languages and require extra operations.
Example bandwidth for different input processing:
• 1 GB ethernet: 100 MB/s;
• Rotating disk (hard disk): 200 MB/s;
• SATA SSD: 500 MB/s;
• PCI SSD (uses multiple lanes to connect to the motherboard): 2 GB/s;
• DRAM (dynamic RAM, has the fastest path to the CPU thanks to semiconductors): 20 GB/s.
Performance when manipulating text completely ignores CPU costs, which are not important for
disks but very relevant for reading data from DRAM: when using the largest bandwidth, they have
the biggest impact on computational time.
The first time a program is executed, data is stored in main memory: after that, it gets copied to
cache, accessed in a quicker way, making the following executions faster (and CPU bound).
Naively optimizing can have downsides, making the situation worse sacrificing portability or creating
a messy code: before touching a program there are tools which can be used to analyze time and
performance (perf).
An efficient way to optimize code is memory mapping: files are saved in the address space of the
program and are treated as arrays being accessed in the same way as dynamic memory (better than
caching).
8
3.1 Clustering 3 PERFORMANCE SPECTRUM
A memory mapped file is in fact a segment of virtual memory which has been assigned a byte-for-
byte correlation with some resource, increasing I/O performance with faster access. Memory maps
are always aligned to page size (4 KiB at most).
Memory mapped files can also be used to share memory between multiple processes without causing
page faults or segmentation violations. Executions in C++ with cold cache are slightly slower, but
warm cache are faster since data is directly accessed from there without copying.
Another useful way consists in using block operations. Search can be performed for instance through
the encoding of special characters, since there is no way to split the data just by directly scanning
the memory.
Loop nest optimization applies a set of loop transformations (increasing the speed of loops and
improving cache performance) partitioning iteration spaces into smaller chunks, ensuring data stays
in cache until it is reused.
Files can be stored in binary format to speed up implementation: using text files implies storing
characters in more than 4 bytes, and parsing with delimiters is slower than just copying the bit
stream to memory.
SSE instructions are an extension of x86 which define faster floating point operations opposite to
the standard architecture, having wider application in digital signal and graphics processing. This
method, although fast, is not portable.
Optimization of mathematical operations is done using several strategies:
• Choosing fastest registry operations (64-bit is faster than 32, performed automatically by the
compiler);
• const instead of constexpr;
• Using assembly language to convert floating point to integer;
• Dividing and multiplying by powers of two (shifting bits), for instance a/c = a∗(264 /c) >> 64;
• Speeding up division and modulo by constant using unsigned integers.
Different data structures also have different performance: arrays have worst-case linear time search,
while maps and sets are implemented using trees for better performance.
Multithreading (processing within multiple CPU cores) consistently increases performance - but
network speed and remote CPUs can cause connection problems. Warm caches parallelizes well, yet
cold cache gets slower.
Different input representation is not always an option, yet when possible a columnar binary format
can reduce touched data up to a factor of 50. It requires an additional conversion step, but it pays
off for repeated queries.
Using all those performance improvements can make a simple program improve bandwidth up to 5
GB/s.
3.1 Clustering
Using multiple machines to perform operations can seem a feasible option, but it depends on which
part of the program has to be sent in the network.
9
4 SQL
Shipping data with a bandwidth of 100 MB/s having a local bandwidth of up to 5 GB/s is inefficient
and would slow down computation.
On the other hand, if only computation can be shipped, the performance is determined by the
latency and the number of nodes:
|W |
≈ 20ms +
5GB/s · n
n is the number of nodes, s is the latency in seconds and W is the size of the problem. W has to be
larger than 113 MB for 10 nodes for this system to be efficient.
In the end, since a single machine can handle many gigabytes efficiently and network speed can
be a bottleneck, scaling up is often preferred over scaling out: the latter, because of its significant
overhead, only pays off in the long run.
4 SQL
SQL is a declarative language (orders to execute a query without specifying how) to interact with
databases and retrieve data. Optimization is not only on the query level, but also on storage and
data types: for instance, strings can be managed adding a pointer to their size (if they are long).
Various SQL dialects have slight differences between semantics, but there are general standards start-
ing from older versions. Different databases might also have different type management, operations
and speed.
Another common problem with databases is handling null values while evaluating boolean expres-
sions: NULL and FALSE results in FALSE, therefore this may give an inaccurate representation
of information. On the other hand, NOT NULL columns are beneficial for optimization (NULL
columns have a reserved memory space for this possibility).
Some PostgreSQL data types:
• Numeric, slow arbitrary precision numbers with unbounded size;
• Float, mobile floating point (4 bytes) or double precision (8 bytes);
• Strings of variable length (pre-allocated or dynamic);
• Other common types such as bytea, timestamp and interval.
10
4.2 Views and CTEs 4 SQL
1. UNION, which vertically merges two tables with the same attributes;
2. EXCEPT, which performs the set difference A - B;
3. INTERSECT, to only keep rows in common between two tables.
The default option implies removing duplicates; to keep them, specifying ALL is necessary.
11
4.4 String comparison 4 SQL
4.5 Indexing
Postgres provides several index types, of which the most common is the B-tree, a self-balancing data
structure allowing logarithmic-time operations (the most important is searching).
It is useful because nodes can have more than two children: the number of sub-trees can in fact
be a multiple of the file system block size. The structure is optimized to have few accesses to disk
avoiding loading data within reads.
Further improvements can be made creating auxiliary indexes containing the first record in each
disk block, narrowing the search in the main database.
4.6 Counting
Counting, unless an index is present, is not optimizable: to have an exact result, the DBMS must
perform a linear scan of the table, to avoid eventual transaction conflicts.
Counting distinct values or grouping before joining, on the other hand, is faster since it involves an
approximation.
Number of total rows can be approximated as well, thanks to sketches: probabilistic data structures
that contain frequency tables of events in a stream of data.
They use hash tables to map events to frequencies, and at any time the sketch can be queried for
the frequency of a particular element.
To reduce variance and have a more accurate estimation, many independent copies can be stored in
parallel, and their outputs is combined.
4.7.1 HyperLogLog
HyperLogLog is one of the most widely used algorithms to count distinct elements avoiding calcu-
lating the cardinality of a set.
12
4.8 Sampling 5 QUERY DECORRELATION
It is a probabilistic algorithm, able to estimate cardinalities of 109 with a typical standard error
around 2% using very little memory.
The process derives from Flajolet-Martin algorithm: given a hash function which uniformly dis-
tributes output, it is taken the least-significant set bit position from every binary representation.
Since data is uniformly distributed, the probability of having an output ending with 2k (one followed
by k zeros) is 2−(k+1) . The cardinality is then estimated using the maximum number of leading
zeros. With n zeros, the number of distinct elements is circa 2n .
HyperLogLog also allows to add new elements, count and merge to obtain the union of two sets.
4.8 Sampling
Sampling is usually performed through the hypergeometrical distribution, a formula composed by
3 binomial coefficients. Computing factorials, approximated with exponentiation, can be a heavy
operation.
Factorials can be estimated using Stirling approximation: log(n!) ≈ n log n − n + 21 (2πn), which is
cheaper and works for sufficiently big n, but still not efficient for very large numbers.
5 Query decorrelation
Often queries are easier to formulate using subqueries, yet this can take a toll on performance: since
SQL is a declarative language, the best evaluation strategy is up to the DBMS.
A subquery is correlated if it refers to tuples from the outer query. The execution of a correlated
query is poor, since the inner one gets computed for every tuple of the outer one, causing a quadratic
running time.
Correlation can be recognized observing fields that do not have a bound with the attributes of the
outer query. In SQL, they usually are in WHERE or EXISTS statements.
Example (from TCP-H dataset):
select avg(l_extendedprice)
from lineitem l1
where l_extendedprice =
(select min(l_extendedprice)
from lineitem l2
where l1.l_orderkey = l2.l_orderkey;)
13
5 QUERY DECORRELATION
The new query is much more efficient, yet not as intuitive. Compilers sometimes manage to auto-
matically decorrelate, but correlations can be complex (inequalities, disjunctions).
Instead of running a function every time, the subset is pre-computated trying to evaluate all the
possible choices to then just use a lookup.
A join dependency allows to recreate a table by joining other tables which have subsets of the
attributes (relationships are independent of each other). In other words, a table subject to join
dependency can be decomposed in other tables and created again joining them.
Dependent joins rely on nested loop evaluations, which are very expensive: the goal of unnesting is
to eliminate them all.
The easiest practice consists in pulling predicates up and creating temporary tables inside FROM
instead of WHERE, evaluating the subquery for all possible bindings simultaneously.
The subquery in some cases is a set without duplicates which allows equivalence between dependent
join and bindings of free variables, therefore dependency can be removed.
Sometimes attributes of the set can be inferred, computing a superset of the values and eliminating
them with the final join. This avoids computing the subquery, but causes a potential loss of pruning
the evaluation tree.
The approach is overall always working with relational algebra, yet not necessarily on the SQL level:
it can add memory overhead which cannot always be removed, despite the ideal linear computational
time.
To decorrelate subqueries, a general method can be used which is effective even with more compli-
cated interrogations containing inequalities and several conditions.
The procedure can be executed several times, always starting from the highest level of nesting:
1. Moving the correlated subquery to FROM;
2. Since there might be multiple keys with the same value, putting DISTINCT helps reducing
cardinality;
3. Joining is performed in the outer query, and if there is more than one value, an OR condition
is applied so that either can be possible;
4. Values are grouped to maintain the same domain, according to the elements in DISTINCT
and JOIN;
5. Other conditions taken from outside the subquery are applied after pre-aggregation.
The inner table should always have an unique keys and some groups, which then can be selected in
the outer one.
Example:
select o_orderpriority ,count (*) as order_count
from orders
where o_orderdate >= date ’1993-07-01’
and o_orderdate < date ’1993-10-01’
and exists (
select *
14
6 RECURSIVE CTES
from lineitem
where l_orderkey = o_orderkey
and l_commitdate < l_receiptdate)
group by o_orderpriority
order by o_orderpriority;
Decorrelated query:
select o_orderpriority, count(*) as order_count
from orders, (
select o2.o_orderkey
from lineitem, (
select distinct o_orderkey
from orders) o2
where l_orderkey = o2.o_orderkey
and l_commitdate < l_receiptdate
group by o2.o_orderkey) preagg
where o_orderdate >= date ’1993 -07 -01’
and o_orderdate < date ’1993 -10 -01’
and orders.o_orderkey = preagg.o_orderkey
group by o_orderpriority
order by o_orderpriority;
6 Recursive CTEs
Recursive CTEs iteratively traverse hierarchical data of arbitrary length, and similarly to algorithms
contain a base case (stop condition) and its recursive step. The number of times to iterate does not
have to be constant.
with recursive r (i) as (
select 1
-- non-recursive term
union all
-- recursive term
select i + 1
from r
where i < 5)
select * from r;
Recursive expressions can be used to manipulate data in a hierarchical structure. The non-recursive
part defines the base case with eventual deterministic checks, while the recursive table traverses the
tree upwards applying conditions.
Every recursive query can be defined by a simple evaluation algorithm:
workingTable = evaluateNonRecursive()
output workingTable
while workingTable is not empty
15
7 WINDOW FUNCTIONS
workingTable = evaluateRecursive(workingTable)
output workingTable
It is important to notice that the working table gets fully replaced at every iteration of the second
part.
It is also possible to increment counters within RCTEs, and create trees. To avoid infinite loops in
cyclic structures, using UNION instead of UNION ALL allows to remove duplicates and checking before
writing a new value.
To avoid loops within graphs, a NOT EXISTS condition can be added to check whether a node has
already been discovered.
Best practices to deal with graph-like data involve knowing its cardinality, to be aware of the
maximum distance, yet most problems are Turing complete.
Postgres only allows one mention of the recursive relation in the recursive subquery.
7 Window functions
Window functions are versatile tools to implement time series analysis, percentiles and cumulative
sums.
A window function can be seen as a special case of aggregate, evaluated after every clause but
ORDER BY, but opposite to aggregation they do not change the result: only additional columns
are computed.
The term window describes the set of rows on which the window operates, returning values from it.
Example (from Postgres documentation):
select product_name,
price,
group_name,
avg(price) over (
partition by group_name
)
from products
inner join product_groups
using (group_id);
The previous query returns the product name, the price, product group name, along with the average
prices of each product group.
AVG works as window function, operating on the subset of rows specified by OVER.
PARTITION BY distributes the rows of the result set into groups, each one of them subject to AVG,
returning the average price.
Calculations on window functions are always performed on the result set after the other clauses but
ORDER BY.
General window functions syntax:
16
7.1 Segment trees 7 WINDOW FUNCTIONS
To summarize:
• The window function is an aggregate function, such as SUM or AVG;
• The (optional) partitioning clause divides the rows into multiple groups;
• The ordering clause specifies the order of internal rows;
• The framing clause defines a subset of rows in the current partition to which the window
function is applied.
Multiple window functions can also be applied in the same query. Some other examples which ignore
framing are:
• Ranking:
– rank() and dense rank(), returning the rank of the current row with or without gaps;
– row number();
– ntile(n), distributing data over buckets;
• Distribution:
– percent rank() and cume dist(), returning the relative rank of the current row or peer
group (rows with equal partitioning);
• Navigation:
– lag(expr, offset, default) and lead(expr, offset, default) evaluate the expres-
sion respectively on preceding and following row.
Framing can follow only a few specifications, not too complex:
• current row;
• unbounded preceding or unbounded following, first and last row in the partition;
• range between unbounded preceding and current row, default frame with order specified;
• range between unbounded preceding and unbounded following, default frame with order
unspecified;
• aggregates(), computing aggregates over all tuples in current frame;
• first value(), last value(), nth value(), evaluating an expression on determined rows of
the frame.
17
7.2 Other aggregates 8 DATABASE CLUSTERS
Querying a static segment tree allows knowing which of the stored segments contains a given element,
using logarithmic time operations.
The set of existing values gets divided into possible interval endpoints, sorted from left to right. All
leaves in the tree correspond to the elementary intervals induced by endpoints, while internal nodes
are the union of two intervals.
Any interval is stored in the canonical set for at most two nodes at the same depth, therefore the
required storage is O(n log n) with n intervals.
Space for leaves is linear, and aggregated functions can be calculated on intervals very quickly.
Segment trees are useful for window functions, since intervals can be seen as windows, either static or
sliding. For each call of a window function, a new segment tree gets built: the time for construction
is just linear, since ordering is already performed by the ORDER BY clause.
8 Database clusters
Storing a large quantity of data on a single machine can have limitations: performance is restricted
to hardware, and in case of failure transactions get lost.
Using multiple machines is therefore a valid solution to handle databases: clusters are transparent
and present themselves as a single system, still maintaining consistency between nodes.
18
8.2 CAP theorem 8 DATABASE CLUSTERS
19
8.4 Replication 8 DATABASE CLUSTERS
Since node crashes in a distributed system are independent, the protocol needs to be extended to
ensure atomicity of transactions: a valid method is two-phase locking (2PL).
It is characterized by the presence of a coordinator node which allows n agents of a system A1 , A2 , . . . , An
to either all persist the changes of T or all discard them.
This is a commitment protocol that coordinates all processes participating in a distributed atomic
transaction, achieving its purpose using logging of the states to aid recovery procedures.
The two phases of the protocol in a normal execution are:
• Commit-request phase, in which a coordinator process attempts to prepare all the processes
participating to the transaction to take the necessary steps and vote whether the operation
can be committed;
• Commit phase, based on the votes, whose result gets communicated to the participants and
the needed actions to commit or abort are taken.
Message flow in 2PL with 4 agents:
Issues of this method include the crash of nodes (both coordinator and agent) or potential lost
messages.
8.4 Replication
Replication is a technique consisting in keeping a copy of the dataset in each machine. Redundant
resources improve query throughput, can accommodate node failures and solve locality issues.
Access to a replicated entity is typically uniform with access to a single non-replicated entity, and
copying should be transparent to external users.
20
8.4 Replication 8 DATABASE CLUSTERS
Systems can also use a master-slave scheme, predominant in high availability clusters, where a single
node is designated to process all requests.
21
8.5 Joins 8 DATABASE CLUSTERS
8.5 Joins
Joins in a distributed system are not different than local ones in presence of a replicated environment,
yet can become complicated when data is sharded across multiple machines.
8.5.4 Considerations
All the previous methods still needs to consider size of the tables and bandwidth: some operations
might be faster reading the data from remote machines rather than on local disks, since cache size
is limited.
In this case, data can be shared to multiple nodes, each of them storing a partition of the table and
computing local results which are then aggregated.
Postgres supports read replicas, kept in sync for consistent reads: each application can send data
either to the master or the slaves, and the master itself can also trigger replication.
However, it is not natively a distributed database: its alternative Greenplum (still based on Postgres)
allows data distribution and partitioning, both of data and schema.
Distributed and partitioning schemes can be specified creating a table, adding DISTRIBUTED BY
to define how data is spread, or PARTITION BY for partitioning within a segment.
Distributions may be round-robin (as evenly as possible, non-deterministic), or hashed (rows dis-
persed according to a hash function).
22
9 AUTONOMOUS DATABASE SYSTEMS
MemSQL is another distributed database system based on SQL, which also allow replication, dis-
tributed and co-located joins.
A MemSQL cluster consists of aggregator and leaf nodes: aggregators handle metadata, monitoring
and results of queries, while leaves act as storage layer and execute queries.
By default, MemSQL shards tables by their primary key, while manual sharding needs to be specified.
Distributed systems overall have more resources, but their synchronization might be expensive.
23
9.1 State modeling 10 HASHING
10 Hashing
To make Maths people feel better about this exam, with the courtesy of Harald Räcke.
24
10 HASHING
The purpose of hashing is having a data structure that tries to directly compute the memory location
from the given key, obtaining constant search time.
Hash functions and their associated hash tables are used in data storage and retrieval applications
to access and store information thanks to their efficiency.
Given an universe U of keys and a set S ≤ U of used keys to be distributed over an array T [0, . . . , n−
1], an optimal hash function H : U → [0, . . . , n − 1] should be fast to evaluate, have small storage
requirements and a good distribution of elements over the whole table.
Ideally, the hash function would map all keys to different memory locations (direct addressing), yet
this is unrealistic since the size of the table would be much larger than available memory.
Knowing previously the set S of actual keys, it could be possible to design a simple function that
maps all of them to different memory locations (perfect hashing); but this is not often the case.
The best expectation would be a function which distributes keys evenly across the table, yet this
leads to the problem of collision: usually the universe U is much larger than the table size n, hence
there may be two elements k1 , k2 from the same set S so that h(k1 ) = h(k2 ), and therefore getting
mapped to the same memory location.
Typically, collisions do not appear once the size of the set S of actual keys get close to n, but
otherwise the probability of having a collision when hashing m elements under uniform hashing is
at least:
m(m−1) m2
1 − e− 2n ≈ 1 − e− 2n
Practically, hash tables should be larger than the amount of keys to store, for instance the next
power of two. There also are scaling coefficients such as 1.5 or 1.5, to avoid resizing too much and
wasting space. Precomputing prime numbers is an useful option, but requires a range of about 10%
more than the number of keys.
There are methods to deal with collisions: open addressing, hashing with chaining, or just not
resolving them at all.
Hashing with chaining arranges elements that map to the same position in a linear list, inserting at
the front of the list. The average time required for an unsuccessful search is A− = 1 + m
n , while for
m/n
+
a successful search it is A ≤ 1 + 2 .
Open addressing, in particular, stores all objects in the table defining a function that determines
the position through a permutation of all possible n − 1 values. If an insertion fails, a new position
is searched (either through a different function or looking for the first free slot).
A good choice for h(i, j) to have an uniform distribution over the possible values is a modulo with
a prime number.
Introducing prime numbers is relevant because the size of hash table should not have factors in
common with the linear component of the function, otherwise keys would be mapped to the same
position. Increasing has also to be done by odd numbers, to reduce the possibility of remainder zero.
Prime numbers can be generated through algorithms, of which the most basic is the prime sieve: a
list of integers up to a desired number is created, and composite numbers are progressively removed
until only primes are left.
25
10.1 Fibonacci hashing 10 HASHING
For large primes used in cryptography, a random chosen range of odd numbers of the desired size
is sieved against a number of relatively small primes, and the remaining are subject to primality
testing (Miller-Rabin).
To keep uniformity of the distribution even allowing new elements to be added, rehashing might be
necessary, or incremental of the prime.
The golden ratio has a close relationship with Fibonacci numbers: the previous equality, in fact,
holds between every Fibonacci number and its successor.
√
(1+ 5)
The nth Fibonacci number is obtained calculating Fn = √1 (φn − ϕn ), where φ = and
5 2
√
(1− 5)
ϕ= 2 .
In the context of Fibonacci hashing, it is taken the reciprocal of φ, and hashing is performed
multiplying the value k by the integer relatively prime to k which is closer to φ−1 k.
Consecutive keys have an uniform spread distribution, therefore it is an efficient method.
10.2 Neojoin
Hashing is relevant in a distributed systems architecture because it can help determining how to
locate relations on nodes.
While performing a distributed join, tuples may join with tuples on remote nodes: relations can be
repartitioned and redistributed for local joins, so that tuples will only join with the corresponding
partition.
The partitioning scheme defines how to assign partitions to nodes, using for instance a hash function.
There are several options to implement a scheme:
1. Minimizing network traffic, assigning a partition to the node which owns its largest part. This
way, only the small fragments of a partition are sent over the network, but a schedule with
minimal network traffic may have high duration;
2. Minimizing response time, i. e. the time from request to result. This is dominated by network
duration, which depends on maximum straggler;
3. Minimizing maximum straggler, formulated as a NP-hard linear programming problem where
nodes have the most similar possible amount of workload.
Satisfying those constraints may lead to unused resources, for instance nodes not doing any work.
26
11 DISTRIBUTED DATA PROCESSING
Nodes and keys are given an hashed identifier of m bits, and using the Chord lookup protocol nodes
are put in a circle that has as most 2m nodes.
Each node has a successor and a predecessor, going in clockwise direction, but normally there are
holes in the sequence. In general, the successor of a node k has the first identifier equal or greater
than k.
Lookup is performed passing the query through the successors of a node, if the key cannot be found
locally. This leads to linear worst-case performance, but can be avoided implementing a finger table
in each node.
A finger table contains up to m entries, and entry i of node n contains the successor of key (n + 2i−1 )
mod 2m . At every lookup, the query is passed to closest successor or predecessor of k in the finger
table, narrowing worst-case performance to logarithmic.
Each peer is responsible for all keys larger than the predecessor number until its own. Introducing
a factor of 2k , the distance always increases by 2, halving the number of hops each time.
27
11.1 MapReduce 11 DISTRIBUTED DATA PROCESSING
11.1 MapReduce
MapReduce is a programming model with associated implementation for processing and generating
big data sets with a parallel distributed algorithm on a cluster.
It is composed of:
1. A Map procedure, which performs filtering and sorting, applied by each worker to the local
data and collecting output to a temporary storage;
2. A Shuffle phase, where workers redistribute data based on the same output keys, so that all
data with the same key is located on the same node;
3. A Reduce method, which executes a summary operation (such as counting) on local data in
parallel.
The model is inspired by functional programming, but its key contributions are the scalability and
fault tolerance achieved by optimizing the execution engine, especially on multi-threaded implemen-
tations.
Maps can be performed in parallel, and a set of reducers can perform the reduction phase. Parallelism
offers possibility of recovering from partial failure of servers or storage: the work can be rescheduled
without recomputing all of it.
Logically, Map takes one pair of data with a type in one data domain, and returns a list of pairs in
a different domain:
M ap(k1 , v1 ) → list(k2 , v2 )
The Reduce function is then applied in parallel to each group, which in turn produces a collection
of values in the same domain:
Each Reduce call typically produces either one key-value par or an empty return.
Example (from Wikipedia):
function map(String name, String document):
// name: document name
// document: document contents
28
11.1 MapReduce 11 DISTRIBUTED DATA PROCESSING
This MapReduce performs a word count on a document. Each document is split into words, and
every word is counted by Map; then all pairs with the same key are put together and sent to Reduce,
which sums the relative counts.
MapReduce implementaton can be applied to many algorithms, such as joining of two relations or
PageRank.
11.1.1 PageRank
PageRank is an algorithm used by Google Search to rank web pages in their search engine results,
measuring the relevance of them.
It works assigning a numerical weight to each element of an hyperlinked set of documents, with the
purpose of obtaining its relative importance within the set.
It outputs a probability distribution used to represent the likelihood that a person randomly clicking
on links will arrive at any particular page. Links from a page to itself are ignored.
The PageRank transferred from a given page to the targets of its outbound links upon the next
iteration is divided equally among all outbound links.
The value for a page v is dependent on the values for each page in the set containing all pages linking
to v, divided by the number of links from v.
Pseudo-code PageRank:
1. Retrieve total number of nodes n;
1
2. Write initial values in each node, computed as n;
1−d
3. Find next node as n , where d is the correction factor;
4. For each iteration of the algorithm:
(a) Compute outgoing page rank fraction;
(b) Add fraction to all neighbors;
5. Swap next node and current;
1−d
6. Reset next reassigning it the value of n .
This algorithm looks at each node in the graph, computes the fraction of rank it has to pass on to
neighbors, and writes it.
29
11.2 Hadoop 11 DISTRIBUTED DATA PROCESSING
For better performance, the value can be computed in reverse, collecting all fractions from reverse
neighbors.
The sequential reads are optimized for modern computers, but the random access to next node
exhibits performance problems. Caches help alleviating this, since they can store most visited
nodes.
If the memory requirements exceed the available main memory, the system will allocate too much
memory, running into several page faults. Sorting by degree will make often-accessed nodes all in
the same page.
The distributed version sends computed values to other nodes, but can cause overhead and network
congestion. MapReduce assigns the Shuffle phase the duty of message passing.
Computing PageRank using MapReduce is also performed using the function to repeatedly multiply
a vector by the matrix representing how the distribution changes after every single step.
11.2 Hadoop
Hadoop is an open-source implementation of the MapReduce framework, written in Java (opposed
to the proprietary of Google in C++). It is now an Apache project used for very large clusters to
handle big data sets.
11.2.1 MapReduce
Hadoop MapReduce is an application on the cluster that negotiates resources with YARN: a Job-
Tracker globally manages jobs, and every node has its own TaskTracker.
1. An user submits job configuration to JobTracker;
2. JobTracker submits mapper tasks to TaskTrackers, scheduling computation on all available
cluster nodes;
3. Each task works on the data local to its node;
4. The Shuffler moves output produced my Map among the nodes in the cluster;
5. Once shuffling is complete, JobTracker schedules Reduce to all nodes.
Since the datasets are much larger than the main memory and processing happens on very big
clusters, intermediate results must be written to disk and shuffled results should be ordered.
MapReduce is however a framework not suitable for all uses: not everything can be expressed through
the functions it offers, plus it is not a good fit for workloads with a lot of computation and smaller
datasets.
11.2.2 HDFS
Hadoop relies on a distributed file system (HDFS), the primary data storage set used by its appli-
cations. It supports the rapid transfer of data between compute nodes, breaking information down
into separate blocks and distributing it.
The file system ensures durability and data distribution, storing data read by Map, and pre-sorting
and saving its output.
30
11.3 Optimizations 11 DISTRIBUTED DATA PROCESSING
Shuffling moves data between nodes and merges pre-sorted data, and results of Reduce are again
written to HDFS.
HDFS is highly fault tolerant: the system replicates data multiple times, placing at least one copy
on a different server rack than the others.
Files are split into blocks of 64MB, and by default each block has 3 replicas, so that missing data
can be reconstructed.
It follows a master-slave architecture: Name node manages file system and regulates access, while
data nodes manage attached storage.
I/O is boosted by striping blocks of data over multiple disks, writing in parallel. HDFS exposes
append-only files to users.
JobTracker is responsible to cope with failing nodes: it notices that one TaskTracker is not responsive,
and its task gets reassigned to another node.
If its job is in Reduce phase, the correspondent function is restarted on other TaskTrackers.
11.2.3 YARN
YARN stands for Yet Another Resource Negotiator, and is a redesigned resource manager and
large-scale distributed operating system for Hadoop.
YARN splits up the functionalities of resource management and job scheduling into separate dae-
mons, to obtain a global resouce manager and a master for every application.
The ResourceManager and the NodeManager form the data computation framework. A NodeM-
anager is the agent in every machine which is responsible for containers, monitoring their resource
usage and reporting results to the ResourceManager.
The Scheduler is responsible for allocating resources to the various running applications subject to
constraints such as capacity, performing no tracking of application status.
11.3 Optimizations
To optimize MapReduce it is possible to add a combiner, an optional class that operated by accepting
input from Map and passing it to Reduce after summarizing records with the same key.
It operates on each Map output key, having the same key-value output as the Reducer, and it helps
segregating data into multiple groups.
Folding is a function to save disk space: it analyzes a recursive data structure, and returns recombined
results by recursively processing its constituent parts.
Programs working well on large clusters require a deep understanding of warehouse-sized computers:
different settings need different techniques. In general, HDFS should be portable.
Overall, Hadoop considers hardware failure to be the norm, with applications needing streaming
access on large datasets. Its model, therefore, uses a model aiming to minimize writes and keeps
computation local with the data.
31
11.4 Scale up vs. out 11 DISTRIBUTED DATA PROCESSING
11.5 Spark
Spark is an alternative to Hadoop which provides more expressive fault-tolerant runtime for dis-
tributed batched data analytics. It can handle more complex analysis with interactive queries
processing real time streams.
The purpose of a new programming model is extending the existing naive algorithms implementation:
• Data can be outputted by sorted key;
• Locality can be increased using different kinds of partitioning;
• Window queries can be answered sorting by group.
The MapReduce model is extended too with complex data pipelines, but recovery mechanisms are
lightweight and require less additional computation.
Sparks offers additional API for more functionalities, such as Spark SQL, GraphX and MLIB.
11.5.1 RDDs
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant
collection of elements that can be operated on in parallel.
RDD is considered the evolution of the general Spark model, offering datasets that can be efficiently
cached and manipulated.
RDDs support two types of operations: transformations, which create a new dataset from an existing
one, and actions, which return a value to the driver program after running a computation on the
dataset.
32
12 NOSQL DATABASES
All transformations in Spark are lazy, in that they do not compute their results right away. Instead,
they just remember the transformations applied to some base dataset (e.g. a file).
They are only computed when an action requires a result to be returned to the driver program, and
the dataset gets only loaded when operations are performed on it. This design enables Spark to run
more efficiently.
Pipelines allow to chain transformations and intermediate steps do not need to be materialized since
they can just be recomputed on node failure.
11.5.2 DataFrames
DatFrames are dynamically strongly typed distributed collections of data organized into columns,
conceptually equivalent to a table in a relational database.
They can be interacted with using SQL functions, allowing query optimization and avoiding inter-
pretation overhead, although with a limited set of functions.
11.5.3 Datasets
A Dataset is a combination between DataFrame and RDD, optionally weakly typed, to retain code
compilation. It is still mapped as a RDD.
Datasets can be created, analyzed (limited, filtered) and be subject of exploration, management or
aggregation just like tables.
11.5.4 Optimization
RDDs track lineage information to rebuild lost data, logging everything in a graph to recompute
missing or damaged partitions. Spark offers support for DAGs.
Snapshots are automatically created after repartitioning transformations and managed using a LRU
cache to speed up recovery.
The optimizer builds a logical plan and applies transformation rules, to then generate the physical
plan and select the cheapest cost model.
Spark’s runtime focuses on three main points:
• Memory Management and Binary Processing, leveraging semantics to eliminate the overhead
of deserialization and garbage collection;
• Cache-aware computation, using columnar storage and compression to exploit memory hier-
archy;
• Code generation to optimize usage of modern compilers and CPU.
Operations to take care of reading, managing ETL pipelines, training and query are performed
within the same framework.
12 NoSQL databases
NoSQL databases are an alternative to traditional relational databases in which data is placed in
tables and data schema is carefully designed before the database is built.
33
12.1 Key-value databases 12 NOSQL DATABASES
Relational databases were seen as slow and complex, not suited for web usages, therefore NoSQL
systems aim to provide a simpler and faster solution, horizontally scaled to clusters of machines.
NoSQL databases are especially useful for working with large sets of real-time distributed data,
providing high throughput and availability.
They do not provide ACID guarantees, relying instead on BASE principles: the trade off is a more
limited query functionality, for instance many NoSQL do not support joining.
Giving up on consistency implies also giving up on additional protocols such as 2PC, but often
compromises are taken: groups of entities that share affinities are co-located, providing transaction
support within entity groups.
To avoid data loss, write-ahead logging (WAL) saves changes to a log before applying them, allowing
updates of a database to be done in-place.
12.2 MongoDB
MongoDB is one of the most popular document-oriented database for JSON documents indexed by
an unique identifier.
It provides high availability with replica sets, writing and reading on the primary by default. Scaling
horizontally is achieved with sharding.
Aggregation can be performed in three ways: through a pipeline, MapReduce or aggregation oper-
ations, eventually attached together.
34
13 KEY-VALUE MODELS
• Tunable consistency, giving up on availability according to CAP theorem but always keeping
partition tolerance;
• Linearizable consistency, using a serial isolation level for lightweight transactions, performed
when uninterrupted sequential scans are not dependent on concurrency.
Cassandra uses the Paxos consensus protocol, where all operations are quorum-based and updates
will incur a performance hit. A consistency level of SERIAL has been added to support linearizable
consistency.
Full row-level isolation is in place (but no more), which means that writes to a row are isolated to
the client performing the write and are not visible to any other user until they are complete.
Data is partitioned, with partitions mapped to nodes in the cluster; scale out is achieved adding
more nodes and redistributing data.
Models can be efficiently stored through a partition key, controlling how data is spread, and a
clustering key, deciding how data is sorted on disk.
13 Key-value models
This kind of storage has a lot of benefits, including the lack of schema: syntax is defined with human
readable languages (XML, SVG) for structured and semi-structured data. The possible formats
vary from free text to relational models, the latter less common; data can also be compressed or
uncompressed.
For instance, possible choices for configuration files are:
1. JSON, although kind of hard to write long files;
2. YAML, really complicated;
3. TOML, with minimal syntax therefore easy.
13.1 XML
XML documents have a rooted hierarchical structure named Document Object Model, where ele-
ments are declared between tags with eventual attributes. Those documents allow to store reason-
ably large information, queried in a declarative language (XPath, XQuery) which guarantees results
appearing in the same order as they were saved.
Validation requires constraints which imply the choice between a simple grammar or a more powerful
one. Grammar is only defined for tags and attributes, not contents.
An XPath is the tree containing elements subject of a query: it stores information about preceding
nodes, siblings, descendants, parents and ancestors in the DOM. It can be navigated from both
directions, retrieving a set related to the matching node (for instance :. indicates parent).
axis::node[predicate]/@attr
XQuery is the official query language for XML, using functional programming to evaluate sequences,
ordered lists of items which can be either nodes or atomic values.
It syntax can be embedded in HTML, allowing to declare variables and show elements according to
conditions.
35
13.2 JSON 15 SPARQL
Grammar definition and DTD are defined in the headers of a document, although parsing can still
be difficult due to malfunctioning web pages.
13.2 JSON
JSON is a language similar to XML, but closer to a relational database: its structure consists in
objects and arrays recursively nested in an arbitrarily complex way. It can be used for a broad
range of data types and evaluated like JavaScript (not the best practice for security issues). Keys
are between quotes to avoid referencing variables.
Data can be accessed through indexes and JavaScript syntax, and queried using JQuery.
.key[index] | Name1: key1.key2, Name2: key3.key4 | select(.key == something)
15 SPARQL
SPARQL is the RDF query language, working with pattern matching over triples and returning the
matching ones. It only contains basic CRUD functionalities with three types of data: URI, literal
and blank node. The latter represents a variable with local scope, not appearing in SELECT.
The subject and the predicate must always be an URI, while objects can also be literals. Selection
can be over distinct or reduced values, which allows to convert inner join to left outer, ignoring how
many matchings are there.
36
15.1 Query processing 15 SPARQL
SELECT ?n
WHERE {
?p rdf:type dbo:Person
?p dbo:hairColor "Black"
?p dbo:eyeColor "Green"
?p dbp:name ?n
}
SPARQL permits advanced SELECT queries such as FILTER, containment check or output as a
graph. Examples of semantic information that can be queried are found on DBPedia.
DESCRIBE allows to perform descriptive queries, retrieving all information regarding the matched
sources and returning a graph.
Similarly to domain calculus, there is no equal condition (when joining, for instance) since the check
is implicit thanks to the usage of the variable two times, requesting the same identifier. Queries can
be translated to relational algebra, binding every WHERE clause with a join just above in the tree.
This property makes it easier to iteratively change data and schema, connecting them to other
sources, but on the other side it’s harder to optimize storage and queries.
37
16 GRAPH DATABASES
RDF triples are closely related, disallowing independence. Soft functional dependencies are indeed
very common, with bind triple patterns making the others unselective and not captured by his-
tograms.
Cardinality estimation is not well suited for RDF data: the independence assumption does not hold
(correlation is the norm), and leads to severe underestimation. Multi-dimensional histograms would
help, but are really expensive to compute due to exponential growth of dimensions.
RDF is also really unfriendly for sampling, since there is no schema: triples can be diverse and
unrepresentative, therefore a sample needs to be huge to be useful.
Tuples classification can be obtained using the characteristic set: unique values grouped by behavior
and occurrence, assuming that triples are more homogeneous within a group. The cardinality of
characteristic sets is much lower than the amount of triples.
It is also possible to take advantage of correlation in a graph format: outgoing edges represent
behavior, and nodes with similar properties can be extracted.
Selecting distinct subjects gives an exact computation of how many sets there are, allowing to
estimate a large number of joins in one step.
Without using distinct, there are two possibilities: if values are bound to a constant, either manually
look at the data (slow) or making estimates relying on indexes to see how common is each value.
Occurrence annotations are no longer exact, but very accurate: the maximum amount of information
about correlation is exploited, and characteristic sets are used to cover the query (finding operations
concerning the same data).
Star-shaped subqueries are hard to optimize, because of the correlation and the selectivities. In this
case, counting distinct values and estimating cardinalities allow to rank the subsets.
The rarest subsets are joined first, leaving the biggest for the last step, to reduce the number of
computations.
The top-down greedy algorithm to optimize runs in linear time, but does not use independence
between predicates. As a remedy, lightweight statistics are constructed counting how often subgraphs
appear together.
Two characteristic sets connected by an edge are defined a characteristic pair, identified with a linear
scan. The problem is that the number of pairs grows quadratically with different sets, but if the
pair is rare the independence assumption holds.
Optimal plans are identified and removed, to then run the dynamic programming algorithm to know
selectivities between subqueries.
16 Graph databases
A graph G = (V, E) is a set of edges and vertices which can have different properties. Edges can
also have intermediate ramifications between nodes.
RDF is a special type of graph database; general ones can store more variants of data and have less
standards.
Examples of interesting properties regarding a graph are:
38
16.1 Cypher 16 GRAPH DATABASES
16.1 Cypher
Cypher is one of the most supported query languages for graphs. Its syntax allows nodes and edges
to have a type along with some attributes.
Its functions are similar to SQL, implementing aggregating, pattern matching and graph properties
in addition.
Neo4j, implementing Cypher, is a graph database offering dynamic visualization and machine learn-
ing algorithms, but it kind of sucks (not scalable).
16.2 Pregel/Giraph
Pregel is Google’s iterative graph processing system for distributed data, while Giraph is its open
source alternative.
They work using Gremlin, a functional graph traversal language, which with one operation lets a
vertex modify its state and propagate information.
The reason why this tool has been developed is because MapReduce systems do not allow operations
on a vertex level. In addition, MapReduce requires transferring state to all nodes.
39