0% found this document useful (0 votes)
14 views

TARDIS Distributed Indexing Framework for Big

Uploaded by

practice752
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

TARDIS Distributed Indexing Framework for Big

Uploaded by

practice752
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

TARDIS: Distributed Indexing Framework for Big

Time Series Data


Liang Zhang, Noura Alghamdi, Mohamed Y. Eltabakh, Elke A. Rundensteiner
Worcester Polytechnic Institute, Worcester, MA 01609, USA
(lzhang6, nalghamdi, meltabakh, rundenst)@wpi.edu

Abstract—The massive amounts of time series data contin- dimensionality reduction technique to extract key features, and
uously generated and collected by applications warrant the then index these features instead. Many summarization and
need for large scale distributed time series processing systems. feature extraction techniques have been proposed including
Indexing plays a critical role in speeding up time series similarity
queries on which various analytics and applications rely. How- Discrete Fourier Transforms (DFT) [5], Discrete Wavelet
ever, the state-of-the-art indexing techniques, which are iSAX- Transforms (DWT) [6], Piecewise Aggregate Approxima-
based structures, do not scale well due to the small adopted fan- tion (PAA) [7], [8], and Symbolic Aggregate approXimation
out (binary) that leads to a highly deep index tree, and the ex- (SAX) [9]. Typically, these representations are then indexed
pensive search cost through many internal nodes. More seriously, by Spatial Access Methods (SAMs) like the R-tree or its
the iSAX character-level cardinality adopted by these indices
suffers from a poor maintenance of the proximity relationships variants. However, SAMs are known to degrade quickly due
among the time series objects, which leads to severe accuracy to the curse of dimensionality. Indexable Symbolic Aggregate
degradation for approximate similarity queries. In this paper, we approXimation (iSAX) [10] is proposed as an index-friendly
propose the TARDIS distributed indexing framework to overcome feature extraction technique. It first divides a time series into
the aforementioned limitations. TARDIS introduces a novel iSAX equal size segments, and then uses characters of variable
index tree that is based on a new word-level variable cardinality.
The proposed index ensures compact structure, efficient search cardinalities to represent the mean of these segments, which
and comparison, and good preservation of the similarity rela- results in representing a given time series as a sequence of
tionships. TARDIS is suitable for indexing and querying billion- characters. The iSAX Binary Tree [11] is later proposed as a
scale time series datasets. TARDIS is composed of one centralized binary index structure of the iSAX representation.
global index and local distributed indices−one per each data Unfortunately, all implementations of the above techniques
partition across the cluster. TARDIS uses both the global and local
indices to efficiently support exact match and kNN approximate are designed for centralized systems and assume the dataset
queries. The system is implemented using Apache Spark, and is small enough to fit in one machine. For this reason, it takes
extensive experiments are conducted on benchmark and real- about 400 hours to build an index for one TB of data [11].
world datasets. Evaluation results demonstrate that for over More recently, some distributed time series management
one billion time series dataset (TB scale), the construction of a systems have been proposed to transfer, store or analyze time
clustered index is about 83% faster than the existing techniques.
Moreover, the average response time of exact match queries is series data (refer to a recent survey in [3]). These systems
decreased by 50%, and the accuracy of the kNN approximate mainly leverage mathematical models and sampling to support
queries has increased more than 10 fold (from 3% to 40%) approximate Select-Project-Transform (SPT) queries with or
compared to the existing techniques. without error bounds. A more relevant system to our work
I. I NTRODUCTION that supports similarity search is the DPiSAX index [12].
However, the built index tree (we refer to it as “iBT”) suffers
Many emerging applications in science, manufacturing, and from too many internal nodes and large depths of the leaf
social domains generate time series data at an explosive speed. nodes due to the inherent binary fan-out. Moreover, since iBTs
For example, the sensors on a Boeing787 produce around half are based on the direct iSAX character-level cardinalities, the
a terabyte of time series data per flight [1]. As a result, the comparisons are shown to be very expensive, and the accuracy
data mining techniques on these big time series data, e.g., of the returned results of kNN approximate queries tend to be
similarity search, clustering, classification, motif discovery, poor (below 10% in many cases).
outlier patterns, and segmentation have drawn a lot of recent In this paper, we propose a novel iSAX-based distributed in-
interest [2], [3]. In particular, similarity search operations are dexing framework for big time series data, called “TARDIS”1
of paramount importance since they form the basis of virtually for supporting exact match and approximate kNN queries.
all of the more complex operations mentioned above [4]. Since The index still adapts the iSAX representation but with a
full scans on large-scale time series data are prohibitively new word-level variable cardinality instead of the character-
expensive, indexing techniques become a critical backbone level cardinality. The word-level cardinalities enable better in-
to make such similarity queries practical. Unfortunately, as it parallel processing, which suits the target distributed systems.
would be presented in this paper, the state-of-the-art indexing In addition, we propose the iSAX-Transpose (iSAX-T) as a
techniques over big time series data lack both the desired string-like signature to get rid of the costly conversions during
scalability and accuracy. the comparisons. On top of these signatures, we introduce a
Since time series are inherently high-dimensional data, a
common approach before indexing the data is to first apply a 1 TARDIS is the Time Machine name introduced in Dr. Who TV series.
K-ary index tree (called sigTree) to overcome the limitations series dataset DB = {X1 , X2 , · · · , Xm } is a collection of m
of the former binary trees. sigTrees enable compact index time series objects all of the same length n.
structure with fewer internal nodes and shorter paths to leaf
Definition 2: [Euclidean Distance(ED)] Given two time
nodes.
series X = hx1 , x2 , · · · , xn i and Y = hy1 , y2 , · · · , yn i, their
TARDIS uses the sigTrees to construct a single centralized Euclidean distance is defined as:
global index based on statistics collected from the data. The v
global index acts as a skeleton (or partitioning scheme) to re- u n
uX
partitioned the time series data across the cluster machines ED(X, Y ) = t (xi − yi )2 (1)
to localize the similar objects together. Then, each partition i=1
is locally indexed−using sigTrees as well−for faster access
within a given partition. To better support exact match queries, Similar to exiting techniques, TARDIS supports two fun-
each local index is augmented with a partition-level Bloom damental similarity queries, namely exact match and kNN
Filter index, which is synchronously generated with the local approximate queries. The exact kNN queries tend to be
index, to avoid many unnecessary accesses to the actual parti- very expensive and time consuming, and most applications,
tion. Moreover, the combination of the word-level cardinality especially those working with big datasets, typically prefer
and the compactness of the sigTrees significantly enhance the faster responses even with some accuracy loss.
accuracy of the kNN Approximate queries mainly because they Definition 3: [Exact Match Query] Given a query time
preserve the proximity of the similar time series objects much series Q = hq1 , q2 , · · · , qn i of length n, and a time series
better than the current techniques. dataset DB = {X1 , X2 , · · · , Xm }, the exact match query
In summary, the contributions of this paper are as follow: finds the complete set S = {Xi ∈ DB} such that ∀Xi ∈
• Identifying the core limitations of the state-of-art index- S, ED(Xi , Q) = 0, and @Yj ∈ / S|ED(Yj , Q) = 0.
ing techniques in processing big time series data. And
then, proposing TARDIS, a scalable distributed indexing Definition 4: [kNN Approximate Query] Given a query time
framework to address these limitations. TARDIS consists series Q = hq1 , q2 , · · · , qn i, a time series dataset DB =
of a centralized global index, and distributed local indices {X1 , X2 , · · · , Xm } and an integer k, the query finds the set
to facilitate efficient similarity query processing. S = {Xi ∈ DB} such that |S| = k. The error ratio of S,
• Proposing a new iSAX-T signature scheme that dramati-
which represents the approximation accuracy, is defined as
k
cally reduces the cardinality conversion cost, and sigTree 1
P ED(Xi ,Q) ∀Xi ∈S
k ED(Yi ,Q) ∀Yi ∈T ≥ 1, where T = {Y1 , Y2 , · · · , Yk } is
that constructs a compact index structure at the word-level i=1
similarity. the ground truth kNN answer set.
• Introducing efficient algorithms for answering the exact
match and kNN approximate queries. We introduce dif- B. iSAX-Representation Overview
ferent query processing strategies to greatly improve the iSAX [10] is based on Piecewise Aggregate Approximation
accuracy of the approximate queries. (PAA) [7] and Symbolic Aggregate approXimation (SAX)
• Conducting extensive experiments on benchmark, syn- [9]. Figure 1 illustrates an example of how these techniques
thetic, and real-world datasets to compare TARDIS with summarize a time series.
the state-of-the-art techniques. The results show signif- PAA(T,w): Given a time series, say T in Figure 1(a), PAA
icant improvement in index construction time (≈ 8x divides T into equal-length segments and represents each
speedup), and more critically, more than 10x accuracy segment by the mean of its values. The number of segments is
improvement in some of the kNN approximate queries. called “word length” (w), which is an input to the technique,
The rest of this paper is organized as follow. We review the and the entire representation vector is called a “word”. For
background in Section 2. The new iSAX signature scheme example, the PAA of word length = 4 of T is PAA(T,4) =
and the index tree are defined in Section 3. TARDIS index [-1.5, -0.4, 0.3, 1.5] as illustrated in Figure 1(b).
construction is presented in Section 4, and the query process- SAX(T,w,c): SAX takes the PAA representation of time
ing algorithms are discussed in Section 5. The experimental series T as its input, and then discretizes it into characters
evaluation is presented in Section 6. Finally, we review related or binary alphabet labels. This discretization is achieved by
work in Section 7, and present the conclusion remarks in dividing the value space (the y-axis) into horizontal stripes.
Section 8. The number of the stripes equals an input parameter referred
to as “cardinality” (c), which is typically a power of 2. For
II. P RELIMINARIES example, in Figures 1(c) and (d), the cardinality is set to 4 (2
bits) and 8 (3 bits), respectively. The authors in [9] proposed
A. Key Concepts of Time Series
an algorithm to decide on the boundaries of each stripe. For
Definition 1: [Time Series Dataset] A time series X = example, in Figure 1(c), stripes “11” and “01” have the bound-
hx1 , x2 , · · · , xn i, xi ∈ R is an ordered sequence of n real- aries of [0.67, ∞], and [-0.67, 0], respectively. Then, each
valued variables. Without loss of generality, we assume that stripe is assigned a character label−which can be an arbitrary
the readings arrive at fixed time granularities, and hence the character or binary bits. Finally, each segment in the time
timestamps are implicit and no need to store them. A time series is assigned its corresponding stripe label. Figures 1(c)
w: word length (# of segments (characters)) c: cardinality (# of horizontal stripes) Root Key Pointer
2 2
0" , 0" , 0" ....
1 1
... 0" ,1" , 0" . . . 1 , 1 ,1 1" , 1" , 1" ....
0 0 0 ", 0 ", 0 " " " "
1 3 5 7 9 11 1 3 5 7 9 11 ... ....
-1 -1
0" , 11% , 0" 0" , 10% , 0" ....
-2 -2 0", 10%, 0"
(a) Raw time series T, |T|=12 (b) PAA(T, w=4) = [-1.5, -0.4, 0.3, 1.5] 0" , 11% , 00% ....
Internal node
Stripe “11” 0" , 11% , 01% 0", 11% , 00% 0" , 11% , 01% ....
2 2 111
Leaf node
boundaries [0.67, ∞]
11
1 1 110
101
(b) Map each leaf node in
10 100 (a) iBT index tree iBT to the related pointer
0 0 011
01 010
1 3 5 7 9 11 1 3 5 7 9 11
-1 -1 001
Stripe “01” 00 000 Fig. 2: The iBT Index and iSAX Map Table.
-2 boundaries [-0.67, 0] -2
(c) SAX(T,w=4,c=4) = [00, 01, 10, 11] (d) SAX(T,w=4,c=8) = [000, 010, 101, 111]

2 2 ure 1(f) also illustrates two possibilities when the cardinality


111
11 1 110
1 101 is set to 8 (3 bits). The decision of how many bits to use for
10 0 100
0 011
010 a given segment is dynamically determined while indexing
1 3 5 7 9 11 01 1 3 5 7 9 11
-1 -1 001
00 000
the time series data and building the iSAX Binary Tree index
-2 -2
(iBT) overviewed next.
(e) iSAX(T,w=4,c=4) is a variable cardinality (f) iSAX(T,w=4,c=8) is a variable
where each segment can be represented cardinality where each segment can be
by either 1 or 2 bits (2 is the max based on represented by either 1, 2, or 3 bits (3 is C. iSAX Binary Tree Index (iBT)
parameter ``c’’). Examples of valid iSAX the max based on parameter ``c’’).
representations include: (a) 1-bit for each Examples of valid iSAX representations The iBT index [10] is an unbalanced binary tree index with
segment [01, 01, 11, 11], (b) 2-bits for the include: (a) 1-bit for each segment [01, 01, the exception of the 1st level (see Figure 2(a)). It starts with
3rd segment [01, 01, 102, 11] 11, 11], (b) 2-bits for the 1st and 3rd
segments and 3-bits for the 2nd one [002, one root node, and a set of leaf nodes in its 1st level using 1
0103, 102, 11] bit representation for each segment, i.e., the number of nodes
in the 1st level is 2w , where w is the word length. The time
Fig. 1: PAA, SAX, and iSAX Representations.
series objects are inserted one at a time to the corresponding
leaf node based on is iSAX representation. Once the number
of time series contained by a leaf node exceeds a threshold,
and (d) illustrate the SAX(T, 4, 4) and SAX(T, 4, 8) for the which is an input parameter, the node switches to be an internal
time series T presented in Figure 1(b). Two main observations node and it splits into two child leaf nodes. The splitting is
to highlight for the SAX representation: performed by increasing the cardinality of one of the segments,
• Fixed Cardinality: A drawback of the SAX representation which means using more bits to represent this segment. This
is that a time series representation is fixed, i.e., each segment will probably lead to distributing the node’s time series objects
is represented by the number of bits corresponding to the over the two child nodes. For example, the internal node
cardinality parameter. This means, for large datasets, a high [01 , 112 , 01 ] in Figure 2(a) is divided into two leaf nodes
cardinality must be used to increase the possibility of creating [01 , 112 , 012 ] and [01 , 112 , 002 ] by extending the cardinality
enough distinct representations among the time series objects. of the 3rd segment (also called character) from 1 bit to 2 bits.
• Lower-Bound Distance: A nice property of the SAX The Round-robin split policy initially proposed in [10] to
representation is that it guarantees for two time series T1 and determine the split character has shown to perform excessive
T2 , their Euclidean distance in the SAX domain (calculated and unnecessary subdivision. An optimized policy is proposed
based on the boundaries of the SAX stripes) is smaller in [11] to pick the character having a high probability to
than or equal to their true distance in the data space. That equally split the leaf node. Ultimately, the cardinality increase
is: ED(T1 .SAX, T2 .SAX) ≤ ED(T1 , T2 ). This property is over any segment cannot exceed the max cardinality c. The
effective in pruning many candidates during a similarly search work in [11] also proposes a bulk-loading mechanism of time
query, e.g., range or kNN, only based on the SAX represen- series data that first determines the shape of the iBT tree, and
tation and without checking the raw time series values. then routes each time series to its leaf node.
iSAX(T,w,c): iSAX maintains the nice lower-bound dis- Limitations of iBT: Although it is an interesting structure
tance property of SAX. However, unlike SAX, iSAX uses and performs well for small datasets, iBT indices suffer from
variable cardinality for each segment in the time series. This severe limitations under big datasets, which include:
is achieved by first enforcing the representation of the stripes’ • Loose structure and long traversal: The superabundance
labels to be binary bits (not arbitrary characters). And then of internal nodes caused by the binary fan-out results in
leveraging these bits to allow for variable cardinality for each deep height for many leaf nodes, and thus increases the
segment. For example, Figure 1(e) illustrates the iSAX(T, w=4, tree height and its traversal at query time.
c=4) for time series T . iSAX takes the same input as SAX, • Large initial cardinality: To guarantee leaf nodes to have
however, each segment can be independently represented by a the same granularity, the conversion from time series to
number of bits up to the max number identified by parameter iSAX needs to put aside enough large initial cardinality
”c”. The figure shows two possible representations for T . Fig- for the split mechanism due to the uncertainty of segment
2 2 Notation Description
A B C 111 A B C 111

1 110 1 110
(ts,rid) (A time series, its record id)
101 101 w Word length
100 100
0
1 3 5 7 9 11
011
0
1 3 5 7 9 11
011
010
b # of cardinality bits, i.e., cardinality = 2b
010
-1 001 -1 001 pid Partition id
isaxt(n) iSAX-T signature with 2n cardinality
000 000
-2 -2 freq(n) Frequency of isaxt(n)
(a) Character-level: B and C are (b) Word-level: A and C are cov- Tardis-G TARDIS global index
covered by [01 , 01 , 0103 , 11 ] ered by [012 , 012 , 012 , 102 ] Tardis-L TARDIS local indices
G-MaxSize Split threshold for Tardis-G leaf nodes
Fig. 3: Similarity of Time Series. L-MaxSize Split threshold for Tardis-L leaf nodes

TABLE I: Frequently Used Notations.


skewness and the amount of data. Hence, it results in
unnecessary conversion and storage.
• High matching overhead: The alternative to solve the long index to identify the corresponding partition. Then, a worker
tree paths is to convert the iBT to a map table [12] (see loads this partition and traverses the local index to find leaf
Figure 2(b)). The signature of each leaf node becomes node(s) for post-processing. DPiSAX is an un-clustered index,
a key in the table, and a pointer is maintained to either i.e., the time series original data remain un-partitioned, the leaf
the tree node (if the tree is kept) or the actual partition nodes in local indices only store the iSAX signatures and the
holding the data. However, given a query time series Q, record id of the corresponding time series.
the search for matches within the map table is complex
Limitations of DPiSAX:
and very expensive due to the variable character cardinal-
• Inheriting the limitations of iBT: Although it achieves
ity in the keys. It requires creating all possible signatures
from Q and then performing repetitive search in the map its relative scalability over the iBT indices by supporting
table, which is a clear bottleneck. distributed processing, DPiSAX is still based on the iBTs
• Weak proximity preservation and poor search accuracy: and inherits its limitations as highlighted above.
• Additional degradation in result’s accuracy: To speedup
iBT uses character-level variable cardinality to solve the
segment skewness problem and construct the hierarchical the creation of the DPiSAX index, it builds an un-
tree. However, character-level matching is not efficient in clustered index. However, answering queries based only
preserving the proximity of the relatively similar objects, on the iSAX representation without the final refine phase
and these objects may end up in far away leaf nodes. This further degrades the accuracy of the results. On the other
results in poor accuracy for approximate queries. hand, retrieving the raw time series to apply the refine
phase involves expensive random I/O operations across
Example 1: Referring to Figure 3(a), assume a character- the cluster machines.
level variable cardinality of (1,1,3,1) bits. In this case, the
time series A, B, C are represented as [01 , 01 , 0113 , 11 ], III. TARDIS B UILDING -B LOCK S TRUCTURES
[01 , 01 , 0103 , 11 ] and [01 , 01 , 0103 , 11 ], respectively. Under To address the aforementioned limitations, we propose
this representation, the closest series to “C” is “B” (their a new iSAX-based signature scheme and its accompanied
distance in the iSAX space is zero). However, it is clear, that index tree to optimize the index construction and similarity
the closest to “C” is “A”. queries over massive time series datasets. The frequently used
D. Distributed iSAX Time Series System notations in this paper are listed in Table I.
To the best of our knowledge, DPiSAX [12] is the only A. iSAX-Transposition Signature (iSAX-T)
iSAX-based distributed system in literature to support index
The objective of the indexable Symbolic Aggregate approX-
construction and kNN Approximate queries. It constructs a
imation Transposition (iSAX-T) is to simplify the represen-
global index and local indices. It leverages the cluster ma-
tation conversion from a higher cardinality, e.g., 5 bits, to
chines to sample a subset of the time series and convert them
a lower cardinality, e.g., 3 or 4 bits, which is a common
into iSAX signatures. These signatures are then sent to the
operation during both index construction and query search.
master node to construct the global index, which is a partition
This guarantees the efficiency of the parallel process. Unlike
table instead of the loose iBT structure. For local indices
iSAX, iSAX-T utilizes word-level variable cardinality defined
construction, all time series are converted into iSAX signature
as follows:
with a large initial cardinality of size 512 to guarantee the
• Word-Level Variable Cardinality: In this representation
split requirement. Then, for each iSAX signature (not the raw
scheme, all characters in one word, i.e., the characters across
time series) a lookup over the partition table is performed
all segments in a time series, must use the same cardinality.
(with high matching overhead) to re-partition the signatures.
This cardinality is decided by the level of the index tree in
Finally, all workers concurrently build iBTs as local indices
which the time series resides.
over their partitions.
Given a query Q, DPiSAX converts Q into its iSAX Example 2: Referring to Figure 3(b), assume the Time
signature. It then matches the partition signature in the global series A, B, C reside in a leaf node at the 2nd level of the
1 1 0 0 1 1 0 0 C Root
Transpose 1 Root
1 1 0 1 1 1 0 Hex E
0 1 1 0 0 0 1 0 2 1st bit 0",0",0" 0",0",1" 1",1",0" . . . 1",1",1"
0 0 0 1 0 1 0 1 5 0 1 6 . .. 7

2nd bit 00$ , 00$ , 10$ 01$ , 00$ , 10$ . . . 01$ , 01$ , 11$
(a) SAX(T,4,16) [11004 , 11014 , 01104 , 00014 ] 10 14 . .. 17
Matrix Transposition and Hexadecimal 3rd bit 010& , 000& , 100& . . . 011 , 001 , 101
& & &
iSAX-T Internal Node Leaf Node 140 . . . 147
SAX(T,4,2) = {1, 1, 0, 0 } =C
SAX(T,4,4) = {11, 11, 01, 00 } = CE (a) Binary alphabet labels (b) String-like
SAX(T,4,8) = {110, 110, 011, 000 } = CE2
SAX(T,4,16) = {1100, 1101, 0110, 0001} = CE25 Fig. 5: sigTree with fan-out = 23 .
(b) iSAX-T Signature for Different Cardinality
• Leaf Nodes: They are the storage nodes at the bottom.
Fig. 4: iSAX-T Signature. They store the iSAX-T signatures and the number of time
series they hold. Moreover, they store additional content
index tree. In this case, all characters (segments) will use a that differs depending on the index type they belong to,
2 bit cardinality, and thus represented as [012 ,012 ,012 ,102 ], i.e, global or local index, as will be explained later.
[002 ,002 ,012 ,112 ] and [012 ,012 ,012 ,102 ], respectively. Com-
Figure 5 shows a sigTree with internal and leaf nodes
pared to Example 1, the closest series to “C” is now “A”. represented by binary and compact string-like signatures. To
This is mainly because the word-level cardinality is intended insert a time series into the tree, we iteratively move down
to preserve the proximity of similar time series better than the in the tree based on the iSAX-T signature until a leaf node is
character-level cardinality. found. If a split is needed, it is performed as mentioned above.
iSAX-T adapts a string-like signature based on matrix In addition, each node is able to reach all sibling nodes with
transposition to speedup the cardinality conversion operation. the same cardinality from the parent node as we maintain the
Thanks to the uniform word-level cardinality for a whole word, nodes double-linked (point to their parents as well as their
the binary signature can be considered as a binary matrix as children).
presented in Figure 4(a). After transposing this matrix and
Example 3: Assume inserting a time series T =
transforming the binary into a hexadecimal, the signature is
[01104 , 00114 , 10114 ] into the sigTree in Figure 5(b). First,
represented as a string.
T is converted into its iSAX-T signature ”1473” according to
As a consequence, the conversion is simplified as a string
Figure 4(b). Then, it starts in the tree from root node, drops 3
drop-Right operation. Equation 2 shows how to calculate
letters down to 1 bit cardinality to match the internal node ”1”
the drop-Right letter number n. hc, lc and w represent the
in the 1st layer. This process repeats downward until finally
high cardinality, the low cardinality and the word length
traverse to the leaf node ”147” in the 3rd layer.
respectively (see Figure 4(b) as an example).
w Benefits: The careful design of our representation of the
n = (log2 hc − log2 lc) ∗ (2) iSAX-T and sigTree solutions offer the following benefits for
4
B. iSAX-T K-ary Tree (sigTree) massive time series processing in a distributed infrastructure:
• Compact structure: Compactness means fewer internal
sigTrees are hierarchical K-ary trees based on the cardinality
of the iSAX-T signature. Each node has no more than 2w nodes and shorter depth of leaf nodes due to the large
children (this bound results from increasing the cardinality fan-out up to 2w .
• Small initial cardinality: The short height can be achieved
representation by 1 bit over the w characters (segments) of the
time series). For example, referring to Figure 5(a), the node with a small cardinality which saves conversion costs and
with signature [01 , 01 , 11 ] in the 1st layer has been expanded storage space.
• Efficient signature conversion: The conversion is simpli-
to its children by adding an additional bit to each of the three
characters, which results in having 8 children in the 2nd layer. fied as a string drop-Right operation. Given the frequency
Three classes of nodes are involved in the sigTrees: of this operation during the index construction and query
processing phases, the cumulative time savings are con-
• Root Node: It represents the entire space and only
siderable.
contains the number of time series in the whole tree and
• Word-level similarity: iSAX-T effectively preserves the
pointers to its children nodes.
proximity relationship of similar time series due to the
• Internal Nodes: They designate splits in the sub-tree
used word-level cardinalities.
space. When the quantity of time series contained in a
leaf node exceeds a given split threshold, this leaf node
IV. TARDIS I NDEXING S TRUCTURE
gets promoted to an internal node and splits all its data
entries into at most 2w leaf nodes by increasing a 1 bit Based on the sigTree structure, we now introduce the design
cardinality for all characters. Each internal node stores of the TARDIS indexing framework. For ease of presentation,
its iSAX-T signature, the number of time series in its we start by giving an overview on the whole framework, and
sub-tree, and pointers to its children nodes. then the construction of the global and local indices in detail.
Root
Query pid nbr: 12

Tardis-G
isaxt: 01 isaxt: 02 isaxt: 03 isaxt: ff
freq: 512 freq: 350,000 freq: 4,352 . .. freq: 360,520
Master Node pid: 1 pid: 5,6,7 pid:1 pid: 10,11,12

Worker Nodes Tardis-L


isaxt: 0201 isaxt: 0202 isaxt: 02ff
freq: 5,012 freq: 100,550 . .. freq: 620
…… pid: 5 pid: 6,7 pid: 5

isaxt: 020201 isaxt: 020202 isaxt: 0202ff Internal node


freq: 12 freq: 550 . .. freq: 620
Target Node Indexed Time Series pid: 6 pid:6 pid:7
Leaf node

Fig. 6: TARDIS Key Components and Search Process. Fig. 7: Tardis-G Structure, the word length is 8 so the signature
uses 2 letters to hex each bit cardinality.

A. TARDIS Overview
TARDIS consists of two level indices as illustrated in Figure collected result for ith layer; (3) Judge that decides to stop
6. TARDIS Global Index (Tardis-G) is a centralized global collection or not: if max[freq(i)] exceeds the G-MaxSize, filter
index maintained in the master node. It is used for efficiently out (isaxt(b), freq(b)) contained by leaf nodes in ith layer and
identifying the target partition in a distributed system. The continue (i+1)th layer, otherwise, stop to finish this step. Note
compactness of the index tree structure is controlled by a that the entries in [(isaxt(b), freq(b))] are filtered out layer by
split threshold G-MaxSize. TARDIS Local Index (Tardis-L) is layer, though the whole size is small. Finally, we get several
a distributed local structure to index the data entries within groups of [(isaxt(i), freq(i))] with their respective layer id.
a single partition. Both structures are sigTree-based indices Skeleton Building: The index structure is constructed and
with a different content only at the leaf nodes. The leaf nodes the node information collected is put in the right places. All
of Tardis-G store the partition information, i.e., pointers to collected information is sent to the master. It is completed
where they are located in the cluster, whereas the leaf nodes of layer by layer in ascending order using a tree insertion
local Tardis-L store the actual time series objects. The overall mechanism. Root node is the entry point. Each inserted node
index framework is constructed given an initial cardinality, a recursively matches the internal node at each layer. Take node
word length, two split thresholds for Tardis-G and Tardis-L (isaxt(3):“0202ff”, freq(3):550) in Figure 7 for example, it
leaf nodes (these notations are summarized in Table I). starts from the root node, finds the matched node isaxt:“02”
in 1st layer, and then isaxt:“0202” in 2nd layer, finally reaches
B. TARDIS Global Index (Tardis-G) its position in 3rd layer. During this process, we observe that
Tardis-G is a lightweight sigTree structure that resides in the isaxt(3):“0202ff” is converted for 2 times in this traversal path.
master node of the cluster. It is the entry point for the index The master node is not the bottleneck of this process due to the
search. Unlike other iSAX-based indices which are constructed small size of tree. To facilitate retrieving siblings’ information
based on the representations of time series as they are loaded from the parent node, all nodes are doubly linked.
gradually, it is based on statistics collected from the cluster Partition Assignment: The goal is to package all under-
nodes in a distributed way. The construction consists of the utilized sibling leaf nodes into as few partitions as possible
following steps. to facilitate parallel processing. Distributed infrastructures,
Data Preprocessing: The dataset is sampled at the block like Hadoop and Spark [13], prefer to launch parallel tasks
level and each time series is converted to a pair of values over large files rather than too many tasks over small files
consisting of the iSAX-T signature and the frequency. This is because the big data is processed in the unit of a partition
performed by the worker nodes in parallel. A percentage of or a block. Assembling sibling leaf nodes together has two
blocks are randomly chosen to reduce the disk access. The benefits: (1) all records are similar at the parent node level;
generation of pairs is completed by a single map-reduce job. (2) the partition is represented by the signature of the parent
All time series (ts,rid) are transformed to (isaxt(b), freq:1) in node to facilitate pruning the search space. In addition, sibling
the map phase, and then aggregated to [(isaxt(b), freq(b))] in leaf nodes are indexed into finer granularity by the Tardis-L at
the reduce phase, where b represents the initial cardinality. each partition even though assembled together. Our problem
Node Statistic: The node statistics are collected for each can be considered as a Partition Packing problem.
layer in ascending order. The [(isaxt(b), freq(b))] generated
Definition 5: [Leaf Partitions Packing] Given a list of n leaf
previously is processed to retrieve [(isaxt(i), freq(i))] in which
nodes under an internal (or root) node L = {l1 , l2 , · · · , ln }
each entry is the information of a node in the ith layer.
and a partition capacity C, the Partition Packing problem is
isaxt(i) means the iSAX-T signature of a node and freq(i)
to group leaf nodes into as few partitions as possible, such
means the frequency of this signature. In other words, the
that the sum of each partition is no greater than C.
quantity of time series under this node. The procedure of
the ith layer involves the following operations: (1) Map that Since it is an NP-hard optimization problem [14], the exact
converts (isaxt(b), freq(b)) to (isaxt(i), freq(b)); (2) Reduce that algorithms typically leverage the branch-and-bound strategy
aggregates (isaxt(i), freq(b)) to [(isaxt(i), freq(i))] which is the which uses approximate algorithms to compute the bounds.
(Bloom filter, (Local Index) Bloom Filter Index Construction: A small size local index
(isaxt, ts, rid) Local Index) 𝐷"((
(ts, rid) (isaxt, ts, rid)
𝐷$((
is built for the Exact-Match query. Bloom Filter [15] is a space
𝐷", 𝐷",
𝐷"
,
𝐷%(( efficient probabilistic data structure to test whether an element
𝐷$ 𝐷$, 𝐷&((
Read 𝐷$ is a member of a set. It can raise false positive but not false
𝐷'((
D 𝐷%, 𝐷%, negative. The iSAX-T signature is used as the input for Bloom
Map (Bloom Filter)
𝐷%
𝐷"(((
𝐷&, 𝐷&, Filter because: (1) it is already contained by each data entry
𝐷& 𝐷$(((
Shuffle 𝐷', 𝐷', 𝐷%((( so no extra conversion is needed; (2) a given initial cardinality
MapPartition 𝐷&(((
Persist
𝐷'(((
and a high fan-out guarantee a high probability of one-to-one
Save
relationship between each time series and its signature. The
Fig. 8: Pipeline of Tardis-L Construction. Bloom Filter index is synchronously generated with the Tardis-
G in mapPartition operation in Figure 8: when each data entry
is inserted during Tardis-G construction, isaxt(b) is encoded
We adopt first fit decreasing (FFD) [14] which is the best into bloom filter data structure at the same time.
known approximate algorithm with time complexity O(nlogn)
and worst-case performance ratio 3/2, to solve this problem.
V. TARDIS Q UERY P ROCESSING
It sorts all leaf nodes by the record number in descending
order, and then inserts each leaf node into the first partition
TARDIS supports classical Exact-Match queries and kNN-
with sufficient remaining space. After finishing packing, the
Approximate queries. While for some problems, finding the
partition ids in the descendant nodes are synchronized to the id
exact match item is essential, for many data mining applica-
list of ancestor nodes to facilitate future information retrieval
tions of massive time series [10] an approximate query pro-
of sibling nodes.
cessing may be all that is required. Take big data visualization
C. TARDIS Local Index (Tardis-L) [16] for example, Tableau [17] takes 71 mins to plot a scatter-
Tardis-L is an sigTree-based structure that indexes data plot for a dataset with 2 billion tuples. In contrast, only 3
entries within each partition. TARDIS leverages the high seconds are taken to produce practically the same visualization
I/O rate and powerful in-memory computation of distributed by carefully choosing 1 million tuples. In this spirit, TARDIS
infrastructures to construct local structures for all partitions supports very fast approximate searches, as only single disk
in parallel. The data pipeline in Figure 8 shows the overall access is required.
procedure. It involves the following steps:
Data Shuffle: Each record is shuffled to the target partition A. Exact Match Query
based on the assigned partition id. This step is finished in one
map-reduce job by workers. Before starting the job, the master The Exact-Match Algorithm harnesses the TARDIS index
broadcasts the Tardis-G to all workers as the partitioners for framework to fetch leaf nodes for validation. It is composed
the reduce operation. It resides in the memory of the workers of the following steps: (1) convert the query time series to
until the job is done. In the map phase, each time series (ts, its iSAX-T signature; (2) traverse the Tardis-G to identify the
rid) in Table I is read and converted to (isaxt(b), ts, rid). In partition; (3) test existence of such signature in the Bloom
the reduce phase, they are aggregated into matched partitions Filter index of the partition; if the test result is false, this query
in two sub-steps: (3.1) each time series obtains the partition terminates with zero results; (4) if the Bloom Filter search is
id by traversing the partitioner; and (3.2) shuffle each to the positive, then load the partition and traverse the Tardis-L to
target partition using the distributed infrastructure. The data retrieve the leaf node, and then lookup the query time series.
entries within partitions are out of order after repartitioning. The failure of traversal in either Tardis-G or Tardis-L means
Local Structure Construction: The Tardis-L is constructed a non-existent result.
within each partition to organize data entries. This step is The algorithm leverages Bloom Filter Index to prevent
implemented in mapPartition operation in Figure 8 using the the high-latency disk access in the case non-existence with
tree insertion mechanism: each data entry (isaxt(b), ts, rid) low false positive in the above 3rd step. As we know, the
enters at the root node of local index, and then traverses to the distributed infrastructures prefer to store data in large files, for
matched leaf node. If the leaf node contains more data entries example, the default block size used by Hadoop and Spark is
than the given split threshold, it is promoted into an internal 64M or 128M . Thus the loading of such file is high latency.
node and split all entries to ≤ 2w leaf nodes. Meanwhile, each For the Exact-Match query, the query time series either exists
node in the traversal path increases the quantity of records by or doesn’t exist in the dataset. In the first case, the access to a
one. Note that both Tardis-G and Tardis-L employ a similar partition is unavoidable. In the second case, however, it may
insertion mechanism to construct the sigTree but three key not be necessary. The algorithm uses the Bloom Filter Index to
differences exist: (1) Scope: Tardis-G is a complete dataset test if the partition contains the query time series or not. Due
whereas Tardis-L corresponds to one partition; (2) Element: to the small size, it resides in memory or is read from disk
Tardis-G inserts the information of nodes whereas Tardis-L with low latency. Furthermore, we also provide Exact-Match
inserts time series entries; (3) Split: Tardis-G finishes nodes Algorithm Non-Bloom Filter which takes more time with the
split in statistics collection phrase whereas Tardis-L splits same query accuracy because one partition has to been loaded
nodes in the index construction phase. if the partition is identified in the above 2nd step.
Algorithm 1 kNN Approximate: Multi-Partitions Access Parameters Value
Input: qts, k, pth HDFS block size 128 M
Output: topK list(dist, rid) Word length 8
1: • Traverse Tardis-G to identify the partition in the master Sampling percentage 10%
2: isaxt = convertT oiSaxt(qts) L-MaxSize 1, 000
3: pid = tardisG.f etchP id(isaxt) Initial cardinality (TARDIS) 64
4: pidList = tardisG.f etchF romP arent(isaxt) Initial cardinality (Baseline) 512
5: if size(pidList) > pth then Multi-Partition Access threshold: pth 40
6: pidList = randomSelect(pidList, pth)
7: end if TABLE II: Experimental Configuration.
8: • Load all partitions by workers
9: partitions = spark.readHdf sBlock(pidList)
10: • Get the threshold from the partition by one worker
11: partition = partitions.select(pid)
12: node = partitions.f etchKnnN ode(isax, k)
13: records = node.f etchRecords().calEuSort(qts)
14: th = records.take(k).last.distance
15: • Scan partitions using the threshold in parallel
16: candidates = partitions.scan(th).calEuSort(qts) Fig. 9: Datasets Distribution.
17: return candidates.take(k)

randomly chosen in the list. After loading all these partitions,


Multi-Partitions Access uses the pruning method above to
B. kNN Approximate Query
process all partitions in parallel. It collects all candidates in
The Target Node Access leverages Tardis-G and Tardis-L to the residual nodes and takes the first k closest records as the
fetch Target node which is the leaf or internal node with more result. Algorithm 1 shows the detailed strategy.
data entries than k at the lowest position of Tardis-L. Note that
if it is an internal node, any child node should contains less VI. E XPERIMENT
data entries than k. The process is composed of the following We first introduce the implementation and the experimental
steps: (1) convert the query time series to its iSAX-T signature; setup, and then empirically evaluate the performance of the
(2) traverse the Tardis-G to identify the partition; (3) load the index construction, the query processing.
partition and traverse the Tardis-L to the target node; (4) fetch
all candidates under this node and take the k closest records A. Implementation & Setup Details
as the result. Implementation. Since the core features of TARDIS are
Besides the Target Node Access algorithm, two optimized infrastructure-independent, they are applicable to big data
algorithms are proposed based on the intuition that the can- engines generally. As a proof of concept, TARDIS prototype
didate scope can be extended by reducing the word-level has been realized on top of the Apache Spark 2.0.2 [13].
cardinality of iSAX signatures to loosen the bounds. Because We opt for Spark due to its efficient main memory caching
of the approximation of iSAX-based representation, the larger of intermediate data and the flexibility it offers for caching
the candidates scope is, the more accurate the result is. hot data. An important design choice of TARDIS is not to
One Partition Access alogrithm scans the Tardis-L of the touch the internal of the core spark engine so to be portable.
loaded partition to extend the scope whereas Multi-Partitions This allows easy migration of TARDIS into a new version of
Access algorithm harnesses the parallel processing power of Spark released in the future. We implement our approach for
the distributed infrastructure to concurrently exploit sibling both clustered and un-clustered indices at the local structure.
partitions to extend it. Both methods use the low bound feature All data structures and algorithms presented about TARDIS
of iSAX-T to prune the search space. PAA is used to obtain are built from scratch in Scala and we extend DPiSAX [12]
a tighter bound since the query time series is provided. to support clustered index, Exact-Match query and kNN-
One Partition Access uses the distance of the k th time series Approximate query as the baseline of evaluation.
obtained at the 4th step of Target Node Access as the threshold Cluster Setup. All experiments were conducted on a cluster
to prune the search space of Tardis-L from top to bottom. It consisting of 2 nodes. Each node consists of 56 Intel@Xeon
collects all candidates in the residual nodes and takes the first E5-2690 processors, 500GB RAM, 7TB SATA hard drive and
k closest records as the result. Unlike Target Node Access runs Ubuntu 16.04.3 LTS with Spark-2.0.2 and Hadoop-2.7.3.
and One Partition Access, Multi-Partitions Access fetches the The Spark cluster is deployed in standalone mode.
partition list of all sibling partitions in the parent node at Datasets. We use one benchmark and three real-world
the 2nd step. This list in the upper layers of Tardis-G may datasets from different domains for the evaluation. Ran-
be large. For example, the list size corresponds to the total domWalk Benchmark Dataset is extensively used as the
of all partitions in Figure 7 because the parent node of leaf benchmark for time series index in other projects [9]–[11],
node isaxt(1):“ff” is the root node. In response, a partition [18]. This dataset is generated for 1 billion time series with 256
threshold pth is set to control the maximum quantity of points. Texmex Corpus Dataset [19] is an image dataset which
partitions loaded. If the list size exceeds it, pth elements are contains 1 billion SIFT feature vectors of 128 points each.
Global Index Local Index Global Index Local Index sampling statistic build index assign Pid sampling statistic build index assign Pid
4,096 8
45
1,024 2,277 64
1,546
954 6
256 486 139

(Minutes)
324 30
(Minuts)

251 83

(Minutes)

(Minutes)
64 139 116 191 16 63
32.7 4
41
16 41 37 46
28 21.3 15
18 19.9
4 10 4 8.7 2
7 6 8 7.3
5 5.3 6.0 5.1
1 3 2.6 0
T B T B T B T B T B 1 1.4 1.0 1.2 T B T B T B T B T B 0
Rw Tx Dn Na Rw Tx Dn Na 2 4 6 8 10 Rw Tx Dn Na Rw Tx Dn Na
2 4 6 8 10
# Time Series (Unit: 100m) T B # Time Series (Unit: 100m) T B

(a) RandomWalk Dataset (b) Dataset Comparison (200m) (a) RandomWalk Dataset (b) Dataset Comparison (200m)

Fig. 10: Index Construction Time. (T: TARDIS, B: Baseline, Rw: Fig. 11: Global Index Construction Time Breakdown. (T: TARDIS,
RandomWalk, Tx: Texmex, Dn: DNA, Na: Noaa) B: Baseline, Rw: RandomWalk, Tx: Texmex, Dn: DNA, Na: Noaa)

DNA Dataset [20] contains assembly of the human genome 400 BF Index
collected from 2000 to 2013. Each DNA string is divided into 300
Local Index
Global Index

(Minutes)
subsequences of length 192 and then converted into time series 200
[11]. It contains 200 million time series with 192 points. Noaa 100
Dataset [21] involves weather and climate data from global 0
NoBF BF NoBF BF NoBF BF NoBF BF NoBF BF
20, 000 stations from 1901 to present. The temperature feature 2 4 6 8 10
is extracted into 200 million time series with 64 points. Each # Time Series (Unit: 100m)

dataset is z-normalized before being indexed. The datasets are Fig. 12: Bloom Filter Index Construction (RandomWalk)
chosen to cover a wide range of skewness with respect to the
values’ occurrence frequencies as illustrated in Figure 9.
As shown in Table II, TARDIS and the baseline systems are completed by itself. In contrast, the time of building index
adopt the same configuration except the initial cardinality. It tree taken by the baseline increases linearly as dataset size
is 64 for TARDIS, whereas the default value of Baseline is increases. Figures 11(b) shows the global index construction
512. For reproducibility, all source code, cluster configuration time in all datasets.
and technical report are provided [22]. For the local index in Figures 10, the difference of two
systems is the read and convert data caused by the partition
B. Index Construction id assignment completed by partitioner, since both systems
1) Clustered Index: The capacity of a HDFS block is set spend the same time on reading data. Each record in the
as the Tardis-G threshold G-MaxSize in terms of the indexed baseline takes the character selection into consideration in
time series Take 1 billion time series RandomWalk dataset as the lookup of partition table. The table matching process
an example, it needs about 10, 189 partitions if each partition and cardinality conversion causes this procedure to be time-
110, 000 data entries. As shown in Figure 10(a), TARDIS consuming. In contrast, TARDIS leverages the Tardis-G and
takes 334 mins to finish the index construction process for iSAX-T signature to finish this process within a short time.
1 billion dataset whereas the baseline takes 2, 323 mins. From TARDIS takes 66 mins to read and convert data for 1 billion
200 million to 1 billion in RandomWalk dataset, the index dataset, whereas the baseline takes 2007 mins. Note that the
construction time of TARDIS increases 7.6 times as that of time includes the fixed overhead of reading the dataset. In
baseline is 16 times. Figure 10(b) shows the performance in our technical report [22], we also include the breakdown
all datasets and the difference between different datasets are construction time for the local indices.
caused by the time series length and value distribution. Our For the Bloom Filter Index construction, the persistence of
new system demonstrates excellent scalability for the large data in memory impacts the performance. If the intermediate
partition number because the sigTree structure of Tardis-G has data is persisted in memory, no obvious overhead exists
a short height for leaf nodes and iSAX-T signature simplifies because the cost only corresponds to dumping this small size
the cardinality conversion to identify the partition for shuffling index, 66k for each partition, to disk. As shown in Figure
operation whereas the partition table derived from the iBT 12, when the RandomWalk dataset is less than 400 million,
for the global index introduces high look-up costs and iSAX the difference is negligibly small. In contrast, if it could not
signature needs expensive cardinality conversion. be persisted in memory totally, the intermediate data has to
For the global index, TARDIS takes 10 mins for 1 billion be persisted in memory and on disk. Taking 1 billion dataset
data whereas the baseline takes about 46 mins in Figure for example, the construction takes extra 97 mins in which
10(a). TARDIS leverages Block-Level sampling to reduce 57 mins are spent on dumping the intermediate result and 40
data reading time in sampling steps, and harnesses powerful mins is to read them.
workers to collect node statistics. Figure 11(a) shows that 2) Index Size: The Global Index is impacted by the index
TARDIS finishes node statistic, build index tree and partition structure and dataset size while the local index is impacted also
assignment in a few minutes even for large datasets. It shows by the setting for indexed data like initial cardinality. For the
good scalability to construct global index quickly for different global index, TARDIS keeps the whole sigTree structure as the
dataset scales. Note that the master node isnot the bottleneck index, whereas the baseline saves all leaf nodes as the partition
even if the index tree construction and partition assignment table. For 1 billion time series in Figure 13(a), TARDIS uses
24
TARDIS
45 2) kNN-Approximate Query Processing Performance: The
18
ground truth is critical for evaluation. However, the naive
Unit: MB

Baseline 30

Unit: GB
12
6 15 TARDIS method, which calculates the distance between query time
Baseline
0 0 series and each record in the dataset to obtain the top k nearest
2 4 6 8 10 2 4 6 8 10
# Time Series (Unit: 100m) # Time Series (Unit: 100million) neighbors, is impractical due to the prohibitive time cost. We
(a) Global Index (b) Local Index instead leverage TARDIS to quickly figure out the ground
truth: for each qi in Q = {q1 , · · · , qp }, use the low bound
Fig. 13: Index Size (RandomWalk) feature of Tardis-G to filter out “large” partitions and then
use this feature of Tardis-L to filter out nodes in the residual
12
Baseline Tardis-NoBF Tardis-BF 10
partitions; if the candidates number within the residual nodes
9
8 equals or exceeds k, we take the top k nearest neighbors as
(Seconds)
(Seconds)

6
6 the ground truth for qi . Note that a threshold (7.5 in our paper)
4
3 2 Baseline Tardis-NoBF is given for above filtering processes in advance.
Tardis-BF
0
RandomWalk Texmex DNA Noaa
0
2 4 6 8 10
In contrast with DPiSAX [12] that considers the query
(400m) (400m) (200m) (200m) # Time Series (Unit: 100) answering time for 10-NN, we study the effect of query
(a) Dataset (b) Dataset Size (RandomWalk) processes, k value and dataset size to evaluate the search
quality and search speed. Search quality is measured by recall
Fig. 14: Exact Match Average Query Time. and error ratio that are standard metrics in high dimension
nearest neighbor query [23], [24]. Search speed is measured
20 M while the baseline uses 1 M. However, the Tardis-G is by average query time. Given a query q, the set of exact K
still lightweight to be broadcasted to worker nodes and resides nearest neighbors is G(q) = {g1 , · · · , gk } and the query result
in RAM. In consideration of the improved efficiency of index is R(q) = {r1 , · · · , rk }. recall is defined as:
construction and query processing, the trade-off of increased
index size is reasonable. For the local index which excludes |G(q) ∩ R(q)|
recall = (5)
indexed data in Figure 13(b), the difference is mainly caused |G(q)|
by the different initial cardinality. Since sigTree leverages a Obviously, the recall score is less than 100%. In the ideal
large fan-out to reduce the depth of leaf nodes, TARDIS uses case, 100% means all the k nearest neighbors are returned.
a small value, 64 here, while the baseline uses a large initial error ratio is defined as:
value, 512 by default, to guarantee enough cardinalities for k
spliting. For 1 billion RandomWalk time series, TARDIS uses 1 X ED(q, rj )
error ratio = (6)
34.9 G while the baseline uses 43.5 G. k j=1 ED(q, gj )

It measures how close the distance of the K nearest neighbors


C. Query Performance
found are compared to that of the exact K nearest neighbors.
1) Exact-Match Query Processing Performance: We eva- The value is larger than 1.0 and the idea case is 1.0.
lute the query performance in all datasets. Each experiment As shown in Figure 15 and 16, TARDIS fetches a better
involves 100 time series queries, with the exactly same length performance in both the recall and the error ratio than the
as the time series in the dataset. In particular, 50% are baseline. It is credited to TARDIS’s word-level similarity and
randomly selected from the dataset while the other 50% the extended candidates scope. At first, we study the effect of
are guaranteed to be not exist in the dataset. The average different query processes using the case of 400 million dataset
query time is measured because the recall rates are all 100%. and 500 k value in Figure 15. For the recall, the baseline is
Although all queries need to identify the partition in the master 1.5% while Target Node Access is about 6.7%, One Partition
and figure out the existence in workers, the key difference is Access is 18.9% and Multi-Partitions Access is 43.4%. For
caused by the operations in workers because the time taken by the error ratio, the baseline is 1.42 and Target Node Access is
the master node is negligibly small compared with operations about 1.19, One Partition Access is 1.07 and Multi-Partitions
in the workers. Access is 1.03. For average query time, the baseline takes
As shown in Figure 14, Tardis-NoBF takes fewer time than about 9.8 sec while Target Node Access takes about 7.5 sec,
the baseline because of the shorter depth of leaf nodes and One Partition Access is 7.7 sec and Multi-Partitions Access
fewer records stored in leaf nodes in the Tardis-L, though both is 9.9 sec. One Partition Access has better performance com-
need to read one partition. In Figure 14(a), Noaa takes more pared with Target Node Access because scanning the partition
time because more time series are stored in each partition due results in a larger candidates scope. Note that Multi-Partitions
to the short length. In RandomWalk dataset, Tardis-BF has Access has obvious advantage because the extended candidate
the best performance with 4 sec, which is about half of the scope derives from sibling partitions. Even more partitions are
baseline 9 sec, because the non-existence time series query loaded, the average query time of Multi-Partitions Access is
avoid loading partitions. Therefore, the Bloom Filter Index similar to that of the baseline because of concurrently process.
effectively prevents the disk access for such queries. In Figure As shown in Figure 16(left), the performance under dif-
14(b), the scale of the dataset has no obvious impact on the ferent dataset sizes follows the same pattern aforementioned,
performance since each query only accesses one partition. particularly the error ratio and the average query time. The
Baseline Target Node Access One Partition Access Multi-Partitions Access

60% 2.2
12
45%

(Seconds)
12
8
4 1.8 8
0

(Seconds)
48%
30% ✕35 ✕28 RandomWalk Texmex DNA Noaa
✕30 1.4 34% 26% 4
15% ✕72
27%

0% 1.0 0
RandomWalk Texmex DNA Noaa RandomWalk Texmex DNA Noaa RandomWalk Texmex DNA Noaa
(400m) (400m) (200m) (200m) (400m) (400m) (200m) (200m) (400m) (400m) (200m) (200m)
(a) recall (b) error ratio (c) average query time

Fig. 15: kNN Approximate Performance in Different Datasets (k: 500).

RandomWalk Texmex DNA Noaa

45% 45% 100%


100%
75%
30% 30% 90%
50%
15% 15% 80%
25% 100%
0%
0% 0%
1% 5% 10% 70%
20% 40% 100%
2 4 6 8 10 10 50 100 500 1k 5k 10k 20k 0%
# Time Series (Unit: 100 m) K value 1% 5% 10% 20% 40% 100% 1% 5% 10% 20% 40% 100%
(a) recall (b) recall Sampling Percent Sampling Percent
(a) Ratio of construction time (b) Ratio of Tardis-G Size
1.4 1.4 102%
75%
1.3 1.3 101%
1.2 1.2 50%
100%
1.1 1.1 25% 99%
1.0 1.0
2 4 6 8 10 10 50 100 500 1k 5k 10k 20k 0% 98%
# Time Series (Unit: 100 m) K value 1% 5% 10% 20% 40% 100%
1% 5% 10% 20% 40% 100%
(c) error ratio (d) error ratio Sampling Percent Sampling Percent
(c) MSE of partition distribution (d) Ratio of error ratio (Top 500)
10 10

Fig. 17: Impact of Sampling Percentage


(Second)

8 8
(second)

6 6

4
time series because of the word-level similarity and both One
4
2 4 6 8 10 10 50 100 500 1k 5k 10k 20k Partition Access and Multi-Partitions Access decrease due to
# Time Series (Unit: 100 m) K value
(e) average query time (f) average query time the dispersedness of the ground truth for larger k value.
Note that Multi-Partitions Access keeps the best accuracy
Fig. 16: Impact to kNN-Approximate Performance (Ran- even if k value varies. For the error ratio, the turning point
domWalk), left: different dataset size for k value: 5000; right: of the baseline in Figure 16(d) is caused by the promotion of
different k for 400 million target node position from a leaf node to an internal node in
the iBT structure. When k is less than 500, the target node is
a leaf node with the large candidate scope 634. One Partition
recall of Multi-Partitions Access decrease faster than others
Access is the up bound of recall and the low bound of error
and the error ratio is the same as others. This tendency results
ratio for Target Node Access, because the candidate scope of
from that the distance of records in one partition tends to be
One Partition Access becomes similar to that of Target Node
small but the ground truth disperse over more partitions. The
Access as the k increases but One Partition Access is the best
error ratio increases and the recall decreases because the large
case of Target Node Access. The average query time does not
dataset size leads to a dispersedness of the ground truth. The
change significantly because the partition number loaded for
key point is that any iSAX-based signature is an approximate
all query processes stay consistent for different k values.
rather than exact representation. The average query time does
not change significantly because the partition number loaded D. Impact of Sampling
for all query processes has no change. Since the sampling percentage impacts the estimation of
In Figure 16 (right), we evaluate the effect of different k iSAX-T representation distribution which determines the con-
values on the benchemark dataset. The performance is affected struction and index quality. The construction quality is mea-
greatly by the candidate scope determined by the granularity sured by construction time and global index size while the
of target node. The lowest level for target node is leaf node. index quality is measured by error ratio which evaluates the
For the same L-MaxSize 1000, however, the average leaf node cohesiveness of partition and MSE of partition size distribution
size of TARDIS is 32 whereas the baseline is 634 caused by which evaluates partition size distribution estimation. Like his-
the different fan-out. The recall for the baseline has no obvious togram method, the last metric gets the probability distribution
change. The slight increase of Target Node Access means of partition sizes, with the 15 megabyte bucket interval, and
that the target node of TARDIS more effectively holds similar then calculates the Mean squared error (MSE).
We compare 1%, 5%, 10%, 20%, 40% percentage with up the compact index construction in distributed environments.
the 100% case. In Figure 17(a), the sampling method greatly Particularly, the word-level similarity feature of sigTree ef-
decreases the global index construction time in all datasets. In fectively keeps better similarity of time series. Additionally,
Figure 17(b), the smaller the percentage is used, the smaller optimized similarity query processes leverage this flexible
the index is generated because the sampling method get part framework to improve the performance. Our experiments over
of the representations. However, the Tardis-G generated from the synthetic and real world datasets validate that our new
1% data is able to support the shuffle operation of the whole approach dramatically reduces the index construction time and
dataset. In Figure 17(c), the small percent has large MSE value substantially improves the query accuracy.
and the 10% has the similar effect as the 100% in all datasets.
We run Top-500 kNN-Approximate query to get the error R EFERENCES
ratio of Multi-Partitions Access. In Figure 17(d), the small [1] J. Ronkainen and A. Iivari, “Designing a data management pipeline for
pervasive sensor communication systems,” PCS, 2015.
percentage leads to high ratio of error ratio. [2] P. Esling and C. Agon, “Time-series data mining,” CSUR, vol. 45, 2012.
[3] S. K. Jensen, T. B. Pedersen, and C. Thomsen, “Time series management
VII. R ELATED W ORK systems: A survey,” TKDE, vol. 29, no. 11, pp. 2581–2600, 2017.
Recent advances in sensor, network and storage technolo- [4] T. Palpanas, “The parallel and distributed future of data series mining,”
in HPCS. IEEE, 2017, pp. 916–920.
gies have sped up the process of generating and collecting [5] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, Fast subsequence
time series. Similarity queries, the fundamental problem of match in time-series databases. ACM, 1994, vol. 23.
time series data mining, relies on the summarization and [6] K.-P. Chan and A. W.-C. Fu, “Efficient time series matching by
wavelets,” in ICDE. IEEE, 1999, pp. 126–133.
indexing of massive datasets. The literature on these topics [7] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Dimensionality
is vast; see paper [2] and references therein for useful surveys reduction for fast similarity search in large time series databases,” KAIS,
and empirical comparisons. The iSAX-based indexing family, vol. 3, pp. 263–286, 2001.
[8] K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani, “Locally adap-
such as iSAX [10], iSAX2.0 [11] and Adaptive Data Series tive dimensionality reduction for indexing large time series databases,”
Index (ADS) [25], demonstrates good scalability for bulk- TODS, vol. 27, no. 2, pp. 188–228, 2002.
loading mechanism in centralized machine. The round bin split [9] J. Lin, E. Keogh, L. Wei, and S. Lonardi, “Experiencing sax: a novel
symbolic representation of time series,” DMKD, vol. 15, 2007.
policy [10] and the statistic-base split policy [11] are proposed [10] J. Shieh and E. Keogh, “isax: indexing and mining terabyte sized time
for the binary split. While both iSAX and iSAX2.0 build series,” in SIGKDD. ACM, 2008, pp. 623–631.
indices over the dataset up-front and leverage these indices [11] A. Camerra, T. Palpanas, J. Shieh, and E. Keogh, “isax 2.0: Indexing
and mining one billion time series,” in ICDM. IEEE, 2010, pp. 58–67.
for query process, ADS [25] shifts the costly index creation [12] D.-E. Yagoubi, R. Akbarinia, F. Masseglia, and T. Palpanas, “Dpisax:
steps from the initialization time to the query processing time. Massively distributed partitioned isax,” in ICDM, 2017, pp. 1–6.
It interactively and adaptively builds parts of the index only [13] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
“Spark: Cluster computing with working sets.” HotCloud, 2010.
for the subsets of data on which the users pose queries. All [14] M. Delorme, M. Iori, and S. Martello, “Bin packing and cutting stock
aforementioned methods are based on a centralized machine. problems: Mathematical models and exact algorithms,” EJOR, 2016.
The authors [26] propose a distributed system which con- [15] B. H. Bloom, “Space time trade-offs in hash coding with allowable
errors,” Commun ACM, vol. 13, pp. 422–426, 1970.
structs vertical inverted tables and horizontal segment trees [16] Y. Park, M. Cafarella, and B. Mozafari, “Visualization-aware sampling
based on the PAA summarization of time series data. However, for very large databases,” in ICDE. IEEE, 2016, pp. 755–766.
the work in [26] cannot handle our target scale of billions of [17] Tableau. [Online]. Available: http://www.tableau.com
[18] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh,
time series objects (they only experimented with 1K, 10K, “Querying and mining of time series data: experimental comparison of
and 100K objects). Moreover, it is explicitly stated in [26] for representations and distance measures,” PVLDB, vol. 1, 2008.
large k > 50, their kNN query performance degrades quickly [19] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest
neighbor search,” TPAMI, vol. 33, pp. 117–128, 2011.
and converges to the brute force search. in contract, TARDIS [20] U. G. Institute. [Online]. Available: https://genome.ucsc.edu/
is designed for scalable k as well, e.g., k in thousands. [21] NCEI. [Online]. Available: https://www.ncdc.noaa.gov/cdo-web/datasets
Several other recent distributed systems have been proposed [22] Tardis report. [Online]. Available: https://github.com/lzhang6/TARDIS
[23] A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high
for managing different aspects of time series data. For exam- dimensions via hashing.” VLDB, 1999, pp. 518–529.
ple, Druid [27] and Gorilla [28] focus only on the storage and [24] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe
compression aspects in a distributed environment. In contrast, lsh: efficient indexing for high-dimensional similarity search.” VLDB,
2007, pp. 950–961.
SciDB [29] focus on distributed linear algebra and statistical [25] K. Zoumpatianos, S. Idreos, and T. Palpanas, “Ads: the adaptive data
operations on time series data, and BTrDB [30] addresses series index,” VLDB Journal, vol. 25, pp. 843–866, 2016.
primitive operations on big time series data such as selection, [26] X. Wang, Z. Fang, P. Wang, R. Zhu, and W. Wang, “A distributed
multi-level composite index for knn processing on long time series,”
projection, and simple aggregations. All of these systems op- in DASFAA. Springer, 2017.
erate at the record-level, e.g., they support insertion, deletion, [27] F. Yang, E. Tschetter, X. Léauté, N. Ray, G. Merlino, and D. Ganguli,
and update of time series records. TARDIS is fundamentally “Druid: A real-time analytical data store,” in SIGMOD. ACM, 2014.
[28] T. Pelkonen, S. Franklin, J. Teller, P. Cavallaro, Q. Huang, J. Meza,
different from these systems as it a batch oriented and designed and K. Veeraraghavan, “Gorilla: A fast, scalable, in-memory time series
for other complex operations such as kNN queries. database,” PVLDB, pp. 1816–1827, 2015.
[29] M. Stonebraker, P. Brown, D. Zhang, and J. Becla, “Scidb: A database
VIII. C ONCLUSIONS management system for applications with complex analytics,” CiSE, pp.
54–62, 2013.
In this paper, we propose TARDIS, a scalable distributed in- [30] M. P. Andersen and D. E. Culler, “Btrdb: Optimizing storage system
dexing framework to index and query massive time series. We design for timeseries processing.” in FAST, 2016, pp. 39–52.
introduce sigTree and iSAX-T signature to simplify and speed

You might also like