Performance Tradeoffs in Read-Optimized Databases: Stavros Harizopoulos Velen Liang Samuel Madden Daniel J. Abadi
Performance Tradeoffs in Read-Optimized Databases: Stavros Harizopoulos Velen Liang Samuel Madden Daniel J. Abadi
ABSTRACT ing several columns in a table, a column store must ensure that it
Database systems have traditionally optimized performance for can read large sequential portions of each column, since the cost
write-intensive workloads. Recently, there has been renewed of seeking between two fields in the same tuple (stored separately
interest in architectures that optimize read performance by using on disk) can be prohibitively expensive. Similarly, writes can
column-oriented data representation and light-weight compres- actually be made relatively inexpensive as long as many inserts
sion. This previous work has shown that under certain broad are buffered together so they can be done sequentially. In this
classes of workloads, column-based systems can outperform row- paper, we carefully study how column orientation affects the per-
based systems. Previous work, however, has not characterized the formance characteristics of a read-optimized database storage
precise conditions under which a particular query workload can manager, using scan-mostly queries.
be expected to perform better on a column-oriented database.
In this paper we first identify the distinctive components of a 1.1 Read-Optimized DBMS Design
read-optimized DBMS and describe our implementation of a While column-oriented systems often have their own sets of col-
high-performance query engine that can operate on both row and umn-wise operators that can provide additional performance ben-
column-oriented data. We then use our prototype to perform an efit to column stores [1], we focus in this paper on the differences
in-depth analysis of the tradeoffs between column and row-ori- between column and row stores related solely to the way data is
ented architectures. We explore these tradeoffs in terms of disk stored on disk. To this end, we implement both a row- and col-
bandwidth, CPU cache latency, and CPU cycles. We show that umn-oriented storage manager from scratch in C++ and measure
for most database workloads, a carefully designed column sys- their performance with an identical set of relational operators. As
tem can outperform a carefully designed row system, sometimes data is brought into memory, normal row store tuples are created
by an order of magnitude. We also present an analytical model to in both systems, and standard row store operations are performed
predict whether a given workload on a particular hardware con- on these tuples. This allows for a fixed query plan in our experi-
figuration is likely to perform better on a row or column-based ments and a more direct comparison of column and row systems
system. from a data layout perspective.
Both of our systems are read-optimized, in the sense that the disk
1. INTRODUCTION representation we use is tailored for read-only workloads rather
A number of recent papers [21][7] have investigated column ori- than update-intensive workloads. This means, for example, that
ented physical database designs (column stores), in which rela- tuples on-disk are dense-packed on pages, rather than being
tional tables are stored by vertically partitioning them into single- placed into a slotted-tuple structure with a free-list per page. Fig-
column files. On first blush, the primary advantage of a column- ure 1 shows the basic components of a generalized read-opti-
oriented database is that it makes it possible to read just the sub- mized DBMS, upon which we base both of our systems (solid
set of columns that are relevant to a query rather than requiring lines show what we have actually implemented for the purposes
the database to read all of the data in a tuple, and the primary dis- of this paper). In this design we assume a staging area (the “write
advantage is that it requires updates to write in a number of dis- optimized store”) where updates are done, and a “read-optimized”
tinct locations on disk (separated by a seek) rather than just a
single file. compression
compression-aware
Surprisingly, these initial observations turn out to not be trivially advisor
operators
true. For example, to obtain a benefit over a row store when read- writes Query engine
MV
advisor reads
Permission to copy without fee all or part of this material is granted pro-
vided that the copies are not made or distributed for direct commercial MV
advantage, the VLDB copyright notice and the title of the publication and Write Read MV
its date appear, and notice is given that copying is by permission of the Optimized Optimized MV
Very Large Data Base Endowment. To copy otherwise, or to republish, to Store Store
post on servers or to redistribute to lists, requires a fee and/or special per-
mission from the publisher, ACM. merge DB storage
VLDB '06, September 12-15, 2006, Seoul, Korea.
Copyright 2006 VLDB Endowment, ACM 1-59593-385-9/06/09 Figure 1. Basic components of read-optimized DBMS.
store where tuples are permanently stored on disk. Tuples are speedup
periodically moved in bulk from the write store to the read store. 144
2
A compression advisor and a materialized view (MV) advisor 2-
usr-L2
40 8
Row usr-uop
30 Column 6 sys
Row CPU
20 Column CPU 4
10 2
0 0
4 20 36 52 68 84 100 116 132 148 # of attributes selected (see Fig. 5 for the type
Selected bytes per tuple row store column store of each attribute)
Figure 6. Baseline experiment (10% selectivity, LINEITEM). Left: Total elapsed time (solid lines) and CPU time (dashed lines) for
column and row store. The total elapsed time is equal to I/O time since CPU time is overlapped. X-axis is spaced by the width of
selected attributes. Right: CPU time breakdowns. The first two bars are for row store, the rest are for column store.
(v) System load. The system load (disk utilization and CPU uti- shows the total time (solid lines), and the CPU time separately
lization) can significantly affect a single query’s response time. (dashed lines) as we vary the number of selected attributes on the
Obviously, a disk-bound query will see a big increase in the x-axis. Both systems are I/O-bound in our default configuration
response time if it is competing with other disk-bound queries. (1 CPU, 3 disks, no competition), and therefore the total time
Competing disk and CPU traffic may again have different effects reflects the time it takes to retrieve data from disk. Both systems
in column stores than in row stores. are designed to overlap I/O with computation (as discussed in
Section 2). As expected, the row store is insensitive to projectiv-
We base all experiments on a variant of the following query:
ity (since it reads all data anyway), and therefore its curve
select A1, A2 … from TABLE remains flat. The column store, however, performs better most of
where predicate (A1) yields variable selectivity the time, as it reads less data. Note that the x-axis is spaced by
Since the number of selected attributes per query is the most the width of the selected attributes (e.g., when selecting 8
important factor, we vary that number on the X-axis through all attributes, the column store is reading 26 bytes per LINEITEM
row, whereas for 9 attributes, it reads 51 bytes — see Figure 5 for
experiments. Table 1 summarizes the parameters considered, what
the expected effect is in terms of time spent on disk, memory bus detailed schema information).
and CPU, and which section discusses their effect. The “crossover” point that the column store starts performing
worse than the row store is when selecting more than 85% of a
tuple’s size. The reason it performs worse in that region is that it
Table 1: Expected performance trends in terms of elapsed makes poorer utilization of the disk. A row store, for a single
disk, memory transfer, and CPU time (arrows facing up mean scan, enjoys a full sequential bandwidth. Column stores need to
increased time), along with related experimentation sections. seek between columns. The more columns they select, the more
time they spend seeking (in addition to the time they spend read-
parameter Disk Mem CPU section
ing the columns). Our prefetch buffer (48 I/O units) amortizes
selecting more attributes some of the seek cost. A smaller prefetch buffer would lower the
(column store only) 4.1
performance of the column store in this configuration, but addi-
decreased selectivity 4.2 tional disk activity from other processes would make the cross-
narrower tuples 4.3 over point to move all the way to the right. We show these two
compression 4.4 scenarios later, in Section 4.5.
larger prefetch 4.5 While this specific configuration is I/O-bound, it still makes
sense to analyze CPU time (dashed lines in left graph of Figure
more disk traffic 4.5
6), as it can affect performance in CPU-bound configurations, or
more CPUs / when the relations are cached in main memory. The left graph of
5
more Disks
Figure 6 shows the total CPU time for both systems. We provide
a time breakdown of CPU costs in the graph on the right part of
Figure 6. The first two bars correspond to the row store, selecting
4.1 Baseline experiment 1 and 16 attributes (the two ends in our experiment). The rest of
the bars belong to the column store, selecting from 1 to 16
select L1, L2 … from LINEITEM
attributes. The height of each stacked bar is the total CPU time in
where predicate (L1) yields 10% selectivity
seconds. The bottom area (dark color), is the time (in sec) spent
As a reminder, the width of a LINEITEM tuple is 150 bytes, it in system mode. This is CPU time spent while Linux was execut-
contains 16 attributes, and the entire relation takes 9.5GB of ing I/O requests and we do not provide any further break-down of
space. Figure 6 shows the elapsed time for the above query for that. For the row store, the system time is the same regardless of
both row and column data. The graph on the left of the figure the number of selected attributes. For the column store it keeps
10 usr-rest (top)
12 12
Elapsed time (sec)
40 40 40
Row -2
20 20 20
Column-2 slow
Column-2
0 0 0
4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32
Selected bytes per tuple Selected bytes per tuple Selected bytes per tuple
Figure 11.Repeating previous experiment for prefetch size of 48, 8, and 2 units, this time in the presence of another concurrent scan.
See text for an explanation of “slow” curve.
one step ahead allows the column system to be more aggressive Throughout the analysis, “DISK” and “CPU” refer to all disks
in its submission of disk requests, and, in our Linux system, to and all CPUs made available for executing the query. Paralleliz-
get favored by the disk controller. As a reference, we imple- ing a query is orthogonal to our analysis. If a query can run on
mented a version of the column system that waits until a disk three CPUs, for example, we will treat it as one that has three
request from one column is served before submitting a request times the CPU bandwidth as a query that runs on a single CPU.
from another column. The results (“slow” line in Figure 11) for
Disk analysis. We model the disk rate R DISK (in tuples/sec) as
this system are now closer to our initial expectations.
the sum of rates of all files read, weighted by the size of a file (in
bytes):
5. ANALYSIS
File1 SizeFile1 File2 SizeFile2
The previous section gave us enough experimental input to guide R DISK = R ⋅ ------------------------------- + R ⋅ ------------------------------- + … (2)
SizeFileALL SizeFileALL
the construction of a simple set of equations to predict relative
performance differences between row and column stores under For example, in the case of a merge-join, if File1 is 1GB and
various configurations. We are primarily interested in predicting File2 is 10GB, then the disks process on average one byte from
the rate of tuples/sec at which row and column systems will oper- File1 for every ten bytes from File2. The individual file rates are
ate for a given query and given configuration. We summarize in defined as:
Table 2 the parameters we are going to use in our analysis along
with what different configurations these parameters can model. File N DiskBW -
R = -------------------------------
TupleWidth N
Our analysis focuses on a setting where scan nodes continuously
read input tuples which are then passed on to other relational DiskBW is simply the available bandwidth from the disks in
operators in a pipelined fashion. We assume (as in our implemen- bytes/sec, which we divide by the width of the tuples to obtain
tation) that CPU and I/O are well overlapped. For simplicity, we the rate in tuples/sec. DiskBW is always the full sequential band-
do not model disk seeks (we assume that the disks read sequen- width (we assume large prefetching buffers that minimize time
tially the majority of the time). The rate (tuples/sec) at which a spent in disk seeks, as shown in the previous section). Note that:
query can process its input is simply the minimum of the rate R
the disks can provide data and the rate the CPUs can process SizeFile i = N i ⋅ TupleWidth i
these data:
where N i is the cardinality of relation i.
R = MIN ( R DISK, R CPU ) (1)
We can now rewrite (2) as:
N1 + N2 + …
R DISK = DiskBW ⋅ ------------------------------- (3)
Table 2: Summary of parameters used in the analysis. SizeFileALL
parameter what can it model For a column store, we can derive a similar equation to the one
above:
SizeFile
various database schemas
TupleWidth Columns N1 f1 + N2 f2 + …
R = DiskBW ⋅ ----------------------------------------- (4)
MemBytesCycle various speeds for the memory bus DISK SizeFileALL
number of attributes selected by a query where f1, f2, etc. are the factors by which a regular row tuple is
f
(projection)
larger than the size of the attributes needed by a query. For exam-
CPU work of each operator ple, if a column system needs to read only two integers (8 bytes)
I (can model various selectivities for scanners,
or various decompression schemes) from ORDERS (32 bytes), the factor f is 4 (= 32 / 8).
more/fewer disks CPU analysis. To model CPU rate, we assume a cascaded con-
cpdb more/fewer CPUs nection of all relational operators and scanners. If an operator
competing traffic for disk / CPU processes Op 1 tuples/sec, and is connected to another operator
with rate Op 2 , which in turn is connected to an operator with rate cpdb (cycles per disk byte) combines into a single parameter the
Op 3 and so on, the overall CPU rate R CPU is: available disk and CPU resources for a given configuration. It
1 = ---------
1 + ---------
1 + ---------
1 + … (5) shows how many (aggregate) CPU cycles elapse in the time it
------------ takes the disks to sequentially deliver a byte of information. For
R CPU Op 1 Op 2 Op 3
example, the machine used in this paper (one CPU, three disks) is
The above formula resembles the formula for computing the rated at 18 cpdb. By operating on a single disk, cpdb rating jumps
equivalent resistance of a circuit where resistors are connected in to 54.
parallel; we adopt the same notation (two parallel bars) and Parameters such as competing traffic and number of disks/CPUs
rewrite (5) (this time including the rates for the various scanners): can be modeled through cpdb rating. Since competing CPU traffic
R CPU = Op1 || Op 2 || … || Scan 1 || Scan 2 || … (6) fights for cycles, the cpdb rating for a given query drops. On the
other hand, competing disk traffic causes cpdb to increase. Look-
As an example, consider one operator processing 4 tuples/sec, ing up trends in CPU [3] and disk8 speed, we find that, for a sin-
connected to an operator that processes 6 tuples/sec. The overall gle CPU over a single disk, cpdb has been slowly growing, from
rate of tuple production in the system is: 10 in 1995, to 30 in 2005. With the advent of multicore chips, we
expect cpdb to grow faster. When calculating the cpdb rating of
Op 1 ⋅ Op 2 an arbitrary configuration, note that disk bandwidth is limited by
Op 1 || Op 2 = -------------------------
- the maximum bandwidth of the disk controllers.
Op 1 + Op 2
We use the speedup formula to predict relative performance of
or 2.4 tuples/sec. column systems over row systems for various configurations by
The rate of an operator is approximated as: changing the cpdb rating. In disk-bound systems (when the disk
rate is lower than the CPU rate), column stores outperform row
clock stores by the same ratio as the total bytes selected over the total
Op = ------------- (7)
I Op size of files. In CPU-bound systems, either system can be faster,
depending on the cost of the scanners. In the previous section we
where clock is the available cycles per second from the CPUs saw some conditions (low selectivity, narrow tuples) where row
(e.g., for our single-CPU machine, clock is 3.2 billion cycles per stores outperform column stores. Note that a high-cost relational
second). I Op is the total number of CPU instructions it takes the operator lowers the CPU rate, and the difference between col-
operator to process one tuple. Note that we approximate the rate umns and rows in a CPU-bound system becomes less noticeable.
by assuming 1 CPU cycle per instruction. If the actual ratio is The formula can also be used to examine whether a specific
available, we can replace it in the above formula. query on a specific configuration is likely to be I/O or CPU
To compute the rate of a scanner, we have to take into consider- bound.
ation the CPU-system time, the CPU-user time, and the time it The graph shown in the introduction (Figure 2), is constructed
takes the memory to deliver tuples to the L2 cache. We treat from the speedup formula, filling up actual CPU rates from our
CPU-system and CPU-user as two different operators. Further, we experimental section. In that figure we use a 50% projection of
compute CPU-user rate as the minimum of the pure computation the attributes in the relation, and 10% selectivity. The graph
rate and the rate the memory can provide tuples to the L2 cache. shows that row stores outperform column stores only in very lim-
The latter is equal to the memory bandwidth divided by the tuple ited settings.
width. We compute memory bandwidth as: clock times how many
bytes arrive per CPU cycle (MemBytesCycle).
6. RELATED WORK
Therefore we can write the rate Scan of a scanner as: The decomposition storage model (DSM) [8] was one of the first
clock clock clock ⋅ MemBytesCycle proposed models for a column oriented system designed to
Scan = --------------- || MIN -------------, --------------------------------------------------------- (8) improve I/O by reducing the amount of data needed to be read off
I system I user TupleWidth
disk. It also improves cache performance by maximizing inter-
record spatial locality [6]. DSM differs from the standard row-
Speedup of columns over rows. We are now ready to compute store N-ary storage model (NSM) by decomposing a relation with
the speedup of columns over rows by dividing the corresponding n attributes into n vertically partitioned tables, each containing
rates (using (1), (3), (4), (6), (7) and (8)): one column from the original relation and a column containing
N1 f1 + N2 f2 + … TupleIDs used for tuple reconstruction. The DSM model attempts
1 1
MIN -----------------------------------------, cpdb ⋅ ------ || -------------- || … to improve tuple reconstruction performance by maintaining a
SizeFileALL I op I ScanC
clustered index on TupleID, however, complete tuple reconstruc-
Speedup = -----------------------------------------------------------------------------------------------------------------------------
N + N + … tion remained slow. As a result, there have been a variety of opti-
1 1 1
MIN -------------------------------, cpdb ⋅ ------ || -------------- || …
2
mizations and hybrid NSM/DSM schemes proposed in
SizeFileALL I op I ScanR subsequent years: (a) partitioning of relations based on how often
attributes appear together in a query [9], (b) using different stor-
To derive the above formula we divided all members by DiskBW,
age models for different mirrors of a database, with queries that
and replaced the following quantity:
perform well on NSM data or DSM data being sent to the appro-
clock - priate mirror [19], (c) optimizing the choice of horizontal and ver-
cpdb = -------------------- tical partitioning given a database workload [2].
DiskBW
8. http://www.hitachigst.com/hdd/technolo/overview/chart16.html
Recently there has been a reemergence of pure vertically parti- [6] P. A. Boncz, S. Manegold, and M. L. Kersten. “Database
tioned column-oriented systems as modern computer architecture Architecture Optimized for the New Bottleneck: Memory
trends favor the I/O and memory bandwidth efficiency that col- Access.” In Proc. VLDB, 1999.
umn-stores have to offer [7][21]. PAX [4] proposes a column- [7] P. Boncz, M. Zukowski, and N. Nes. “MonetDB/X100:
based layout for the records within a database page, taking advan- Hyper-Pipelining Query Execution.” In Proc. CIDR, 2005.
tage of the increased spatial locality to improve cache perfor-
mance, similarly to column-based stores. However, since PAX [8] A. Copeland and S. Khoshafian. “A Decomposition Storage
Model.” In Proc. SIGMOD, 1985.
does not change the actual contents of the page, I/O performance
is identical to that of a row-store. The Fates database system [20] [9] D. W. Cornell and P. S. Yu. “An Effective Approach to Ver-
organizes data on the disk in a column-based fashion and relies tical Partitioning for Physical Design of Relational Data-
on clever data placement to minimize the seek and rotational bases.” In IEEE Transactions on Software Engineering
delays involved in retrieving data from multiple columns. 16(2): 248-258, 1990.
While it is generally accepted that generic row-stores are pre- [10] J. Goldstein, R. Ramakrishnan, and U. Shaft. “Compressing
ferred for OLTP workloads, in this paper we complement recent Relations and Indexes.” In Proc. ICDE, 1998.
proposals by exploring the fundamental tradeoffs in column and [11] G. Graefe and L. D. Shapiro. “Data compression and data-
row oriented DBMS on read-mostly workloads. Ongoing work base performance.” In Proc. ACM/IEEE-CS Symposium on
[12] compares row- and column-based pages using the Shore Applied Computing, Kansas City, MO, 1991.
storage manager. The authors examine 100% selectivity and focus
[12] A. Halverson, J. L. Beckmann, J. F. Naughton, and D. J.
on 100% projectivity, both of which are the least favorable work- DeWitt. “A Comparison of C-Store and Row-Store in a
load for pipelined column scanners, as we showed earlier. Common Framework.” Technical Report, University of Wis-
consin-Madison, Department of Computer Sciences,
7. CONCLUSIONS TR1566, 2006.
In this paper, we compare the performance of read-intensive col- [13] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. “QPipe:
umn- and row-oriented database systems in a controlled imple- A Simultaneously Pipelined Relational Query Engine.” In
mentation. We find that column stores, with appropriate Proc. SIGMOD, 2005.
prefetching, can almost always make better use of disk bandwidth [14] J. L. Hennessy, D. A. Patterson. “Computer Architecture: A
than row-stores, but that under a limited number of situations, Quantitative Approach.” 2nd ed, Morgan-Kaufmann, 1996.
their CPU performance is not as good. In particular, they do less
well when processing very narrow tuples, use long projection [15] K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W.
lists, and apply non-selective predicates. We use our implementa- E. Baker. “Performance Characterization of a Quad Pentium
Pro SMP Using OLTP Workloads.” In Proc. ISCA-25, 1998.
tion to derive an analytical model that predicts query performance
with a particular disk and CPU configuration and find that current [16] P. J. Mucci, S. Browne, C. Deane, and G. Ho. “PAPI: A Por-
architectural trends suggest column stores, even without other table Interface to Hardware Performance Counters.” In Proc.
advantages (such as the ability to operate directly on compressed Department of Defense HPCMP Users Group Conference,
data [1] or vectorized processing [7]) will become an even more Monterey, CA, June 1999.
attractive architecture with time. Hence, a column oriented data- [17] S. Padmanabhan, T. Malkemus, R. Agarwal, and A. Jhing-
base seems to be an attractive design for future read-oriented ran. “Block Oriented Processing of Relational Database
database systems. Operations in Modern Computer Architectures.” In Proc.
ICDE, 2001.
8. ACKNOWLEDGMENTS [18] M. Poss and D. Potapov. “Data Compression in Oracle.” In
We thank David DeWitt, Mike Stonebraker, and the VLDB Proc. VLDB, 2003.
reviewers for their helpful comments. This work was supported [19] R. Ramamurthy, D. J. DeWitt, and Q. Su. “A Case for Frac-
by the National Science Foundation under Grants 0520032, tured Mirrors.” In Proc. VLDB, 2002.
0448124, and 0325525.
[20] M. Shao, J. Schindler, S. W. Schlosser, A. Ailamaki, and G.
R. Ganger. “Clotho: Decoupling memory page layout from
9. REFERENCES storage organization.” In Proc. VLDB, 2004.
[1] D. J. Abadi, S. Madden, and M. Ferreira. “Integrating Com-
pression and Execution in Column-Oriented Database Sys- [21] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherni-
tems.” In Proc. SIGMOD, 2006. ack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P.
E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. “C-Store: A
[2] S. Agrawal, V. R. Narasayya, B. Yang. “Integrating Vertical Column-oriented DBMS.” In Proc. VLDB, 2005.
and Horizontal Partitioning Into Automated Physical Data-
base Design.” In Proc. SIGMOD, 2004. [22] T. Westmann, D. Kossmann, S. Helmer, and G. Moerkotte.
“The Implementation and Performance of Compressed Data-
[3] A. Ailamaki. “Database Architecture for New Hardware.” bases.” In SIGMOD Rec., 29(3):55–67, Sept. 2000.
Tutorial. In Proc. VLDB, 2004.
[23] J. Zhou and K. A. Ross. “Buffering Database Operations for
[4] A. Ailamaki, D. J. DeWitt, et al. “Weaving Relations for Enhanced Instruction Cache Performance.” In Proc. SIG-
Cache Performance.” In Proc. VLDB, 2001. MOD, 2004.
[5] A. Ailamaki, D. J. DeWitt, at al. “DBMSs on a modern pro- [24] M. Zukowski, S. Heman, N. Nes, and P. Boncz. “Super-Sca-
cessor: Where does time go?” In Proc. VLDB, 1999. lar RAM-CPU Cache Compression.” In Proc. ICDE, 2006.