0% found this document useful (0 votes)
8 views

Column-vs-Row databases

The document compares column-stores and row-stores, highlighting their differences in data storage, efficiency for read-only queries, and optimization techniques. It discusses the advantages of column-stores in terms of I/O efficiency and compression, while also noting potential drawbacks for certain applications. Additionally, it explores methods to emulate column-store benefits in row-stores through vertical partitioning and materialized views.

Uploaded by

Usama Riaz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Column-vs-Row databases

The document compares column-stores and row-stores, highlighting their differences in data storage, efficiency for read-only queries, and optimization techniques. It discusses the advantages of column-stores in terms of I/O efficiency and compression, while also noting potential drawbacks for certain applications. Additionally, it explores methods to emulate column-store benefits in row-stores through vertical partitioning and materialized views.

Uploaded by

Usama Riaz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

11/30/2024

Column-Stores vs. Row-Stores

Contents

 Column-store introduction
 Column-store data model
 Emulation of Column Store in Row store
 Column store optimization
 Experiment and Results
 Conclusion

1
11/30/2024

Row Store and Column Store

Figure taken form [2]

 In row store data are stored in the disk tuple by tuple.


 Where in column store data are stored in the disk
column by column

Row Store and Column Store


 Most of the queries does not process all the attributes of a
particular relation.

 For example the query


Select c.name and c.address
From CUSTOMES as c
Where c.region=abc;

 Only process three attributes of the relation CUSTOMER. But


the customer relation can have more than three attributes.

 Column-stores are more I/O efficient for read-only queries as


they read, only those attributes which are accessed by a query.

2
11/30/2024

Row Store and Column Store

Row Store Column Store

(+) Easy to add/modify a record (+) Only need to read in relevant data

(-) Might read in unnecessary data (-) Tuple writes require multiple accesses

 So column stores are suitable for read-mostly, read-


intensive, large data repositories

Why Column Stores?


 Can be significantly faster than row stores for some
applications
 Fetch only required columns for a query
 Better cache effects
 Better compression (similar attribute values within a column)
 But can be slower for other applications
 OLTP with many row inserts, ..

3
11/30/2024

Column Stores - Data Model

 Standard relational logical data model


 EMP(name, age, salary, dept)
 DEPT(dname, floor)
 Table – collection of projections
 Projection – set of columns
 Horizontally partitioned into segments with segment
identifier

Column Stores - Data Model

 To answer queries, projections are joined using Storage


keys and join indexes
 Storage Keys:
 Within a segment, every data value of every column is
associated with a unique Skey
 Values from different columns with matching Skey belong to
the same logical row

4
11/30/2024

Column Stores – Data Model

 Join Indexes
 T1 and T2 are projections on T
 M segments in T1 and N segments in T2
 Join Index from T1 to T2 is a table of the form:
 (s: Segment ID in T2, k: Storage key in Segment s)
 Each row in join index matches corresponding row in T1
 Join indexes are built such that T could be efficiently
reconstructed from T1 and T2

Compression
 Trades I/O for CPU
 Increased column-store opportunities:
 Higher data value locality in column stores
 Techniques such as run length encoding far more useful
 Schemes
 Null Suppression
 Dictionary encoding
 Run Length encoding
 Bit-Vector encoding
 Heavyweight schemes

10

5
11/30/2024

Query Execution - Operators

 Select: Same as relational algebra, but produces a bit


string
 Project: Same as relational algebra
 Join: Joins projections according to predicates

 Aggregation: SQL like aggregates


 Sort: Sort all columns of a projection

11

Query Execution - Operators


 Decompress: Converts compressed column to
uncompressed representation
 Mask(Bitstring B, Projection Cs) => emit only those
values whose corresponding bits are 1
 Concat: Combines one or more projections sorted in
the same order into a single projection
 Permute: Permutes a projection according to the
ordering defined by a join index
 Bitstring operators: Band – Bitwise AND, Bor – Bitwise
OR, Bnot – complement

12

6
11/30/2024

Row Store Vs Column Store


 Now the simplistic view about the difference in storage
layout leads to that one can obtain the performance
benefits of a column-store using a row-store by making
some changes to the physical structure of the row store.

 This changes can be


 Vertically partitioning
 Using index-only plans
 Using materialized views

13

Vertical Partitioning
 Process:
 Full Vertical partitioning of each relation
 Each column =1 Physical table
 This can be achieved by adding integer position column to every table
 Adding integer position is better than adding primary key
 Join on Position for multi column fetch

 Problems:
 “Position” - Space and disk bandwidth
 Header for every tuple – further space wastage
 e.g. 24 byte overhead in PostgreSQL

14

7
11/30/2024

Index-only plans: Example

15

Materialized Views
 Process:
 Create ‘optimal' set of MVs for given query workload
 Objective:
 Provide just the required data
 Avoid overheads
 Performs better
 Expected to perform better than other two approach

 Problems:
 Practical only in limited situation
 Require knowledge of query workloads in advance

16

8
11/30/2024

Materialized Views: Example

 Select F.custID
from Facts as F
where F.price>20

17

Optimizing Column oriented Execution

 Different optimization for column oriented database


 Compression
 Late Materialization
 Block Iteration
 Invisible Join

18

9
11/30/2024

Compression

 Low information entropy (high data value locality) leads


to High compression ratio
 Advantage
 Disk Space is saved
 Less I/O
 CPU cost decrease if we can perform operation without
decompressing
 Light weight compression schemes do better

19

Compression

 If data is sorted on one column that column will be


super-compressible in row store

 eg. Run length encoding

Figure taken form [2]


20

10
11/30/2024

Late Materialization

 Most query results entity-at-a-time not column-at-a-time


 So at some point of time multiple column must be
combined
 One simple approach is to join the columns relevant for a
particular query
 But further performance can be improve using late-
materialization

21

Late Materialization

 Delay Tuple Construction


 Might avoid constructing it altogether
 Intermediate position lists might need to be constructed
 Eg: SELECT R.a FROM R WHERE R.c = 5 AND R.b = 10
 Output of each predicate is a bit string
 Perform Bitwise AND
 Use final position list to extract R.a

22

11
11/30/2024

Late Materialization

 Advantages
 Unnecessary construction of tuple is avoided
 Direct operation on compressed data
 Cache performance is improved (PAX)

23

Thank You!

24

12

You might also like