0% found this document useful (0 votes)
57 views

ClickHouse_grokking

Uploaded by

vanvan99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

ClickHouse_grokking

Uploaded by

vanvan99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

ClickHouse

DBMS for data analytics

Hung Vo
[email protected]
Introduction

An open source column-oriented database
management system capable of real time
generation of analytical data reports using SQL
queries.
Introduction

Blazing Fast

Linearly Scalable

Hardware Efcient

Fault Tolerant

Feature Rich

Highly Reliable

Simple and Handy
Key Features

True column-oriented storage

Vectorized query execution

Data compression

Parallel and distributed query execution

Real time query processing

Real time data ingestion

On-disk locality of reference

Cross-datacenter replication

High availability

SQL support
Key Features

Local and distributed joins

Pluggable external dimension tables

Arrays and nested data types

Approximate query processing

Probabilistic data structures

Full support of IPv6

Features for web analytics

State-of-the-art algorithms

Detailed documentation

Clean documented code
Feature Rich

ClickHouse features a user-friendly SQL query dialect with a number of built-in
analytics capabilities. For example, it includes probabilistic data structures for fast
and memory-efcient calculation of cardinalities and quantiles. There are
functions for working dates, times and time zones, as well as some specialized ones
like addressing URLs and IPs (both IPv4 and IPv6) and many more.

Data organizing options available in ClickHouse, such as arrays, array joins, tuples
and nested data structures, are extremely efcient for managing denormalized
data.

Using ClickHouse allows joining both distributed data and co-located data, as the
system supports local joins and distributed joins. It also ofers an opportunity to use
external dictionaries, dimension tables loaded from an external source, for seamless
joins with simple syntax.

ClickHouse supports approximate query processing – you can get results as fast as
you want, which is indispensable when dealing with terabytes and petabytes of data.

The system's conditional aggregate functions, calculation of totals and extremes,
allow getting results with a single query without having to run a number of them.
When to use ClickHouse
For analytics over stream of clean, well structured and immutable events or
logs. It is recommended to put each such stream into a single wide fact table
with pre-joined dimensions.

Web and App analytics

Advertising networks and RTB

Telecommunications

E-commerce and fnance

Information security

Monitoring and telemetry

Time series

Business intelligence

Online games

Internet of Things
When NOT to use

Transactional workloads (OLTP): ClickHouse doesn't have
UPDATE statement and full-featured transactions.

Key-value access with high request rate: If you want high load
of small single-row queries, please use another system.

Blob-store, document oriented: ClickHouse is intended for
vast amount of fne-grained data.

Over-normalized data: Better to make up single wide fact
table with pre-joined dimensions.
Interfaces

HTTP REST

clickhouse-client

JDBC (production), ODBC (beta)

Languages

Python, PHP, Perl, Go,

Node.js, Ruby, C++, .NET, Scala, R, Julia, Rust

Input/Output data formats



CSV, TSV, JSON, CapnProto, XML
Data types

UInt8, UInt16, UInt32, UInt64, Int8, Int16, Int32, Int64

Float32, Float64

Boolean: UInt8 type, but restricted value 0 or 1

String, FixedString(N)

Date/DateTime

Enum

Array

AggregateFunction

Tuple

Nested data structure
SQL

SELECT

INSERT INTO

CREATE DATABASE/TABLE/[MATERIALIZED] VIEW

ALTER: Column Manipulations, Partitions/Parts

ALTER TABLE DELETE WHERE….

ATTACH, DROP, DETACH, RENAME, USE, SET,
OPTIMIZE, KILL QUERY
SQL SELECT
SQL Functions

Arithmetic, Rounding, Mathematical

Comparison, Logical, Conditional

Type conversion

Dates/Times

String

Bit, Hash, Array

URLs, IP, JSON

Geographical coordinates

Higher-order: lambda, arrayMap/Filter...
Aggregate Functions

Normal Aggregate functions:
– count, min, max, any*, sum*, avg, median
– stddev*, var*, covar*, corr
– uniq*, quantile*, topK

Aggregate function combinators: -If, -Array, -State, -Merge, -
MergeState, -ForEach

Parametric aggregate functions: sequenceMatch,
sequenceCount, windowFunnel, retention, uniqUpTo.
Table engines

MergeTree family

TinyLog, Log, Memory, Bufer, External data

Distributed, Dictionary, Merge, File, URL

View, MaterializedView

Integrations: Kafka, MySQL
Table engines - MergeTree

Stores data sorted by primary key: This allows you to create a
small sparse index that helps fnd data faster.

This allows you to use partitions if the partitioning key is
specifed: ClickHouse supports certain operations with partitions
that are more efective than general operations on the same
data with the same result. ClickHouse also automatically cuts
of the partition data where the partitioning key is specifed in
the query. This also increases the query performance.

Data replication support: The family of ReplicatedMergeTree
tables is used for this.

Data sampling support.
MergeTree family

MergeTree

ReplacingMergeTree: removes duplicate entries with the same
primary key value

SummingMergeTree: totals data while merging

AggregatingMergeTree: the merge combines the states of
aggregate functions stored in the table for rows with the same
primary key value

CollapsingMergeTree: allows automatic deletion, or "collapsing"
certain pairs of rows when merging.

GraphiteMergeTree: designed for rollup (thinning and
aggregating/averaging) Graphite data
THANK YOU!!

You might also like