0% found this document useful (0 votes)
22 views

MODULE 2 and 3

Here are the key steps to connect to the remote Vertica server and work with projections hands-on: 1. Open Putty on your local machine and enter the host name, username, and password provided for the remote Vertica server. 2. Once logged in, navigate to the Vertica installation directory and run the vsql command as the dbadmin user to connect to the database. 3. With the dbadmin connection active, you can now create database objects like schemas, tables, etc and load sample data. 4. To work with projections, use the CREATE PROJECTION statement to define projections on existing tables. Specify columns, encoding, sorting etc. 5. You can also use the

Uploaded by

Raghu C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

MODULE 2 and 3

Here are the key steps to connect to the remote Vertica server and work with projections hands-on: 1. Open Putty on your local machine and enter the host name, username, and password provided for the remote Vertica server. 2. Once logged in, navigate to the Vertica installation directory and run the vsql command as the dbadmin user to connect to the database. 3. With the dbadmin connection active, you can now create database objects like schemas, tables, etc and load sample data. 4. To work with projections, use the CREATE PROJECTION statement to define projections on existing tables. Specify columns, encoding, sorting etc. 5. You can also use the

Uploaded by

Raghu C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Projection

• In case of Vertica, table is a logical entity. A projection is a collection of


table columns.

• Projections are physical storage for table data. There can be ‘n’ number of
projections for a single table.

• Data in a projection is compressed and encoded, making it optimized for


query execution.

• Vertica automatically creates projections on first data load into a table.


We can create projections using the Vertica Database Designer or create
them manually.

• At query execution, the vertica’s optimizer chooses the best projection for
the query.

1
Vertica Object Hierarchy
Query Execution

• SQL query is written against tables.


• In order to execute a query , the vertica database generates a
query plan. Query plan is the sequence of steps used to determine
the execution path & resource cost for each step.
• Cost calculated at each step in a query plan is the estimation of
resources used like
– Data distribution statistics
– Disk space
– Network bandwidth
– CPU speed
– Data segmentation across cluster
• When you submit a query, the initiator chooses the projections to
use, optimizes and plans the query execution. Planning and
optimization are quick, requiring at most a few milliseconds.
• Based on the projections chosen, the query plan that the
optimizer produces is decomposed into “mini-plans.”
• These mini-plans are distributed to the other nodes, known
as executors. The nodes process the mini-plans in parallel.
• The query execution proceeds with intermediate result sets
(rows) flowing through network connections between the nodes
as needed.
• In the final stages of executing a query plan, some wrap up work
is done at the initiator, such as:
– Combining results in a grouping operation
– Merging multiple sorted partial result sets from all the
executors
– Formatting the results to return to the client
• Some small queries, it can be executed locally. This avoids
unnecessary network communication.
Vertica Transactions

• A transaction is a sequence of operations, ending with COMMIT.

• With every transaction, a transaction number (called epoch) is


created. An epoch is 64-bit number that represents a logical time
stamp for the data in Vertica.

• The epoch advances when the data is committed with a DML


operation (INSERT, UPDATE, MERGE, COPY, or DELETE).
Hybrid data store – WOS & ROS
• Vertica is unique in many ways, one of which can be seen in its data
storage model.
• To understand the Vertica storage model, we first need to
understand these three elements:
– ROS (Read-Optimized Column Store)
– WOS (Write-Optimized Row Store)
– Tuple Mover
• Vertica uses two distinct structures for storing data: WOS and ROS.
• Vertica moves data from WOS to ROS using the Tuple Mover Entity.
• The Tuple Mover performs two operations: Moveout and Mergeout
Hybrid Data Store: WOS & ROS

DEPARTMENT OF CSE 9
20NHOP01
WOS
• It is a memory resident data store.
• Temporarily storing data in primary memory, speeds up loading
process & reduce fragmentation on disk.

ROS
• It is a disk resident data store.
• When Tuple mover task , moves out into ROS: ROS containers are
created and data is organized into projections on the hard disk.

Tuple mover:

Moveout
• Vertica’s optimizer component that moves out data from
memory to the disk (ROS)

Mergeout
• It combines ROS containers on the disk 10
Why this Hybrid model?
To support different types of load.
• If small frequent load or trickle load:
– Then load to WOS
– The data is WOS is still available for query result
– Size of WOS: min{25% of available memory, 2GB RAM}

• If data loaded to WOS exceed the size, the data is


automatically spill over to ROS
– i,e for Bulk load or large load : the best practice is to load
directly to ROS

11
12
Projection Design
Projection fundamentals:

• Data in a projection is compressed and encoded, making it optimized for query


execution.
• We can create projections using the Vertica Database Designer or create them
manually.
• At query execution, the vertica’s optimizer chooses the best projection for the
query.
• Vertica stores the projection data in ROS containers.
• Logical grouping of all data files(data file + meta data file) on a node against a
projection is called a container.
• There can be more than one container for a single projection on a particular
node. This happens because every time on insert or delete operation new data
files are created and not appended to the older files. But this limitation is only
for 10 min because after 10 min tuple mover will perform merge out action to
merge all the containers of a single projection into bigger one.
• We can have 1024 containers per projection per node.
Projection Example
Projection Types

Projections are mainly classified into 5 types:

1. Superprojection
2. Query-specific Projection
3. Buddy Projection
4. Live Aggregate Projection
5. Pre join projection

20
21
Superprojection

• A projection that Vertica automatically creates when


you initially load data into a table using INSERT,
COPY commands.

• A super projection contains every column of a table


in the logical schema, ensuring that all of the data is
available for queries.

22
Query-specific Projection

• Optimized for a specific query or class of queries. It contains


only the subset of table columns to process a given query.

Buddy Projection

• A projection with the same columns and segmentation on


different nodes to provide high availability.

Pre-join Projection

• Manually created
• Multiple tables are joined and stored in the form of
projection.

23
Live Aggregate Projection

• A live aggregate projection contains columns with values that


are aggregated from columns in its table.
• Manually created.

Functions Supported for Live Aggregate Projections

• SUM
• MAX
• MIN
• COUNT

24
Projection properties
1. Encoding/compression: Each column is always encoded,
compressed or both. Vertica can work directly with encoded
data but not with compressed data . Compressed data must
first be uncompressed.
2. Sorting – order by: All projections contain at least one column
in the ORDER BY statement.
ORDER BY A ORDER BY A, B ORDER BY B

B A C
A B C A B C 1 1 a
1 2 a 1 1 a 1 2 d
1 1 b 1 2 b 2 1 b
1 3 a 1 3 c 2 2 e
2 1 c 2 1 d 3 2 f
2 3 d 2 2 e 3 1 c
2 2 f 2 3 f
e
Projection Properties

• Encoding/compression

• Sorting – order by

• K-safety

• Replication and Segmentation

26
Replication and Segmentation
• Vertica distributes data evenly on different nodes. There are two
methods of distribution:

1. Replication: It is the process of copying the full projection to each


node. This method is used for small projections such as
projections with less than 1million records.

2. Segmentation: It is the process of segmenting and distributing


the projection data across multiple nodes. This method is used
for large projections.
Buddy Projection
Creating projections
• AUTOMATICALLY
• Vertica automatically creates projections on first data load
into a table.
• These projections are considered unoptimized super
projections.

• RUNNING DBD
• Optimized super projections are created when you run the
Database Designer.
• You also have the option of creating query-specific projections
in the DBD wizard.

• MANUALLY
33
MANUALLY
• CREATE PROJECTION statement:
• CREATE PROJECTION [ IF NOT EXISTS ]
...projection-name
... [ ( ........{ projection-col | grouped-clause
......... [ ENCODING encoding-type ]
........}[,...] ...) ] AS SELECT
...select‑list from-clause
...[ ORDER BY column-expr[,...] ]
...[ segmentation-spec ] ...[ KSAFE [ k-num ] ]

34
MANUALLY
• CREATE PROJECTION projection_name
(projection_col,...) AS SELECT table_col,...
FROM existing_table ...
• => CREATE TABLE trades (stock CHAR(5), bid
INT, ask INT);
=> CREATE PROJECTION tradeproj (stock
ENCODING RLE,
GROUPED(bid ENCODING DELTAVAL, ask))
AS (SELECT * FROM trades) KSAFE 1;
35
Hands-on (Module 2)
1. Creation of schema, tables and execution of
SQL statements on Vertica Database

2. Hands-on projections
Connect to Remote Vertica Server
• Click on putty (available on desktop)
• Host name: 10.10.26.11
• Login as : hp1
• Password: ROOT@123
(login to vertica by giving the above user name and password)
• Login to dbadmin by typing the following path
/opt/vertica/bin/vsql –Udbadmin –wvertica123
dbadmin=>
(now we can create tables and perform all the queries)
Database Designer(DBD)
• Vertica's Database Designer is a tool that:
 Analyses your logical schema, sample data, and, optionally, your sample
queries.
 Creates a Physical Schema design (a set of projections) that can be deployed
automatically or manually.
 Can be used by anyone without specialized database knowledge. Even
business users can run Database Designer.
 Can be run and re-run any time for additional optimization without stopping
the database.
• The projections that Database Designer creates provide excellent query
performance within physical constraints while using disk space efficiently.

DBD can run in 2 modes:


1. Comprehensive mode
2. Incremental mode
DBD Advantages

1. Creates a physical schema design (a set of projections) that can


be deployed automatically or manually.
2. Can be used by anyone without specialized database knowledge.
3. Can be run and re-run any time for additional optimization
without stopping the database.
4. Provide excellent query performance while using disk space
efficiently.
5. Accepting up to 100 queries in the query input file for an
incremental design.
6. Accepting unlimited queries for a comprehensive design.
COPY
• Loads data from files to Vertica
• Faster approach to load a Data warehouse
• AUTO spill over by default
• Loads to WOS and spills to ROS if needed
• By default, WOS= 25% or 2GB of RAM(which ever is less)
• Used for trickle loading

Syntax :

COPY table [ column [ ,...] ]


FROM { 'file' | STDIN }
[ DELIMITER 'char' ]

40
COPY DIRECT
• Best for infrequent or bulk loads
• More efficient than AUTO.
• No WOS involved
• Can create ROS fragmentation if used for small frequent loads
• COPY automatically COMMITS by default

41
42
3 ways : COPY
• Depending on the data you are loading, the COPY
statement has several load methods.

You can choose from three load methods:


• COPY AUTO
• COPY DIRECT
• COPY TRICKLE

43
MERGE
• MERGE statements combine INSERT and UPDATE operations.
• The source table can include new and existing data.
• If the target table does not include any of the source table’s
records (new or existing), MERGE inserts all source data into
the target.
• The following is a MERGE example with two MERGE options
that you update (WHEN MATCHED THEN UPDATE…) or insert
data (WHEN NOT MATCHED THEN INSERT…).
• Updating one million records in two seconds.

44
Syntax:
MERGE INTO [[db-name.]schema.]target-table
[ alias ] ... USING [[db-name.]schema.]source-
table [ alias ] ... ON ( condition ) ... [ WHEN
MATCHED THEN UPDATE SET column1 =
value1 [, column2 = value2 ... ] ] ... [ WHEN
NOT MATCHED THEN INSERT ( column1 [,
column2 ...]) VALUES ( value1 [, value2 ... ] ) ]

45
MERGE INTO target TGT
USING source SRC
ON SRC.A=TGT.A
WHEN MATCHED THEN
UPDATE SET A=TGT.A, B=TGT.B, C=TGT.C, D=TGT.D,
E=TGT.E
WHEN NOT MATCHED THEN
INSERT VALUES (SRC.A,SRC.B, SRC.C, SRC.D,
SRC.E);

46
47
Purge
• In HP Vertica, delete operations do not remove rows from
physical storage.
• The DELETE command in HP Vertica marks rows as deleted.
• Purge is the process of removing the deleted data from disk.
• Purge Permanently removes deleted data from physical
storage so that the disk space can be reused.
Partitioning
• HP Vertica supports data partitioning at the table level, which
divides one large table into smaller pieces.
• Partitions are a table property that apply to all projections for a
given table. The Vertica partitioning capability divides one large
table into smaller pieces based on values in one or more columns.
• A common use for partitions is to split data by time. For instance, if
a table contains decades of data, you can partition it by year, or by
month, if the table has a year of data.
• Partitions segregate data on each node to facilitate dropping
partitions.
• Partitions can make data lifecycle management easier and improve
the performance of queries.
Differences Between Partitioning and
Segmentation
• There is a distinction between partitioning at the table level
and segmenting a projection :

• Partitioning—defined by the table for fast data purges and


query performance. Table partitioning segregates data on
each node. You can drop partitions.

• Segmentation—defined by the projection for distributed


computing. Segmenting distributes projection data across
multiple nodes in a cluster.

50
The following diagram illustrates the flow of segmentation
and partitioning on a four-node database cluster:
1. Example table data
2. Data segmented by HASH(order_id)
3. Data segmented by hash across four nodes
4. Data partitioned by year on a single node

51
52
Hands-on (Module 3)
1. Loading data files from different sources to
Vertica database.
2. Verifying the log files after loading the data
into Vertica database.
3. Hands-on partitions.

You might also like