0% found this document useful (0 votes)
11 views

Hive Notes (1)

Hive

Uploaded by

559aryan.ar3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Hive Notes (1)

Hive

Uploaded by

559aryan.ar3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Hive Notes

File Formats

Apache Hive supports various file formats for storing and querying data in Hadoop. Each
format has its pros and cons, and the choice of format can impact the performance of Hive
queries. Here’s a detailed explanation of the main file formats used with Hive, including
examples, advantages, disadvantages, and performance considerations:

1. TextFile
- Description: This is the default file format for Hive. It stores data as plain text files, with
each line representing a row, and fields separated by a delimiter (commonly, a comma or
tab).
- Example:
```

7
1,John,Doe,5000
2,Jane,Smith,7000
```
- Pros:
11- Easy to use and understand.
- Suitable for small datasets and when compatibility with other tools is needed.
- No complex configurations are required.
- Cons:
- Large storage space is needed since data is stored as plain text.
- Slower read and write performance due to the lack of compression.
- No support for schema evolution.
- Performance:
TG

- Text files result in higher storage costs and slower query performance.
- They are not ideal for large-scale data processing.

2. SequenceFile
- Description: A binary file format that stores data as key-value pairs. It is commonly used
in Hadoop for intermediate data storage.
- Example:
- Key: Row number (0, 1, 2, etc.)
- Value: The corresponding row data.
- Pros:
- Supports compression (e.g., Snappy, Gzip) which reduces storage space.
- Faster read/write performance than TextFile.
- Cons:
- Not human-readable.
- Not columnar, which can impact performance for specific queries (e.g., columnar
aggregation).
- Performance:
- Suitable for moderate-sized datasets.
- Efficient for use cases involving key-value pair processing.
3. RCFile (Record Columnar File)
- Description: Combines the benefits of row-based and column-based storage by storing
data in a columnar fashion but grouping columns in row splits.
- Example:
- Data is stored in row groups with columns grouped together within each row.
- Pros:
- Better compression than TextFile and SequenceFile.
- Columnar storage allows for efficient querying on specific columns.
- Cons:
- Read and write performance can be slower than newer columnar formats like ORC and
Parquet.
- More complex than SequenceFile and TextFile.
- Performance:
- Suitable for analytical queries that need columnar access.
- Query performance is better than TextFile and SequenceFile but not as good as ORC
and Parquet.

7
4. ORC (Optimized Row Columnar)
- Description: A columnar storage format optimized for Hive, designed to improve the
performance of reading, writing, and processing data.
11
- Example:
- Data is stored in columnar format with built-in indexes for fast data retrieval.
- Pros:
- High compression ratio reduces storage requirements.
- Supports complex types (e.g., arrays, maps).
- Built-in indexes and statistics allow for efficient query execution.
- Supports predicate pushdown, which filters data during scanning, improving
performance.
TG

- Cons:
- Slower write performance due to the overhead of indexing and compression.
- Compatibility may be an issue when integrating with non-Hadoop tools.
- Performance:
- Ideal for large datasets and analytical workloads.
- Provides the best performance for Hive queries compared to other formats.
- Reduces I/O by reading only the required columns and rows.

5. Parquet
- Description: A columnar storage format that is platform-independent and optimized for
large-scale data processing frameworks like Apache Spark and Hadoop.
- Example:
- Data is stored in a columnar format, with each column stored separately.
- Pros:
- High compression ratios, resulting in lower storage requirements.
- Columnar storage allows for fast query execution.
- Supports nested data structures (e.g., structs, arrays).
- Widely used across various data processing frameworks (e.g., Hive, Spark, Impala).
- Cons:
- Slower write performance due to compression.
- Requires more complex configurations compared to TextFile.
- Performance:
- Comparable to ORC for read operations.
- Works well with large datasets and supports complex queries.
- Preferred format for interoperability between Hive and other processing tools.

6. AVRO
- Description: A row-based storage format suitable for data serialization and
deserialization, supporting schema evolution.
- Example:
- Data is stored in JSON-like format with a separate schema file.
- Pros:
- Supports schema evolution, making it suitable for ETL processes.
- Lightweight and efficient for data exchange between different systems.
- Well-supported in the Hadoop ecosystem.
- Cons:

7
- Row-based format results in slower performance for analytical queries.
- Compression ratios may not be as high as ORC or Parquet.
- Performance:
- Suitable for use cases where schema evolution is required.
11
- Better read and write performance than TextFile and SequenceFile.

Summary Table

| File Format | Type | Compression | Read Performance | Write Performance | Ideal Use Case |
|----------------|-------------|-------------|------------------|-------------------|--------------------------------------------|
| TextFile | Row-based | No | Slow | Slow | Small datasets, simple use cases |
| SequenceFile | Row-based | Yes | Medium | Medium | Intermediate data storage in Hadoop |
| RCFile | Hybrid | Yes | Medium | Medium | Columnar processing, moderate-sized data |
TG
| ORC | Columnar | Yes | Fast | Medium | Large datasets, analytics in Hive |
| Parquet | Columnar | Yes | Fast | Medium | Interoperability, complex nested data |
| AVRO | Row-based | Yes | Medium | Medium | Schema evolution, data exchange |

Choosing the Right Format


- For large datasets with analytics: Use ORC or Parquet for efficient compression and fast
query performance.
- For schema evolution and ETL processes: AVRO is suitable.
- For simple text data: TextFile can be used, but it is not efficient for large datasets.
- For key-value data processing: Consider SequenceFile.
- For moderate-sized datasets requiring columnar processing: RCFile is an option, though
ORC and Parquet are generally better.

Each file format has its unique advantages, and the choice depends on the specific
requirements of your data processing workflow in Hive.

CSV:
CREATE TABLE my_csv_table (
column1 INT,
column2 STRING,
column3 DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

TSV:
CREATE TABLE your_table_name (
column1 data_type1,
column2 data_type2,
...
)
ROW FORMAT DELIMITED

7
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
11
ORC :
CREATE TABLE my_orc_table (
column1 INT,
column2 STRING,
column3 DOUBLE
)
STORED AS ORC;
TG

Parquet :
CREATE TABLE IF NOT EXISTS my_table (
column1 datatype1,
column2 datatype2,
-- Add more columns as needed
)
STORED AS PARQUET;

AVRO :
CREATE TABLE avro_table (
column1 datatype1,
column2 datatype2,
-- Add more columns as needed
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='file:///path/to/your/avro_schema.avsc');

Sequence :
CREATE TABLE my_sequence_table (
id INT,
name STRING
)
STORED AS SEQUENCEFILE;

JSON :
CREATE TABLE my_json_table (
column1 datatype1,
column2 datatype2,
...

7
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;
11
XML:
CREATE TABLE xml_table (
col1 STRING,
col2 INT
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
TG

"column.xpath.col1"="/root/col1/text()",
"column.xpath.col2"="/root/col2/text()"
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat';
Compression codecs :

In Hive, various file formats support different compression codecs, which help reduce
storage space and improve data transfer performance. Here is a comparison of popular
compression codecs used in Hive, focusing on their speed, compression ratio, and use
cases:

1. Snappy
- Speed: Fast (Both compression and decompression)
- Compression Ratio: Moderate
- Use Case: Optimized for speed rather than space, making it ideal for real-time queries
and quick data access. Often used in ORC and Parquet files.
- Supported Formats: ORC, Parquet, Avro, SequenceFile
- Pros:
- High decompression speed
- Low CPU overhead

7
- Cons:
- Lower compression ratio compared to GZIP and ZLIB

2. GZIP
11
- Speed: Slow (Compression), Moderate (Decompression)
- Compression Ratio: High
- Use Case: When compression ratio is a higher priority than speed. Used where disk
space or network bandwidth is a concern.
- Supported Formats: TextFile, ORC, Parquet, Avro
- Pros:
- High compression ratio
- Suitable for archiving data
TG

- Cons:
- Slower compression and decompression speeds

3. ZLIB
- Speed: Moderate (Compression), Moderate (Decompression)
- Compression Ratio: High (Similar to GZIP)
- Use Case: Offers a balance between compression speed and size. It is widely used in
ORC for batch processing where size reduction is important.
- Supported Formats: ORC, Avro, SequenceFile
- Pros:
- Good balance between compression speed and size
- Supported in multiple Hive file formats
- Cons:
- Not as fast as Snappy for decompression

4. LZO
- Speed: Fast (Decompression), Moderate (Compression)
- Compression Ratio: Moderate
- Use Case: Suitable for large-scale data where decompression speed is important. Often
used in SequenceFile format and Hadoop applications.
- Supported Formats: TextFile, SequenceFile
- Pros:
- Fast decompression
- Effective for distributed systems like Hadoop
- Cons:
- Requires separate installation of LZO libraries
- Lower compression ratio than GZIP

5. BZIP2
- Speed: Slow (Both compression and decompression)
- Compression Ratio: Very high
- Use Case: Used where maximum compression is needed, though rarely due to its slow
speed. Suitable for archival purposes.
- Supported Formats: TextFile

7
- Pros:
- Highest compression ratio among common codecs
- Cons:
- Very slow compression and decompression speed
11- Not recommended for real-time queries or frequent access

6. DEFLATE
- Speed: Moderate
- Compression Ratio: High (Similar to ZLIB)
- Use Case: Mostly used in ORC and other formats requiring balance between
compression and performance. It's a universal standard compression algorithm.
- Supported Formats: ORC, Avro, SequenceFile
TG

- Pros:
- High compression ratio
- Good for general-purpose use
- Cons:
- Slower than Snappy for real-time data access

Split Ability Overview

- Split-able formats (like Snappy, LZO, and BZIP2) allow Hive or Hadoop to split large files
into smaller parts. This enables efficient parallel processing across multiple nodes in a
distributed environment like Hadoop, improving query performance and reducing job
completion times.

- Snappy: Supports splitting, making it highly suitable for distributed processing and large
datasets.
- LZO: Supports splitting but requires an index file (`.index` file) to allow Hadoop to split the
compressed file. Without the index, LZO is not split-able. However, when indexed properly,
it's a good choice for distributed data processing.

- BZIP2: Despite its slower compression and decompression speeds, it supports splitting,
which can be advantageous in distributed systems where high compression is needed
without sacrificing parallelism.

- Non-Split-able formats (like GZIP, ZLIB, and DEFLATE) do not support splitting, meaning
that each compressed file must be processed by a single node. This limitation can reduce
parallelism and increase processing time when working with large datasets.

- GZIP: Does not support splitting. When a GZIP file is processed, the entire file is assigned
to one node, which can become a bottleneck in distributed environments.

- ZLIB: Similar to GZIP, ZLIB also doesn't support splitting. It's more suited for smaller files

7
or cases where parallelism is not critical.

- DEFLATE: Not split-able, so it's less suitable for very large datasets that need to be
processed in parallel.
11
Key Considerations:
- If you're working with large datasets and distributed processing is important, it's best to
choose a split-able codec like Snappy, LZO (with indexing), or BZIP2.
- For archival purposes or when maximizing storage efficiency is the goal, GZIP or ZLIB may
be better, but keep in mind that they do not allow file splitting.

Recommendations Based on Split Ability:


TG

1. For distributed processing (with Hive and Hadoop):


- Snappy and LZO (with indexing) are the best choices because they balance good
compression performance with split-ability, enabling efficient parallelism.
- BZIP2 can also be used for splitting but is generally slow.

2. For archival or low-frequency access:


- GZIP and ZLIB offer higher compression ratios but are not split-able, which may limit
performance when processing large files in distributed systems.

Comparison Table
| Compression Codec | Compression Speed | Decompression Speed | Compression Ratio | Split Ability | Best Use Case | Supported File Formats
|
|-------------------|-------------------|---------------------|-------------------|---------------|-----------------------------------|--------------------------------|
| Snappy | Fast | Fast | Moderate | Yes | Real-time queries, low CPU usage | ORC, Parquet, Avro, SequenceFile |
| GZIP | Slow | Moderate | High | No | Archiving, bandwidth-constrained | TextFile, ORC, Parquet, Avro |
| ZLIB | Moderate | Moderate | High | No | Batch processing, disk savings | ORC, Avro, SequenceFile |
| LZO | Moderate | Fast | Moderate | Yes (with indexing) | Large datasets, distributed systems | TextFile, SequenceFile
|
| BZIP2 | Very Slow | Very Slow | Very High | Yes | Maximum compression, archival | TextFile |
| DEFLATE | Moderate | Moderate | High | No | General-purpose compression | ORC, Avro, SequenceFile |

Key Takeaways:
- Snappy is great for fast decompression and lower CPU overhead, making it suitable for
quick access to data.
- GZIP and ZLIB provide higher compression ratios at the cost of slower performance.
- LZO offers a middle ground, with fast decompression for large datasets, often used in
distributed systems.
- BZIP2 gives the highest compression ratio but is impractically slow for most real-time use
cases.

7
Recommendation:
- Real-time querying: Use `Snappy`.
- Batch processing or archival: Use `GZIP` or `ZLIB` for a good balance between
compression size and performance.
11
- Large-scale distributed systems: Use `LZO` for fast decompression.

Supported by ORC :
-NONE: No compression is applied. Data is stored as is.
-ZLIB: This is a widely used compression algorithm that provides good compression ratios at
the cost of higher CPU usage.
TG

-SNAPPY: Snappy is a fast compression algorithm that provides good compression ratios
with low CPU usage. It's a popular choice for tasks where speed is a priority.
-LZO: LZO (Lempel-Ziv-Oberhumer) is a compression algorithm known for its fast
compression and decompression speeds. However, it's not as space-efficient as some other
codecs.
-LZ4: LZ4 is a very fast compression algorithm that is designed for speed rather than high
compression ratios. It's commonly used for tasks where decompression speed is crucial.
-ZSTD: Zstandard is a modern compression algorithm that aims to provide a good balance
between compression ratio and speed. It's designed to be competitive with other popular
codecs like Snappy and zlib.
-LZ4_FRAME: This is another variant of the LZ4 algorithm specifically tailored for streaming
applications.

Supported by Parquet:
-Gzip: This is a widely used compression algorithm. It provides good compression ratios but
may not be as fast to decompress as some other codecs.
-Snappy: Snappy is a fast compression and decompression library. It provides good
compression ratios and is relatively fast in terms of decompression speed.
-LZO (Not natively supported): LZO is a popular compression algorithm that is known for its
high compression and decompression speeds. However, it is not natively supported by
Parquet and would require additional configurations or libraries.
-Brotli: Brotli is a newer compression algorithm developed by Google. It provides high
compression ratios and is designed to be fast in terms of both compression and
decompression.
-LZ4: LZ4 is another compression algorithm known for its high-speed compression and
decompression. It doesn't provide as high compression ratios as some other algorithms like
Gzip, but it is very fast.
-Zstandard (zstd): Zstandard is a relatively new compression algorithm that aims to provide a
good balance between compression ratios and speed.
-None: You can choose not to use any compression. This is suitable for scenarios where you
prioritize speed and do not need to minimize storage space.

Supported by AVRO:
-Deflate: This is a commonly used compression algorithm. It provides good compression

7
ratios and is relatively fast.
-Snappy: This is a popular compression codec known for its speed and reasonable
compression ratios. It is often used in conjunction with Avro.
-Gzip: Gzip provides good compression ratios but tends to be slower than some other
11
codecs.
-Bzip2: This provides very good compression ratios but is generally slower than other
codecs.
-LZO: LZO is a high-speed compression algorithm that is suitable for real-time data
processing.
-Zstandard (Zstd): This is a modern compression algorithm that provides a good balance
between compression ratios and speed.
-LZ4: LZ4 is an extremely fast compression algorithm. It may not achieve the highest
TG

compression ratios, but it is very fast to compress and decompress.

Supported by TEXT based formats:


-Gzip: This is a widely used compression codec that provides good compression ratios. It's
supported by Hive for TEXTFILE storage format.
-Bzip2: Another popular compression codec known for its high compression ratios. It's also
supported by Hive for TEXTFILE storage format.
-Deflate: This is the same compression algorithm used in ZIP files. It's supported by Hive for
TEXTFILE storage format.
-Snappy: Although commonly associated with ORC file format, Snappy compression can
also be used with TEXTFILE format in Hive.
-LZO: While not part of the default distribution, you can enable LZO compression in Hive by
installing the necessary libraries and configuring them properly.
-Zlib: This is a compression library used in various software applications. It can be used with
Hive for TEXTFILE storage format.
Hive Transactional Tables

# Overview
Hive transactional tables enable ACID (Atomicity, Consistency, Isolation, Durability)
operations, allowing you to perform INSERT, UPDATE, DELETE, and MERGE operations,
which are typically absent in traditional Hive tables. This capability is crucial for managing
transactional data and slowly changing dimensions in big data systems.

# Relevance
Transactional tables in Hive are especially useful in scenarios where data mutability is
required, like when you need to:
- Update existing records in data warehouses.
- Delete specific data (e.g., GDPR-related requirements).
- Perform incremental loads with MERGE operations.

7
Prior to the introduction of transactional tables, Hive was primarily an append-only system,
meaning data could only be added but not modified or deleted, limiting its usefulness in
certain data warehousing scenarios.
11
# Use Cases
1. Incremental Data Loading: Performing MERGE operations to update existing records
based on new incoming data, typically in ETL pipelines.
2. Slowly Changing Dimensions (SCD): Managing changes to dimensions over time, where
you need to update or delete records while keeping history.
3. GDPR Compliance: Deleting or updating sensitive customer data in compliance with data
privacy regulations.
TG

4. Data Deduplication: In scenarios where duplicate data needs to be removed.

# Pros
- ACID Properties: Ensures consistency and reliability in transactional data operations, even
in a distributed environment like Hadoop.
- Mutability: Allows performing UPDATE, DELETE, and MERGE operations, making Hive
more suitable for enterprise data warehouses that require changes to historical data.
- Efficient Storage: Utilizes compaction to optimize storage after a series of small inserts or
updates, ensuring that storage space is not unnecessarily consumed.
- Fine-grained Access: Allows fine-grained access to data, improving data governance.

# Cons
- Performance Overhead: Transactional tables require write locks, and compactions
introduce overhead, making them slower than traditional append-only Hive tables.
- More Resources: The need for compaction and locking mechanisms can result in higher
memory and CPU usage.
- Configuration Complexity: Requires proper configuration of Hive metastore and Tez or MR
engines to handle ACID properties.
- Limitations on File Formats: Currently, ORC is the only supported format for fully
ACID-compliant tables, restricting flexibility.

Compaction in Hive Transactional Tables

Compaction in Hive transactional tables is a process designed to optimize storage and


improve performance by consolidating small, fragmented files into larger, more manageable
files. It plays a crucial role in maintaining the efficiency of ACID transactional tables by
reducing the overhead caused by multiple INSERT, UPDATE, and DELETE operations.

# Why Compaction is Needed


When working with Hive transactional tables, operations like INSERT, UPDATE, and
DELETE create small delta files that record changes. Over time, as more transactional
operations are performed, the number of these small files grows. This leads to several

7
performance and storage issues:
1. Small File Problem: Too many small files can overwhelm the NameNode in Hadoop, as it
has to manage a large number of file metadata, leading to performance degradation.
2. Query Performance: Each query must process all delta files, increasing the time taken to
11
retrieve the data.
3. Increased Storage Overhead: A large number of small files lead to inefficient use of
storage space.

Compaction addresses these issues by merging the small delta files and base files into
fewer, larger files, optimizing both query performance and storage usage.

---
TG

Types of Compaction
There are two main types of compaction in Hive: Minor Compaction and Major Compaction.

# 1. Minor Compaction
- Purpose: Combines delta files generated by INSERT, UPDATE, and DELETE operations
into a single delta file.
- What It Does: Merges all the delta files into a single delta file without touching the base file
(i.e., the original data). This helps reduce the overhead of scanning multiple small delta files
during queries.

For example, if you have multiple delta files:


```
delta_0001
delta_0002
delta_0003
```
Minor compaction will merge these into a single file:
```
delta_0001_0003
```

- Trigger: Typically triggered automatically when a certain number of delta files are created,
but it can also be triggered manually if needed.

```sql
ALTER TABLE table_name COMPACT 'MINOR';
```

# 2. Major Compaction
- Purpose: Combines both the base file and all associated delta files into a single, optimized
file.
- What It Does: This process merges all the delta files into the base file, creating a new base
file that incorporates all changes made by the transactions (INSERTs, UPDATEs, and
DELETEs).

7
For example, if you have a base file and multiple delta files:
```
base_0000
11
delta_0001
delta_0002
delta_0003
```
Major compaction will merge these into a single base file:
```
base_0001
```
TG

- Trigger: Major compaction can be triggered manually or automatically based on the


configuration settings.

```sql
ALTER TABLE table_name COMPACT 'MAJOR';
```

---

Automatic vs. Manual Compaction

- Automatic Compaction: Hive automatically triggers compaction based on certain thresholds


(e.g., the number of delta files or time since the last compaction). The automatic compaction
process runs in the background and ensures that the table is periodically optimized.

You can configure automatic compaction in the `hive-site.xml` file:

```xml
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name>hive.compactor.worker.threads</name>
<value>1</value>
</property>
```

- `hive.compactor.initiator.on`: Enables automatic compaction.


- `hive.compactor.worker.threads`: Specifies the number of threads that will perform
compaction. More threads can speed up the process.

- Manual Compaction: Compaction can be triggered manually by the user. This is useful
when automatic compaction is disabled, or you want to force compaction at a specific time
(e.g., during off-peak hours).

7
```sql
ALTER TABLE table_name COMPACT 'MINOR';
ALTER TABLE table_name COMPACT 'MAJOR';
11
```

---

Compaction Scheduling and Execution

Compaction in Hive is carried out by the Compaction Initiator and Compaction Worker
processes:
TG

- Compaction Initiator: Periodically scans the Hive Metastore for tables that need compaction
based on the thresholds configured.
- Compaction Worker: Performs the actual compaction tasks. Each worker takes a
compaction job from the queue, reads the delta and base files, merges them, and writes the
output back to the table’s storage location.

---

Configuration Settings for Compaction

In addition to enabling automatic compaction, you can fine-tune the compaction process with
the following settings in the `hive-site.xml` file:

1. Compaction Frequency:
```xml
<property>
<name>hive.compactor.delta.num.threshold</name>
<value>10</value>
</property>
```
- Controls how many delta files must accumulate before triggering compaction. In this
case, 10 delta files will trigger a compaction.

2. Compaction Threshold for Small Files:


```xml
<property>
<name>hive.compactor.abortedtxn.threshold</name>
<value>1000</value>
</property>
```
- Specifies the maximum number of small files that can be generated by aborted
transactions before a compaction is triggered.

3. Compaction Delay:
```xml
<property>

7
<name>hive.compactor.initiator.honor.transactions</name>
<value>true</value>
</property>
```
11
- Delays compaction until all active transactions on the table are completed.

---

Compaction Use Cases


- ETL Workflows: In typical ETL processes, data is updated frequently (e.g., daily
incremental loads), creating many small delta files. Compaction helps by reducing the
overhead of scanning these small files.
TG

- Transactional Data Processing: When working with ACID tables where UPDATE, DELETE,
and INSERT operations are frequent, compaction improves performance by consolidating
these small changes.
- Long-running Queries: In scenarios where long-running queries scan large datasets,
compaction ensures that queries don't have to scan multiple small delta files, reducing query
execution times.

---

Compaction Pros and Cons

# Pros
- Improved Query Performance: Queries run faster after compaction because fewer files
need to be scanned.
- Efficient Storage Utilization: Reduces the number of small files, improving storage
efficiency and reducing the burden on the NameNode in Hadoop.
- Better Resource Utilization: Compaction optimizes resources by minimizing the CPU and
I/O overhead during queries.

# Cons
- Resource-Intensive Process: Compaction itself is resource-intensive and can affect system
performance during execution, especially if not properly scheduled.
- Additional Configuration: Requires fine-tuning and monitoring to ensure compactions are
running efficiently without impacting other operations.
- Potential Delays: Compaction might not keep up with very high-frequency data updates,
leading to increased overhead if not managed properly.

---

Summary
Compaction is a critical process for Hive transactional tables, helping to maintain
performance and storage efficiency in an environment where multiple small files are
generated by ACID operations. By merging delta files and base files, it ensures that Hive can
continue to process queries efficiently without being bogged down by small-file overhead.
Proper configuration of compaction settings and balancing between automatic and manual
compaction is essential for achieving optimal performance in transactional Hive tables.

7
Hive Materialized Views

# Overview
11
Materialized Views in Hive are pre-computed views that store the result of a query physically
on disk. When a query uses a materialized view, Hive will try to use the pre-computed data
instead of re-executing the query, significantly improving query performance for repetitive or
complex operations.

# Relevance
Materialized views are highly relevant in big data analytics and data warehousing where
complex queries with joins, aggregations, and filters are common. They reduce the query
TG

execution time by avoiding the recomputation of expensive operations and allow for query
acceleration.

# Use Cases
1. Reporting: Use materialized views to pre-compute data that is frequently queried in
reports, dashboards, or BI tools, improving response times.
2. ETL Workflows: Speed up ETL processes by creating materialized views of intermediate
steps.
3. Query Optimization: In scenarios where queries repeatedly access the same data with
similar filters or joins, materialized views can reduce computation time.
4. Data Aggregation: Create materialized views for complex aggregations that need to be
reused across multiple queries.

# Pros
- Query Optimization: Significantly reduces the time taken to run complex queries by reusing
precomputed data instead of executing the query from scratch.
- Automatic Refresh: Materialized views can be refreshed automatically when the underlying
tables are updated, ensuring the view remains up-to-date.
- Improved Query Performance: Since the data is already computed, materialized views
reduce the CPU and memory required to run frequent queries.
- Partitioning Support: Materialized views can take advantage of table partitioning, allowing
better performance on large datasets.

# Cons
- Storage Overhead: Since materialized views store the precomputed data on disk, they
consume extra storage space.
- Staleness of Data: Materialized views can become stale if not refreshed regularly, which
may result in outdated query results.
- Maintenance Complexity: Requires periodic maintenance and monitoring of view
freshness, especially in environments where data changes frequently.
- Not Suitable for Real-Time Data: Materialized views are precomputed, so they don't work
well for real-time data needs where immediate freshness is critical.

---

7
Comparison and Summary

| Feature | Hive Transactional Tables | Hive Materialized Views |


|------------------------------|------------------------------------------------------------------|----------------------------------------------------------------|
11
| Purpose
| Use Case
| Best For
| Performance Impact
| File Format
| ACID compliance, allowing for `INSERT`, `UPDATE`, `DELETE`, `MERGE` | Precompute and store the results of complex queries
| Managing mutable data (updates, deletes) | Optimizing query performance for repeated complex queries
| Slowly changing dimensions, data deduplication, GDPR compliance | Frequent reporting queries, query acceleration for aggregations|
| Higher write overhead due to locking, compaction
| Primarily ORC
| Reduced query times but requires additional storage
| Supports all Hive table formats |
|

|
|

| Data Freshness | Data is current as transactions are directly applied | Data might be stale unless the view is refreshed |
| Storage Requirement | No additional storage requirements (except for compaction) | Requires additional storage for storing precomputed data |
| Maintenance | Requires configuration and management of locking, compaction | Requires refreshing when source tables change |
| Scalability | Good scalability but with some performance trade-offs | Scales well for read-heavy workloads |

When to Use Which?


- Hive Transactional Tables: Use when data mutability is needed. For example, in data
TG

warehouses where records need to be updated or deleted over time, transactional tables are
essential.
- Materialized Views: Use when you need query optimization for complex, frequently run
queries. Materialized views significantly improve the performance of queries with
aggregations, joins, and filters.

Creating Transactional Tables and Materialized Views

1. `hive.support.concurrency = true`: Enables concurrency control for Hive transactions,


allowing multiple clients to read and write to the same table simultaneously.

2. `hive.enforce.bucketing = true`: Enforces bucketing, which is a technique to distribute data


evenly across files and partitions, improving query performance.
3. `hive.exec.dynamic.partition.mode = nonstrict`: Specifies the dynamic partition mode for
Hive. In non-strict mode, dynamic partitioning can be used even if the user fails to specify all
partition columns.

4. `hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager`: Sets the


transaction manager for Hive to manage transactions in the database using the
`DbTxnManager` implementation.

5. `hive.compactor.initiator.on = true`: Enables the compactor initiator, which is responsible


for identifying tables that need to be compacted based on the transactional properties and
initiating the compaction process.

6. `hive.compactor.worker.threads = 1`: Specifies the number of threads to be used by the


compactor worker for compacting tables. In this case, only one thread is allocated for
compacting tables. Increasing the number of threads can improve compaction performance,
but it also increases resource consumption.

7
11
SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker.threads = 1;

CREATE TABLE employee (id int, name string, salary int)STORED AS ORC
TG

TBLPROPERTIES ('transactional' = 'true');

INSERT INTO employee VALUES


(1, 'Jerry', 5000),
(2, 'Tom', 8000),
(3, 'Kate', 6000);

-- Create a Hive transactional table called employee


CREATE TABLE IF NOT EXISTS employee (
emp_id INT,
emp_name STRING,
emp_salary DECIMAL(10, 2)
)
CLUSTERED BY (emp_id) INTO 4 BUCKETS
STORED AS ORC TBLPROPERTIES('transactional'='true');

-- Insert data into the employee table


INSERT INTO TABLE employee VALUES
(1, 'John Doe', 50000.00),
(2, 'Jane Smith', 60000.00),
(3, 'Alice Johnson', 70000.00);

-- Update data in the employee table


UPDATE employee SET emp_salary = 55000.00 WHERE emp_id = 1;

-- Delete data from the employee table


DELETE FROM employee WHERE emp_id = 2;

-- Insert data from a simple Hive table into the transactional table
INSERT INTO TABLE employee
SELECT emp_id, emp_name, emp_salary FROM simple_hive_table;

To disable the `DbTxnManager` in Hive, you need to set the `hive.txn.manager` property to a

7
value that corresponds to a different transaction manager or to no transaction manager at
all. You can achieve this by setting it to an empty string or to
`org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager`.
11
Here's how you can disable the `DbTxnManager`:

SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager;


```

By setting `hive.txn.manager` to `org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager`,


you effectively disable transaction management in Hive. This means Hive won't support
TG

ACID (Atomicity, Consistency, Isolation, Durability) properties and won't perform any
transaction-related operations.

Remember to restart Hive services or sessions after making this change for it to take effect.

-- Create a simple employee table


CREATE TABLE IF NOT EXISTS employee (
emp_id INT,
emp_name STRING,
emp_salary DECIMAL(10, 2)
);

-- Insert some data into the employee table


INSERT INTO TABLE employee VALUES
(1, 'John Doe', 50000.00),
(2, 'Jane Smith', 60000.00),
(3, 'Alice Johnson', 70000.00);

-- Create a view on the employee table


CREATE VIEW IF NOT EXISTS employee_view AS
SELECT * FROM employee;

-- Query the view to fetch all employees


SELECT * FROM employee_view;

-- Query the view to fetch employees with a salary greater than 60000
SELECT * FROM employee_view WHERE emp_salary > 60000;

-- Query the view to fetch employees with 'John' in their name


SELECT * FROM employee_view WHERE emp_name LIKE '%John%';

-- Query the view to fetch the count of employees


SELECT COUNT(*) FROM employee_view;

-- Query the view to fetch the average salary of employees

7
SELECT AVG(emp_salary) FROM employee_view;
11
To create a materialized view in Hive and perform queries on it based on a transactional
`employee` table, you would typically follow these steps:

1. Create the materialized view.


2. Insert data into the materialized view from the transactional table.
3. Query the materialized view.

Here's an example code:


TG

-- Step 1: Create a materialized view


CREATE MATERIALIZED VIEW IF NOT EXISTS mv_employee AS
SELECT * FROM employee;

-- Step 2: Insert data into the materialized view from the transactional table
-- Note: This will continuously update the materialized view with the latest data from the
transactional table.
INSERT OVERWRITE TABLE mv_employee
SELECT * FROM employee;

-- Step 3: Query the materialized view


-- Example query: Get the count of employees
SELECT COUNT(*) AS employee_count FROM mv_employee;

-- Example query: Get the average salary of employees


SELECT AVG(emp_salary) AS avg_salary FROM mv_employee;

-- Example query: Get the details of employees earning more than $60,000
SELECT * FROM mv_employee WHERE emp_salary > 60000;
```

In this example:
- We first create a materialized view called `mv_employee` using a `SELECT * FROM
employee` statement.
- We then continuously update the materialized view with the latest data from the
transactional table by using `INSERT OVERWRITE TABLE` with a `SELECT` statement.
- Finally, we perform various queries on the materialized view, such as getting the count of
employees, calculating the average salary, and fetching details of employees earning more
than $60,000.

Make sure to adjust the queries according to your specific requirements and schema.

7
11
TG
Vectorization in Hive

# Overview
Vectorization in Hive is a performance optimization technique where operations are
performed on batches of rows (blocks of data) rather than processing rows one-by-one. This
approach significantly speeds up query execution by reducing CPU overhead and making
better use of modern CPU architectures, which can handle multiple data points in parallel.

Traditionally, Hive processes data in a row-by-row fashion, which involves numerous


iterations and function calls for each row. In contrast, vectorized execution processes
multiple rows (typically 1,024 rows) at once, minimizing overhead and improving
performance for large datasets.

---

Relevance

7
Vectorization is crucial in Hive for query optimization and is particularly useful in data
analytics where large-scale aggregation, filtering, and join operations are common. It speeds
up processing, especially on ORC-formatted tables, by leveraging SIMD (Single Instruction,
Multiple Data) CPU instructions that operate on multiple data points simultaneously.
11
---

Use Cases
1. Data Analytics: When running complex analytical queries that involve large scans,
aggregations, and joins, vectorization improves query performance by reducing the time
taken to process the data.
2. ETL Pipelines: In Extract-Transform-Load (ETL) pipelines, where large volumes of data
TG

need to be transformed, filtered, or aggregated before loading into the target system,
vectorized operations can reduce the time required for each transformation step.
3. Business Intelligence Reports: Queries run by BI tools often involve large datasets with
aggregations and joins. Vectorization enhances query speed, improving the responsiveness
of dashboards and reports.
4. Big Data Workloads: Any workload that processes large datasets in Hadoop or Hive, such
as e-commerce platforms, financial data analysis, or social media analytics, can benefit from
vectorized query execution.

---

How Vectorization Works


- Row-Based Processing: In traditional Hive, rows are processed one by one. For each row,
a function call is made, which can lead to high CPU overhead.
- Vectorized Processing: Instead of processing one row at a time, Hive processes a batch of
rows (default is 1,024) in one go. This batch is stored in columnar format in memory,
allowing Hive to apply operations on an entire column of data simultaneously, reducing the
number of function calls and CPU cache misses.
---

Prerequisites for Vectorization

1. Hive Version:
- Vectorization was introduced in Hive 0.13, but it is significantly optimized in later versions,
especially Hive 2.x and Hive 3.x.

2. Table Format:
- ORC (Optimized Row Columnar) is the most commonly supported file format for
vectorized query execution. Parquet files also support vectorization in Hive.

3. Hive Configuration Settings:


- Vectorization must be enabled in Hive’s configuration. To enable vectorization, set the
following property in `hive-site.xml` or during the session:

7
```sql
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
```
11
- `hive.vectorized.execution.enabled`: Enables vectorization for Map tasks.
- `hive.vectorized.execution.reduce.enabled`: Enables vectorization for Reduce tasks.

4. Execution Engines:
- Vectorization works with Tez and MapReduce execution engines, but Tez is generally
recommended due to better performance.
TG

5. Data Types:
- Vectorization supports most primitive data types (e.g., `INT`, `BIGINT`, `FLOAT`,
`DOUBLE`, `STRING`, etc.), but there are limited support for complex data types (e.g.,
`ARRAY`, `MAP`, `STRUCT`).

---

Pros of Vectorization

1. Performance Improvements:
- Faster Query Execution: Vectorization can lead to up to 2-10x improvement in query
execution times, especially for aggregation, joins, and filtering on large datasets.
- Better CPU Utilization: By processing batches of rows in a columnar format, vectorization
reduces the CPU overhead caused by row-by-row processing, resulting in more efficient use
of CPU caches and instructions.

2. Reduced Memory and CPU Overhead:


- By processing data in batches, vectorization reduces the number of function calls and
context switches between the CPU and memory, leading to improved performance.
3. Scalability:
- Since vectorization handles large data efficiently, it scales well with big data workloads
and can significantly reduce the time taken for processing queries over petabytes of data.

4. Better Performance on Columnar Data:


- Vectorization is particularly effective with columnar storage formats (e.g., ORC and
Parquet), as it allows operations to be performed on entire columns rather than rows,
leveraging the strengths of columnar file formats.

---

Cons of Vectorization

1. Limited Support for Complex Data Types:


- Currently, complex data types like `ARRAY`, `MAP`, and `STRUCT` are not fully
supported in vectorized execution. For queries involving complex data types, Hive may fall

7
back to traditional row-based execution, reducing the performance benefits.

2. Resource Intensive:
- Although vectorization reduces CPU overhead, it can increase memory consumption due
11
to the need to store data in batches for columnar processing. This can lead to increased
memory pressure, especially in large-scale data processing jobs.

3. Not Always Beneficial for Small Datasets:


- For small datasets or simple queries, the benefits of vectorization may be negligible. The
overhead of setting up batch processing might outweigh the performance gains, making
vectorization more useful for large datasets or complex queries.
TG

4. Initial Setup:
- Proper configuration is required for vectorization, and misconfigurations (e.g., disabling
vectorization for certain tasks) may lead to suboptimal performance.

---

Performance Impact
Vectorization is highly effective for:
- Aggregation Queries: Queries that involve sum, count, average, and other aggregations
over large datasets.

Example:
```sql
SELECT department, AVG(salary) FROM employees GROUP BY department;
```

- Join Operations: When joining large tables, vectorization reduces the time taken for
performing joins by processing rows in bulk.

Example:
```sql
SELECT a.emp_id, b.department FROM employees a JOIN departments b ON a.dept_id =
b.dept_id;
```

- Filter Operations: For queries with WHERE clauses, vectorization applies the filter across
the batch of rows in one go, reducing the time spent filtering individual rows.

Example:
```sql
SELECT * FROM employees WHERE salary > 50000;
```

---

Configuration and Tuning

7
To ensure that vectorization is optimally configured, consider the following:

- Tuning Batch Size: The default batch size for vectorization is 1,024 rows, but this can be
11
tuned based on memory and CPU characteristics.

```sql
set hive.vectorized.execution.batch.size = 2048;
```

- Execution Engine: Using Tez as the execution engine can provide better performance
compared to MapReduce when combined with vectorization.
TG

```xml
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
```

- Query Rewriting: Ensure that your queries are written in a way that takes advantage of
vectorized operations. Simple aggregations, filters, and joins are the best candidates.

---
Summary of Vectorization

| Feature | Description |
|------------------------------------------------------------------------------------------------------------------------------|
| Purpose | Process multiple rows (batches) at once, improving performance|
| Best for | Large-scale queries with aggregations, joins, and filtering |
| File Formats | Primarily effective with ORC and Parquet formats |
| Supported Data Types | Supports most primitive types; limited support for complex types|
| Performance Impact | Up to 2-10x speedup for large queries |
| Prerequisites | Hive version 0.13+ (better in 2.x+), ORC/Parquet formats, Tez or MR|
| Pros | Faster queries, better CPU utilization, scalable |
| Cons | Not suitable for small data, limited complex type support, memory usage|

When to Use Vectorization


- Use vectorization when working with large datasets in columnar formats like ORC or
Parquet.
- Ideal for complex queries involving aggregations, joins, and filters.
- Not beneficial for small datasets or queries that don’t involve operations on large amounts

7
of data.
11
TG

You might also like