Hive Notes (1)
Hive Notes (1)
File Formats
Apache Hive supports various file formats for storing and querying data in Hadoop. Each
format has its pros and cons, and the choice of format can impact the performance of Hive
queries. Here’s a detailed explanation of the main file formats used with Hive, including
examples, advantages, disadvantages, and performance considerations:
1. TextFile
- Description: This is the default file format for Hive. It stores data as plain text files, with
each line representing a row, and fields separated by a delimiter (commonly, a comma or
tab).
- Example:
```
7
1,John,Doe,5000
2,Jane,Smith,7000
```
- Pros:
11- Easy to use and understand.
- Suitable for small datasets and when compatibility with other tools is needed.
- No complex configurations are required.
- Cons:
- Large storage space is needed since data is stored as plain text.
- Slower read and write performance due to the lack of compression.
- No support for schema evolution.
- Performance:
TG
- Text files result in higher storage costs and slower query performance.
- They are not ideal for large-scale data processing.
2. SequenceFile
- Description: A binary file format that stores data as key-value pairs. It is commonly used
in Hadoop for intermediate data storage.
- Example:
- Key: Row number (0, 1, 2, etc.)
- Value: The corresponding row data.
- Pros:
- Supports compression (e.g., Snappy, Gzip) which reduces storage space.
- Faster read/write performance than TextFile.
- Cons:
- Not human-readable.
- Not columnar, which can impact performance for specific queries (e.g., columnar
aggregation).
- Performance:
- Suitable for moderate-sized datasets.
- Efficient for use cases involving key-value pair processing.
3. RCFile (Record Columnar File)
- Description: Combines the benefits of row-based and column-based storage by storing
data in a columnar fashion but grouping columns in row splits.
- Example:
- Data is stored in row groups with columns grouped together within each row.
- Pros:
- Better compression than TextFile and SequenceFile.
- Columnar storage allows for efficient querying on specific columns.
- Cons:
- Read and write performance can be slower than newer columnar formats like ORC and
Parquet.
- More complex than SequenceFile and TextFile.
- Performance:
- Suitable for analytical queries that need columnar access.
- Query performance is better than TextFile and SequenceFile but not as good as ORC
and Parquet.
7
4. ORC (Optimized Row Columnar)
- Description: A columnar storage format optimized for Hive, designed to improve the
performance of reading, writing, and processing data.
11
- Example:
- Data is stored in columnar format with built-in indexes for fast data retrieval.
- Pros:
- High compression ratio reduces storage requirements.
- Supports complex types (e.g., arrays, maps).
- Built-in indexes and statistics allow for efficient query execution.
- Supports predicate pushdown, which filters data during scanning, improving
performance.
TG
- Cons:
- Slower write performance due to the overhead of indexing and compression.
- Compatibility may be an issue when integrating with non-Hadoop tools.
- Performance:
- Ideal for large datasets and analytical workloads.
- Provides the best performance for Hive queries compared to other formats.
- Reduces I/O by reading only the required columns and rows.
5. Parquet
- Description: A columnar storage format that is platform-independent and optimized for
large-scale data processing frameworks like Apache Spark and Hadoop.
- Example:
- Data is stored in a columnar format, with each column stored separately.
- Pros:
- High compression ratios, resulting in lower storage requirements.
- Columnar storage allows for fast query execution.
- Supports nested data structures (e.g., structs, arrays).
- Widely used across various data processing frameworks (e.g., Hive, Spark, Impala).
- Cons:
- Slower write performance due to compression.
- Requires more complex configurations compared to TextFile.
- Performance:
- Comparable to ORC for read operations.
- Works well with large datasets and supports complex queries.
- Preferred format for interoperability between Hive and other processing tools.
6. AVRO
- Description: A row-based storage format suitable for data serialization and
deserialization, supporting schema evolution.
- Example:
- Data is stored in JSON-like format with a separate schema file.
- Pros:
- Supports schema evolution, making it suitable for ETL processes.
- Lightweight and efficient for data exchange between different systems.
- Well-supported in the Hadoop ecosystem.
- Cons:
7
- Row-based format results in slower performance for analytical queries.
- Compression ratios may not be as high as ORC or Parquet.
- Performance:
- Suitable for use cases where schema evolution is required.
11
- Better read and write performance than TextFile and SequenceFile.
Summary Table
| File Format | Type | Compression | Read Performance | Write Performance | Ideal Use Case |
|----------------|-------------|-------------|------------------|-------------------|--------------------------------------------|
| TextFile | Row-based | No | Slow | Slow | Small datasets, simple use cases |
| SequenceFile | Row-based | Yes | Medium | Medium | Intermediate data storage in Hadoop |
| RCFile | Hybrid | Yes | Medium | Medium | Columnar processing, moderate-sized data |
TG
| ORC | Columnar | Yes | Fast | Medium | Large datasets, analytics in Hive |
| Parquet | Columnar | Yes | Fast | Medium | Interoperability, complex nested data |
| AVRO | Row-based | Yes | Medium | Medium | Schema evolution, data exchange |
Each file format has its unique advantages, and the choice depends on the specific
requirements of your data processing workflow in Hive.
CSV:
CREATE TABLE my_csv_table (
column1 INT,
column2 STRING,
column3 DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
TSV:
CREATE TABLE your_table_name (
column1 data_type1,
column2 data_type2,
...
)
ROW FORMAT DELIMITED
7
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
11
ORC :
CREATE TABLE my_orc_table (
column1 INT,
column2 STRING,
column3 DOUBLE
)
STORED AS ORC;
TG
Parquet :
CREATE TABLE IF NOT EXISTS my_table (
column1 datatype1,
column2 datatype2,
-- Add more columns as needed
)
STORED AS PARQUET;
AVRO :
CREATE TABLE avro_table (
column1 datatype1,
column2 datatype2,
-- Add more columns as needed
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='file:///path/to/your/avro_schema.avsc');
Sequence :
CREATE TABLE my_sequence_table (
id INT,
name STRING
)
STORED AS SEQUENCEFILE;
JSON :
CREATE TABLE my_json_table (
column1 datatype1,
column2 datatype2,
...
7
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;
11
XML:
CREATE TABLE xml_table (
col1 STRING,
col2 INT
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
TG
"column.xpath.col1"="/root/col1/text()",
"column.xpath.col2"="/root/col2/text()"
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat';
Compression codecs :
In Hive, various file formats support different compression codecs, which help reduce
storage space and improve data transfer performance. Here is a comparison of popular
compression codecs used in Hive, focusing on their speed, compression ratio, and use
cases:
1. Snappy
- Speed: Fast (Both compression and decompression)
- Compression Ratio: Moderate
- Use Case: Optimized for speed rather than space, making it ideal for real-time queries
and quick data access. Often used in ORC and Parquet files.
- Supported Formats: ORC, Parquet, Avro, SequenceFile
- Pros:
- High decompression speed
- Low CPU overhead
7
- Cons:
- Lower compression ratio compared to GZIP and ZLIB
2. GZIP
11
- Speed: Slow (Compression), Moderate (Decompression)
- Compression Ratio: High
- Use Case: When compression ratio is a higher priority than speed. Used where disk
space or network bandwidth is a concern.
- Supported Formats: TextFile, ORC, Parquet, Avro
- Pros:
- High compression ratio
- Suitable for archiving data
TG
- Cons:
- Slower compression and decompression speeds
3. ZLIB
- Speed: Moderate (Compression), Moderate (Decompression)
- Compression Ratio: High (Similar to GZIP)
- Use Case: Offers a balance between compression speed and size. It is widely used in
ORC for batch processing where size reduction is important.
- Supported Formats: ORC, Avro, SequenceFile
- Pros:
- Good balance between compression speed and size
- Supported in multiple Hive file formats
- Cons:
- Not as fast as Snappy for decompression
4. LZO
- Speed: Fast (Decompression), Moderate (Compression)
- Compression Ratio: Moderate
- Use Case: Suitable for large-scale data where decompression speed is important. Often
used in SequenceFile format and Hadoop applications.
- Supported Formats: TextFile, SequenceFile
- Pros:
- Fast decompression
- Effective for distributed systems like Hadoop
- Cons:
- Requires separate installation of LZO libraries
- Lower compression ratio than GZIP
5. BZIP2
- Speed: Slow (Both compression and decompression)
- Compression Ratio: Very high
- Use Case: Used where maximum compression is needed, though rarely due to its slow
speed. Suitable for archival purposes.
- Supported Formats: TextFile
7
- Pros:
- Highest compression ratio among common codecs
- Cons:
- Very slow compression and decompression speed
11- Not recommended for real-time queries or frequent access
6. DEFLATE
- Speed: Moderate
- Compression Ratio: High (Similar to ZLIB)
- Use Case: Mostly used in ORC and other formats requiring balance between
compression and performance. It's a universal standard compression algorithm.
- Supported Formats: ORC, Avro, SequenceFile
TG
- Pros:
- High compression ratio
- Good for general-purpose use
- Cons:
- Slower than Snappy for real-time data access
- Split-able formats (like Snappy, LZO, and BZIP2) allow Hive or Hadoop to split large files
into smaller parts. This enables efficient parallel processing across multiple nodes in a
distributed environment like Hadoop, improving query performance and reducing job
completion times.
- Snappy: Supports splitting, making it highly suitable for distributed processing and large
datasets.
- LZO: Supports splitting but requires an index file (`.index` file) to allow Hadoop to split the
compressed file. Without the index, LZO is not split-able. However, when indexed properly,
it's a good choice for distributed data processing.
- BZIP2: Despite its slower compression and decompression speeds, it supports splitting,
which can be advantageous in distributed systems where high compression is needed
without sacrificing parallelism.
- Non-Split-able formats (like GZIP, ZLIB, and DEFLATE) do not support splitting, meaning
that each compressed file must be processed by a single node. This limitation can reduce
parallelism and increase processing time when working with large datasets.
- GZIP: Does not support splitting. When a GZIP file is processed, the entire file is assigned
to one node, which can become a bottleneck in distributed environments.
- ZLIB: Similar to GZIP, ZLIB also doesn't support splitting. It's more suited for smaller files
7
or cases where parallelism is not critical.
- DEFLATE: Not split-able, so it's less suitable for very large datasets that need to be
processed in parallel.
11
Key Considerations:
- If you're working with large datasets and distributed processing is important, it's best to
choose a split-able codec like Snappy, LZO (with indexing), or BZIP2.
- For archival purposes or when maximizing storage efficiency is the goal, GZIP or ZLIB may
be better, but keep in mind that they do not allow file splitting.
Comparison Table
| Compression Codec | Compression Speed | Decompression Speed | Compression Ratio | Split Ability | Best Use Case | Supported File Formats
|
|-------------------|-------------------|---------------------|-------------------|---------------|-----------------------------------|--------------------------------|
| Snappy | Fast | Fast | Moderate | Yes | Real-time queries, low CPU usage | ORC, Parquet, Avro, SequenceFile |
| GZIP | Slow | Moderate | High | No | Archiving, bandwidth-constrained | TextFile, ORC, Parquet, Avro |
| ZLIB | Moderate | Moderate | High | No | Batch processing, disk savings | ORC, Avro, SequenceFile |
| LZO | Moderate | Fast | Moderate | Yes (with indexing) | Large datasets, distributed systems | TextFile, SequenceFile
|
| BZIP2 | Very Slow | Very Slow | Very High | Yes | Maximum compression, archival | TextFile |
| DEFLATE | Moderate | Moderate | High | No | General-purpose compression | ORC, Avro, SequenceFile |
Key Takeaways:
- Snappy is great for fast decompression and lower CPU overhead, making it suitable for
quick access to data.
- GZIP and ZLIB provide higher compression ratios at the cost of slower performance.
- LZO offers a middle ground, with fast decompression for large datasets, often used in
distributed systems.
- BZIP2 gives the highest compression ratio but is impractically slow for most real-time use
cases.
7
Recommendation:
- Real-time querying: Use `Snappy`.
- Batch processing or archival: Use `GZIP` or `ZLIB` for a good balance between
compression size and performance.
11
- Large-scale distributed systems: Use `LZO` for fast decompression.
Supported by ORC :
-NONE: No compression is applied. Data is stored as is.
-ZLIB: This is a widely used compression algorithm that provides good compression ratios at
the cost of higher CPU usage.
TG
-SNAPPY: Snappy is a fast compression algorithm that provides good compression ratios
with low CPU usage. It's a popular choice for tasks where speed is a priority.
-LZO: LZO (Lempel-Ziv-Oberhumer) is a compression algorithm known for its fast
compression and decompression speeds. However, it's not as space-efficient as some other
codecs.
-LZ4: LZ4 is a very fast compression algorithm that is designed for speed rather than high
compression ratios. It's commonly used for tasks where decompression speed is crucial.
-ZSTD: Zstandard is a modern compression algorithm that aims to provide a good balance
between compression ratio and speed. It's designed to be competitive with other popular
codecs like Snappy and zlib.
-LZ4_FRAME: This is another variant of the LZ4 algorithm specifically tailored for streaming
applications.
Supported by Parquet:
-Gzip: This is a widely used compression algorithm. It provides good compression ratios but
may not be as fast to decompress as some other codecs.
-Snappy: Snappy is a fast compression and decompression library. It provides good
compression ratios and is relatively fast in terms of decompression speed.
-LZO (Not natively supported): LZO is a popular compression algorithm that is known for its
high compression and decompression speeds. However, it is not natively supported by
Parquet and would require additional configurations or libraries.
-Brotli: Brotli is a newer compression algorithm developed by Google. It provides high
compression ratios and is designed to be fast in terms of both compression and
decompression.
-LZ4: LZ4 is another compression algorithm known for its high-speed compression and
decompression. It doesn't provide as high compression ratios as some other algorithms like
Gzip, but it is very fast.
-Zstandard (zstd): Zstandard is a relatively new compression algorithm that aims to provide a
good balance between compression ratios and speed.
-None: You can choose not to use any compression. This is suitable for scenarios where you
prioritize speed and do not need to minimize storage space.
Supported by AVRO:
-Deflate: This is a commonly used compression algorithm. It provides good compression
7
ratios and is relatively fast.
-Snappy: This is a popular compression codec known for its speed and reasonable
compression ratios. It is often used in conjunction with Avro.
-Gzip: Gzip provides good compression ratios but tends to be slower than some other
11
codecs.
-Bzip2: This provides very good compression ratios but is generally slower than other
codecs.
-LZO: LZO is a high-speed compression algorithm that is suitable for real-time data
processing.
-Zstandard (Zstd): This is a modern compression algorithm that provides a good balance
between compression ratios and speed.
-LZ4: LZ4 is an extremely fast compression algorithm. It may not achieve the highest
TG
# Overview
Hive transactional tables enable ACID (Atomicity, Consistency, Isolation, Durability)
operations, allowing you to perform INSERT, UPDATE, DELETE, and MERGE operations,
which are typically absent in traditional Hive tables. This capability is crucial for managing
transactional data and slowly changing dimensions in big data systems.
# Relevance
Transactional tables in Hive are especially useful in scenarios where data mutability is
required, like when you need to:
- Update existing records in data warehouses.
- Delete specific data (e.g., GDPR-related requirements).
- Perform incremental loads with MERGE operations.
7
Prior to the introduction of transactional tables, Hive was primarily an append-only system,
meaning data could only be added but not modified or deleted, limiting its usefulness in
certain data warehousing scenarios.
11
# Use Cases
1. Incremental Data Loading: Performing MERGE operations to update existing records
based on new incoming data, typically in ETL pipelines.
2. Slowly Changing Dimensions (SCD): Managing changes to dimensions over time, where
you need to update or delete records while keeping history.
3. GDPR Compliance: Deleting or updating sensitive customer data in compliance with data
privacy regulations.
TG
# Pros
- ACID Properties: Ensures consistency and reliability in transactional data operations, even
in a distributed environment like Hadoop.
- Mutability: Allows performing UPDATE, DELETE, and MERGE operations, making Hive
more suitable for enterprise data warehouses that require changes to historical data.
- Efficient Storage: Utilizes compaction to optimize storage after a series of small inserts or
updates, ensuring that storage space is not unnecessarily consumed.
- Fine-grained Access: Allows fine-grained access to data, improving data governance.
# Cons
- Performance Overhead: Transactional tables require write locks, and compactions
introduce overhead, making them slower than traditional append-only Hive tables.
- More Resources: The need for compaction and locking mechanisms can result in higher
memory and CPU usage.
- Configuration Complexity: Requires proper configuration of Hive metastore and Tez or MR
engines to handle ACID properties.
- Limitations on File Formats: Currently, ORC is the only supported format for fully
ACID-compliant tables, restricting flexibility.
7
performance and storage issues:
1. Small File Problem: Too many small files can overwhelm the NameNode in Hadoop, as it
has to manage a large number of file metadata, leading to performance degradation.
2. Query Performance: Each query must process all delta files, increasing the time taken to
11
retrieve the data.
3. Increased Storage Overhead: A large number of small files lead to inefficient use of
storage space.
Compaction addresses these issues by merging the small delta files and base files into
fewer, larger files, optimizing both query performance and storage usage.
---
TG
Types of Compaction
There are two main types of compaction in Hive: Minor Compaction and Major Compaction.
# 1. Minor Compaction
- Purpose: Combines delta files generated by INSERT, UPDATE, and DELETE operations
into a single delta file.
- What It Does: Merges all the delta files into a single delta file without touching the base file
(i.e., the original data). This helps reduce the overhead of scanning multiple small delta files
during queries.
- Trigger: Typically triggered automatically when a certain number of delta files are created,
but it can also be triggered manually if needed.
```sql
ALTER TABLE table_name COMPACT 'MINOR';
```
# 2. Major Compaction
- Purpose: Combines both the base file and all associated delta files into a single, optimized
file.
- What It Does: This process merges all the delta files into the base file, creating a new base
file that incorporates all changes made by the transactions (INSERTs, UPDATEs, and
DELETEs).
7
For example, if you have a base file and multiple delta files:
```
base_0000
11
delta_0001
delta_0002
delta_0003
```
Major compaction will merge these into a single base file:
```
base_0001
```
TG
```sql
ALTER TABLE table_name COMPACT 'MAJOR';
```
---
```xml
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name>hive.compactor.worker.threads</name>
<value>1</value>
</property>
```
- Manual Compaction: Compaction can be triggered manually by the user. This is useful
when automatic compaction is disabled, or you want to force compaction at a specific time
(e.g., during off-peak hours).
7
```sql
ALTER TABLE table_name COMPACT 'MINOR';
ALTER TABLE table_name COMPACT 'MAJOR';
11
```
---
Compaction in Hive is carried out by the Compaction Initiator and Compaction Worker
processes:
TG
- Compaction Initiator: Periodically scans the Hive Metastore for tables that need compaction
based on the thresholds configured.
- Compaction Worker: Performs the actual compaction tasks. Each worker takes a
compaction job from the queue, reads the delta and base files, merges them, and writes the
output back to the table’s storage location.
---
In addition to enabling automatic compaction, you can fine-tune the compaction process with
the following settings in the `hive-site.xml` file:
1. Compaction Frequency:
```xml
<property>
<name>hive.compactor.delta.num.threshold</name>
<value>10</value>
</property>
```
- Controls how many delta files must accumulate before triggering compaction. In this
case, 10 delta files will trigger a compaction.
3. Compaction Delay:
```xml
<property>
7
<name>hive.compactor.initiator.honor.transactions</name>
<value>true</value>
</property>
```
11
- Delays compaction until all active transactions on the table are completed.
---
- Transactional Data Processing: When working with ACID tables where UPDATE, DELETE,
and INSERT operations are frequent, compaction improves performance by consolidating
these small changes.
- Long-running Queries: In scenarios where long-running queries scan large datasets,
compaction ensures that queries don't have to scan multiple small delta files, reducing query
execution times.
---
# Pros
- Improved Query Performance: Queries run faster after compaction because fewer files
need to be scanned.
- Efficient Storage Utilization: Reduces the number of small files, improving storage
efficiency and reducing the burden on the NameNode in Hadoop.
- Better Resource Utilization: Compaction optimizes resources by minimizing the CPU and
I/O overhead during queries.
# Cons
- Resource-Intensive Process: Compaction itself is resource-intensive and can affect system
performance during execution, especially if not properly scheduled.
- Additional Configuration: Requires fine-tuning and monitoring to ensure compactions are
running efficiently without impacting other operations.
- Potential Delays: Compaction might not keep up with very high-frequency data updates,
leading to increased overhead if not managed properly.
---
Summary
Compaction is a critical process for Hive transactional tables, helping to maintain
performance and storage efficiency in an environment where multiple small files are
generated by ACID operations. By merging delta files and base files, it ensures that Hive can
continue to process queries efficiently without being bogged down by small-file overhead.
Proper configuration of compaction settings and balancing between automatic and manual
compaction is essential for achieving optimal performance in transactional Hive tables.
7
Hive Materialized Views
# Overview
11
Materialized Views in Hive are pre-computed views that store the result of a query physically
on disk. When a query uses a materialized view, Hive will try to use the pre-computed data
instead of re-executing the query, significantly improving query performance for repetitive or
complex operations.
# Relevance
Materialized views are highly relevant in big data analytics and data warehousing where
complex queries with joins, aggregations, and filters are common. They reduce the query
TG
execution time by avoiding the recomputation of expensive operations and allow for query
acceleration.
# Use Cases
1. Reporting: Use materialized views to pre-compute data that is frequently queried in
reports, dashboards, or BI tools, improving response times.
2. ETL Workflows: Speed up ETL processes by creating materialized views of intermediate
steps.
3. Query Optimization: In scenarios where queries repeatedly access the same data with
similar filters or joins, materialized views can reduce computation time.
4. Data Aggregation: Create materialized views for complex aggregations that need to be
reused across multiple queries.
# Pros
- Query Optimization: Significantly reduces the time taken to run complex queries by reusing
precomputed data instead of executing the query from scratch.
- Automatic Refresh: Materialized views can be refreshed automatically when the underlying
tables are updated, ensuring the view remains up-to-date.
- Improved Query Performance: Since the data is already computed, materialized views
reduce the CPU and memory required to run frequent queries.
- Partitioning Support: Materialized views can take advantage of table partitioning, allowing
better performance on large datasets.
# Cons
- Storage Overhead: Since materialized views store the precomputed data on disk, they
consume extra storage space.
- Staleness of Data: Materialized views can become stale if not refreshed regularly, which
may result in outdated query results.
- Maintenance Complexity: Requires periodic maintenance and monitoring of view
freshness, especially in environments where data changes frequently.
- Not Suitable for Real-Time Data: Materialized views are precomputed, so they don't work
well for real-time data needs where immediate freshness is critical.
---
7
Comparison and Summary
|
|
| Data Freshness | Data is current as transactions are directly applied | Data might be stale unless the view is refreshed |
| Storage Requirement | No additional storage requirements (except for compaction) | Requires additional storage for storing precomputed data |
| Maintenance | Requires configuration and management of locking, compaction | Requires refreshing when source tables change |
| Scalability | Good scalability but with some performance trade-offs | Scales well for read-heavy workloads |
warehouses where records need to be updated or deleted over time, transactional tables are
essential.
- Materialized Views: Use when you need query optimization for complex, frequently run
queries. Materialized views significantly improve the performance of queries with
aggregations, joins, and filters.
7
11
SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker.threads = 1;
CREATE TABLE employee (id int, name string, salary int)STORED AS ORC
TG
-- Insert data from a simple Hive table into the transactional table
INSERT INTO TABLE employee
SELECT emp_id, emp_name, emp_salary FROM simple_hive_table;
To disable the `DbTxnManager` in Hive, you need to set the `hive.txn.manager` property to a
7
value that corresponds to a different transaction manager or to no transaction manager at
all. You can achieve this by setting it to an empty string or to
`org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager`.
11
Here's how you can disable the `DbTxnManager`:
ACID (Atomicity, Consistency, Isolation, Durability) properties and won't perform any
transaction-related operations.
Remember to restart Hive services or sessions after making this change for it to take effect.
-- Query the view to fetch employees with a salary greater than 60000
SELECT * FROM employee_view WHERE emp_salary > 60000;
7
SELECT AVG(emp_salary) FROM employee_view;
11
To create a materialized view in Hive and perform queries on it based on a transactional
`employee` table, you would typically follow these steps:
-- Step 2: Insert data into the materialized view from the transactional table
-- Note: This will continuously update the materialized view with the latest data from the
transactional table.
INSERT OVERWRITE TABLE mv_employee
SELECT * FROM employee;
-- Example query: Get the details of employees earning more than $60,000
SELECT * FROM mv_employee WHERE emp_salary > 60000;
```
In this example:
- We first create a materialized view called `mv_employee` using a `SELECT * FROM
employee` statement.
- We then continuously update the materialized view with the latest data from the
transactional table by using `INSERT OVERWRITE TABLE` with a `SELECT` statement.
- Finally, we perform various queries on the materialized view, such as getting the count of
employees, calculating the average salary, and fetching details of employees earning more
than $60,000.
Make sure to adjust the queries according to your specific requirements and schema.
7
11
TG
Vectorization in Hive
# Overview
Vectorization in Hive is a performance optimization technique where operations are
performed on batches of rows (blocks of data) rather than processing rows one-by-one. This
approach significantly speeds up query execution by reducing CPU overhead and making
better use of modern CPU architectures, which can handle multiple data points in parallel.
---
Relevance
7
Vectorization is crucial in Hive for query optimization and is particularly useful in data
analytics where large-scale aggregation, filtering, and join operations are common. It speeds
up processing, especially on ORC-formatted tables, by leveraging SIMD (Single Instruction,
Multiple Data) CPU instructions that operate on multiple data points simultaneously.
11
---
Use Cases
1. Data Analytics: When running complex analytical queries that involve large scans,
aggregations, and joins, vectorization improves query performance by reducing the time
taken to process the data.
2. ETL Pipelines: In Extract-Transform-Load (ETL) pipelines, where large volumes of data
TG
need to be transformed, filtered, or aggregated before loading into the target system,
vectorized operations can reduce the time required for each transformation step.
3. Business Intelligence Reports: Queries run by BI tools often involve large datasets with
aggregations and joins. Vectorization enhances query speed, improving the responsiveness
of dashboards and reports.
4. Big Data Workloads: Any workload that processes large datasets in Hadoop or Hive, such
as e-commerce platforms, financial data analysis, or social media analytics, can benefit from
vectorized query execution.
---
1. Hive Version:
- Vectorization was introduced in Hive 0.13, but it is significantly optimized in later versions,
especially Hive 2.x and Hive 3.x.
2. Table Format:
- ORC (Optimized Row Columnar) is the most commonly supported file format for
vectorized query execution. Parquet files also support vectorization in Hive.
7
```sql
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
```
11
- `hive.vectorized.execution.enabled`: Enables vectorization for Map tasks.
- `hive.vectorized.execution.reduce.enabled`: Enables vectorization for Reduce tasks.
4. Execution Engines:
- Vectorization works with Tez and MapReduce execution engines, but Tez is generally
recommended due to better performance.
TG
5. Data Types:
- Vectorization supports most primitive data types (e.g., `INT`, `BIGINT`, `FLOAT`,
`DOUBLE`, `STRING`, etc.), but there are limited support for complex data types (e.g.,
`ARRAY`, `MAP`, `STRUCT`).
---
Pros of Vectorization
1. Performance Improvements:
- Faster Query Execution: Vectorization can lead to up to 2-10x improvement in query
execution times, especially for aggregation, joins, and filtering on large datasets.
- Better CPU Utilization: By processing batches of rows in a columnar format, vectorization
reduces the CPU overhead caused by row-by-row processing, resulting in more efficient use
of CPU caches and instructions.
---
Cons of Vectorization
7
back to traditional row-based execution, reducing the performance benefits.
2. Resource Intensive:
- Although vectorization reduces CPU overhead, it can increase memory consumption due
11
to the need to store data in batches for columnar processing. This can lead to increased
memory pressure, especially in large-scale data processing jobs.
4. Initial Setup:
- Proper configuration is required for vectorization, and misconfigurations (e.g., disabling
vectorization for certain tasks) may lead to suboptimal performance.
---
Performance Impact
Vectorization is highly effective for:
- Aggregation Queries: Queries that involve sum, count, average, and other aggregations
over large datasets.
Example:
```sql
SELECT department, AVG(salary) FROM employees GROUP BY department;
```
- Join Operations: When joining large tables, vectorization reduces the time taken for
performing joins by processing rows in bulk.
Example:
```sql
SELECT a.emp_id, b.department FROM employees a JOIN departments b ON a.dept_id =
b.dept_id;
```
- Filter Operations: For queries with WHERE clauses, vectorization applies the filter across
the batch of rows in one go, reducing the time spent filtering individual rows.
Example:
```sql
SELECT * FROM employees WHERE salary > 50000;
```
---
7
To ensure that vectorization is optimally configured, consider the following:
- Tuning Batch Size: The default batch size for vectorization is 1,024 rows, but this can be
11
tuned based on memory and CPU characteristics.
```sql
set hive.vectorized.execution.batch.size = 2048;
```
- Execution Engine: Using Tez as the execution engine can provide better performance
compared to MapReduce when combined with vectorization.
TG
```xml
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
```
- Query Rewriting: Ensure that your queries are written in a way that takes advantage of
vectorized operations. Simple aggregations, filters, and joins are the best candidates.
---
Summary of Vectorization
| Feature | Description |
|------------------------------------------------------------------------------------------------------------------------------|
| Purpose | Process multiple rows (batches) at once, improving performance|
| Best for | Large-scale queries with aggregations, joins, and filtering |
| File Formats | Primarily effective with ORC and Parquet formats |
| Supported Data Types | Supports most primitive types; limited support for complex types|
| Performance Impact | Up to 2-10x speedup for large queries |
| Prerequisites | Hive version 0.13+ (better in 2.x+), ORC/Parquet formats, Tez or MR|
| Pros | Faster queries, better CPU utilization, scalable |
| Cons | Not suitable for small data, limited complex type support, memory usage|
7
of data.
11
TG