BIG DATA 4
BIG DATA 4
FILTER
2. GROUP
B = GROUP A BY age;
3. UNION
C = UNION A, B;
4. JOIN
The JOIN operator is used to join two datasets based on a common field
5. SPLIT
The SPLIT operator splits a dataset into multiple datasets based on conditions.
6. FOREACH
7. DISTINCT
8. LIMIT
B = LIMIT A 10;
9. ORDER BY
10. MAX
B = GROUP A ALL;
2. Explain the features of PIG. Also explain the datatypes of PIG. Explain the architecture of PIG.
ChatGPT said:
ChatGPT
1. Ease of Use:
o Pig provides a high-level scripting language, Pig Latin, which is simpler to write and
understand compared to Java-based MapReduce code.
o Pig scripts describe the flow of data transformations and operations step-by-step.
3. Extensibility:
o Allows users to create their own functions (UDFs) to process and analyze data.
o Supports complex and nested data types like tuples, bags, and maps, enabling
processing of semi-structured data like JSON, XML, and log files.
5. Optimization:
o Pig automatically optimizes execution plans, making scripts efficient without requiring
user intervention.
6. Fault Tolerance:
o Pig runs on top of Hadoop and inherits its fault-tolerance capabilities. Failed tasks are
retried automatically.
7. Schema-Free Processing:
o Users can define schemas as needed or choose to process data without them.
Scalar Types:
6. bytearray: Represents a sequence of bytes. (Default type when the schema is not defined).
Complex Types:
1. Tuple:
2. Bag:
o A collection of tuples, which is unordered and allows duplicates.
3. Map:
Apache Pig works on top of Hadoop, and its architecture is designed for efficient execution of data
processing tasks. The architecture includes the following components:
Parses, validates, and optimizes Pig Latin scripts into a series of MapReduce jobs.
3. Compiler:
Converts Pig Latin scripts into MapReduce jobs or other execution models (e.g., Tez, Spark).
4. Execution Modes:
Local Mode: Runs on a single machine, suitable for testing small datasets.
MapReduce Mode: Runs on a Hadoop cluster, suitable for large-scale data processing.
6. MapReduce Framework:
Executes the jobs generated by the Pig compiler. Each job corresponds to a specific
transformation or operation defined in the Pig script.
3. With a relevant diagram, explain the architecture of Hive. Along with three kinds of Metastore.
Architecture of Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop. It provides a SQL-like interface to
query and manage large datasets in HDFS. Below is the detailed explanation of its architecture along
with a diagram.
Web Interface
2. Metastore:
3. Driver:
o Acts as the controller, receiving queries from the UI and driving their execution.
4. Compiler:
o Parses the query and generates an execution plan, which includes MapReduce or other
execution models.
5. Optimizer:
o Optimizes the logical plan by applying rules like predicate pushdown, join optimization,
and query rewriting.
6. Execution Engine:
o Converts the execution plan into tasks and coordinates their execution using the Hadoop
framework.
o Hive provides a SQL-like query language for data manipulation and querying.
sql
Copy code
+------------------+ +-------------------------+
+------------------+ +-------------------------+
| |
v |
+------------------+ +-------------------------+
+------------------+ +-------------------------+
+-------------------------+
| HDFS |
+-------------------------+
1. Embedded Metastore:
2. Local Metastore:
3. Remote Metastore:
4. Explain the Data units of Hive and also infer the usage of Partitioning and Bucketing with an example
and diagram.
1. Database:
o A collection of tables. It is used to organize tables logically and avoid naming conflicts.
2. Table:
o Example: CREATE TABLE customers (id INT, name STRING, age INT);
3. Partition:
o Example: A table partitioned by region will have separate directories for region=US,
region=EU, etc.
4. Bucket:
o A further division of data within a partition. Data is distributed into buckets based on a
hash of a column.
o Example: Data in a partition can be divided into buckets using the customer_id.
5. Partitioning in Hive
Partitioning organizes data into smaller, more manageable chunks by dividing the table
into directories based on the value of one or more columns.
product_id INT,
amount DOUBLE
STORED AS ORC;
Bucketing in Hive
Bucketing further divides data within each partition into smaller files (buckets) based on a hash of a
column. This improves query performance, especially for joins.
product_id INT,
amount DOUBLE
STORED AS ORC;
/bucket_00001
/bucket_00000
/bucket_00001
Sure! Here are the data types in Hive explained in a simpler way:
3. FLOAT: Numbers with decimals, but less precise. Example: 3.14, -0.5
9. DECIMAL: Numbers with fixed precision and scale. Example: 10.5, -3.14
3. STRUCT: Group of fields, each with its own type. Example: {'name':'Alice', 'age':25}
6. What is Hive. What are the tasks performed by Hive? Also explain the features of Hive .in easier
ChatGPT said:
ChatGPT
What is Hive?
Hive is a data warehouse tool built on top of Hadoop. It enables querying, analyzing, and managing large
datasets stored in Hadoop’s HDFS using a SQL-like language called HiveQL (Hive Query Language). Hive
makes it easy for users familiar with SQL to work with big data without needing to write complex
MapReduce programs.
1. Query Data:
o Hive allows users to query large datasets using SQL-like syntax (HiveQL).
2. Analyze Data:
o Perform aggregations, filtering, sorting, and grouping of data.
o Data is stored in tables that can be partitioned and bucketed for better organization and
faster querying.
4. Manage Metadata:
o Hive’s Metastore keeps track of table structures, schema details, and data locations.
5. Data Transformation:
o Hive can transform data by joining, filtering, and applying user-defined functions (UDFs).
6. Batch Processing:
o Hive is designed for batch processing of large datasets rather than real-time analytics.
Features of Hive
o Provides an easy-to-use SQL-like query language for users familiar with traditional
databases.
3. Schema on Read:
o Data is not validated at the time of storage but during query execution, allowing
flexibility with data formats.
4. Scalability:
o Scales with Hadoop’s ability to handle large amounts of data across distributed nodes.
o Helps organize and retrieve data more efficiently by dividing it into partitions and
buckets.
o Works with structured, semi-structured, and nested data using data types like arrays,
maps, and structs.
7. Extensibility:
o Allows developers to write custom functions using Java, Python, or other languages
(UDFs).
8. Interoperability:
o Works with tools like Spark, Pig, and MapReduce for diverse data processing needs.
9. Fault Tolerance:
Hive supports multiple file formats for storing and managing data in tables. Each format has its own
advantages and use cases. The file format affects how data is stored in HDFS and how Hive processes it.
Description: Stores data as plain text files, with fields separated by a delimiter (e.g., comma or
tab).
Advantages:
o Human-readable.
Disadvantages:
Usage:
sql
Copy code
STORED AS TEXTFILE;
Description: A binary file format used for compressing and storing data. It is a key-value format.
Advantages:
o Supports compression.
Disadvantages:
o Not human-readable.
Usage:
sql
Copy code
STORED AS SEQUENCEFILE;
Description: A highly optimized columnar file format that compresses data and improves query
performance.
Advantages:
Disadvantages:
Usage:
sql
Copy code
STORED AS ORC;
4. Parquet Format
Advantages:
o High query performance for specific columns.
o Efficient compression.
Disadvantages:
Usage:
sql
Copy code
STORED AS PARQUET;
5. Avro Format
Description: A row-based storage format that supports schema evolution (changing schema
without reloading data).
Advantages:
Disadvantages:
Usage:
sql
Copy code
STORED AS AVRO;
Description: Stores data in rows within blocks, but each block stores values column-wise.
Advantages:
Usage:
sql
Copy code
STORED AS RCFILE;