0% found this document useful (0 votes)
14 views14 pages

BIG DATA 4

Uploaded by

Vakkala Venusai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views14 pages

BIG DATA 4

Uploaded by

Vakkala Venusai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

1.

FILTER

The FILTER operator is used to filter out rows based on a condition.

Eg;A = LOAD 'data' AS (name:chararray, age:int);

B = FILTER A BY age > 25;

2. GROUP

The GROUP operator groups the data by a specified field.

Eg;A = LOAD 'data' AS (name:chararray, age:int);

B = GROUP A BY age;

3. UNION

The UNION operator combines the contents of two datasets.

Eg; A = LOAD 'data1' AS (name:chararray, age:int);

B = LOAD 'data2' AS (name:chararray, age:int);

C = UNION A, B;

4. JOIN

The JOIN operator is used to join two datasets based on a common field

Eg; A = LOAD 'data1' AS (name:chararray, age:int);

B = LOAD 'data2' AS (name:chararray, salary:int);

C = JOIN A BY name, B BY name;

5. SPLIT

The SPLIT operator splits a dataset into multiple datasets based on conditions.

Eg; A = LOAD 'data' AS (name:chararray, age:int);

SPLIT A INTO B IF age > 25, C IF age <= 25;

6. FOREACH

The FOREACH operator is used to transform data.

Eg; A = LOAD 'data' AS (name:chararray, age:int);

B = FOREACH A GENERATE name, age * 2;

7. DISTINCT

The DISTINCT operator removes duplicate rows.

Eg; A = LOAD 'data' AS (name:chararray, age:int);


B = DISTINCT A;

8. LIMIT

The LIMIT operator limits the number of rows returned.

Eg; A = LOAD 'data' AS (name:chararray, age:int);

B = LIMIT A 10;

9. ORDER BY

The ORDER BY operator sorts the data.

Eg; A = LOAD 'data' AS (name:chararray, age:int);

B = ORDER A BY age DESC;

10. MAX

The MAX operator calculates the maximum value in a dataset.

Eg; A = LOAD 'data' AS (name:chararray, age:int);

B = GROUP A ALL;

C = FOREACH B GENERATE MAX(A.age);

2. Explain the features of PIG. Also explain the datatypes of PIG. Explain the architecture of PIG.

ChatGPT said:

ChatGPT

Features of Apache Pig

1. Ease of Use:

o Pig provides a high-level scripting language, Pig Latin, which is simpler to write and
understand compared to Java-based MapReduce code.

o Supports SQL-like operations for data processing.

2. Data Flow Approach:

o Pig scripts describe the flow of data transformations and operations step-by-step.

3. Extensibility:

o Allows users to create their own functions (UDFs) to process and analyze data.

4. Handles Both Structured and Unstructured Data:

o Supports complex and nested data types like tuples, bags, and maps, enabling
processing of semi-structured data like JSON, XML, and log files.
5. Optimization:

o Pig automatically optimizes execution plans, making scripts efficient without requiring
user intervention.

6. Fault Tolerance:

o Pig runs on top of Hadoop and inherits its fault-tolerance capabilities. Failed tasks are
retried automatically.

7. Schema-Free Processing:

o Users can define schemas as needed or choose to process data without them.

Datatypes in Apache Pig

Pig has the following scalar types and complex types:

Scalar Types:

1. int: Represents a 32-bit signed integer.


Example: 10, -5

2. long: Represents a 64-bit signed integer.


Example: 1234567890

3. float: Represents a 32-bit floating-point number.


Example: 3.14, -0.5

4. double: Represents a 64-bit floating-point number.


Example: 3.14159

5. chararray: Represents a string of characters.


Example: "Hello", "Pig"

6. bytearray: Represents a sequence of bytes. (Default type when the schema is not defined).

7. boolean: Represents true/false values.


Example: true, false

Complex Types:

1. Tuple:

o A fixed-length ordered collection of fields.

o Example: (1, 'Alice', 25)

2. Bag:
o A collection of tuples, which is unordered and allows duplicates.

o Example: {(1, 'Alice'), (2, 'Bob')}

3. Map:

o A set of key-value pairs where the keys are unique.

o Example: ['name'->'Alice', 'age'->25]

Architecture of Apache Pig

Apache Pig works on top of Hadoop, and its architecture is designed for efficient execution of data
processing tasks. The architecture includes the following components:

1. Pig Latin Script:

 User writes a script in Pig Latin to define data processing tasks.

2. Pig Execution Environment:

 Parses, validates, and optimizes Pig Latin scripts into a series of MapReduce jobs.

3. Compiler:

 Converts Pig Latin scripts into MapReduce jobs or other execution models (e.g., Tez, Spark).

4. Execution Modes:

 Local Mode: Runs on a single machine, suitable for testing small datasets.

 MapReduce Mode: Runs on a Hadoop cluster, suitable for large-scale data processing.

5. HDFS (Hadoop Distributed File System):

 Acts as the storage layer for the data being processed.

6. MapReduce Framework:

 Executes the jobs generated by the Pig compiler. Each job corresponds to a specific
transformation or operation defined in the Pig script.
3. With a relevant diagram, explain the architecture of Hive. Along with three kinds of Metastore.

Architecture of Hive

Apache Hive is a data warehouse infrastructure built on top of Hadoop. It provides a SQL-like interface to
query and manage large datasets in HDFS. Below is the detailed explanation of its architecture along
with a diagram.

Hive Architecture Components

1. User Interfaces (UI):

o Provides interaction with Hive through:

 CLI (Command Line Interface)

 Web Interface

 JDBC/ODBC Driver for external applications.

2. Metastore:

o Stores metadata about databases, tables, columns, partitions, etc.

o Hive uses this metadata for query planning and execution.

o The Metastore can be one of three types: Embedded, Local, or Remote.

3. Driver:

o Acts as the controller, receiving queries from the UI and driving their execution.

o Manages query lifecycle, including parsing, planning, optimization, and execution.

4. Compiler:

o Parses the query and generates an execution plan, which includes MapReduce or other
execution models.

5. Optimizer:

o Optimizes the logical plan by applying rules like predicate pushdown, join optimization,
and query rewriting.

6. Execution Engine:

o Converts the execution plan into tasks and coordinates their execution using the Hadoop
framework.

7. Hadoop Distributed File System (HDFS):


o The storage layer for Hive data. Hive operates directly on files in HDFS.

8. Query Language (HiveQL):

o Hive provides a SQL-like query language for data manipulation and querying.

Diagram of Hive Architecture

Here’s a simplified representation:

sql

Copy code

+------------------+ +-------------------------+

| User Interface | <------> | Driver |

| (CLI, Web, ODBC) | | (Parser, Compiler, etc) |

+------------------+ +-------------------------+

| |

v |

+------------------+ +-------------------------+

| Metastore | <------> | Execution Engine |

| (Metadata Store) | | (MR, Tez, Spark) |

+------------------+ +-------------------------+

+-------------------------+

| HDFS |

| (Data Storage Layer) |

+-------------------------+

Three Types of Metastore in Hive

1. Embedded Metastore:

o Runs in the same JVM as the Hive service.

o Uses an embedded Derby database.


o Suitable for testing and development.

o Not recommended for production due to single-user limitations.

2. Local Metastore:

o Connects to a separate database (e.g., MySQL, Postgres) using JDBC.

o Still runs in the same JVM as the Hive service.

o Suitable for single-node deployments or lightweight production environments.

3. Remote Metastore:

o Runs as a standalone service, separate from the Hive service.

o Allows multiple Hive services to share the same metadata store.

o Recommended for production use.

4. Explain the Data units of Hive and also infer the usage of Partitioning and Bucketing with an example
and diagram.

Data Units in Hive

Hive organizes data into the following hierarchical units:

1. Database:

o A collection of tables. It is used to organize tables logically and avoid naming conflicts.

o Example: CREATE DATABASE sales;

2. Table:

o Stores data in a structured format, similar to a relational database table.

o Example: CREATE TABLE customers (id INT, name STRING, age INT);

3. Partition:

o Subdivision of a table based on a column's value. Each partition is stored as a separate


directory in HDFS.

o Example: A table partitioned by region will have separate directories for region=US,
region=EU, etc.

4. Bucket:

o A further division of data within a partition. Data is distributed into buckets based on a
hash of a column.

o Example: Data in a partition can be divided into buckets using the customer_id.

5. Partitioning in Hive
Partitioning organizes data into smaller, more manageable chunks by dividing the table
into directories based on the value of one or more columns.

CREATE TABLE sales (

product_id INT,

amount DOUBLE

PARTITIONED BY (region STRING)

STORED AS ORC;

Bucketing in Hive

Bucketing further divides data within each partition into smaller files (buckets) based on a hash of a
column. This improves query performance, especially for joins.

CREATE TABLE sales (

product_id INT,

amount DOUBLE

PARTITIONED BY (region STRING)

CLUSTERED BY (customer_id) INTO 4 BUCKETS

STORED AS ORC;

Diagram ;/sales/ <-- Table

/region=US/ <-- Partition

/bucket_00000 <-- Bucket

/bucket_00001

/region=EU/ <-- Partition

/bucket_00000

/bucket_00001

5. Explain the datatypes of Hive in easier

Sure! Here are the data types in Hive explained in a simpler way:

Primitive Data Types

1. INT: Whole numbers without decimals. Example: 10, -5


2. BIGINT: Larger whole numbers. Example: 1234567890

3. FLOAT: Numbers with decimals, but less precise. Example: 3.14, -0.5

4. DOUBLE: More precise numbers with decimals. Example: 3.14159

5. STRING: Text or characters. Example: "Hello", "Hive"

6. BOOLEAN: True or false values. Example: true, false

7. TINYINT: Small whole numbers. Example: 127

8. SMALLINT: Medium-sized whole numbers. Example: 32767

9. DECIMAL: Numbers with fixed precision and scale. Example: 10.5, -3.14

10. DATE: Date values. Example: 2024-12-25

11. TIMESTAMP: Date and time values. Example: 2024-12-25 10:30:00

12. BINARY: Sequence of bytes. Example: binary data

Complex Data Types

1. ARRAY: Ordered collection of elements of the same type. Example: [1, 2, 3]

2. MAP: Collection of key-value pairs. Example: {'name':'Alice', 'age':25}

3. STRUCT: Group of fields, each with its own type. Example: {'name':'Alice', 'age':25}

4. UNIONTYPE: Can hold any one of several types. Example: 1, 'Hello'

6. What is Hive. What are the tasks performed by Hive? Also explain the features of Hive .in easier

ChatGPT said:

ChatGPT

What is Hive?

Hive is a data warehouse tool built on top of Hadoop. It enables querying, analyzing, and managing large
datasets stored in Hadoop’s HDFS using a SQL-like language called HiveQL (Hive Query Language). Hive
makes it easy for users familiar with SQL to work with big data without needing to write complex
MapReduce programs.

Tasks Performed by Hive

1. Query Data:

o Hive allows users to query large datasets using SQL-like syntax (HiveQL).

o Example: SELECT * FROM sales WHERE region = 'US';

2. Analyze Data:
o Perform aggregations, filtering, sorting, and grouping of data.

o Example: SELECT region, SUM(amount) FROM sales GROUP BY region;

3. Store and Organize Data:

o Data is stored in tables that can be partitioned and bucketed for better organization and
faster querying.

4. Manage Metadata:

o Hive’s Metastore keeps track of table structures, schema details, and data locations.

5. Data Transformation:

o Hive can transform data by joining, filtering, and applying user-defined functions (UDFs).

6. Batch Processing:

o Hive is designed for batch processing of large datasets rather than real-time analytics.

Features of Hive

1. SQL-Like Interface (HiveQL):

o Provides an easy-to-use SQL-like query language for users familiar with traditional
databases.

2. Handles Large Datasets:

o Processes petabytes of data efficiently using Hadoop's distributed computing.

3. Schema on Read:

o Data is not validated at the time of storage but during query execution, allowing
flexibility with data formats.

4. Scalability:

o Scales with Hadoop’s ability to handle large amounts of data across distributed nodes.

5. Partitioning and Bucketing:

o Helps organize and retrieve data more efficiently by dividing it into partitions and
buckets.

6. Supports Complex Data Types:

o Works with structured, semi-structured, and nested data using data types like arrays,
maps, and structs.

7. Extensibility:
o Allows developers to write custom functions using Java, Python, or other languages
(UDFs).

8. Interoperability:

o Works with tools like Spark, Pig, and MapReduce for diverse data processing needs.

9. Fault Tolerance:

o Inherits Hadoop’s fault tolerance, ensuring tasks are retried on failure.

7. Explain the File Format of Hive. in easier

File Formats in Hive

Hive supports multiple file formats for storing and managing data in tables. Each format has its own
advantages and use cases. The file format affects how data is stored in HDFS and how Hive processes it.

1. Text File Format

 Description: Stores data as plain text files, with fields separated by a delimiter (e.g., comma or
tab).

 Advantages:

o Simple and easy to use.

o Human-readable.

 Disadvantages:

o Consumes more storage space.

o Slower for queries due to lack of optimization.

 Usage:

sql

Copy code

CREATE TABLE my_table (id INT, name STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

2. Sequence File Format

 Description: A binary file format used for compressing and storing data. It is a key-value format.
 Advantages:

o Faster read and write compared to text files.

o Supports compression.

 Disadvantages:

o Not human-readable.

 Usage:

sql

Copy code

CREATE TABLE my_table (id INT, name STRING)

STORED AS SEQUENCEFILE;

3. ORC (Optimized Row Columnar) Format

 Description: A highly optimized columnar file format that compresses data and improves query
performance.

 Advantages:

o High compression, reduces storage cost.

o Optimized for read-heavy operations like analytics.

o Stores metadata for faster queries.

 Disadvantages:

o Slightly slower write performance.

 Usage:

sql

Copy code

CREATE TABLE my_table (id INT, name STRING)

STORED AS ORC;

4. Parquet Format

 Description: A columnar storage format designed for distributed systems.

 Advantages:
o High query performance for specific columns.

o Efficient compression.

o Interoperable with tools like Spark and Impala.

 Disadvantages:

o More complex to implement compared to simpler formats.

 Usage:

sql

Copy code

CREATE TABLE my_table (id INT, name STRING)

STORED AS PARQUET;

5. Avro Format

 Description: A row-based storage format that supports schema evolution (changing schema
without reloading data).

 Advantages:

o Compact and efficient for row-based processing.

o Interoperable with multiple programming languages.

 Disadvantages:

o Less optimized for column-based analytics.

 Usage:

sql

Copy code

CREATE TABLE my_table (id INT, name STRING)

STORED AS AVRO;

6. RC (Row Columnar) Format

 Description: Stores data in rows within blocks, but each block stores values column-wise.

 Advantages:

o Better compression and query performance than text or sequence files.


 Disadvantages:

o Not as optimized as ORC or Parquet.

 Usage:

sql

Copy code

CREATE TABLE my_table (id INT, name STRING)

STORED AS RCFILE;

You might also like