REAL TIME ANALYTICAL
PLATFORM
RTAP
Real-time analytics in Big Data allows organizations to quickly gain insights
from large datasets as they are being generated.
Instead of focusing only on past data, it helps businesses analyze live data
streams.
This improves decision-making and enables businesses to adapt swiftly to
market changes, customer needs, and operational issues.
APPLICATIONS OF RTAP
• Healthcare Feedback
• Entertainment analytics
• Customer support optimization
• E-Commerce Insights
• Customer Feedback Analysis • Event Monitoring
• Social Media Sentiment Monitoring • Travel & Hospitality Feedback
• Stock Market Prediction • Retail store sentiment analysis
• Political Sentiment Analysis • Sentiment driven Marketing
• Crisis Management
• Product Development
• News Sentiment Analysis
RTAP FOR STOCK MARKET
PREDICTION
Data Source
Data Ingestion
Stream Processing
Real Time Querying
Data Storage
Sentiment Analysis
Machine Learning
Visualization
APACHE PIG
APACHE PIG
Apache Pig is a high-level platform for processing and analyzing large data
sets in Hadoop.
It uses a scripting language called Pig Latin, which simplifies the task of
writing MapReduce programs.
Pig Latin is a declarative language similar to SQL, used to express data
transformation and processing logic.
Pig is designed to handle structured, semi-structured, and unstructured
data, making it highly flexible for diverse data analysis tasks.
FEATURES
High-Level Language
Flexibility- Handles different data types & support multiple languages.
Handles Complex Data Flows-ETL (Extract, Transform, Load), data
preparation, and ad hoc querying.
Extensibility
Rich set of data processing operators
Optimized for Hadoop
Interactive and Batch Modes- Local Mode & Hadoop Mode
COMPONENTS OF APACHE
PIG
Pig Latin Scripts
Pig Runtime raw_data = LOAD '[Link]' USING
PigStorage(',')
Execution Environment:
high_salary = FILTER raw_data BY salary >
Local & Hadoop Mode 50000;
grouped_data = GROUP high_salary BY name;
result = FOREACH grouped_data GENERATE
group, COUNT(high_salary);
STORE result INTO 'output' USING
PigStorage(',');
INTERNAL ARCHITECTURE
OF APACHE PIG
PIG LATIN DATA MODELS
ATOM TUPLE BAG
MAP RELATION
SAMPLE CODE
students = LOAD '[Link]' AS (name:chararray, age:int,
city:chararray);
first_student = ('Alice', 30, 'New York');
student_bag = {('Alice', 30), ('Bob', 25), ('Charlie', 28)};
student_map = ['name' -> 'Alice', 'age' -> 30, 'city' -> 'New York'];
student_age = 30;
DUMP students;
DATA PROCESSING
OPERATORS
Relational Operators
Diagnostic Operators
Expression Operators
Build in Functions
User Defined Functions
RELATIONAL OPERATORS
LOAD STORE FILTER FOREACH GROUP
COGROUP JOIN CROSS UNION SPLIT
ORDER DISTINCT LIMIT
SAMPLE CODE
data = LOAD '[Link]' USING PigStorage(',') AS (name:chararray, age:int, city:chararray);
filtered_data = FILTER data BY age > 25;
transformed_data = FOREACH filtered_data GENERATE name, age * 2 AS age_doubled;
grouped_data = GROUP transformed_data BY city;
data2 = LOAD '[Link]' USING PigStorage(',') AS (city:chararray, state:chararray);
cogrouped_data = COGROUP grouped_data BY group, data2 BY city;
joined_data = JOIN transformed_data BY city, data2 BY city;
crossed_data = CROSS transformed_data, data2;
unioned_data = UNION transformed_data, transformed_data;
SPLIT data INTO young IF age < 30, old IF age >= 30;
ordered_data = ORDER data BY age ASC;
distinct_data = DISTINCT data;
limited_data = LIMIT data 10;
STORE limited_data INTO '[Link]' USING PigStorage(',')
DIAGNOSTIC OPERATORS
DESCRIBE EXPLAIN
ILLUSTRAT
DUMP
E
SAMPLE CODE A: {id: int, name: chararray, age: int, salary:
float}
A = LOAD 'sample_data.txt' USING (2,Jane,35,60000.0)
PigStorage(',') AS (id: int, name: chararray, (3,Mike,40,55000.0)
age: int, salary: float);
(5,Tom,35,70000.0)
DESCRIBE A;
B = FILTER A BY age > 30; Logical Plan:
DUMP B; Load -> Group -> Store
C = GROUP A BY age;
Output:
EXPLAIN C;
D:
D = FOREACH C GENERATE group AS age, (22,{(4,Susan,22,45000.0)}) -> (22,45000.0)
AVG([Link]) AS avg_salary;
(28,{(1,John,28,50000.0)}) -> (28,50000.0)
ILLUSTRATE D; (35,{(2,Jane,35,60000.0),(5,Tom,35,70000.0)}) ->
STORE D INTO 'output' USING PigStorage(','); (35,65000.0)
1,John,28,50000.0 (40,{(3,Mike,40,55000.0)}) -> (40,55000.0)
2,Jane,35,60000.0
3,Mike,40,55000.0
4,Susan,22,45000.0
5,Tom,35,70000.0
EXPRESSION OPERATORS
ARITHMETIC COMPARISON
LOGICAL &
TEXT
RELATIONAL
SAMPLE CODE
A = LOAD '[Link]' USING PigStorage(',') AS (id: int, name: chararray, age: int, salary: float);
B = FILTER A BY age > 30;
C = FOREACH A GENERATE id, name, age, salary * 1.1 AS new_salary;
D = FOREACH A GENERATE CONCAT(name, ' Employee') AS full_name;
E = FILTER A BY salary IS NULL;
DUMP B;
DUMP C;
DUMP D;
DUMP E;
EXPRESSION OPERATORS
BAG &
STRING MATH
TUPLE
FUNCTIONS FUNCTIONS
FUNCTIONS
DATE &
AGGREGATE UTILITY
TIME
FUNCTIONS FUNCTIONS
FUNCTIONS
MATCH THE FOLLOWING
Function Name Description
CONCAT a. Converts string to lowercase.
SUBSTRING b. Splits a string into a tuple.
c. Removes leading and trailing
TRIM
whitespaces.
TOLOWER d. Replaces a substring with another.
TOUPPER e. Extracts a substring.
REPLACE f. Concatenates two strings.
STRSPLIT g. Converts string to uppercase.
h. Returns the index of the first
INDEXOF
occurrence of a substring.
STRING FUNCTIONS
Function Description Example
CONCAT('Hello', 'World') →
CONCAT Concatenates two strings.
HelloWorld
SUBSTRING('HelloWorld', 0, 5)
SUBSTRING Extracts a substring.
→ Hello
Removes leading and trailing
TRIM TRIM(' Hello ') → Hello
whitespaces.
TOLOWER Converts string to lowercase. TOLOWER('HELLO') → hello
TOUPPER Converts string to uppercase. TOUPPER('hello') → HELLO
Replaces a substring with REPLACE('HelloWorld', 'World',
REPLACE
another. 'Pig') → HelloPig
STRSPLIT Splits a string into a tuple. STRSPLIT('a,b,c', ',') → (a, b, c)
Returns the index of the first INDEXOF('HelloWorld', 'World')
INDEXOF
occurrence of a substring. →5
MATCH THE FOLLOWING
Function Description
ABS Produces a different decimal between 0 and 1 each time you call it.
Moves a number down to the previous whole number, no matter
SQRT
the decimal.
Returns the mathematical constant approximately equal to
ROUND
3.14159.
CEIL Adjusts a decimal number to the closest whole number.
FLOOR Finds the number which, when multiplied by itself, equals the input.
RANDOM Gives you the distance of a number from zero, ignoring its sign.
Moves a number up to the next whole number, no matter the
PI
decimal.
MATH FUNCTIONS
Function Description Example
ABS Returns the absolute value. ABS(-5) → 5
Returns the square root of a
SQRT SQRT(16) → 4.0
number.
Rounds a number to the nearest
ROUND ROUND(4.6) → 5
integer.
Returns the smallest integer
CEIL greater than or equal to the CEIL(4.1) → 5
number.
Returns the largest integer less
FLOOR FLOOR(4.9) → 4
than or equal to the number.
Returns a random number
RANDOM RANDOM() → 0.3456 (varies)
between 0 and 1.
PI Returns the value of π. PI() → 3.14159
MATCH THE FOLLOWING
Function Description
SIZE a. Turns multiple fields into a collection of tuples.
TOTUPLE b. Finds the number of elements in a collection.
c. Extracts a specified number of top records from
TOBAG
data.
FLATTEN d. Changes fields into a single tuple.
TOP e. Breaks down a collection into individual elements.
BAG & TUPLE FUNCTIONS
Function Description Example
Returns the size of a bag or
SIZE SIZE({(1,2),(3,4)}) → 2
tuple.
TOTUPLE Converts fields into a tuple. TOTUPLE(1,2,3) → (1,2,3)
TOBAG(1,2,3) → {(1),(2),
TOBAG Converts fields into a bag.
(3)}
Flattens a bag or tuple into FLATTEN({(1,2),(3,4)}) →
FLATTEN
separate fields. (1,2),(3,4)
Returns the top N tuples
top_records = TOP(3,
TOP from a bag based on a
score, data);
comparator.
AGGREGATIVE FUNCTIONS
Function Description Example
Counts the number of
COUNT COUNT(A) → 5
elements.
SUM Sums up numeric values. SUM([Link]) → 250000.0
Calculates the average of
AVG AVG([Link]) → 50000.0
numeric values.
MIN Finds the minimum value. MIN([Link]) → 22
MAX Finds the maximum value. MAX([Link]) → 40
Groups data based on a
GROUP GROUP A BY age
field.
DATE TIME FUNCTION
Function Description Example
ToDate Converts a string to a date. ToDate('2025-01-01', 'yyyy-MM-dd')
AddDuration Adds a duration to a date. AddDuration('2025-01-01', '5d')
SubtractDuratio Subtracts a duration from a SubtractDuration('2025-01-01',
n date. '5d')
Returns the current
CurrentTime CurrentTime()
timestamp.
UTILITY FUNCTION
Function Description Example
Finds the difference
DIFF DIFF(A, B)
between two bags.
Checks if a bag or tuple is
IsEmpty IsEmpty(A)
empty.
Replaces null with a
NVL NVL([Link], 0)
specified value.
Splits a string into a bag of TOKENIZE('Hello Pig') →
TOKENIZE
words. {(Hello), (Pig)}
MCQ
No. Question Option 1 Option 2 Option 3
Which Pig function calculates total
1 sum of numeric field sales in dataset SUM([Link]) SUM(data) SUM(sales)
data?
Which function extracts the year
2 GET_YEAR(order_date) YEAR(order_date) TO_YEAR(order_date)
from a date field order_date?
To get current date and time in Pig,
3 NOW() CURRENT_TIME() GET_DATE()
which function to use?
Which function returns number of COUNT_DISTINCT(cust
4 COUNT(customers) DISTINCT(customers)
distinct customers in bag customers? omers)
How to round the value of field price
5 ROUND(price) CEIL(price) FLOOR(price)
to nearest integer?
Which function converts string '2024- ToDate('2024-05- ToDate('yyyy-MM-dd',
6 ToDate('2024-05-17')
05-17' into date format? 17', 'yyyy-MM-dd') '2024-05-17')
Which function breaks bag {(1,2),
7 FLATTEN() EXPLODE() UNPACK()
(3,4)} into individual fields?
To get number of elements inside a
8 SIZE(orders) LENGTH(orders) COUNT(orders)
bag orders, which function is correct?
APACHE HIVE
APACHE HIVE
Apache Hive is a data warehouse infrastructure built on top of Apache
Hadoop for providing data summarization, query, and analysis.
It enables querying large datasets stored in Hadoop's HDFS using a SQL-
like language called HiveQL.
Originally developed by Facebook, Hive translates SQL queries into
MapReduce jobs for distributed processing.
It supports features like partitioning, indexing, and user-defined functions
to optimize big data analysis.
Hive is designed for batch processing and data warehousing, not for real-
time transactional workloads (OLTP).
It is widely used for data summarization, ad-hoc queries, and ETL in big
data ecosystems.
KEY FEATURES OF HIVE
SQL-like Interface (HiveQL)
Scalability with Hadoop
Schema on Read
Data Warehouse Integration
Table Partitioning
Bucketing for efficient querying
Built-in Functions (math, string, date, etc.) Support for Multiple File Formats
Compatibility with Hadoop Ecosystem Extensibility with UDFs, UDAFs, and UDTFs
Indexing for faster queries
Security (authentication and authorization)
Easy Integration with BI Tools
Query Optimization (cost-based and rule-based)
Fault Tolerance via Hadoop
Hive Metastore for metadata management
PROBLEM STATEMENTS
Create a database called test.
Display the existing databases
Select the newly created database
Create a employee table which contains
employeeid,name,age,department.
Insert 10 records under it.
Retrieve all records from the employees table.
Write a query to display the names of employees in the IT department.
Write a query to retrieve details of all employees whose age is greater than
25.
PROBLEM STATEMENTS
Find employees whose age is between 25 and 35.
Write a query to display all employees sorted by age in ascending order.
Write a query to list all departments that have employees aged over 30.
Identify Understaffed Departments (fewer than 5 employees)
Predict New Salaries with 10% Increment
Create a query to summarize the total number of employees, average age
per department.
CREATE DATA IN CSV
cat > sales_data.csv <<EOL
order_id,product_name,quantity,price,sale_date
101,Laptop,2,50000,2023-05-01
102,Mouse,5,500,2023-05-02
103,Keyboard,3,1500,2023-05-03
104,Monitor,1,10000,2023-05-04
EOL
CREATE TABLE
CREATE TABLE sales (
order_id INT,
product_name STRING,
quantity INT,
price FLOAT,
sale_date STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
LOAD THE DATA & READ
THE CONTENT
LOAD DATA LOCAL INPATH '/home/shalini/sales_data.csv' INTO TABLE sales;
Select * from sales;
THANK YOU