0% found this document useful (0 votes)
319 views38 pages

RTAP Applications in Real-Time Analytics

The document discusses Real-Time Analytical Platforms (RTAP) and their applications in various sectors such as healthcare, e-commerce, and stock market prediction. It also covers Apache Pig, a high-level platform for processing large datasets in Hadoop, detailing its features, components, and data processing operators. Additionally, it introduces Apache Hive, a data warehouse infrastructure for querying large datasets using HiveQL, highlighting its key features and problem statements for practical implementation.

Uploaded by

nextapai.blog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
319 views38 pages

RTAP Applications in Real-Time Analytics

The document discusses Real-Time Analytical Platforms (RTAP) and their applications in various sectors such as healthcare, e-commerce, and stock market prediction. It also covers Apache Pig, a high-level platform for processing large datasets in Hadoop, detailing its features, components, and data processing operators. Additionally, it introduces Apache Hive, a data warehouse infrastructure for querying large datasets using HiveQL, highlighting its key features and problem statements for practical implementation.

Uploaded by

nextapai.blog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

REAL TIME ANALYTICAL

PLATFORM
RTAP
 Real-time analytics in Big Data allows organizations to quickly gain insights
from large datasets as they are being generated.
 Instead of focusing only on past data, it helps businesses analyze live data
streams.
 This improves decision-making and enables businesses to adapt swiftly to
market changes, customer needs, and operational issues.
APPLICATIONS OF RTAP
• Healthcare Feedback
• Entertainment analytics
• Customer support optimization
• E-Commerce Insights
• Customer Feedback Analysis • Event Monitoring
• Social Media Sentiment Monitoring • Travel & Hospitality Feedback
• Stock Market Prediction • Retail store sentiment analysis
• Political Sentiment Analysis • Sentiment driven Marketing
• Crisis Management
• Product Development
• News Sentiment Analysis
RTAP FOR STOCK MARKET
PREDICTION
 Data Source
 Data Ingestion
 Stream Processing
 Real Time Querying
 Data Storage
 Sentiment Analysis
 Machine Learning
 Visualization
APACHE PIG
APACHE PIG
 Apache Pig is a high-level platform for processing and analyzing large data
sets in Hadoop.
 It uses a scripting language called Pig Latin, which simplifies the task of
writing MapReduce programs.
 Pig Latin is a declarative language similar to SQL, used to express data
transformation and processing logic.
 Pig is designed to handle structured, semi-structured, and unstructured
data, making it highly flexible for diverse data analysis tasks.
FEATURES
 High-Level Language
 Flexibility- Handles different data types & support multiple languages.
 Handles Complex Data Flows-ETL (Extract, Transform, Load), data
preparation, and ad hoc querying.
 Extensibility
 Rich set of data processing operators
 Optimized for Hadoop
 Interactive and Batch Modes- Local Mode & Hadoop Mode
COMPONENTS OF APACHE
PIG
 Pig Latin Scripts
 Pig Runtime raw_data = LOAD '[Link]' USING
PigStorage(',')
 Execution Environment:
high_salary = FILTER raw_data BY salary >
Local & Hadoop Mode 50000;
grouped_data = GROUP high_salary BY name;
result = FOREACH grouped_data GENERATE
group, COUNT(high_salary);
STORE result INTO 'output' USING
PigStorage(',');
INTERNAL ARCHITECTURE
OF APACHE PIG
PIG LATIN DATA MODELS

ATOM TUPLE BAG

MAP RELATION
SAMPLE CODE
 students = LOAD '[Link]' AS (name:chararray, age:int,
city:chararray);
 first_student = ('Alice', 30, 'New York');
 student_bag = {('Alice', 30), ('Bob', 25), ('Charlie', 28)};
 student_map = ['name' -> 'Alice', 'age' -> 30, 'city' -> 'New York'];
 student_age = 30;
 DUMP students;
DATA PROCESSING
OPERATORS
 Relational Operators
 Diagnostic Operators
 Expression Operators
 Build in Functions
 User Defined Functions
RELATIONAL OPERATORS

LOAD STORE FILTER FOREACH GROUP

COGROUP JOIN CROSS UNION SPLIT

ORDER DISTINCT LIMIT


SAMPLE CODE
data = LOAD '[Link]' USING PigStorage(',') AS (name:chararray, age:int, city:chararray);
filtered_data = FILTER data BY age > 25;
transformed_data = FOREACH filtered_data GENERATE name, age * 2 AS age_doubled;
grouped_data = GROUP transformed_data BY city;
data2 = LOAD '[Link]' USING PigStorage(',') AS (city:chararray, state:chararray);
cogrouped_data = COGROUP grouped_data BY group, data2 BY city;
joined_data = JOIN transformed_data BY city, data2 BY city;
crossed_data = CROSS transformed_data, data2;
unioned_data = UNION transformed_data, transformed_data;
SPLIT data INTO young IF age < 30, old IF age >= 30;
ordered_data = ORDER data BY age ASC;
distinct_data = DISTINCT data;
limited_data = LIMIT data 10;
STORE limited_data INTO '[Link]' USING PigStorage(',')
DIAGNOSTIC OPERATORS

DESCRIBE EXPLAIN

ILLUSTRAT
DUMP
E
SAMPLE CODE A: {id: int, name: chararray, age: int, salary:
float}
A = LOAD 'sample_data.txt' USING (2,Jane,35,60000.0)
PigStorage(',') AS (id: int, name: chararray, (3,Mike,40,55000.0)
age: int, salary: float);
(5,Tom,35,70000.0)
DESCRIBE A;
B = FILTER A BY age > 30; Logical Plan:
DUMP B; Load -> Group -> Store
C = GROUP A BY age;
Output:
EXPLAIN C;
D:
D = FOREACH C GENERATE group AS age, (22,{(4,Susan,22,45000.0)}) -> (22,45000.0)
AVG([Link]) AS avg_salary;
(28,{(1,John,28,50000.0)}) -> (28,50000.0)
ILLUSTRATE D; (35,{(2,Jane,35,60000.0),(5,Tom,35,70000.0)}) ->
STORE D INTO 'output' USING PigStorage(','); (35,65000.0)
1,John,28,50000.0 (40,{(3,Mike,40,55000.0)}) -> (40,55000.0)
2,Jane,35,60000.0
3,Mike,40,55000.0
4,Susan,22,45000.0
5,Tom,35,70000.0
EXPRESSION OPERATORS

ARITHMETIC COMPARISON

LOGICAL &
TEXT
RELATIONAL
SAMPLE CODE
A = LOAD '[Link]' USING PigStorage(',') AS (id: int, name: chararray, age: int, salary: float);
B = FILTER A BY age > 30;
C = FOREACH A GENERATE id, name, age, salary * 1.1 AS new_salary;
D = FOREACH A GENERATE CONCAT(name, ' Employee') AS full_name;
E = FILTER A BY salary IS NULL;
DUMP B;
DUMP C;
DUMP D;
DUMP E;
EXPRESSION OPERATORS
BAG &
STRING MATH
TUPLE
FUNCTIONS FUNCTIONS
FUNCTIONS

DATE &
AGGREGATE UTILITY
TIME
FUNCTIONS FUNCTIONS
FUNCTIONS
MATCH THE FOLLOWING
Function Name Description

CONCAT a. Converts string to lowercase.

SUBSTRING b. Splits a string into a tuple.


c. Removes leading and trailing
TRIM
whitespaces.
TOLOWER d. Replaces a substring with another.

TOUPPER e. Extracts a substring.

REPLACE f. Concatenates two strings.

STRSPLIT g. Converts string to uppercase.

h. Returns the index of the first


INDEXOF
occurrence of a substring.
STRING FUNCTIONS
Function Description Example
CONCAT('Hello', 'World') →
CONCAT Concatenates two strings.
HelloWorld
SUBSTRING('HelloWorld', 0, 5)
SUBSTRING Extracts a substring.
→ Hello
Removes leading and trailing
TRIM TRIM(' Hello ') → Hello
whitespaces.
TOLOWER Converts string to lowercase. TOLOWER('HELLO') → hello
TOUPPER Converts string to uppercase. TOUPPER('hello') → HELLO
Replaces a substring with REPLACE('HelloWorld', 'World',
REPLACE
another. 'Pig') → HelloPig
STRSPLIT Splits a string into a tuple. STRSPLIT('a,b,c', ',') → (a, b, c)
Returns the index of the first INDEXOF('HelloWorld', 'World')
INDEXOF
occurrence of a substring. →5
MATCH THE FOLLOWING
Function Description

ABS Produces a different decimal between 0 and 1 each time you call it.

Moves a number down to the previous whole number, no matter


SQRT
the decimal.
Returns the mathematical constant approximately equal to
ROUND
3.14159.

CEIL Adjusts a decimal number to the closest whole number.

FLOOR Finds the number which, when multiplied by itself, equals the input.

RANDOM Gives you the distance of a number from zero, ignoring its sign.

Moves a number up to the next whole number, no matter the


PI
decimal.
MATH FUNCTIONS
Function Description Example
ABS Returns the absolute value. ABS(-5) → 5
Returns the square root of a
SQRT SQRT(16) → 4.0
number.
Rounds a number to the nearest
ROUND ROUND(4.6) → 5
integer.
Returns the smallest integer
CEIL greater than or equal to the CEIL(4.1) → 5
number.

Returns the largest integer less


FLOOR FLOOR(4.9) → 4
than or equal to the number.

Returns a random number


RANDOM RANDOM() → 0.3456 (varies)
between 0 and 1.
PI Returns the value of π. PI() → 3.14159
MATCH THE FOLLOWING
Function Description

SIZE a. Turns multiple fields into a collection of tuples.

TOTUPLE b. Finds the number of elements in a collection.


c. Extracts a specified number of top records from
TOBAG
data.
FLATTEN d. Changes fields into a single tuple.

TOP e. Breaks down a collection into individual elements.


BAG & TUPLE FUNCTIONS
Function Description Example
Returns the size of a bag or
SIZE SIZE({(1,2),(3,4)}) → 2
tuple.
TOTUPLE Converts fields into a tuple. TOTUPLE(1,2,3) → (1,2,3)
TOBAG(1,2,3) → {(1),(2),
TOBAG Converts fields into a bag.
(3)}
Flattens a bag or tuple into FLATTEN({(1,2),(3,4)}) →
FLATTEN
separate fields. (1,2),(3,4)
Returns the top N tuples
top_records = TOP(3,
TOP from a bag based on a
score, data);
comparator.
AGGREGATIVE FUNCTIONS
Function Description Example
Counts the number of
COUNT COUNT(A) → 5
elements.
SUM Sums up numeric values. SUM([Link]) → 250000.0
Calculates the average of
AVG AVG([Link]) → 50000.0
numeric values.
MIN Finds the minimum value. MIN([Link]) → 22
MAX Finds the maximum value. MAX([Link]) → 40
Groups data based on a
GROUP GROUP A BY age
field.
DATE TIME FUNCTION

Function Description Example


ToDate Converts a string to a date. ToDate('2025-01-01', 'yyyy-MM-dd')
AddDuration Adds a duration to a date. AddDuration('2025-01-01', '5d')
SubtractDuratio Subtracts a duration from a SubtractDuration('2025-01-01',
n date. '5d')
Returns the current
CurrentTime CurrentTime()
timestamp.
UTILITY FUNCTION

Function Description Example


Finds the difference
DIFF DIFF(A, B)
between two bags.
Checks if a bag or tuple is
IsEmpty IsEmpty(A)
empty.
Replaces null with a
NVL NVL([Link], 0)
specified value.
Splits a string into a bag of TOKENIZE('Hello Pig') →
TOKENIZE
words. {(Hello), (Pig)}
MCQ
No. Question Option 1 Option 2 Option 3
Which Pig function calculates total
1 sum of numeric field sales in dataset SUM([Link]) SUM(data) SUM(sales)
data?
Which function extracts the year
2 GET_YEAR(order_date) YEAR(order_date) TO_YEAR(order_date)
from a date field order_date?
To get current date and time in Pig,
3 NOW() CURRENT_TIME() GET_DATE()
which function to use?
Which function returns number of COUNT_DISTINCT(cust
4 COUNT(customers) DISTINCT(customers)
distinct customers in bag customers? omers)
How to round the value of field price
5 ROUND(price) CEIL(price) FLOOR(price)
to nearest integer?
Which function converts string '2024- ToDate('2024-05- ToDate('yyyy-MM-dd',
6 ToDate('2024-05-17')
05-17' into date format? 17', 'yyyy-MM-dd') '2024-05-17')
Which function breaks bag {(1,2),
7 FLATTEN() EXPLODE() UNPACK()
(3,4)} into individual fields?
To get number of elements inside a
8 SIZE(orders) LENGTH(orders) COUNT(orders)
bag orders, which function is correct?
APACHE HIVE
APACHE HIVE
 Apache Hive is a data warehouse infrastructure built on top of Apache
Hadoop for providing data summarization, query, and analysis.
 It enables querying large datasets stored in Hadoop's HDFS using a SQL-
like language called HiveQL.
 Originally developed by Facebook, Hive translates SQL queries into
MapReduce jobs for distributed processing.
 It supports features like partitioning, indexing, and user-defined functions
to optimize big data analysis.
 Hive is designed for batch processing and data warehousing, not for real-
time transactional workloads (OLTP).
 It is widely used for data summarization, ad-hoc queries, and ETL in big
data ecosystems.
KEY FEATURES OF HIVE
SQL-like Interface (HiveQL)
Scalability with Hadoop
Schema on Read
Data Warehouse Integration
Table Partitioning
Bucketing for efficient querying
Built-in Functions (math, string, date, etc.) Support for Multiple File Formats
Compatibility with Hadoop Ecosystem Extensibility with UDFs, UDAFs, and UDTFs
Indexing for faster queries
Security (authentication and authorization)
Easy Integration with BI Tools
Query Optimization (cost-based and rule-based)
Fault Tolerance via Hadoop
Hive Metastore for metadata management
PROBLEM STATEMENTS
 Create a database called test.
 Display the existing databases
 Select the newly created database
 Create a employee table which contains
employeeid,name,age,department.
 Insert 10 records under it.
 Retrieve all records from the employees table.
 Write a query to display the names of employees in the IT department.
 Write a query to retrieve details of all employees whose age is greater than
25.
PROBLEM STATEMENTS
 Find employees whose age is between 25 and 35.
 Write a query to display all employees sorted by age in ascending order.
 Write a query to list all departments that have employees aged over 30.
 Identify Understaffed Departments (fewer than 5 employees)
 Predict New Salaries with 10% Increment
 Create a query to summarize the total number of employees, average age
per department.
CREATE DATA IN CSV
cat > sales_data.csv <<EOL
order_id,product_name,quantity,price,sale_date
101,Laptop,2,50000,2023-05-01
102,Mouse,5,500,2023-05-02
103,Keyboard,3,1500,2023-05-03
104,Monitor,1,10000,2023-05-04
EOL
CREATE TABLE
CREATE TABLE sales (
order_id INT,
product_name STRING,
quantity INT,
price FLOAT,
sale_date STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
LOAD THE DATA & READ
THE CONTENT
 LOAD DATA LOCAL INPATH '/home/shalini/sales_data.csv' INTO TABLE sales;
 Select * from sales;
THANK YOU

You might also like