Lecture 2 - Hadoop 221
Lecture 2 - Hadoop 221
Big Data
● Importance:
Analytics in ○ Improved Decision Making: Data-driven insights can
help businesses make more informed decisions and
Business - ○
reduce the risk of errors and biases.
Increased Efficiency: Big Data Analytics can help
Importance businesses optimize processes, improve workflows,
and reduce operational costs.
○ Enhanced Customer Experience: Big Data Analytics
can provide valuable insights into customer behavior
and preferences, enabling businesses to personalize
experiences and improve customer satisfaction. Ex
Dominos
○ Competitive Advantage: Big Data Analytics can
provide businesses with a competitive advantage by
enabling them to identify trends, anticipate market
changes, and innovate.
○ Revenue Growth: Big Data Analytics can help
businesses identify new revenue streams, cross-sell
and upsell opportunities, and improve pricing
strategies.
Big Data ● Definition: process of analyzing large and complex
datasets to uncover insights and knowledge
Analytics in ● Use Cases:
○ Customer Analytics: analyzing customer behavior and
Business - preferences to improve marketing strategies, personalized
offers, and customer experience. Ex Amazon
○ Fraud Detection: detecting fraudulent activities in financial
Use Cases transactions, insurance claims, and healthcare billing
○ Supply Chain Optimization: optimizing the supply chain to
reduce costs, improve delivery times, and increase
efficiency
○ Predictive Maintenance: predicting equipment failure and
maintenance needs to improve uptime and reduce costs. Ex
Tesla
○ Risk Management: analyzing data to identify and mitigate
risks in financial investments, insurance claims, and
cybersecurity. Ex Investment banking
○ Sales Forecasting: using historical data and machine
learning algorithms to predict sales trends and forecast
demand
Big Data Overview: Netflix is a streaming service that uses Big Data Analytics to drive
decision making and improve customer experience.
Analytics Case
● Use Cases:
Study - ○ Personalized Recommendations: Netflix's recommendation engine
analyzes viewing history, ratings, and user behavior to personalize
movie and TV show recommendations for each user.
○ Content Creation: Netflix uses Big Data Analytics to identify popular
genres, actors, and storylines to create original content that resonates
with viewers.
○ Pricing Strategy: Netflix uses data to optimize pricing, test different
pricing models, and offer promotions to attract and retain customers.
● Results:
○ Improved Customer Experience: Personalized recommendations have
increased customer engagement and satisfaction, resulting in higher
retention rates.
○ Increased Revenue: Netflix's data-driven approach to content creation
and pricing has helped the company grow its subscriber base and
increase revenue.
○ As of Q4 2022, Netflix had over 214 million subscribers globally, up
from 103.95 million in Q2 2016 (Doubling)
Data Collection and Pre-processing
Importance • Foundation of Analysis: Quality insights stem from
quality data.
of Data • Accuracy & Relevance: Ensures that the analysis is on
Collection target.
• Informed Decisions: Data-driven strategies are more
likely to succeed.
• Understanding Trends: Spot market trends and
customer behaviors.
• Risk Management: Proper data can warn about
potential future issues.
Primary and Secondary Data
Methods of How Do We Collect Data?
• Hadoop Architecture is designed to enable distributed processing of large datasets across a cluster of
computers, making it an ideal solution for businesses that need to process and analyze large amounts of data.
HDFS Hadoop Hadoop Distributed File System (HDFS) is a key component of Hadoop
Distributed File System architecture that provides a distributed and reliable way to store and
manage large datasets. Here are some key features of HDFS:
(Mapper and in Hadoop to process large datasets in parallel. MapReduce operates in two phases:
the Map phase and the Reduce phase. Here's how it works:
Reducer) ● Map phase: In the Map phase, data is read from HDFS and processed into
key-value pairs. The Map function takes these key-value pairs as input,
performs some operations on them, and produces intermediate key-value
pairs as output.
● Shuffle and Sort: The intermediate key-value pairs produced in the Map
phase are shuffled and sorted by key before being passed to the Reduce
phase.
● Reduce phase: In the Reduce phase, the sorted key-value pairs are
processed by the Reduce function, which takes the key-value pairs as input
and produces output in the form of key-value pairs.
● Final output: The final output of MapReduce is written back to HDFS as key-
value pairs.