0% found this document useful (0 votes)
13 views

Lecture 2 - Hadoop 221

The document discusses the importance of big data analytics for business, including improved decision making, increased efficiency, and enhanced customer experience. It covers key concepts like data collection and preprocessing, analytical tools and techniques, and ethics in big data analytics. Example use cases are also provided, such as Netflix's use of analytics for recommendations and content creation.

Uploaded by

Rishabh Das
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 2 - Hadoop 221

The document discusses the importance of big data analytics for business, including improved decision making, increased efficiency, and enhanced customer experience. It covers key concepts like data collection and preprocessing, analytical tools and techniques, and ethics in big data analytics. Example use cases are also provided, such as Netflix's use of analytics for recommendations and content creation.

Uploaded by

Rishabh Das
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Data & Hadoop – Lecture 2

Big Data
● Importance:
Analytics in ○ Improved Decision Making: Data-driven insights can
help businesses make more informed decisions and
Business - ○
reduce the risk of errors and biases.
Increased Efficiency: Big Data Analytics can help
Importance businesses optimize processes, improve workflows,
and reduce operational costs.
○ Enhanced Customer Experience: Big Data Analytics
can provide valuable insights into customer behavior
and preferences, enabling businesses to personalize
experiences and improve customer satisfaction. Ex
Dominos
○ Competitive Advantage: Big Data Analytics can
provide businesses with a competitive advantage by
enabling them to identify trends, anticipate market
changes, and innovate.
○ Revenue Growth: Big Data Analytics can help
businesses identify new revenue streams, cross-sell
and upsell opportunities, and improve pricing
strategies.
Big Data ● Definition: process of analyzing large and complex
datasets to uncover insights and knowledge
Analytics in ● Use Cases:
○ Customer Analytics: analyzing customer behavior and
Business - preferences to improve marketing strategies, personalized
offers, and customer experience. Ex Amazon
○ Fraud Detection: detecting fraudulent activities in financial
Use Cases transactions, insurance claims, and healthcare billing
○ Supply Chain Optimization: optimizing the supply chain to
reduce costs, improve delivery times, and increase
efficiency
○ Predictive Maintenance: predicting equipment failure and
maintenance needs to improve uptime and reduce costs. Ex
Tesla
○ Risk Management: analyzing data to identify and mitigate
risks in financial investments, insurance claims, and
cybersecurity. Ex Investment banking
○ Sales Forecasting: using historical data and machine
learning algorithms to predict sales trends and forecast
demand
Big Data Overview: Netflix is a streaming service that uses Big Data Analytics to drive
decision making and improve customer experience.
Analytics Case
● Use Cases:
Study - ○ Personalized Recommendations: Netflix's recommendation engine
analyzes viewing history, ratings, and user behavior to personalize
movie and TV show recommendations for each user.
○ Content Creation: Netflix uses Big Data Analytics to identify popular
genres, actors, and storylines to create original content that resonates
with viewers.
○ Pricing Strategy: Netflix uses data to optimize pricing, test different
pricing models, and offer promotions to attract and retain customers.
● Results:
○ Improved Customer Experience: Personalized recommendations have
increased customer engagement and satisfaction, resulting in higher
retention rates.
○ Increased Revenue: Netflix's data-driven approach to content creation
and pricing has helped the company grow its subscriber base and
increase revenue.
○ As of Q4 2022, Netflix had over 214 million subscribers globally, up
from 103.95 million in Q2 2016 (Doubling)
Data Collection and Pre-processing
Importance • Foundation of Analysis: Quality insights stem from
quality data.
of Data • Accuracy & Relevance: Ensures that the analysis is on
Collection target.
• Informed Decisions: Data-driven strategies are more
likely to succeed.
• Understanding Trends: Spot market trends and
customer behaviors.
• Risk Management: Proper data can warn about
potential future issues.
Primary and Secondary Data
Methods of How Do We Collect Data?

Data • Surveys & Questionnaires: Direct feedback from


Collection individuals.
• Observations: Direct or participant observations for
qualitative data.
• Interviews: One-on-one sessions, telephonic, or in-
person.
• Web Scraping: Extracting data from websites
automatically.
• Logs & Transaction Data: System-generated, e.g., e-
commerce transactions.
• Sensors & IoT Devices: Physical data from the
environment.
Data Cleaning Making Data Usable: Cleaning & Transformation
and
• Data Cleaning: Removing inconsistencies, errors, and
Transformation duplicates.
• E.g., Removing invalid email addresses or standardizing
date formats.
• Data Transformation: Converting data into a suitable format
or structure.
• E.g., Normalizing scales or encoding categorical variables.
• Handling Missing Data: Using techniques to fill in or omit
missing values.
• Feature Engineering: Enhancing data by constructing new
relevant features.

• Tool Eg. - Open Refine


Analytics Tools and Techniques
Overview of Analytical
Tools
• Business Intelligence Tools: Tools
like Microsoft BI, QlikView, which allow
for data preparation, dashboard
development, and data warehousing.
• Statistical Analysis Software: Tools
like SPSS, Stata for deep statistical
analysis.
• Data Management Platforms: Such
as Alteryx, Talend, that assist in the
integration, cleansing, and data
enrichment.
Visualization • Tableau: Interactive visualization software that allows
you to create dashboards and stories.
Tools like • Features: Drag-and-drop interface, data blending,
Tableau, real-time analysis, and collaboration.
• Power BI: Microsoft's BI tool that integrates seamlessly
Power BI with other Microsoft products.
• Features: DAX scripting, drill-down & drill-through
capabilities, custom visuals, and integration with
Azure.
• Benefits: Transform raw data into comprehensible
visuals, help in identifying patterns, trends, and outliers.
Analytical • SAS: A software suite used for advanced analytics,
multivariate analysis, business intelligence, and data
Platforms management.
Features: Reliable with industry-specific solutions,
like SAS, R, •
GUI for non-coders.
Python • R: A programming language and environment used for
statistical computing and graphics.
• Features: Rich ecosystem of packages, integrated
data handling.
• Python: A high-level, general-purpose programming
language.
• Features: Libraries like Pandas, NumPy for data
manipulation, and Scikit-learn for machine learning.
• Use Cases: Predictive modeling, text analysis, data
mining, and optimization.
Ethics in Big Data Analytics
Importance • Trustworthiness: Ethical handling establishes trust with
customers and stakeholders.
of Ethics • Regulatory Compliance: Meeting legal and policy
requirements.
• Reputation: Avoiding bad publicity and reputational
damage due to unethical practices.
• Sustainability: Ensuring long-term business survival and
relevance.
• Responsibility: Recognizing the impact of data on real
people and society.
Best • Informed Consent: Always get explicit permissions
before data collection.
Practices for • Data Minimization: Only collect what is absolutely
Ethical Data necessary.
• Transparency: Be open about how, why, and where data
Handling is used.
• Continuous Audits: Regular checks to ensure ethical
compliance.
• Diversity in Teams: Diverse teams can better spot and
rectify biases.
• User Empowerment: Allow users to control their data
and its usage.
Hadoop - An Introduction
• Hadoop is an open-source framework for processing and
storing large datasets.
• It is designed to handle big data and distribute the
processing workload across a cluster of computers.
• Hadoop consists of two main components: the Hadoop
Distributed File System (HDFS) and MapReduce.
• HDFS is a distributed file system that allows data to be
stored across multiple machines.
• MapReduce is a programming model for processing large
datasets in parallel across a cluster of computers.
• Hadoop is used by many organizations to process and
analyze large datasets, including Facebook, Yahoo!, and
Amazon.
• Hadoop is an essential tool for big data analytics and has
revolutionized the field of data processing and storage.
Hadoop - Benefits
Faster Data Processing: Hadoop
Flexibility: Hadoop allows businesses
Cost-effectiveness: Hadoop offers a enables faster data processing by
to store and process data of any
cost-effective solution for processing distributing the workload across a
format and size, including structured,
and storing large datasets compared cluster of computers, reducing the
semi-structured, and unstructured
to traditional data storage methods. time required for data analysis and
data.
decision-making.

Insights into Customer Behavior:


Scalability: Hadoop supports
Reliability and Fault Tolerance: Hadoop enables businesses to gain
scalability and can handle datasets of
Hadoop is designed to handle failures insights into customer behavior and
any size, making it ideal for
and ensure data availability, making it preferences, allowing for targeted
businesses that need to store and
a reliable solution for businesses. marketing and improved customer
process large amounts of data.
experiences.

Identification of New Opportunities


Optimization of Business Operations:
for Growth and Revenue: Hadoop
Hadoop can be used to optimize
can help businesses identify new
business operations, including supply
opportunities for growth and revenue
chain management, fraud detection,
by analyzing large datasets and
and risk analysis.
identifying patterns and trends.
Hadoop Architecture - Overview
• Hadoop is a distributed system that consists of the following components:
● Hadoop Distributed File System (HDFS): HDFS is a distributed file system that stores data across a
cluster of computers.
● Yet Another Resource Negotiator (YARN): YARN is a resource management framework that manages
resources in a Hadoop cluster and schedules jobs.
● MapReduce: MapReduce is a programming model used for processing and analyzing large datasets in a
distributed environment.
● Hadoop Common: Hadoop Common provides the libraries and utilities needed by other Hadoop
modules.
● Hadoop Clients: Hadoop Clients are the tools used to interact with the Hadoop cluster, including
Hadoop command-line tools and web-based user interfaces.
● Hadoop Ecosystem: Hadoop Ecosystem consists of various tools and technologies that work with
Hadoop to enhance its capabilities, including Apache Spark, Hive, Pig, and HBase.

• Hadoop Architecture is designed to enable distributed processing of large datasets across a cluster of
computers, making it an ideal solution for businesses that need to process and analyze large amounts of data.
HDFS Hadoop Hadoop Distributed File System (HDFS) is a key component of Hadoop
Distributed File System architecture that provides a distributed and reliable way to store and
manage large datasets. Here are some key features of HDFS:

● Distributed storage: HDFS distributes data across multiple nodes in


a cluster for better performance and reliability.
● Fault tolerance: HDFS maintains multiple copies of each data block
to ensure that data is not lost in case of node failure.
● Scalability: HDFS can scale horizontally by adding more nodes to
the cluster to accommodate growing data needs.
● High throughput: HDFS is optimized for large sequential reads and
writes, making it ideal for applications that require high
throughput.
● Data locality: HDFS tries to store data on nodes where it will be
processed, minimizing network traffic and improving performance.
MapReduce MapReduce is a programming model and processing framework that is widely used

(Mapper and in Hadoop to process large datasets in parallel. MapReduce operates in two phases:
the Map phase and the Reduce phase. Here's how it works:

Reducer) ● Map phase: In the Map phase, data is read from HDFS and processed into
key-value pairs. The Map function takes these key-value pairs as input,
performs some operations on them, and produces intermediate key-value
pairs as output.
● Shuffle and Sort: The intermediate key-value pairs produced in the Map
phase are shuffled and sorted by key before being passed to the Reduce
phase.
● Reduce phase: In the Reduce phase, the sorted key-value pairs are
processed by the Reduce function, which takes the key-value pairs as input
and produces output in the form of key-value pairs.
● Final output: The final output of MapReduce is written back to HDFS as key-
value pairs.

MapReduce allows processing of large datasets in parallel by dividing them into


smaller chunks and processing them on multiple nodes in a cluster. This distributed
processing enables scalability, fault tolerance, and high throughput.
How does Hadoop Work?
Hadoop runs code across a Data is initially divided into
These files are then
cluster of computers. This directories and files. Files HDFS, being on top of the
distributed across various
process includes the are divided into uniform local file system, supervises
cluster nodes for further
following core tasks that sized blocks of 128M and the processing.
processing.
Hadoop performs − 64M (preferably 128M).

Performing the sort that


Blocks are replicated for Checking that the code was Sending the sorted data to
takes place between the
handling hardware failure. executed successfully. a certain computer.
map and reduce stages.

Writing the debugging logs


for each job.
The Word Count example is a simple yet powerful demonstration of Hadoop's
MapReduce framework. In this example, the goal is to count the number of
occurrences of each word in a large dataset of text files.
HDFS and Map Reduce Flow

You might also like