0% found this document useful (0 votes)

11 views30 pages

Big Data Analytics Presentation

The document provides an overview of big data processing, highlighting its key characteristics (volume, velocity, variety) and essential components, including storage systems, processing frameworks, and data analysis techniques. It discusses Hadoop as a foundational framework for distributed data storage and processing, detailing its components like HDFS and MapReduce, and contrasts it with Apache Spark, which offers in-memory processing and a unified computing engine for various data tasks. Additionally, it compares the performance and usability of Hadoop and Spark, and lists examples of big data analytics platforms that utilize these technologies.

Uploaded by

nextstep013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views30 pages

Big Data Analytics Presentation

Uploaded by

nextstep013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

P R E S E N T A T I O N O N

B I G DATA A N A LY S I S A N D
PROCESSING
Introduction to big
data processing
Introduction
Big data processing refers to the methods and technologies used to
handle large volumes of data that traditional data processing applications
can't manage efficiently. This data typically comes from various sources
such as social media, sensors, machines, transactions, and more. The
three main characteristics of big data, often referred to as the three Vs,
are volume, velocity, and variety:
• Volume: Big data involves large amounts of data, often ranging from
terabytes to petabytes or even exabytes.
• Velocity: Data streams in at high speeds and needs to be processed
quickly to derive insights or take actions in real-time or near real-time.
• Variety: Data comes in various formats and types, including structured
data (like databases), semi-structured data (like XML files), and
unstructured data (like text, images, and videos).
components of Big data
• Storage processing
Systems: Big data storage solutions like Hadoop
Distributed File System (HDFS), Amazon S3, or Google Cloud
Storage are used to store massive amounts of data across
distributed systems.

• Processing Frameworks: Frameworks like Apache Hadoop,

Apache Spark, and Apache Flink provide distributed computing
capabilities for processing large datasets across clusters of
computers.

• Data Processing Languages: Programming languages such as

Java, Python, Scala, and SQL are commonly used for big data
processing tasks. Each language has its strengths depending
• Data Integration Tools: Tools like Apache NiFi, Apache Kafka, and
Apache Flume are used for ingesting data from various sources
into the processing pipeline.

• Data Analysis and Machine Learning: Techniques such as data

mining, machine learning, and predictive analytics are applied to
extract valuable insights from big data.

• Distributed Computing: Big data processing often involves

distributing processing tasks across multiple nodes in a cluster to
parallelize computation and improve performance.

• Fault Tolerance and Scalability: Big data systems need to be

fault-tolerant and scalable to handle hardware failures and
accommodate growing data volumes.
Introduction to
hadoop
What is
Hadoop is an
Hadoop?
open-source framework designed for
distributed storage and processing of large datasets across
clusters of commodity hardware. At its core, it comprises
two main components: Hadoop Distributed File System
(HDFS) for storing data across multiple machines with fault
tolerance, and MapReduce for parallel processing of data.
It provides a scalable and cost-effective solution for
organizations to handle Big Data, allowing them to store,
process, and analyze vast amounts of data to derive
valuable insights and make data-driven decisions.
Components of Hadoop
• Hadoop ecosystem
Distributed File System (HDFS): HDFS is the primary storage
system used by Hadoop. It is designed to store large files across
multiple machines in a reliable and fault-tolerant manner. HDFS divides
large files into smaller blocks and distributes them across a cluster of
commodity hardware.

• Hadoop MapReduce: MapReduce is a programming model and

processing engine for parallel processing of large datasets across a
distributed cluster. It consists of two main phases: the Map phase, where
data is processed in parallel across multiple nodes, and the Reduce
phase, where the results from the Map phase are aggregated.

• YARN (Yet Another Resource Negotiator): YARN is the resource

management layer of Hadoop. It is responsible for managing and
allocating resources (CPU, memory, etc.) to various applications running
• Hadoop Common: Hadoop Common contains libraries and
utilities that support other Hadoop modules. It includes common
utilities, configuration files, and libraries used by various
components in the Hadoop ecosystem.

• Hadoop HBase: HBase is a distributed, scalable, and column-

oriented NoSQL database built on top of Hadoop. It provides
real-time read/write access to large datasets and is suitable for
applications requiring random, real-time read/write access to Big
Data.

• Hadoop Hive: Hive is a data warehouse infrastructure built on

top of Hadoop that provides a SQL-like interface (HiveQL) for
querying and managing large datasets stored in HDFS. Hive
translates SQL queries into MapReduce or Tez jobs, allowing
users familiar with SQL to analyze Big Data without needing to
The Hadoop Distributed File System (HDFS) architecture
comprises two main components: the NameNode and
DataNodes. The NameNode, serving as the master node,
manages metadata including file structure and block
locations, while DataNodes, acting as slaves, store data
blocks and communicate with the NameNode for read/write
operations. HDFS stores large files by dividing them into
blocks, replicating these blocks across multiple DataNodes
for fault tolerance, and enabling parallel processing of data
across the cluster, ensuring scalability, reliability, and
efficient storage and retrieval of data in distributed Hadoop
environments.
Hadoop MapReduce is a parallel processing framework designed to
efficiently process vast amounts of data across distributed clusters. It
operates through two main components: the JobTracker and
TaskTrackers. The JobTracker coordinates job execution, managing
tasks such as task scheduling, monitoring, and task failure recovery.
TaskTrackers, deployed on individual cluster nodes, execute specific
map and reduce tasks assigned by the JobTracker. MapReduce
employs a map phase for data transformation and a reduce phase for
aggregation, enabling data processing in parallel across nodes. Its
fault-tolerant design, facilitated by data replication and task re-
execution, ensures resilience in the face of node failures. Hadoop
MapReduce facilitates scalable and distributed data processing,
making it a fundamental component in the Hadoop ecosystem for
Apache Spark : an
overview
what is apache
spark?
Apache Spark is an open-source, distributed computing framework
that provides an in-memory processing engine for fast and
efficient data processing. It offers a versatile set of libraries and
APIs for various tasks including batch processing, real-time
streaming, machine learning, and graph processing. Spark's
resilient distributed dataset (RDD) abstraction allows for fault-
tolerant parallel processing of data across distributed clusters. Its
unified architecture combines multiple processing workloads,
enabling seamless integration and faster data analysis compared
to traditional disk-based processing frameworks like Hadoop
MapReduce. Spark's ease of use, scalability, and rich ecosystem
make it a popular choice for big data processing and analytics
Advantages of Apache
Spark over Hadoop Map
Reduce
• In-Memory Processing: Spark performs in-memory processing, reducing
the need for disk I/O and improving processing speed significantly
compared to MapReduce, which relies heavily on disk storage for
intermediate data.

• Unified Computing Engine: Spark provides a unified platform for

various data processing tasks including batch processing, interactive
queries, real-time streaming, machine learning, and graph processing,
whereas MapReduce is primarily suited for batch processing.

• Ease of Use: Spark offers a more user-friendly and expressive API

compared to the low-level programming model of MapReduce, allowing
developers to write complex data processing pipelines more efficiently
and with fewer lines of code.
• Fault Tolerance with Resilient Distributed Datasets (RDDs): Spark's RDD
abstraction provides built-in fault tolerance by tracking the lineage of
transformations applied to the data, allowing lost data to be recomputed
from the original source. This makes Spark more resilient to failures
compared to MapReduce.

• Efficient Caching and Data Reuse: Spark allows datasets to be cached in

memory across multiple operations, enabling iterative and interactive
processing with reduced latency by avoiding repetitive data reads from
disk, which is not efficiently supported in MapReduce.

• Optimized Execution Plan with Directed Acyclic Graph (DAG) Scheduler:

Spark optimizes the execution of data processing tasks using a DAG
scheduler, which generates an optimal execution plan based on the
dependencies between tasks, resulting in better performance compared to
the static execution plan of MapReduce.
Spark
architecture
Spark core
Spark Core serves as the foundational engine of the Apache
Spark framework, providing the distributed computing
infrastructure for processing large-scale data sets. At its core,
Spark Core introduces the concept of Resilient Distributed
Datasets (RDDs), immutable distributed collections of data
objects that allow for fault-tolerant parallel processing across a
cluster. With its in-memory processing capabilities, Spark Core
significantly enhances processing speed by minimizing disk I/O
overhead. Additionally, it offers a rich set of transformations and
actions, fault tolerance through lineage information, and
distributed task execution, making it a versatile and efficient
engine for a wide range of data processing tasks in Spark.
spark sql
Spark SQL is a component of Apache Spark designed to
facilitate seamless interaction with structured data using
SQL queries, DataFrame API, and SQL-compatible functions.
It extends Spark's capabilities to include SQL queries and
manipulation of structured data, enabling integration with
existing SQL-based tools and expertise. With Spark SQL,
users can perform complex data analysis, join multiple data
sources, and execute SQL queries directly against data
stored in various formats such as Parquet, JSON, CSV, and
Hive tables, thus bridging the gap between traditional
relational databases and big data processing frameworks.
spark streaming
Spark Streaming is an extension of the Apache Spark core that
enables scalable, fault-tolerant processing of live data streams. It
provides high-level abstractions like DStream (Discretized
Stream), allowing developers to process continuous data streams
in real-time using the same programming model as batch
processing. Spark Streaming ingests data from various sources
such as Kafka, Flume, and TCP sockets, and divides it into micro-
batches for parallel processing, offering low-latency stream
processing with fault tolerance and exactly-once semantics. With
its seamless integration with Spark's ecosystem, Spark Streaming
empowers developers to build robust and scalable stream
processing applications for a wide range of use cases, including
spark Mlib
Spark MLlib, part of the Apache Spark ecosystem, is a scalable
machine learning library designed for distributed data processing.
It provides a rich set of algorithms and utilities for common
machine learning tasks such as classification, regression,
clustering, collaborative filtering, and dimensionality reduction.
Leveraging Spark's distributed computing capabilities, MLlib
enables efficient processing of large-scale datasets, parallel
model training, and distributed model inference. With its user-
friendly APIs and seamless integration with other Spark
components, MLlib empowers data scientists and developers to
build and deploy scalable machine learning pipelines for real-
world applications, accelerating the development and
Spark
Spark GraphX is a GraphX
component of the Apache Spark
ecosystem designed for scalable graph processing and
analytics. It provides an API for constructing and
manipulating graphs, along with a set of distributed graph
algorithms for tasks such as graph traversal, pattern
matching, and graph analytics. GraphX leverages Spark's
distributed computing framework to efficiently process large-
scale graphs in parallel, making it suitable for analyzing
social networks, recommendation systems, and other graph-
structured data. With its seamless integration with other
Spark components, GraphX enables developers to build and
execute complex graph algorithms on distributed datasets,
performance
• Processing
comparision
Speed: Spark generally outperforms Hadoop
MapReduce due to its in-memory processing capability,
which reduces the need for disk I/O and improves
processing speed significantly, especially for iterative and
interactive workloads.

• Resource Utilization: Spark's ability to cache data in

memory and reuse it across multiple operations leads to
better resource utilization compared to Hadoop
MapReduce, which relies heavily on disk storage for
intermediate data.
• Fault Tolerance: Hadoop relies on data replication and task
re-execution, while Spark employs RDD lineage and
distributed execution for faster recovery from failures.

• Scalability: Both systems scale horizontally, but Spark's

unified engine and efficient processing make it more
scalable, especially for real-time streaming and interactive
analysis.

• Ease of Use: Spark's user-friendly API enables developers to

write complex data processing pipelines more efficiently and
with less code compared to the lower-level programming
model of Hadoop MapReduce.
Hadoop vs Spark. Example of Big Data Analytics platforms for
batch and streaming computing
• Amazon Elastic MapReduce (EMR): Offers Hadoop clusters
distributed across multiple Amazon EC2 instances. It
supports processing frameworks like Apache Spark and
HBase and integrates with various data stores within Amazon
Web Services (AWS).

• Google Cloud Dataproc: Provides Apache Spark services on

Hadoop clusters for batch processing, querying, streaming,
and machine learning, enabling scalable and efficient data
processing tasks in the cloud environment.

• Microsoft Azure HDInsight: Offers an HDP Hadoop distribution

in the cloud, providing scalable and reliable Hadoop clusters
for processing large-scale data workloads, along with
integration with other Azure services for enhanced analytics

SPARK
No ratings yet
SPARK
47 pages
Unit 6-1
No ratings yet
Unit 6-1
128 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Bda Unit Iv
No ratings yet
Bda Unit Iv
102 pages
Poetic Seminar
No ratings yet
Poetic Seminar
17 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Unit 5
No ratings yet
Unit 5
32 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
4.big Data Platforms
No ratings yet
4.big Data Platforms
49 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
MA - VaishuAchini - VIT - 24 - ICT703 - A3
No ratings yet
MA - VaishuAchini - VIT - 24 - ICT703 - A3
21 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Big Data Engines: Binary Batch Processing
No ratings yet
Big Data Engines: Binary Batch Processing
12 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
Big Data Technologies Presentation
No ratings yet
Big Data Technologies Presentation
10 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
Introduction To Big DAta
No ratings yet
Introduction To Big DAta
2 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Big Data
No ratings yet
Big Data
27 pages
Unit 2 - Intro To Hadoop
No ratings yet
Unit 2 - Intro To Hadoop
51 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Probability Mass Function & Density Function
No ratings yet
Probability Mass Function & Density Function
34 pages
Big Data Handling Techniques
No ratings yet
Big Data Handling Techniques
21 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Spark Overview
No ratings yet
Spark Overview
31 pages
Attachment
No ratings yet
Attachment
11 pages
IEC 61850-Introduction-Sv
No ratings yet
IEC 61850-Introduction-Sv
32 pages
Hadoop in Bigdata Processing Concept
No ratings yet
Hadoop in Bigdata Processing Concept
2 pages
Certificates PageNumbers Centered From Intro
No ratings yet
Certificates PageNumbers Centered From Intro
67 pages
Quran Fonts
0% (1)
Quran Fonts
8 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Shark
No ratings yet
Shark
24 pages
Module 2
No ratings yet
Module 2
20 pages
Spark SQL
100% (1)
Spark SQL
25 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Tools in Data Analytics
No ratings yet
Tools in Data Analytics
17 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
36 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Suraj Data
No ratings yet
Suraj Data
100 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
StotraNidhi Telugu 15-Books Combo
No ratings yet
StotraNidhi Telugu 15-Books Combo
1 page
Ba K 0106 1 en
No ratings yet
Ba K 0106 1 en
20 pages
BigDataProcessingTools HaddopHDFSHiveSpark
No ratings yet
BigDataProcessingTools HaddopHDFSHiveSpark
2 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Benefits of Hadoop MapReduce
No ratings yet
Benefits of Hadoop MapReduce
1 page
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Unit 4
No ratings yet
Unit 4
8 pages
Specification For Fireproofingof Structural Steel and Equipment
No ratings yet
Specification For Fireproofingof Structural Steel and Equipment
11 pages
Checklist For Installation of CI Pipe
No ratings yet
Checklist For Installation of CI Pipe
1 page
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
No ratings yet
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
8 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Be Form 2 School Work Plan
100% (1)
Be Form 2 School Work Plan
3 pages
HADOOP
No ratings yet
HADOOP
10 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
HP Officejet 4200 Series All-In-One
No ratings yet
HP Officejet 4200 Series All-In-One
166 pages
Generative Artificial Intelligence in Ophthalmology
No ratings yet
Generative Artificial Intelligence in Ophthalmology
11 pages
Technical Manual: Includes
No ratings yet
Technical Manual: Includes
13 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Abaqus PDF
No ratings yet
Abaqus PDF
27 pages
Unit 2
No ratings yet
Unit 2
18 pages
Using File Server Resource Manager To Screen For Ransomware
No ratings yet
Using File Server Resource Manager To Screen For Ransomware
19 pages
Gidukevo Nusimiga Zapog
No ratings yet
Gidukevo Nusimiga Zapog
3 pages
Networking
No ratings yet
Networking
4 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Nikuni KTMcatalogue A4
No ratings yet
Nikuni KTMcatalogue A4
4 pages
Ejemplo - Contrato de Obra by Alicia Cosio González - Issuu
No ratings yet
Ejemplo - Contrato de Obra by Alicia Cosio González - Issuu
1 page
CBD ZZ 00 DR DR 1001
No ratings yet
CBD ZZ 00 DR DR 1001
1 page
T34 Catlogue - Catalogue - V2 - 2023
No ratings yet
T34 Catlogue - Catalogue - V2 - 2023
8 pages
Sheetal Cyriac Virtual Impedance Based Stabilization
No ratings yet
Sheetal Cyriac Virtual Impedance Based Stabilization
6 pages
Manual Garvens S2 Uk
100% (1)
Manual Garvens S2 Uk
2 pages
Quants Intern - JD
No ratings yet
Quants Intern - JD
3 pages
Programming Unit Vocabulary 1
No ratings yet
Programming Unit Vocabulary 1
4 pages
Design and Implement of Performance of M
No ratings yet
Design and Implement of Performance of M
4 pages
Exp22 Excel Ch04 CumulativeAssessment Variation Rockville Auto Sales Instructions
No ratings yet
Exp22 Excel Ch04 CumulativeAssessment Variation Rockville Auto Sales Instructions
2 pages
Toshiba 500gb Dt01aca Dt01aca050!3!5 Internal Hard Hdkpc01 282179 User Manual
No ratings yet
Toshiba 500gb Dt01aca Dt01aca050!3!5 Internal Hard Hdkpc01 282179 User Manual
2 pages
LV Circuit Breaker Calculator Guide (Level 2) European Arc Guide EAG
No ratings yet
LV Circuit Breaker Calculator Guide (Level 2) European Arc Guide EAG
5 pages

Big Data Analytics Presentation

Uploaded by

Big Data Analytics Presentation

Uploaded by

P R E S E N T A T I O N O N

• Processing Frameworks: Frameworks like Apache Hadoop,

• Data Processing Languages: Programming languages such as

• Data Analysis and Machine Learning: Techniques such as data

• Distributed Computing: Big data processing often involves

• Fault Tolerance and Scalability: Big data systems need to be

• Hadoop MapReduce: MapReduce is a programming model and

• YARN (Yet Another Resource Negotiator): YARN is the resource

• Hadoop HBase: HBase is a distributed, scalable, and column-

• Hadoop Hive: Hive is a data warehouse infrastructure built on

• Unified Computing Engine: Spark provides a unified platform for

• Ease of Use: Spark offers a more user-friendly and expressive API

• Efficient Caching and Data Reuse: Spark allows datasets to be cached in

• Optimized Execution Plan with Directed Acyclic Graph (DAG) Scheduler:

• Resource Utilization: Spark's ability to cache data in

• Scalability: Both systems scale horizontally, but Spark's

• Ease of Use: Spark's user-friendly API enables developers to

• Google Cloud Dataproc: Provides Apache Spark services on

• Microsoft Azure HDInsight: Offers an HDP Hadoop distribution

You might also like