0% found this document useful (0 votes)
11 views30 pages

Big Data Analytics Presentation

The document provides an overview of big data processing, highlighting its key characteristics (volume, velocity, variety) and essential components, including storage systems, processing frameworks, and data analysis techniques. It discusses Hadoop as a foundational framework for distributed data storage and processing, detailing its components like HDFS and MapReduce, and contrasts it with Apache Spark, which offers in-memory processing and a unified computing engine for various data tasks. Additionally, it compares the performance and usability of Hadoop and Spark, and lists examples of big data analytics platforms that utilize these technologies.

Uploaded by

nextstep013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views30 pages

Big Data Analytics Presentation

The document provides an overview of big data processing, highlighting its key characteristics (volume, velocity, variety) and essential components, including storage systems, processing frameworks, and data analysis techniques. It discusses Hadoop as a foundational framework for distributed data storage and processing, detailing its components like HDFS and MapReduce, and contrasts it with Apache Spark, which offers in-memory processing and a unified computing engine for various data tasks. Additionally, it compares the performance and usability of Hadoop and Spark, and lists examples of big data analytics platforms that utilize these technologies.

Uploaded by

nextstep013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

P R E S E N T A T I O N O N

B I G DATA A N A LY S I S A N D
PROCESSING
Introduction to big
data processing
Introduction
Big data processing refers to the methods and technologies used to
handle large volumes of data that traditional data processing applications
can't manage efficiently. This data typically comes from various sources
such as social media, sensors, machines, transactions, and more. The
three main characteristics of big data, often referred to as the three Vs,
are volume, velocity, and variety:
• Volume: Big data involves large amounts of data, often ranging from
terabytes to petabytes or even exabytes.
• Velocity: Data streams in at high speeds and needs to be processed
quickly to derive insights or take actions in real-time or near real-time.
• Variety: Data comes in various formats and types, including structured
data (like databases), semi-structured data (like XML files), and
unstructured data (like text, images, and videos).
components of Big data
• Storage processing
Systems: Big data storage solutions like Hadoop
Distributed File System (HDFS), Amazon S3, or Google Cloud
Storage are used to store massive amounts of data across
distributed systems.

• Processing Frameworks: Frameworks like Apache Hadoop,


Apache Spark, and Apache Flink provide distributed computing
capabilities for processing large datasets across clusters of
computers.

• Data Processing Languages: Programming languages such as


Java, Python, Scala, and SQL are commonly used for big data
processing tasks. Each language has its strengths depending
• Data Integration Tools: Tools like Apache NiFi, Apache Kafka, and
Apache Flume are used for ingesting data from various sources
into the processing pipeline.

• Data Analysis and Machine Learning: Techniques such as data


mining, machine learning, and predictive analytics are applied to
extract valuable insights from big data.

• Distributed Computing: Big data processing often involves


distributing processing tasks across multiple nodes in a cluster to
parallelize computation and improve performance.

• Fault Tolerance and Scalability: Big data systems need to be


fault-tolerant and scalable to handle hardware failures and
accommodate growing data volumes.
Introduction to
hadoop
What is
Hadoop is an
Hadoop?
open-source framework designed for
distributed storage and processing of large datasets across
clusters of commodity hardware. At its core, it comprises
two main components: Hadoop Distributed File System
(HDFS) for storing data across multiple machines with fault
tolerance, and MapReduce for parallel processing of data.
It provides a scalable and cost-effective solution for
organizations to handle Big Data, allowing them to store,
process, and analyze vast amounts of data to derive
valuable insights and make data-driven decisions.
Components of Hadoop
• Hadoop ecosystem
Distributed File System (HDFS): HDFS is the primary storage
system used by Hadoop. It is designed to store large files across
multiple machines in a reliable and fault-tolerant manner. HDFS divides
large files into smaller blocks and distributes them across a cluster of
commodity hardware.

• Hadoop MapReduce: MapReduce is a programming model and


processing engine for parallel processing of large datasets across a
distributed cluster. It consists of two main phases: the Map phase, where
data is processed in parallel across multiple nodes, and the Reduce
phase, where the results from the Map phase are aggregated.

• YARN (Yet Another Resource Negotiator): YARN is the resource


management layer of Hadoop. It is responsible for managing and
allocating resources (CPU, memory, etc.) to various applications running
• Hadoop Common: Hadoop Common contains libraries and
utilities that support other Hadoop modules. It includes common
utilities, configuration files, and libraries used by various
components in the Hadoop ecosystem.

• Hadoop HBase: HBase is a distributed, scalable, and column-


oriented NoSQL database built on top of Hadoop. It provides
real-time read/write access to large datasets and is suitable for
applications requiring random, real-time read/write access to Big
Data.

• Hadoop Hive: Hive is a data warehouse infrastructure built on


top of Hadoop that provides a SQL-like interface (HiveQL) for
querying and managing large datasets stored in HDFS. Hive
translates SQL queries into MapReduce or Tez jobs, allowing
users familiar with SQL to analyze Big Data without needing to
The Hadoop Distributed File System (HDFS) architecture
comprises two main components: the NameNode and
DataNodes. The NameNode, serving as the master node,
manages metadata including file structure and block
locations, while DataNodes, acting as slaves, store data
blocks and communicate with the NameNode for read/write
operations. HDFS stores large files by dividing them into
blocks, replicating these blocks across multiple DataNodes
for fault tolerance, and enabling parallel processing of data
across the cluster, ensuring scalability, reliability, and
efficient storage and retrieval of data in distributed Hadoop
environments.
Hadoop MapReduce is a parallel processing framework designed to
efficiently process vast amounts of data across distributed clusters. It
operates through two main components: the JobTracker and
TaskTrackers. The JobTracker coordinates job execution, managing
tasks such as task scheduling, monitoring, and task failure recovery.
TaskTrackers, deployed on individual cluster nodes, execute specific
map and reduce tasks assigned by the JobTracker. MapReduce
employs a map phase for data transformation and a reduce phase for
aggregation, enabling data processing in parallel across nodes. Its
fault-tolerant design, facilitated by data replication and task re-
execution, ensures resilience in the face of node failures. Hadoop
MapReduce facilitates scalable and distributed data processing,
making it a fundamental component in the Hadoop ecosystem for
Apache Spark : an
overview
what is apache
spark?
Apache Spark is an open-source, distributed computing framework
that provides an in-memory processing engine for fast and
efficient data processing. It offers a versatile set of libraries and
APIs for various tasks including batch processing, real-time
streaming, machine learning, and graph processing. Spark's
resilient distributed dataset (RDD) abstraction allows for fault-
tolerant parallel processing of data across distributed clusters. Its
unified architecture combines multiple processing workloads,
enabling seamless integration and faster data analysis compared
to traditional disk-based processing frameworks like Hadoop
MapReduce. Spark's ease of use, scalability, and rich ecosystem
make it a popular choice for big data processing and analytics
Advantages of Apache
Spark over Hadoop Map
Reduce
• In-Memory Processing: Spark performs in-memory processing, reducing
the need for disk I/O and improving processing speed significantly
compared to MapReduce, which relies heavily on disk storage for
intermediate data.

• Unified Computing Engine: Spark provides a unified platform for


various data processing tasks including batch processing, interactive
queries, real-time streaming, machine learning, and graph processing,
whereas MapReduce is primarily suited for batch processing.

• Ease of Use: Spark offers a more user-friendly and expressive API


compared to the low-level programming model of MapReduce, allowing
developers to write complex data processing pipelines more efficiently
and with fewer lines of code.
• Fault Tolerance with Resilient Distributed Datasets (RDDs): Spark's RDD
abstraction provides built-in fault tolerance by tracking the lineage of
transformations applied to the data, allowing lost data to be recomputed
from the original source. This makes Spark more resilient to failures
compared to MapReduce.

• Efficient Caching and Data Reuse: Spark allows datasets to be cached in


memory across multiple operations, enabling iterative and interactive
processing with reduced latency by avoiding repetitive data reads from
disk, which is not efficiently supported in MapReduce.

• Optimized Execution Plan with Directed Acyclic Graph (DAG) Scheduler:


Spark optimizes the execution of data processing tasks using a DAG
scheduler, which generates an optimal execution plan based on the
dependencies between tasks, resulting in better performance compared to
the static execution plan of MapReduce.
Spark
architecture
Spark core
Spark Core serves as the foundational engine of the Apache
Spark framework, providing the distributed computing
infrastructure for processing large-scale data sets. At its core,
Spark Core introduces the concept of Resilient Distributed
Datasets (RDDs), immutable distributed collections of data
objects that allow for fault-tolerant parallel processing across a
cluster. With its in-memory processing capabilities, Spark Core
significantly enhances processing speed by minimizing disk I/O
overhead. Additionally, it offers a rich set of transformations and
actions, fault tolerance through lineage information, and
distributed task execution, making it a versatile and efficient
engine for a wide range of data processing tasks in Spark.
spark sql
Spark SQL is a component of Apache Spark designed to
facilitate seamless interaction with structured data using
SQL queries, DataFrame API, and SQL-compatible functions.
It extends Spark's capabilities to include SQL queries and
manipulation of structured data, enabling integration with
existing SQL-based tools and expertise. With Spark SQL,
users can perform complex data analysis, join multiple data
sources, and execute SQL queries directly against data
stored in various formats such as Parquet, JSON, CSV, and
Hive tables, thus bridging the gap between traditional
relational databases and big data processing frameworks.
spark streaming
Spark Streaming is an extension of the Apache Spark core that
enables scalable, fault-tolerant processing of live data streams. It
provides high-level abstractions like DStream (Discretized
Stream), allowing developers to process continuous data streams
in real-time using the same programming model as batch
processing. Spark Streaming ingests data from various sources
such as Kafka, Flume, and TCP sockets, and divides it into micro-
batches for parallel processing, offering low-latency stream
processing with fault tolerance and exactly-once semantics. With
its seamless integration with Spark's ecosystem, Spark Streaming
empowers developers to build robust and scalable stream
processing applications for a wide range of use cases, including
spark Mlib
Spark MLlib, part of the Apache Spark ecosystem, is a scalable
machine learning library designed for distributed data processing.
It provides a rich set of algorithms and utilities for common
machine learning tasks such as classification, regression,
clustering, collaborative filtering, and dimensionality reduction.
Leveraging Spark's distributed computing capabilities, MLlib
enables efficient processing of large-scale datasets, parallel
model training, and distributed model inference. With its user-
friendly APIs and seamless integration with other Spark
components, MLlib empowers data scientists and developers to
build and deploy scalable machine learning pipelines for real-
world applications, accelerating the development and
Spark
Spark GraphX is a GraphX
component of the Apache Spark
ecosystem designed for scalable graph processing and
analytics. It provides an API for constructing and
manipulating graphs, along with a set of distributed graph
algorithms for tasks such as graph traversal, pattern
matching, and graph analytics. GraphX leverages Spark's
distributed computing framework to efficiently process large-
scale graphs in parallel, making it suitable for analyzing
social networks, recommendation systems, and other graph-
structured data. With its seamless integration with other
Spark components, GraphX enables developers to build and
execute complex graph algorithms on distributed datasets,
performance
• Processing
comparision
Speed: Spark generally outperforms Hadoop
MapReduce due to its in-memory processing capability,
which reduces the need for disk I/O and improves
processing speed significantly, especially for iterative and
interactive workloads.

• Resource Utilization: Spark's ability to cache data in


memory and reuse it across multiple operations leads to
better resource utilization compared to Hadoop
MapReduce, which relies heavily on disk storage for
intermediate data.
• Fault Tolerance: Hadoop relies on data replication and task
re-execution, while Spark employs RDD lineage and
distributed execution for faster recovery from failures.

• Scalability: Both systems scale horizontally, but Spark's


unified engine and efficient processing make it more
scalable, especially for real-time streaming and interactive
analysis.

• Ease of Use: Spark's user-friendly API enables developers to


write complex data processing pipelines more efficiently and
with less code compared to the lower-level programming
model of Hadoop MapReduce.
Hadoop vs Spark. Example of Big Data Analytics platforms for
batch and streaming computing
• Amazon Elastic MapReduce (EMR): Offers Hadoop clusters
distributed across multiple Amazon EC2 instances. It
supports processing frameworks like Apache Spark and
HBase and integrates with various data stores within Amazon
Web Services (AWS).

• Google Cloud Dataproc: Provides Apache Spark services on


Hadoop clusters for batch processing, querying, streaming,
and machine learning, enabling scalable and efficient data
processing tasks in the cloud environment.

• Microsoft Azure HDInsight: Offers an HDP Hadoop distribution


in the cloud, providing scalable and reliable Hadoop clusters
for processing large-scale data workloads, along with
integration with other Azure services for enhanced analytics

You might also like