0% found this document useful (0 votes)
19 views

BDA Class3

Uploaded by

Celina Sawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

BDA Class3

Uploaded by

Celina Sawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

DIGITAL IMAGE PROCESSING

BIG DATA ANALYTICS

Lecture 3
Tools for Big Data Analysis
BIG DATA PROCESSES
BIG DATA PROCESSES
TYPES OF DATA PROCESSING
TECHNIQUES
TOOLS FOR COLLECTING, PREPROCESSING AND ANALYZING
BIG DATA
 A Data warehouse is used to store and retrieve large datasets.
 The data is stored in a data warehouse using dimensional approach and
normalized approach.
 In dimensional approach, data are divided into fact table and dimension
table which supports the fact table.
 In normalized approach, data is divided into entities creating several
tables in a relational database.
 Relational Database Management System (RDBMS) is the traditional
method of managing structured data.
 RDBMS uses a relational database and schema for storage and retrieval
of data.
 Structured Query Language (SQL) is most commonly used database
query language.
NOSQL (NOT ONLY STRUCTURED QUERY LANGUAGES)
 Due to Atomicity, Consistency, Isolation and Durability (ACID) constraint,
scaling of a large volume of data is not possible.
 These properties are fundamental principles in the field of database
management systems (DBMS) to ensure data integrity and reliability.
 Atomicity:
 Atomicity refers to the property that a transaction or operation is
treated as a single, indivisible unit.
 If any part of the operation fails, the entire operation is rolled back,
ensuring that the data remains in a consistent state.
 In the realm of big data, achieving atomicity can be challenging due to
the distributed nature of data and the potential for partial failures.

 Consistency:
 Consistency ensures that data transitions from one valid state to
another, maintaining data integrity and adhering to defined business
rules.
 In big data systems, maintaining consistency across distributed data
sources is complex,
 Isolation:
 Isolation guarantees that multiple concurrent transactions can occur
without interfering with each other.
 In the context of big data, achieving strong isolation can be challenging
due to the distributed nature of data processing.

 Durability:
 Durability ensures that once a transaction is committed, its changes
are permanent and will survive any subsequent system failures.
 In big data, durability remains a critical aspect, and systems need to
replicate data across multiple nodes to achieve fault tolerance and
durability.

RDBMS is incapable of handling semi-structured and


unstructured data
NOSQL
 NoSQL stores and manages unstructured data.
 These databases are also known as “schema-free” databases since they
enable quick upgradation of structure of data without table rewrites.
 NoSQL supports document store, key value stores, BigTable and
graph database.
 It uses looser consistency model than the traditional databases.
 Data management and data storage functions are separate in NoSQL
database
 It allows the scalability of data.
 Few examples of NoSQL databases are HBase, MangoDB, and Dynamo.
HADOOP
 Hadoop is an open-source framework and ecosystem for processing,
storing, and analyzing large and complex datasets across clusters
of commodity hardware.

 It was conceived and implemented on the basis of Google File System


and Map Reduce programming paradigm.

 Hadoop enables organizations to harness the power of distributed


computing to handle massive amounts of data efficiently.

 The core components of the Hadoop ecosystem include Hadoop


Distributed File System (HDFS) and the MapReduce programming
model.
HDFS
 The Hadoop Distributed File System (HDFS) is a distributed file storage
system designed to store and manage very large datasets across
clusters of machines.
Key features:
 Distributed Storage: HDFS distributes data across multiple machines
in a cluster, allowing for efficient storage and retrieval of large datasets.
This distribution also provides fault tolerance, as data is replicated
across nodes to ensure that data remains available even if a node fails.

 Blocks:
 Data in HDFS is divided into fixed-size blocks (typically 128 MB or 256
MB).
 These blocks are distributed across the cluster and can be stored on
different machines.
 This block structure enables parallel processing and efficient data
management.
 Replication:
 HDFS replicates data blocks across multiple nodes in the cluster.
 The default replication factor is usually set to three, which means that
each data block is stored on three different nodes.
 This replication provides fault tolerance: if a node fails, the data can
still be accessed from other replicas.
 Master-Slave Architecture:
 HDFS follows a master-slave architecture.
 The master node is called the NameNode, which stores metadata
about the file system hierarchy and the location of data blocks.
 DataNodes are the slave nodes that store the actual data blocks.
 Data Integrity:
 HDFS ensures data integrity through checksums.
 Each data block has a checksum associated with it, and the client
verifies the checksum when reading data.
 If a checksum mismatch occurs, the system knows that data
corruption has occurred.

 High Throughput:
 HDFS is optimized for high throughput data access rather than low-
latency access.
 It is well-suited for batch processing workloads, such as those commonly
found in big data analytics.

 Data Locality:
 HDFS tries to place computation close to the data.
 This means that when processing data, tasks are scheduled on nodes
where the data resides, reducing the need for network transfers.
MAPREDUCE
 MapReduce is a programming model and processing framework for
processing and generating large datasets in parallel.
 It divides tasks into smaller subtasks and distributes them across
nodes in the cluster.
 MapReduce is used for batch processing and has been fundamental in
early big data analytics.
The two distinct phases of MapReduce are:
1) Map Phase:
 In Map phase, the workload is divided into smaller sub-workloads.
 The tasks are assigned to Mapper, which processes each unit block
of data to produce a sorted list of (key, value) pairs.
 This list, which is the output of mapper, is passed to the next
phase.
 This process is known as shuffling.

2) Reduce:
In Reduce phase, the input is analyzed and merged to produce the
final output which is written to the HDFS in the cluster.
LIMITATIONS OF HADOOP
 Multiple Copies of Data: Inefficiency of HDFS leads to creation of
multiple copies of the data (minimum 3 copies).
 Limited SQL support: Hadoop offers a limited SQL support and they
lack basic functions such as sub-queries, “group by” analytics etc.
 Inefficient execution: Lack of query optimizer leads to inefficient cost-
based plan for execution thus resulting in larger cluster compared to
similar database.
 Challenging Framework: Complex transformational logic cannot be
leveraged using the MapReduce framework.
 Lack of Skills: Knowledge of algorithms and skills for distributed
MapReduce development are required for proper implementation.

You might also like