Ms.
Bijayalaxmi Mishra (Assistant Professor, CSE)
WHAT IS DATA
• Data is a collection of facts, statistic or information, typically gathered
through observation measurement, or research which can be
processed and analyze to gain insights or make decisions
• Data can take various forms, including numbers, text, images, videos,
or any other type of input that can be stored and interupted by a
system or human
KEY CHARACTERISTICS OF DATA
• ACCURCAY: data must be correct and precise be reliability
• COMPLETENESS: all necessary data should be present for analysis
• TIMELINE: data should be up to date and available when needed
• CONSISTENCY: data should remain uniform across different sources
or system.
• RELEVANCE: data must be relevant to the context and purpose
• VALIDTY: data should conform to the defined formats and rule
• GRANULARITY: Refers to the level of details depth of data
TYPES OF DATA
• Structured Data: Organized in a predefined manner, usually in rows and
columns(e.g., databases, spreadsheet). This type of data is easily
searchable and analyzable using SQL- based system.
• EXAMPLE: Transaction records, sensor data, customer information.
• Unstructured Data: lacks a specific structure or format, making it more
challenging to process and analyze. It often includes human generated
content and multimedia.
• EXAMPLE: text files, emails, images, videos, social media posts
• Semi- structured Data: Contains elements of both structured and
unstructured data. While it does not conform to a strict format, it has some
organization through tags or markers.
• EXAMPLE: XML files, JSON files, log files
• Meta Data: data that describes other data. It provides information
about the content, format, or context of a dataset.
• Example: File size, creation date, author, GPS location in images.
• Binary Data: Data stored in binary formats, typically used by machines
or software. Files, audio/videos files, and system files
• Real-Time Data: includes everything from system files to multimedia
data.
• EXAMPLE: executable a: data that is generated and processed instantaneously,
often in live applications like financial markets, IoT devices, or social media
feeds.
• Big Data: extremely larger and complex data sets, often mix of
structured, unstructured and semi-structured data. Big data requires
specialized tools for storage, managements and analysis
• EXAMPLE: Clickstream data, transaction logs, large scale social media analytics
What is BIG DATA
• Big data refers to extremely large data sets that are complex and difficult to process
using traditional data management tools due to their size, speed and variety.
• It involves a massive amount of data generated continuously from various sources, such
as social media, IoT Devices, sensors and business transaction
• The data is generated at high speed requiring rapid processing for real time or near real
time analysis and decision making.
• Big data is challenging to manage due to diverse types of data and the need to ensure
data integrity, consistency, and quality.
• Big data comes in multiple formats structured(database), semi structured(XML, JSON),
and unstructured (text, images, videos).
• Big data is used for advanced analytics, including predictive modelling, machine learning,
and AI to uncover patterns, trends, and correlation.
• It is crucial across industries like healthcare, finance, marketing, logistics and
government where insights from data can optimize operations, improve services and
drive information
The 5 Vs of BIG DATA
VOLUME VELOCITY VARIETY
VERACITY VALUE
• Volume: Big data refers to the vast amounts of data that are generated every day from various sources
such as social media, sensors, IoT devices and more. The sheer volume of this data is massive, and it
continues to grow exponentially.
• Velocity: Big data is generated at a high speed, and it continues to flow in at an incredible pace. This
stream of data is constant and fast-paced, making it challenging to process and analyze in real time.
• Variety: Big data comes in all shapes and sizes, and it encompasses a wide range of data types, including
structured, semi-structured and unstructured data. This variety makes it difficult to manage and analyze
using traditional data processing
• Veracity: Big data is often noisy, and its accuracy and quality can be questionable. Ensuring the veracity
of big data is crucial to extract meaningful insights and make informed decisions
• Value: Big data has the potential to create significant value for organization, but only if it is properly
analyzed and interpreted. The value of big data lies in its ability to provide insights that can drive
business decision, improve operations and create new opportunities.
WHY DO WE NEED TO ANALYZE BIG DATA
• Insights and decision-making: Bog data analysis provides valuable insights that can inform business,
improve operations, and driven innovation. By analyzing large datasets, organizations can identify patterns,
trends and correlation that may not be apparent thought traditional data analysis methods.
• Competitive Advantages: in today’s data driven economy, organization that can effectively analyze and
leverage big data have a significant competitive advantage over those that do not. By gaining insight from
big data, companies can stay ahead of the competition, identify new opportunities and improve their
market position.
• Cost saving: big data analysis can help organization reduce cost by identifying areas of inefficiency,
optimizing processes and improving resources allocation. For example analyzing sensor data from
manufacturing equipment can help companies predict maintenance needs, reducing downtime.
• Improved customer Experience: Big data analysis can help organization identify and mitigate risks by
analyzing large datasets for patterns and anomalies. For example, analyzing financial transaction data can
help companies detect fraudulent activity and prevent losses.
• Innovation and R&D: big data analysis can drive innovation and R&D by providing insights that can lead to
new products, services and business models. By analyzing large datasets, organizations can identify
opportunities for innovation and create new revenue streams
• Operational Efficiency: Big data analysis can help organization optimize their operations
by identifying areas of inefficiency, improving supply chain management, and streamline
processes. For example, analyzing logistic data can help companies optimize their delivery
routes and reduce transportation costs.
• Healthcare and public Health: Big Data analysis can improve healthcare outcomes by
analyzing large datasets to identify patterns and trends in patient data. This can lead to
better disease diagnosis, treatment and prevention
• Environment Sustainability: Big data analysis can help organization reduce their
environment impact by analyzing data on energy consumption, waste management, and
resources usage. This can lead to more sustainable practices and reduced carbon
emission.
CONCEPTS OF BIG DATA
• Data Ingestion: The process of controlling and transporting data from various sources to a
centralized location for analysis.
• Data Wrangling: The process of cleaning, transforming and preparing data for analysis.
• Data Visualization: The process of creating graphical representation of data to
communicate insights and trends
• Machine Learning: A subset of artificial intelligence that involves training algorithms to
learn from data and make predictions or decisions
• Predictive Analytics: The use of optimization and simulation techniques to recommend
actions based on data analysis.
• Data Mining: The process of atomically discovering patterns and relationships in large
datasets
• Text Analytics: The process of extracting insights and meaning from unstructured text data.
• Sentiment Analysis: The process of determining the emotional tone or sentiment behind
text data.
• Network Analysis: The study of relationships and connections between people,
organizations, and devices.
• Clustering: A technique used to group similar data points or customers based on their
characteristics.
Methodology Of Big Data Analysis
1. Data Collection
• Identify the source of data, such as social media, sensor, IoT devices, and more
• Determine the type of data to collect, such as structured, semi-structured or unstructured data
• Collect the data using various tools and techniques, such as APIs, web scraping, and data ingestion
tools.
2. Data Preprocessing
• Clean and preprocess the data to remove noise, handle missing values, and transformation the data
into suitable format
• Perform data quality checks to ensure accuracy, completeness, and consistency
• Transform the data into a format suitable for analysis, such as aggregating suitable for converting data
types.
3. Data Storage and Management
• Store the preprocessed data in a scalable and efficient data storage system, such as Hadoop or NoSQL
databases
• Manage the data using data governance policies, data security measures and data access controls
4. Exploratory Data Analysis
• Use statistical and visual methods to understand the characteristics of the data such as distribution,
correlation, and outliners
• Identify patterns, trends, and relationships in the data using techniques such as clustering, decision
trees and regression analysis
5. Modelling and Machine learning
• Develop predictive models using machine learning algorithms, such as supervised, unsupervised and
reinforcement learning
• Train and test the model using various techniques, such as cross- validation and hyper parameter tuning
• Evaluate the performance of the models using metrics such as accuracy, precision and recall
6. Insights Generation and Visualization
• Use the model to generate insights and predictions from the data
• Visualize the insights using various tools and techniques, such as dashboards, reports and data visualization
software
• Communicate the insights to stakeholders using clear concise language
7. Deployment and Monitoring
• Deploy the models and insights into production system, such as recommendation engines or predictive analytics
platforms
• Monitor the performance of the model and insights overtime, using techniques such as model drift detection
and data quality monitoring
• Refine and update the models and insights based on new data and feedback from stakeholders
BIG DATA ANALYTICS
UNIT 1
• 2002: Doug Cutting and Mike Cafarella begin work on Apache Nutch, a web search engine
project.
• 2003: Google publishes the paper on the Google File System (GFS), influencing the design
of Hadoop.
• 2004: Google releases the MapReduce paper, which further inspires Hadoop's
architecture.
• 2005: Hadoop is created as a sub-project of Apache Nutch, with its initial focus on
distributed storage and processing.
• 2006: Hadoop becomes an Apache project, and the first version is released.
• 2008: Hadoop is promoted to a top-level Apache project, signaling its maturity and
stability.
• 2010s: Rapid adoption in the industry, leading to the development of a rich ecosystem
(e.g., Hive, Pig, HBase).
• Present: Continues to evolve with contributions from various organizations and remains a
cornerstone of big data processing.
APACHE HADOOP
Apache Hadoop is an open-source software framework used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data
and is based on the MapReduce programming model, which allows for the parallel
processing of large datasets.
Key Features of Apache Hadoop
•Scalability: Apache Hadoop is highly scalable and can handle large
amounts of data.
•Flexibility: Apache Hadoop can handle a variety of data formats,
including structured, semi-structured, and unstructured data.
•Cost-effectiveness: Apache Hadoop is a cost-effective solution for
storing and processing large amounts of data.
•Fault-tolerance: Apache Hadoop is designed to be fault-tolerant,
which means that it can continue to operate even if one or more
nodes in the cluster fail.
CORE HADOOP COMPONENT
1. Hadoop common
• Definition: this is the foundational library containing utilities and java libraries used by other Hadoop
modules.
• Purpose: provides shared utilities, file system abstraction and OS-level integration.
2. Hadoop Distributed File System (HDFS):
• Definition: A scalable, fault-tolerant distributed file system designed for large-scale data storage.
• Stores data across multiple nodes
• Ensures reliability by replicating data blocks (Default is 3 copies).
• Works with large files in a write-once-read-many pattern.
3. Hadoop MapReduce
• Definition: a programing model and processing engine for large-scale data processing in parallel.
• Process data in key value pairs
• Uses a Map function to filter and categorizes data, and a Reduce function to aggregate and summarize
results
4. Yet Another Resource Negotiator (YARN)
• Definition: A resource management framework for job scheduling and cluster resource management.
• Manages computational resources across the cluster
• Allocates resources to various application dynamically.
Hadoop Versions
Hadoop has evolved through three major versions—Hadoop 1.x, 2.x, and 3.x. Each version
introduced new features and addressed the limitations of its predecessors.
1. Hadoop 1.x
• The first stable version of Hadoop, designed for batch processing using the MapReduce framework.
• Release Timeline: Early 2006 to 2012.
Key Features:
• MapReduce 1 (MR1):
• Used a monolithic architecture with JobTracker and TaskTracker for resource management and job
scheduling.
• HDFS (Hadoop Distributed File System):
• Supported distributed storage with replication for fault tolerance.
• Scalability: Limited to clusters with around 4,000 nodes.
Limitations:
• Single point of failure in JobTracker.
• Inflexible—supported only MapReduce for data processing.
• Resource utilization inefficiencies.
2. Hadoop 2.x
• A major update that introduced YARN and significantly improved Hadoop’s scalability and flexibility.
• Release Timeline: 2013 to 2017.
Key Features:
• YARN (Yet Another Resource Negotiator):
• Decoupled resource management and job scheduling from MapReduce.
• Allowed multiple frameworks (e.g., Apache Spark, Tez) to run on Hadoop.
• HDFS Federation:
• Introduced support for multiple NameNodes to improve scalability.
• Backward Compatibility:
• Existing MapReduce applications could run with minimal modifications.
• High Availability (HA):
• Added support for active-passive NameNode configurations to eliminate the single point of failure.
• Scalability:
• Supported clusters with 10,000+ nodes.
• Support for Non-MapReduce Frameworks:
• Enabled the execution of real-time processing, graph processing, and streaming applications.
3. Hadoop 3.x
• A modernized version of Hadoop with enhanced efficiency, cost-effectiveness, and support for emerging
technologies.
• December 2017 to the present.
Key Features:
• Erasure Coding:
• Reduced storage overhead compared to traditional replication (uses ~50% less storage).
• Docker Container Support:
• Introduced native support for running tasks in containers, enabling better isolation and resource management.
• Enhanced Scalability:
• Support for more than two NameNodes in high-availability setups.
• Improved YARN:
• Added scheduling enhancements and support for GPUs and machine learning workloads.
• Default Java Version:
• Transitioned to Java 8 and later versions.
• Storage Optimization:
• Block IDs now support larger clusters with smaller storage overhead.
• Improved Performance:
• Introduced tools like Timeline Service v.2 to optimize monitoring and troubleshooting.
• Support for HDFS Namenode Federation:
• Further enhanced scalability and fault tolerance.
HADOOP COMMON
• Hadoop Common is a collection of shared libraries and utilities that support the core functionality of the Hadoop
Framework.
• It provides the essential building blocks required by other Hadoop modules (HDFS, YARN and MAPREDUCE) to function
correctly.
• These components ensures seamless integration and communication within Hadoop Ecosystem.
KEY FEATURES
• CORE LIBRARIES: provides Java libraries essential for starting and running hadoop services like HDFS, MapReduces, and
YARN.
• FILE SYSTEM ABSTRACTION: supports both local and distributed file system (e.g., HDFS) with tools to interact with data
storage efficiently.
• SERILAIZATION NAD DATA STRUCTURES: offers tools for serializing and deserializing data to facilitates data exchanges
between hadoop modules.
• CONFIGURATION MANAGEMENT: provides a flexible configuration API to handle cluster settings, environment
variables, and runtime properties.
• ERROR HANDLING AND FAULT TOLERANCE: implements mechanism to handle node failures and retry logic to ensures
robust distributed processing.
• SECURITY FRAMEWORK: includes integration with Kerberos for authentication and tools for secures data transfer.
• UTILITIES ABD COMMANDS: offers essential utilities, including file manipulation (e.g., copu=ying/moving files in HDFS)
and debugging tools for cluster management.
HADOOP DISTRIBUTED FILE SYSTEM (HDFS):
• The Hadoop Distributed File System (HDFS) is a scalable, fault-tolerant, and high-performance distributed file system
designed to handle large datasets across multiple machines.
• It is the primary storage system of Hadoop, enabling data storage and retrieval in a distributed computing environment.
• HDFS splits large files into blocks and stores them across a cluster of nodes, ensuring both redundancy and parallel
processing capabilities.
KEY FEATURES
• HDFS is replicated across multiple nodes (default replication factor is 3), ensuring high availability even in case of node
failures.
• HDFS is designed to scale out by adding more nodes to the cluster, by handling petabytes of data without compromising
performance.
• Optimized for delivering high throughput on large datasets, making it suitable for batch processing task like MapReduce
• Files are divided into fix-sized blocks typically (128MB or 256MB), with each block stored across multiple nodes in the
cluster.
• HDFS follows a write-once-read-many model, where files are generally written once and then read many times,
optimizing for large file storage.
• HDFS follows master-slave architecture with NameNode (master) managing metadata and DataNodes (slaves) storing
the actual data blocks.
• HDFS is accessible via command-line tools, APIs and integrate seamlessly with other Hadoop ecosystem component (like
MapReduce, Hive and Spark).
HADOOP MAPREDUCE
• Hadoop MapReduce is a distributed processing framework for handling large-scale data in parallel across a
hadoop cluster. It follows a programming model where data is processed in two stages.
• MAP: processes input data into intermediate key-value pairs
• REDUCER: aggregates and summarizes these key-value pairs to produced the final output.
• MapReduce breaks task into smaller chunks and distributes them across cluster nodes, enabling efficient
computation over vast datasets ensuring scalability and fault tolerance.
KEY FEATURES
• Splits tasks into smaller units and distributes them across multiple nodes, ensuring parallel execution for
faster processing.
• It can process terabytes or petabytes of data efficiently by adding more nodes to the cluster
• It automatically detects and retries failed tasks on another nodes, ensuring reliable processing
• It works with key value pair, making it flexible for processing both structured and unstructured data.
• Managed by a JOBTRACKER (master) and TASKTRACKER (slaves), coordinating tasks assignment and
monitoring
• It supports multiple programming languages such as java, python, Ruby through APIs, offering flexible for
developers.
• It works seamlessly with HDFS and other hadoop tools and ecosystem like HIVE and PIG to store and
analyze large datasets.
HADOOP YARN
• Yarn (Yet Another Resource Negotiator) is a resource management and job scheduling framework within the hadoop
ecosystem. It decouples resource management from data processing, allowing multiple data processing engines (e.g.,
MapReduce, Spark) to run on a single Hadoop cluster.
• YARN dynamically allocates cluster resources (CPU, memory) to applications, ensuring efficient utilization and
scalability.
• YARN replaces the older JOBTRACKER and TASKTRACKER system, providing enhanced flexibility and support for non-
MapReduce workloads.
KEY FEATURES
• YARN efficiently manages cluster resources, allocating them dynamically based on application needs.
• It schedules and monitors applications through a RESOURCESMANAGER and NODEMANGAER architecture, optimizing
job execution.
• Supports running multiple types of workloads (MapReduce, Spark, Hive etc.) simultaneously on the same cluster
• Designed to scale out with thousands of nodes enabling large-scale data processing and storage.
• Detects node failures and reassigned task to other healthy nodes to ensure job completion
• Allows fine-grained resource allocation by breaking clusters resources into containers, improving overall cluster
efficiency.
• It supports a wide variety of processing engines including batch (MapReduce), stream processing (Spark, Flink) and
machine learning.
APACHE HADOOP ECOSYSTEM
• The Apache Hadoop Ecosystem refers to a suite of open-source tools and technologies that work
together to enable distributed storage, processing, and management of large-scale data across clusters
of commodity hardware.
• The hadoop ecosystem is a collection of open source tools, framework and technologies built around
core Hadoop components (HDFS, YARN and MapReduce).
• It provides an integrated environment for storing, processing and analyzing large-scale datasets
efficiently in a distributed manner.
• The ecosystem addresses diverse big data needs, including data storage, real-time data processing,
querying, machine learning and workflow
Key Ecosystem
• Apache Hive: Data warehouse software for querying and analyzing large datasets stored in HDFS using
HiveQL.
• Apache Pig: A platform for processing and analyzing large data sets using Pig Latin, a high-level
language.
• Apache HBase: A NoSQL database running on HDFS to support real-time read/write access to data.
• Apache Sqoop: Facilitates data transfer between HDFS and relational databases.
• Apache Flume: A service designed for efficiently collecting, aggregating, and transferring large amounts
of log data into HDFS.
• Apache Spark: A fast, in-memory data processing engine designed for real-time data processing and
iterative workloads.
Apache Hive
1. Apache Hive is an open-source data warehouse framework built on top of Hadoop, designed to manage, query, and
analyze large datasets stored in the Hadoop Distributed File System (HDFS). It uses a SQL-like query language called
HiveQL, making it accessible to users familiar with relational database systems.
2. Hive simplifies big data processing by abstracting the complexities of writing MapReduce jobs, enabling efficient
querying and analysis.
3. HiveQL allows users to query data using familiar SQL syntax while also supporting custom MapReduce for advanced
operations.
4. It integrates seamlessly with other Hadoop ecosystem tools like HDFS, MapReduce, and HBase.
5. Optimized for handling large-scale, read-intensive, and batch-oriented workloads rather than real-time queries.
6. Built to process petabyte-scale datasets, leveraging Hadoop's distributed computing capabilities.
7. Supports various data formats such as CSV, JSON, ORC, and Parquet, and applies schema at query time (schema on
read).
APACHE PIG
1. Apache Pig is a high-level data processing tool built on top of Hadoop for analyzing large datasets. It uses a
scripting language called Pig Latin, which simplifies the writing of data transformation and analysis tasks
without requiring Java programming.
2. Designed to handle large-scale, complex data transformations and analysis efficiently. It abstracts the
complexity of writing MapReduce jobs.
3. Pig Latin is declarative and easy to learn, with commands for filtering, sorting, grouping, and joining data.
4. Works with both structured and unstructured data, supporting formats like JSON, XML, and text files.
5. Pig integrates with Hadoop ecosystem tools like HDFS and Hive, and supports custom UDFs written in Java,
Python, or other languages.
6. Suitable for processing large data volumes in a batch mode rather than real-time.
APACHE HBASE
1. Apache HBase is a distributed, non-relational, column-oriented database built on top of Hadoop. It is
designed for real-time read and write access to large datasets, enabling fast operations on structured and
semi-structured data.
2. HBase provides random, real-time access to data stored in HDFS. Unlike traditional relational databases, it
supports sparse, wide tables with billions of rows and millions of columns.
3. Supports horizontal scaling, making it ideal for applications requiring high throughput and low latency.
4. Works seamlessly with Hadoop's ecosystem, including MapReduce for batch processing and tools like Hive
and Pig for querying.
5. Commonly used for handling time-series data, storing metadata, and supporting online transaction
processing (OLTP) systems.
6. HBase is particularly effective for real-time analytics and is optimized for applications where massive
datasets must be quickly queried or updated.
Apache Sqoop
1. Apache Sqoop is a data transfer tool designed to efficiently transfer large amounts of data between
relational databases (like MySQL, PostgreSQL, Oracle) and the Hadoop ecosystem (HDFS, Hive, HBase). The
name "Sqoop" is derived from "SQL-to-Hadoop."
2. Sqoop simplifies the process of importing data from structured databases into Hadoop for analysis and
exporting processed data back to databases.
3. Leverages database-specific connectors for fast and secure data movement.
4. Supports HDFS, Hive, and HBase, allowing seamless integration into big data workflows.
5. Uses MapReduce for parallel processing, enabling the transfer of large datasets efficiently.
6. Offers options for selective data import/export using SQL queries, specifying delimiters, or defining table
partitions.
7. Provides a simple command-line interface, making it accessible to users familiar with SQL.
8. Sqoop is ideal for organizations that need to integrate structured data with big data platforms for analytics
or ETL processes.
Apache Flume
1. Apache Flume is a reliable, distributed, and available service designed to collect, aggregate, and move
large amounts of log data or event data from multiple sources to a centralized data store like HDFS or
HBase for analysis.
2. Built for efficiently handling streaming data, especially log data, from distributed systems.
3. Works on a simple, flexible, and scalable model comprising sources, channels, and sinks, forming a data
flow pipeline.
4. Collect data from various systems such as log files, network streams, or applications.
5. Act as a buffer between sources and sinks, ensuring reliable data delivery even during failures.
6. Deliver the aggregated data to its destination, such as HDFS, HBase, or Kafka.
7. Ensures fault tolerance with a transactional approach during data transfer.
8. Easily scales horizontally by adding more agents to handle increasing data volumes
9. Allows custom components like custom sources and sinks for specific use cases.
Apache Spark
1. Apache Spark is an open-source unified analytics engine designed for large-scale data processing. It
provides fast, in-memory data computation and supports a wide range of workloads, including batch
processing, real-time streaming, machine learning, and graph processing.
2. Spark processes data up to 100x faster than traditional MapReduce by leveraging in-memory computation
and optimized query execution.
3. It supports APIs in Java, Python, Scala, R, and SQL, making it accessible for developers and data scientists.
4. Handles diverse workloads, including real-time streaming (Spark Streaming), machine learning (MLlib),
SQL-based querying (Spark SQL), and graph analytics (GraphX).
5. Works seamlessly with Hadoop ecosystems, accessing data from HDFS, Apache Hive, Apache HBase,
Cassandra, and more.
6. Built for distributed computing, Spark scales efficiently from a single machine to thousands of nodes.
7. Ensures reliability by recovering lost data and computations automatically in case of failures using Directed
Acyclic Graph (DAG) execution.