0% found this document useful (0 votes)
3 views

hadoop.pptx

The document provides an overview of Hadoop and its ecosystem, including key components such as Flume, Sqoop, Hive, HBase, Pig, Mahout, Oozie, Zookeeper, and YARN, which facilitate data ingestion, storage, processing, and management. It also discusses NoSQL databases, sharding techniques, and the importance of data visualization and exploratory data analysis (EDA) in understanding and interpreting data. Additionally, it highlights Amazon S3 as a cloud storage solution for data management.

Uploaded by

ashmakhan8855
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

hadoop.pptx

The document provides an overview of Hadoop and its ecosystem, including key components such as Flume, Sqoop, Hive, HBase, Pig, Mahout, Oozie, Zookeeper, and YARN, which facilitate data ingestion, storage, processing, and management. It also discusses NoSQL databases, sharding techniques, and the importance of data visualization and exploratory data analysis (EDA) in understanding and interpreting data. Additionally, it highlights Amazon S3 as a cloud storage solution for data management.

Uploaded by

ashmakhan8855
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Unit-5

Hadoop
NoSQL: non-SQL or non relational
database
• non-tabular databases and store data differently than relational tables.
• NoSQL databases come in a variety of types based on their data model.
• The main types are document, key-value, wide-column, and graph.
• They provide flexible schemas and scale easily with large amounts of data
and high user loads.
• NoSQL databases are databases that store data in a format other than
relational tables.
Sharding

• type of database partitioning that separates large databases into smaller,


faster, more easily managed parts.
• “Shard” means “a small part of a whole“.
• Sharding means dividing a larger part into smaller parts.
• In DBMS, Sharding is a type of DataBase partitioning in which a large
database is divided or partitioned into smaller data and different nodes.
• These shards are not only smaller, but also faster and hence easily
manageable.
Sharding architectures and types:
1. ranged/dynamic sharding
2. algorithmic/hashed sharding
3. entity/relationship-based sharding
4. geography-based sharding.
What is Hadoop?

• Hadoop is an open source framework from Apache and is


used to store process and analyze data which are very
huge in volume. Hadoop is written in Java

• The Hadoop framework, built by the Apache Software


Foundation
Hadoop Ecosystem
Key Components of the Hadoop Ecosystem
1. Data Ingestion and Transfer:
• Flume: Seamless Streaming Data Collection
• Sqoop: Your Data Import/Export Wizard
2. Data Storage and Querying:
• Hive: Your Gateway to Structured Big Data
• HBase: Real-Time NoSQL Database for Quick Access
3. Data Processing and Analysis:
• Pig: Simplifying the Data Processing Journey
4. Machine Learning and Analytics:
• Mahout: Unleashing Machine Learning on Big Data
5. Workflow Coordination and Management:
• Oozie: Orchestrating Workflows with Ease
6. Coordination and Consistency:
• Zookeeper: Keeping Distributed Systems in Sync
7. Resource Management:
• YARN: Efficient Resource Management for Hadoop
• Flume: Seamless Streaming Data Collection
Flume is your data highway, ensuring a smooth flow of streaming data into
the Hadoop ecosystem. It acts as the bridge connecting various data sources
to Hadoop, working hand-in-hand with HDFS for efficient collection and
transfer of streaming data.

• Sqoop: Your Data Import/Export Wizard


Sqoop is like a data import/export superhero, helping you seamlessly move
data between Hadoop and relational databases. Tightly integrated with
Hadoop, Sqoop allows effortless transfer of data to and from HDFS,
connecting Hadoop’s distributed processing power with traditional
relational databases.
• Hive: Your Gateway to Structured Big Data
Hive is like a translator for big data, allowing you to speak SQL and get
meaningful insights from massive datasets. It simplifies data analysis by
converting SQL-like queries into operations that Hadoop can understand,
utilizing Hadoop Distributed File System (HDFS) for efficient storage and
retrieval.

• HBase: Real-Time NoSQL Database for Quick Access


HBase is your go-to solution for real-time access to large datasets without
compromising on scalability. Integrated with Hadoop, HBase complements
HDFS by providing fast and random read/write access to your data, making
it suitable for low-latency operations.
• Pig: Simplifying the Data Processing Journey
Pig is your scripting buddy, making data processing on Hadoop a breeze without
the need for complex programming. Pig scripts abstract the intricacies of
MapReduce programming, running on Hadoop to process large datasets stored in
HDFS, enabling you to focus on the logic of your data transformations.

Mahout: Unleashing Machine Learning on Big Data


Mahout is your ticket to the world of machine learning on big data, helping you
make sense of vast datasets for predictive analytics and recommendations. Mahout
seamlessly integrates with Hadoop, utilizing its parallel processing capabilities to
efficiently execute machine learning algorithms on distributed datasets.
• Oozie: Orchestrating Workflows with Ease
Oozie is your workflow conductor, ensuring that Hadoop jobs dance in
harmony according to a well-defined sequence. Oozie acts as the manager for
workflows, coordinating the execution of various tasks in Hadoop, providing
a structured way to manage and schedule complex data processing workflows.

• Zookeeper: Keeping Distributed Systems in Sync


Zookeeper is your guardian of coordination, ensuring that distributed systems
within Hadoop remain in harmony. It plays a crucial role in maintaining
coordination and consensus among different components in the Hadoop
ecosystem, ensuring processes are synchronized, and data consistency is
maintained.
• YARN: Efficient Resource Management for Hadoop
YARN is like a traffic manager for Hadoop, efficiently allocating resources
to different applications running on the cluster. YARN enhances the
performance of Hadoop by managing resources dynamically, allowing
various processing engines, including MapReduce, to share resources
effectively and optimize overall cluster performance.
Hadoop HDFS

• Data is stored in a distributed manner in HDFS. There are two components of HDFS -
name node and data node. While there is only one name node, there can be multiple
data nodes.
Hadoop MapReduce
Hadoop YARN

• Hadoop YARN stands for Yet Another Resource Negotiator. It is the


resource management unit of Hadoop and is available as a component of
Hadoop version 2.
• Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on
top of HDFS.
• It is responsible for managing cluster resources to make sure you don't
overload one machine.
• It performs job scheduling to make sure that the jobs are scheduled in the
right place
Hive:
The Hadoop ecosystem component, Apache Hive, is an open source data warehouse
system for querying and analyzing large datasets stored in Hadoop files. Hive do
three main functions: data summarization, query, and analysis.
Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL
automatically translates SQL-like queries into MapReduce jobs which will execute
on Hadoop.
Apache Pig
Apache HBase
S3 (Simple Storage Service)
• It is a cloud storage service provided by Amazon Web Services (AWS)
for storing and retrieving data in the cloud.
• It's an object storage service, meaning data is stored as objects
within containers called "buckets".
• S3 offers high durability, availability, security, and scalability,
making it suitable for various use cases like storing internet
applications, backups, disaster recovery, and data archiving.
How it Works:

Create a Bucket: Users first create a bucket in S3, which acts as a


container for their data.
Upload Objects: Data is uploaded as objects (files and metadata)
to the specified bucket.
Access Data: Data can be accessed from anywhere with internet
connectivity using the S3 API or the AWS Console
Data visualization and visual data
analysis techniques
• Data visualization and visual data analysis techniques translate data into visual
formats like charts, graphs, and maps, making it easier to understand and interpret.

• Interactive techniques allow users to manipulate and explore the visualizations, while
visual analytics tools help identify patterns and insights.

• These methods facilitate communication, decision-making, and data exploration.


Visual Data Analysis Techniques
Interacting Techniques
Visual Analytics
EDA – Exploratory Data Analysis
• a method used by data scientists to analyze datasets and
summarize their main characteristics, often using
visualizations.

• It's a crucial first step in any data analysis project, helping to


understand the data structure, identify patterns and anomalies,
and formulate questions for further analysis.
Key
Aspects
of EDA
Why is EDA important?
Common EDA Techniques
Analysis for Unstructured data
Data visualization before analysis
• Data visualization before analysis is crucial for gaining an
intuitive understanding of the data and identifying patterns,
trends, and outliers that might be missed when analyzing the
data in its raw form.

• It helps in formulating a better analysis strategy by providing


insights into the data's structure and characteristics, enabling
more effective data handling and model selection.

You might also like