0% found this document useful (0 votes)
33 views

1 - Big Data and Hadoop Framework

The document provides an overview of big data challenges and introduces the Hadoop framework as an open-source solution. It describes key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. The document outlines how Hadoop is designed to reliably process large datasets across commodity hardware and has grown to be widely adopted in production environments.

Uploaded by

Prishita Kapoor
Copyright
© © All Rights Reserved
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

1 - Big Data and Hadoop Framework

The document provides an overview of big data challenges and introduces the Hadoop framework as an open-source solution. It describes key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. The document outlines how Hadoop is designed to reliably process large datasets across commodity hardware and has grown to be widely adopted in production environments.

Uploaded by

Prishita Kapoor
Copyright
© © All Rights Reserved
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
You are on page 1/ 40

Big Data and Hadoop Framework

9/15/23 1
Agenda

•Big Data Challenges


•Hadoop Framework
•HDFS (Hadoop Distributed File System)
•Map Reduce Programming Concept
•Demo Hadoop and Map Reduce
Programming concept

9/15/23 2
Big Data Challenges

A new paradigm is born as Data-intensive scientific discovery


(DISD), also known as Big Data problems.
–Data capture
–Data storage
–Data analysis
–Data visualization

9/15/23 3
Big Data in five domains


Health care

Public sector administration

Retail Transaction Data

Global manufacturing

Personal location data

James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers, Big data:
The Next Frontier for Innovation, Competition, and Productivity, McKinsey Global Institute, 2012 .

9/15/23 4
Goals of Big Data Analytics

9/15/23 5
Big Data Analytics: Open Source Solutions

9/15/23 6
A Brief History

9/15/23 7
Hadoop

Existing tools were not designed to handle such large


Amounts of data

"The Apache Hadoop project develops open-source


software for reliable, scalable, distributed computing."
-http://hadoop.apache.org

●Process Big Data on clusters of commodity hardware


●Vibrant open-source community

●Many products and tools reside on top of Hadoop

9/15/23 8
9/15/23 9
History of Hadoop
Started as a sub-project of Apache Nutch
●Nutch’s job is to index the web and expose it for searching

●Open Source alternative to Google

●Started by Doug Cutting

-In 2004 Google publishes Google File System (GFS) and MapReduce
framework papers

-2005, Doug Cutting and Nutch team implemented Google’s frameworks


in Nutch

-In 2006 Yahoo! hires Doug Cutting to work on Hadoop with a dedicated
team

-In 2008 Hadoop became Apache Top Level Project


– http://hadoop.apache.org

9/15/23 10
Naming Conventions

Doug Cutting drew inspiration from his


Family
● Lucene: Doug’s wife’s middle name

●Nutch: A word for "meal" that his son used as a


toddler

● Hadoop: Yellow stuffed elephant named by his son

9/15/23 11
Hadoop Current Status

April 1, 2016 marked the 10-year anniversary of the first Hadoop code release.

The vast majority of Fortune 500 companies are using Hadoop in production

Global Hadoop market was valued at $1.5 billion in 2012, and is expected to grow at a

CAGR of 58.2% during 2013 to 2020 to reach $50.2 billion by 2020.

Hadoop Spark, a successor system that is more powerful and flexible than Hadoop

MapReduce

Spark is intended to enhance, not replace, the Hadoop stack.

Recent trend in big data is processing big data in cloud computing environments.

Real time analysis(Business Intelligence)
http://searchdatamanagement.techtarget.com/news/4500272078/Co-creator-Cutting-assesses-Hadoop-future-present-and-past?utm_medium=EM&asrc=EM_NLN

_53090788&utm_campaign=20160204_Hadoop%20at%2010:%20Q&A%20with%20co-creator%20Doug%20Cutting_jbiscobing&utm_source=NLN&track=NL-181
6&ad=905761&src=905761


https://www.alliedmarketresearch.com/hadoop-market


https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83


https://databricks.com/blog/2014/01/21/spark-and-hadoop.html

9/15/23 12
9/15/23 13
Source: https://opensource.com/life/14/8/intro-apache-hadoop-big-data
Hadoop: Assumptions
• Hardware will fail.
• Processing will be run in batches. Thus there is an emphasis
on high throughput as opposed to low latency.
• Applications that run on HDFS have large data sets. A typical
file in HDFS is gigabytes to terabytes in size.
• It should provide high aggregate data bandwidth and scale to
hundreds of nodes in a single cluster. It should support tens
of millions of files in a single instance.
• Applications need a write-once-read-many access model.
• Moving Computation is Cheaper than Moving Data.
• Portability is important.
9/15/23 14
Hadoop Distribution

• MapR Distribution
• Hortonworks Data Platform (HDP)
• Apache BigTop Distribution
• Greenplum HD Data Computing Appliance
• Cloudera Distribution for Hadoop (CDH)
– http://www.cloudera.com/hadoop
– 100% open-source and Integrates majority of popular
Hadoop Products like: HDFS, MapReduce, HBase, Hive, Mahout,
Oozie, Pig, Sqoop, Whirr, Zookeeper, Flume
9/15/23 15
Hadoop Cluster


A set of “cheap” commodity hardware.

Networked Together

Reside in the same location
–Set of server in a set of racks in a data center

9/15/23 16
Typical Hadoop Cluster

9/15/23 17
Typical Hadoop Cluster
The Apache Hadoop framework - Modules


Hadoop Common: contains libraries and utilities needed
by other Hadoop modules

Hadoop Distributed File System (HDFS): a distributed
file-system that stores data on the commodity machines,
providing very high aggregate bandwidth across the cluster

Hadoop YARN: a resource-management platform
responsible for managing compute resources in clusters
and using them for scheduling of users' applications

Hadoop MapReduce: a programming model for large
scale data processing
9/15/23 19
HDFS Architecture
NameNode

–Recordattributes like permissions, modification and access times,


namespace and disk space quotas.
–NameNode keeps the entire namespace image in RAM
–The file content is split into large blocks (typically 128 megabytes, but
user selectable file-by-file)
–Eachblock of the file is independently replicated at multiple
DataNodes (typically three, but user selectable file-by-file).
–The NameNode maintains the namespace tree and the mapping of
blocks to DataNodes.
–NameNode keeps the entire namespace image in RAM
Image, Journal, and checkpoints

9/15/23 20
HDFS Architecture (Continue…)

DataNodes
–Each block replica on a DataNode is represented
by two files in the local native filesystem.
–The first file contains the data itself and the
second file records the block's metadata including
checksums for the data and the generation stamp.

CheckpointNode

BackupNode

9/15/23 21
HDFS Continue...

HDFS is a block-structured file system (default 128 MB)

HDFS Daemons

9/15/23 22
Hadoop high-level architecture

9/15/23 23
HDFS Architecture (Continue...)

9/15/23 24
HDFS Write

9/15/23 25
HDFS Read

9/15/23 26
start-all.sh


jaiprakash@jaiprakash-HP-ProBook-4440s:/$start-
all.sh
jaiprakash@jaiprakash-HP-ProBook-4440s:/$ jps

3538 SecondaryNameNode

3381 DataNode

7240 Jps

3786 TaskTracker

3629 JobTracker

3231 NameNode

9/15/23 27
Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by


default available at these locations:
http://localhost:9870/dfshealth.html#tab-overview
– web UI of the NameNode daemon
http://localhost:8088/cluster
– These web interfaces provide concise information about
what’s happening in Hadoop cluster

9/15/23 28
Hadoop 2.x Features over 1.x

Allows to work in MR as well as other distributed computing models like Spark, Hama, Giraph, Message
Passing Interface) MPI & HBase coprocessors.

YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done
using different processing models.

Has better scalability. Scalable up to 10000 nodes per cluster


Works on concepts of containers. Using containers can run generic tasks.


Multiple Namenode servers manage multiple namespace



Has feature to overcome SPOF with a standby Namenode and in case of Namenode failure, it is
configured for automatic recovery.

MR API requires additional files for a program written in Hadoop1x to execute in Hadoop2x.


Can serve as a platform for a wide variety of data analytics-possible to run event processing, streaming
and real time operations.

The Hadoop stack – Hive, Pig, HBase etc. are all equipped to handle

Namenode failure.

Added support for Microsoft windows
9/15/23 29
9/15/23 31
Hadoop and Spark Cluster Configuration – Cluster
of SMP Nodes- ISRO Respond Project
Shell Commands

Interact with FileSystem by executing shell like commands

Usage: $hdfs dfs -<command> -<option>
–Example $hadoop dfs -ls /

Most commands behave like UNIX commands
– ls, cat, du, etc..

Supports HDFS specific operations
– Ex: changing replication

List supported commands
– $ hadoop dfs -help

Display detailed help for a command
–$ hadoop dfs -help <command_name>
9/15/23 33
Shell Commands (Continue...)
1.Print the Hadoop version
•hadoop version

2.List the contents of the root directory in HDFS


•hadoop fs -ls /

3.Report the amount of space used and available on currently mounted files
system
•hadoop fs -du hdfs:/

4.Count the number of directories, files and bytes under the paths that match
the specified file pattern
•hadoop fs -count hdfs:/

5.Run a DFS file system checking utility


•hadoop fsck – /

6.Run a cluster balancing utility


•hadoop balancer

9/15/23
Shell Commands (Continue...)
7.Create a new directory named “hadoop” below the /user/NIM directory in HDFS. /user/NIM is
your home directory in HDFS.
•hadoop fs -mkdir /user/NIM/hadoop

8.Add a TotalWords text file from the local directory named “jaiprakash” to the new directory you
created in HDFS during the previous step.
•hadoop fs -put /home/jaiprakash/sample.txt /user/training/hadoop

9.List the contents of this new directory in HDFS.


•hadoop fs -ls /user/NIM/hadoop

10. Add the entire local directory called “retail” to the /user/NIM/hadoop directory in HDFS.
–hadoop fs -put /home/jaiprakash/retail /user/NIM/hadoop

11. Since /user/NIM is your home directory in HDFS, any command that does not have an absolute
path is interpreted as relative to that directory. The next command will therefore list your home
directory, and should show the items you’ve just added there.
•hadoop fs -ls

12. See how much space this directory occupies in HDFS.


–hadoop fs -du -s -h /user/NIM/hadoop

9/15/23
Shell Commands (Continue...)
13. Delete a file ‘customers’ from the “retail” directory.
–hadoop fs -rm /user/NIM/hadoop/retail/customers

14. Ensure this file is no longer in HDFS.


–hadoop fs -ls /user/NIM/hadoop/retail

15. Delete all files from the “retail” directory using a wildcard.
–hadoop fs -rm /user/NIM/hadoop/retail/*

16. To empty the trash


•hadoop fs -expunge

17. Finally, remove the entire retail directory and all of its contents in
HDFS.
•hadoop fs -rmr /user/NIM/hadoop/retail

18. List the hadoop directory again


•hadoop fs -ls user/NIM/hadoop
9/15/23
Shell Commands (Continue...)
19. Add the purchases.txt file from the local directory named “/home/jaiprakash/” to the
hadoop directory you created in HDFS
•hadoop fs -copyFromLocal /home/jaiprakash/purchases.txt /user/NIM/hadoop/

20. To view the contents of your text file purchases.txt which is present in your Hadoop
directory.
•hadoop fs -cat /user/NIM/hadoop/purchases.txt

21. Add the purchases.txt file from “hadoop” directory which is present in HDFS directory
to the directory “data” which is present in your local directory
•hadoop fs -copyToLocal /user/NIM/hadoop/purchases.txt /home/jaiprakash/data

22. cp is used to copy files between directories present in HDFS


•hadoop fs -cp /user/NIM/*.txt /user/NIM/hadoop

23. ‘-get’ command can be used alternaively to ‘-copyToLocal’ command


•hadoop fs -get /user/NIM/hadoop/TotalWords.txt /home/jaiprakash/data

24. Display last kilobyte of the file “purchases.txt” to stdout.


•hadoop fs -tail /user/NIM/hadoop/purchases.txt
9/15/23
Shell Commands (Continue...)
25. Default file permissions are 666 in HDFS Use ‘-chmod’ command to change
permissions of a file
•hadoop fs -ls /user/NIM/hadoop/purchases.txt
hadoop fs -chmod 600 /user/NIM/hadoop/purchases.txt

26. Use ‘-chown’ to change owner name and group name simultaneously
•hadoop fs -ls /user/NIM/hadoop/purchases.txt
hadoop fs -chown root:root /user/NIM/hadoop/purchases.txt

27. Use ‘-chgrp’ command to change group name


•hadoop fs -ls hadoop/purchases.txt
hadoop fs -chgrp training /user/NIM/hadoop/purchases.txt

28. Move a directory from one location to other


•hadoop fs -mv /user/NIM/hadoop /user/apache_hadoop

29. Default replication factor to a file is 3. Use ‘-setrep’ command to change


replication factor of a file
•hadoop fs -setrep -w 2 /user/NIM/TotalWords.txt

9/15/23
Shell Commands (Continue...)
30. Copy a directory from one node in the cluster to another Use ‘-
distcp’ command to copy, -overwrite option to overwrite in an
existing files -update command to synchronize both directories
•hadoop fs -distcp hdfs://namenodeA/apache_hadoop
hdfs://namenodeB/hadoop
31.Command to make the name node leave safe mode
•hadoop fs -expunge
hadoop dfsadmin -safemode leave
32.List all the hadoop file system shell commands
•hadoop fs
33.Last but not least, always ask for help!
•hadoop fs -help

9/15/23

Questions ?

9/15/23 40

You might also like