1 - Big Data and Hadoop Framework
1 - Big Data and Hadoop Framework
9/15/23 1
Agenda
9/15/23 2
Big Data Challenges
9/15/23 3
Big Data in five domains
●
Health care
●
Public sector administration
●
Retail Transaction Data
●
Global manufacturing
●
Personal location data
●
James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers, Big data:
The Next Frontier for Innovation, Competition, and Productivity, McKinsey Global Institute, 2012 .
9/15/23 4
Goals of Big Data Analytics
9/15/23 5
Big Data Analytics: Open Source Solutions
9/15/23 6
A Brief History
9/15/23 7
Hadoop
9/15/23 8
9/15/23 9
History of Hadoop
Started as a sub-project of Apache Nutch
●Nutch’s job is to index the web and expose it for searching
-In 2004 Google publishes Google File System (GFS) and MapReduce
framework papers
-In 2006 Yahoo! hires Doug Cutting to work on Hadoop with a dedicated
team
9/15/23 10
Naming Conventions
9/15/23 11
Hadoop Current Status
●
April 1, 2016 marked the 10-year anniversary of the first Hadoop code release.
●
The vast majority of Fortune 500 companies are using Hadoop in production
Global Hadoop market was valued at $1.5 billion in 2012, and is expected to grow at a
●
Hadoop Spark, a successor system that is more powerful and flexible than Hadoop
●
MapReduce
●
Spark is intended to enhance, not replace, the Hadoop stack.
●
Recent trend in big data is processing big data in cloud computing environments.
●
Real time analysis(Business Intelligence)
http://searchdatamanagement.techtarget.com/news/4500272078/Co-creator-Cutting-assesses-Hadoop-future-present-and-past?utm_medium=EM&asrc=EM_NLN
●
_53090788&utm_campaign=20160204_Hadoop%20at%2010:%20Q&A%20with%20co-creator%20Doug%20Cutting_jbiscobing&utm_source=NLN&track=NL-181
6&ad=905761&src=905761
●
https://www.alliedmarketresearch.com/hadoop-market
●
https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
●
https://databricks.com/blog/2014/01/21/spark-and-hadoop.html
9/15/23 12
9/15/23 13
Source: https://opensource.com/life/14/8/intro-apache-hadoop-big-data
Hadoop: Assumptions
• Hardware will fail.
• Processing will be run in batches. Thus there is an emphasis
on high throughput as opposed to low latency.
• Applications that run on HDFS have large data sets. A typical
file in HDFS is gigabytes to terabytes in size.
• It should provide high aggregate data bandwidth and scale to
hundreds of nodes in a single cluster. It should support tens
of millions of files in a single instance.
• Applications need a write-once-read-many access model.
• Moving Computation is Cheaper than Moving Data.
• Portability is important.
9/15/23 14
Hadoop Distribution
• MapR Distribution
• Hortonworks Data Platform (HDP)
• Apache BigTop Distribution
• Greenplum HD Data Computing Appliance
• Cloudera Distribution for Hadoop (CDH)
– http://www.cloudera.com/hadoop
– 100% open-source and Integrates majority of popular
Hadoop Products like: HDFS, MapReduce, HBase, Hive, Mahout,
Oozie, Pig, Sqoop, Whirr, Zookeeper, Flume
9/15/23 15
Hadoop Cluster
●
A set of “cheap” commodity hardware.
●
Networked Together
●
Reside in the same location
–Set of server in a set of racks in a data center
–
9/15/23 16
Typical Hadoop Cluster
9/15/23 17
Typical Hadoop Cluster
The Apache Hadoop framework - Modules
●
Hadoop Common: contains libraries and utilities needed
by other Hadoop modules
●
Hadoop Distributed File System (HDFS): a distributed
file-system that stores data on the commodity machines,
providing very high aggregate bandwidth across the cluster
●
Hadoop YARN: a resource-management platform
responsible for managing compute resources in clusters
and using them for scheduling of users' applications
●
Hadoop MapReduce: a programming model for large
scale data processing
9/15/23 19
HDFS Architecture
NameNode
●
9/15/23 20
HDFS Architecture (Continue…)
●
DataNodes
–Each block replica on a DataNode is represented
by two files in the local native filesystem.
–The first file contains the data itself and the
second file records the block's metadata including
checksums for the data and the generation stamp.
●
CheckpointNode
●
BackupNode
9/15/23 21
HDFS Continue...
●
HDFS is a block-structured file system (default 128 MB)
●
HDFS Daemons
9/15/23 22
Hadoop high-level architecture
9/15/23 23
HDFS Architecture (Continue...)
9/15/23 24
HDFS Write
9/15/23 25
HDFS Read
9/15/23 26
start-all.sh
●
jaiprakash@jaiprakash-HP-ProBook-4440s:/$start-
all.sh
jaiprakash@jaiprakash-HP-ProBook-4440s:/$ jps
●
3538 SecondaryNameNode
●
3381 DataNode
●
7240 Jps
●
3786 TaskTracker
●
3629 JobTracker
●
3231 NameNode
●
9/15/23 27
Hadoop Web Interfaces
9/15/23 28
Hadoop 2.x Features over 1.x
●
Allows to work in MR as well as other distributed computing models like Spark, Hama, Giraph, Message
Passing Interface) MPI & HBase coprocessors.
●
YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done
using different processing models.
●
Has feature to overcome SPOF with a standby Namenode and in case of Namenode failure, it is
configured for automatic recovery.
MR API requires additional files for a program written in Hadoop1x to execute in Hadoop2x.
●
●
Can serve as a platform for a wide variety of data analytics-possible to run event processing, streaming
and real time operations.
The Hadoop stack – Hive, Pig, HBase etc. are all equipped to handle
●
Namenode failure.
●
Added support for Microsoft windows
9/15/23 29
9/15/23 31
Hadoop and Spark Cluster Configuration – Cluster
of SMP Nodes- ISRO Respond Project
Shell Commands
●
Interact with FileSystem by executing shell like commands
●
Usage: $hdfs dfs -<command> -<option>
–Example $hadoop dfs -ls /
●
Most commands behave like UNIX commands
– ls, cat, du, etc..
●
Supports HDFS specific operations
– Ex: changing replication
●
List supported commands
– $ hadoop dfs -help
●
Display detailed help for a command
–$ hadoop dfs -help <command_name>
9/15/23 33
Shell Commands (Continue...)
1.Print the Hadoop version
•hadoop version
3.Report the amount of space used and available on currently mounted files
system
•hadoop fs -du hdfs:/
4.Count the number of directories, files and bytes under the paths that match
the specified file pattern
•hadoop fs -count hdfs:/
9/15/23
Shell Commands (Continue...)
7.Create a new directory named “hadoop” below the /user/NIM directory in HDFS. /user/NIM is
your home directory in HDFS.
•hadoop fs -mkdir /user/NIM/hadoop
8.Add a TotalWords text file from the local directory named “jaiprakash” to the new directory you
created in HDFS during the previous step.
•hadoop fs -put /home/jaiprakash/sample.txt /user/training/hadoop
10. Add the entire local directory called “retail” to the /user/NIM/hadoop directory in HDFS.
–hadoop fs -put /home/jaiprakash/retail /user/NIM/hadoop
11. Since /user/NIM is your home directory in HDFS, any command that does not have an absolute
path is interpreted as relative to that directory. The next command will therefore list your home
directory, and should show the items you’ve just added there.
•hadoop fs -ls
9/15/23
Shell Commands (Continue...)
13. Delete a file ‘customers’ from the “retail” directory.
–hadoop fs -rm /user/NIM/hadoop/retail/customers
15. Delete all files from the “retail” directory using a wildcard.
–hadoop fs -rm /user/NIM/hadoop/retail/*
17. Finally, remove the entire retail directory and all of its contents in
HDFS.
•hadoop fs -rmr /user/NIM/hadoop/retail
20. To view the contents of your text file purchases.txt which is present in your Hadoop
directory.
•hadoop fs -cat /user/NIM/hadoop/purchases.txt
21. Add the purchases.txt file from “hadoop” directory which is present in HDFS directory
to the directory “data” which is present in your local directory
•hadoop fs -copyToLocal /user/NIM/hadoop/purchases.txt /home/jaiprakash/data
26. Use ‘-chown’ to change owner name and group name simultaneously
•hadoop fs -ls /user/NIM/hadoop/purchases.txt
hadoop fs -chown root:root /user/NIM/hadoop/purchases.txt
9/15/23
Shell Commands (Continue...)
30. Copy a directory from one node in the cluster to another Use ‘-
distcp’ command to copy, -overwrite option to overwrite in an
existing files -update command to synchronize both directories
•hadoop fs -distcp hdfs://namenodeA/apache_hadoop
hdfs://namenodeB/hadoop
31.Command to make the name node leave safe mode
•hadoop fs -expunge
hadoop dfsadmin -safemode leave
32.List all the hadoop file system shell commands
•hadoop fs
33.Last but not least, always ask for help!
•hadoop fs -help
9/15/23
●
Questions ?
9/15/23 40