3 Hadoop
3 Hadoop
Techniques
Samatrix Consulting Pvt Ltd
Hadoop Installation
https://www.oreilly.com/content/hadoop-with-python/
• NameNode: Requires high memory and will have a lot of RAM and does
not require much memory on hard disk.
• Secondary NameNode: Its memory requirement for a is not as high
compared to primary NameNode.
• DataNodes: Each datanode requires 16 GB of memory because they are
supposed to store data. So, they are high on hard disk and have multiple
drives.
• Hadoop Cluster in Facebook:
• Facebook uses Hadoop to store copies of internal log and dimension
data sources and use it as a source for reporting, analytics and
machine learning.
• Currently, Facebook has two major clusters: A 1100-machine cluster
with 800 cores and about 12 PB raw storage. Another one is a 300
machine cluster with 2,400 cores and about 3 PB raw storage. Each of
the commodity node has 8 cores and 12 TB storage.
• Facebook uses streaming and Java API a lot and have used Hive to
build a higher-level data warehousing framework. They have also
developed a FUSE application over HDFS.
• Map/Reduce and HDFS are the primary components of Hadoop cluster.
• MapReduce: MapReduce is a programming model associated for
implementation by generating and processing big data sets with parallel
and distributed algorithms on a cluster. The following is the
Map/Reduce Master-slave architecture.
• Master: JobTraker
• Slaves: {tasktraker}……{Tasktraker}
• Master {Jobtracker} is the interaction point for map/reduce framework
and users. The Jobtraker arranges the pending jobs in a queue for
execution in FIFO(first-come-first-serve) format to reduce tasks to the
tasktrackers, and also manages the mapping.
• Slaves {tasktraker} execute tasks based on the instructions of Master
{Jobtraker} and handles data exchange between map and reduce.
• HDFS (Hadoop distributed file system): HDFS is a part of Apache
Software Foundation designed to support a fault-tolerant file system
that can run on any hardware commodity.
• The following is the HDFS Master-slave architecture.
• Master: NameNode
• Slave: {Datanode}…..{Datanode}
• The Master {NameNode} manages the file system namespace
operations such as renaming files, determines the mapping of blocks
to DataNodes, directories, opening and closing files, and regulating
access to files by client.
• Slaves {DataNodes} serve read and write requests from the clients file
system and perform all the tasks like replication, block creation,
and deletion based on the instruction of Master {NameNode}.
Where to use HDFS
• Low Latency data access: Applications that require very less time to
access the first data should not use HDFS as it is giving importance to
whole data rather than time to fetch the first record.
• Lots Of Small Files: The name node contains the metadata of files in
memory and if the files are small in size it takes a lot of memory for
name node's memory which is not feasible.
• Multiple Writes : It should not be used when we have to write
multiple times.
Hadoop Operation Modes
• Once you have downloaded Hadoop, you can operate your Hadoop
cluster in one of the three supported modes −
• Local/Standalone Mode − After downloading Hadoop in your system, by
default, it is configured in a standalone mode and can be run as a single
java process.
• Pseudo Distributed Mode − It is a distributed simulation on single
machine. Each Hadoop daemon such as hdfs, yarn, MapReduce etc., will
run as a separate java process. This mode is useful for development.
• Fully Distributed Mode − This mode is fully distributed with minimum
two or more machines as a cluster. We will come across this mode in
detail in the coming chapters.
Installing Hadoop in Standalone Mode
• Here we will discuss the installation of Hadoop in standalone mode.
• There are no daemons running and everything runs in a single JVM.
• Standalone mode is suitable for running MapReduce programs during
development, since it is easy to test and debug them.
$ hadoop version
Installing Hadoop in Pseudo Distributed Mode
• The data analytics process kicks in with the data collection process.
• In the process, all sorts of data such as sales, marketing, employee,
payroll, HR, etc., is collected from the data warehouse based on the
requirements of the data analyst.
• Data stewards and data governance teams ensure that the right data
is collected while maintaining the confidentiality of the data.
• Confidential or private data should not be used in data analytics at all.
• The data should be collected and engineered properly so that the
data can meet the requirements of the data scientist.
The following Xml files must be reconfigured in
order to develop Hadoop in Java:
• Core-site.xml
• Hdfs-site.xml
• Yarn-site.xml
• Mapred-site.xml
core-site.xml
• The core-site.xml file informs Hadoop daemon where NameNode runs in the cluster. It
contains the configuration settings for Hadoop Core such as I/O settings that are
common to HDFS and MapReduce.
• The core-site.xml file contains information such as the port number used for Hadoop
instance, memory allocated for the file system, memory limit for storing the data, and
size of Read/Write buffers.
• Open the core-site.xml and add the following properties in between <configuration>,
</configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
• The hdfs-site.xml file contains information such as the value of replication
data, namenode path, and datanode paths of your local file systems. It
means the place where you want to store the Hadoop infrastructure.
• The hdfs-site.xml file contains the configuration settings for
• HDFS daemons
• NameNode,
• Secondary NameNode,
• DataNodes.
• Here, we can configure hdfs-site.xml to specify default block replication and
permission checking on HDFS. The actual number of replications can also be
specified when the file is created. The default is used if replication is not
specified in create time.
mapred-site.xml
• The mapred-site.xml file contains the configuration settings for MapReduce daemons; the job tracker
and the task-trackers.
• This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a
template of yarn-site.xml. First of all, it is required to copy the file from mapred-site.xml.template to
mapred-site.xml file using the following command.
• $ cp mapred-site.xml.template mapred-site.xml
• Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration>tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
• This file is used to configure yarn into Hadoop. Open the yarn-site.xml
file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Starting HDFS
Initially you have to format the configured HDFS file system, open
namenode (HDFS server), and execute the following command.
$ start-dfs.sh
HDFS Basic File Operations
• Putting data to HDFS from local file system
• First create a folder in HDFS where data can be put form local file system.
• $ hadoop fs -mkdir /user/test
• Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to
HDFS folder /user/ test
• $ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test
• Display the content of HDFS folder
• $ Hadoop fs -ls /user/test
• Copying data from HDFS to local file system
• $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt