0% found this document useful (0 votes)
34 views

3 Hadoop

Hadoop

Uploaded by

Boby Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

3 Hadoop

Hadoop

Uploaded by

Boby Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Data Science – Tools and

Techniques
Samatrix Consulting Pvt Ltd
Hadoop Installation
https://www.oreilly.com/content/hadoop-with-python/
• NameNode: Requires high memory and will have a lot of RAM and does
not require much memory on hard disk.
• Secondary NameNode: Its memory requirement for a is not as high
compared to primary NameNode.
• DataNodes: Each datanode requires 16 GB of memory because they are
supposed to store data. So, they are high on hard disk and have multiple
drives.
• Hadoop Cluster in Facebook:
• Facebook uses Hadoop to store copies of internal log and dimension
data sources and use it as a source for reporting, analytics and
machine learning.
• Currently, Facebook has two major clusters: A 1100-machine cluster
with 800 cores and about 12 PB raw storage. Another one is a 300
machine cluster with 2,400 cores and about 3 PB raw storage. Each of
the commodity node has 8 cores and 12 TB storage.
• Facebook uses streaming and Java API a lot and have used Hive to
build a higher-level data warehousing framework. They have also
developed a FUSE application over HDFS.
• Map/Reduce and HDFS are the primary components of Hadoop cluster.
• MapReduce: MapReduce is a programming model associated for
implementation by generating and processing big data sets with parallel
and distributed algorithms on a cluster. The following is the
Map/Reduce Master-slave architecture.
• Master: JobTraker
• Slaves: {tasktraker}……{Tasktraker}
• Master {Jobtracker} is the interaction point for map/reduce framework
and users. The Jobtraker arranges the pending jobs in a queue for
execution in FIFO(first-come-first-serve) format to reduce tasks to the
tasktrackers, and also manages the mapping.
• Slaves {tasktraker} execute tasks based on the instructions of Master
{Jobtraker} and handles data exchange between map and reduce.
• HDFS (Hadoop distributed file system): HDFS is a part of Apache
Software Foundation designed to support a fault-tolerant file system
that can run on any hardware commodity.
• The following is the HDFS Master-slave architecture.
• Master: NameNode
• Slave: {Datanode}…..{Datanode}
• The Master {NameNode} manages the file system namespace
operations such as renaming files, determines the mapping of blocks
to DataNodes, directories, opening and closing files, and regulating
access to files by client.
• Slaves {DataNodes} serve read and write requests from the clients file
system and perform all the tasks like replication, block creation,
and deletion based on the instruction of Master {NameNode}.
Where to use HDFS

• Very Large Files: Files should be of hundreds of megabytes, gigabytes


or more.
• Streaming Data Access: The time to read whole data set is more
important than latency in reading the first. HDFS is built on write-
once and read-many-times pattern.
• Commodity Hardware:It works on low cost hardware.
Where not to use HDFS

• Low Latency data access: Applications that require very less time to
access the first data should not use HDFS as it is giving importance to
whole data rather than time to fetch the first record.
• Lots Of Small Files: The name node contains the metadata of files in
memory and if the files are small in size it takes a lot of memory for
name node's memory which is not feasible.
• Multiple Writes : It should not be used when we have to write
multiple times.
Hadoop Operation Modes
• Once you have downloaded Hadoop, you can operate your Hadoop
cluster in one of the three supported modes −
• Local/Standalone Mode − After downloading Hadoop in your system, by
default, it is configured in a standalone mode and can be run as a single
java process.
• Pseudo Distributed Mode − It is a distributed simulation on single
machine. Each Hadoop daemon such as hdfs, yarn, MapReduce etc., will
run as a separate java process. This mode is useful for development.
• Fully Distributed Mode − This mode is fully distributed with minimum
two or more machines as a cluster. We will come across this mode in
detail in the coming chapters.
Installing Hadoop in Standalone Mode
• Here we will discuss the installation of Hadoop in standalone mode.
• There are no daemons running and everything runs in a single JVM.
• Standalone mode is suitable for running MapReduce programs during
development, since it is easy to test and debug them.
$ hadoop version
Installing Hadoop in Pseudo Distributed Mode
• The data analytics process kicks in with the data collection process.
• In the process, all sorts of data such as sales, marketing, employee,
payroll, HR, etc., is collected from the data warehouse based on the
requirements of the data analyst.
• Data stewards and data governance teams ensure that the right data
is collected while maintaining the confidentiality of the data.
• Confidential or private data should not be used in data analytics at all.
• The data should be collected and engineered properly so that the
data can meet the requirements of the data scientist.
The following Xml files must be reconfigured in
order to develop Hadoop in Java:
• Core-site.xml
• Hdfs-site.xml
• Yarn-site.xml
• Mapred-site.xml
core-site.xml
• The core-site.xml file informs Hadoop daemon where NameNode runs in the cluster. It
contains the configuration settings for Hadoop Core such as I/O settings that are
common to HDFS and MapReduce.
• The core-site.xml file contains information such as the port number used for Hadoop
instance, memory allocated for the file system, memory limit for storing the data, and
size of Read/Write buffers.
• Open the core-site.xml and add the following properties in between <configuration>,
</configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
• The hdfs-site.xml file contains information such as the value of replication
data, namenode path, and datanode paths of your local file systems. It
means the place where you want to store the Hadoop infrastructure.
• The hdfs-site.xml file contains the configuration settings for
• HDFS daemons
• NameNode,
• Secondary NameNode,
• DataNodes.
• Here, we can configure hdfs-site.xml to specify default block replication and
permission checking on HDFS. The actual number of replications can also be
specified when the file is created. The default is used if replication is not
specified in create time.
mapred-site.xml
• The mapred-site.xml file contains the configuration settings for MapReduce daemons; the job tracker
and the task-trackers.
• This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a
template of yarn-site.xml. First of all, it is required to copy the file from mapred-site.xml.template to
mapred-site.xml file using the following command.

• $ cp mapred-site.xml.template mapred-site.xml
• Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration>tags in this file.

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
• This file is used to configure yarn into Hadoop. Open the yarn-site.xml
file and add the following properties in between the <configuration>,
</configuration> tags in this file.

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Starting HDFS
Initially you have to format the configured HDFS file system, open
namenode (HDFS server), and execute the following command.

$ hadoop namenode -format


After formatting the HDFS, start the distributed file system. The
following command will start the namenode as well as the data nodes
as cluster.

$ start-dfs.sh
HDFS Basic File Operations
• Putting data to HDFS from local file system
• First create a folder in HDFS where data can be put form local file system.
• $ hadoop fs -mkdir /user/test
• Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to
HDFS folder /user/ test
• $ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test
• Display the content of HDFS folder
• $ Hadoop fs -ls /user/test
• Copying data from HDFS to local file system
• $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt

• Compare the files and see that both are same


• $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt
• Recursive deleting
• hadoop fs -rmr <arg>
• Example:
• hadoop fs -rmr /user/sonoo/
HDFS Commands
• The below is used in the commands
• "<path>" means any file or directory name.
• "<path>..." means one or more file or directory names.
• "<file>" means any filename.
• "<src>" and "<dest>" are path names in a directed operation.
• "<localSrc>" and "<localDest>" are paths as above, but on the local
file system
• put <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to
dest within the DFS.
• copyFromLocal <localSrc><dest>
Identical to -put
• moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to
dest within HDFS, and then deletes the local copy on success.
• get [-crc] <src><localDest>
Copies the file or directory in HDFS identified by src to the local file system
path identified by localDest.
• cat <filen-name>
Displays the contents of filename on stdout.
• moveToLocal <src><localDest>
Works like -get, but deletes the HDFS copy on success.
• setrep [-R] [-w] rep <path>
Sets the target replication factor for files identified by path to rep. (The actual
replication factor will move toward the target over time)
• touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file
already exists at path, unless the file is already size 0.
• test -[ezd] <path>
Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.
• stat [format] <path>
Prints information about path. Format is a string which accepts file size in blocks
(%b), filename (%n), block size (%o), replication (%r), and modification date (%y,
%Y).
Python-Snakebite library to
operate Hadoop
Snakebite
• Snakebite is a python package created by Spotify.
• It provides a python client library that runs to access HDFS
programmatically from a Python application.
• The client library uses protobuf messages to communicate directly
with the NameNode. Snakebite also includes a command line
interface for HDFS based on client libraries.
Installation

• Snakebite currently only supports Python2, and the minimum version


of Python-protobuf is 2.4.1; python3 currently does not support it.
Snakebite is distributed through PyPI and can be installed using pip:
$ pip install snakebite
• Client library
• The client library is written in Python, uses protobuf messages, and implements
the Hadoop RPC protocol to communicate with the NameNode.
• This allows Python applications to communicate directly with HDFS without
having to make system calls to HDFS dfs.
• List directory information
• Example 1-1.python/HDFS/list_directory.py Use the Snakebite client to list the
information in the HDFS root directory:

from snakebite.client import Client


client = Client('localhost', 9000)
for x in client.ls(['/']):
print x
• The following parameters are accepted for the client method

• host(string): The host name or IP address of the NameNode.


• prot(int): NameNode RPC port number
• hadoop_version(int): The version number of the Hadoop protocol
used
• use_trash(boolean): Use the recycle bin when deleting files
• effective_use(string): effective user for user operations
CLI Client
Snakebite's CLI client is a Python command line HDFS client (based on
the client library). When running the Snakebite client, the host name or
IP and port number must be specified.
Thanks
Samatrix Consulting Pvt Ltd

You might also like