0% found this document useful (0 votes)
29 views23 pages

SITA1603_Big Data - UNIT 1- Material

The document outlines the course material for Big Data at Sathyabama University, detailing the objectives, challenges, characteristics, and technologies associated with big data. It emphasizes the importance of understanding big data's volume, variety, velocity, and veracity, as well as the key technologies like Hadoop, MongoDB, and Cassandra. Additionally, it discusses the Hadoop ecosystem and its components, highlighting the need for effective data management and analysis in modern organizations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views23 pages

SITA1603_Big Data - UNIT 1- Material

The document outlines the course material for Big Data at Sathyabama University, detailing the objectives, challenges, characteristics, and technologies associated with big data. It emphasizes the importance of understanding big data's volume, variety, velocity, and veracity, as well as the key technologies like Hadoop, MongoDB, and Cassandra. Additionally, it discusses the Hadoop ecosystem and its components, highlighting the need for effective data management and analysis in modern organizations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

SATHYABAMA UNIVERSITY

COURSE MATERIAL - BIG DATA (SITA1603)


FACULTY OF COMPUTING

COURSE OBJECTIVES

1) To understand the dominant software systems and algorithms for coping with Big Data.
2) Apply appropriate analytics techniques and tools to analyze big data, create statistical
models, and identify insights
3) To explore the ethical implications of big data research, and particularly as they related
to the web

Unit I – Introduction

Introduction to Big Data:


Big Data has to deal with large and complex datasets that can be structured,
Semi-structured, or unstructured and will typically not fit into memory to be
Processed.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING

Issues and Challenges in traditional systems:


While big data has many advantages, it does present some challenges that organizations must be
ready to tackle when collecting, managing, and taking action on such an enormous amount of data.
The most commonly reported big data challenges include:
Lack of data talent and skills. Data scientists, data analysts, and data engineers are in short supply—and are
some of the most highly sought after (and highly paid) professionals in the IT industry. Lack of big data skills
and experience with advanced data tools is one of the primary barriers to realizing value from big data
environments.
Speed of data growth. Big data, by nature, is always rapidly changing and increasing. Without a solid
infrastructure in place that can handle your processing, storage, network, and security needs, it can become
extremely difficult to manage.
Problems with data quality. Data quality directly impacts the quality of decision-making, data analytics, and
planning strategies. Raw data is messy and can be difficult to curate. Having big data doesn’t guarantee
results unless the data is accurate, relevant, and properly organized for analysis. This can slow down
reporting, but if not addressed, you can end up with misleading results and worthless insights.
Compliance violations. Big data contains a lot of sensitive data and information, making it a tricky task to
continuously ensure data processing and storage meet data privacy and regulatory requirements, such as data
localization and data residency laws.
Integration complexity. Most companies work with data siloed across various systems and applications across
the organization. Integrating disparate data sources and making data accessible for business users is complex,
but vital, if you hope to realize any value from your big data.
Security concerns. Big data contains valuable business and customer information, making big data stores high-
value targets for attackers. Since these datasets are varied and complex, it can be harder to implement
comprehensive strategies and policies to protect them.

Big Data Characteristics:


Big Data contains a large amount of data that is not being processed by traditional data storage or the
processing unit. It is used by many multinational companies to process the data and business of
many organizations. The data flow would exceed 150 exabytes per day before replication.
There are five v's of Big Data that explains the characteristics.

5 V's of Big Data


o Volume
o Veracity
o Variety
o Value
o Velocity
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING

Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data generated
from many sources daily, such as business processes, machines, social media platforms, networks,
human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button is
recorded, and more than 350 million new posts are uploaded each day. Big data technologies can handle
large amounts of data.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from different
sources. Data will only be collected from databases and sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.

The data is categorized as below:


1. Structured data: In Structured schema, along with all the required columns. It is in a tabular form.
Structured Data is stored in the relational database management system.
2. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON, XML, CSV,
TSV, and email. OLTP (Online Transaction Processing) systems are built to work with semi-
structured data. It is stored in relations, i.e., tables.
3. Unstructured Data: All the unstructured files, log files, audio files, and image files are included in
the unstructured data. Some organizations have much data available, but they did not know how
to derive the value of data since the data is raw.
4. Quasi-structured Data:The data format contains textual data with inconsistent data formats that are
formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that contains a list
of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data. Veracity is
the process of being able to handle and manage data efficiently. Big Data is also essential in business
development.
For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING

Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity
bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.

Big Data Storage:


Let us first discuss leading Big Data Technologies that come under Data Storage:
o Hadoop: When it comes to handling big data, Hadoop is one of the leading technologies that come into
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
play. This technology is based entirely on map-reduce architecture and is mainly used to process batch
information. Also, it is capable enough to process tasks in batches. The Hadoop framework was mainly
introduced to store and process data in a distributed data processing environment parallel to
commodity hardware and a basic programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from various machines
with a faster speed and low cost. That is why Hadoop is known as one of the core components of big
data technologies. The Apache Software Foundation introduced it in Dec 2011. Hadoop is written in
Java programming language.
o MongoDB: MongoDB is another important component of big data technologies in terms of storage. No
relational properties and RDBMS properties apply to MongoDb because it is a NoSQL database. This is
not the same as traditional RDBMS databases that use structured query languages. Instead, MongoDB
uses schema documents.
The structure of the data storage in MongoDB is also different from traditional RDBMS databases. This
enables MongoDB to hold massive amounts of data. It is based on a simple cross-platform document-
oriented design. The database in MongoDB uses documents similar to JSON with the schema. This
ultimately helps operational data storage options, which can be seen in most financial organizations. As
a result, MongoDB is replacing traditional mainframes and offering the flexibility to handle a wide range
of high-volume data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of C++, Python,
JavaScript, and Go language.
o RainStor: RainStor is a popular database management system designed to manage and analyze
organizations' Big Data requirements. It uses deduplication strategies that help manage storing and
handling vast amounts of data for reference.
RainStor was designed in 2004 by a RainStor Software Company. It operates just like SQL. Companies
such as Barclays and Credit Suisse are using RainStor for their big data needs.
o Hunk: Hunk is mainly helpful when data needs to be accessed in remote Hadoop clusters using virtual
indexes. This helps us to use the spunk search processing language to analyze data. Also, Hunk allows
us to report and visualize vast amounts of data from Hadoop and NoSQL data sources.
Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming language.
o Cassandra: Cassandra is one of the leading big data technologies among the list of top NoSQL
databases. It is open-source, distributed and has extensive column storage options. It is freely available
and provides high availability without fail. This ultimately helps in the process of handling data efficiently
on large commodity groups. Cassandra's essential features include fault-tolerant mechanisms,
scalability, MapReduce support, distributed nature, eventual consistency, query language property,
tunable consistency, and multi-datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the Facebook inbox search
feature. It is based on the Java programming language.

Enabling Big Data Technologies:


Big Data deals with large data sets or deals with the deals with complexities handled by
traditional data processing application software. It has three key concepts like volume, variety, and
velocity. In volume, determining the size of data and in variety, data will be categorized means will
determine the type of data like images, PDF, audio, video, etc. and in velocity, speed of data
transfer or speed of processing and analyzing data will be considered. Big data works on large data
sets, and it can be unstructured, semi-structured, and structured. It includes the following key
parameters while considering big data like capturing data, search, data storage, sharing of data,
transfer, data analysis, visualization, and querying, etc. In the case of analyzing, it will be used in
A/B testing, machine learning, and natural language processing, etc. In the case of visualization, it
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
will be used in charts, graphs, etc. In big data, the following technology will be used in Business
intelligence, cloud computing, and databases, etc.

Here, we will discuss the overview of these big data technologies in detail and will mainly focus on
the overview part of each technology as mentioned above in the diagram.
1. Apache Cassandra: It is one of the No-SQL databases which is highly scalable and has high
availability. In this, we can replicate data across multiple data centers. Replication across multiple
data centers is supported. In Cassandra, fault tolerance is one of the big factors in which failed
nodes can be easily replaced without any downtime.
2. Apache Hadoop: Hadoop is one of the most widely used big data technology that is used to
handle large-scale data, large file systems by using Hadoop file system which is called HDFS, and
parallel processing like features using the MapReduce framework of Hadoop. Hadoop is a scalable
system that helps to provide a scalable solution capable of handling large capacities and
capabilities. For example: If you see real use cases like NextBio is using Hadoop MapReduce and
HBase to process multi-terabyte data sets off the human genome.
3. Apache Hive: It is used for data summarization and ad hoc querying which means for querying
and analyzing Big Data easily. It is built on top of Hadoop for providing data summarization, ad-hoc
queries, and the analysis of large datasets using SQL-like language called HiveQL. It is not a
relational database and not a language for real-time queries. It has many features like: designed for
OLAP, SQL type language called HiveQL, fast, scalable, and extensible.
4. Apache Flume: It is a distributed and reliable system that is used to collect, aggregate, and move
large amounts of log data from many data sources toward a centralized data store.
5. Apache Spark: The main objective of spark for speeding up the Hadoop computational
computing software process, and It was introduced by Apache Software Foundation. Apache Spark
can work independently because it has its own cluster management, and It is not an updated or
modified version of Hadoop and if you delve deeper then you can say it is just one way to
implement Spark with Hadoop. The Main idea to implement Spark with Hadoop in two ways is for
storage and processing. So, in two ways Spark uses Hadoop for storage purposes just because
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Spark has its own cluster management computation. In Spark, it includes interactive queries and
stream processing, and in-memory cluster computing is one of the key features.
6. Apache Kafka: It is a distributed publish-subscribe messaging system and more specifically you
can say it has a robust queue that allows you to handle a high volume of data, and you can pass
the messages from one point to another as you can say from one sender to receiver. You can
perform message computation in both offline and online modes, it is suitable for both. To prevent
data loss Kafka messages are replicated within the cluster. For real-time streaming data analysis, it
integrates Apache Storm and Spark and is built on top of the ZooKeeper synchronization service.
7. MongoDB: It is based on cross-platform and works on a concept like collection and document. It
has document-oriented storage that means data will be stored in the form of JSON form. It can be
an index on any attribute. It has features like high availability, replication, rich queries, support by
MongoDB, Auto-Sharding, and Fast in-place updates.
8. ElasticSearch: It is a real-time distributed system, and open-source full-text search and analytics
engine. It has features like scalability factor is high and scalable structured and unstructured data
up to petabytes, It can be used as a replacement of MongoDB, RavenDB which is based on
document-based storage. To improve the search performance, it uses denormalization. If you see
the real use case then it is an enterprise search engine and big organizations using it, for example-
Wikipedia, GitHub.

Hadoop Ecosystem:
Apache Hadoop is an open source framework intended to make interaction with big data easier, However,
for those who are not acquainted with this technology, one question arises that what is big data ? Big data
is a term given to the data sets which can’t be processed in an efficient manner with the help of traditional
methodology such as RDBMS. Hadoop has made its place in the industries and companies that need to
work on large data sets which are sensitive and needs efficient handling. Hadoop is a framework that
enables processing of large data sets which reside in the form of clusters. Being a framework, Hadoop is
made up of several modules that are supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There are four
major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common Utilities. Most of the tools
or solutions are used to supplement or support these major elements. All these tools work collectively to
provide services such as absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING

Hadoop – Architecture:

Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and
store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google.
Today lots of Big Brand Companies are using Hadoop in their Organization to deal with big data, eg.
Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components.

• MapReduce
• HDFS(Hadoop Distributed File System)
• YARN(Yet Another Resource Negotiator)
• Common Utilities or Hadoop Common
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING

Let’s understand the role of each one of this component in detail.

1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed processing in parallel in a
Hadoop cluster which Makes Hadoop working so fast. When you are dealing with Big Data, serial
processing is no more of any use. MapReduce has mainly 2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING

Here, we can see that the Input is provided to the Map() function then it’s output is used as an input to the
Reduce function and after that, we receive our final output. Let’s understand What this Map() and Reduce()
does.
As we can see that an Input is provided to the Map(), now as we are using Big Data. The Input is a set of
Data. The Map() function here breaks this DataBlocks into Tuples that are nothing but a key-value pair.
These key-value pairs are now sent as input to the Reduce(). The Reduce() function then combines this
broken Tuples or key-value pair based on its Key value and form set of Tuples, and perform some
operation like sorting, summation type job, etc. which is then sent to the final Output Node. Finally, the
Output is Obtained.
The data processing is always done in Reducer depending upon the business requirement of that industry.
This is How First Map() and then Reduce is utilized one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:

• RecordReader The purpose of recordreader is to break the records. It is responsible for providing key-
value pairs in a Map() function. The key is actually is its locational information and value is the data
associated with it.
• Map: A map is nothing but a user-defined function whose work is to process the Tuples obtained from
record reader. The Map() function either does not generate any key-value pair or generate multiple
pairs of these tuples.
• Combiner: Combiner is used for grouping the data in the Map workflow. It is similar to a Local reducer.
The intermediate key-value that are generated in the Map is combined with the help of this combiner.
Using a combiner is not necessary as it is optional.
• Partitionar: Partitional is responsible for fetching key-value pairs generated in the Mapper Phases. The
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
partitioner generates the shards corresponding to each reducer. Hashcode of each key is also fetched
by this partition. Then partitioner performs it’s(Hashcode) modulus with the number of
reducers(key.hashcode()%(number of reducers)).
Reduce Task

• Shuffle and Sort: The Task of Reducer starts with this step, the process in which the Mapper generates
the intermediate key-value and transfers them to the Reducer task is known as Shuffling. Using the
Shuffling process the system can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it is a faster process and does not
wait for the completion of the task performed by Mapper.
• Reduce: The main function or task of the Reduce is to gather the Tuple generated from Map and then
perform some sorting and aggregation sort of process on those key-value depending on its key
element.
• OutputFormat: Once all the operations are performed, the key-value pairs are written into the file with
the help of record writer, each record in a new line, and the key and value in a space-separated
manner.

2. HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed for working
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
on commodity Hardware devices(inexpensive devices), working on a distributed file system design. HDFS
is designed in such a way that it believes more in storing the data in a large chunk of blocks rather than
storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the other devices
present in that Hadoop cluster. Data storage Nodes in HDFS.

• NameNode(Master)
• DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be the
transaction logs that keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location(Block number,
Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication.
Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop
cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more number of
DataNode, the Hadoop cluster will be able to store more data. So it is advised that the DataNode should
have High storing capacity to store a large number of file blocks.
High Level Architecture Of Hadoop

HDFS operations:

Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS server), and execute
the following command.
$ hadoop namenode -format
After formatting the HDFS, start the distributed file system. The following command will start the namenode
as well as the data nodes as cluster.
$ start-dfs.sh
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of data is divided
into multiple blocks of size 128MB which is default and you can also change it manually.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Let’s understand this concept of breaking down of file in blocks with an example. Suppose you have
uploaded a file of 400MB to your HDFS then what happens is this file got divided into blocks of
128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created each of 128MB except the last
one. Hadoop doesn’t know or it doesn’t care about what data is stored in these blocks so it considers the
final file blocks as a partial record as it does not have any idea regarding it. In the Linux file system, the
size of a file block is about 4KB which is very much less than the default size of file blocks in the Hadoop
file system. As we all know Hadoop is mainly configured for storing the large size data which is in petabyte,
this is what makes Hadoop file system different from other file systems as it can be scaled, nowadays file
blocks of 128MB to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data. Replication is making a copy of
something and the number of times you make a copy of that particular thing can be expressed as it’s
Replication Factor. As we have seen in File blocks that the HDFS stores the data in the form of various
blocks at the same time Hadoop is also configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can change it
manually as per your requirement like in above example we have made 4 file blocks which means that 3
Replica or copy of each file block is made means total of 4×3 = 12 blocks are made for the backup
purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive system hardware)
which can be crashed at any time. We are not using the supercomputer for our Hadoop setup. That is why
we need such a feature in HDFS which can make copies of that file blocks for backup purposes, this is
known as fault tolerance.
Now one thing we also need to notice that after making so many replica’s of our file blocks we are wasting
so much of our storage but for the big brand organization the data is very much important than the storage
so nobody cares for this extra storage. You can configure the Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop cluster (maybe
30 to 40). A large Hadoop cluster is consists of so many Racks . with the help of this Racks information
Namenode chooses the closest Datanode to achieve the maximum performance while performing the
read/write information which reduces the Network Traffic.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
HDFS Architecture

3. YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job schedular is to divide a big task into small
jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing can be
Maximized. Job Scheduler also keeps track of which job is important, which job has more priority,
dependencies between the jobs and all the other information like job timing, etc. And the use of Resource
Manager is to manage all the resources that are made available for running a Hadoop cluster.
Features of YARN

• Multi-Tenancy
• Scalability
• Cluster-Utilization
• Compatibility

4. Hadoop common or Common Utilities


Hadoop common or Common utilities are nothing but our java library and java files or we can say the java
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
scripts that we need for all the other components present in a Hadoop cluster. these utilities are used by
HDFS, YARN, and MapReduce for running the cluster. Hadoop Common verify that Hardware failure in a
Hadoop cluster is common so it needs to be solved automatically in software by Hadoop Framework.

Operations of HDFS:
Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS server), and execute
the following command.
$ hadoop namenode -format
After formatting the HDFS, start the distributed file system. The following command will start the namenode
as well as the data nodes as cluster.
$ start-dfs.sh

Listing Files in HDFS


After loading the information in the server, we can find the list of files in a directory, status of a file,
using ‘ls’. Given below is the syntax of ls that you can pass to a directory or a filename as an argument.
$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS


Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file
system. Follow the steps given below to insert the required file in the Hadoop file system.
Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step 2
Transfer and store a data file from local systems to the Hadoop file system using the put command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step 3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input
Retrieving Data from HDFS
Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving the
required file from the Hadoop file system.
Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Step 2
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

Shutting Down the HDFS


You can shut down the HDFS by using the following command.
$ stop-dfs.sh

Configuring Hadoop ona local or Distributed enviroinment:

Hadoop is an open-source framework which is mainly used for storage purpose and maintaining and
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
analyzing a large amount of data or datasets on the clusters of commodity hardware, which means it is
actually a data management tool. Hadoop also posses a scale-out storage property, which means that we
can scale up or scale down the number of nodes as per are a requirement in the future which is really a
cool feature.

Hadoop Mainly works on 3 different Modes:


Standalone Mode
Pseudo-distributed Mode
Fully-Distributed Mode

1. Standalone Mode
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary
Name node, Job Tracker, and Task Tracker. We use job-tracker and task-tracker for processing
purposes in Hadoop1. For Hadoop2 we use Resource Manager and Node Manager. Standalone
Mode also means that we are installing Hadoop only in a single system. By default, Hadoop is
made to run in this Standalone Mode or we can also call it as the Local mode. We mainly use
Hadoop in this Mode for the Purpose of Learning, testing, and debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As we all know HDFS (Hadoop
distributed file system) is one of the major components for Hadoop which utilized for storage Permission is
not utilized in this mode. You can think of HDFS as similar to the file system’s available for windows i.e.
NTFS (New Technology File System) and FAT32(File Allocation Table which stores the data in the blocks
of 32 bits ). when your Hadoop works in this mode there is no need to configure the files – hdfs-
site.xml, mapred-site.xml, core-site.xml for Hadoop environment. In this Mode, all of your Processes will
run on a single JVM(Java Virtual Machine) and this mode can only be used for small development
purposes.
2. Pseudo Distributed Mode (Single Node Cluster)
In Pseudo-distributed Mode we also use only a single node, but the main thing is that the cluster is
simulated, which means that all the processes inside the cluster will run independently to each other. All
the daemons that are Namenode, Datanode, Secondary Name node, Resource Manager, Node Manager,
etc. will be running as a separate process on separate JVM(Java Virtual Machine) or we can say run on
different java processes that is why it is called a Pseudo-distributed.
One thing we should remember that as we are using only the single node set up so all the Master and
Slave processes are handled by the single system. Namenode and Resource Manager are used as Master
and Datanode and Node Manager is used as a slave. A secondary name node is also used as a Master.
The purpose of the Secondary Name node is to just keep the hourly based backup of the Name node. In
this Mode,
Hadoop is used for development and for debugging purposes both.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input and Output processes.
We need to change the configuration files mapred-site.xml, core-site.xml, hdfs-site.xml for setting up the
environment.

3. Fully Distributed Mode (Multi-Node Cluster)


This is the most important one in which multiple nodes are used few of them run the Master
Daemon’s that are Namenode and Resource Manager and the rest of them run the Slave Daemon’s
that are DataNode and Node Manager. Here Hadoop will run on the clusters of Machine or nodes. Here
the data that is used is distributed across different nodes. This is actually the Production Mode of
Hadoop let’s clarify or understand this Mode in a better way in Physical Terminology.
Once you download the Hadoop in a tar file format or zip file format then you install it in your system and
you run all the processes in a single system but here in the fully distributed mode we are extracting this tar
or zip file to each of the nodes in the Hadoop cluster and then we are using a particular node for a
particular process. Once you distribute the process among the nodes then you’ll define which nodes are
working as a master or which one of them is working as a slave.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING

HDFS Commands

HDFS is the primary or major component of the Hadoop ecosystem which is responsible for storing large
data sets of structured or unstructured data across various nodes and thereby maintaining the metadata in
the form of log files.

Commands:
1. ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we want a
hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs means we
want the executables of hdfs particularly dfs(Distributed File System) commands.

2. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>

creating home directory:

hdfs/bin -mkdir /user


hdfs/bin -mkdir /user/username -> write the username of your computer
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be
created relative to the home directory.
3. touchz: It creates an empty file.
Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt

4. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the most
important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder geeks present on
hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks

(OR)

bin/hdfs dfs -put ../Desktop/AI.txt /geeks

5. cat: To print file contents.


Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
// inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->

6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero

(OR)

bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero

7. moveFromLocal: This command will move file from local to hdfs.


Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks

8. cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs dfs -cp /geeks /geeks_copied

9. mv: This command is used to move files within hdfs. Lets cut-paste a file myfile.txt from geeks folder
to geeks_copied.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs dfs -mv /geeks/myfile.txt /geeks_copied

10. rmr: This command deletes a file from HDFS recursively. It is very useful command when you want to
delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then the directory itself.

You might also like