SITA1603_Big Data - UNIT 1- Material
SITA1603_Big Data - UNIT 1- Material
COURSE OBJECTIVES
1) To understand the dominant software systems and algorithms for coping with Big Data.
2) Apply appropriate analytics techniques and tools to analyze big data, create statistical
models, and identify insights
3) To explore the ethical implications of big data research, and particularly as they related
to the web
Unit I – Introduction
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data generated
from many sources daily, such as business processes, machines, social media platforms, networks,
human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button is
recorded, and more than 350 million new posts are uploaded each day. Big data technologies can handle
large amounts of data.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from different
sources. Data will only be collected from databases and sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity
bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.
Here, we will discuss the overview of these big data technologies in detail and will mainly focus on
the overview part of each technology as mentioned above in the diagram.
1. Apache Cassandra: It is one of the No-SQL databases which is highly scalable and has high
availability. In this, we can replicate data across multiple data centers. Replication across multiple
data centers is supported. In Cassandra, fault tolerance is one of the big factors in which failed
nodes can be easily replaced without any downtime.
2. Apache Hadoop: Hadoop is one of the most widely used big data technology that is used to
handle large-scale data, large file systems by using Hadoop file system which is called HDFS, and
parallel processing like features using the MapReduce framework of Hadoop. Hadoop is a scalable
system that helps to provide a scalable solution capable of handling large capacities and
capabilities. For example: If you see real use cases like NextBio is using Hadoop MapReduce and
HBase to process multi-terabyte data sets off the human genome.
3. Apache Hive: It is used for data summarization and ad hoc querying which means for querying
and analyzing Big Data easily. It is built on top of Hadoop for providing data summarization, ad-hoc
queries, and the analysis of large datasets using SQL-like language called HiveQL. It is not a
relational database and not a language for real-time queries. It has many features like: designed for
OLAP, SQL type language called HiveQL, fast, scalable, and extensible.
4. Apache Flume: It is a distributed and reliable system that is used to collect, aggregate, and move
large amounts of log data from many data sources toward a centralized data store.
5. Apache Spark: The main objective of spark for speeding up the Hadoop computational
computing software process, and It was introduced by Apache Software Foundation. Apache Spark
can work independently because it has its own cluster management, and It is not an updated or
modified version of Hadoop and if you delve deeper then you can say it is just one way to
implement Spark with Hadoop. The Main idea to implement Spark with Hadoop in two ways is for
storage and processing. So, in two ways Spark uses Hadoop for storage purposes just because
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Spark has its own cluster management computation. In Spark, it includes interactive queries and
stream processing, and in-memory cluster computing is one of the key features.
6. Apache Kafka: It is a distributed publish-subscribe messaging system and more specifically you
can say it has a robust queue that allows you to handle a high volume of data, and you can pass
the messages from one point to another as you can say from one sender to receiver. You can
perform message computation in both offline and online modes, it is suitable for both. To prevent
data loss Kafka messages are replicated within the cluster. For real-time streaming data analysis, it
integrates Apache Storm and Spark and is built on top of the ZooKeeper synchronization service.
7. MongoDB: It is based on cross-platform and works on a concept like collection and document. It
has document-oriented storage that means data will be stored in the form of JSON form. It can be
an index on any attribute. It has features like high availability, replication, rich queries, support by
MongoDB, Auto-Sharding, and Fast in-place updates.
8. ElasticSearch: It is a real-time distributed system, and open-source full-text search and analytics
engine. It has features like scalability factor is high and scalable structured and unstructured data
up to petabytes, It can be used as a replacement of MongoDB, RavenDB which is based on
document-based storage. To improve the search performance, it uses denormalization. If you see
the real use case then it is an enterprise search engine and big organizations using it, for example-
Wikipedia, GitHub.
Hadoop Ecosystem:
Apache Hadoop is an open source framework intended to make interaction with big data easier, However,
for those who are not acquainted with this technology, one question arises that what is big data ? Big data
is a term given to the data sets which can’t be processed in an efficient manner with the help of traditional
methodology such as RDBMS. Hadoop has made its place in the industries and companies that need to
work on large data sets which are sensitive and needs efficient handling. Hadoop is a framework that
enables processing of large data sets which reside in the form of clusters. Being a framework, Hadoop is
made up of several modules that are supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There are four
major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common Utilities. Most of the tools
or solutions are used to supplement or support these major elements. All these tools work collectively to
provide services such as absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Hadoop – Architecture:
Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and
store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google.
Today lots of Big Brand Companies are using Hadoop in their Organization to deal with big data, eg.
Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components.
• MapReduce
• HDFS(Hadoop Distributed File System)
• YARN(Yet Another Resource Negotiator)
• Common Utilities or Hadoop Common
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
1. MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed processing in parallel in a
Hadoop cluster which Makes Hadoop working so fast. When you are dealing with Big Data, serial
processing is no more of any use. MapReduce has mainly 2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Here, we can see that the Input is provided to the Map() function then it’s output is used as an input to the
Reduce function and after that, we receive our final output. Let’s understand What this Map() and Reduce()
does.
As we can see that an Input is provided to the Map(), now as we are using Big Data. The Input is a set of
Data. The Map() function here breaks this DataBlocks into Tuples that are nothing but a key-value pair.
These key-value pairs are now sent as input to the Reduce(). The Reduce() function then combines this
broken Tuples or key-value pair based on its Key value and form set of Tuples, and perform some
operation like sorting, summation type job, etc. which is then sent to the final Output Node. Finally, the
Output is Obtained.
The data processing is always done in Reducer depending upon the business requirement of that industry.
This is How First Map() and then Reduce is utilized one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:
• RecordReader The purpose of recordreader is to break the records. It is responsible for providing key-
value pairs in a Map() function. The key is actually is its locational information and value is the data
associated with it.
• Map: A map is nothing but a user-defined function whose work is to process the Tuples obtained from
record reader. The Map() function either does not generate any key-value pair or generate multiple
pairs of these tuples.
• Combiner: Combiner is used for grouping the data in the Map workflow. It is similar to a Local reducer.
The intermediate key-value that are generated in the Map is combined with the help of this combiner.
Using a combiner is not necessary as it is optional.
• Partitionar: Partitional is responsible for fetching key-value pairs generated in the Mapper Phases. The
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
partitioner generates the shards corresponding to each reducer. Hashcode of each key is also fetched
by this partition. Then partitioner performs it’s(Hashcode) modulus with the number of
reducers(key.hashcode()%(number of reducers)).
Reduce Task
• Shuffle and Sort: The Task of Reducer starts with this step, the process in which the Mapper generates
the intermediate key-value and transfers them to the Reducer task is known as Shuffling. Using the
Shuffling process the system can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it is a faster process and does not
wait for the completion of the task performed by Mapper.
• Reduce: The main function or task of the Reduce is to gather the Tuple generated from Map and then
perform some sorting and aggregation sort of process on those key-value depending on its key
element.
• OutputFormat: Once all the operations are performed, the key-value pairs are written into the file with
the help of record writer, each record in a new line, and the key and value in a space-separated
manner.
2. HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed for working
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
on commodity Hardware devices(inexpensive devices), working on a distributed file system design. HDFS
is designed in such a way that it believes more in storing the data in a large chunk of blocks rather than
storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the other devices
present in that Hadoop cluster. Data storage Nodes in HDFS.
• NameNode(Master)
• DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be the
transaction logs that keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location(Block number,
Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication.
Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop
cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more number of
DataNode, the Hadoop cluster will be able to store more data. So it is advised that the DataNode should
have High storing capacity to store a large number of file blocks.
High Level Architecture Of Hadoop
HDFS operations:
Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS server), and execute
the following command.
$ hadoop namenode -format
After formatting the HDFS, start the distributed file system. The following command will start the namenode
as well as the data nodes as cluster.
$ start-dfs.sh
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of data is divided
into multiple blocks of size 128MB which is default and you can also change it manually.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Let’s understand this concept of breaking down of file in blocks with an example. Suppose you have
uploaded a file of 400MB to your HDFS then what happens is this file got divided into blocks of
128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created each of 128MB except the last
one. Hadoop doesn’t know or it doesn’t care about what data is stored in these blocks so it considers the
final file blocks as a partial record as it does not have any idea regarding it. In the Linux file system, the
size of a file block is about 4KB which is very much less than the default size of file blocks in the Hadoop
file system. As we all know Hadoop is mainly configured for storing the large size data which is in petabyte,
this is what makes Hadoop file system different from other file systems as it can be scaled, nowadays file
blocks of 128MB to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data. Replication is making a copy of
something and the number of times you make a copy of that particular thing can be expressed as it’s
Replication Factor. As we have seen in File blocks that the HDFS stores the data in the form of various
blocks at the same time Hadoop is also configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can change it
manually as per your requirement like in above example we have made 4 file blocks which means that 3
Replica or copy of each file block is made means total of 4×3 = 12 blocks are made for the backup
purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive system hardware)
which can be crashed at any time. We are not using the supercomputer for our Hadoop setup. That is why
we need such a feature in HDFS which can make copies of that file blocks for backup purposes, this is
known as fault tolerance.
Now one thing we also need to notice that after making so many replica’s of our file blocks we are wasting
so much of our storage but for the big brand organization the data is very much important than the storage
so nobody cares for this extra storage. You can configure the Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop cluster (maybe
30 to 40). A large Hadoop cluster is consists of so many Racks . with the help of this Racks information
Namenode chooses the closest Datanode to achieve the maximum performance while performing the
read/write information which reduces the Network Traffic.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
HDFS Architecture
YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job schedular is to divide a big task into small
jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing can be
Maximized. Job Scheduler also keeps track of which job is important, which job has more priority,
dependencies between the jobs and all the other information like job timing, etc. And the use of Resource
Manager is to manage all the resources that are made available for running a Hadoop cluster.
Features of YARN
• Multi-Tenancy
• Scalability
• Cluster-Utilization
• Compatibility
Operations of HDFS:
Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS server), and execute
the following command.
$ hadoop namenode -format
After formatting the HDFS, start the distributed file system. The following command will start the namenode
as well as the data nodes as cluster.
$ start-dfs.sh
Hadoop is an open-source framework which is mainly used for storage purpose and maintaining and
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
analyzing a large amount of data or datasets on the clusters of commodity hardware, which means it is
actually a data management tool. Hadoop also posses a scale-out storage property, which means that we
can scale up or scale down the number of nodes as per are a requirement in the future which is really a
cool feature.
1. Standalone Mode
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary
Name node, Job Tracker, and Task Tracker. We use job-tracker and task-tracker for processing
purposes in Hadoop1. For Hadoop2 we use Resource Manager and Node Manager. Standalone
Mode also means that we are installing Hadoop only in a single system. By default, Hadoop is
made to run in this Standalone Mode or we can also call it as the Local mode. We mainly use
Hadoop in this Mode for the Purpose of Learning, testing, and debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As we all know HDFS (Hadoop
distributed file system) is one of the major components for Hadoop which utilized for storage Permission is
not utilized in this mode. You can think of HDFS as similar to the file system’s available for windows i.e.
NTFS (New Technology File System) and FAT32(File Allocation Table which stores the data in the blocks
of 32 bits ). when your Hadoop works in this mode there is no need to configure the files – hdfs-
site.xml, mapred-site.xml, core-site.xml for Hadoop environment. In this Mode, all of your Processes will
run on a single JVM(Java Virtual Machine) and this mode can only be used for small development
purposes.
2. Pseudo Distributed Mode (Single Node Cluster)
In Pseudo-distributed Mode we also use only a single node, but the main thing is that the cluster is
simulated, which means that all the processes inside the cluster will run independently to each other. All
the daemons that are Namenode, Datanode, Secondary Name node, Resource Manager, Node Manager,
etc. will be running as a separate process on separate JVM(Java Virtual Machine) or we can say run on
different java processes that is why it is called a Pseudo-distributed.
One thing we should remember that as we are using only the single node set up so all the Master and
Slave processes are handled by the single system. Namenode and Resource Manager are used as Master
and Datanode and Node Manager is used as a slave. A secondary name node is also used as a Master.
The purpose of the Secondary Name node is to just keep the hourly based backup of the Name node. In
this Mode,
Hadoop is used for development and for debugging purposes both.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input and Output processes.
We need to change the configuration files mapred-site.xml, core-site.xml, hdfs-site.xml for setting up the
environment.
HDFS Commands
HDFS is the primary or major component of the Hadoop ecosystem which is responsible for storing large
data sets of structured or unstructured data across various nodes and thereby maintaining the metadata in
the form of log files.
Commands:
1. ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we want a
hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs means we
want the executables of hdfs particularly dfs(Distributed File System) commands.
2. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
4. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the most
important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder geeks present on
hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
(OR)
6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR)
8. cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs dfs -cp /geeks /geeks_copied
9. mv: This command is used to move files within hdfs. Lets cut-paste a file myfile.txt from geeks folder
to geeks_copied.
SATHYABAMA UNIVERSITY
COURSE MATERIAL - BIG DATA (SITA1603)
FACULTY OF COMPUTING
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs dfs -mv /geeks/myfile.txt /geeks_copied
10. rmr: This command deletes a file from HDFS recursively. It is very useful command when you want to
delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then the directory itself.