Big Data 3rd Module
Big Data 3rd Module
1. Hadoop History
Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002
when they both started to work on Apache Nutch project. Apache Nutch project
was the process of building a search engine system that can index 1 billion pages.
After a lot of research on Nutch, they concluded that such a system will cost
around half a million dollars in hardware, and along with a monthly running cost of
$30, 000 approximately, which is very expensive. So, they realized that their
project architecture will not be capable enough to the workaround with billions of
pages on the web. So they were looking for a feasible solution which can reduce
the implementation cost as well as the problem of storing and processing of large
datasets.
In 2003, they came across a paper that described the architecture of Google’s
distributed file system, called GFS (Google File System) which was published by
Google, for storing the large data sets. Now they realize that this paper can solve
their problem of storing very large files which were being generated because of
web crawling and indexing processes. But this paper was just the half solution to
their problem.
In 2004, Google published one more paper on the technique MapReduce, which
was the solution of processing those large datasets. Now this paper was another
half solution for Doug Cutting and Mike Cafarella for their Nutch project. These
both techniques (GFS & MapReduce) were just on white paper at Google. Google
didn’t implement these two techniques. Doug Cutting knew from his work on
Apache Lucene ( It is a free and open-source information retrieval software library,
originally written in Java by Doug Cutting in 1999) that open-source is a great way
to spread the technology to more people. So, together with Mike Cafarella, he
started implementing Google’s techniques (GFS & MapReduce) as open-source in
the Apache Nutch project.
In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He
soon realized two problems:
(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters
(b) And that was looking impossible with just two people (Doug Cutting & Mike
Cafarella).
The engineering task in Nutch project was much bigger than he realized. So he
started to find a job with a company who is interested in investing in their efforts.
And he found Yahoo!.Yahoo had a large team of engineers that was eager to work
on this there project.
So in 2006, Doug Cutting joined Yahoo along with Nutch project. He wanted to
provide the world with an open-source, reliable, scalable computing framework,
with the help of Yahoo. So at Yahoo first, he separates the distributed computing
parts from Nutch and formed a new project Hadoop (He gave name Hadoop it was
the name of a yellow toy elephant which was owned by the Doug Cutting’s son.
and it was easy to pronounce and was the unique word.) Now he wanted to make
Hadoop in such a way that it can work well on thousands of nodes. So with GFS
and MapReduce, he started to work on Hadoop.
In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using
it.
In January of 2008, Yahoo released Hadoop as an open source project to
ASF(Apache Software Foundation). And in July of 2008, Apache Software
Foundation successfully tested a 4000 node cluster with Hadoop.
In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less
than 17 hours for handling billions of searches and indexing millions of web pages.
And Doug Cutting left the Yahoo and joined Cloudera to fulfill the challenge of
spreading Hadoop to other industries.
In December of 2011, Apache Software Foundation released Apache Hadoop
version 1.0.
And later in Aug 2013, Version 2.0.6 was available.
And currently, we have Apache Hadoop version 3.0 which released in December
2017.
Let’s Summarize above History :
2. HDFS Overview
Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly
faulttolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing.
Features of HDFS
HDFS Architecture
Given below is the architecture of a Hadoop File System. HDFS follows the
master-slave architecture and it has the following elements.
Namenode
The name node is the commodity hardware that contains the GNU/Linux
operating system and the name node software. It is a software that can be run on
commodity hardware. The system having the name node acts as the master server
and it does the following tasks −
• Manages the file system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and
opening files and directories.
Data node
The data node is a commodity hardware having the GNU/Linux operating system
and data node software. For every node (Commodity hardware/System) in a
cluster, there will be a data node. These nodes manage the data storage of their
system.
• Datanodes perform read-write operations on the file systems, as per client
request.
• They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will
be divided into one or more segments and/or stored in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount of
data that HDFS can read or write is called a Block. The default block size is
64MB, but it can be increased as per the need to change in HDFS configuration.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore HDFS should
have mechanisms for quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the
computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.
3. COMPONENTS OF HADOOP
Apache Pig : software for analysing large data sets that consists of a high-level
language similar to SQL for expressing data analysis programs, coupled with
infrastructure for evaluating these programs. It contains a compiler that produces
sequences of Map- Reduce programs.
Hive: it is Data warehousing application that provides the SQL interface and
relational model. Hive infrastructure is built on the top of Hadoop that help in
providing summarization, query and analysis.
Big Top: It is used for packaging and testing the Hadoop ecosystem.
Oozie: Oozie is a java based web-application that runs in a java servlet. Oozie uses
the database to store definition of Workflow that is a collection of actions. It
manages the Hadoop jobs. So there are many advantages of hadoop that are:
Hadoop framework allows the user to quickly write and test distributed systems. It
is efficient, and it automatic distributes the data and work across the machines and
in turn, utilizes the underlying parallelism of the CPU cores. Hadoop does not rely
on hardware to provide fault-tolerance and high availability (FTHA), rather
Hadoop library itself has been designed to detect and handle failures at the
application layer.
4. ANALYSING BIG DATA WITH HADOOP
Big data analytics is the process of examining this large amount of different
data types, or big data, in an effort to uncover hidden patterns, unknown
correlations and other useful information.
Big data analysis allows market analysts, researchers and business users to
develop deep insights from the available data, resulting in numerous business
advantages. Business users are able to make a precise analysis of the data and
the key early indicators from this analysis can mean fortunes for the business.
Some of the exemplary use cases are as follows:
• Whenever users browse travel portals, shopping sites, search flights,
hotels or add a particular item into their cart, then Ad Targeting
companies can analyze this wide variety of data and activities and can
provide better recommendations to the user regarding offers, discounts
and deals based on the user browsing history and product history.
• In the telecommunications space, if customers are moving from one
service provider to another service provider, then by analyzing huge call
data records of the various issues faced by the customers can be
unearthed. Issues could be as wide-ranging as a significant increase in
the call drops or some network congestion problems. Based on
analyzing these issues, it can be identified if a telecom company needs
to place a new tower in a particular urban area or if they need to revive
the marketing strategy for a particular region as a new player has come
up there. That way customer churn can be proactively minimized.
Input Output
Terminology
• PayLoad − Applications implement the Map and the Reduce functions, and
form the core of the job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate
key/value pair.
• NamedNode − Node that manages the Hadoop Distributed File System
(HDFS).
• DataNode − Node where data is presented in advance before any
processing takes place.
• MasterNode − Node where JobTracker runs and which accepts job requests
from clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
• Job − A program is an execution of a Mapper and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer on a slice of data.
• Task Attempt − A particular instance of an attempt to execute a task on a
SlaveNode.
Apache Pig
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used
to analyze larger sets of data representing them as data flows. Pig is generally
used with Hadoop; we can perform all the data manipulation operations in
Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig
Latin. This language provides various operators using which programmers can
develop their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig
Latin language. All these scripts are internally converted to Map and Reduce
tasks. Apache Pig has a component known as Pig Engine that accepts the Pig
Latin scripts as input and converts those scripts into MapReduce jobs.
Why Do We Need Apache Pig?
Programmers who are not so good at Java normally used to struggle working with
Hadoop, especially while performing any MapReduce tasks. Apache Pig is a boon
for all such programmers.
• Using Pig Latin, programmers can perform MapReduce tasks easily
without having to type complex codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of
codes. For example, an operation that would require you to type 200 lines
of code (LoC) in Java can be easily done by typing as less as just 10 LoC in
Apache Pig. Ultimately Apache Pig reduces the development time by
almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you
are familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like
joins, filters, ordering, etc. In addition, it also provides nested data types
like tuples, bags, and maps that are missing from MapReduce.
Features of Pig
Apache Pig comes with the following features −
• Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only on
semantics of the language.
• Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig
Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Apache Pig Vs MapReduce
Listed below are the major differences between Apache Pig and MapReduce.
Any novice programmer with a basic knowledge Exposure to Java is must to work with
of SQL can work conveniently with Apache Pig. MapReduce.
Apache Pig uses multi-query approach, thereby MapReduce will require almost 20 times
reducing the length of the codes to a great extent. more the number of lines to perform the
same task.
There is no need for compilation. On execution, MapReduce jobs have a long
every Apache Pig operator is converted internally compilation process.
into a MapReduce job.
Pig SQL
In Apache Pig, schema is optional. We can store data Schema is mandatory in SQL.
without designing a schema (values are stored as $01,
$02 etc.)
The data model in Apache Pig is nested relational. The data model used in SQL is flat
relational.
Apache Pig provides limited opportunity for Query There is more opportunity for
optimization. query optimization in SQL.
Apache Pig uses a language called Pig Latin. Hive uses a language called HiveQL. It was
It was originally created at Yahoo. originally created at Facebook.
Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured data.
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Features of Hive
User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data types,
and HDFS mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is
Hive Execution Engine. Execution engine processes the query and
generates results as same as MapReduce results. It uses the flavor
of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the
syntax and query plan or the requirement of query.
3 Get Metadata
4 Send Metadata
The compiler checks the requirement and resends the plan to the driver. Up to here,
the parsing and compiling of a query is complete.
6 Execute Plan
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.
8 Fetch Result
9 Send Results
10 Send Results
5. Hadoop streaming
Hadoop streaming is a utility that comes with the Hadoop distribution. This utility
allows you to create and run Map/Reduce jobs with any executable or script as the
mapper and/or the reducer. The utility will create a Map/Reduce job, submit the
job to an appropriate cluster, and monitor the progress of the job until it
completes.
When a script is specified for mappers, each mapper task will launch the script as
a separate process when the mapper is initialized. As the mapper task runs, it
converts its inputs into lines and feed the lines to the standard input (STDIN) of
the process. In the meantime, the mapper collects the line-oriented outputs from
the standard output (STDOUT) of the process and converts each line into a
key/value pair, which is collected as the output of the mapper. By default, the
prefix of a line up to the first tab character is the key and the rest of the line
(excluding the tab character) will be the value. If there is no tab character in the
line, then the entire line is considered as the key and the value is null. However,
this can be customized, as per one need.
When a script is specified for reducers, each reducer task will launch the script as
a separate process, then the reducer is initialized. As the reducer task runs, it
converts its input key/values pairs into lines and feeds the lines to the standard
input (STDIN) of the process. In the meantime, the reducer collects the line-
oriented outputs from the standard output (STDOUT) of the process, converts
each line into a key/value pair, which is collected as the output of the reducer. By
default, the prefix of a line up to the first tab character is the key and the rest of
the line (excluding the tab character) is the value. However, this can be
customized as per specific requirements.