Big Data SV Publication
Big Data SV Publication
B.Sc.
(DATA SCIENCE)
III YEAR VI SEMESTER
PAPER-VII (A)
BIG DATA
Mr.G.Venkata Subba Reddy.
Head, Department of Computer Science
S.V. PUBLICATIONS
HYDERBAD.
SYLLABUS
Unit - I
Getting an overview of Big Data: Introduction to Big Data, Structuring Big Data,
Types of Data, Elements of Big Data, Big Data Analytics, Advantages of Big Data
Analytics.
Introducing Technologies for Handling Big Data: Distributed and Parallel
Computing for Big Data, Cloud Computing and Big Data, Features of Cloud
Computing, Cloud Deployment Models, Cloud Services for Big Data, Cloud
Providers in Big Data Market.
Unit - II
Understanding Hadoop Ecosystem: Introducing Hadoop, HDFS and
MapReduce, Hadoop functions, Hadoop Ecosystem.
Hadoop Distributed File System- HDFS Architecture, Concept of Blocks in
HDFS Architecture, Namenodes and Datanodes, Features of HDFS, MapReduce.
Introducing HBase- HBase Architecture, Regions, Storing Big Data with HBase,
Combining HBase and HDFS, Features of HBase, Hive, Pig and Pig Latin, Sqoop,
ZooKeeper, Flume, Oozie.
Unit - III
Understanding MapReduce Fundamentals and HBase: The
MapReduceFramework ,Exploring the features of MapReduce, Working of
MapReduce, Techniques to optimize MapReduce Jobs, Hardware/Network
Topology, Synchronization, File system, Uses of MapReduce, Role of HBase in
Big Data Processing, Characteristics of HBase.
Understanding Big Data Technology Foundations: Exploring the Big Data
Stack, Data Sources Layer, Ingestion Layer, Storage Layer, Physical Infrastructure
Layer, Platform Management Layer, Security Layer, Monitoring Layer,
Visualization Layer.
Unit - IV
Storing Data in Databases and Data Warehouses: RDBMS and Big Data, Issues
with Relational Model, Non – Relational Database, Issues with Non Relational
Database, Polyglot Persistence, Integrating Big Data with Traditional Data
Warehouse, Big Data Analysis and Data Warehouse.
NoSQL Data Management: Introduction to NoSQL, Characteristics of NoSQL,
History of NoSQL, Types of NoSQL Data Models- Key Value Data Model,
Column Oriented Data Model, Document Data Model, Graph Databases,
Schema-Less Databases, Materialized Views, CAP Theorem.
Contents
UNIT-I
Getting an overview of Big Data, Introducing Technologies for
Handling Big Data
1.1 Introduction to Big Data 4 – 7
1.2 Structuring Big Data 7 – 8
1.3 Types of Data 8 – 9
1.4 Elements of Big Data 9 – 10
1.5 Big Data Analytics & Advantages of Big Data Analytics 10 – 12
1.6 Distributed and Parallel Computing for Big Data 13 – 17
1.7 Cloud Computing and Big Data 17 – 19
1.8 Deployment Models 19 - 21
1.9 Cloud Deploy and service Models 21 – 22
1 . 1 0 Cloud Providers in Big Data Markets 23
Multiple Choice Questions
Fill in the Blanks 24 – 26
Very Short Questions
UNIT-II
Understanding Hadoop Ecosystem, Hadoop Distributed File System,
Introducing HBase
2.1 Introducing Hadoop 31 – 32
2.2 Hadoop Ecosystem &Components 32 – 35
2.2.1 Basic HDFS DFS Commands 35 – 36
2.2.2 Features of HDFS 36 – 37
2.3 MapReduce 37 - 39
2.5 Regions 41 – 42
UNIT-III
Understanding MapReduce Fundamentals and HBase, Understanding
Big Data Technology Foundations
3.1 The MapReduce Frame Work 57
3.2 Exploring the features of MapReduce 57 -58
3.3 Working of MapReduce 58 – 61
3.4 Techniques to Optimize MapReduce Jobs 61 – 62
3.5 Uses of MapReduce 62 - 63
3.6 Role of HBase in Big Data Processing 63 – 64
3.7 Exploring the Big Data Stack 65
3.8 Data Source Layer 65
3.9 Ingestion Layer 65 – 66
3 . 1 0 Storage Layer 66 – 67
3 . 1 1 Physical Infrastructure 67 – 68
3 . 1 2 Security Layer 68 – 69
3 . 1 3 Monitoring Layer 69 – 70
3 . 1 4 Visualization Layer 70 – 71
Multiple Choice Questions
Fill in the Blanks 72 - 73
Very Short Questions
UNIT-IV
Storing Data in Databases and Data Warehouses, NoSQL Data
Management
4.1 RDBMS and Big Data 79 – 82
4.2 Issues with the Relational Model 82
4.3 Non-Relational Database 83
4.4 Issues with Non-Relational Databases 83 – 84
4.5 Polyglot Persistence 84
4.6 Integrating Big Data with Traditional Data Warehouses 84 – 86
4.7 Big Data Analysis and Data Warehouse 86 - 87
4.7.1 Changing Deployment Models in Big Data Era 87 – 89
4.8 Introduction to NoSQL 90 – 91
4.9 Characteristics of NoSQL 91 – 92
4 . 1 0 History of NoSQL 92
4 . 1 1 Types of NoSQL Data Models- Key Value Data Model 92 – 93
4 . 1 2 Column Oriented Data Model 93 - 94
4 . 1 3 Document Data Model 94 – 95
4 . 1 4 Graph Databases 95 – 96
4 . 1 5 Schema-Less Databases 96
4 . 1 6 Materialized Views 96 – 97
4 . 1 7 CAP Theorem 97 - 98
Multiple Choice Questions
Fill in the Blanks 99 - 100
Very Short Questions
LAB PRACTICAL A - HH
Model Papers
FACULTY OF SCIENCE
B.Sc. (CBCS) VI - Semester Examination
Model Paper-I
Subject: Data Science
Paper – VII(A): BIG DATA
Time: 3 Hours Max. Marks: 80
PART - A
Note: Answer any EIGHT questions. (8 x 4 = 32 Marks)
1. What is Parallel Computing?
2. What is structured and unstructured data in big data?
3. Discuss about Cloud Delivery Models in Big Data
4. What is MapReduce?
5. List few HDFS Features.
6. Difference Between Hadoop and HBase.
7. Short notes on Security Layer.
8. What are the functions of Ingestion Layer?
9. Why is the order in which a function is executed very important in MapReduce?
10. What are the 5 features of NoSQL?
11. Discuss about the CAP Theorem.
12. Explain about Graph Data Model?
PART – B
Note: Answer ALL the questions. (4 x 12 = 60 Marks)
13. What are types of data in big data?
OR
What is parallel and distributed processing in big data?
14. What is a Hadoop ecosystem? What are the main components of Hadoop ecosystem?
OR
a) Which storage is used by HBase?
b) How does HDFS maintain Data Integrity?
15. a) What is Hadoop Mapreduce and How Does it Work.
b) What techniques are used to optimize MapReduce jobs?
OR
c) What is redundant physical infrastructure?
d) Why monitoring your big data pipeline is important?
16. a) What problems do big data solutions solve?
b) What is the main problem with relational database for processing big data?
OR
c) What is a column-family data store in NoSQL Database? List the features of
column family database.
d) What is CAP theorem explain?
FACULTY OF SCIENCE
B.Sc. (CBCS) VI - Semester Examination
Model Paper-II
Subject: Data Science
Paper – VII(A): BIG DATA
Time: 3 Hours Max. Marks: 80
PART - A
Note: Answer any EIGHT questions. (8 x 4 = 32 Marks)
1. What is the structure of big data? Why is structuring of big data needed?
2. Which tools are used by data analyst?
3. What are cloud providers give example?
4. How Does Hadoop Work?
5. MapReduce Architecture explained in detail.
6. How regions are created in HBase?
7. Characteristics of HBase.
8. What is big data stack? Explain.
9. Why is the order in which a function is executed very important in MapReduce?
10. What problems do big data solutions solve?
11. What is a non-relational database?
12. What is a NoSQL key-value database?
PART – B
Note: Answer ALL the questions. (4 x 12 = 60 Marks)
13. a) What is Big Data Analytics and Why is it Important?
b) Which tools are used by data analyst?
OR
c) What is Big Data cloud in Cloud Computing?
d) What Are the Most Common Cloud Computing Service Delivery Models?
14. How does HDFS maintain Data Integrity?
OR
What is Zookeeper Explain the purpose of Zookeeper in Hadoop Ecosystem.
15. What are the different use cases of MapReduce?
OR
What is the main purpose of security? Why is data security important in big data?
16. How To Apply Big Data Concepts to a Traditional Data Warehouse?
OR
What are the basic characteristics of a NoSQL database?
Objective
S.V. PUBLICATIONS 1
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
PART – A
Short Type Questions
Q1. What is Parallel Computing?
Ans:
Parallel Computing is also known as parallel processing. It utilizes several processors.
Each of the processors completes the tasks that have been allocated to them. In other words,
parallel computing involves performing numerous tasks simultaneously. A shared memory
or distributed memory system can be used to assist in parallel computing. All CPUs in
shared memory systems share the memory. Memory is shared between the processors in
distributed memory systems.
There are various benefits of using distributed computing. It enables scalability and
makes it simpler to share resources. It also aids in the efficiency of computation processes.
An example of Structured Data is the data stored in a Relational Database which can be
managed using SQL (Structured Query Language). For Example, Employee Data (Name, ID,
Designation, and Salary) can be stored in a tabular format.
Unstructured Data: Unstructured Data is the data that does not have any structure. It can be
in any form, there is no pre-defined data model. We can‘t store it in traditional databases. It
is complex to search and process it.
Also, the volume of Unstructured Data is very high. Example of Unstructured Data is e-
mail body, Audio, Video, Images, Achieved documents, etc.
S.V. PUBLICATIONS 2
For SaaS to work, the infrastructure (IaaS) and the platform (PaaS) must be in place.
S.V. PUBLICATIONS 3
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
PART – B
Essay Type Questions
GETTING AN OVER VIEW OF BIG DATA
Big Data
Big Data is a collection of data that is huge in volume, yet growing exponentially with
time. It is a data with so large size and complexity that none of traditional data management
tools can store it or process it efficiently. Big data is also a data but with huge size.
S.V. PUBLICATIONS 4
Skilled professionals: In order to develop, manage and run those applications that
generate insights, organizations need professionals who possess a high-level
proficiency in data sciences.
Other challenges: Other challenges of big data are with respect to capture, storage,
search, analysis, transfer and security of big data. Visualization: Big data refers to
datasets whose size is typically beyond the storage capacity of traditional database
software tools. There is no explicit definition of how big the data set should be for it
to be considered big data. Data visualization (computer graphics) is becoming
popular as a separate discipline. There are very few data visualization experts.
From 1944 to 1980, many articles and presentations were presented that observed the
‗information explosion‘ and the arising needs for storage capacity.
1980:
In 1980, the sociologist Charles Tilly uses the term big data in one sentence ―none of the
big questions has actually yielded to the bludgeoning of the big-data people.‖ ―The old-new
social history and the new old social history‖.
But the term used in this sentence is not in the context of the present meaning of Big
Data today. Now, moving fast to 1997-1998 where we see the actual use of big data in its
present context.
1997:
In 1977, Michael Cox and David Ellsworth published the article ―Application-controlled
demand paging for out-of-core visualization‖ in the Proceedings of the IEEE 8th conference
on Visualization.
S.V. PUBLICATIONS 5
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
The big data term in the sentence ―Visualization provides an interesting challenge for
computer systems: data sets are generally quite large, taxing the capacities of main memory,
local disk, and even remote disk. We call this the problem of big data. When data sets do not
fit in main memory (in core), or when they do not fit even on local disk, the most common
solution is to acquire more resources.‖.
It was the first article in the ACM digital library that uses the term big data with its
modern context.
1998:
In 1998, John Mashey, who was Chief Scientist at SGI presented a paper titled ―Big
Data… and the Next Wave of Infrastress.‖ at a USENIX meeting. John Mashey used this
term in his various speeches and that‘s why he got the credit for coining the term Big Data.
S.V. PUBLICATIONS 6
In the paper, he stated that ―Recently, much good science, whether physical, biological,
or social, has been forced to confront—and has often benefited from—the ―Big Data‖
phenomenon.
Big Data refers to the explosion in the quantity (and sometimes, quality) of available and
potentially relevant data, largely the result of recent and unprecedented advancements in
data recording and storage technology.‖
He is the one who linked big data term explicitly to the way we understand big data
today.
2001:
In 2001, Doug Laney, who was an analyst with the Meta Group (Gartner), presented a
research paper titled ―3D Data Management: Controlling Data Volume, Velocity, and
Variety.‖ The 3V‘s have become the most accepted dimensions for defining big data.
This is for sure the current widely understood form of Big data definition.
In 2005 Yahoo used Hadoop to process petabytes of data which is now made open-
source by Apache Software Foundation. Many companies are now using Hadoop to crunch
Big Data.
S.V. PUBLICATIONS 7
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
Various sources generate a variety of data such as images, text, audios, etc. All such
different types of data can be structured only if it is sorted and organized in some logical
pattern. Thus, the process of structuring data requires one to first understand the various
types of data available today.
S.V. PUBLICATIONS 8
S.V. PUBLICATIONS 9
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
Traditional customer feedback systems are getting replaced by new systems designed
with Big Data technologies. In these new systems, Big Data and natural language processing
technologies are being used to read and evaluate consumer responses.
Early identification of risk to the product/services, if any
Better operational efficiency
Big Data technologies can be used for creating a staging area or landing zone for new
data before identifying what data should be moved to the data warehouse. In addition, such
integration of Big Data technologies and data warehouse helps an organization to offload
infrequently accessed data.
Technologies such as business intelligence (BI) tools and systems help organizations take
the unstructured and structured data from multiple sources. Users (typically employees)
input queries into these tools to understand business operations and performance. Big data
analytics uses the four data analysis methods to uncover meaningful insights and derive
solutions.
There are four main types of big data analytics that support and inform different
business decisions.
1. Descriptive analytics
Descriptive analytics refers to data that can be easily read and interpreted. This data
helps create reports and visualize information that can detail company profits and sales.
Example: During the pandemic, a leading pharmaceuticals company conducted data
analysis on its offices and research labs. Descriptive analytics helped them identify
unutilized spaces and departments that were consolidated, saving the company millions of
dollars.
2. Diagnostics analytics
Diagnostics analytics helps companies understand why a problem occurred. Big data
technologies and tools allow users to mine and recover data that helps dissect an issue and
prevent it from happening in the future.
Example: A clothing company‘s sales have decreased even though customers continue to
add items to their shopping carts. Diagnostics analytics helped to understand that the
payment page was not working properly for a few weeks.
3. Predictive analytics
Predictive analytics looks at past and present data to make predictions. With artificial
intelligence (AI), machine learning, and data mining, users can analyze the data to predict
market trends.
Example: In the manufacturing sector, companies can use algorithms based on historical
data to predict if or when a piece of equipment will malfunction or break down.
S.V. PUBLICATIONS 10
S.V. PUBLICATIONS 11
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
plays a critical role in measuring COVID-19 outcomes on a global scale. It informs health
ministries within each nation‘s government on how to proceed with vaccinations and
devises solutions for mitigating pandemic outbreaks in the future.
Q6. Which tools are used by data analyst?
Ans:
Tools used in big data analytics
Harnessing all of that data requires tools. Thankfully, technology has advanced so that
there are many intuitive software systems available for data analysts to use.
Hadoop: An open-source framework that stores and processes big data sets. Hadoop
is able to handle and analyze structured and unstructured data.
Spark: An open-source cluster computing framework used for real-time processing
and analyzing data.
Data integration software: Programs that allow big data to be streamlined across
different platforms, such as MongoDB, Apache, Hadoop, and Amazon EMR.
Stream analytics tools: Systems that filter, aggregate, and analyze data that might be
stored in different platforms and formats, such as Kafka.
Distributed storage: Databases that can split data across multiple servers and have
the capability to identify lost or corrupt data, such as Cassandra.
Predictive analytics hardware and software: Systems that process large amounts of
complex data, using machine learning and algorithms to predict future outcomes,
such as fraud detection, marketing, and risk assessments.
Data mining tools: Programs that allow users to search within structured and
unstructured big data.
NoSQL databases: Non-relational data management systems ideal for dealing with
raw and unstructured data.
Data warehouses: Storage for large amounts of data collected from many different
sources, typically using predefined schemas.
S.V. PUBLICATIONS 12
The components interact with one another so that they can reach a common purpose.
Maintaining component concurrency, overcoming the lack of a global clock, and controlling
the independent failure of parts are three crucial issues of distributed systems. When one
system's component fails, the system does not fail. Peer-to-peer applications, SOA-based
systems, and massively multiplayer online games are all examples of distributed systems.
S.V. PUBLICATIONS 13
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
lowering its price. The new software was developed to use this hardware by
automating activities such as load balancing and optimization over a large cluster of
nodes.
Built-in rules in the software recognized that certain workloads demanded a specific
level of performance. Using the virtualization technology, software regarded all
nodes as one giant pool of processing, storage, and networking assets. It shifted
processes to another node without interruption if one failed.
It wasn't that firms didn't want to wait for the results they required; it was simply
not financially feasible to purchase enough computing power to meet these new
demands. Because of the costs, many businesses would merely acquire a subset of
data rather than trying to gather all of it. Analysts wanted all of the data, but they
had to make do with snapshots to capture the appropriate data at the right time.
Consider our example program that detects cats in images. In a distributed computing
approach, a managing computer would send the image information to each of the worker
computers and each worker would report back their results.
In more complex architectures, worker nodes must communicate with other worker
nodes. This is necessary when using distributed computing to train a deep learning network.
S.V. PUBLICATIONS 14
Cluster computing has its own limitations; setting up a cluster requires physical space,
hardware operations expertise, and of course, money to buy all the devices and networking
infrastructure.
Fortunately, many companies now offer cloud computing services which give
programmers everywhere access to managed clusters. The companies manage the hardware
operations, provide tools to upload programs, and charge based on usage.
Distribution of functionality
Another form of distributed computing is to use different computing devices to execute
different pieces of functionality.
For example, imagine a zoo with an array of security cameras. Each security camera
records video footage in a digital format. The
cameras send their video data to a computer
cluster located in the zoo headquarters, and
that cluster runs video analysis algorithms to
detect escaped animals. The cluster also sends
the video data to a cloud computing server
which analyzes terabytes of video data to
discover historical trends.
This form of distributed computing recognizes that the world is filled with a range of
computing devices with varying capabilities, and ultimately, some problems are best solved
by utilizing a network of those devices.
S.V. PUBLICATIONS 15
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
Every application that uses the Internet is an example of distributed computing, but each
application makes different decisions about how it distributes the computing. For another
example, smart home assistants do a small amount of language processing locally to
determine that you've asked them for help but then send your audio to high-powered
servers to parse your full question.
S.V. PUBLICATIONS 16
In cloud computing, all data is gathered in data centres and then distributed to the end-
users. Further, automatic backups and recovery of data is also ensured for business
continuity, all such resources are available in the cloud.
We do not know exact physical location of these resources provided to us. You just need
dummy terminals like desktops, laptops, phones etc. and a net connection.
The Roles & Relationship Between Big Data & Cloud Computing
Cloud Computing providers often utilize a ―software as a service‖ model to allow
customers to easily process data. Typically, a console that can take in specialized commands
and parameters is available, but everything can also be done from the site‘s user interface.
Some products that are usually part of this package include database management systems,
cloud-based virtual machines and containers, identity management systems, machine
learning capabilities, and more.
In turn, Big Data is often generated by large, network-based systems. It can be in either a
standard or non-standard format. If the data is in a non-standard format, artificial
S.V. PUBLICATIONS 17
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
intelligence from the Cloud Computing provider may be used in addition to machine
learning to standardize the data.
From there, the data can be harnessed through the Cloud Computing platform and
utilized in a variety of ways. For example, it can be searched, edited, and used for future
insights.
This cloud infrastructure allows for real-time processing of Big Data. It can take huge
―blasts‖ of data from intensive systems and interpret it in real-time. Another common
relationship between Big Data and Cloud Computing is that the power of the cloud allows
Big Data analytics to occur in a fraction of the time it used to.
Cloud computing has many features that make it one of the fastest growing industries at
present. The flexibility offered by cloud services in the form of their growing set of tools and
technologies has accelerated its deployment across industries. This blog will tell you about
the essential features of cloud computing.
S.V. PUBLICATIONS 18
S.V. PUBLICATIONS 19
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
servers you‘re utilizing and who controls them are defined by a cloud deployment model. It
specifies how your cloud infrastructure will look, what you can change, and whether you
will be given services or will have to create everything yourself. Relationships between the
infrastructure and your users are also defined by cloud deployment types.
S.V. PUBLICATIONS 20
Hybrid Cloud
Hybrid Cloud is a combination of the public cloud and the private cloud. We can say:
Hybrid Cloud = Public Cloud + Private Cloud
Hybrid cloud is partially secure because the services which are running on the public
cloud can be accessed by anyone, while the services which are running on a private cloud
can be accessed only by the organization's users.
Community Cloud
Community cloud allows systems and services to be accessible by a group of several
organizations to share the information between the organization and a specific community.
It is owned, managed, and operated by one or more organizations in the community, a third
party, or a combination of them.
S.V. PUBLICATIONS 21
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
Types of Cloud Computing
Most cloud computing services fall into five broad categories:
1. Software as a service (SaaS)
2. Platform as a service (PaaS)
3. Infrastructure as a service (IaaS)
These are sometimes called the cloud computing stack because they are built on top of
one another. Knowing what they are and how they are different, makes it easier to
accomplish your goals. These abstraction layers can also be viewed as a layered architecture
where services of a higher layer can be composed of services of the underlying layer i.e, SaaS
can provide Infrastructure.
Infrastructure as a Service (IaaS)
IaaS is also known as Hardware as a Service (HaaS). It is a computing infrastructure
managed over the internet. The main advantage of using IaaS is that it helps users to avoid
the cost and complexity of purchasing and managing the physical servers.
Characteristics of IaaS
There are the following characteristics of IaaS -
Resources are available as a service
Services are highly scalable
Dynamic and flexible
GUI and API-based access
Automated administrative tasks
S.V. PUBLICATIONS 22
In addition there are many startups that have interesting products in cloud space. Here
we have a list of major vendors of cloud computing. Few of the cloud providers are google,
citrix, netmagic, redhat, rackspace etc. Amazon (aws) is the leading cloud provider amongst
all. Microsoft is also providing cloud services and it is called as azure.
S.V. PUBLICATIONS 23
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
Internal Assessment
S.V. PUBLICATIONS 24
S.V. PUBLICATIONS 25
BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
Q7. List Type of Cloud Deployment Model
Ans:
Here are some important types of Cloud Deployment models:
Private Cloud: Resource managed and used by the organization.
Public Cloud: Resource available for the general public under the Pay as you go
model.
Community Cloud: Resource shared by several organizations, usually in the same
industry.
Hybrid Cloud: This cloud deployment model is partly managed by the service
provided and partly by the organization.
S.V. PUBLICATIONS 26
Objective
S.V. PUBLICATIONS 27
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
PART – A
Short Type Questions
Q1. What is Hadoop?
Ans:
Hadoop, as a Big Data framework, provides businesses with the ability to distribute data
storage, parallel processing, and process data at higher volume, higher velocity, variety,
value, and veracity. HDFS, MapReduce, and YARN are the three major components.
Hadoop HDFS uses name nodes and data nodes to store extensive data. MapReduce
manages these nodes for processing, and YARN acts as an Operating system for Hadoop in
managing cluster resources.
Q2.What is MapReduce?
Ans:
A MapReduce is a data processing tool which is used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce:
Simplified Data Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the
reducer phase. In the Mapper, the input is given in the form of a key-value pair. The output
of the Mapper is fed to the reducer as input. The reducer runs only after the Mapper is over.
The reducer too takes input in key-value format, and the output of reducer is the final
output.
S.V. PUBLICATIONS 28
HBase can store massive amounts of data from terabytes to petabytes. The tables present
in HBase consist of billions of rows having millions of columns. HBase is built for low
latency operations, which is having some specific features compared to traditional relational
models.
S.V. PUBLICATIONS 29
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
Often referred as a key value store or column family-oriented database, or storing
versioned maps of maps.
fundamentally, it's a platform for storing and retrieving data with random access.
It doesn't care about datatypes(storing an integer in one row and a string in another
for the same column).
It doesn't enforce relationships within your data.
It is designed to run on a cluster of computers, built using commodity hardware.
S.V. PUBLICATIONS 30
PART – B
Essay Type Questions
UNDER STANDING HADOOP ECOSYSTEM, HDFS
2.1 Introducing Hadoop
Q1. What is Hadoop? Components of Hadoop and How Does It Work.
Ans:
Hadoop is an open source framework from Apache and is used to store process and
analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP
(online analytical processing). It is used for batch/offline processing. It is being used by
Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up
just by adding nodes in the cluster.
Modules of Hadoop
HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output of
Map task is consumed by reduce task and then the out of reducer gives the desired
result.
Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
S.V. PUBLICATIONS 31
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
The primary function of Hadoop is to process the data in an organised manner among
the cluster of commodity software. The client should submit the data or program that needs
to be processed. Hadoop HDFS stores the data. YARN, MapReduce divides the resources
and assigns the tasks to the data. Let‟s know the working of Hadoop in detail.
The client input data is divided into 128 MB blocks by HDFS. Blocks are replicated
according to the replication factor: various DataNodes house the unions and their
duplicates.
The user can process the data once all blocks have been put on HDFS DataNodes.
The client sends Hadoop the MapReduce programme to process the data.
The user-submitted software was then scheduled by ResourceManager on particular
cluster nodes.
The output is written back to the HDFS once processing has been completed by all
nodes.
Applications built using HADOOP are run on large data sets distributed across clusters
of commodity computers. Commodity computers are cheap and widely available. These are
mainly useful for achieving greater computational power at low cost.
Similar to data residing in a local file system of a personal computer system, in Hadoop,
data resides in a distributed file system which is called as a Hadoop Distributed File system.
The processing model is based on „Data Locality‟ concept wherein computational logic is
sent to cluster nodes(server) containing data. This computational logic is nothing, but a
compiled version of a program written in a high-level language such as Java. Such a
program, processes data stored in Hadoop HDFS.
S.V. PUBLICATIONS 32
Map phase filters, groups, and sorts the data. Input data is divided into multiple splits.
Each map task works on a split of data in parallel on different machines and outputs a key-
value pair. The output of this phase is acted upon by the reduce task and is known as the
Reduce phase. It aggregates the data, summarises the result, and stores it on HDFS.
YARN
YARN or Yet Another Resource Negotiator manages resources in the cluster and
manages the applications over Hadoop. It allows data stored in HDFS to be processed and
run by various data processing engines such as batch processing, stream processing,
interactive processing, graph processing, and many more. This increases efficiency with the
use of YARN.
HBase
HBase is a Column-based NoSQL database. It runs on top of HDFS and can handle any
type of data. It allows for real-time processing and random read/write operations to be
performed in the data.
Pig
Pig was developed for analyzing large datasets and overcomes the difficulty to write
map and reduce functions. It consists of two components: Pig Latin and Pig Engine.
Pig Latin is the Scripting Language that is similar to SQL. Pig Engine is the execution
engine on which Pig Latin runs. Internally, the code written in Pig is converted to
MapReduce functions and makes it very easy for programmers who aren‟t proficient in Java.
S.V. PUBLICATIONS 33
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
Q3. How do you explain HDFS architecture with example?
Ans:
HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed
over several machines and replicated to ensure their durability to failure and high
availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data
nodes and node name.
HDFS Features
Unlike other distributed file system, HDFS is highly fault-tolerant and can be deployed
on low-cost hardware. It can easily handle the application that contains large data sets.
Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a
single cluster.
Replication - Due to some unfavorable conditions, the node containing the data may
be loss. So, to overcome such problems, HDFS always maintains the copy of data on
a different machine.
Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in
the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the
other machine containing the copy of that data automatically become active.
Distributed data storage - This is one of the most important features of HDFS that
makes Hadoop very powerful. Here, data is divided into multiple blocks and stored
into nodes.
Portable - HDFS is designed in such a way that it can easily portable from platform
to another.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1
or YARN/MR2.
S.V. PUBLICATIONS 34
HDFS follows the master-slave architecture and it has the following elements.
NameNode
It is a single master server exist in the HDFS cluster.
As it is a single node, it may become the reason of single point failure.
It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
It simplifies the architecture of the system.
DataNode
The HDFS cluster contains multiple DataNodes.
Each DataNode contains multiple data blocks.
These data blocks are used to store data.
It is the responsibility of DataNode to read and write requests from the file system's
clients.
It performs block creation,
deletion, and replication upon
instruction from the NameNode.
Job Tracker
The role of Job Tracker is to
accept the MapReduce jobs from
client and process the data by
using NameNode.
In response, NameNode provides
metadata to Job Tracker.
Task Tracker
It works as a slave node for Job Tracker.
It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
S.V. PUBLICATIONS 35
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
COMMAND DESCRIPTION
-get Store file / Folder from HDFS to local file
-getmerge Merge Multiple Files in an HDFS
-count Count number of directory, number of files and file size
-setrep Changes the replication factor of a file
-mv HDFS Command to move files from source to destination
-moveFromLocal Move file / Folder from local disk to HDFS
-moveToLocal Move a File to HDFS from Local
-cp Copy files from source to destination
-tail Displays last kilobyte of the file
-touch create, change and modify timestamps of a file
-touchz Create a new file on HDFS with size 0 bytes
-appendToFile Appends the content to the file which is present on HDF
-copyFromLocal Copy file from local file system
-copyToLocal Copy files from HDFS to local file system
-usage Return the Help for Individual Command
-checksum Returns the checksum information of a file
Change group association of files/change the group of a file
-chgrp
or a path
-chmod Change the permissions of a file
-chown change the owner and group of a file
-df Displays free space
-head Displays first kilobyte of the file
-Create Snapshots Create a snapshot of a snapshottable directory
-Delete Snapshots Delete a snapshot of from a snapshottable directory
-Rename Snapshots Rename a snapshot
-expunge create new checkpoint
-Stat Print statistics about the file/directory
Truncate all files that match the specified file pattern to the
-truncate
specified length
-find Find File Size in HDFS
High Availability
Hadoop HDFS is a highly available file system. In HDFS, data gets replicated among the
nodes in the Hadoop cluster by creating a replica of the blocks on the other slaves present in
HDFS cluster. So, whenever a user wants to access this data, they can access their data from
the slaves which contain its blocks.
S.V. PUBLICATIONS 36
High Reliability
HDFS provides reliable data storage. It can store data in the range of 100s of petabytes.
HDFS stores data reliably on a cluster. It divides the data into blocks. Hadoop framework
stores these blocks on nodes present in HDFS cluster.
HDFS stores data reliably by creating a replica of each and every block present in the
cluster. Hence provides fault tolerance facility. If the node in the cluster containing data goes
down, then a user can easily access that data from the other nodes.
HDFS by default creates 3 replicas of each block containing data present in the nodes. So,
data is quickly available to the users. Hence user does not face the problem of data loss.
Replication
Data Replication is unique features of HDFS. Replication solves the problem of data loss
in an unfavorable condition like hardware failure, crashing of nodes etc. HDFS maintain the
process of replication at regular interval of time.
HDFS also keeps creating replicas of user data on different machine present in the
cluster. So, when any node goes down, the user can access the data from other machines.
Thus, there is no possibility of losing of user data.
Scalability
Hadoop HDFS stores data on multiple nodes in the cluster. So, whenever requirements
increase you can scale the cluster. Two scalability mechanisms are available in HDFS:
Vertical and Horizontal Scalability.
Distributed Storage
All the features in HDFS are achieved via distributed storage and replication. HDFS
store data in a distributed manner across the nodes. In Hadoop, data is divided into blocks
and stored on the nodes present in the HDFS cluster.
After that HDFS create the replica of each and every block and store on other nodes.
When a single machine in the cluster gets crashed we can easily access our data from the
other nodes which contain its replica.
2.3 MapReduce
Q6. Explain MapReduce Architecture in Big Data with Example.
Ans:
A MapReduce is a data processing tool which is used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce:
Simplified Data Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the
reducer phase. In the Mapper, the input is given in the form of a key-value pair. The output
of the Mapper is fed to the reducer as input. The reducer runs only after the Mapper is over.
The reducer too takes input in key-value format, and the output of reducer is the final
output.
S.V. PUBLICATIONS 37
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
Steps in Map Reduce
The map takes data in the form of pairs and returns a list of <key, value> pairs. The
keys will not be unique in this case.
Using the output of Map, sort and shuffle are applied by the Hadoop architecture.
This sort and shuffle acts on these list of <key, value> pairs and sends out unique
keys and a list of values associated with this unique key <key, list(values)>.
An output of sort and shuffle sent to the reducer phase. The reducer performs a
defined function on a list of values for unique keys, and Final output <key, value>
will be stored/displayed.
Input Splits
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input
splits Input split is a chunk of the input that is consumed by a single map
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in
each split is passed to a mapping function to produce output values. In our example, a job of
mapping phase is to count a number of occurrences of each word from input splits (more
details about input-split is given below) and prepare a list in the form of <word, frequency>.
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubed together
along with their respective frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value.
S.V. PUBLICATIONS 38
S.V. PUBLICATIONS 39
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
INTRODUCING HBASE
2.4 Introduction to HBase and Architecture
Q7. What is HBase and its architecture?
Ans:
HBase is an open-source, column-oriented distributed database system in a Hadoop
environment. Initially, it was Google Big Table, afterward; it was renamed as HBase and is
primarily written in Java. Apache HBase is needed for real-time Big Data applications.
HBase can store massive amounts of data from terabytes to petabytes. The tables present
in HBase consist of billions of rows having millions of columns. HBase is built for low
latency operations, which is having some specific features compared to traditional relational
models.
S.V. PUBLICATIONS 40
2.5 Regions
Q8. How regions are created in HBase?
Ans:
HBase Tables are divided horizontally by row key range into “Regions.”
A region contains all rows in the table between the region‟s start key and end key.
Regions are assigned to the nodes in the cluster, called “Region Servers,” and these
serve data for reads and writes.
A region server can serve about 1,000 regions.
Each region is 1GB in size (default)
The basic unit of scalability and load balancing in HBase is called a region. These are
essentially contiguous ranges of rows stored together. They are dynamically split by the
system when they become too large. Alternatively, they may also be merged to reduce their
number and required storage files. An HBase system ma y have more than one region
servers.
Initially there is only one region for a table and as we start adding data to it, the
system is monitoring to ensure that you do not exceed a configured maximum size. If
S.V. PUBLICATIONS 41
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
you exceed the limit, the region is split into two at the middle key middle of the
region, creating two roughly equal halves.
Each region is served by exactly one region server, the row key in the and each of
these servers can serve many regions at any time.
Rows are grouped in regions and may be served by different servers
S.V. PUBLICATIONS 42
Coming to HBase the following are the key terms representing table schema
Table: Collection of rows present.
Row: Collection of column families.
Column Family: Collection of columns.
Column: Collection of key-value pairs.
Namespace: Logical grouping of tables.
Cell: A {row, column, version} tuple exactly specifies a cell definition in HBase.
S.V. PUBLICATIONS 43
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
File descriptors shortage—Since we keep documents open on a loaded cluster, it
doesn't take too long to run the documents on a framework. For example, we have a
cluster having three hubs, each running an instance of a DataNode and region server,
and we are performing an upload on a table that presently has 100 regions and 10
column families. Suppose every column family two file records. In that case, we will
have 100 * 10 * 2 = 2000 records open at any time. Add to this the collective random
descriptors consumed by good scanners, and java libraries. Each open record
devours no less than one descriptor on the remote DataNode. The default quantity of
record descriptors is 1024.
Not many Data Node threads—essentially, the Hadoop DataNode has a higher limit
of 256 threads it can run at any given time. Suppose we use the same table
measurements cite earier. It is not difficult to perceive how we can surpass this figure
given in the DataNode since each open connection with a record piece consumes a
thread If you look in the DataNode log, you will see an error like xceiver count 256
limit exceeds point of simultaneous xcievers 256; therefore, you need to be careful in
using the threads.
Bad blocks—The Dfs client class in the region server will have a tendency to mark
document blocks as bad if the server is heavily loaded. Blocks can be recreated only
three times. Therefor the region server will proceed onward for the recreation of the
blocks. But if this recreation is performed during a period of heavy loading, we will
have two of the three blocks marked as bad. In the event that the third block is
discovered to be bad, we will see an error, stating, No live hubs contain current block
in region server logs. During startup, you may face many issues as regions are
opened and deployed.
UI—HBase runs a Web server on the master to present a perspective on the
condition of your running cluster. It listens on port 60010 by default. The master UI
shows a list of basic functions, e.g., software renditions, cluster load, request rail‟s,
list of group tables, and participant region servers. In the master UI, click a region
server and you will be directed to the Web server running the individual region
server. A list of regions carried by this server and metrics like consumed resources
and request rate would be displayed
Schema design—HBase tables are similar In RDBMS tables, except that HBase tables
have versioned cells, sorted rows, and columns. The other thing to keep in mind is
that an important attribute of the column (family)-oriented database, like HBase, is
that it can host extensive and lightly populated tables at no extra incurred cost.
Row keys— Spend a good time in defining your row key. It can be utilized for
grouping information as a part of routes. If your keys are integers, utilize a binary
representation instead of a persistent string form of a number as it requires lesser
space.
S.V. PUBLICATIONS 44
Data Blocks are sometimes, also called block servers. A block server primarily stores
data in a file system and maintains the metadata of a block. A block server carries out the
following functions:
Storage (and retrieval) of data on a local file system. HDF5 supports different
operating systems and provides similar performance on all of them.
Storage of metadata of a block on the local file system on the basis of a similar
template on the NameNode.
Conduct of periodic validations for file checksums.
Intimation about the availability of blocks to the NameNode by sending reports
regularly.
On-demand supply of metadata and data to the clients where client application
programs can directly access data nodes.
Movement of data to connected nodes on the basis of the pipelining model.
S.V. PUBLICATIONS 45
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
High Availability: Moreover, it offers LAN and WAN which supports failover and
recovery. Basically, there is a master server, at the core, which handles monitoring
the region servers as well as all metadata for the cluster.
Client API: Through Java APIs, it also offers programmatic access.
Scalability: In both linear and modular form, HBase supports scalability. In addition,
we can say it is linearly scalable.
Hadoop/HDFS integration: HBase can run on top of other file systems as well as like
Hadoop/HDFS integration.
Distributed storage: This feature of HBase supports distributed storage such as
HDFS.
Data Replication: HBase supports data replication across clusters.
Failover Support and Load Sharing: By using multiple block allocation and
replications, HDFS is internally distributed and automatically recovered and HBase
runs on top of HDFS, hence HBase is automatically recovered. Also using
RegionServer replication, this failover is facilitated.
API Support: Because of Java APIs support in HBase, clients can access it easily.
MapReduce Support: For parallel processing of large volume of data, HBase
supports MapReduce.
Backup Support: In HBase “Backup support” means it supports back-up of Hadoop
MapReduce jobs in HBase tables.
Sorted Row Keys: It is possible to build an optimized request Since searching is done
on the range of rows, and HBase stores row keys in lexicographical orders, hence, by
using these sorted row keys and timestamp we can build an optimized request.
Real-time Processing: In order to perform real-time query processing, HBase
supports block cache and Bloom filters.
Faster Lookups: While it comes to faster lookups, HBase internally uses Hash tables
and offers random access, as well as it stores the data in indexed HDFS files.
Type of Data: For both semi-structured as well as structured data, HBase supports
well.
Schema-less: There is no concept of fixed columns schema in HBase because it is
schema-less. Hence, it defines only column families.
High Throughput: Due to high security and easy management characteristics of
HBase, it offers unprecedented high write throughput.
S.V. PUBLICATIONS 46
Initially Hive was developed by Facebook, later the Apache Software Foundation took it
up and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
S.V. PUBLICATIONS 47
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
Q14. What is Apache Pig? What is Pig Latin in Pig?
Ans:
Pig and Pig Latin
Pig is defined as “a platform for analysing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure for evaluating
these programs. The salient property of Pig programs is that their structure is amenable to
substantial paralielization, which in turns enables them to handle very large data sets.”
Pig is used as an ELT tool for Hadoop. It makes Hadoop more approachable and usable
for non-technical persons. It opens an interactive and script-based execution environment
for non-developers with its language. Pig Latin. Pig Latin loads and processes input data
using a series of operations and transforms that data to produce the desued output. Pig can
execute in the following two modes:
Local—In this mode, all tlie scripts arc executed on a single machine. This mode does
not require Hadoop MapRcducc or HDFS.
MapReduce— In this mode, all the scripts are executed on a given Hadoop cluster.
This mode is called termed as the MapReduce mode.
Similar to Hadoop, Pig operates by creating a set of map and reduce jobs. The user need
not be concerned about writing code, compiling, packaging, and submitting the Jobs to the
RDBMS system.
Pig Latin facilitates the extraction of the required information from Big Data in an
abstract way by focusing on the data and not on the structure of any custom software
program. Pig programs can be executed in three ways, all of which are compatible with the
local and Hadoop
Mode of execution
Script—pig Latin commands are contained in a file having the .pig suffix. Pig
interprets and executes these commands in a sequential order.
Grunt—It is a command interpreter. The commands are fed on the Grunt command
line and interpreted and executed by Grunt.
Embedded—This enables the execution of Pig programs as a part of a Java program.
Pig Latin comes with a very rich syntax and supports the following operations:
Loading and storing data
Streaming data
Filtering data
Grouping and joining data
Sorting data
Combining and splitting data
Apart from these, a wide variety' of types, expressions, functions, diagnostic operators,
macros, and file system commands are also supported by Pig Latin.
S.V. PUBLICATIONS 48
Zookeeper helps you to maintain configuration information, naming, group services for
distributed applications. It implements different protocols on the cluster so that the
application should not implement on their own. It provides a single coherent view of
multiple machines.
Features of Zookeeper
Synchronization − Mutual exclusion and co-operation between server processes.
Ordered Messages - The strict ordering means that sophisticated synchronization
primitives can be implemented at the client.
Reliability - The reliability aspects keep it from being a single point of failure.
Atomicity − Data transfer either succeeds or fails completely, but no transaction is
partial.
High performance - The performance aspects of Zookeeper means it can be used in
large, distributed systems.
Distributed, High availability, Fault-tolerant, Loose coupling, Partial failure.
High throughput and low latency - data is stored data in memory and on disk as
well.
Replicated.
Automatic failover- When a Zookeeper dies, the session is automatically migrated
over to another Zookeeper.
S.V. PUBLICATIONS 49
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
APACHE FLUME
Ingesting data is an important part of our Hadoop Ecosystem.
The Flume is a service which helps in ingesting unstructured and semi-structured
data into HDFS.
It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving large amount of data sets.
It helps us to ingest online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS.
Now, let us understand the architecture of Flume from the below diagram:
There is a Flume agent which ingests the streaming data from various data sources to
HDFS. From the diagram, you can easily understand that the web server indicates the data
source. Twitter is among one of the famous sources for streaming data.
Sqoop
Sqoop is a tool used to transfer bulk data between Hadoop and external datastores, such
as relational databases (MS SQL Server, MySQL).
To process data using Hadoop, the data first needs to be loaded into Hadoop clusters
from several sources. However, it turned out that the process of loading data from several
heterogeneous sources was extremely challenging. The problems administrators
encountered included:
Maintaining data consistency
Ensuring efficient utilization of resources
Loading bulk data to Hadoop was not possible
Loading data using scripts was slow
The solution was Sqoop. Using Sqoop in Hadoop helped to overcome all the challenges
of the traditional approach and it could load bulk data from RDBMS to Hadoop with ease.
S.V. PUBLICATIONS 50
S.V. PUBLICATIONS 51
BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
Internal Assessment
S.V. PUBLICATIONS 52
S.V. PUBLICATIONS 53
BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
Objective
S.V. PUBLICATIONS 54
PART – A
Short Type Questions
Q1. What are the strengths of MapReduce?
Ans:
Apache Hadoop usually has two parts, the storage part and the processing part.
MapReduce falls under the processing part. Some of the various advantages of Hadoop
MapReduce are:
Scalability – The biggest advantage of MapReduce is its level of scalability, which is
very high and can scale across thousands of nodes.
Parallel nature – One of the other major strengths of MapReduce is that it is parallel
in nature. It is best to work with both structured and unstructured data at the same
time.
Memory requirements – MapReduce does not require large memory as compared to
other Hadoop ecosystems. It can work with minimal amount of memory and still
produce results quickly.
Cost reduction – As MapReduce is highly scalable, it reduces the cost of storage and
processing in order to meet the growing data requirements.
Q2. Why is the order in which a function is executed very important in MapReduce?
Ans:
The point of MapReduce is to enable parallelism, or at least concurrency, among actions
within each of the major phases of the algorithm. And the key to increasing concurrency is
to remove ordering dependencies.
1. During the map phase, multiple workers can independently apply a mapping
function to each datum. The order doesn‘t matter here, as each element‘s mapping is
independent of all the others.
2. Based on the mapping, elements get sorted into groups for reduction. Each of the
workers arranges for the elements it mapped to end up in the right reduction group.
3. During the reduction phase, multiple workers can independently process each of the
groups, reducing the elements in that group to a single value representing the group.
So the only real ordering here is between the three phases. You need to do all of phase 1,
followed by all of phase 2, followed by all of phase 3. Within each phase, all of the work
executes in a somewhat arbitrary order.
Q3. List the Application Of MapReduce?
Ans:
Entertainment: To discover the most popular movies, based on what you like and
what you watched in this case Hadoop MapReduce help you out. It mainly focuses
on their logs and clicks.
E-commerce: Numerous E-commerce suppliers, like Amazon, Walmart, and eBay,
utilize the MapReduce programming model to distinguish most loved items
dependent on clients inclinations or purchasing behavior. It incorporates making
item proposal Mechanisms for E-commerce inventories, examining website records,
buy history, user interaction logs, etc.
S.V. PUBLICATIONS 55
BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
Data Warehouse: We can utilize MapReduce to analyze large data volumes in data
warehouses while implementing specific business logic for data insights.
Fraud Detection: Hadoop and MapReduce are utilized in monetary enterprises,
including organizations like banks, insurance providers, installment areas for
misrepresentation recognition, pattern distinguishing proof, or business metrics
through transaction analysis.
S.V. PUBLICATIONS 56
PART – B
Essay Type Questions
UNDER MAPREDUCE FUNDAMENTALS AND HABSE
3.1 The MapReduce Frame Work
Q1. What is MapReduce program in big data? Or Which type of framework will
supported by MapReduce?
Ans:
The MapReduce framework that enables you to write applications that process vast
amounts of data, in parallel, on large clusters of commodity hardware, in a reliable and
fault-tolerant manner. In addition, this post describes the architectural components of
MapReduce and lists the benefits of using MapReduce.
It is a software framework that enables you to write applications that process vast
amounts of data, in-parallel on large clusters of commodity hardware in a reliable and fault-
tolerant manner.
Prior to Hadoop 2.0, MapReduce was the only way to process data in Hadoop.
A MapReduce job usually splits the input data set into independent chunks, which
are processed by the map tasks in a completely parallel manner.
The framework sorts the outputs of the maps, which are then inputted to the reduce
tasks.
Typically, both the input and the output of the job are stored in a file system.
The framework takes care of scheduling tasks, monitors them, and re-executes the
failed tasks.
MapReduce Framework
MapReduce is a software framework that enables you to write applications that will
process large amounts of data, in- parallel, on large clusters of commodity hardware, in a
reliable and fault-tolerant manner. It integrates with HDFS and provides the same benefits
for parallel data processing. It Sends computations to where the data is stored. The
framework:
– Schedules and monitors tasks, and re-executes failed tasks.
– Hides complex ―housekeeping‖ and distributed computing complexity tasks from
the developer.
– The records are divided into smaller chunks for efficiency, and each chunk is
executed serially on a particular compute engine.
– The output of the Map phase is a set of records that are grouped by the mapper
output key. Each group of records is processed by a reducer (again, these are
logically in parallel).
– The output of the Reduce phase is the union of all records that are produced by the
reducers.
S.V. PUBLICATIONS 57
BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
separated in manageable blocks. The subtasks are executed independently from each other
and then, the results from all Independent executions are combined to provide the complete
output.
S.V. PUBLICATIONS 58
A Map Task is a single instance of a MapReduce app. These tasks determine which
records to process from a data block. The input data is split and analyzed, in parallel, on the
assigned compute resources in a Hadoop cluster. This step of a MapReduce job prepares the
<key, value> pair output for the reduce step.
A Reduce Task processes an output of a map task. Similar to the map stage, all reduce
tasks occur at the same time, and they work independently. The data is aggregated and
combined to deliver the desired output. The final result is a reduced set of <key, value>
pairs which MapReduce, by default, stores in HDFS.
S.V. PUBLICATIONS 59
BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
The final output we are looking for is: How many times the words Apache, Hadoop,
Class, and Track appear in total in all documents.
For illustration purposes, the example environment consists of three nodes. The input
contains six documents distributed across the cluster. We will keep it simple here, but in real
circumstances, there is no limit. You can have thousands of servers and billions of
documents.
1. First, in the map stage, the input data (the six documents) is split and distributed
across the cluster (the three servers). In this case, each map task works on a split
containing two documents. During mapping, there is no communication between the
nodes. They perform independently.
2. Then, map tasks create a <key, value> pair for every word. These pairs show how
many times a word occurs. A word is a key, and a value is its count. For example,
one document contains three of four words we are looking for: Apache 7 times, Class
8 times, and Track 6 times. The key-value pairs in one map task output look like this:
<apache, 7>
<class, 8>
<track, 6>
This process is done in parallel tasks on all nodes for all documents and gives a
unique output.
3. After input splitting and mapping completes, the outputs of every map task are
shuffled. This is the first step of the Reduce stage. Since we are looking for the
frequency of occurrence for four words, there are four parallel Reduce tasks. The
reduce tasks can run on the same nodes as the map tasks, or they can run on any
other node.
The shuffle step ensures the keys Apache, Hadoop, Class, and Track are sorted
for the reduce step. This process groups the values by keys in the form of <key,
value-list> pairs.
4. In the reduce step of the Reduce stage, each of the four tasks process a <key, value-
list> to provide a final key-value pair. The reduce tasks also happen at the same time
and work independently.
In our example from the diagram, the reduce tasks get the following individual results:
S.V. PUBLICATIONS 60
The example we used here is a basic one. MapReduce performs much more complicated
tasks.
S.V. PUBLICATIONS 61
BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
implementations rely on a master-slave style of distribution, where the master node
stores all the metadata, access rights, mapping and location of files and blocks, and
so on. The slaves are nodes where the actual data is stored. All the requests go to the
master and then are handled by the appropriate slave node. As you contemplate the
design of the file system you need to support a MapReduce implementation, you
should consider the following:
Keep it warm: As you might expect, the master node could get overworked
because everything begins there. Additionally, if the master node fails, the entire
file system is inaccessible until the master isrestored. A very important
optimization is to create a ―warm standby‖master node that can jump into
service if a problem occurs with the online master.
The bigger the better: File size is also an important consideration. Lots of small
files (less than 100MB) should be avoided. Distributed file systems supporting
MapReduee engines work best when they are populated with a modest number
of large files.
The long view: Because workloads are managed in batches, highly sustained
network bandwidth is more important than quick execution times of the mappers
or reducers. The optimal approach is for the code to stream lots of data when it is
reading and again when it is time to write to the file system.
Keep it secure: But not overly so. Adding layers of security on the distributed file
system will degrade its performance. The file permissions are there to guard
against unintended consequences, not malicious behavior. The best approach is
to ensure that only authorized users have access to the data center environment
and to keep the distributed file system protected from the outside.
Entertainment
Hadoop MapReduce assists end users in finding the most
popular movies based on their preferences and previous
viewing history. It primarily concentrates on their clicks and
logs.
S.V. PUBLICATIONS 62
Many e-commerce vendors use the MapReduce programming model to identify popular
products based on customer preferences or purchasing behavior. Making item proposals for
e-commerce inventory is part of it, as is looking at website records, purchase histories, user
interaction logs, etc., for product recommendations.
Social media
Nearly 500 million tweets, or about 3000 per second, are sent daily on the microblogging
platform Twitter. MapReduce processes Twitter data, performing operations such as
tokenization, filtering, counting, and aggregating counters.
Tokenization: It creates key-value pairs from the tokenized tweets by mapping the
tweets as maps of tokens.
Filtering: The terms that are not wanted are removed from the token maps.
Counting: It creates a token counter for each word in the count.
Aggregate counters: A grouping of comparable counter values is prepared into
small, manageable pieces using aggregate counters.
Data warehouse
Systems that handle enormous volumes of information are known as data warehouse
systems. The star schema, which consists of a fact table and several dimension tables, is the
most popular data warehouse model. In a shared-nothing architecture, storing all the
necessary data on a single node is impossible, so retrieving data from other nodes is
essential.
This results in network congestion and slow query execution speeds. If the dimensions
are not too big, users can replicate them over nodes to get around this issue and maximize
parallelism. Using MapReduce, we may build specialized business logic for data insights
while analyzing enormous data volumes in data warehouses.
Fraud detection
Conventional methods of preventing fraud are not always very effective. For instance,
data analysts typically manage inaccurate payments by auditing a tiny sample of claims and
requesting medical records from specific submitters. Hadoop is a system well suited for
handling large volumes of data needed to create fraud detection algorithms. Financial
businesses, including banks, insurance companies, and payment locations, use Hadoop and
MapReduce for fraud detection, pattern recognition evidence, and business analytics
through transaction analysis.
S.V. PUBLICATIONS 63
BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
HBase is a data model that is similar to Google‘s big table designed to provide quick
random access to huge amounts of structured data. It
leverages the fault tolerance provided by the
Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides
random real-time read/write access to data in the
Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.
Characteristics of HBase
It is linearly scalable across various nodes as well as modularly scalable, as it divided
across various nodes.
HBase provides consistent read and writes.
It provides atomic read and write means during one read or write process, all other
processes are prevented from performing any read or write operations.
It provides easy to use Java API for client access.
It supports Thrift and REST API for non-Java front ends which supports XML,
Protobuf and binary data encoding options.
It supports a Block Cache and Bloom Filters for real-time queries and for high
volume query optimization.
HBase provides automatic failure support between Region Servers.
It support for exporting metrics with the Hadoop metrics subsystem to files.
It doesn‘t enforce relationship within your data.
It is a platform for storing and retrieving data with random access.
S.V. PUBLICATIONS 64
The data obtained from the data sources, has to be validated and cleaned before
introducing it for any logical use in the enterprise. The task of validating, sorting, and
cleaning data is done by the ingestion layer. The removal of noise from the data also takes
place in the ingestion layer.
S.V. PUBLICATIONS 65
BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
volume, high velocity, and a variety of data, the ingestion layer validates, cleanses,
transforms, reduces, and integrates the unstructured data into the Big Data stack for further
processing. The functioning of the ingestion layer:
Identification: Data is categorised into various
known data formats or unstructured data is
assigned with default formats.
Filtration: The information relevant for the
enterprise is filtered on the basis of the
Enterprise Master Data Management (MDM)
repository.
Validation: The filtered day is analyzed against
MDM metadata.
Noise reduction: Data is cleaned by removing
the noise and minimising the related
disturbances.
Transformation: Data is split or combined on the basis of its type, contents, and the
requirement of the organisation.
Compression: The size of the data is reduced without affecting is relavance for the
required process. It should be remembered that compression does not affect the
analysis results.
Integration: The refined data set is integrated with the Hadoop storage layer, which
consists of Hadoop Distributed File System (HDFS) and NOSQL database.
Data ingestion in the Hadoop world means ELT (Extract, Load and Transform) as
opposed to ETL (Extract, Transform and Load) in case of traditional warehouses.
HDFS is a file system that is used to store huge volumes of data across a large number of
commodity machines in a cluster. The data can be in terabytes or petabytes. HDFS stores
data in the form of blocks of files and follows the write-once-read-many model to access
data from these blocks of files. The flies stored in the HDFS are operated upon by many
complex programs, as per the requirement.
Consider an example of a hospital that used to perform a periodic review of the data
obtained from the sensors and machines attached to the patients. This review helped doctors
to keep a check on the condition of terminal patients as well as analyze the effects of various
medicines on them. With time, the growing volume of data made it difficult for the hospital
staff to store and handle it. To find a solution, the hospital consulted a data analyst who
suggested the implementation of HDFS as an answer to this problem. HDFS can be
S.V. PUBLICATIONS 66
Earlier, we needed to have different types of databases, such as relational and non-
relational, for stormy different types of data. However, now, all these types of data storage
requirements can be addressed by a single concept known as Not Only SQL (NoSQL)
databases. Some examples at NOSQL database include HBase, MongoDB, AllegroGraph,
and InfiniteGraph.
S.V. PUBLICATIONS 67
BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
infrastructures can be costly, but you can control the costs with cloud services, where
you only pay for what you actually use.
Cost: What can you afford? Because the infrastructure is a set of com- ponents, you
might be able to buy the ―best‖ networking and decide to save money on storage (or
vice versa). You need to establish requirements for each of these areas in the context
of an overall budget and then make trade-offs where necessary.
Infrastructure designers should plan for these expected increases and try to create
physical implementations that are ―elastic.‖ As network traffic ebbs and flows, so too does
the set of physical assets associated with the implementation. Your infrastructure should
offer monitoring capabilities so that operators can react when more resources are required to
address changes in workloads.
S.V. PUBLICATIONS 68
Analytics Engine
The role of an analytics engine is to analyze huge amounts of unstructured data. This
type of analysis is related to text analytics and statistical analytics. Some examples of
different types of unstructured data that are available as large datasets include the
following:
Documents containing textual patterns
Text and symbols generated by customers or users using social media forums, such
as Yammer, Twitter, and Facebook
Machine generated data, such as Radio Frequency Identification (RFID) feeds and
weather data.
Data generated from application logs about upcoming or down time details or about
maintenance and upgrade details
Some statistical and numerical methods used for analyzing various unstructured data
sources:
Natural Language Processing
Text Mining
Linguistic Computation
Machine Learning
Search and Sort Algorithms
Syntax and Lexical Analysis
S.V. PUBLICATIONS 69
BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
The following types of engines are used for analyzing Big Data:
Search engines —Big Data analysis requires extremely last search engines with
iterative and cognitive data discovery mechanisms for analyzing huge volumes of
data. This is required because the data loaded from various sources has to be indexed
and searched for Big Data analytics processing.
Real Time Engines — These days real time applications generate data at a very high
speed and Even a few hours old data becomes obsolete and useless as new data
continues to flow in. Real-time analysis is required in the Big Data environment to
analyze this type of data. For this purpose real time engines and NoSQL stores are
used.
S.V. PUBLICATIONS 70
S.V. PUBLICATIONS 71
BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
Internal Assessment
S.V. PUBLICATIONS 72
S.V. PUBLICATIONS 73
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
Objective
The discussion databases and data warehouses
in terms of their utility in data storage begins by describing RDBMS and its
role in managing Big Data.
Introducing non-relational databases
Explained the concept of polyglot persistence
The integration of Big Data with traditional data warehouses
Big Data analysis and data warehousing have also been discussed in detail.
The elaborated on the changing in the deployment models in the era of Big
Data
We discussed various aspects of data management in NoSQL in detail.
After introducing NoSQL,
Explained types of data models
It has then described the key/value, column-oriented, and document data
models.
We also look been made familiar with the concept of materialized views as
well as distribution models.
S.V. PUBLICATIONS 74
PART – A
Short Type Questions
Q1. List the Characteristics of Non-Relational Databases.
Ans:
Non-relational database technologies have the following characteristics in common:
Scalability —It refers to the capability to write data across multiple data clusters
simultaneously, irrespective of physical hardware or infrastructure limitations.
Seamlessness —Another important aspect that ensures the resiliency of non-
relational databases, is their capability to expand/contract to accommodate varying
degrees of increasing or decreasing data flows, without affecting the end-user
experience.
Data and Query Model — Instead of the traditional row/column, key-value
structure, non-relational databases use frameworks to store data with a required set
of queries or APIs to access the data.
Persistence Design—Persistence is an important element in non-relational
databases, ensuring faster throughput of huge amounts of data by making use of
dynamic memory rather than conventional reading and writing from disks. This is a
vital element in non-relational databases. Due to the high variety, velocity, and
volume of data, these databases utilize different mechanisms, maintain persistency in
data. The highest performance option is "in-memory," where the entire database is
stored in a huge cache, so as to avoid time-consuming read-write cycles.
Eventual Consistency—While RDBMS use ACID (Atomicity, Consistency, Isolation,
Durability) for ensuring data consistency, non-relational DBMS use BASE Basically
available soft state, eventual consistency to ensure that inconsistencies are resolved
when data is midway between the nodes in a distributed system.
S.V. PUBLICATIONS 75
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
not relate to one another. If you can permit table JOINs, you are again treading into
the world of ―NewSQL‖ rather than ―NoSQL.‖ e.g., MongoDB $lookup. Using JOINs
in MongoDB NoSQL Databases — SitePoint
3. Schema-optional or schemaless. SQL requires all tables have pre-defined schemas.
NoSQL permits you to have a schema (schema optional) or may be schemaless.
These are anathema to the SQL RDBMS world, where you have to pre-define your
schema, have strong typing, and you don’t want to hear about sparse data models.
Schema-optional or schemaless permit dealing with a wider variety of data, and with
rapidly evolving data models.
4. Horizontal scalability: The NoSQL world was designed (ideally) to scale to the web,
or the cloud, or Internet of Things (IoT). Whereas SQL was designed (ideally) with
the enterprise in mind — a Fortune 500 company, for instance. Summarily, SQL was
designed to scale vertically for an enterprise (―one very big box‖ like a mainframe),
whereas NoSQL was designed to scale horizontally (―many little boxes, all alike‖ on
commodity hardware). However, while this was generally true in the past — a rule
of thumb — there are a few NoSQL systems ported to mainframes, and now some
SQL systems designed to scale horizontally. Ideally a database can be architected to
scale horizontally and vertically (e.g., Scylla).
5. Availability-focused (vs. Consistency-focused): SQL RDBMS’s grew up in the
world of big banking and other commercial use cases that required consistency in
transaction processing. The money was either in your account, or it wasn’t. And if
you checked your balance, it needed to be precise. Reloading it should not change
what’s in there. But that means the database needs to take its own sweet time to
make sure your balance is right. Whereas for the NoSQL world, the database needed
to be available. No blocking transactions. Which means that data may be eventually
consistent. i.e., reload and you’ll eventually see the right answer. That was fine with
use cases that were less mission-critical like social media posting or caching
ephemeral data like browser cookies. However, again, this is a rule of thumb from
the early days. Now, many NoSQL systems support ACID, two-phase commits,
strong consistency, and so on. Still, the prevalence is for many NoSQL systems to be
more aligned with the ―AP‖ (Availability/Partition Tolerant) side than the ―CP‖
(Consistency/Partition Tolerant) side of the CAP theorem.
S.V. PUBLICATIONS 76
The diagram shows a typical approach to data flows with warehouses and marts:
Organizations will inevitably continue to use data warehouses to manage the type of
structured and operational data that characterizes systems of record. These data warehouses
will still provide business analysts with the ability to analyze key data, trends, and so on.
However, the advent of big data is both challenging the role of the data warehouse and
providing a complementary approach.
It's inevitable that operational and structured data will have to interact in the world of
big data, where the information sources have not (necessarily) been cleansed or profiled.
Increasingly, organizations understand that they have a business requirement to be able to
combine traditional data warehouses with their historical business data sources with less
structured and vetted big data sources. A hybrid approach supporting traditional and big
data sources can help to accomplish these business goals.
S.V. PUBLICATIONS 77
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
These are the data models which are based on topographical network structure.
Obviously, in graph theory, we have terms like Nodes, edges, and properties, let’s see what
it means here in the Graph-Based data model.
Nodes: These are the instances of data that represent objects which is to be tracked.
Edges: As we already know edges represent relationships between nodes.
Properties: It represents information associated with nodes.
S.V. PUBLICATIONS 78
PART – B
Essay Type Questions
STORING DATA IN DATABASES AND DATA WARE
HOUSES
4.1 RDBMS and Big Data
Q1. Why should I care about big data? Why do I need a big data solution?
Ans:
A simple Database Management System (DBMS) stores data in the form of schemas or
tables comprising rows and columns. The main goal of a DBMS is to provide a solution for
storing and retrieving information in a convenient and efficient manner. The most common
way of fetching data from these tables is by using Structured Query Language (SQL). A
Relational Database Management System (RDBMS) stores the relationships between these
tables in columns that serve as a reference for another table. These columns are known as
primary keys and foreign keys, both of which can be used to reference other tables so that
data can be related between the tables and retrieved as and when it is required.
Such a database system usually consists of several tables and relationships between
those tables, which help in classifying the information contained in them. These tables are
also stored in Boyce - Codd Normal Form (BCNF), and a relationship is sustained within
these tables’ primary/foreign keys.
In addition to data files, a Data Warehouse (DWH) is also used for handling large
amounts of data or Big Data. A DWH can be defined as the association of data from various
sources that are created for supporting planned decision making. "A data warehouse is a
subject-oriented, integrated, time-variant and non-volatile collection of data in support of
management's decision-making process."
The primary goal of a data warehouse is to provide a consistent picture of the business at
a given point of time. Using various data warehousing toolsets, employees in an
organization can efficiently execute online queries and mine data according to their
requirement.
S.V. PUBLICATIONS 79
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
RDBMS presumes your data to be essentially correct beforehand. Mapping real-world
problems to a relational database model comprises many improvised strategies, but the
textbook approach recommends following a three-step process: conceptual, logical, and
physical modeling. But simply mapping the problem in such a way doesn't work. It makes
some incorrect assumptions, such as up your information system staying consistent and
static all the time. However in the real world, data is only partially legible.
The Internet has grown by leaps and bounds in the past two decades. Numerous
domains have been registered, more than a billion gigabytes of Web-space has been
reserved. With the digital revolution of the early 1990s aiding the personal computer
segment, Web transactions have grown rapidly. With the advent of various search engines,
lure of easily available information, and freely distributable information, the social media
platform has recorded a threefold increase in the volume of transactions. Although solutions
for such situations occurring in corporate scenarios have been there, coping with Web-based
transactions has turned out to be a compelling factor while addressing database-related
issues with Big Data solutions.
Big Data mainly takes three Vs into account: Volume, Variety, and Velocity. These three
terms can be briefly explained as follows:
Volume of Data—Big Data is designed to store and process a few hundred terabytes,
or even petabytes or zettabytes of data.
Variety of Data—Collection of data, different from a format suiting relational
database systems, is stored in a semi-structured or an unstructured format.
Velocity of Data—The rate of data arrival might make an enterprise data warehouse
problematic, particularly where formal data preparation processes like conforming,
examining, transforming, and cleansing of the data needs to be accomplished before
it is stored in data warehouse tables.
One of the biggest difficulties with RDBMS is that it is not yet near the demand levels of
big data. The volume of data handling today is rising at a fast rate.
Big Data primarily comprises semi-structured data, such as social media sentiment
analysis and text mining data, while RDBMSs are more suitable for structured data, such as
weblog, sensor, and financial data.
S.V. PUBLICATIONS 80
Big Data solutions are adjusted for loading huge amounts of data using simple file
formats and highly distributed storage mechanisms, with initial processing of the data
occurring at every storage node. This means that once the data is loaded on to the cluster
storage, the data bulk need not be moved over the network for processing.
Big Data is an important tool when you need to manage data that is arriving very
quickly in random formats, which you can process later. You can store the data in clusters in
its original format, and then process it when required using a query that extracts the
required result set and stores it in a relational database, or makes it available for reporting.
S.V. PUBLICATIONS 81
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
CAP Theorem
CAP Theorem is also known as Brewer's theorem. It states that it is not possible for a
distributed system to provide all the following three conditions at the same point of time:
Consistency—Same data is visible by all the nodes.
Availability— Every request is answered, whether it succeeds or fails.
Partition-tolerance—Despite network failures, the system continues to operate.
S.V. PUBLICATIONS 82
Another important class of non-relational databases is the one that does not support the
relational model, but uses and relies on SQL as the primary means of manipulating existing
data. Despite K e relational and non-relational databases having similar fundamentals, the
difference lies in how they individually achieve the fundamentals.
S.V. PUBLICATIONS 83
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
In many non-relational databases security is lesser than relational databases
which is a major concern.Like Mongodb and Cassandra both databases have lack
of encryption for data files, they have very weak authentication system, and very
simple authorization
A lot of corporations still use relational databases for some data, but the increasing
persistence requirements of dynamic applications are growing from predominantly
relational to a mixture of data sources.
Polyglot persistence uses the same ideas as polyglot programming, which is the practice
of writing applications using a mix of languages in order to take full advantage of the fact
that various languages are suitable for solving different problems.
Polyglot is defined as someone who can speak and write in several languages. In data
storage, the term persistence means that data survives after the process with which it was
created has ended. In other words, data is stored in non-volatile storage.
To build a useful app for a user, developers are required to make clever improvisations
with the existing ways of storing and querying data. They need tools to provide the required
context-based content to the users.
S.V. PUBLICATIONS 84
Big Data and the traditional data warehouse confront each other often, they are more
likely to stay homogenous than to come up with a new version of Big Data. Data
warehouses are highly structured and adjusted for custom purposes. The relational data
model is here to stay for a long time, since organizations will continue to use data
warehouses to manage organized and operational type of data that characterize record
systems.
A relationship between Big Data and a data warehouse can be best described as a hybrid
structure in which the well-structured, optimized, and operational data remains in the
heavily guarded data warehouse.
A hybrid approach helps in accomplishing these business goals, which supports a cross
between traditional and Big Data sources. The main challenges confronting the physical
architecture of the next-gen data warehouse platform include data availability, loading
storage performance, data volume, scalability, assorted and varying query demands against
the data, and operational costs of maintaining the environment.
The key methods that we are going to explain in brief are as follows:
Data Availability—Data availability is a well-known challenge for any system
related to transforming and processing data for use by end-users, and Big Data is no
different. The challenge is to sort and load the data, which is unstructured and in
varied formats. Also, context-sensitive data involving several different domains may
require another level of availability check. The data present in the Big Data hierarchy
is not updated reprocessing new data containing updates will create duplicate data,
and this needs to be handled to minimize the impact on availability.
Pattern Study—Pattern study is nothing but the centralization and localization of
data according to the demands. A global e-commerce website can centralize requests
and fetch directives and results on the basis of end user locations, so as to return only
meaningful contextual knowledge than to impart the entire data to the user.
Trending topics are one of the pattern-based data study models that are a popular
mode of knowledge gathering for all platforms. The trending pattern to be found for
a given regional location is matched for occurrence in the massive data stream, in
terms of keywords or popularity of links as per the hits they receive, and based on
the geographical diversifications and stream filtering methods, data with similar
characteristics is conjoined to form a pattern.
Data Incorporation and Integration—since no guidebook format or schema
metadata exists, the data incorporation process for Big Data is about just acquiring
the data and storing as files. Especially in case of big documents, images or videos, if
such requirements happen to be the sole architecture driver, a dedicated machine can
be allocated for this task, bypassing the guesswork involved in the configuration and
setup process.
Data Volumes and Exploration—Traffic spikes and volatile surge in data volumes
can easily dislocate the functional architecture of corporate infrastructure due to the
fundamental nature of the data streams. On each cycle of data acquisition
S.V. PUBLICATIONS 85
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
completion, retention requirements for data can vary depending on the nature and
the freshness of the data and its core relevance to the business. Data exploration and
mining is an activity responsible for Big Data procurements across organizations and
also yields large data sets as processing output. These data sets are required to be
preserved in the system by occasional optimization of intermediary data sets.
Compliance and Localized Legal Requirements —Various compliance standards
such as Safe Harbor, GLBA, and PCI regulations can have some impact on data
security and storage. Therefore, these standards should be judiciously planned and
executed. Moreover, there are several cases of transactional data sets not being stored
online required by the courts of law. Big Data infrastructure can be used as a storage
engine for such data types, but the data needs to comply with certain standards and
additional security. Large volumes of data can affect overall performance, and if such
data sets are processed on the Big Data platform, the appliance configurator can
provide administrators with tools/tips to mark the data in its own area of
infrastructure zoning, minimizing both risk and performance impact.
Storage Performance — In all these years, storage-based solutions didn't advance as
rapidly as their counterparts, processors, memories, or cores did. Disk performance
is a vital point to be taken care of while developing Big Data systems and appliance
architecture can throw better light on the storage class and layered architecture. If a
combination of Solid State Drive (SSD), in-memory, and traditional storage
architecture is intended for Big Data processing, the exchange and persistence of data
across the different layers can be time-consuming and vapid.
A few parallels can be drawn between a data warehouse and Big Data solution. Both
store a lot of data. Both can be used for reporting and are managed by electronic storage
devices.
Data warehousing, on the other hand, is a group of methods and software to enable data
collection from functional systems, integration, and synchronization of that data into a
S.V. PUBLICATIONS 86
A big client of Argon Technology wants a solution for analyzing the data of 100,000
employees across the world. Assessing the performance manually of each employee is a
huge task for the administrative department before rewarding bonuses or increasing
salaries. Argon Technology experts set up a data warehouse in which information related to
each employee is stored. The administrative department extracts that information with the
help of Argon's Big Data solution easily and analyzes it before providing benefits to an
employee.
Complexity of the data warehouse environment has risen dramatically in recent years
with the influx of data warehouse appliances architecture, NoSQL/Hadoop, databases, and
several API-based tools for many forms of cutting-edge analytics or real-time tasks. A Big
Data solution is preferred because there is a lot of data that has to be manually and
relationally handled. In organizations handling Big Data, if the data is potentially used, it
can provide much valuable information leading to superior decision making which, in turn,
can lead to more profitability, revenue, and happier customers.
On comparing a data warehouse to a Big Data solution, we find that a Big Data solution
is a technology and data warehousing is architecture.
Big Data is not a substitute for a data warehouse. Data warehouses work with abstracted
data that has been operated, filtered, and transformed into a separate database, using
analytics such as sales trend analysis or compliance reporting. That database is updated
gradually with the same filtered data, either at a weekly or monthly basis.
Organizations that use data warehousing technology will continue to do so and those
that use both Big Data and data warehousing are future-proof from any further
technological advancements only up till the point where the thin line of separation starts
depleting. Conventional data warehouse systems are proven systems and with investments
to the tune of millions of dollars for their development, those systems, are not going
anywhere, soon. Regardless of how good and profitable Big Data analytics is or turns out,
data warehousing will still continue to provide crucial database support to many
enterprises, and in all circumstances, will complete the lifecycle of current systems.
S.V. PUBLICATIONS 87
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
companies to optimize these warehouses and limit the scope and size of the data being
managed.
The inevitable trend is toward hybrid environments that address the following
enterprise Big Data necessities:
Scalability and Speed — The developing hybrid Big Data platform supports parallel
processing, optimized appliances and storage, workload management, and dynamic
query optimization.
Agility and Elasticity—The hybrid model is agile, which means it is flexible and
responds rapidly in case of changing trends. It also provides elasticity, which means
this model can be increased or decreased as per the demands of the user.
Affordability and Manageability—The hybrid environment will integrate flexible
pricing, including licensed software, custom designed appliances, and cloud-based
approaches for future proofing. data warehouse has become a real-world method of
producing an enhanced environment to support the transition to new information
management.
Appliance Model (also known as Commodity Hardware) — Appliance is a system
used in data centers to optimize data storage and management. It can be easy and
quick to implement, and offers low cost in terms of operation and maintenance. It
also integrates logical engines and tools to shorten the process of examining data
from several sources. The appliance is thus a single-purpose machine that usually
includes different interfaces to easily connect to an existing data warehouse. Though
various backups systems and central node transfer techniques have made it a real
easy deal for the present-day appliance model to be fairly reliant, oddities still
remain.
Cloud Deployment—Big Data business applications with cloud-based approaches
have an advantage that other methods lack.
o On-demand self-service—Enables the customer to use a self-propelled cloud
service with minimal interaction involving the cloud service provider
o Broad network access — Allows Big Data cloud resources to be available
over the network and accessible across different client platforms
o Multi-user —Allows cloud resources to be allocated so that isolation can be
guaranteed to multiple users, their computations, and data from one another
o Elasticity and scalability—Allows cloud resources to be elastically, rapidly,
and automatically scaled out, up, and down as the need be
o Measured service—Enables remote monitoring and billing of Big Data cloud
resources.
Some of the challenges for a Big Data architecture and cloud computing are as follows:
The data involved, and its magnitude and location. Big Data may end up unrelated
and start out in different locations. These sites may or may not be serviceable by a
cloud service.
The type of processing required on the data. Continuous or burst mode? Can the
data be parallelized?
S.V. PUBLICATIONS 88
S.V. PUBLICATIONS 89
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
For example, Google and Facebook are collecting terabytes of data daily for their users.
Such databases do not require fixed schema, avoid join operations, and scale data
horizontally.
In a NoSQL database, tables are stored as ASCII files, with each tuple represented by a
row and fields separated with tabs. The database is manipulated through shell scripts that
can be combined to UNIX pipelines. As its name suggests, NoSQL doesn't use SQL as a
query language.
There are lots of data storage options available in the market. And yet, many more are
still coming up as the nature of data, platforms, user requirements, architectures, and
processes are constantly changing. We are living in the era of Big Data and are in search of
ways of handling it. This has given impetus to, since 2009, the need for creating schema-free
databases that can handle large amounts of data. These databases are scalable, enable
availability of user, support replication, and are distributed and possibly open source. One
such database is NoSQL.
NoSQL databases are still in the development stages and are going through a lot of
changes. Software developers who work on databases have a mixed opinion about NoSQL.
Some find it useful, while others point out flaws in it. Some are uncertain, and believe that
it's just another hyped technology that will eventually vanish due to its immaturity.
NoSQL database stands for ―Not Only SQL‖ or ―Not SQL.‖ Though a better term would
be ―NoREL‖, NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database technologies that
can store structured, semi-structured, unstructured and polymorphic data.
S.V. PUBLICATIONS 90
To resolve this problem, we could ―scale up‖ our systems by upgrading our existing
hardware. This process is expensive.
This comes particularly useful while dealing with non-uniform data and custom fields.
We see relational databases as one option for data storage. This point of view is often
referred to as polyglot persistence, and simply means using different data stores for different
circumstances.
We will now look at some of the most common features that define a basic NoSQL
database.
Non-relational
NoSQL databases never follow the relational model
Never provide tables with flat fixed-column records
Work with self-contained aggregates or BLOBs
Doesn’t require object-relational mapping and data normalization
No complex features like query languages, query planners,referential integrity joins,
ACID
Schema-free
NoSQL databases are either schema-free or have relaxed schemas
Do not require any sort of definition of the schema of the data
Offers heterogeneous structures of data in the same domain
Simple API
Offers easy to use interfaces for storage and querying data provided
APIs allow low-level data manipulation & selection methods
Text-based protocols mostly used with HTTP REST with JSON
Mostly used no standard based NoSQL query language
Web-enabled databases running as internet-facing services
Distributed
Multiple NoSQL databases can be executed in a distributed fashion
Offers auto-scaling and fail-over capabilities
Often ACID concept can be sacrificed for scalability and throughput
S.V. PUBLICATIONS 91
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
Mostly no synchronous replication between distributed nodes Asynchronous Multi-
Master Replication, peer-to-peer, HDFS Replication
Only providing eventual consistency
Shared Nothing Architecture. This enables less coordination and higher distribution.
Then the relational database was created by E.F. Codd and these databases answered the
question of having no standard way to store data. But later relational database also get a
problem that it could not handle big data, due to this problem there was a need of database
which can handle every types of problems then NoSQL database was developed.
1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
2000- Graph database Neo4j is launched
2004- Google BigTable is launched
2005- CouchDB is launched
2007- The research paper on Amazon Dynamo is released
2008- Facebooks open sources the Cassandra project
2009- The term NoSQL was reintroduced
S.V. PUBLICATIONS 92
An efficient and compact structure of the index is used by the key-value store to have the
option to rapidly and dependably find value using its key. For example, Redis is a key-value
store used to tracklists, maps, heaps, and primitive types (which are simple data structures)
in a constant database. Redis can uncover a very basic point of interaction to query and
manipulate value types, just by supporting a predetermined number of value types, and
when arranged, is prepared to do high throughput.
Features
One of the most un-complex kinds of NoSQL data models.
For storing, getting, and removing data, key-value databases utilize simple functions.
Querying language is not present in key-value databases.
Built-in redundancy makes this database more reliable.
The Columnar Database of NoSQL is important. NoSQL databases are different from
SQL databases. This is because it uses a data model that has a different structure than the
previously followed row-and-column table model used with relational database
management systems (RDBMS). NoSQL databases are a flexible schema model which is
designed to scale horizontally across many servers and is used in large volumes of data.
Basically, the relational database stores data in rows and also reads the data row by row,
column store is organized as a set of columns. So, if someone wants to run analytics on a
small number of columns, one can read those columns directly without consuming memory
with the unwanted data. Columns are somehow are of the same type and gain from more
efficient compression, which makes reads faster than before. Examples of Columnar Data
Model: Cassandra and Apache Hadoop Hbase.
Column-family databases store data in column families as rows that have many columns
associated with a row key. Column families are groups of related data that is often accessed
together.
S.V. PUBLICATIONS 93
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
S.V. PUBLICATIONS 94
The idea stems from graph theory in mathematics, where graphs represent data sets
using nodes, edges, and properties.
Nodes or points are instances or entities of data which represent any object to be
tracked, such as people, accounts, locations, etc.
Edges or lines are the critical concepts in graph databases which represent
relationships between nodes. The connections have a direction that is either
unidirectional (one way) or bidirectional (two way).
Properties represent descriptive information associated with nodes. In some cases,
edges have properties as well.
The following table outlines the critical differences between graph and relational
databases:
S.V. PUBLICATIONS 95
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
Relational databases have an advantage here because their lack of aggregate structure
allows them to support accessing data in different ways. Furthermore, they provide a
convenient mechanism that allows you to look at data differently from the way it’s stored
S.V. PUBLICATIONS 96
Views provide a mechanism to hide from the client whether data is derived data or base
data—but can’t avoid the fact that some views are expensive to compute. To cope with this,
materialized views were invented, which are views that are computed in advance and
cached on disk. Materialized views are effective for data that is read heavily but can stand
being somewhat stale.
Although NoSQL databases don’t have views, they may have precomputed and cached
queries, and they reuse the term ―materialized view‖ to describe them. It’s also much more
of a central aspect for aggregate-oriented databases than it is for relational systems, since
most applications will have to deal with some queries that don’t fit well with the aggregate
structure.
There are two rough strategies to building a materialized view. The first is the eager
approach where you update the materialized view at the same time you update the base
data for it. This approach is good when you have more frequent reads of the materialized
view than you have writes and you want the materialized views to be as fresh as possible.
If you don’t want to pay that overhead on each update, you can run batch jobs to update
the materialized views at regular intervals. Building materialized views outside of the
database by reading the data, computing the view, and saving it back to the database. More
often databases will support building materialized views themselves.
Materialized views can be used within the same aggregate. An order document might
include an order summary element that provides summary information about the order so
that a query for an order summary does not have to transfer the entire order document.
Using different column families for materialized views is a common feature of column-
family databases. An advantage of doing this is that it allows you to update the materialized
view within the same atomic operation.
The three letters in CAP refer to three desirable properties of distributed systems with
replicated data: consistency (among replicated copies), availability (of the system for read
and write operations) and partition tolerance (in the face of the nodes in the system being
partitioned by a network fault).
The CAP theorem states that it is not possible to guarantee all three of the desirable
properties – consistency, availability, and partition tolerance at the same time in a
distributed system with data replication.
S.V. PUBLICATIONS 97
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
The theorem states that networked shared-data systems can only strongly support two
of the following three properties:
Consistency – Consistency means that the nodes will have the same copies of a
replicated data item visible for various transactions. A guarantee that every node in a
distributed cluster returns the same, most recent and a successful write. Consistency
refers to every client having the same view of the data. There are various types of
consistency models. Consistency in CAP refers to sequential consistency, a very
strong form of consistency.
Availability – Availability means that
each read or write request for a data
item will either be processed
successfully or will receive a message
that the operation cannot be completed.
Every non-failing node returns a
response for all the read and write
requests in a reasonable amount of
time. The key word here is ―every‖. In
simple terms, every node (on either side
of a network partition) must be able to respond in a reasonable amount of time.
Partition Tolerance – Partition tolerance means that the system can continue
operating even if the network connecting the nodes has a fault that results in two or
more partitions, where the nodes in each partition can only communicate among
each other. That means, the system continues to function and upholds its consistency
guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover from
partitions once the partition heals.
ACID Property
ACID (Atomicity, Consistency, Isolation, and Durability) transactions guarantee four
properties:
Atomicity—In this either all the operations in transaction will complete or not a
single one. If any part of transaction fails, the entire transaction will fail.
Consistency — A transaction must leave the database in an inconsistent state. This
ensures that any transaction which is done will change the database to another valid
state.
Isolation—Transactions will not interfere with each other.
Durability—The Successful completion of transaction will not be reversed.
S.V. PUBLICATIONS 98
Internal Assessment
S.V. PUBLICATIONS 99
BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
Objective
S.V. PUBLICATIONS A
Single node cluster means only one DataNode running and setting up all the
NameNode, DataNode, ResourceManager and NodeManager on a single machine.
B S.V. PUBLICATIONS
Install Java
– Java JDK Link to download
https://www.oracle.com/java/technologies/javase-jdk8-downloads.html
– extract and install Java in C:\Java
– open cmd and type -> javac -version
Download Hadoop
– https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.0/hadoop-
3.3.0.tar.gz
– extract to C:\Hadoop
S.V. PUBLICATIONS C
D S.V. PUBLICATIONS
S.V. PUBLICATIONS E
F S.V. PUBLICATIONS
Configurations
– Edit file C:/Hadoop-3.3.0/etc/hadoop/core-site.xml, paste the xml code in folder and
save
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
– Rename (If Necessary) “mapred-site.xml.template” to “mapred-site.xml” and edit
this file C:/Hadoop-3.3.0/etc/hadoop/mapred-site.xml, paste xml code and save this
file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
– Create folder “data” under “C:\Hadoop-3.3.0”
– Create folder “datanode” under “C:\Hadoop-3.3.0\data”
– Create folder “namenode” under “C:\Hadoop-3.3.0\data”
S.V. PUBLICATIONS G
Hadoop Configurations
Download
– https://github.com/brainmentorspvtltd/BigData_RDE/blob/master/Hadoop%20C
onfiguration.zip or (for hadoop 3)
– https://github.com/s911415/apache-hadoop-3.1.0-winutils
– Copy folder bin and replace existing bin folder in C:\Hadoop-3.3.0\bin
– Format the NameNode
– Open cmd and type command “hdfs namenode –format”
H S.V. PUBLICATIONS
Open: http://localhost:8088
S.V. PUBLICATIONS I
J S.V. PUBLICATIONS
The term MapReduce actually refers to two separate and distinct tasks that Hadoop
programs perform. The first is the map job, which takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples (key/value
pairs). The reduce job takes the output from a map as input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job
is always performed after the map job.
1. Download MapReduceClient.jar
(Link: https://github.com/MuhammadBilalYar/HADOOP-INSTALLATION-ON-
WINDOW-10/blob/master/MapReduceClient.jar)
2. Download Input_file.txt from c:\
3. Create an input directory in HDFS.
hadoop fs -mkdir /input_dir
4. Copy the input text file named input_file.txt in the input directory (input_dir) of
HDFS.
hadoop fs -put C:/input_file.txt /input_dir
5. Verify input_file.txt available in HDFS input directory (input_dir).
hadoop fs -ls /input_dir/
S.V. PUBLICATIONS K
L S.V. PUBLICATIONS
S.V. PUBLICATIONS M
N S.V. PUBLICATIONS
S.V. PUBLICATIONS O
P S.V. PUBLICATIONS
5. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter
your data.
Solution:
1. Hadoop Cluster Installation: Apache Pig is a platform build on the top of Hadoop.
You can refer to our previously published article to install a Hadoop single node
cluster on Windows 10.
2. 7zip is needed to extract .tar.gz archives we will be downloading in this guide.
3. Downloading Apache Pig: To download the Apache Pig, you should go to the
following link: https://downloads.apache.org/pig/
4. After the file is downloaded, we should extract it twice using 7zip (using 7zip: the
first time we extract the .tar.gz file, the second time we extract the .tar file). We will
extract the Pig folder into “E:\hadoop-env” directory.
5. Setting Environment Variables: After extracting Derby and Hive archives, we
should go to Control Panel > System and Security > System. Then Click on
“Advanced system settings”.
S.V. PUBLICATIONS Q
PIG_HOME: “E:\hadoop-env\pig-0.17.0”
8. Now, we should edit the Path user variable to add the following paths:
%PIG_HOME%\bin
R S.V. PUBLICATIONS
9. Starting Apache Pig: After setting environment variables, let's try to run Apache Pig.
10. Open a command prompt as administrator, and execute the following command
pig –version
11. To fix this error, we should edit the pig.cmd file located in the “pig-0.17.0\bin”
directory by changing the HADOOP_BIN_PATH value from
“%HADOOP_HOME%\bin” to “%HADOOP_HOME%\libexec”.
pig –version
The simplest way to write PigLatin statements is using Grunt shell which is an
interactive tool where we write a statement and get the desired output. There are two modes
to involve Grunt Shell:
Local: All scripts are executed on a single machine without requiring Hadoop.
(command: pig -x local)
MapReduce: Scripts are executed on a Hadoop cluster (command: pig -x
MapReduce)
S.V. PUBLICATIONS S
6. Write a Pig Latin script for finding TF-IDF value for book dataset (A corpus of eBooks
available at: Project Gutenberg)
Solution:
import math
from pyspark.sql.functions import *
data=[(1,'i love dogs'),(2,"i hate dogs and knitting"),(3,"knitting is my hobby and my
passion")]
lines=sc.parallelize(data)
map1=lines.flatMap(lambda x: [((x[0],i),1) for i in x[1].split()])
reduce=map1.reduceByKey(lambda x,y:x+y)
tf=reduce.map(lambda x: (x[0][1],(x[0][0],x[1])))
map3=reduce.map(lambda x: (x[0][1],(x[0][0],x[1],1)))
map3.collect()
map4=map3.map(lambda x:(x[0],x[1][2]))
map2.collect()
reduce2=map4.reduceByKey(lambda x,y:x+y)
reduce2.collect()
idf=reduce2.map(lambda x: (x[0],math.log10(len(data)/x[1])))
idf.collect()
idf=numberofdocs_word.map(lambda x: (x[0],math.log10(len(data)/x[1])))
idf.collect()
rdd=tf.join(idf)
rdd=rdd.map(lambda x: (x[1][0][0],(x[0],x[1][0][1],x[1][1],x[1][0][1]*x[1][1]))).sortByKey()
rdd.collect()
rdd=rdd.map(lambda x: (x[0],x[1][0],x[1][1],x[1][2],x[1][3]))
T S.V. PUBLICATIONS
7. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views,
functions, and indexes.
Solution:
1. Prerequisites
7zip
In order to extract tar.gz archives, you should install the 7zip tool.
Installing Hadoop
To install Apache Hive, you must have a Hadoop Cluster installed and running: You can
refer to our previously published step-by-step guide to install Hadoop 3.2.1 on Windows 10.
Apache Derby
In addition, Apache Hive requires a relational database to create its Metastore (where all
metadata will be stored). In this guide, we will use the Apache Derby database 4.
We have Java 8 installed, we must install Apache Derby 10.14.2.0 version which can be
downloaded from the following link
https://downloads.apache.org//db/derby/db-derby-10.14.2.0/db-derby-10.14.2.0-
bin.tar.gz
Once downloaded, we must extract twice (using 7zip: the first time we extract the .tar.gz
file, the second time we extract the .tar file) the content of the db-derby-10.14.2.0-bin.tar.gz
archive into the desired installation directory. Since in the previous guide we have installed
Hadoop within “E:\hadoop-env\hadoop-3.3.0\” directory, we will extract Derby into
“E:\hadoop-env\db-derby-10.14.2.0\” directory.
Cygwin
Since there are some Hive 3.1.2 tools that aren‟t compatible with Windows (such as
schematool). We will need the Cygwin tool to run some Linux commands.
When the file download is complete, we should extract twice (as mentioned above) the
apache-hive.3.1.2-bin.tar.gz archive into “E:\hadoop-env\apache-hive-3.3.0” directory.
S.V. PUBLICATIONS U
V S.V. PUBLICATIONS
4.Configuring Hive
Copy Derby libraries
Now, we should go to the Derby libraries directory (E:\hadoop-env\db-derby-
10.14.2.0\lib) and copy all *.jar files.
Then, we should paste them within the Hive libraries directory (E:\hadoop-env\apache-
hive-3.1.2\lib).
S.V. PUBLICATIONS W
Configuring hive-site.xml
Now, we should go to the Apache Hive configuration directory (E:\hadoop-
env\apache-hive-3.1.2\conf) create a new file “hive-site.xml”. We should paste the
following XML code within this file:
5. Starting Services
Hadoop Services
To start Apache Hive, open the command prompt utility as administrator. Then, start the
Hadoop services using start-dfs and start-yarn commands (as illustrated in the Hadoop
installation guide).
X S.V. PUBLICATIONS
This error is thrown since the Hive 3.x version is not built for Windows (only in some
Hive 2.x versions). To get things working, we should download the necessary *.cmd files
from the following link: https://svn.apache.org/repos/asf/hive/trunk/bin/. Note that,
you should keep the folder hierarchy (bin\ext\util). You can download all *.cmd files from
the following GitHub repository
https://github.com/HadiFadl/Hive-cmd
7. Initializing Hive
After ensuring that the Apache Hive started successfully. We may not be able to run any
HiveQL command. This is because the Metastore is not initialized yet. Besides HiveServer2
service must be running.
To initialize Metastore, we need to use schematool utility which is not compatible with
windows. To solve this problem, we will use Cygwin utility which allows executing Linux
command from windows.
S.V. PUBLICATIONS Y
We can add these lines to the “~/.bashrc” file then you don‟t need to write them each
time you open Cygwin.Now, we should use the schematool utility to initialize the
Metastore:
$HIVE_HOME/bin/schematool -dbType derby -initSchema
Starting HiveServer2 service
Now, open a command prompt and run the following command:
hive --service hiveserver2 start
We should leave this command prompt open, and open a new one where we should
start Apache Hive using the following command:
hive
To start the WebHCat server, we should open the Cygwin utility and execute the
following command:
$HIVE_HOME/hcatalog/sbin/webhcat_server.sh start
Drop Database
Drop the database by using the following command.
hive> drop database financials;
check whether the database is dropped or not.
hive> show databases;
Z S.V. PUBLICATIONS
Load Data
Once the internal table has been created, the next step is to load the data into it. So, in
Hive, we can easily load data from any file to the database.
Let's load the data of the file into the database by using the following command: -
load data local inpath '/home/codegyani/hive/emp_details' into table demo.employee;
Alter Table
Most table properties can be altered with ALTER TABLE statements, which change
metadata about the table but not the data itself. These statements can be used to fix mistakes
in schema, move partition locations (as we saw in External Partitioned Tables), and do other
operations.
S.V. PUBLICATIONS AA
Multiple partitions can be added in the same query when using Hive v0.8.0 and later. As
always, IF NOT EXISTS is optional and has the usual meaning.
Changing Columns
You can rename a column, change its position, type, or comment:
ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;
View
Views are similar to tables, which are generated based on the requirements.
• We can save any result set data as a view in Hive
• Usage is similar to as views used in SQL
• All type of DML operations can be performed on a view
Creation of View:
Create VIEW < VIEWNAME> AS SELECT
Example:
Hive>Create VIEW Sample_View AS SELECT * FROM employees WHERE
salary>25000
In this example, we are creating view Sample_View where it will display all the row
values with salary field greater than 25000.
8. Install, Deploy & configure Apache Spark Cluster. Run apache spark applications
using Scala.
Solution:
Apache Spark is an open-source framework that processes large volumes of stream data
from multiple sources. Spark is used in distributed computing with machine learning
applications, data analytics, and graph-parallel processing.
Prerequisites
A system running Windows 10
BB S.V. PUBLICATIONS
iv. Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH.
Leave the other box checked.
v. Next, click Customize installation.
vi. You can leave all boxes checked at this step, or you can uncheck the options you do
not want.
vii. Click Next.
viii. Select the box Install for all users and leave other boxes as they are.
ix. Under Customize install location, click Browse and navigate to the C drive. Add a
new folder and name it Python.
x. Select that folder and click OK.
S.V. PUBLICATIONS CC
iv. A page with a list of mirrors loads where you can see different servers to download
from. Pick any from the list and save the file to your Downloads folder.
Step 4: Verify Spark Software File
i. Verify the integrity of your download by checking the checksum of the file. This
ensures you are working with unaltered, uncorrupted software.
ii. Navigate back to the Spark Download page and open the Checksum link, preferably
in a new tab.
iii. Next, open a command line and enter the following command:
certutil -hashfile c:\users\username\Downloads\spark-2.4.5-bin-hadoop2.7.tgz SHA512
iv. Change the username to your username. The system displays a long alphanumeric
code, along with the message Certutil: -hashfile completed successfully.
v. Compare the code to the one you opened in a new browser tab. If they match, your
download file is uncorrupted.
DD S.V. PUBLICATIONS
ii. Find the Download button on the right side to download the file.
iii. Now, create new folders Hadoop and bin on C: using Windows Explorer or the
Command Prompt.
iv. Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.
Step 7: Configure Environment Variables
Configuring environment variables in Windows adds the Spark and Hadoop locations to
your system PATH. It allows you to run the Spark shell directly from a command prompt
window.
i. Click Start and type environment.
ii. Select the result labeled Edit the system environment variables.
iii. A System Properties dialog box appears. In the lower-right corner, click Environment
Variables and then click New in the next window.
S.V. PUBLICATIONS EE
vi. In the top box, click the Path entry, then click Edit. Be careful with editing the system
path. Avoid deleting any entries already on the list.
vii. You should see a box with entries on the left. On the right, click New.
viii. The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-
2.4.5-bin-hadoop2.7\bin. We recommend using %SPARK_HOME%\bin to avoid
possible issues with the path.
FF S.V. PUBLICATIONS
vii. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.
Test Spark
We will launch the Spark shell and use Scala to read the contents of a file. You can use an
existing file, such as the README file in the Spark directory, or you can create your own.
We created pnaptest with some text.
i. Open a command-prompt window and navigate to the folder with the file you want
to use and launch the Spark shell.
ii. First, state a variable to use in the Spark context with the name of the file. Remember
to add the file extension if there is any.
val x =sc.textFile("pnaptest")
iii. The output shows an RDD is created. Then, we can view the file contents by using
this command to call an action:
x.take(11).foreach(println)
S.V. PUBLICATIONS GG
iv. This command instructs Spark to print 11 lines from the file you specified. To
perform an action on this file (value x), add another value y, and do a map
transformation.
v. For example, you can print the characters in reverse with this command:
val y = x.map(_.reverse)
vi. The system creates a child RDD in relation to the first one. Then, specify how many
lines you want to print from the value y:
y.take(11).foreach(println)
HH S.V. PUBLICATIONS