0% found this document useful (0 votes)
121 views142 pages

Big Data SV Publication

The document outlines the syllabus and content for a Big Data course as part of a B.Sc. in Data Science program at Osmania University. It covers topics such as the overview of Big Data, Hadoop ecosystem, MapReduce fundamentals, and NoSQL data management. Additionally, it includes model papers and practical questions to assess understanding of the subject matter.

Uploaded by

Menaka Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views142 pages

Big Data SV Publication

The document outlines the syllabus and content for a Big Data course as part of a B.Sc. in Data Science program at Osmania University. It covers topics such as the overview of Big Data, Hadoop ecosystem, MapReduce fundamentals, and NoSQL data management. Additionally, it includes model papers and practical questions to assess understanding of the subject matter.

Uploaded by

Menaka Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 142

lOMoARcPSD|52011493

BIG DATA - SV PUBLICATIONS NOTES

Data science (Osmania University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by menaka pamarthi ([email protected])
lOMoARcPSD|52011493

AS PER CBCS SYLLABUS

B.Sc.
(DATA SCIENCE)
III YEAR VI SEMESTER
PAPER-VII (A)

BIG DATA
Mr.G.Venkata Subba Reddy.
Head, Department of Computer Science

© Copyrights Reserved with the Publisher


Despite of Every Effort Taken The Book Without Errors, Some Errors Might Have Crept In. We Do Not Take
any Legal Responsibility For Such Errors and Omissions. However, If They Are Brought To Our Notice, They
Will Be Corrected In The Next Edition.

S.V. PUBLICATIONS
HYDERBAD.

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

SYLLABUS
Unit - I
Getting an overview of Big Data: Introduction to Big Data, Structuring Big Data,
Types of Data, Elements of Big Data, Big Data Analytics, Advantages of Big Data
Analytics.
Introducing Technologies for Handling Big Data: Distributed and Parallel
Computing for Big Data, Cloud Computing and Big Data, Features of Cloud
Computing, Cloud Deployment Models, Cloud Services for Big Data, Cloud
Providers in Big Data Market.
Unit - II
Understanding Hadoop Ecosystem: Introducing Hadoop, HDFS and
MapReduce, Hadoop functions, Hadoop Ecosystem.
Hadoop Distributed File System- HDFS Architecture, Concept of Blocks in
HDFS Architecture, Namenodes and Datanodes, Features of HDFS, MapReduce.
Introducing HBase- HBase Architecture, Regions, Storing Big Data with HBase,
Combining HBase and HDFS, Features of HBase, Hive, Pig and Pig Latin, Sqoop,
ZooKeeper, Flume, Oozie.
Unit - III
Understanding MapReduce Fundamentals and HBase: The
MapReduceFramework ,Exploring the features of MapReduce, Working of
MapReduce, Techniques to optimize MapReduce Jobs, Hardware/Network
Topology, Synchronization, File system, Uses of MapReduce, Role of HBase in
Big Data Processing, Characteristics of HBase.
Understanding Big Data Technology Foundations: Exploring the Big Data
Stack, Data Sources Layer, Ingestion Layer, Storage Layer, Physical Infrastructure
Layer, Platform Management Layer, Security Layer, Monitoring Layer,
Visualization Layer.
Unit - IV
Storing Data in Databases and Data Warehouses: RDBMS and Big Data, Issues
with Relational Model, Non – Relational Database, Issues with Non Relational
Database, Polyglot Persistence, Integrating Big Data with Traditional Data
Warehouse, Big Data Analysis and Data Warehouse.
NoSQL Data Management: Introduction to NoSQL, Characteristics of NoSQL,
History of NoSQL, Types of NoSQL Data Models- Key Value Data Model,
Column Oriented Data Model, Document Data Model, Graph Databases,
Schema-Less Databases, Materialized Views, CAP Theorem.

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

Contents
UNIT-I
Getting an overview of Big Data, Introducing Technologies for
Handling Big Data
1.1 Introduction to Big Data 4 – 7
1.2 Structuring Big Data 7 – 8
1.3 Types of Data 8 – 9
1.4 Elements of Big Data 9 – 10
1.5 Big Data Analytics & Advantages of Big Data Analytics 10 – 12
1.6 Distributed and Parallel Computing for Big Data 13 – 17
1.7 Cloud Computing and Big Data 17 – 19
1.8 Deployment Models 19 - 21
1.9 Cloud Deploy and service Models 21 – 22
1 . 1 0 Cloud Providers in Big Data Markets 23
Multiple Choice Questions
Fill in the Blanks 24 – 26
Very Short Questions

UNIT-II
Understanding Hadoop Ecosystem, Hadoop Distributed File System,
Introducing HBase
2.1 Introducing Hadoop 31 – 32
2.2 Hadoop Ecosystem &Components 32 – 35
2.2.1 Basic HDFS DFS Commands 35 – 36
2.2.2 Features of HDFS 36 – 37
2.3 MapReduce 37 - 39

2.4 Introduction to HBase and Architecture 40-41

2.5 Regions 41 – 42

2.6 Storing Big Data with HBase 42 – 43

2.7 Combining HBase with HDFS 43 – 45


2.8 Features of HBase 45 – 46

2.9 HIVE, Pig and Pig Latin, Zookeeper, Flume, Oozie 46 - 51


Multiple Choice Questions
Fill in the Blanks 52 - 53
Very Short Questions

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

UNIT-III
Understanding MapReduce Fundamentals and HBase, Understanding
Big Data Technology Foundations
3.1 The MapReduce Frame Work 57
3.2 Exploring the features of MapReduce 57 -58
3.3 Working of MapReduce 58 – 61
3.4 Techniques to Optimize MapReduce Jobs 61 – 62
3.5 Uses of MapReduce 62 - 63
3.6 Role of HBase in Big Data Processing 63 – 64
3.7 Exploring the Big Data Stack 65
3.8 Data Source Layer 65
3.9 Ingestion Layer 65 – 66
3 . 1 0 Storage Layer 66 – 67
3 . 1 1 Physical Infrastructure 67 – 68
3 . 1 2 Security Layer 68 – 69
3 . 1 3 Monitoring Layer 69 – 70
3 . 1 4 Visualization Layer 70 – 71
Multiple Choice Questions
Fill in the Blanks 72 - 73
Very Short Questions

UNIT-IV
Storing Data in Databases and Data Warehouses, NoSQL Data
Management
4.1 RDBMS and Big Data 79 – 82
4.2 Issues with the Relational Model 82
4.3 Non-Relational Database 83
4.4 Issues with Non-Relational Databases 83 – 84
4.5 Polyglot Persistence 84
4.6 Integrating Big Data with Traditional Data Warehouses 84 – 86
4.7 Big Data Analysis and Data Warehouse 86 - 87
4.7.1 Changing Deployment Models in Big Data Era 87 – 89
4.8 Introduction to NoSQL 90 – 91
4.9 Characteristics of NoSQL 91 – 92

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

4 . 1 0 History of NoSQL 92
4 . 1 1 Types of NoSQL Data Models- Key Value Data Model 92 – 93
4 . 1 2 Column Oriented Data Model 93 - 94
4 . 1 3 Document Data Model 94 – 95
4 . 1 4 Graph Databases 95 – 96
4 . 1 5 Schema-Less Databases 96
4 . 1 6 Materialized Views 96 – 97
4 . 1 7 CAP Theorem 97 - 98
Multiple Choice Questions
Fill in the Blanks 99 - 100
Very Short Questions

LAB PRACTICAL A - HH
Model Papers

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

FACULTY OF SCIENCE
B.Sc. (CBCS) VI - Semester Examination
Model Paper-I
Subject: Data Science
Paper – VII(A): BIG DATA
Time: 3 Hours Max. Marks: 80
PART - A
Note: Answer any EIGHT questions. (8 x 4 = 32 Marks)
1. What is Parallel Computing?
2. What is structured and unstructured data in big data?
3. Discuss about Cloud Delivery Models in Big Data
4. What is MapReduce?
5. List few HDFS Features.
6. Difference Between Hadoop and HBase.
7. Short notes on Security Layer.
8. What are the functions of Ingestion Layer?
9. Why is the order in which a function is executed very important in MapReduce?
10. What are the 5 features of NoSQL?
11. Discuss about the CAP Theorem.
12. Explain about Graph Data Model?

PART – B
Note: Answer ALL the questions. (4 x 12 = 60 Marks)
13. What are types of data in big data?
OR
What is parallel and distributed processing in big data?
14. What is a Hadoop ecosystem? What are the main components of Hadoop ecosystem?
OR
a) Which storage is used by HBase?
b) How does HDFS maintain Data Integrity?
15. a) What is Hadoop Mapreduce and How Does it Work.
b) What techniques are used to optimize MapReduce jobs?
OR
c) What is redundant physical infrastructure?
d) Why monitoring your big data pipeline is important?
16. a) What problems do big data solutions solve?
b) What is the main problem with relational database for processing big data?
OR
c) What is a column-family data store in NoSQL Database? List the features of
column family database.
d) What is CAP theorem explain?

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

FACULTY OF SCIENCE
B.Sc. (CBCS) VI - Semester Examination
Model Paper-II
Subject: Data Science
Paper – VII(A): BIG DATA
Time: 3 Hours Max. Marks: 80
PART - A
Note: Answer any EIGHT questions. (8 x 4 = 32 Marks)
1. What is the structure of big data? Why is structuring of big data needed?
2. Which tools are used by data analyst?
3. What are cloud providers give example?
4. How Does Hadoop Work?
5. MapReduce Architecture explained in detail.
6. How regions are created in HBase?
7. Characteristics of HBase.
8. What is big data stack? Explain.
9. Why is the order in which a function is executed very important in MapReduce?
10. What problems do big data solutions solve?
11. What is a non-relational database?
12. What is a NoSQL key-value database?
PART – B
Note: Answer ALL the questions. (4 x 12 = 60 Marks)
13. a) What is Big Data Analytics and Why is it Important?
b) Which tools are used by data analyst?
OR
c) What is Big Data cloud in Cloud Computing?
d) What Are the Most Common Cloud Computing Service Delivery Models?
14. How does HDFS maintain Data Integrity?
OR
What is Zookeeper Explain the purpose of Zookeeper in Hadoop Ecosystem.
15. What are the different use cases of MapReduce?
OR
What is the main purpose of security? Why is data security important in big data?
16. How To Apply Big Data Concepts to a Traditional Data Warehouse?
OR
What are the basic characteristics of a NoSQL database?

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

Getting an overview of Big Data

UNIT Introduction to Big Data, Structuring Big Data, Types of


Data, Elements of Big Data, Big Data Analytics,
Advantages of Big Data Analytics.
Introducing Technologies for Handling Big Data
Distributed and Parallel Computing for Big Data, Cloud

I Computing and Big Data, Features of Cloud


Computing, Cloud Deployment Models, Cloud Services
for Big Data, Cloud Providers in Big Data Market.

Objective

 Introduction to Big Data


 Big Data in today’s Real world and It Industry
 Evolution of Big Data
 Learn about Various types like structured, Un-Structured &Semi
Structured
 Discussion about 4V’s of Big Data
 Big data in various Domains
 Skills required learning Big Data
 The concept of Distributed and Parallel Computing
 Role of parallel Computing in Big Data
 Hadoop and its functioning
 Cloud Computing in Detail
 Features of Cloud computing
 Cloud Deployment Models and Delivery models, Services to Big Data

S.V. PUBLICATIONS 1

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data

PART – A
Short Type Questions
Q1. What is Parallel Computing?
Ans:
Parallel Computing is also known as parallel processing. It utilizes several processors.
Each of the processors completes the tasks that have been allocated to them. In other words,
parallel computing involves performing numerous tasks simultaneously. A shared memory
or distributed memory system can be used to assist in parallel computing. All CPUs in
shared memory systems share the memory. Memory is shared between the processors in
distributed memory systems.

Parallel computing provides numerous advantages. Parallel computing helps to increase


the CPU utilization and improve the performance because several processors work
simultaneously. Moreover, the failure of one CPU has no impact on the other CPUs'
functionality. Furthermore, if one processor needs instructions from another, the CPU might
cause latency.

Q2.What is Distributing Computing?


Ans:
It comprises several software components that reside on different systems but operate as
a single system. A distributed system's computers can be physically close together and
linked by a local network or geographically distant and linked by a wide area network
(WAN). A distributed system can be made up of any number of different configurations,
such as mainframes, PCs, workstations, and minicomputers. The main aim of distributed
computing is to make a network work as a single computer.

There are various benefits of using distributed computing. It enables scalability and
makes it simpler to share resources. It also aids in the efficiency of computation processes.

Q3. What is structured and unstructured data in big data?


Ans:
Structured Data: The Data which has a proper structure or the one that can be easily stored
in a tabular form in any Relational DataBases like Oracle, SQL Server or MySQL is known as
Structured Data. We can process or analyze it easily and efficiently.

An example of Structured Data is the data stored in a Relational Database which can be
managed using SQL (Structured Query Language). For Example, Employee Data (Name, ID,
Designation, and Salary) can be stored in a tabular format.

In a traditional database, we can perform operations or process unstructured or semi-


structured data only after it is formatted or fit into the relational database. Examples of
Structured Data are ERP, CRM, etc.

Unstructured Data: Unstructured Data is the data that does not have any structure. It can be
in any form, there is no pre-defined data model. We can‘t store it in traditional databases. It
is complex to search and process it.

Also, the volume of Unstructured Data is very high. Example of Unstructured Data is e-
mail body, Audio, Video, Images, Achieved documents, etc.

S.V. PUBLICATIONS 2

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


Q4. What are the essential features of cloud services?
Ans:
The Essential Characteristics of Cloud Computing are:
 Low Cost: Cloud computing eliminates the capital expense of buying hardware and
software and setting up and running on-site data centers.
 On-Demand Service: This is most important and valuable features of cloud
computing. On-demand computing is a delivery model in which computing
resources are made available to the user as needed.
 Global scale: The benefits of cloud computing services include the ability to scale
elastically. In cloud speak, that means delivering the right amount of IT resources-for
example, more or less computing power, storage, bandwidth-right when it is needed
and from the right geographic location.
 Reliability: Cloud computing makes data backup, disaster recovery and business
continuity easier and less expensive because data can be mirrored at multiple
redundant sites on the cloud provider's network.

Q5. Discuss about Cloud Delivery Models in Big Data


Ans:
Cloud services are categorized as below:
 Infrastructure as a service (IAAS): It means complete infrastructure will be provided
to you. Maintenance related tasks will be done by cloud provider and you can use it
as per your requirement. It can be used as public and private both. Examples of IaaS
are virtual machines, load balancers, and network attached storage.
 Platform as a service (PAAS): Here we have object storage, queuing, databases,
runtime etc. All these we can get directly from the cloud provider. It‘s our
responsibility to configure and use that. Providers will give us the resources but
connectivity to our database and other similar activities are our responsibility.
Examples of PaaS are Windows Azure and Google App Engine (GAE).
 Applications or software as a service (SAAS) ex. Salesforce.com, dropbox, google
drive etc. Here we do not have any responsibility. We are using the application that
is running on the cloud. All infrastructure setup is the responsibility of the service
provider.

For SaaS to work, the infrastructure (IaaS) and the platform (PaaS) must be in place.

S.V. PUBLICATIONS 3

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data

PART – B
Essay Type Questions
GETTING AN OVER VIEW OF BIG DATA

1.1 Introduction to Big Data


Q1. What is Big Data? Which factors contribute in evolution of big data?
Ans:
Data
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.

Big Data
Big Data is a collection of data that is huge in volume, yet growing exponentially with
time. It is a data with so large size and complexity that none of traditional data management
tools can store it or process it efficiently. Big data is also a data but with huge size.

Normally we work on data of size MB(WordDoc ,Excel) or maximum GB(Movies,


Codes) but data in Peta bytes i.e. 1015 byte size is called Big Data. It is stated that almost 90%
of today's data has been generated in the past 3 years.

Sources of Big Data


These data come from many sources like
 Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
 E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
 Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
 Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
 Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.

Challenges with Big Data


 Data volume: Data today is growing at an exponential rate. This high tide of data
will continue to rise continuously. The key questions are –
“will all this data be useful for analysis?”,
“Do we work with all this data or subset of it?”,
“How will we separate the knowledge from the noise?” etc.
 Storage: Cloud computing is the answer to managing infrastructure for big data as
far as cost-efficiency, elasticity and easy upgrading/downgrading is concerned. This
further complicates the decision to host big data solutions outside the enterprise.
 Data retention: How long should one retain this data? Some data may require for
log-term decision, but some data may quickly become irrelevant and obsolete.

S.V. PUBLICATIONS 4

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

 Skilled professionals: In order to develop, manage and run those applications that
generate insights, organizations need professionals who possess a high-level
proficiency in data sciences.
 Other challenges: Other challenges of big data are with respect to capture, storage,
search, analysis, transfer and security of big data. Visualization: Big data refers to
datasets whose size is typically beyond the storage capacity of traditional database
software tools. There is no explicit definition of how big the data set should be for it
to be considered big data. Data visualization (computer graphics) is becoming
popular as a separate discipline. There are very few data visualization experts.

History of Big Data


The history of big data starts many years before the present buzz around Big Data.
Seventy years ago the first attempt to quantify the growth rate of data in the terms of
volume of data was encountered. That has popularly been known as ―information
explosion―.

We will be covering some major milestones in the evolution of ―big data‖.


1944:
Fremont Rider, based upon his observation, speculated that Yale Library in 2040 will
have ―approximately 200,000,000 volumes, which will occupy over 6,000 miles of shelves…
a cataloging staff of over six thousand persons.‖
He did not predict the digitization of libraries but predicted the information explosion.

From 1944 to 1980, many articles and presentations were presented that observed the
‗information explosion‘ and the arising needs for storage capacity.

1980:
In 1980, the sociologist Charles Tilly uses the term big data in one sentence ―none of the
big questions has actually yielded to the bludgeoning of the big-data people.‖ ―The old-new
social history and the new old social history‖.

But the term used in this sentence is not in the context of the present meaning of Big
Data today. Now, moving fast to 1997-1998 where we see the actual use of big data in its
present context.
1997:
In 1977, Michael Cox and David Ellsworth published the article ―Application-controlled
demand paging for out-of-core visualization‖ in the Proceedings of the IEEE 8th conference
on Visualization.

S.V. PUBLICATIONS 5

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data

The big data term in the sentence ―Visualization provides an interesting challenge for
computer systems: data sets are generally quite large, taxing the capacities of main memory,
local disk, and even remote disk. We call this the problem of big data. When data sets do not
fit in main memory (in core), or when they do not fit even on local disk, the most common
solution is to acquire more resources.‖.

It was the first article in the ACM digital library that uses the term big data with its
modern context.

1998:
In 1998, John Mashey, who was Chief Scientist at SGI presented a paper titled ―Big
Data… and the Next Wave of Infrastress.‖ at a USENIX meeting. John Mashey used this
term in his various speeches and that‘s why he got the credit for coining the term Big Data.

S.V. PUBLICATIONS 6

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


2000:
In 2000, Francis Diebold presented a paper titled ―‘ Big Data‘ Dynamic Factor Models for
Macroeconomic Measurement and Forecasting‖ to the Eighth World Congress of the
Econometric Society.

In the paper, he stated that ―Recently, much good science, whether physical, biological,
or social, has been forced to confront—and has often benefited from—the ―Big Data‖
phenomenon.

Big Data refers to the explosion in the quantity (and sometimes, quality) of available and
potentially relevant data, largely the result of recent and unprecedented advancements in
data recording and storage technology.‖

He is the one who linked big data term explicitly to the way we understand big data
today.

2001:
In 2001, Doug Laney, who was an analyst with the Meta Group (Gartner), presented a
research paper titled ―3D Data Management: Controlling Data Volume, Velocity, and
Variety.‖ The 3V‘s have become the most accepted dimensions for defining big data.

This is for sure the current widely understood form of Big data definition.

In 2005 Yahoo used Hadoop to process petabytes of data which is now made open-
source by Apache Software Foundation. Many companies are now using Hadoop to crunch
Big Data.

1.2 Structuring Big Data


Q2. What is the structure of big data? Why is structuring of big data needed?
Ans:
Structuring of data, is arranging the available data in a manner so that it becomes easy to
study, analyze, and derive conclusion from it. But, why is structuring required? In daily life,
you may have come across questions like:
 How do I use to my advantage the vast amount of data and information I come
across?
 Which news articles should I read of the thousands I come across?
 How do I choose a book of the millions available on my favorite sites or stores?
 How do I keep myself updated about new events, sports, inventions, and discoveries
taking place across the globe?

Solutions to such questions can be found by information processing systems. These


systems can analyze and structure a large amount of data specifically for you on the basis of
what you searched, what you looked at, and for how long you remained at a particular page
or website, thus scanning and presenting you with the customized information as per your
behavior and habits. In other words, structuring data helps in understanding user behaviors,
requirements, and preferences to make personalized recommendations for every individual.
When a user regularly visits or purchases from online shopping sites, say eBay, each time
he/she logs in, the system can present a recommended list of products that may interest the
user on the basis of his/her earlier purchases or searches, thus presenting a specially
customized recommendation set for every user. This is the power of Big Data analytics.

S.V. PUBLICATIONS 7

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
Various sources generate a variety of data such as images, text, audios, etc. All such
different types of data can be structured only if it is sorted and organized in some logical
pattern. Thus, the process of structuring data requires one to first understand the various
types of data available today.

1.3 Types of Data


Q3. What are types of data in big data?
Ans:
Data that comes from multiple sources such as databases, Enterprise Resource Planning
(ERP) systems, weblogs, chat history, and GPS maps, varies in its format. However, different
formats of data need to be made consistent and clear to be used for analysis. Data is
obtained primarily from the following types of sources:
 Internal sources, such as organizational or enterprise data
 External sources, such as social data

Structures of big data


Big data structures can be divided into three categories – structured, unstructured, and
semi-structured. Let‘s have a look at them in detail.
1. Structured data: It‘s the data which follows a pre-defined format and thus, is straight
forward to analyze. It conforms to a tabular format together with relationships
between different rows and columns. You can think of SQL databases as a common
example. Structured data relies on how data could be stored, processed, as well as,
accessed. It‘s considered the most ―traditional‖ type of data storage.
2. Unstructured data: This type of big data comes with unknown form and cannot be
stored in traditional ways and cannot be analyzed unless it‘s transformed into a
structured format. You can think of multimedia content like audios, videos, images
as examples of unstructured data. It‘s important to understand that these days,
unstructured data is growing faster than other types of big data.
3. Semi-structured data: It‘s a type of big data that doesn‘t conform with a formal
structure of data models. But it comes with some kinds of organizational tags or
other markers that help to separate semantic elements, as well as, enforce hierarchies
of fields and records within that data. You can think of JSON documents or XML
files as this type of big data. The reason behind the existence of this category is semi-
structured data is significantly easier to analyze than unstructured data. A significant
number of big data solutions and tools come with the ability of reading and
processing XML files or JASON documents, reducing the complexity of the
analyzing process.

S.V. PUBLICATIONS 8

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


While data analytics aren‘t new, the emergence of big data has dramatically changed the
nature of work. It‘s important for businesses looking to make most out of the big data to try
to adopt advanced tools and technologies to keep up with the pace at which the data is
growing.

1.4 Elements of Big Data


Q4. What are the four elements of big data?
Ans:
4V's of Big Data
 Velocity: The data is increasing at a very fast rate. It is estimated that the volume of
data will double in every 2 years.
 Variety: Now a days data are not stored in rows and column. Data is structured as
well as unstructured. Log file, CCTV footage is unstructured data. Data which can be
saved in tables are structured data like the transaction data of the bank.
 Volume: The amount of data which we deal with is of very large size of Peta bytes.
 Variability: This refers to the inconsistency which can be shown by the data at times,
thus hampering the process of being able to handle and manage the data effectively.

Example of Big Data


1. Stock Exchange is an example of Big Data that generates about one terabyte of new
trade data per day.
2. Social Media: The statistic shows that 500+terabytes of new data get ingested into
the databases of social media site Facebook, every day. This data is mainly generated
in terms of photo and video uploads, message exchanges, putting comments etc.
3. A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.
With many thousand flights per day, generation of data reaches up to many
Petabytes.
Advantages of Big Data Processing
Ability to process Big Data in DBMS brings in multiple benefits, such as-
Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies.
Improved customer service

S.V. PUBLICATIONS 9

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
Traditional customer feedback systems are getting replaced by new systems designed
with Big Data technologies. In these new systems, Big Data and natural language processing
technologies are being used to read and evaluate consumer responses.
Early identification of risk to the product/services, if any
Better operational efficiency
Big Data technologies can be used for creating a staging area or landing zone for new
data before identifying what data should be moved to the data warehouse. In addition, such
integration of Big Data technologies and data warehouse helps an organization to offload
infrequently accessed data.

1.5 Big Data Analytics & Advantages of Big Data Analytics


Q5. What is Big Data Analytics and Why is it Important?
Ans:
Big data analytics is the process of collecting, examining, and analyzing large amounts of
data to discover market trends, insights, and patterns that can help companies make better
business decisions. This information is available quickly and efficiently so that companies
can be agile in crafting plans to maintain their competitive advantage.

Technologies such as business intelligence (BI) tools and systems help organizations take
the unstructured and structured data from multiple sources. Users (typically employees)
input queries into these tools to understand business operations and performance. Big data
analytics uses the four data analysis methods to uncover meaningful insights and derive
solutions.

There are four main types of big data analytics that support and inform different
business decisions.

1. Descriptive analytics
Descriptive analytics refers to data that can be easily read and interpreted. This data
helps create reports and visualize information that can detail company profits and sales.
Example: During the pandemic, a leading pharmaceuticals company conducted data
analysis on its offices and research labs. Descriptive analytics helped them identify
unutilized spaces and departments that were consolidated, saving the company millions of
dollars.
2. Diagnostics analytics
Diagnostics analytics helps companies understand why a problem occurred. Big data
technologies and tools allow users to mine and recover data that helps dissect an issue and
prevent it from happening in the future.
Example: A clothing company‘s sales have decreased even though customers continue to
add items to their shopping carts. Diagnostics analytics helped to understand that the
payment page was not working properly for a few weeks.

3. Predictive analytics
Predictive analytics looks at past and present data to make predictions. With artificial
intelligence (AI), machine learning, and data mining, users can analyze the data to predict
market trends.
Example: In the manufacturing sector, companies can use algorithms based on historical
data to predict if or when a piece of equipment will malfunction or break down.

S.V. PUBLICATIONS 10

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


4. Prescriptive analytics
Prescriptive analytics provides a solution to a problem, relying on AI and machine
learning to gather data and use it for risk management.
Example: Within the energy sector, utility companies, gas producers, and pipeline owners
identify factors that affect the price of oil and gas in order to hedge risks.

Benefits of big data analytics


There are quite a few advantages to incorporating big data analytics into a business or
organization. These include:
1. Cost reduction: Big data can reduce costs in storing all the business data in one
place. Tracking analytics also helps companies find ways to work more efficiently to
cut costs wherever possible.
2. Product development: Developing and marketing new products, services, or brands
is much easier when based on data collected from customers‘ needs and wants. Big
data analytics also helps businesses understand product viability and keep up with
trends.
3. Strategic business decisions: The ability to constantly analyze data helps businesses
make better and faster decisions, such as cost and supply chain optimization.
4. Customer experience: Data-driven algorithms help marketing efforts (targeted ads,
as an example) and increase customer satisfaction by delivering an enhanced
customer experience.
5. Risk management: Businesses can identify risks by analyzing data patterns and
developing solutions for managing those risks.
6. Entertainment: Providing a personalized recommendation of movies and music
according to a customer‘s individual preferences has been transformative for the
entertainment industry (think Spotify and Netflix).
7. Education: Big data helps schools and educational technology companies alike
develop new curriculums while improving existing plans based on needs and
demands.
8. Health care: Monitoring patients‘ medical histories helps doctors detect and prevent
diseases.
9. Government: Big data can be used to collect data from CCTV and traffic cameras,
satellites, body cameras and sensors, emails, calls, and more, to help manage the
public sector.
10. Marketing: Customer information and preferences can be used to create targeted
advertising campaigns with a high return on investment (ROI).
11. Banking: Data analytics can help track and monitor illegal money laundering.

Why is big data analytics important?


Big data analytics is important because it helps companies leverage their data to identify
opportunities for improvement and optimization. Across different business segments,
increasing efficiency leads to overall more intelligent operations, higher profits, and satisfied
customers. Big data analytics helps companies reduce costs and develop better, customer-
centric products and services.
Data analytics helps provide insights that improve the way our society functions. In
health care, big data analytics not only keeps track of and analyzes individual records, but

S.V. PUBLICATIONS 11

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
plays a critical role in measuring COVID-19 outcomes on a global scale. It informs health
ministries within each nation‘s government on how to proceed with vaccinations and
devises solutions for mitigating pandemic outbreaks in the future.
Q6. Which tools are used by data analyst?
Ans:
Tools used in big data analytics
Harnessing all of that data requires tools. Thankfully, technology has advanced so that
there are many intuitive software systems available for data analysts to use.
 Hadoop: An open-source framework that stores and processes big data sets. Hadoop
is able to handle and analyze structured and unstructured data.
 Spark: An open-source cluster computing framework used for real-time processing
and analyzing data.
 Data integration software: Programs that allow big data to be streamlined across
different platforms, such as MongoDB, Apache, Hadoop, and Amazon EMR.
 Stream analytics tools: Systems that filter, aggregate, and analyze data that might be
stored in different platforms and formats, such as Kafka.
 Distributed storage: Databases that can split data across multiple servers and have
the capability to identify lost or corrupt data, such as Cassandra.
 Predictive analytics hardware and software: Systems that process large amounts of
complex data, using machine learning and algorithms to predict future outcomes,
such as fraud detection, marketing, and risk assessments.
 Data mining tools: Programs that allow users to search within structured and
unstructured big data.
 NoSQL databases: Non-relational data management systems ideal for dealing with
raw and unstructured data.
 Data warehouses: Storage for large amounts of data collected from many different
sources, typically using predefined schemas.

S.V. PUBLICATIONS 12

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

INTRODUCING TECHNOLOGY FOR HANDLING BIG DATA


1.6 Distributed and Parallel Computing for Big Data
Q7. What is parallel and distributed processing in big data?
Ans:
Distributed Computing
When solving problems, we don't need to limit our solutions to running on a single
computer. Instead we can use distributed
computing to distribute the problem across
multiple networked computing devices.

In Computer science, Distributed Computing


studies distributed systems. In a distributed
system, the components are spread across
multiple networked computers and
communicate and coordinate their actions by
transferring messages from one system to the next.

The components interact with one another so that they can reach a common purpose.
Maintaining component concurrency, overcoming the lack of a global clock, and controlling
the independent failure of parts are three crucial issues of distributed systems. When one
system's component fails, the system does not fail. Peer-to-peer applications, SOA-based
systems, and massively multiplayer online games are all examples of distributed systems.

The use of distributed systems to address computational issues is frequently referred to


as distributed computing. A problem is divided into numerous jobs in distributed
computing, each of which is solved by one or more computers that communicate with one
another via message passing.

Need for Distributed Computing for Big Data


 Not every situation necessitates the use of distributed computing. Complicated
processing can be done remotely using a specialist service if there isn't a significant
time limitation. IT moves data to an external service or entity with plenty of free
resources for processing when corporations want to undertake extensive data
analysis.
 The data management industry was transformed by robust hardware and software
innovations. First, demand and innovation boosted the power of hardware while

S.V. PUBLICATIONS 13

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
lowering its price. The new software was developed to use this hardware by
automating activities such as load balancing and optimization over a large cluster of
nodes.
 Built-in rules in the software recognized that certain workloads demanded a specific
level of performance. Using the virtualization technology, software regarded all
nodes as one giant pool of processing, storage, and networking assets. It shifted
processes to another node without interruption if one failed.
 It wasn't that firms didn't want to wait for the results they required; it was simply
not financially feasible to purchase enough computing power to meet these new
demands. Because of the costs, many businesses would merely acquire a subset of
data rather than trying to gather all of it. Analysts wanted all of the data, but they
had to make do with snapshots to capture the appropriate data at the right time.

Distribution of parallel processes


Distributed computing is often used in tandem with parallel computing. Parallel
computing on a single computer uses multiple processors to process tasks in parallel,
whereas distributed parallel computing uses multiple computing devices to process those
tasks.

Consider our example program that detects cats in images. In a distributed computing
approach, a managing computer would send the image information to each of the worker
computers and each worker would report back their results.

Evaluating the performance


Distributed computing can improve the performance of many solutions, by taking
advantage of hundreds or thousands of computers running in parallel. We can measure the
gains by calculating the speedup: the time taken by the sequential solution divided by the
time taken by the distributed parallel solution. If a
sequential solution takes 606060 minutes and a
distributed solution takes 666 minutes, the speedup
is 101010.

The performance of distributed solutions can


also suffer from their distributed nature, however.
The computers must communicate over the
network, sending messages with input and output
values. Every message sent back and forth takes some amount of time, and that time adds to
the overall time of the solution. For a distributed computing solution to be worth the
trouble, the time saved by distributing the operations
must be greater than the time added by the
communication overhead.

In the simplest distributed computing architecture,


the managing computer needs to communicate with
each worker:

In more complex architectures, worker nodes must communicate with other worker
nodes. This is necessary when using distributed computing to train a deep learning network.

S.V. PUBLICATIONS 14

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


One way to reduce the communication time is to use cluster computing: co-located
computers on a local network that all work on similar tasks. In a computer cluster, a
message does not have to travel very far and more importantly, does not have to travel over
the public Internet.

Cluster computing has its own limitations; setting up a cluster requires physical space,
hardware operations expertise, and of course, money to buy all the devices and networking
infrastructure.

Fortunately, many companies now offer cloud computing services which give
programmers everywhere access to managed clusters. The companies manage the hardware
operations, provide tools to upload programs, and charge based on usage.

Distribution of functionality
Another form of distributed computing is to use different computing devices to execute
different pieces of functionality.

For example, imagine a zoo with an array of security cameras. Each security camera
records video footage in a digital format. The
cameras send their video data to a computer
cluster located in the zoo headquarters, and
that cluster runs video analysis algorithms to
detect escaped animals. The cluster also sends
the video data to a cloud computing server
which analyzes terabytes of video data to
discover historical trends.

Each computing device in this distributed


network is working on a different piece of the
problem, based on their strengths and
weaknesses. The security cameras themselves
don't have enough processing power to detect escaped animals or enough storage space for
the other cameras' footage (which could help an algorithm track movement). The local
cluster does have a decent amount of processing power and extra storage, so it can perform
the urgent task of escaped animal detection. However, the cluster defers the task which
requires the most processing and storage (but isn't as time sensitive) to the cloud computing
server.

This form of distributed computing recognizes that the world is filled with a range of
computing devices with varying capabilities, and ultimately, some problems are best solved
by utilizing a network of those devices.

In fact, you're currently participating in a giant example of distributed computing: the


web. Your computer is doing a lot of processing to read this website: sending HTTP requests
to get the website data, interpreting the JavaScript that the website loads, and constantly
updating the screen as you scroll the page. But our servers are also doing a lot of work while
responding to your HTTP requests, plus we send data out to high-powered analytics servers
for further processing.

S.V. PUBLICATIONS 15

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data

Every application that uses the Internet is an example of distributed computing, but each
application makes different decisions about how it distributes the computing. For another
example, smart home assistants do a small amount of language processing locally to
determine that you've asked them for help but then send your audio to high-powered
servers to parse your full question.

The Internet enables distributed computing at a worldwide scale, both to distribute


parallel computation and to distribute functionality. Computer scientists, programmers, and
entrepreneurs are constantly discovering new ways to use distributed computing to take
advantage of such a massive network of computers to solve problems.

Difference between Parallel Computing and Distributed Computing:


Parallel Computing Distributed Computing
Many operations are performed System components are located at different
simultaneously locations
Single computer is required Uses multiple computers
Multiple processors perform multiple Multiple computers perform multiple
operations operations
It may have shared or distributed
It have only distributed memory
memory
Processors communicate with each Computer communicate with each other
other through bus through message passing.
Improves system scalability, fault tolerance
Improves the system performance
and resource sharing capabilities

Key differences between the Parallel Computing and Distributed Computing


Some of the key differences between parallel computing and distributed computing are
as follows:
 Parallel computing is a sort of computation in which various tasks or processes are
run at the same time. In contrast, distributed computing is that type of computing in
which the components are located on various networked systems that interact and
coordinate their actions by passing messages to one another.
 In parallel computing, processors communicate with another processor via a bus. On
the other hand, computer systems in distributed computing connect with one
another via a network.
 Parallel computing takes place on a single computer. In contrast, distributed
computing takes place on several computers.

S.V. PUBLICATIONS 16

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


 Parallel computing aids in improving system performance. On the other hand,
distributed computing allows for scalability, resource sharing, and the efficient
completion of computation tasks.
 The computer in parallel computing can have shared or distributed memory. In
contrast, every system in distributed computing has its memory.
 Multiple processors execute multiple tasks simultaneously in parallel computing. In
contrast, many computer systems execute tasks simultaneously in distributed
computing.

1.7 Cloud Computing and Big Data


Q8. What is Big Data cloud in Cloud Computing?
Ans:
The term cloud refers to a network or the internet. It is a technology that uses remote
servers on the internet to store, manage, and access data online rather than local drives. The
data can be anything such as files, images, documents, audio, video, and more.

There are the following operations that we can do


using cloud computing:
 Developing new applications and services
 Storage, back up, and recovery of data
 Hosting blogs and websites
 Delivery of software on demand
 Analysis of data
 Streaming videos and audios

In cloud computing, all data is gathered in data centres and then distributed to the end-
users. Further, automatic backups and recovery of data is also ensured for business
continuity, all such resources are available in the cloud.

We do not know exact physical location of these resources provided to us. You just need
dummy terminals like desktops, laptops, phones etc. and a net connection.

There are multiple ways to access the cloud:


 Applications or software as a service (SAAS) ex. Salesforce.com, dropbox, google
drive etc.
 Platform as a service (PAAS)
 Infrastructure as a service (IAAS)

The Roles & Relationship Between Big Data & Cloud Computing
Cloud Computing providers often utilize a ―software as a service‖ model to allow
customers to easily process data. Typically, a console that can take in specialized commands
and parameters is available, but everything can also be done from the site‘s user interface.
Some products that are usually part of this package include database management systems,
cloud-based virtual machines and containers, identity management systems, machine
learning capabilities, and more.

In turn, Big Data is often generated by large, network-based systems. It can be in either a
standard or non-standard format. If the data is in a non-standard format, artificial

S.V. PUBLICATIONS 17

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
intelligence from the Cloud Computing provider may be used in addition to machine
learning to standardize the data.

From there, the data can be harnessed through the Cloud Computing platform and
utilized in a variety of ways. For example, it can be searched, edited, and used for future
insights.

This cloud infrastructure allows for real-time processing of Big Data. It can take huge
―blasts‖ of data from intensive systems and interpret it in real-time. Another common
relationship between Big Data and Cloud Computing is that the power of the cloud allows
Big Data analytics to occur in a fraction of the time it used to.

Q9. What are main features of cloud services?


Ans:
Features of Cloud Computing
Cloud computing is becoming popular day by day. Continuous business expansion and
growth requires huge computational power and large-scale data storage systems. Cloud
computing can help organizations expand and securely move data from physical locations
to the 'cloud' that can be accessed anywhere.

Cloud computing has many features that make it one of the fastest growing industries at
present. The flexibility offered by cloud services in the form of their growing set of tools and
technologies has accelerated its deployment across industries. This blog will tell you about
the essential features of cloud computing.

 Resources Pooling: Resource


pooling is one of the essential
features of cloud computing.
Resource pooling means that a cloud
service provider can share resources
among multiple clients, each
providing a different set of services
according to their needs. It is a multi-client strategy that can be applied to data
storage, processing and bandwidth-delivered services. The administration process of
allocating resources in real-time does not conflict with the client's experience.
 On-Demand Self-Service: It is one of the important and essential features of cloud
computing. This enables the client to continuously monitor server uptime,
capabilities and allocated network storage. This is a fundamental feature of cloud
computing, and a customer can also control the computing capabilities according to
their needs.
 Easy Maintenance: This is one of the best cloud features. Servers are easily
maintained, and downtime is minimal or sometimes zero. Cloud computing
powered resources often undergo several updates to optimize their capabilities and
potential. Updates are more viable with devices and perform faster than previous
versions.
 Scalability And Rapid Elasticity: A key feature and advantage of cloud computing
is its rapid scalability. This cloud feature enables cost-effective handling of
workloads that require a large number of servers but only for a short period. Many

S.V. PUBLICATIONS 18

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


customers have workloads that can be run very cost-effectively due to the rapid
scalability of cloud computing.
 Economical: This cloud feature helps in reducing the IT expenditure of the
organizations. In cloud computing, clients need to pay the administration for the
space used by them. There is no cover-up or additional charges that need to be paid.
Administration is economical, and more often than not, some space is allocated for
free.
 Measured And Reporting Service: Reporting Services is one of the many cloud
features that make it the best choice for organizations. The measurement and
reporting service is helpful for both cloud providers and their customers. This
enables both the provider and the customer to monitor and report which services
have been used and for what purposes. It helps in monitoring billing and ensuring
optimum utilization of resources.
 Security: Data security is one of the best features of cloud computing. Cloud services
make a copy of the stored data to prevent any kind of data loss. If one server loses
data by any chance, the copied version is restored from the other server. This feature
comes in handy when multiple users are working on a particular file in real-time,
and one file suddenly gets corrupted.
 Automation: Automation is an essential feature of cloud computing. The ability of
cloud computing to automatically install, configure and maintain a cloud service is
known as automation in cloud computing. In simple words, it is the process of
making the most of the technology and minimizing the manual effort. However,
achieving automation in a cloud ecosystem is not that easy. This requires the
installation and deployment of virtual machines, servers, and large storage. On
successful deployment, these resources also require constant maintenance.
 Resilience: Resilience in cloud computing means the ability of a service to quickly
recover from any disruption. The resilience of a cloud is measured by how fast its
servers, databases and network systems restart and recover from any loss or damage.
Availability is another key feature of cloud computing. Since cloud services can be
accessed remotely, there are no geographic restrictions or limits on the use of cloud
resources.
 Large Network Access: A big part of the cloud's characteristics is its ubiquity. The
client can access cloud data or transfer data to the cloud from any location with a
device and internet connection. These capabilities are available everywhere in the
organization and are achieved with the help of internet. Cloud providers deliver that
large network access by monitoring and guaranteeing measurements that reflect how
clients access cloud resources and data: latency, access times, data throughput, and
more.

1.8 Deployment Models


Q10. What are the 4 different layers of cloud computing deployment models?
Ans:
Deployment Models
The cloud deployment model identifies the specific type of cloud environment based on
ownership, scale, and access, as well as the cloud‘s nature and purpose. The location of the

S.V. PUBLICATIONS 19

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
servers you‘re utilizing and who controls them are defined by a cloud deployment model. It
specifies how your cloud infrastructure will look, what you can change, and whether you
will be given services or will have to create everything yourself. Relationships between the
infrastructure and your users are also defined by cloud deployment types.

Public Cloud Computing


A cloud platform that is based on
standard cloud computing model in which
service provider offers resources,
applications storage to the customers over
the internet is called as public cloud
computing. The hardware resources in
public cloud are shared among similar users
and accessible over a public network such
as the internet. Most of the applications that
are offered over internet such as Software as a Service (SaaS) offerings such as cloud storage
and online applications uses Public Cloud Computing platform. Budget conscious startups,
SMEs not keen on high level of security features looking to save money can opt for Public
Cloud Computing.

Private Cloud Computing


A cloud platform in which a secure cloud
based environment with dedicated storage and
hardware resources provided to a single
organization is called Private Cloud Computing.
The Private cloud can be either hosted within the
company or outsourced to a trusted and reliable
third-party vendor. It offers company a greater
control over privacy and data security. The
resources in case of private cloud are not shared with others and hence it offer better
performance compared to public cloud. The additional layers of security allow company to
process confidential data and sensitive work in the private cloud environment.

S.V. PUBLICATIONS 20

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


Based on the location and management, National Institute of Standards and Technology
(NIST) divide private cloud into the following two parts-
 On-premise private cloud
 Outsourced private cloud

Hybrid Cloud
Hybrid Cloud is a combination of the public cloud and the private cloud. We can say:
Hybrid Cloud = Public Cloud + Private Cloud

Hybrid cloud is partially secure because the services which are running on the public
cloud can be accessed by anyone, while the services which are running on a private cloud
can be accessed only by the organization's users.

Community Cloud
Community cloud allows systems and services to be accessible by a group of several
organizations to share the information between the organization and a specific community.
It is owned, managed, and operated by one or more organizations in the community, a third
party, or a combination of them.

1.9 Cloud Deploy and service Models


Q11. What Are the Most Common Cloud Computing Service Delivery Models?
Ans:
Cloud Computing can be defined as the practice of using a network of remote servers
hosted on the Internet to store, manage, and process data, rather than a local server or a
personal computer. Companies offering such kinds of cloud computing services are called
cloud providers and typically charge for cloud computing services based on usage. Grids
and clusters are the foundations for cloud computing.

S.V. PUBLICATIONS 21

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
Types of Cloud Computing
Most cloud computing services fall into five broad categories:
1. Software as a service (SaaS)
2. Platform as a service (PaaS)
3. Infrastructure as a service (IaaS)
These are sometimes called the cloud computing stack because they are built on top of
one another. Knowing what they are and how they are different, makes it easier to
accomplish your goals. These abstraction layers can also be viewed as a layered architecture
where services of a higher layer can be composed of services of the underlying layer i.e, SaaS
can provide Infrastructure.
Infrastructure as a Service (IaaS)
IaaS is also known as Hardware as a Service (HaaS). It is a computing infrastructure
managed over the internet. The main advantage of using IaaS is that it helps users to avoid
the cost and complexity of purchasing and managing the physical servers.
Characteristics of IaaS
There are the following characteristics of IaaS -
 Resources are available as a service
 Services are highly scalable
 Dynamic and flexible
 GUI and API-based access
 Automated administrative tasks

Platform as a Service (PaaS)


PaaS cloud computing platform is created for the programmer to develop, test, run, and
manage the applications.
Characteristics of PaaS
There are the following characteristics of PaaS -
 Accessible to various users via the same development application.
 Integrates with web services and databases.
 Builds on virtualization technology, so resources can easily be scaled up or down as
per the organization's need.
 Support multiple languages and frameworks.
 Provides an ability to "Auto-scale"

Software as a Service (SaaS)


SaaS is also known as "on-demand software". It is a software in which the applications
are hosted by a cloud service provider. Users can access these applications with the help of
internet connection and web browser.
Characteristics of SaaS
There are the following characteristics of SaaS -
 Managed from a central location
 Hosted on a remote server
 Accessible over the internet
 Users are not responsible for hardware and software updates. Updates are applied
automatically.
 The services are purchased on the pay-as-per-use basis

S.V. PUBLICATIONS 22

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

1.10 Cloud Providers in Big Data Markets


Q12. What are cloud providers give example?
Ans:
Providers in the Big Data Cloud Market
Cloud computing companies come in all shapes and sizes. All large software vendors
either have already started offerings in cloud space, or are in the process of launching one.

In addition there are many startups that have interesting products in cloud space. Here
we have a list of major vendors of cloud computing. Few of the cloud providers are google,
citrix, netmagic, redhat, rackspace etc. Amazon (aws) is the leading cloud provider amongst
all. Microsoft is also providing cloud services and it is called as azure.

Infrastructure as a Service cloud computing companies:


 Amazon‘s offerings include S3 (Data storage/file system), SimpleDB (non-relational
database) and EC2 (computing servers).
 Rackspace‘s offerings include Cloud Drive (Data storage/file system), Cloud Sites
(web site hosting on cloud) and Cloud Servers(computing servers).
 IBM‘s offerings include Smart Business Storage Cloud and Computing on Demand
(CoD).
 AT&T‘s provides Synaptic Storage and Synaptic Compute as a service.
Platform as a Service cloud computing companies
 Googles AppEngine is a development platform that is built upon Python and Java.
 com‘s provides a development platform that is based upon Apex.
 Microsoft Azure provides a development platform based upon .Net.
Software as a Service companies
 In SaaS, Google provides space that includes Google Docs, Gmail, Google Calendar
and Picasa.
 IBM provides LotusLive iNotes, a web-based email service for messaging and
calendaring capabilities to business users.
 Zoho provides online products similar to Microsoft office suite.

S.V. PUBLICATIONS 23

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data

Internal Assessment

MULTIPLE CHOICE QUESTIONS

1. How can Big data analytics help to prevent fraud? [ a ]


a) Analyze all the Data b) Detect Fraud in real time
c) Use Predictive Analytics d) All
2. Data in ____ bytes size is called big data [ d ]
a) Meta b) Giga c) Tera d) Peta
3. ___________ is a collection of data that is used in volume, yet growing [ d ]
exponentially with time
a) Big DataBase b)Big Data File c) BigDBMS d) Big Data
4. Total V‘s of big data is ____ [ c ]
a) 3 b) 4 c) 5 d) 2
5. Big data analysis does the following except? [ b ]
a) Speard Sheet b) Analyze Data c) Organize Data d) Collect Data
6. Who is the father of cloud computing? [ c ]
a) Sharon B. Codd b) Edgar Frank Codd
c) J.C.R. Licklider d) Charles Bachman
7. Which of the following is not a type of cloud server? [ d ]
a) Public Cloud Servers b) Private Cloud Servers
c) Dedicated Cloud Servers d) Merged Cloud Servers
8. Which of the following is a type of cloud computing service? [ c ]
a) Service-as-a-Software (SaaS) b) Software-and-a-Server (SaaS)
c) Software-as-a-Service (SaaS) d) Software-as-a-Server (SaaS)

FILL IN THE BLANKS


1. Parallel Computing is also known as _____________. (Parallel processing)
2. __________ is a collection of data that is huge in volume.(Big Data)
3. __________ of data, is arranging the available data in a manner so that it becomes
easy to study. (Structuring)
4. __________ is the process of collecting, examining, and analyzing large amounts of
data. (Big data analytics)
5. _______________ is an open-source framework that stores and processes big data
sets. (Handoop)

S.V. PUBLICATIONS 24

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


6. Distributed Computing studies ______ (Distributed Systems)
7. In_________ , all data is gathered in data centres and then distributed to the end-
users (Cloud Computing)
8. ________ is a combination of the public cloud and the private cloud. (Hybrid Cloud)
9. ______ is also known as "on-demand software". (SaaS)
10. Cloud computing companies come in all ____________ & _______. (Shapes , Sizes)

Very Short Questions


Q1. What is distributed computing?
Ans:
The subject of computer science known as distributed computing explores dispersed
systems. A distributed system is one in which the components are spread across multiple
networked computers and communicate and coordinate their actions by transferring the
messages from one system to another.

Q2. Define Parallel Computer.


Ans:
Parallel Computing is also known as parallel processing. It utilizes several processors.
Each of the processors completes the tasks that have been allocated to them.

Q3. What is Unstructured Data?


Ans:
Unstructured Data is the data that does not have any structure. It can be in any form,
there is no pre-defined data model. We can‘t store it in traditional databases. It is complex to
search and process it.

Q4. Define Public Cloud Computing.


Ans:
A cloud platform that is based on standard cloud computing model in which service
provider offers resources, applications storage to the customers over the internet is called as
public cloud computing.

Q5. What is Private Cloud Computing?


Ans:
A cloud platform in which a secure cloud based environment with dedicated storage
and hardware resources provided to a single organization is called Private Cloud
Computing.

Q6. What is Community Cloud?


Ans:
Community cloud allows systems and services to be accessible by a group of several
organizations to share the information between the organization and a specific community.

S.V. PUBLICATIONS 25

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Getting an Overview of Big Data, Introducing Technologies for Handling Big Data
Q7. List Type of Cloud Deployment Model
Ans:
Here are some important types of Cloud Deployment models:
 Private Cloud: Resource managed and used by the organization.
 Public Cloud: Resource available for the general public under the Pay as you go
model.
 Community Cloud: Resource shared by several organizations, usually in the same
industry.
 Hybrid Cloud: This cloud deployment model is partly managed by the service
provided and partly by the organization.

S.V. PUBLICATIONS 26

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

Understanding Hadoop Ecosystem


Introducing Hadoop, HDFS and MapReduce,
Hadoop functions, Hadoop Ecosystem.

UNIT Hadoop Distributed File System


HDFS Architecture, Concept of Blocks in HDFS
Architecture, Namenodes and Datanodes, Features
of HDFS. MapReduce.
Introducing HBase

II HBase Architecture, Regions, Storing Big Data with


HBase, Combining HBase and HDFS, Features of
HBase, Hive, Pig and Pig Latin, Sqoop, ZooKeeper,
Flume, Oozie.

Objective

 Hadoop Ecosystem and its Major Components


 Various aspects of Data storages in Hadoop
 Hadoop Functions
 Introduction to HDFS and its Architecture
 HBase and its Architecture
 Features of HBase
 HBase with HDFS
 Brief explanation about Components
 YARN,
 HBase,
 HIVE,
 Pig Latin,
 Sqoop,
 Zoo-Keeper,
 Flume and Oozie

S.V. PUBLICATIONS 27

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase

PART – A
Short Type Questions
Q1. What is Hadoop?
Ans:
Hadoop, as a Big Data framework, provides businesses with the ability to distribute data
storage, parallel processing, and process data at higher volume, higher velocity, variety,
value, and veracity. HDFS, MapReduce, and YARN are the three major components.

Hadoop HDFS uses name nodes and data nodes to store extensive data. MapReduce
manages these nodes for processing, and YARN acts as an Operating system for Hadoop in
managing cluster resources.

Q2.What is MapReduce?
Ans:
A MapReduce is a data processing tool which is used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce:
Simplified Data Processing on Large Clusters," published by Google.

The MapReduce is a paradigm which has two phases, the mapper phase, and the
reducer phase. In the Mapper, the input is given in the form of a key-value pair. The output
of the Mapper is fed to the reducer as input. The reducer runs only after the Mapper is over.
The reducer too takes input in key-value format, and the output of reducer is the final
output.

Q3. What is HDFS?


Ans:
Hadoop Distributed File System (HDFS) is the primary data storage system used by
Hadoop applications. To implement a distributed file system that provides high-
performance access to data across highly scalable Hadoop clusters, HDFS uses the
NameNode and DataNode architecture. Apache Hadoop is an open-source framework for
managing data processing and storage for big data applications. HDFS is a crucial part of the
Hadoop ecosystem. It can manage big data pools and support big data analytics
applications. HDFS has two components, which are as follows:
1) Namenode
2) Datanode

Q4. List few HDFS Features


Ans:
Unlike other distributed file system, HDFS is highly fault-tolerant and can be deployed
on low-cost hardware. It can easily handle the application that contains large data sets.
 Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a
single cluster.
 Replication - Due to some unfavorable conditions, the node containing the data may
be loss. So, to overcome such problems, HDFS always maintains the copy of data on
a different machine.
 Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in
the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the
other machine containing the copy of that data automatically become active.

S.V. PUBLICATIONS 28

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


 Distributed data storage - This is one of the most important features of HDFS that
makes Hadoop very powerful. Here, data is divided into multiple blocks and stored
into nodes.
 Portable - HDFS is designed in such a way that it can easily portable from platform
to another.

Q5. What is HBase?


Ans:
HBase is an open-source, column-oriented distributed database system in a Hadoop
environment. Initially, it was Google Big Table, afterward; it was renamed as HBase and is
primarily written in Java. Apache HBase is needed for real-time Big Data applications.

HBase can store massive amounts of data from terabytes to petabytes. The tables present
in HBase consist of billions of rows having millions of columns. HBase is built for low
latency operations, which is having some specific features compared to traditional relational
models.

Q6. What is HBase vs Hadoop? Or Difference Between Hadoop and HBase


Ans:
Hadoop HBase
Hadoop is a collection of software tools HBase is a part of hadoop eco-system
Stores data sets in a distributed
Stores data in a column-oriented manner
environment
Hadoop is a framework HBase is a NOSQL database
Data are stored in form of chunks Data are stored in form of key/value pair
Hadoop does not allow run time changes HBase allows run time changes
File can be written only once, can be read
File can be read and write multiple times
many times
Hadoop has low latency operations HBase has high latency operations
HDFS can be accessed through HBase can be accessed through shell
MapReduce commands, Java API, REST

Q7. List few features of Hbase.


Ans:
Features of Hbase
 Horizontally scalable: You can add any number of columns anytime.
 Automatic Failover: Automatic failover is a resource that allows a system
administrator to automatically switch data handling to a standby system in the event
of system compromise
 Integrations with Map/Reduce framework: Al the commands and java codes
internally implement Map/ Reduce to do the task and it is built over Hadoop
Distributed File System.
 sparse, distributed, persistent, multidimensional sorted map, which is indexed by
rowkey, column key,and timestamp.

S.V. PUBLICATIONS 29

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
 Often referred as a key value store or column family-oriented database, or storing
versioned maps of maps.
 fundamentally, it's a platform for storing and retrieving data with random access.
 It doesn't care about datatypes(storing an integer in one row and a string in another
for the same column).
 It doesn't enforce relationships within your data.
 It is designed to run on a cluster of computers, built using commodity hardware.

S.V. PUBLICATIONS 30

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

PART – B
Essay Type Questions
UNDER STANDING HADOOP ECOSYSTEM, HDFS
2.1 Introducing Hadoop
Q1. What is Hadoop? Components of Hadoop and How Does It Work.
Ans:
Hadoop is an open source framework from Apache and is used to store process and
analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP
(online analytical processing). It is used for batch/offline processing. It is being used by
Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up
just by adding nodes in the cluster.

Modules of Hadoop
 HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
 Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
 Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output of
Map task is consumed by reduce task and then the out of reducer gives the desired
result.
 Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.

How Does Hadoop Work?


Hadoop has a Master-Slave Architecture for data storage and distributed data
processing using MapReduce and HDFS methods.
 NameNode: NameNode represented every files and directory which is used in the
namespace
 DataNode: DataNode helps you to manage the state of an HDFS node and allows
you to interacts with the blocks
 MasterNode: The master node allows you to conduct parallel processing of data
using Hadoop MapReduce.
 Slave node: The slave nodes
are the additional machines
in the Hadoop cluster which
allows you to store data to
conduct complex calculations.
Moreover, all the slave node
comes with Task Tracker and
a DataNode. This allows you
to synchronize the processes
with the NameNode and Job Tracker respectively.

S.V. PUBLICATIONS 31

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
The primary function of Hadoop is to process the data in an organised manner among
the cluster of commodity software. The client should submit the data or program that needs
to be processed. Hadoop HDFS stores the data. YARN, MapReduce divides the resources
and assigns the tasks to the data. Let‟s know the working of Hadoop in detail.
 The client input data is divided into 128 MB blocks by HDFS. Blocks are replicated
according to the replication factor: various DataNodes house the unions and their
duplicates.
 The user can process the data once all blocks have been put on HDFS DataNodes.
 The client sends Hadoop the MapReduce programme to process the data.
 The user-submitted software was then scheduled by ResourceManager on particular
cluster nodes.
 The output is written back to the HDFS once processing has been completed by all
nodes.

2.2 Hadoop Ecosystem &Components


Q2. What is a Hadoop ecosystem? What are the main components of Hadoop ecosystem?
Ans:
Apache Hadoop is an open source software framework used to develop data processing
applications which are executed in a distributed computing environment.

Applications built using HADOOP are run on large data sets distributed across clusters
of commodity computers. Commodity computers are cheap and widely available. These are
mainly useful for achieving greater computational power at low cost.

Similar to data residing in a local file system of a personal computer system, in Hadoop,
data resides in a distributed file system which is called as a Hadoop Distributed File system.
The processing model is based on „Data Locality‟ concept wherein computational logic is
sent to cluster nodes(server) containing data. This computational logic is nothing, but a
compiled version of a program written in a high-level language such as Java. Such a
program, processes data stored in Hadoop HDFS.

HDFS (Hadoop Distributed File System)


 It is the storage component of Hadoop that stores data in the form of files.
 Each file is divided into blocks of 128MB (configurable) and stores them on different
machines in the cluster.

S.V. PUBLICATIONS 32

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


 It has a master-slave architecture with two main components: Name Node and Data
Node.
o Name node is the master node and there is only one per cluster. Its task is to
know where each block belonging to a file is lying in the cluster
o Data node is the slave node that stores the blocks of data and there are more
than one per cluster. Its task is to retrieve the data as and when required. It
keeps in constant touch with the Name node through heartbeats
MapReduce
 To handle Big Data, Hadoop relies on the MapReduce algorithm introduced by
Google and makes it easy to distribute a job and run it in parallel in a cluster. It
essentially divides a single task into multiple tasks and processes them on different
machines.
 In layman terms, it works in a divide-and-conquer manner and runs the processes on
the machines to reduce traffic on the network.
 It has two important phases: Map and Reduce.

Map phase filters, groups, and sorts the data. Input data is divided into multiple splits.
Each map task works on a split of data in parallel on different machines and outputs a key-
value pair. The output of this phase is acted upon by the reduce task and is known as the
Reduce phase. It aggregates the data, summarises the result, and stores it on HDFS.

YARN
YARN or Yet Another Resource Negotiator manages resources in the cluster and
manages the applications over Hadoop. It allows data stored in HDFS to be processed and
run by various data processing engines such as batch processing, stream processing,
interactive processing, graph processing, and many more. This increases efficiency with the
use of YARN.

HBase
HBase is a Column-based NoSQL database. It runs on top of HDFS and can handle any
type of data. It allows for real-time processing and random read/write operations to be
performed in the data.

Pig
Pig was developed for analyzing large datasets and overcomes the difficulty to write
map and reduce functions. It consists of two components: Pig Latin and Pig Engine.

Pig Latin is the Scripting Language that is similar to SQL. Pig Engine is the execution
engine on which Pig Latin runs. Internally, the code written in Pig is converted to
MapReduce functions and makes it very easy for programmers who aren‟t proficient in Java.

S.V. PUBLICATIONS 33

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
Q3. How do you explain HDFS architecture with example?
Ans:
HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed
over several machines and replicated to ensure their durability to failure and high
availability to parallel application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks, data
nodes and node name.

Where to use HDFS


 Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
 Streaming Data Access: The time to read whole data set is more important than
latency in reading the first. HDFS is built on write-once and read-many-times
pattern.
 Commodity Hardware: It works on low cost hardware.
Where not to use HDFS
 Low Latency data access: Applications that require very less time to access the first
data should not use HDFS as it is giving importance to whole data rather than time
to fetch the first record.
 Lots of Small Files: The name node contains the metadata of files in memory and if
the files are small in size it takes a lot of memory for name node's memory which is
not feasible.
 Multiple Writes: It should not be used when we have to write multiple times.

HDFS Features
Unlike other distributed file system, HDFS is highly fault-tolerant and can be deployed
on low-cost hardware. It can easily handle the application that contains large data sets.
 Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a
single cluster.
 Replication - Due to some unfavorable conditions, the node containing the data may
be loss. So, to overcome such problems, HDFS always maintains the copy of data on
a different machine.
 Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in
the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the
other machine containing the copy of that data automatically become active.
 Distributed data storage - This is one of the most important features of HDFS that
makes Hadoop very powerful. Here, data is divided into multiple blocks and stored
into nodes.
 Portable - HDFS is designed in such a way that it can easily portable from platform
to another.

Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1
or YARN/MR2.

S.V. PUBLICATIONS 34

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node
includes DataNode and TaskTracker.

HDFS follows the master-slave architecture and it has the following elements.
NameNode
 It is a single master server exist in the HDFS cluster.
 As it is a single node, it may become the reason of single point failure.
 It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
 It simplifies the architecture of the system.
DataNode
 The HDFS cluster contains multiple DataNodes.
 Each DataNode contains multiple data blocks.
 These data blocks are used to store data.
 It is the responsibility of DataNode to read and write requests from the file system's
clients.
 It performs block creation,
deletion, and replication upon
instruction from the NameNode.
Job Tracker
 The role of Job Tracker is to
accept the MapReduce jobs from
client and process the data by
using NameNode.
 In response, NameNode provides
metadata to Job Tracker.
Task Tracker
 It works as a slave node for Job Tracker.
 It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.

2.2.1 Basic HDFS DFS Commands


Q4. How to access HDFS files from command line?
Ans:
Below are basic hdfs dfs or hadoop fs Commands.
COMMAND DESCRIPTION
-ls List files with permissions and other details
-mkdir Creates a directory named path in HDFS
-rm To Remove File or a Directory
Removes the file that identified by path / Folder and
-rmr
subfolders
-rmdir Delete a directory
-put Upload a file / Folder from the local disk to HDFS
-cat Display the contents for a file
-du Shows the size of the file on hdfs.
-dus Directory/file of total size

S.V. PUBLICATIONS 35

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase

COMMAND DESCRIPTION
-get Store file / Folder from HDFS to local file
-getmerge Merge Multiple Files in an HDFS
-count Count number of directory, number of files and file size
-setrep Changes the replication factor of a file
-mv HDFS Command to move files from source to destination
-moveFromLocal Move file / Folder from local disk to HDFS
-moveToLocal Move a File to HDFS from Local
-cp Copy files from source to destination
-tail Displays last kilobyte of the file
-touch create, change and modify timestamps of a file
-touchz Create a new file on HDFS with size 0 bytes
-appendToFile Appends the content to the file which is present on HDF
-copyFromLocal Copy file from local file system
-copyToLocal Copy files from HDFS to local file system
-usage Return the Help for Individual Command
-checksum Returns the checksum information of a file
Change group association of files/change the group of a file
-chgrp
or a path
-chmod Change the permissions of a file
-chown change the owner and group of a file
-df Displays free space
-head Displays first kilobyte of the file
-Create Snapshots Create a snapshot of a snapshottable directory
-Delete Snapshots Delete a snapshot of from a snapshottable directory
-Rename Snapshots Rename a snapshot
-expunge create new checkpoint
-Stat Print statistics about the file/directory
Truncate all files that match the specified file pattern to the
-truncate
specified length
-find Find File Size in HDFS

2.2.2 Features of HDFS


Q5. What is HDFS explain its features and goals?
Ans:
Fault Tolerance
The fault tolerance in Hadoop HDFS is the working strength of a system in unfavorable
conditions. It is highly fault-tolerant. Hadoop framework divides data into blocks. After
that creates multiple copies of blocks on different machines in the cluster.

High Availability
Hadoop HDFS is a highly available file system. In HDFS, data gets replicated among the
nodes in the Hadoop cluster by creating a replica of the blocks on the other slaves present in
HDFS cluster. So, whenever a user wants to access this data, they can access their data from
the slaves which contain its blocks.

S.V. PUBLICATIONS 36

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


At the time of unfavorable situations like a failure of a node, a user can easily access their
data from the other nodes. Because duplicate copies of blocks are present on the other nodes
in the HDFS cluster.

High Reliability
HDFS provides reliable data storage. It can store data in the range of 100s of petabytes.
HDFS stores data reliably on a cluster. It divides the data into blocks. Hadoop framework
stores these blocks on nodes present in HDFS cluster.

HDFS stores data reliably by creating a replica of each and every block present in the
cluster. Hence provides fault tolerance facility. If the node in the cluster containing data goes
down, then a user can easily access that data from the other nodes.

HDFS by default creates 3 replicas of each block containing data present in the nodes. So,
data is quickly available to the users. Hence user does not face the problem of data loss.

Replication
Data Replication is unique features of HDFS. Replication solves the problem of data loss
in an unfavorable condition like hardware failure, crashing of nodes etc. HDFS maintain the
process of replication at regular interval of time.

HDFS also keeps creating replicas of user data on different machine present in the
cluster. So, when any node goes down, the user can access the data from other machines.
Thus, there is no possibility of losing of user data.

Scalability
Hadoop HDFS stores data on multiple nodes in the cluster. So, whenever requirements
increase you can scale the cluster. Two scalability mechanisms are available in HDFS:
Vertical and Horizontal Scalability.

Distributed Storage
All the features in HDFS are achieved via distributed storage and replication. HDFS
store data in a distributed manner across the nodes. In Hadoop, data is divided into blocks
and stored on the nodes present in the HDFS cluster.

After that HDFS create the replica of each and every block and store on other nodes.
When a single machine in the cluster gets crashed we can easily access our data from the
other nodes which contain its replica.

2.3 MapReduce
Q6. Explain MapReduce Architecture in Big Data with Example.
Ans:
A MapReduce is a data processing tool which is used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce:
Simplified Data Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the
reducer phase. In the Mapper, the input is given in the form of a key-value pair. The output
of the Mapper is fed to the reducer as input. The reducer runs only after the Mapper is over.
The reducer too takes input in key-value format, and the output of reducer is the final
output.

S.V. PUBLICATIONS 37

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
Steps in Map Reduce
 The map takes data in the form of pairs and returns a list of <key, value> pairs. The
keys will not be unique in this case.
 Using the output of Map, sort and shuffle are applied by the Hadoop architecture.
This sort and shuffle acts on these list of <key, value> pairs and sends out unique
keys and a list of values associated with this unique key <key, list(values)>.
 An output of sort and shuffle sent to the reducer phase. The reducer performs a
defined function on a list of values for unique keys, and Final output <key, value>
will be stored/displayed.

MapReduce Architecture in Big Data


The whole process goes through four phases of execution namely, splitting, mapping,
shuffling, and reducing. Consider you have following input data for your MapReduce in Big
data Program. The data goes through the following phases of MapReduce in Big Data

Input Splits
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input
splits Input split is a chunk of the input that is consumed by a single map

Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in
each split is passed to a mapping function to produce output values. In our example, a job of
mapping phase is to count a number of occurrences of each word from input splits (more
details about input-split is given below) and prepare a list in the form of <word, frequency>.

Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubed together
along with their respective frequency.

Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value.

S.V. PUBLICATIONS 38

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


MapReduce Architecture explained in detail
1. One map task is created for each split which then executes map function for each
record in the split.
2. It is always beneficial to have multiple splits because the time taken to process a split
is small as compared to the time taken for processing of the whole input. When the
splits are smaller, the processing is better to load balanced since we are processing
the splits in parallel.
3. However, it is also not desirable to have splits too small in size. When splits are too
small, the overload of managing the splits and map task creation begins to dominate
the total job execution time.
4. For most jobs, it is better to make a split size equal to the size of an HDFS block
(which is 64 MB, by default).
5. Execution of map tasks results into writing output to a local disk on the respective
node and not to HDFS.
6. Reason for choosing local disk over HDFS is, to avoid replication which takes place
in case of HDFS store operation.
7. Map output is intermediate output which is processed by reduce tasks to produce
the final output.
8. Once the job is complete, the map output can be thrown away. So, storing it in HDFS
with replication becomes overkill.
9. In the event of node failure, before the map output is consumed by the reduce task,
Hadoop reruns the map task on another node and re-creates the map output.
10. Reduce task doesn‟t work on the concept of data locality. An output of every map
task is fed to the reduce task. Map output is transferred to the machine where reduce
task is running.
11. On this machine, the output is merged and then passed to the user-defined reduce
function.
12. Unlike the map output, reduce output is stored in HDFS (the first replica is stored on
the local node and other replicas are stored on off-rack nodes). So, writing the reduce
output

S.V. PUBLICATIONS 39

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase

INTRODUCING HBASE
2.4 Introduction to HBase and Architecture
Q7. What is HBase and its architecture?
Ans:
HBase is an open-source, column-oriented distributed database system in a Hadoop
environment. Initially, it was Google Big Table, afterward; it was renamed as HBase and is
primarily written in Java. Apache HBase is needed for real-time Big Data applications.

HBase can store massive amounts of data from terabytes to petabytes. The tables present
in HBase consist of billions of rows having millions of columns. HBase is built for low
latency operations, which is having some specific features compared to traditional relational
models.

HBase Architecture and its Important Components


HBase architecture consists mainly of four components
 HMaster
 HRegionserver
 HRegions
 Zookeeper
 HDFS
HBase is an open-source, distributed
key-value data storage system and
column-oriented database with high write
output and low latency random read
performance. By using HBase, we can
perform online real-time analytics. HBase
architecture has strong random
readability. In HBase, data is sharded physically into what are known as regions. A single
region server hosts each region, and one or more regions are responsible for each region
server. The HBase Architecture is composed of master-slave servers. The cluster HBase has
one Master node called HMaster and several Region Servers called HRegion Server
(HRegion Server). There are multiple regions – regions in each Regional Server.
All the 3 components are described below:
 HMaster – The implementation of Master Server in HBase is HMaster. It is a process
in which regions are assigned to region server as well as DDL (create, delete table)
operations. It monitor all Region Server instances present in the cluster. In a
distributed environment, Master runs several background threads. HMaster has
many features like controlling load balancing, failover etc.
 Region Server – HBase Tables are divided horizontally by row key range into
Regions. Regions are the basic building elements of HBase cluster that consists of the
distribution of tables and are comprised of Column families. Region Server runs on
HDFS DataNode which is present in Hadoop cluster. Regions of Region Server are
responsible for several things, like handling, managing, executing as well as reads
and writes HBase operations on that set of regions. The default size of a region is 256
MB.

S.V. PUBLICATIONS 40

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


 HBase Regions - HRegions are the basic building elements of HBase cluster that
consists of the distribution of tables and are comprised of Column families. It
contains multiple stores, one for each column family. It consists of mainly two
components, which are Memstore and Hfile.
 Zookeeper – It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed synchronization, server
failure notification etc. Clients communicate with region servers via zookeeper.

HBase Data Model


HBase Data Model is a set of components that consists of Tables, Rows, Column families,
Cells, Columns, and Versions. HBase tables contain column families and rows with elements
defined as Primary keys. A column in HBase data model table represents attributes to the
objects.

HBase Data Model consists of following elements,


 Set of tables
 Each table with column families and rows
 Each table must have an element defined as Primary Key.
 Row key acts as a Primary key in HBase.
 Any access to HBase tables uses this Primary Key
 Each column present in HBase denotes attribute corresponding to object

2.5 Regions
Q8. How regions are created in HBase?
Ans:
 HBase Tables are divided horizontally by row key range into “Regions.”
 A region contains all rows in the table between the region‟s start key and end key.
 Regions are assigned to the nodes in the cluster, called “Region Servers,” and these
serve data for reads and writes.
 A region server can serve about 1,000 regions.
 Each region is 1GB in size (default)

The basic unit of scalability and load balancing in HBase is called a region. These are
essentially contiguous ranges of rows stored together. They are dynamically split by the
system when they become too large. Alternatively, they may also be merged to reduce their
number and required storage files. An HBase system ma y have more than one region
servers.
 Initially there is only one region for a table and as we start adding data to it, the
system is monitoring to ensure that you do not exceed a configured maximum size. If

S.V. PUBLICATIONS 41

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
you exceed the limit, the region is split into two at the middle key middle of the
region, creating two roughly equal halves.
 Each region is served by exactly one region server, the row key in the and each of
these servers can serve many regions at any time.
 Rows are grouped in regions and may be served by different servers

Table is divided into regions

2.6 Storing Big Data with HBase


Q9. Which storage is used by HBase?
Ans:
HBase is highly configurable and gives great flexibility to address massive amounts of
data efficiently. Now let's understand how HBase can help address your significant data
challenges.
 HBase is a columnar database. Like relational database management systems
(RDBMSs), it stores all data in tables with columns and rows.
 The intersection of a column and row is called a cell. Each cell value contains a
"version" attribute that is no more than a timestamp, distinctively selecting the cell.
 Versioning tracks swap in the cell and makes it possible to redeem any version of the
contents.
 HBase stores the data in cells in decreasing order (using the timestamp), so a reader
will always first choose the most current values.
 Columns in HBase belong to a column family. The column family name is used to
identify its family members.
 The rows in HBase tables also have a key associated with them. The structure of the
key is very flexible. It can be a computed value, a string, or another data structure.

S.V. PUBLICATIONS 42

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


 The key is used to control access to the cells in the row, and they are stored in order
from low to high value.
 These features together make up the schema. It can alter new tables and column
families after the database is up and running.

Storage Mechanism in HBase


HBase is a column-oriented database and data is stored in tables. The tables are sorted
by RowId. As shown below, HBase has RowId, which is the collection of several column
families that are present in the table.
The column families that are present in the schema are key-value pairs. If we observe in
detail each column family having multiple numbers of columns. The column values stored
into disk memory. Each cell of the table has its own Metadata like timestamp and other
information.

Coming to HBase the following are the key terms representing table schema
 Table: Collection of rows present.
 Row: Collection of column families.
 Column Family: Collection of columns.
 Column: Collection of key-value pairs.
 Namespace: Logical grouping of tables.
 Cell: A {row, column, version} tuple exactly specifies a cell definition in HBase.

Column-Oriented Vs Row-oriented storages


Column and Row-oriented storages differ in their storage mechanism. As we all know
traditional relational models store data in terms of row-based format like in terms of rows of
data. Column-oriented storages store data tables in terms of columns and column families.

2.7 Combining HBase with HDFS


Q10. How does HBase work with HDFS?
Ans:
Combining HBase and HDF
HBase's utilization of HDFS is different from how it is utilized by MapReduce. In
MapReduce, by and large, HDFS documents are opened, their content streamed through a
map task(which takes a set of data and converts it into another set of data), and are closed.
In HBase, data files are opened on group startup and kept open so that the user does not
have to pay the file open expenses on each access. Hence, HBase has a tendency to handle
issues not commonly experienced by MapReduce clients, which are as follows:

S.V. PUBLICATIONS 43

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
 File descriptors shortage—Since we keep documents open on a loaded cluster, it
doesn't take too long to run the documents on a framework. For example, we have a
cluster having three hubs, each running an instance of a DataNode and region server,
and we are performing an upload on a table that presently has 100 regions and 10
column families. Suppose every column family two file records. In that case, we will
have 100 * 10 * 2 = 2000 records open at any time. Add to this the collective random
descriptors consumed by good scanners, and java libraries. Each open record
devours no less than one descriptor on the remote DataNode. The default quantity of
record descriptors is 1024.
 Not many Data Node threads—essentially, the Hadoop DataNode has a higher limit
of 256 threads it can run at any given time. Suppose we use the same table
measurements cite earier. It is not difficult to perceive how we can surpass this figure
given in the DataNode since each open connection with a record piece consumes a
thread If you look in the DataNode log, you will see an error like xceiver count 256
limit exceeds point of simultaneous xcievers 256; therefore, you need to be careful in
using the threads.
 Bad blocks—The Dfs client class in the region server will have a tendency to mark
document blocks as bad if the server is heavily loaded. Blocks can be recreated only
three times. Therefor the region server will proceed onward for the recreation of the
blocks. But if this recreation is performed during a period of heavy loading, we will
have two of the three blocks marked as bad. In the event that the third block is
discovered to be bad, we will see an error, stating, No live hubs contain current block
in region server logs. During startup, you may face many issues as regions are
opened and deployed.
 UI—HBase runs a Web server on the master to present a perspective on the
condition of your running cluster. It listens on port 60010 by default. The master UI
shows a list of basic functions, e.g., software renditions, cluster load, request rail‟s,
list of group tables, and participant region servers. In the master UI, click a region
server and you will be directed to the Web server running the individual region
server. A list of regions carried by this server and metrics like consumed resources
and request rate would be displayed
 Schema design—HBase tables are similar In RDBMS tables, except that HBase tables
have versioned cells, sorted rows, and columns. The other thing to keep in mind is
that an important attribute of the column (family)-oriented database, like HBase, is
that it can host extensive and lightly populated tables at no extra incurred cost.
 Row keys— Spend a good time in defining your row key. It can be utilized for
grouping information as a part of routes. If your keys are integers, utilize a binary
representation instead of a persistent string form of a number as it requires lesser
space.

Q11. How does HDFS maintain Data Integrity?


Ans:
HDFS ensures data integrity' throughout the duster with the help of the following
features:

S.V. PUBLICATIONS 44

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

 Maintaining Transaction Logs—HDFS maintains transaction logs in order to


monitor every operation and carry out effective auditing and recovery' of data in
case something goes wrong.
 Validating Checksum—Checksum is an effective error-detection technique wherein
a numerical value is assigned to a transmitted message on the bases of the number of
bits contained the message. HDFS uses checksum validations for verification of the
content of a file. The validations are carried out as follows:
1. When a file is requested by a client the contents are verified using checksum.
2. If the checksums of the received and sent messages match, the file operations
proceed further; otherwise, an error is reported.
3. The message receiver verifies the checksum of the message to ensure that it is the
same as in the sent message. If a difference is identified in the two values, the
message is discarded assuming that it has been tempered with in transition.
Checksum files are hidden to avoid tempering.
 Creating Data Blocks—HDFS maintains replicated copies of data blocks to avoid
corruption of a file due to failure of a server. The degree of replication, the number of
data nodes in the cluster, and the specifications of the HDFS namespace ore
identified and implemented during the initial implementation of the cluster.
However, these parameters can be adjusted any time during the operation of the
duster.

Data Blocks are sometimes, also called block servers. A block server primarily stores
data in a file system and maintains the metadata of a block. A block server carries out the
following functions:
 Storage (and retrieval) of data on a local file system. HDF5 supports different
operating systems and provides similar performance on all of them.
 Storage of metadata of a block on the local file system on the basis of a similar
template on the NameNode.
 Conduct of periodic validations for file checksums.
 Intimation about the availability of blocks to the NameNode by sending reports
regularly.
 On-demand supply of metadata and data to the clients where client application
programs can directly access data nodes.
 Movement of data to connected nodes on the basis of the pipelining model.

2.8 Features of HBase


Q12. What are the key Features of HBase?
Ans:
 Consistency: We can use this HBase feature for high-speed requirements because it
offers consistent reads and writes.
 Atomic Read and Write: During one read or write process, all other processes are
prevented from performing any read or write operations this is what we call Atomic
read and write.
 Sharding: In order to reduce I/O time and overhead, HBase offers automatic and
manual splitting of regions into smaller subregions, as soon as it reaches a threshold
size.

S.V. PUBLICATIONS 45

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
 High Availability: Moreover, it offers LAN and WAN which supports failover and
recovery. Basically, there is a master server, at the core, which handles monitoring
the region servers as well as all metadata for the cluster.
 Client API: Through Java APIs, it also offers programmatic access.
 Scalability: In both linear and modular form, HBase supports scalability. In addition,
we can say it is linearly scalable.
 Hadoop/HDFS integration: HBase can run on top of other file systems as well as like
Hadoop/HDFS integration.
 Distributed storage: This feature of HBase supports distributed storage such as
HDFS.
 Data Replication: HBase supports data replication across clusters.
 Failover Support and Load Sharing: By using multiple block allocation and
replications, HDFS is internally distributed and automatically recovered and HBase
runs on top of HDFS, hence HBase is automatically recovered. Also using
RegionServer replication, this failover is facilitated.
 API Support: Because of Java APIs support in HBase, clients can access it easily.
 MapReduce Support: For parallel processing of large volume of data, HBase
supports MapReduce.
 Backup Support: In HBase “Backup support” means it supports back-up of Hadoop
MapReduce jobs in HBase tables.
 Sorted Row Keys: It is possible to build an optimized request Since searching is done
on the range of rows, and HBase stores row keys in lexicographical orders, hence, by
using these sorted row keys and timestamp we can build an optimized request.
 Real-time Processing: In order to perform real-time query processing, HBase
supports block cache and Bloom filters.
 Faster Lookups: While it comes to faster lookups, HBase internally uses Hash tables
and offers random access, as well as it stores the data in indexed HDFS files.
 Type of Data: For both semi-structured as well as structured data, HBase supports
well.
 Schema-less: There is no concept of fixed columns schema in HBase because it is
schema-less. Hence, it defines only column families.
 High Throughput: Due to high security and easy management characteristics of
HBase, it offers unprecedented high write throughput.

2.9 HIVE, Pig and Pig Latin, Zookeeper, Flume, Oozie


Q13. What is Hive explain with example?
Ans:
HIVE
Hive is a data warehouse and an ETL tool which provides an SQL-like interface between
the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built
on top of Hadoop. It is a software project that provides data query and analysis. It facilitates
reading, writing and handling wide datasets that stored in distributed storage and queried
by Structure Query Language (SQL) syntax. It is not built for Online Transactional
Processing (OLTP) workloads. It is frequently used for data warehousing tasks like data
encapsulation, Ad-hoc Queries, and analysis of huge datasets. It is designed to enhance

S.V. PUBLICATIONS 46

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


scalability, extensibility, performance, fault-tolerance and loose-coupling with its input
formats.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it
up and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

The components of Apache Hive are as follows:


 Driver: The driver acts as a controller receiving HiveQL statements. It begins the
execution of statements by creating sessions. It is responsible for monitoring the life
cycle and the progress of the execution. Along with that, it also saves the important
metadata that has been generated during the execution of the HiveQL statement.
 Meta-store: A metastore stores metadata of all tables. Since Hive includes partition
metadata, it helps the driver in tracking the progress of various datasets that have
been distributed across a cluster, hence keeping track of data. In a metastore, the data
is saved in an RDBMS format.
 Compiler: The compiler performs the compilation of a HiveQL query. It transforms
the query into an execution plan that contains tasks.
 Optimizer: An optimizer performs many transformations on the execution plan for
providing an optimized DAG. An optimizer aggregates several transformations
together like converting a pipeline of joins to a single join. It can also split the tasks
for providing better performance.
 Executor: After the processes of compilation and optimization are completed, the
execution of the task is done by the executor. It is responsible for pipelining the tasks.

Hive organizes data using the following elements:


 Tables—Similar to RDBMS tables. Hive tables also consist of rows and columns.
Since Hive is layered on Hadoop HDF5, Hive tables have a mapping with the
directories in the file system. Hive also supports tables located in other native file
systems.
 Partitions—Hive tables may support one or more partitions that are mapped to
subdirectories in the file system. The partitions depict how data is distributed
throughout the table. For instance, in a table autos, having a key value 5678 and a
maker value icon, the partition will be located hivewh/autos/ky=5678/lcon.
 Buckets—Data within a Hive table is divided into buckets, which are stored as
individual Files in the partition directories in the file system. In the preceding
example, the autos table can have a bucket called Focus that stores all the attributes
of a Ford Focus auto.

Hive metadata is maintained externally in the metastore, which is a relational database


that stores a detailed description of the Hive schema including types of columns, owners,
key and value data, table statistics, and so on. The metastore can also sync data with other
metadata services in the Hadoop ecosystem.

S.V. PUBLICATIONS 47

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
Q14. What is Apache Pig? What is Pig Latin in Pig?
Ans:
Pig and Pig Latin
Pig is defined as “a platform for analysing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure for evaluating
these programs. The salient property of Pig programs is that their structure is amenable to
substantial paralielization, which in turns enables them to handle very large data sets.”

Pig is used as an ELT tool for Hadoop. It makes Hadoop more approachable and usable
for non-technical persons. It opens an interactive and script-based execution environment
for non-developers with its language. Pig Latin. Pig Latin loads and processes input data
using a series of operations and transforms that data to produce the desued output. Pig can
execute in the following two modes:
 Local—In this mode, all tlie scripts arc executed on a single machine. This mode does
not require Hadoop MapRcducc or HDFS.
 MapReduce— In this mode, all the scripts are executed on a given Hadoop cluster.
This mode is called termed as the MapReduce mode.

Similar to Hadoop, Pig operates by creating a set of map and reduce jobs. The user need
not be concerned about writing code, compiling, packaging, and submitting the Jobs to the
RDBMS system.

Pig Latin facilitates the extraction of the required information from Big Data in an
abstract way by focusing on the data and not on the structure of any custom software
program. Pig programs can be executed in three ways, all of which are compatible with the
local and Hadoop

Mode of execution
 Script—pig Latin commands are contained in a file having the .pig suffix. Pig
interprets and executes these commands in a sequential order.
 Grunt—It is a command interpreter. The commands are fed on the Grunt command
line and interpreted and executed by Grunt.
 Embedded—This enables the execution of Pig programs as a part of a Java program.

Pig Latin comes with a very rich syntax and supports the following operations:
 Loading and storing data
 Streaming data
 Filtering data
 Grouping and joining data
 Sorting data
 Combining and splitting data
Apart from these, a wide variety' of types, expressions, functions, diagnostic operators,
macros, and file system commands are also supported by Pig Latin.

Q15. What is Zookeeper Explain the purpose of Zookeeper in Hadoop Ecosystem.


Ans:
Zookeeper is an open source distributed coordination service that helps to manage a
large set of hosts. Management and coordination in a distributed environment is tricky.

S.V. PUBLICATIONS 48

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


Zookeeper automates this process and allows developers to focus on building software
features rather than worry about it‟s distributed nature.

Zookeeper helps you to maintain configuration information, naming, group services for
distributed applications. It implements different protocols on the cluster so that the
application should not implement on their own. It provides a single coherent view of
multiple machines.

Features of Zookeeper
 Synchronization − Mutual exclusion and co-operation between server processes.
 Ordered Messages - The strict ordering means that sophisticated synchronization
primitives can be implemented at the client.
 Reliability - The reliability aspects keep it from being a single point of failure.
 Atomicity − Data transfer either succeeds or fails completely, but no transaction is
partial.
 High performance - The performance aspects of Zookeeper means it can be used in
large, distributed systems.
 Distributed, High availability, Fault-tolerant, Loose coupling, Partial failure.
 High throughput and low latency - data is stored data in memory and on disk as
well.
 Replicated.
 Automatic failover- When a Zookeeper dies, the session is automatically migrated
over to another Zookeeper.

Service Provided by Zookeeper


 Naming service − Identifying the nodes in a cluster by name. It is similar to DNS,
but for nodes.
 Configuration management − Latest and up-to-date configuration information of
the system for a joining node.
 Cluster management − Joining/leaving of a node in a cluster and node status in real
time.
 Leader election − Electing a node as leader for coordination purposes.

Q16. Write a short Note on OoZiea, Flume and Sqoop.


Ans:
OOZIE
Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For
Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them
together as one logical work.

There are two kinds of Oozie jobs:


 Oozie workflow: These are sequential set of actions to be executed. You can assume
it as a relay race. Where each athlete waits for the last one to complete his part.
 Oozie Coordinator: These are the Oozie jobs which are triggered when the data is
made available to it. Think of this as the response-stimuli system in our body. In the
same manner as we respond to an external stimulus, an Oozie coordinator responds
to the availability of data and it rests otherwise.

S.V. PUBLICATIONS 49

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase
APACHE FLUME
Ingesting data is an important part of our Hadoop Ecosystem.
 The Flume is a service which helps in ingesting unstructured and semi-structured
data into HDFS.
 It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving large amount of data sets.
 It helps us to ingest online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS.
Now, let us understand the architecture of Flume from the below diagram:

There is a Flume agent which ingests the streaming data from various data sources to
HDFS. From the diagram, you can easily understand that the web server indicates the data
source. Twitter is among one of the famous sources for streaming data.

The flume agent has 3 components: source, sink and channel.


 Source: it accepts the data from the incoming streamline and stores the data in the
channel.
 Channel: it acts as the local storage or the primary storage. A Channel is a temporary
storage between the source of data and persistent data in the HDFS.
 Sink: Then, our last component i.e. Sink, collects the data from the channel and
commits or writes the data in the HDFS permanently.

Sqoop
Sqoop is a tool used to transfer bulk data between Hadoop and external datastores, such
as relational databases (MS SQL Server, MySQL).

To process data using Hadoop, the data first needs to be loaded into Hadoop clusters
from several sources. However, it turned out that the process of loading data from several
heterogeneous sources was extremely challenging. The problems administrators
encountered included:
 Maintaining data consistency
 Ensuring efficient utilization of resources
 Loading bulk data to Hadoop was not possible
 Loading data using scripts was slow
The solution was Sqoop. Using Sqoop in Hadoop helped to overcome all the challenges
of the traditional approach and it could load bulk data from RDBMS to Hadoop with ease.

S.V. PUBLICATIONS 50

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


Sqoop Features
Sqoop has several features, which makes it helpful in the Big Data world:
 Parallel Import/Export: Sqoop uses the YARN framework to import and export data.
This provides fault tolerance on top of parallelism.
 Import Results of an SQL Query: Sqoop enables us to import the results returned
from an SQL query into HDFS.
 Connectors For All Major RDBMS Databases: Sqoop provides connectors for
multiple RDBMSs, such as the MySQL and Microsoft SQL servers.
 Kerberos Security Integration: Sqoop supports the Kerberos computer network
authentication protocol, which enables nodes communication over an insecure
network to authenticate users securely.
 Provides Full and Incremental Load: Sqoop can load the entire table or parts of the
table with a single command.

S.V. PUBLICATIONS 51

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding Hadoop Ecosystem, Hadoop Distributed File System Introducing HBase

Internal Assessment

MULTIPLE CHOICE QUESTIONS


1. _____ is a platform for constructing data flows for extract, transform, and [ c ]
load (ETL) processing and analysis of large datasets.
a) Pig Latin b) Oozie c) Pig d) Hive
2. Hive also support custom extensions written in ____________ [ b ]
a) C# b) Java c) C d) C++
3. ___________ is general-purpose computing model and runtime [ a ]
system for distributed data analytics.
a) Mapreduce b) Drill c) Oozie d) None of the mentioned
4. The Pig Latin scripting language is not only a higher-level data flow [ a ]
language but also has operators similar to ____________
a) SQL b) JSON c) XML d) All of the mentioned
5. A ________ serves as the master and there is only one NameNode [ b ]
per cluster.
a) Data Node b) NameNode c) Data block d) Replication
6. HDFS works in a __________ fashion. [ a ]
a) master-worker b) master-slave c) worker/slave d) all
7. For ________ the HBase Master UI provides information about the [ b ]
HBase Master uptime.
a) HBase b) Oozie c) Kafka d) All of the mentioned
8. The Hadoop list includes the HBase database, the Apache Mahout _____ [ a ]
system, and matrix operations.
a) Machine learning b) Pattern recognition
c) Statistical classification d) Artificial intelligence

FILL IN THE BLANKS


1. _______ Jobs are optimized for scalability but not latency. (Hive)
2. Hive also support custom extensions written in ____________ (Java)
3. For ___ the Master UI provides information about the HBase Master uptime. (HBase)
4. Hadoop Development Tools is an effort undergoing incubation at ____ (ASF)
5. HDT provides plugin for inspecting ________ nodes. (HDFS)
6. HDT is used for listing running Jobs on __________ Cluster. (MapReduce)

S.V. PUBLICATIONS 52

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

Very Short Questions


Q1. What is HDFS?
Ans:
HDFS is a distributed file system that provides access to data across Hadoop clusters. A
cluster is a group of computers that work together. Like other Hadoop-related technologies,
HDFS is a key tool that manages and supports analysis of very large volumes; petabytes and
zetta bytes of data.

Q2. What type of database is HBase?


Ans:
HBase is a column-oriented, non-relational database that suggests that data is stored in
separate indexes and columns by a solitary row key. This architecture permits rapid
retrieval of individual rows and columns and efficient scans over different columns within a
table.

Q3. List Applications of HBase.


Ans:
 It is used whenever there is a need to write heavy applications.
 HBase is used whenever we need to provide fast random access to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

Q4. What is Hive?


Ans:
Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File
System (HDFS).

Q5. What is Flume in Hadoop?


Ans:
Apache Flume is service designed for streaming logs into Hadoop environment. Flume
is a distributed and reliable service for collecting and aggregating huge amounts of log data.
With a simple and easy to use architecture based on streaming data flows, it also has tunable
reliability mechanisms and several recovery and failover mechanisms.

S.V. PUBLICATIONS 53

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations

Understanding MapReduce Fundamentals and


HBase
The MapReduce Framework, Exploring the features of

UNIT MapReduce, Working of MapReduce, Techniques to


optimize MapReduce Jobs, Hardware/Network
Topology, Synchronization, File system, Uses of
MapReduce, Role of HBase in Big Data Processing-
Characteristics of HBase.

III Understanding Big Data Technology Foundations


Exploring the Big Data Stack, Data Sources Layer,
Ingestion Layer, Storage Layer, Physical Infrastructure
Layer, Platform Management Layer, Security Layer,
Monitoring Layer, Visualization Layer.

Objective

 We Learn about Map Reduce Frame work and its Features


 Integration of Different Functions to sort, process and analyse Big Data
 The working of map and reduce functions
 Explore about HBase with Mapreduce
 Integrating HBase with Mapreduce and processing of Big Data
 Exploring the Big Data Stack architecture
 Discussion about various layers and components in the Big Data
architecture
 Data source Layer
 Ingestion Layer
 Storage Layer
 Physical Infra Structure Layer
 Platform Management Layer
 Security Layer
 Monitoring Layer
 Visualization Layer

S.V. PUBLICATIONS 54

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

PART – A
Short Type Questions
Q1. What are the strengths of MapReduce?
Ans:
Apache Hadoop usually has two parts, the storage part and the processing part.
MapReduce falls under the processing part. Some of the various advantages of Hadoop
MapReduce are:
 Scalability – The biggest advantage of MapReduce is its level of scalability, which is
very high and can scale across thousands of nodes.
 Parallel nature – One of the other major strengths of MapReduce is that it is parallel
in nature. It is best to work with both structured and unstructured data at the same
time.
 Memory requirements – MapReduce does not require large memory as compared to
other Hadoop ecosystems. It can work with minimal amount of memory and still
produce results quickly.
 Cost reduction – As MapReduce is highly scalable, it reduces the cost of storage and
processing in order to meet the growing data requirements.

Q2. Why is the order in which a function is executed very important in MapReduce?
Ans:
The point of MapReduce is to enable parallelism, or at least concurrency, among actions
within each of the major phases of the algorithm. And the key to increasing concurrency is
to remove ordering dependencies.
1. During the map phase, multiple workers can independently apply a mapping
function to each datum. The order doesn‘t matter here, as each element‘s mapping is
independent of all the others.
2. Based on the mapping, elements get sorted into groups for reduction. Each of the
workers arranges for the elements it mapped to end up in the right reduction group.
3. During the reduction phase, multiple workers can independently process each of the
groups, reducing the elements in that group to a single value representing the group.

So the only real ordering here is between the three phases. You need to do all of phase 1,
followed by all of phase 2, followed by all of phase 3. Within each phase, all of the work
executes in a somewhat arbitrary order.
Q3. List the Application Of MapReduce?
Ans:
 Entertainment: To discover the most popular movies, based on what you like and
what you watched in this case Hadoop MapReduce help you out. It mainly focuses
on their logs and clicks.
 E-commerce: Numerous E-commerce suppliers, like Amazon, Walmart, and eBay,
utilize the MapReduce programming model to distinguish most loved items
dependent on clients inclinations or purchasing behavior. It incorporates making
item proposal Mechanisms for E-commerce inventories, examining website records,
buy history, user interaction logs, etc.

S.V. PUBLICATIONS 55

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
 Data Warehouse: We can utilize MapReduce to analyze large data volumes in data
warehouses while implementing specific business logic for data insights.
 Fraud Detection: Hadoop and MapReduce are utilized in monetary enterprises,
including organizations like banks, insurance providers, installment areas for
misrepresentation recognition, pattern distinguishing proof, or business metrics
through transaction analysis.

Q4. What is Data Sources Layer?


Ans:
 Data sources for big data architecture are all over the map. The bottom layer of the
stack is the foundation and is known as the data layer.
 Data can come through from company servers and sensors, or from third-party data
 providers.
 The big data environment can ingest data in batch mode or real-time.
 The basic function of the data sources layer is to absorb and integrate the data
coming from various sources with different formats at varying velocity.

Q5. Short notes on Security Layer.


Ans:
 Big data projects are subject to security issues because of the distributed architecture.
 To implement security baseline foundation, the minimum security design
considerations are :
o Authenticates nodes using protocols like Kerberos
o Enable file-layer encryption
o Subscribes for trusted keys and certificates
o Use tools like Chef or Puppet for validation during deployment of data.
o Logs the communication between nodes, and use distributed logging
mechanism
o Ensure all communication between nodes is secure.

Q6. What are the functions of Ingestion Layer?


Ans:
The functioning of the ingestion layer:
 Identification: Data is categorised into various known data formats or unstructured
data is assigned with default formats.
 Filtration: The information relevant for the enterprise is filtered on the basis of the
Enterprise Master Data Management (MDM) repository.
 Validation: The filtered day is analysied against MDM metadata.
 Noise reduction: Data is cleaned by removing the noise and minimising the related
disturbances.
 Transformation: Data is split or combined on the basis of its type, contents, and the
requirement of the organisation.
 Compression: The size of the data is reduced without affecting is relavance for the
required process. It should be remembered that compression does not affect the
analysis results.
 Integration: The refined data set is integrated with the Hadoop storage layer, which
consists of Hadoop Distributed File System (HDFS) and NOSQL database.

S.V. PUBLICATIONS 56

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

PART – B
Essay Type Questions
UNDER MAPREDUCE FUNDAMENTALS AND HABSE
3.1 The MapReduce Frame Work
Q1. What is MapReduce program in big data? Or Which type of framework will
supported by MapReduce?
Ans:
The MapReduce framework that enables you to write applications that process vast
amounts of data, in parallel, on large clusters of commodity hardware, in a reliable and
fault-tolerant manner. In addition, this post describes the architectural components of
MapReduce and lists the benefits of using MapReduce.

It is a software framework that enables you to write applications that process vast
amounts of data, in-parallel on large clusters of commodity hardware in a reliable and fault-
tolerant manner.
 Prior to Hadoop 2.0, MapReduce was the only way to process data in Hadoop.
 A MapReduce job usually splits the input data set into independent chunks, which
are processed by the map tasks in a completely parallel manner.
 The framework sorts the outputs of the maps, which are then inputted to the reduce
tasks.
 Typically, both the input and the output of the job are stored in a file system.
 The framework takes care of scheduling tasks, monitors them, and re-executes the
failed tasks.

MapReduce Framework
MapReduce is a software framework that enables you to write applications that will
process large amounts of data, in- parallel, on large clusters of commodity hardware, in a
reliable and fault-tolerant manner. It integrates with HDFS and provides the same benefits
for parallel data processing. It Sends computations to where the data is stored. The
framework:
– Schedules and monitors tasks, and re-executes failed tasks.
– Hides complex ―housekeeping‖ and distributed computing complexity tasks from
the developer.
– The records are divided into smaller chunks for efficiency, and each chunk is
executed serially on a particular compute engine.
– The output of the Map phase is a set of records that are grouped by the mapper
output key. Each group of records is processed by a reducer (again, these are
logically in parallel).
– The output of the Reduce phase is the union of all records that are produced by the
reducers.

3.2 Exploring the features of MapReduce


Q2. List the main features of MapReduce.
Ans:
MapReduce keeps all the processing operations separate for parallel execution. Problems
that are extremely large in size are divided into subtasks, which are chunks of data

S.V. PUBLICATIONS 57

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
separated in manageable blocks. The subtasks are executed independently from each other
and then, the results from all Independent executions are combined to provide the complete
output.

The principal features of MapReduee include the following:


 Scheduling—MapReduee involves two operations: map and reduce, which are
executed by dividing large problems into smaller chunks. These chunks are run in
parallel by different computing resources. The operation of breaking tasks into
subtasks and running these subtasks independently in parallel is called mapping,
which is performed ahead of the reduce operation. The mapping operation requires
task prioritization based on the number of nodes in the cluster. In case nodes are
fewer than tasks, then tasks are executed on a priority basis. The reduction operation
cannot be performed until the entire mapping operation is completed. The reduction
operation then merges independent results on the basis of priority. Hence, the
MapReduce programming model requires scheduling of task.
 Synchronization—Execution of several concurrent processes requires
synchronization. The MapReduce program execution framework is aware of the
mapping and reducing operations that are taking place in the program. The
framework tracks all the tasks along with their timings and starts the reduction
process after the completion of mapping. A method known as shuffle and sort
produces the intermediate data, which is transmitted across the network. The shuffle
and sort mechanism is used for collecting the mapped data and preparing it for
reduction.
 Co-location of Code/Data (Data Locality)—The effectiveness of a data processing
mechanism depends largely on the location of the code and the data required for the
code to execute. The best result is obtained when both code and data reside on the
same machine. This means that the co-location of the code and data produces the
most effective processing outcome.
 Handling of Errors/Faults—MapReduce engines usually provide a high level of fault
tolerance and robustness in handling errors. The reason for providing robustness to
these engines is their high propensity to make errors or faults. There are high
chances of failure in clustered nodes on which different parts of a program are
running. Therefore, the MapReduce engine must have the capability of recognizing
the fault and rectify it. Moreover, the MapReduce engine design involves the ability
to find out the tasks that are incomplete and eventually assign them to different
nodes.
 Scale-Out Architecture — MapReduce engines are built in such a way that they can
accommodate more machines, as and when required. This possibility of introducing
more computing resources to the architecture makes the MapReduce programming
model more suited to the higher computational demands of Big Data.

3.3 Working of MapReduce


Q3. What is Hadoop Mapreduce and How Does it Work.
Ans:
MapReduce facilitates concurrent processing by splitting petabytes of data into smaller
chunks, and processing them in parallel on Hadoop commodity servers. In the end, it

S.V. PUBLICATIONS 58

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


aggregates all the data from multiple servers to return a consolidated output back to the
application.

How MapReduce Works?


At the crux of MapReduce are two functions: Map and Reduce. They are sequenced one
after the other.
 The Map function takes input from the disk as <key,value> pairs, processes them,
and produces another set of intermediate <key,value> pairs as output.
 The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output.

There are two types of tasks:


1. Map tasks (Splits & Mapping)
2. Reduce tasks (Shuffling, Reducing)

A Map Task is a single instance of a MapReduce app. These tasks determine which
records to process from a data block. The input data is split and analyzed, in parallel, on the
assigned compute resources in a Hadoop cluster. This step of a MapReduce job prepares the
<key, value> pair output for the reduce step.

A Reduce Task processes an output of a map task. Similar to the map stage, all reduce
tasks occur at the same time, and they work independently. The data is aggregated and
combined to deliver the desired output. The final result is a reduced set of <key, value>
pairs which MapReduce, by default, stores in HDFS.

The Map and Reduce stages have two parts each.


 The Map part first deals with the splitting of the input data that gets assigned to
individual map tasks. Then, the mapping function creates the output in the form of
intermediate key-value pairs.
 The Reduce stage has a shuffle and a reduce step. Shuffling takes the map output
and creates a list of related key-value-list pairs. Then, reducing aggregates the results
of the shuffling to produce the final output that the MapReduce application
requested.

Hadoop Map and Reduce Working Together


As the name suggests, MapReduce works by processing input data in two stages – Map
and Reduce. To demonstrate this, we will use a simple example with counting the number
of occurrences of words in each document.

S.V. PUBLICATIONS 59

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
The final output we are looking for is: How many times the words Apache, Hadoop,
Class, and Track appear in total in all documents.

For illustration purposes, the example environment consists of three nodes. The input
contains six documents distributed across the cluster. We will keep it simple here, but in real
circumstances, there is no limit. You can have thousands of servers and billions of
documents.

1. First, in the map stage, the input data (the six documents) is split and distributed
across the cluster (the three servers). In this case, each map task works on a split
containing two documents. During mapping, there is no communication between the
nodes. They perform independently.
2. Then, map tasks create a <key, value> pair for every word. These pairs show how
many times a word occurs. A word is a key, and a value is its count. For example,
one document contains three of four words we are looking for: Apache 7 times, Class
8 times, and Track 6 times. The key-value pairs in one map task output look like this:
 <apache, 7>
 <class, 8>
 <track, 6>
This process is done in parallel tasks on all nodes for all documents and gives a
unique output.
3. After input splitting and mapping completes, the outputs of every map task are
shuffled. This is the first step of the Reduce stage. Since we are looking for the
frequency of occurrence for four words, there are four parallel Reduce tasks. The
reduce tasks can run on the same nodes as the map tasks, or they can run on any
other node.

The shuffle step ensures the keys Apache, Hadoop, Class, and Track are sorted
for the reduce step. This process groups the values by keys in the form of <key,
value-list> pairs.
4. In the reduce step of the Reduce stage, each of the four tasks process a <key, value-
list> to provide a final key-value pair. The reduce tasks also happen at the same time
and work independently.

In our example from the diagram, the reduce tasks get the following individual results:

S.V. PUBLICATIONS 60

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


 <apache, 22>
 <hadoop, 20>
 <class, 18>
 <track, 22>
5. Finally, the data in the Reduce stage is grouped into one output. MapReduce now
shows us how many times the words Apache, Hadoop, Class, and track appeared in
all documents. The aggregate data is, by default, stored in the HDFS.

The example we used here is a basic one. MapReduce performs much more complicated
tasks.

Some of the use cases include:


 Turning Apache logs into tab-separated values (TSV).
 Determining the number of unique IP addresses in weblog data.
 Performing complex statistical modeling and analysis.
 Running machine-learning algorithms using different frameworks, such as Mahout.

3.4 Techniques to Optimize MapReduce Jobs


Q4. What techniques are used to optimize MapReduce jobs?
Ans:
Aside from optimizing the actual application code, you can use some optimization
techniques to improve the reliability and performance of your MapReduce jobs. They fall
into three categories: hardware/network topology, synchronization, and file system.
 Hard Ware/ Network topology: Independent of application, the fastest hardware
and networks will likely yield the fastest run times for your software. A distinct
advantage of MapReduce is the capability to run on inexpensive clusters of
commodity hardware and standard networks. If you don't pay attention to where
your servers are physically organized, you won't get the best performance and high
degree of fault tolerance necessary to support big data tasks. Commodity hardware
is often stored in racks in the data center. The proximity of the hardware within the
rack offers a performance advantage as opposed to moving data and/or code from
rack to rack. During implementation, you can configure your MapReduce engine to
be aware of and take advantage of this proximity. Keeping the data and the code
together is one of the best optimizations for MapReduce performance. In essence, the
closer the hardware processing elements are to each other, the less latency you will
have to deal with.
 Synchronization: Because it is inefficient to hold all the results of your mapping
within the node, the synchronization mechanisms copy the mapping results to the
reducing nodes immediately after they have completed so that the processing can
begin right away. All values from the same key are sent to the same reducer, again
ensuring higher performance and better efficiency. The reduction outputs are written
directly to the file system, so it must be designed and tuned for best results.
 File system: Your MapReduce implementation is supported by a distributed file
system. The major difference between local and distributed file systems is capacity.
To handle the huge amounts of information in a big data world, file systems need to
be spread across multiple machines or nodes in a network. MapReduce

S.V. PUBLICATIONS 61

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
implementations rely on a master-slave style of distribution, where the master node
stores all the metadata, access rights, mapping and location of files and blocks, and
so on. The slaves are nodes where the actual data is stored. All the requests go to the
master and then are handled by the appropriate slave node. As you contemplate the
design of the file system you need to support a MapReduce implementation, you
should consider the following:
 Keep it warm: As you might expect, the master node could get overworked
because everything begins there. Additionally, if the master node fails, the entire
file system is inaccessible until the master isrestored. A very important
optimization is to create a ―warm standby‖master node that can jump into
service if a problem occurs with the online master.
 The bigger the better: File size is also an important consideration. Lots of small
files (less than 100MB) should be avoided. Distributed file systems supporting
MapReduee engines work best when they are populated with a modest number
of large files.
 The long view: Because workloads are managed in batches, highly sustained
network bandwidth is more important than quick execution times of the mappers
or reducers. The optimal approach is for the code to stream lots of data when it is
reading and again when it is time to write to the file system.
 Keep it secure: But not overly so. Adding layers of security on the distributed file
system will degrade its performance. The file permissions are there to guard
against unintended consequences, not malicious behavior. The best approach is
to ensure that only authorized users have access to the data center environment
and to keep the distributed file system protected from the outside.

3.5 Uses of MapReduce


Q5. What are the different use cases of MapReduce?
Ans:
By spreading out processing across numerous nodes and
merging or decreasing the results of those nodes, MapReduce
has the potential to handle large data volumes. This makes it
suitable for the following use cases:

Entertainment
Hadoop MapReduce assists end users in finding the most
popular movies based on their preferences and previous
viewing history. It primarily concentrates on their clicks and
logs.

Various OTT services, including Netflix, regularly release


many web series and movies. It may have happened to you that
you couldn‘t pick which movie to watch, so you looked at
Netflix‘s recommendations and decided to watch one of the
suggested series or films. Netflix uses Hadoop and MapReduce
to indicate to the user some well-known movies based on what
they have watched and which movies they enjoy. MapReduce
can examine user clicks and logs to learn how they watch movies.

S.V. PUBLICATIONS 62

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


E-commerce
Several e-commerce companies, including Flipkart, Amazon, and eBay, employ
MapReduce to evaluate consumer buying patterns based on customers‘ interests or
historical purchasing patterns. For various e-commerce businesses, it provides product
suggestion methods by analyzing data, purchase history, and user interaction logs.

Many e-commerce vendors use the MapReduce programming model to identify popular
products based on customer preferences or purchasing behavior. Making item proposals for
e-commerce inventory is part of it, as is looking at website records, purchase histories, user
interaction logs, etc., for product recommendations.

Social media
Nearly 500 million tweets, or about 3000 per second, are sent daily on the microblogging
platform Twitter. MapReduce processes Twitter data, performing operations such as
tokenization, filtering, counting, and aggregating counters.
 Tokenization: It creates key-value pairs from the tokenized tweets by mapping the
tweets as maps of tokens.
 Filtering: The terms that are not wanted are removed from the token maps.
 Counting: It creates a token counter for each word in the count.
 Aggregate counters: A grouping of comparable counter values is prepared into
small, manageable pieces using aggregate counters.

Data warehouse
Systems that handle enormous volumes of information are known as data warehouse
systems. The star schema, which consists of a fact table and several dimension tables, is the
most popular data warehouse model. In a shared-nothing architecture, storing all the
necessary data on a single node is impossible, so retrieving data from other nodes is
essential.

This results in network congestion and slow query execution speeds. If the dimensions
are not too big, users can replicate them over nodes to get around this issue and maximize
parallelism. Using MapReduce, we may build specialized business logic for data insights
while analyzing enormous data volumes in data warehouses.
Fraud detection
Conventional methods of preventing fraud are not always very effective. For instance,
data analysts typically manage inaccurate payments by auditing a tiny sample of claims and
requesting medical records from specific submitters. Hadoop is a system well suited for
handling large volumes of data needed to create fraud detection algorithms. Financial
businesses, including banks, insurance companies, and payment locations, use Hadoop and
MapReduce for fraud detection, pattern recognition evidence, and business analytics
through transaction analysis.

3.6 Role of HBase in Big Data Processing


Q6. What is HBase? What are the common features of the HBase?
Ans:
Hbase is an open source and sorted map data built on Hadoop. It is column oriented and
horizontally scalable.
 HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable.

S.V. PUBLICATIONS 63

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
 HBase is a data model that is similar to Google‘s big table designed to provide quick
random access to huge amounts of structured data. It
leverages the fault tolerance provided by the
Hadoop File System (HDFS).
 It is a part of the Hadoop ecosystem that provides
random real-time read/write access to data in the
Hadoop File System.
 One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.

Characteristics of HBase
 It is linearly scalable across various nodes as well as modularly scalable, as it divided
across various nodes.
 HBase provides consistent read and writes.
 It provides atomic read and write means during one read or write process, all other
processes are prevented from performing any read or write operations.
 It provides easy to use Java API for client access.
 It supports Thrift and REST API for non-Java front ends which supports XML,
Protobuf and binary data encoding options.
 It supports a Block Cache and Bloom Filters for real-time queries and for high
volume query optimization.
 HBase provides automatic failure support between Region Servers.
 It support for exporting metrics with the Hadoop metrics subsystem to files.
 It doesn‘t enforce relationship within your data.
 It is a platform for storing and retrieving data with random access.

S.V. PUBLICATIONS 64

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

INTRODUCING BIG DATA TECHNOLOGY FOUNDATIONS


3.7 Exploring the Big Data Stack
Q7. What is big data stack? Explain.
Ans:
Good design principles are critical when creating (or evolving) an environment to
support big data — whether dealing with storage, analytics, reporting, or applications. The
environment must include considerations for hardware, infrastructure software, operational
software, management software, well-defined application programming interfaces (APIs),
and even software developer tools.
 Capture
 Integrate
 Organize
 Analyze
 Act

This is a comprehensive stack, and


you may focus on certain aspects
initially based on the specific problem
you are addressing. However, it is
important to understand the entire
stack so that you are prepared for the future.

3.8 Data Source layer


Q8. What is data source in big data?
Ans:
Organizations generate a huge amount of data on a daily basis. The basic function of the
data sources layer is to absorb and integrate the data coming from various sources, at
varying velocity and in different formats. Before this data is considered for big data stack,
we have to differentiate between the noise and relevant information.

The data obtained from the data sources, has to be validated and cleaned before
introducing it for any logical use in the enterprise. The task of validating, sorting, and
cleaning data is done by the ingestion layer. The removal of noise from the data also takes
place in the ingestion layer.

3.9 Ingestion Layer


Q9. What is Data Ingestion? How to Pick the Right Data Ingestion Tool?
Ans:
The role of the ingestion layer is to absorb the huge inflow of data and sort it out in
different categories. This layer separates noise from relevant information. It can handle huge

S.V. PUBLICATIONS 65

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
volume, high velocity, and a variety of data, the ingestion layer validates, cleanses,
transforms, reduces, and integrates the unstructured data into the Big Data stack for further
processing. The functioning of the ingestion layer:
 Identification: Data is categorised into various
known data formats or unstructured data is
assigned with default formats.
 Filtration: The information relevant for the
enterprise is filtered on the basis of the
Enterprise Master Data Management (MDM)
repository.
 Validation: The filtered day is analyzed against
MDM metadata.
 Noise reduction: Data is cleaned by removing
the noise and minimising the related
disturbances.
 Transformation: Data is split or combined on the basis of its type, contents, and the
requirement of the organisation.
 Compression: The size of the data is reduced without affecting is relavance for the
required process. It should be remembered that compression does not affect the
analysis results.
 Integration: The refined data set is integrated with the Hadoop storage layer, which
consists of Hadoop Distributed File System (HDFS) and NOSQL database.

Data ingestion in the Hadoop world means ELT (Extract, Load and Transform) as
opposed to ETL (Extract, Transform and Load) in case of traditional warehouses.

3.10 Storage Layer


Q10. Which layer is used for storage in big data?
Ans:
Hadoop is an open source framework used to store large volumes of data in a
distributed manner across multiple machines. The Hadoop storage layer supports fault-
tolerance and parallelization, which enable high-speed distributed processing algorithms to
execute over large-scale data. There are two major components of Hadoop: a scalable
Hadoop Distributed File System (HDFS) that can support petabytes of data and a
MapReduce engine that computes results in batches.

HDFS is a file system that is used to store huge volumes of data across a large number of
commodity machines in a cluster. The data can be in terabytes or petabytes. HDFS stores
data in the form of blocks of files and follows the write-once-read-many model to access
data from these blocks of files. The flies stored in the HDFS are operated upon by many
complex programs, as per the requirement.

Consider an example of a hospital that used to perform a periodic review of the data
obtained from the sensors and machines attached to the patients. This review helped doctors
to keep a check on the condition of terminal patients as well as analyze the effects of various
medicines on them. With time, the growing volume of data made it difficult for the hospital
staff to store and handle it. To find a solution, the hospital consulted a data analyst who
suggested the implementation of HDFS as an answer to this problem. HDFS can be

S.V. PUBLICATIONS 66

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


implemented in an organization at comparatively low costs than advanced technologies and
can easily handle the continuous streaming of data.

Earlier, we needed to have different types of databases, such as relational and non-
relational, for stormy different types of data. However, now, all these types of data storage
requirements can be addressed by a single concept known as Not Only SQL (NoSQL)
databases. Some examples at NOSQL database include HBase, MongoDB, AllegroGraph,
and InfiniteGraph.

3.11 Physical Infrastructure


Q11. What is redundant physical infrastructure?
Ans:
At the lowest level of the stack is the physical infrastructure — the hardware, network,
and so on. Your company might already have a data centre or made investments in physical
infrastructures, so you‘re going to want to find a way to use the existing assets. Big data
implementations have very specific requirements on all elements in the reference
architecture, so you need to examine these requirements on a layer-by-layer basis to ensure
that your implementation will perform and scale according to the demands of your
business. As you start to think about your big data implementation, it is important to have
some overarching principles that you can apply to the approach. A prioritized list of these
principles should include statements about the following:
 Performance: How responsive do you need the system to be? Performance, also
called latency, is often measured end to end, based on a single transaction or query
request. Very fast (high-performance, low- latency) infrastructures tend to be very
expensive.
 Availability: Do you need a 100 percent uptime guarantee of service? How long can
your business wait in the case of a service interruption or failure? Highly available
infrastructures are also very expensive.
 Scalability: How big does your infrastructure need to be? How much disk space is
needed today and in the future? How much computing power do you need?
Typically, you need to decide what you need and then add a little more scale for
unexpected challenges.
 Flexibility: How quickly can you add more resources to the infrastructure? How
quickly can your infrastructure recover from failures? The most flexible

S.V. PUBLICATIONS 67

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
infrastructures can be costly, but you can control the costs with cloud services, where
you only pay for what you actually use.
 Cost: What can you afford? Because the infrastructure is a set of com- ponents, you
might be able to buy the ―best‖ networking and decide to save money on storage (or
vice versa). You need to establish requirements for each of these areas in the context
of an overall budget and then make trade-offs where necessary.

Physical redundant networks


Networks should be redundant and must have enough capacity to accommodate the
anticipated volume and velocity of the inbound and outbound data in addition to the
―normal‖ network traffic experienced by the business. As you begin making big data an
integral part of your computing strategy, it is reasonable to expect volume and velocity to
increase.

Infrastructure designers should plan for these expected increases and try to create
physical implementations that are ―elastic.‖ As network traffic ebbs and flows, so too does
the set of physical assets associated with the implementation. Your infrastructure should
offer monitoring capabilities so that operators can react when more resources are required to
address changes in workloads.

Managing hardware: Storage and servers


Likewise, the hardware (storage and server) assets must have sufficient speed and
capacity to handle all expected big data capabilities. It‘s of little use to have a high-speed
network with slow servers because the servers will most likely become a bottleneck.
However, a very fast set of storage and compute servers can overcome variable network
performance. Of course, nothing will work properly if network performance is poor or
unreliable.
Infrastructure operations
Another important design consideration is infrastructure operations management. The
greatest levels of performance and flexibility will be present only in a well-managed
environment. Data centre managers need to be able to anticipate and prevent catastrophic
failures so that the integrity of the data.

3.12 Security Layer


Q12. What is the main purpose of security? Why is data security important in big data?
Ans:
Security and privacy requirements for big data are similar to the requirements for
conventional data environments. The security requirements have to be closely aligned to
specific business needs. Some unique challenges arise when big data becomes part of the
strategy.
 Data access: User access to raw or computed big data has about the same level of
technical requirements as non-big data implementations. The data should be
available only to those who have a legitimate business need for examining or
interacting with it. Most core data storage platforms have rigorous security schemes
and are often augmented with a federated identity capability, providing appropriate
access across the many layers of the architecture.
 Application access: Application access to data is also relatively straightforward from
a technical perspective. Most application programming interfaces (APIs) offer

S.V. PUBLICATIONS 68

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


protection from unauthorized usage or access. This level of protection is probably
adequate for most big data Implementations.
 Data encryption: Data encryption is the most challenging aspect of security in a big
data environment. In traditional environments, encrypting and decrypting data
really stresses the systems‘ resources. With the volume, velocity, and varieties
associated with big data, this problem is exacerbated. The simplest (brute-force)
approach is to provide more and faster computational capability. However, this
comes with a steep price tag — especially when you have to accommodate resiliency
requirements. A more temperate approach is to identify the data elements requiring
this level of security and to encrypt only the necessary items.
 Threat detection: The inclusion of mobile devices and social networks exponentially
increases both the amount of data and the opportunities for security threats. It is
therefore important that organizations take a multi-perimeter approach to security.

3.13 Monitoring Layer


Q13. Why monitoring your big data pipeline is important?
Ans:
The monitoring layer consists of a number of monitoring systems. These systems remain
automatically aware of all the configurations and functions of different operating systems
and hardware. They also provide the facility of machine communication with the help of a
monitoring tool through high-level protocols, such as Extensible Markup Language (XML).
Monitoring systems also provide tools for data storage and visualization. Some examples of
open source tools for monitoring Big Data stacks are Ganglia and Nagios.

Analytics Engine
The role of an analytics engine is to analyze huge amounts of unstructured data. This
type of analysis is related to text analytics and statistical analytics. Some examples of
different types of unstructured data that are available as large datasets include the
following:
 Documents containing textual patterns
 Text and symbols generated by customers or users using social media forums, such
as Yammer, Twitter, and Facebook
 Machine generated data, such as Radio Frequency Identification (RFID) feeds and
weather data.
 Data generated from application logs about upcoming or down time details or about
 maintenance and upgrade details

Some statistical and numerical methods used for analyzing various unstructured data
sources:
 Natural Language Processing
 Text Mining
 Linguistic Computation
 Machine Learning
 Search and Sort Algorithms
 Syntax and Lexical Analysis

S.V. PUBLICATIONS 69

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations
The following types of engines are used for analyzing Big Data:
 Search engines —Big Data analysis requires extremely last search engines with
iterative and cognitive data discovery mechanisms for analyzing huge volumes of
data. This is required because the data loaded from various sources has to be indexed
and searched for Big Data analytics processing.
 Real Time Engines — These days real time applications generate data at a very high
speed and Even a few hours old data becomes obsolete and useless as new data
continues to flow in. Real-time analysis is required in the Big Data environment to
analyze this type of data. For this purpose real time engines and NoSQL stores are
used.

3.14 Visualization Layer


Q14. What is the role of visual analytics in the world of big data?
Ans:
 The visualization layer handles task of interpreting and visualizing Big data.
 Visualization of data is done by data analysts to have a look at the different aspects
of the data in various visual modes.
 It can be described as viewing a piece of information from different perspectives,
Interpreting it in different manners, trying to fit it in different types of situations, and
deriving different types of conclusions from it.
Process Flow of Visualization Layer:
 The visualization layer works on top of the aggregated data stored in traditional
Operational Data Stores (ODS), data warehouse and data marts.
 These ODS get the aggregated data through the data scoop.
 Examples of visualization tools are Tableau, Clickview, Spotfire, MapR and
revolution R.
 These tools work on top of the traditional components such as reports, dashboards
and queries.

Visualization and Big Data


 Virtualization is a process that allows you to run the images of multiple operating
systems on a physical computer.
 These images of operating systems are called virtual machines.

S.V. PUBLICATIONS 70

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


 A virtual machine is basically a software representation of a physical machine that
can execute or perform the same functions as the physical machine.
 Each virtual machine contains a separate copy of the operating system with its own
virtual hardware resources, device drivers, services and applications.
 Although virtualization is not a requirement for Big Data analysis, the required
software frameworks such as MapReduce works very efficiently in a virtualized
environment.

S.V. PUBLICATIONS 71

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Understanding MapReduce Fundamentals and HBase & Big Data Technology Foundations

Internal Assessment

MULTIPLE CHOICE QUESTIONS


1. _______ can change the maximum number of cells of a column family. [ c ]
a) set b) reset c) alter d) select
2. __________ class adds HBase configuration files to its object. [ a ]
a) Configuration b) Collector c) Component d) None of the mentioned
3. Correct and valid syntax for count command is ____________ [ b ]
a) count ‗<row number>‘ b) count ‘<table name>’
c) count ‗<column name>‘ d) none of the mentioned
4. Adding layers of security on the distributed file system will _______ its [ a ]
performance.
a) degrade b) Upgrade c) Both a& b d) None of the above
5. The filtered day is analyzed against MDM metadata. [ b ]
a) Verification b) validation c) Authorization d) Authentication
6. The role of an analytics engine is to analyze huge amounts of _____ data. [ b ]
a) Structured b) unstructured c) Semi structured Data d) None
7. Examples of visualization tools are [ d ]
a) Tableau b) Clickview c) Spotfire d) All

FILL IN THE BLANKS


1. An operational data source consisted of highly structured data managed by the line
of business in a ______________. (Relational database)
2. ________________ reduces the cost of storage and processing in order to meet the
growing data requirements. (Cost Reduction)
3. ____________ usually provide a high level of fault tolerance and robustness in
handling errors. (MapReduce Engines)
4. The role of the _________________ is to absorb the huge inflow of data and sort it out
in different categories. (Ingestion Layer)
5. A _________________ is a single instance of a MapReduce app. (Map Task)
6. A ________ processes an output of a map task. (Reduce Task)
7. _____________is the most challenging aspect of security in a big data environment.
(Data encryption)

S.V. PUBLICATIONS 72

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

Very Short Questions


Q1. What is map stage?
Ans:
The task of the map or mapper is to process the input data at this level. In most cases, the
input data is stored in the Hadoop file system as a file or directory (HDFS). The mapper
function receives the input file line by line. The mapper processes the data and produces
several little data chunks.

Q2. What is Mapper?


Ans:
The Mapper processes input records produced by the Record Reader and generate
intermediate key-value pairs. The intermediate output is completely different from the input
pair.

Q3. What is Visualization Layer?


Ans:
A huge volume of big data can lead to information overload. If visualization is
incorporated early-on as an integral part of the big data tech stack, it will be useful for data
analyst and scientists to gain insights faster and increase their ability to look at different
aspects of the data in various visual modes.

Q4. Define Data Processing


Ans:
Data processing is the second layer, responsible for collecting, cleaning, and preparing
the data for analysis. This layer is critical for ensuring that the data is high quality and ready
to be used in the future.

Q5. Define Data Ingestion.


Ans:
This layer is responsible for collecting and storing data from various sources. In Big
Data, the data ingestion process of extracting data from various sources and loading it into a
data repository. Data ingestion is a key component of Big Data architecture because it
determines how data will be ingested, transformed, and stored.

S.V. PUBLICATIONS 73

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management

Storing Data in Databases and Data Warehouses


RDBMS and Big Data, Issues with Relational Model,
Non – Relational Database, Issues with Non-Relational

UNIT Database, Polyglot Persistence, Integrating Big Data


with Traditional Data Warehouse, Big Data Analysis
and Data Warehouse.
NoSQL Data Management
Introduction to NoSQL, Characteristics of NoSQL,

IV History of NoSQL, Types of NoSQL Data Models- Key


Value Data Model, Column Oriented Data Model,
Document Data Model, Graph Databases, Schema-Less
Databases, Materialized Views, CAP Theorem.

Objective
 The discussion databases and data warehouses
 in terms of their utility in data storage begins by describing RDBMS and its
role in managing Big Data.
 Introducing non-relational databases
 Explained the concept of polyglot persistence
 The integration of Big Data with traditional data warehouses
 Big Data analysis and data warehousing have also been discussed in detail.
 The elaborated on the changing in the deployment models in the era of Big
Data
 We discussed various aspects of data management in NoSQL in detail.
 After introducing NoSQL,
 Explained types of data models
 It has then described the key/value, column-oriented, and document data
models.
 We also look been made familiar with the concept of materialized views as
well as distribution models.

S.V. PUBLICATIONS 74

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

PART – A
Short Type Questions
Q1. List the Characteristics of Non-Relational Databases.
Ans:
Non-relational database technologies have the following characteristics in common:
 Scalability —It refers to the capability to write data across multiple data clusters
simultaneously, irrespective of physical hardware or infrastructure limitations.
 Seamlessness —Another important aspect that ensures the resiliency of non-
relational databases, is their capability to expand/contract to accommodate varying
degrees of increasing or decreasing data flows, without affecting the end-user
experience.
 Data and Query Model — Instead of the traditional row/column, key-value
structure, non-relational databases use frameworks to store data with a required set
of queries or APIs to access the data.
 Persistence Design—Persistence is an important element in non-relational
databases, ensuring faster throughput of huge amounts of data by making use of
dynamic memory rather than conventional reading and writing from disks. This is a
vital element in non-relational databases. Due to the high variety, velocity, and
volume of data, these databases utilize different mechanisms, maintain persistency in
data. The highest performance option is "in-memory," where the entire database is
stored in a huge cache, so as to avoid time-consuming read-write cycles.
 Eventual Consistency—While RDBMS use ACID (Atomicity, Consistency, Isolation,
Durability) for ensuring data consistency, non-relational DBMS use BASE Basically
available soft state, eventual consistency to ensure that inconsistencies are resolved
when data is midway between the nodes in a distributed system.

Q2. Brief History of NoSQL Databases


Ans:
 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
 2000- Graph database Neo4j is launched
 2004- Google BigTable is launched
 2005- CouchDB is launched
 2007- The research paper on Amazon Dynamo is released
 2008- Facebooks open sources the Cassandra project
 2009- The term NoSQL was reintroduced

Q3. What are the 5 features of NoSQL?


Ans:
1. No Structured Query Language (SQL): You can use a similar language, like
Cassandra Query Language (CQL), or you can use a radically different API, such as
one using JSON. But if you can compliantly accept all SQL statements, verbatim, you
are not ―NoSQL,‖ but ―NewSQL.‖ Because you literally are a SQL database.
2. No table joins. Remember the ‘relational’ part of RDBMS? One of the other big
differentiators was that NoSQL used to mean absolutely no table JOINs; the tables do

S.V. PUBLICATIONS 75

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
not relate to one another. If you can permit table JOINs, you are again treading into
the world of ―NewSQL‖ rather than ―NoSQL.‖ e.g., MongoDB $lookup. Using JOINs
in MongoDB NoSQL Databases — SitePoint
3. Schema-optional or schemaless. SQL requires all tables have pre-defined schemas.
NoSQL permits you to have a schema (schema optional) or may be schemaless.
These are anathema to the SQL RDBMS world, where you have to pre-define your
schema, have strong typing, and you don’t want to hear about sparse data models.
Schema-optional or schemaless permit dealing with a wider variety of data, and with
rapidly evolving data models.
4. Horizontal scalability: The NoSQL world was designed (ideally) to scale to the web,
or the cloud, or Internet of Things (IoT). Whereas SQL was designed (ideally) with
the enterprise in mind — a Fortune 500 company, for instance. Summarily, SQL was
designed to scale vertically for an enterprise (―one very big box‖ like a mainframe),
whereas NoSQL was designed to scale horizontally (―many little boxes, all alike‖ on
commodity hardware). However, while this was generally true in the past — a rule
of thumb — there are a few NoSQL systems ported to mainframes, and now some
SQL systems designed to scale horizontally. Ideally a database can be architected to
scale horizontally and vertically (e.g., Scylla).
5. Availability-focused (vs. Consistency-focused): SQL RDBMS’s grew up in the
world of big banking and other commercial use cases that required consistency in
transaction processing. The money was either in your account, or it wasn’t. And if
you checked your balance, it needed to be precise. Reloading it should not change
what’s in there. But that means the database needs to take its own sweet time to
make sure your balance is right. Whereas for the NoSQL world, the database needed
to be available. No blocking transactions. Which means that data may be eventually
consistent. i.e., reload and you’ll eventually see the right answer. That was fine with
use cases that were less mission-critical like social media posting or caching
ephemeral data like browser cookies. However, again, this is a rule of thumb from
the early days. Now, many NoSQL systems support ACID, two-phase commits,
strong consistency, and so on. Still, the prevalence is for many NoSQL systems to be
more aligned with the ―AP‖ (Availability/Partition Tolerant) side than the ―CP‖
(Consistency/Partition Tolerant) side of the CAP theorem.

Q4. Discuss about the CAP Theorem


Ans:
The CAP theorem applies to any system which replicates data across multiple nodes,
and which requires consistency (however defined) for replicated data.
 Consistency - All the servers in the system will have the same data so users will get
the same copy regardless of which server answers their request.
 Availability - The system will always respond to a request (even if it's not the latest
data or consistent across the system or just a message saying the system isn't
working).
 Partition Tolerance - The system continues to operate as a whole even if individual
servers fail or can't be reached.

S.V. PUBLICATIONS 76

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


Q5. How To Apply Big Data Concepts to a Traditional Data Warehouse?
Ans:
While the worlds of big data and the traditional data warehouse will intersect, they are
unlikely to merge anytime soon. Think of a data warehouse as a system of record for
business intelligence, much like a customer relationship management (CRM) or accounting
system. These systems are highly structured and optimized for specific purposes. In
addition, these systems of record tend to be highly centralized.

The diagram shows a typical approach to data flows with warehouses and marts:

Organizations will inevitably continue to use data warehouses to manage the type of
structured and operational data that characterizes systems of record. These data warehouses
will still provide business analysts with the ability to analyze key data, trends, and so on.
However, the advent of big data is both challenging the role of the data warehouse and
providing a complementary approach.

It's inevitable that operational and structured data will have to interact in the world of
big data, where the information sources have not (necessarily) been cleansed or profiled.
Increasingly, organizations understand that they have a business requirement to be able to
combine traditional data warehouses with their historical business data sources with less
structured and vetted big data sources. A hybrid approach supporting traditional and big
data sources can help to accomplish these business goals.

Q6. Discuss about the Key-Value Data Model?


Ans:
A key-value data model or database is also referred to as a
key-value store. It is a non-relational type of database. In this, an
associative array is used as a basic database in which an
individual key is linked with just one value in a collection. For the
values, keys are special identifiers. Any kind of entity can be
valued. The collection of key-value pairs stored on separate
records is called key-value databases and they do not have an
already defined structure.

Q7. Explain about Graph Data Model?


Ans:
Graph Based Data Model in NoSQL is a type of Data Model which tries to focus on
building the relationship between data elements. As the name suggests Graph-Based Data
Model, each element here is stored as a node, and the association between these elements is
often known as Links. Association is stored directly as these are the first-class elements of
the data model. These data models give us a conceptual view of the data.

S.V. PUBLICATIONS 77

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
These are the data models which are based on topographical network structure.
Obviously, in graph theory, we have terms like Nodes, edges, and properties, let’s see what
it means here in the Graph-Based data model.
 Nodes: These are the instances of data that represent objects which is to be tracked.
 Edges: As we already know edges represent relationships between nodes.
 Properties: It represents information associated with nodes.

S.V. PUBLICATIONS 78

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

PART – B
Essay Type Questions
STORING DATA IN DATABASES AND DATA WARE
HOUSES
4.1 RDBMS and Big Data
Q1. Why should I care about big data? Why do I need a big data solution?
Ans:
A simple Database Management System (DBMS) stores data in the form of schemas or
tables comprising rows and columns. The main goal of a DBMS is to provide a solution for
storing and retrieving information in a convenient and efficient manner. The most common
way of fetching data from these tables is by using Structured Query Language (SQL). A
Relational Database Management System (RDBMS) stores the relationships between these
tables in columns that serve as a reference for another table. These columns are known as
primary keys and foreign keys, both of which can be used to reference other tables so that
data can be related between the tables and retrieved as and when it is required.

Such a database system usually consists of several tables and relationships between
those tables, which help in classifying the information contained in them. These tables are
also stored in Boyce - Codd Normal Form (BCNF), and a relationship is sustained within
these tables’ primary/foreign keys.

In addition to data files, a Data Warehouse (DWH) is also used for handling large
amounts of data or Big Data. A DWH can be defined as the association of data from various
sources that are created for supporting planned decision making. "A data warehouse is a
subject-oriented, integrated, time-variant and non-volatile collection of data in support of
management's decision-making process."

The primary goal of a data warehouse is to provide a consistent picture of the business at
a given point of time. Using various data warehousing toolsets, employees in an
organization can efficiently execute online queries and mine data according to their
requirement.

RDBMS and Big Data


An RDBMS uses a relational model where all the data is stored using preset schemas.
These schemas are linked using the values in specific columns of each table. The data is
hierarchical, which means for data to be stored or transacted it needs to adhere to ACID
standards, namely:
 Atomicity—Ensures full completion of a database operation.
 Consistency—Ensures that data abides by the schema (table) standards, such as
correct data type entry, constraints, and keys.
 Isolation—Refers to the encapsulation of information. Makes only necessary
information visible.
 Durability—Ensures that transactions stay valid even after a power failure or errors.

In traditional database systems, every time data is accessed or modified, it requires to be


moved (indexed) to a central location for processing.

S.V. PUBLICATIONS 79

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
RDBMS presumes your data to be essentially correct beforehand. Mapping real-world
problems to a relational database model comprises many improvised strategies, but the
textbook approach recommends following a three-step process: conceptual, logical, and
physical modeling. But simply mapping the problem in such a way doesn't work. It makes
some incorrect assumptions, such as up your information system staying consistent and
static all the time. However in the real world, data is only partially legible.

The Internet has grown by leaps and bounds in the past two decades. Numerous
domains have been registered, more than a billion gigabytes of Web-space has been
reserved. With the digital revolution of the early 1990s aiding the personal computer
segment, Web transactions have grown rapidly. With the advent of various search engines,
lure of easily available information, and freely distributable information, the social media
platform has recorded a threefold increase in the volume of transactions. Although solutions
for such situations occurring in corporate scenarios have been there, coping with Web-based
transactions has turned out to be a compelling factor while addressing database-related
issues with Big Data solutions.

Big Data mainly takes three Vs into account: Volume, Variety, and Velocity. These three
terms can be briefly explained as follows:
 Volume of Data—Big Data is designed to store and process a few hundred terabytes,
or even petabytes or zettabytes of data.
 Variety of Data—Collection of data, different from a format suiting relational
database systems, is stored in a semi-structured or an unstructured format.
 Velocity of Data—The rate of data arrival might make an enterprise data warehouse
problematic, particularly where formal data preparation processes like conforming,
examining, transforming, and cleansing of the data needs to be accomplished before
it is stored in data warehouse tables.

Q2. What problems do big data solutions solve?


Ans:
Big Data solutions are designed for storing and managing enormous amounts of data
using a simple file structure, formats, and highly distributed storage mechanisms with the
initial handling of the data occurring at each storage node, unlike RDBMS. This obviates the
need for data to be moved every time over the network for even a simple processing.

One of the biggest difficulties with RDBMS is that it is not yet near the demand levels of
big data. The volume of data handling today is rising at a fast rate.

Big Data primarily comprises semi-structured data, such as social media sentiment
analysis and text mining data, while RDBMSs are more suitable for structured data, such as
weblog, sensor, and financial data.

Feature Relational Database Systems Big Data Solutions


Structured, semi-structured &
Data types and formats Structured
unstructured
Depends on the technology used—
Data integrity High transactional updates
often follows consistent models
Schema Static Dynamic
Read and write pattern Fully repeatable read/ write Write once, repeatable read

S.V. PUBLICATIONS 80

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

Storage volume Gigabytes to terabytes Terabytes, petabytes, and beyond


Scale up with more powerful
Scalability Scale out with additional servers
hardware
Data processing
Limited or none Distributed across clusters
distribution
Expensive hardware and Commodity hardware and open-
Economics
software source software
A report of a survey done by Forrester on 60 clients brings out some interesting facts.
Three-quarters of the respondents indicate volume as the main reason for contemplating Big
Data solutions, while others state velocity, variety as
well as variability as the major cause toward shifting
to Big Data technologies.

Big Data does not force a relational model on the


stored data. Rather, the data can be of any type —
structured, semi-structured, or unstructured. A
suitable schema can be applied later on, when you
query this data. Big Data stores data in its raw format
and applies a schema only when the data is read,
which preserves all of the information within the data.

Big Data solutions are adjusted for loading huge amounts of data using simple file
formats and highly distributed storage mechanisms, with initial processing of the data
occurring at every storage node. This means that once the data is loaded on to the cluster
storage, the data bulk need not be moved over the network for processing.

Q3. Will a big data solution replace my relational databases?


Ans:
Big Data solutions provide a way to avoid storage limitations and reduce the cost of
processing and storage, for immense volumes of data. Even in the case of RDBMS, with the
advancement of technology and development in hardware and software for such an
essential business function, capabilities for storing huge amounts of data are available.

Big Data is an important tool when you need to manage data that is arriving very
quickly in random formats, which you can process later. You can store the data in clusters in
its original format, and then process it when required using a query that extracts the
required result set and stores it in a relational database, or makes it available for reporting.

S.V. PUBLICATIONS 81

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management

CAP Theorem
CAP Theorem is also known as Brewer's theorem. It states that it is not possible for a
distributed system to provide all the following three conditions at the same point of time:
 Consistency—Same data is visible by all the nodes.
 Availability— Every request is answered, whether it succeeds or fails.
 Partition-tolerance—Despite network failures, the system continues to operate.

4.2 Issues with the Relational Model


Q4. What is the main problem with relational database for processing big data?
Ans:
The main problems with RDBMS’s and their use are:
 Using RDBMS for use cases to which it does not apply
 Not using RDBMS for use cases to which it does apply
 Poor schema design for the use case
 Application design for applications that need data but for which the app design
preceded the schema design
 Badly written SQL
 Not taking advantage of numerous non-relational features in modern RDBMS
systems
 Not listening to DBAs and Database Architects (DAs)
 DBAs and DAs not involved in code review

The solutions flow from the problem definition.


 Tuple storage primarily — a log style that's ready to replicate
 Structural renderings of Tuples for efficient querying — indexes, records, graphs
and more
 Distributed querying — that uses a binary SQL. Removing all problems that arise
from SQL text parsing.
 Management of entities — with declarations of the intended use, and stats of use.
Resulting in optimised use of hardware resources. Particularly for disk choices, but
also spare RAM, and varying ACID compliance levels.

S.V. PUBLICATIONS 82

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

4.3 Non-Relational Database


Q5. What is a non-relational database? What are the main characteristics of a non-
relational database?
Ans:
A non-relational database is a database that does not use the tabular schema of rows and
columns found in most traditional database systems. Instead, non-relational databases use a
storage model that is optimized for the specific requirements of the type of data being
stored.

Another important class of non-relational databases is the one that does not support the
relational model, but uses and relies on SQL as the primary means of manipulating existing
data. Despite K e relational and non-relational databases having similar fundamentals, the
difference lies in how they individually achieve the fundamentals.

Non-relational database technologies have the following characteristics in common:


 Scalability —It refers to the capability to write data across multiple data clusters
simultaneously, irrespective of physical hardware or infrastructure limitations.
 Seamlessness —Another important aspect that ensures the resiliency of non-
relational databases, is their capability to expand/contract to accommodate varying
degrees of increasing or decreasing data flows, without affecting the end-user
experience.
 Data and Query Model — Instead of the traditional row/column, key-value
structure, non-relational databases use frameworks to store data with a required set
of queries or APIs to access the data.
 Persistence Design—Persistence is an important element in non-relational
databases, ensuring faster throughput of huge amounts of data by making use of
dynamic memory rather than conventional reading and writing from disks. This is a
vital element in non-relational databases. Due to the high variety, velocity, and
volume of data, these databases utilize different mechanisms, maintain persistency in
data. The highest performance option is "in-memory," where the entire database is
stored in a huge cache, so as to avoid time-consuming read-write cycles.
 Eventual Consistency—While RDBMS use ACID (Atomicity, Consistency, Isolation,
Durability) for ensuring data consistency, non-relational DBMS use BASE Basically
available soft state, eventual consistency to ensure that inconsistencies are resolved
when data is midway between the nodes in a distributed system.

4.4 Issues with Non-Relational Databases


Q6. What are the problems with non-relational databases?
Ans:
The non-relational data model looks more like a concept of one entity, and all the data
that relates to that one entity is called a document.
 Non-relational databases are less reliable than the relational databases because
they compromise reliability for performance.
 Non-relational databases also compromise consistency for performance unless
manual support is provided.
 Non-relational databases use different query languages than relational databases.
So, people find it an overhead to learn new query languages.

S.V. PUBLICATIONS 83

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
 In many non-relational databases security is lesser than relational databases
which is a major concern.Like Mongodb and Cassandra both databases have lack
of encryption for data files, they have very weak authentication system, and very
simple authorization

4.5 Polyglot Persistence


Q7. What is polyglot persistence? Why is polyglot persistence considered a new
approach?
Ans:
Polyglot persistence is an enterprise storage term used to describe choosing different
data storage/data stores technologies to support the various data types and their storage
needs. Polyglot persistence is essentially the idea that an application can use more than one
core database (DB)/storage technology.

A lot of corporations still use relational databases for some data, but the increasing
persistence requirements of dynamic applications are growing from predominantly
relational to a mixture of data sources.

Polyglot persistence uses the same ideas as polyglot programming, which is the practice
of writing applications using a mix of languages in order to take full advantage of the fact
that various languages are suitable for solving different problems.

Polyglot is defined as someone who can speak and write in several languages. In data
storage, the term persistence means that data survives after the process with which it was
created has ended. In other words, data is stored in non-volatile storage.

However, like distributed or


parallel computing, if the data is
scattered over different database
nodes, you can at least be sure that
not all functionalities would be
hampered in an unlikely event.

Polyglot persistence is quite a


common and popular paradigm
among database professionals, but an important reason is that users, like applications, come
studded with rich experiences—where they can easily search for what they are looking,
without caring about rest of the frills— such as finding friends, nearby restaurants, accurate
information, etc.

To build a useful app for a user, developers are required to make clever improvisations
with the existing ways of storing and querying data. They need tools to provide the required
context-based content to the users.

4.6 Integrating Big Data with Traditional Data Warehouses


Q8. How To Apply Big Data Concepts to a Traditional Data Warehouse?
Ans:
A feasibility study of Big Data requirement, the identification and transition phase
commences. Planning a heterogeneous data environment is a task that involves multiple
phases of un-do-redo, quality checks, and integrity assurances/whereas implementing a Big

S.V. PUBLICATIONS 84

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


Data solution to a previously non-existent environment includes key challenges to be taken
care of, before any further commitments ensue.

Big Data and the traditional data warehouse confront each other often, they are more
likely to stay homogenous than to come up with a new version of Big Data. Data
warehouses are highly structured and adjusted for custom purposes. The relational data
model is here to stay for a long time, since organizations will continue to use data
warehouses to manage organized and operational type of data that characterize record
systems.

A relationship between Big Data and a data warehouse can be best described as a hybrid
structure in which the well-structured, optimized, and operational data remains in the
heavily guarded data warehouse.

A hybrid approach helps in accomplishing these business goals, which supports a cross
between traditional and Big Data sources. The main challenges confronting the physical
architecture of the next-gen data warehouse platform include data availability, loading
storage performance, data volume, scalability, assorted and varying query demands against
the data, and operational costs of maintaining the environment.

The key methods that we are going to explain in brief are as follows:
 Data Availability—Data availability is a well-known challenge for any system
related to transforming and processing data for use by end-users, and Big Data is no
different. The challenge is to sort and load the data, which is unstructured and in
varied formats. Also, context-sensitive data involving several different domains may
require another level of availability check. The data present in the Big Data hierarchy
is not updated reprocessing new data containing updates will create duplicate data,
and this needs to be handled to minimize the impact on availability.
 Pattern Study—Pattern study is nothing but the centralization and localization of
data according to the demands. A global e-commerce website can centralize requests
and fetch directives and results on the basis of end user locations, so as to return only
meaningful contextual knowledge than to impart the entire data to the user.
Trending topics are one of the pattern-based data study models that are a popular
mode of knowledge gathering for all platforms. The trending pattern to be found for
a given regional location is matched for occurrence in the massive data stream, in
terms of keywords or popularity of links as per the hits they receive, and based on
the geographical diversifications and stream filtering methods, data with similar
characteristics is conjoined to form a pattern.
 Data Incorporation and Integration—since no guidebook format or schema
metadata exists, the data incorporation process for Big Data is about just acquiring
the data and storing as files. Especially in case of big documents, images or videos, if
such requirements happen to be the sole architecture driver, a dedicated machine can
be allocated for this task, bypassing the guesswork involved in the configuration and
setup process.
 Data Volumes and Exploration—Traffic spikes and volatile surge in data volumes
can easily dislocate the functional architecture of corporate infrastructure due to the
fundamental nature of the data streams. On each cycle of data acquisition

S.V. PUBLICATIONS 85

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
completion, retention requirements for data can vary depending on the nature and
the freshness of the data and its core relevance to the business. Data exploration and
mining is an activity responsible for Big Data procurements across organizations and
also yields large data sets as processing output. These data sets are required to be
preserved in the system by occasional optimization of intermediary data sets.
 Compliance and Localized Legal Requirements —Various compliance standards
such as Safe Harbor, GLBA, and PCI regulations can have some impact on data
security and storage. Therefore, these standards should be judiciously planned and
executed. Moreover, there are several cases of transactional data sets not being stored
online required by the courts of law. Big Data infrastructure can be used as a storage
engine for such data types, but the data needs to comply with certain standards and
additional security. Large volumes of data can affect overall performance, and if such
data sets are processed on the Big Data platform, the appliance configurator can
provide administrators with tools/tips to mark the data in its own area of
infrastructure zoning, minimizing both risk and performance impact.
 Storage Performance — In all these years, storage-based solutions didn't advance as
rapidly as their counterparts, processors, memories, or cores did. Disk performance
is a vital point to be taken care of while developing Big Data systems and appliance
architecture can throw better light on the storage class and layered architecture. If a
combination of Solid State Drive (SSD), in-memory, and traditional storage
architecture is intended for Big Data processing, the exchange and persistence of data
across the different layers can be time-consuming and vapid.

A few parallels can be drawn between a data warehouse and Big Data solution. Both
store a lot of data. Both can be used for reporting and are managed by electronic storage
devices.

4.7 Big Data Analysis and Data Warehouse


Q9. What is big data analytics and big data warehousing?
Ans:
Big Data is analyzed to know the present behavior or trends and make future
predictions. Various Big Data solutions are used for extracting useful information from Big
Data. A simple working definition of a Big Data solution is that it is a technology that:
 Enables the storage of very large amounts of heterogeneous data
 Holds data in low-cost storage devices
 Keeps data in a raw or unstructured format
Big Data analytics performed by a Big Data solution helps organizations in the following
ways:
 Brings improvement in the tourism industry by analyzing the behavior of tourists
and their trends
 Enables improvement in technological research
 Enables improvement in the medical field for diagnosing diseases quickly
 Helps the defense sector by enabling better monitoring
 Helps the insurance industry by better customer relationship management

Data warehousing, on the other hand, is a group of methods and software to enable data
collection from functional systems, integration, and synchronization of that data into a

S.V. PUBLICATIONS 86

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


centralized database, and then the analytical visualization and report-tracking of key
performance indicators in a dash board-driven environment. The data warehousing process
comprises cleaning, integration, and consolidation of data. A data warehouse acts as a single
point of reference for a granular and integrated data of the company.

A big client of Argon Technology wants a solution for analyzing the data of 100,000
employees across the world. Assessing the performance manually of each employee is a
huge task for the administrative department before rewarding bonuses or increasing
salaries. Argon Technology experts set up a data warehouse in which information related to
each employee is stored. The administrative department extracts that information with the
help of Argon's Big Data solution easily and analyzes it before providing benefits to an
employee.

Complexity of the data warehouse environment has risen dramatically in recent years
with the influx of data warehouse appliances architecture, NoSQL/Hadoop, databases, and
several API-based tools for many forms of cutting-edge analytics or real-time tasks. A Big
Data solution is preferred because there is a lot of data that has to be manually and
relationally handled. In organizations handling Big Data, if the data is potentially used, it
can provide much valuable information leading to superior decision making which, in turn,
can lead to more profitability, revenue, and happier customers.

On comparing a data warehouse to a Big Data solution, we find that a Big Data solution
is a technology and data warehousing is architecture.

They are two distinct things.


 Technology is just a medium to store and operate huge amounts of data.
 In a typical data warehouse, you will find a combination of flat files, relational
database tables, and non-relational sources.

Big Data is not a substitute for a data warehouse. Data warehouses work with abstracted
data that has been operated, filtered, and transformed into a separate database, using
analytics such as sales trend analysis or compliance reporting. That database is updated
gradually with the same filtered data, either at a weekly or monthly basis.

Organizations that use data warehousing technology will continue to do so and those
that use both Big Data and data warehousing are future-proof from any further
technological advancements only up till the point where the thin line of separation starts
depleting. Conventional data warehouse systems are proven systems and with investments
to the tune of millions of dollars for their development, those systems, are not going
anywhere, soon. Regardless of how good and profitable Big Data analytics is or turns out,
data warehousing will still continue to provide crucial database support to many
enterprises, and in all circumstances, will complete the lifecycle of current systems.

4.7.1 Changing Deployment Models in Big Data Era


Q10. What are the different deployment models?
Ans:
Data managing deployment models are shifting to altogether different levels ever since
the inception of Big Data. From implementation of traditional data on a large and an
individual system in the data center to distributed database nodes located in several
computers within the same data center, the costs-price ratio of such a model has led

S.V. PUBLICATIONS 87

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
companies to optimize these warehouses and limit the scope and size of the data being
managed.

The inevitable trend is toward hybrid environments that address the following
enterprise Big Data necessities:
 Scalability and Speed — The developing hybrid Big Data platform supports parallel
processing, optimized appliances and storage, workload management, and dynamic
query optimization.
 Agility and Elasticity—The hybrid model is agile, which means it is flexible and
responds rapidly in case of changing trends. It also provides elasticity, which means
this model can be increased or decreased as per the demands of the user.
 Affordability and Manageability—The hybrid environment will integrate flexible
pricing, including licensed software, custom designed appliances, and cloud-based
approaches for future proofing. data warehouse has become a real-world method of
producing an enhanced environment to support the transition to new information
management.
 Appliance Model (also known as Commodity Hardware) — Appliance is a system
used in data centers to optimize data storage and management. It can be easy and
quick to implement, and offers low cost in terms of operation and maintenance. It
also integrates logical engines and tools to shorten the process of examining data
from several sources. The appliance is thus a single-purpose machine that usually
includes different interfaces to easily connect to an existing data warehouse. Though
various backups systems and central node transfer techniques have made it a real
easy deal for the present-day appliance model to be fairly reliant, oddities still
remain.
 Cloud Deployment—Big Data business applications with cloud-based approaches
have an advantage that other methods lack.
o On-demand self-service—Enables the customer to use a self-propelled cloud
service with minimal interaction involving the cloud service provider
o Broad network access — Allows Big Data cloud resources to be available
over the network and accessible across different client platforms
o Multi-user —Allows cloud resources to be allocated so that isolation can be
guaranteed to multiple users, their computations, and data from one another
o Elasticity and scalability—Allows cloud resources to be elastically, rapidly,
and automatically scaled out, up, and down as the need be
o Measured service—Enables remote monitoring and billing of Big Data cloud
resources.

Some of the challenges for a Big Data architecture and cloud computing are as follows:
 The data involved, and its magnitude and location. Big Data may end up unrelated
and start out in different locations. These sites may or may not be serviceable by a
cloud service.
 The type of processing required on the data. Continuous or burst mode? Can the
data be parallelized?

S.V. PUBLICATIONS 88

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


 Does data need to be moved to the processing environment or does the processing
environment need to be moved to where data is, depending on the data volumes
involved?
 Technical and supportability requirements of cloud-based Big Data
 Service models and deployment of Big Data in the cloud

S.V. PUBLICATIONS 89

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management

NOSQL DATA MANAGEMENT


4.8 Introduction to NoSQL
Q10. What are NoSQL Databases?
Ans:
NoSQL is a non-relational database that varies from traditional relational database
system. NoSQLdatabase is provided for distributed data stores where there is a need for
large scale of data storing.

For example, Google and Facebook are collecting terabytes of data daily for their users.
Such databases do not require fixed schema, avoid join operations, and scale data
horizontally.

In a NoSQL database, tables are stored as ASCII files, with each tuple represented by a
row and fields separated with tabs. The database is manipulated through shell scripts that
can be combined to UNIX pipelines. As its name suggests, NoSQL doesn't use SQL as a
query language.

There are lots of data storage options available in the market. And yet, many more are
still coming up as the nature of data, platforms, user requirements, architectures, and
processes are constantly changing. We are living in the era of Big Data and are in search of
ways of handling it. This has given impetus to, since 2009, the need for creating schema-free
databases that can handle large amounts of data. These databases are scalable, enable
availability of user, support replication, and are distributed and possibly open source. One
such database is NoSQL.

NoSQL databases are still in the development stages and are going through a lot of
changes. Software developers who work on databases have a mixed opinion about NoSQL.
Some find it useful, while others point out flaws in it. Some are uncertain, and believe that
it's just another hyped technology that will eventually vanish due to its immaturity.

NoSQL database stands for ―Not Only SQL‖ or ―Not SQL.‖ Though a better term would
be ―NoREL‖, NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.

Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database technologies that
can store structured, semi-structured, unstructured and polymorphic data.

S.V. PUBLICATIONS 90

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response time
becomes slow when you use RDBMS for massive volumes of data.

To resolve this problem, we could ―scale up‖ our systems by upgrading our existing
hardware. This process is expensive.

4.9 Characteristics of NoSQL


Q11. What are the basic characteristics of a NoSQL database?
Ans:
When the creators of NoSQL chose the name NoSQL, they meant to communicate the
fact that here is a database query language that does not follow the principles of RDBMS.
Another important characteristic of NoSQL databases is that they are generally open-source
projects. The term NoSQL is frequently applied to an open-source phenomenon. Most
NoSQL databases are driven by the need to run on clusters. This fact has an effect on their
data model as well as their approach to consistency.

This comes particularly useful while dealing with non-uniform data and custom fields.
We see relational databases as one option for data storage. This point of view is often
referred to as polyglot persistence, and simply means using different data stores for different
circumstances.

We will now look at some of the most common features that define a basic NoSQL
database.

Non-relational
 NoSQL databases never follow the relational model
 Never provide tables with flat fixed-column records
 Work with self-contained aggregates or BLOBs
 Doesn’t require object-relational mapping and data normalization
 No complex features like query languages, query planners,referential integrity joins,
ACID
Schema-free
 NoSQL databases are either schema-free or have relaxed schemas
 Do not require any sort of definition of the schema of the data
 Offers heterogeneous structures of data in the same domain
Simple API
 Offers easy to use interfaces for storage and querying data provided
 APIs allow low-level data manipulation & selection methods
 Text-based protocols mostly used with HTTP REST with JSON
 Mostly used no standard based NoSQL query language
 Web-enabled databases running as internet-facing services
Distributed
 Multiple NoSQL databases can be executed in a distributed fashion
 Offers auto-scaling and fail-over capabilities
 Often ACID concept can be sacrificed for scalability and throughput

S.V. PUBLICATIONS 91

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
 Mostly no synchronous replication between distributed nodes Asynchronous Multi-
Master Replication, peer-to-peer, HDFS Replication
 Only providing eventual consistency
 Shared Nothing Architecture. This enables less coordination and higher distribution.

4.10 History of NoSQL


Q12. Give a Brief History of Non-Relational Databases or NoSQL databases?
Ans:
In the early 1970, Flat File Systems are used. Data were stored in flat files and the biggest
problems with flat files are each company implement their own flat files and there are no
standards. It is very difficult to store data in the files, retrieve data from files because there is
no standard way to store data.

Then the relational database was created by E.F. Codd and these databases answered the
question of having no standard way to store data. But later relational database also get a
problem that it could not handle big data, due to this problem there was a need of database
which can handle every types of problems then NoSQL database was developed.
 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
 2000- Graph database Neo4j is launched
 2004- Google BigTable is launched
 2005- CouchDB is launched
 2007- The research paper on Amazon Dynamo is released
 2008- Facebooks open sources the Cassandra project
 2009- The term NoSQL was reintroduced

4.11 Types of NoSQL Data Models- Key Value Data Model


Q13. What is NoSQL key-value databases?
Ans:
NoSQL Databases are mainly categorized into four types: Key-value pair, Column-
oriented, Graph-based and Document-oriented. Every category has its unique attributes and
limitations. None of the above-specified database is better to solve all the problems. Users
should select the database based on their product needs.

Types of NoSQL Databases:


 Key-value Pair Based
 Column-oriented Graph
 Graphs based
 Document-oriented

Key Value Pair Based DataModel


A key-value data model or database is also referred to as a
key-value store. It is a non-relational type of database. In this,
an associative array is used as a basic database in which an
individual key is linked with just one value in a collection. For
the values, keys are special identifiers. Any kind of entity can
be valued. The collection of key-value pairs stored on separate
records is called key-value databases and they do not have an already defined structure.

S.V. PUBLICATIONS 92

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


How do key-value databases work?
A number of easy strings or even a complicated entity are referred to as a value that is
associated with a key by a key-value database, which is utilized to monitor the entity. Like
in many programming paradigms, a key-value database resembles a map object or array, or
dictionary, however, which is put away in a tenacious manner and controlled by a DBMS.

An efficient and compact structure of the index is used by the key-value store to have the
option to rapidly and dependably find value using its key. For example, Redis is a key-value
store used to tracklists, maps, heaps, and primitive types (which are simple data structures)
in a constant database. Redis can uncover a very basic point of interaction to query and
manipulate value types, just by supporting a predetermined number of value types, and
when arranged, is prepared to do high throughput.

Features
 One of the most un-complex kinds of NoSQL data models.
 For storing, getting, and removing data, key-value databases utilize simple functions.
 Querying language is not present in key-value databases.
 Built-in redundancy makes this database more reliable.

4.12 Column Oriented Data Model


Q14. What is a column-family data store in NoSQL Database? List the features of column
family database.
Ans:
Column-oriented databases work on columns and are based on BigTable paper by
Google. Every column is treated separately. Values of single column databases are stored
contiguously.

The Columnar Database of NoSQL is important. NoSQL databases are different from
SQL databases. This is because it uses a data model that has a different structure than the
previously followed row-and-column table model used with relational database
management systems (RDBMS). NoSQL databases are a flexible schema model which is
designed to scale horizontally across many servers and is used in large volumes of data.

Basically, the relational database stores data in rows and also reads the data row by row,
column store is organized as a set of columns. So, if someone wants to run analytics on a
small number of columns, one can read those columns directly without consuming memory
with the unwanted data. Columns are somehow are of the same type and gain from more
efficient compression, which makes reads faster than before. Examples of Columnar Data
Model: Cassandra and Apache Hadoop Hbase.

Column-family databases store data in column families as rows that have many columns
associated with a row key. Column families are groups of related data that is often accessed
together.

In Columnar Data Model instead of organizing information into rows, it does in


columns. This makes them function the same way that tables work in relational databases.
This type of data model is much more flexible obviously because it is a type of NoSQL
database. The below example will help in understanding the Columnar data model:

S.V. PUBLICATIONS 93

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management

There are several benefits that go along with columnar databases:


 Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to
accommodate more customers and more data as per requirement.
 Always on architecture − Cassandra has no single point of failure and it is
continuously available for business-critical applications that cannot afford a failure.
 Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your
throughput as you increase the number of nodes in the cluster. Therefore it
maintains a quick response time.
 Flexible data storage − Cassandra accommodates all possible data formats
including: structured, semi-structured, and unstructured. It can dynamically
accommodate changes to your data structures according to your need.
 Easy data distribution − Cassandra provides the flexibility to distribute data where
you need by replicating data across multiple data centers.
 Transaction support − Cassandra supports properties like Atomicity, Consistency,
Isolation, and Durability (ACID).
 Fast writes − Cassandra was designed to run on cheap commodity hardware. It
performs blazingly fast writes and can store hundreds of terabytes of data, without
sacrificing the read efficiency.

4.13 Document Data Model


Q15. What is a Document Database? Give some applications of Datamodel.
Ans:
A document is a record in a document database. A document typically stores
information about one object and any of its related metadata.
Documents store data in field-value pairs. The values can be a variety of types and
structures, including strings, numbers, dates, arrays, or objects. Documents can be stored in
formats like JSON, BSON, and XML.
Document Data Model
A Document Data Model is a lot different than
other data models because it stores data in JSON,
BSON, or XML documents. In this data model, we can
move documents under one document and apart from
this; any particular elements can be indexed to run
queries faster. Often documents are stored and
retrieved in such a way that it becomes close to the data
objects which are used in many applications which means very less translations are required
to use data in applications. JSON is a native language that is often used to store and query
data too.

S.V. PUBLICATIONS 94

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


Document Database
A document database is a type of NoSQL database which stores data as JSON
documents instead of columns and rows. JSON is a native language used to both store and
query data. These documents can be grouped together into collections to form database
systems.

Applications of Document Data Model


 Content Management: These data models are very much used in creating various
video streaming platforms, blogs, and similar services Because each is stored as a
single document and the database here is much easier to maintain as the service
evolves over time.
 Book Database: These are very much useful in making book databases because as we
know this data model lets us nest.
 Catalog: When it comes to storing and reading catalog files these data models are
very much used because it has a fast reading ability if in case Catalogs have
thousands of attributes stored.
 Analytics Platform: These data models are very much used in the Analytics Platform.

4.14 Graph Databases


Q16. What is graph database model? How a graph type NoSQL database stores data?
Ans:
The graph is a collection of nodes and edges where each node is used to represent an
entity and each edge describes the relationship between entities. A graph-oriented database,
or graph database, is a type of NoSQL database that uses graph theory to store, map and
query relationships. Graph databases are basically used for analyzing interconnections.

The idea stems from graph theory in mathematics, where graphs represent data sets
using nodes, edges, and properties.
 Nodes or points are instances or entities of data which represent any object to be
tracked, such as people, accounts, locations, etc.
 Edges or lines are the critical concepts in graph databases which represent
relationships between nodes. The connections have a direction that is either
unidirectional (one way) or bidirectional (two way).
 Properties represent descriptive information associated with nodes. In some cases,
edges have properties as well.

The following table outlines the critical differences between graph and relational
databases:

S.V. PUBLICATIONS 95

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management

4.15 Schema-Less Databases


Q17. What is a schemaless database?
Ans:
A schemaless database manages information without the need for a blueprint. The onset
of building a schemaless database doesn’t rely on conforming to certain fields, tables, or data
model structures. There is no Relational Database Management System (RDBMS) to enforce
any specific kind of structure. In other words, it’s a non-relational database that can handle
any database type, whether that be a key-value store, document store, in-memory, column-
oriented, or graph data model. NoSQL databases’ flexibility is responsible for the rising
popularity of a schemaless approach and is often considered more user-friendly than scaling
a schema or SQL database.

How does a schemaless database work?


With a schemaless database, you don’t need to have a fully-realized vision of what your
data structure will be. Because it doesn’t adhere to a schema, all data saved in a schemaless
database is kept completely intact. A relational database, on the other hand, picks and
chooses what data it keeps, either changing the data to fit the schema, or eliminating it
altogether. Going schemaless allows every bit of detail from the data to remain unaltered
and be completely accessible at any time. For businesses whose operations change according
to real-time data, it’s important to have that untouched data as any of those points can prove
to be integral to how the database is later updated. Without a fixed data structure,
schemaless databases can include or remove data types, tables, and fields without major
repercussions, like complex schema migrations and outages. Because it can withstand
sudden changes and parse any data type, schemaless databases are popular in industries
that are run on real-time data, like financial services, gaming, and social media.

4.16 Materialized Views


Q18. What is the purpose of materialized view?
Ans:
When we talked about aggregate-oriented data models, we stressed their advantages. If
you want to access, it’s useful to have all the data for a single aggregate that can be stored
and accessed as a unit. But aggregate-orientation has a corresponding disadvantage: What
happens if we want to know how much a particular item has sold over the last couple of
weeks? Now the aggregate-orientation works against you, forcing you to potentially read
every order in the database to answer the question. You can reduce this burden by building
an index, but you’re still working against the aggregate structure.

Relational databases have an advantage here because their lack of aggregate structure
allows them to support accessing data in different ways. Furthermore, they provide a
convenient mechanism that allows you to look at data differently from the way it’s stored

S.V. PUBLICATIONS 96

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester


views. A view is like a relational table but it’s defined by computation over the base tables.
When you access a view, the database computes the data in the view—a handy form of
encapsulation.

Views provide a mechanism to hide from the client whether data is derived data or base
data—but can’t avoid the fact that some views are expensive to compute. To cope with this,
materialized views were invented, which are views that are computed in advance and
cached on disk. Materialized views are effective for data that is read heavily but can stand
being somewhat stale.

Although NoSQL databases don’t have views, they may have precomputed and cached
queries, and they reuse the term ―materialized view‖ to describe them. It’s also much more
of a central aspect for aggregate-oriented databases than it is for relational systems, since
most applications will have to deal with some queries that don’t fit well with the aggregate
structure.

There are two rough strategies to building a materialized view. The first is the eager
approach where you update the materialized view at the same time you update the base
data for it. This approach is good when you have more frequent reads of the materialized
view than you have writes and you want the materialized views to be as fresh as possible.

If you don’t want to pay that overhead on each update, you can run batch jobs to update
the materialized views at regular intervals. Building materialized views outside of the
database by reading the data, computing the view, and saving it back to the database. More
often databases will support building materialized views themselves.

Materialized views can be used within the same aggregate. An order document might
include an order summary element that provides summary information about the order so
that a query for an order summary does not have to transfer the entire order document.
Using different column families for materialized views is a common feature of column-
family databases. An advantage of doing this is that it allows you to update the materialized
view within the same atomic operation.

4.17 CAP Theorem


Q19. What is CAP theorem explain?
Ans:
The CAP theorem, originally introduced as the CAP principle, can be used to explain
some of the competing requirements in a distributed system with replication. It is a tool
used to make system designers aware of the trade-offs while designing networked shared-
data systems.

The three letters in CAP refer to three desirable properties of distributed systems with
replicated data: consistency (among replicated copies), availability (of the system for read
and write operations) and partition tolerance (in the face of the nodes in the system being
partitioned by a network fault).

The CAP theorem states that it is not possible to guarantee all three of the desirable
properties – consistency, availability, and partition tolerance at the same time in a
distributed system with data replication.

S.V. PUBLICATIONS 97

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management
The theorem states that networked shared-data systems can only strongly support two
of the following three properties:
 Consistency – Consistency means that the nodes will have the same copies of a
replicated data item visible for various transactions. A guarantee that every node in a
distributed cluster returns the same, most recent and a successful write. Consistency
refers to every client having the same view of the data. There are various types of
consistency models. Consistency in CAP refers to sequential consistency, a very
strong form of consistency.
 Availability – Availability means that
each read or write request for a data
item will either be processed
successfully or will receive a message
that the operation cannot be completed.
Every non-failing node returns a
response for all the read and write
requests in a reasonable amount of
time. The key word here is ―every‖. In
simple terms, every node (on either side
of a network partition) must be able to respond in a reasonable amount of time.
 Partition Tolerance – Partition tolerance means that the system can continue
operating even if the network connecting the nodes has a fault that results in two or
more partitions, where the nodes in each partition can only communicate among
each other. That means, the system continues to function and upholds its consistency
guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover from
partitions once the partition heals.

Generally, it is impossible to sustain all the three requirements. Therefore, according to


CAP theorem, two out of the three requirements should be followed.
The three combinations; CA, CP, and AP:
 CA—Implies single site cluster in which all nodes communicate with each other.
 CP—Implies all the available data will be consistent or accurate, provided some data
may not be available.
 AP—Implies some data returned may be inconsistent.

ACID Property
ACID (Atomicity, Consistency, Isolation, and Durability) transactions guarantee four
properties:
 Atomicity—In this either all the operations in transaction will complete or not a
single one. If any part of transaction fails, the entire transaction will fail.
 Consistency — A transaction must leave the database in an inconsistent state. This
ensures that any transaction which is done will change the database to another valid
state.
 Isolation—Transactions will not interfere with each other.
 Durability—The Successful completion of transaction will not be reversed.

S.V. PUBLICATIONS 98

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA B.Sc. (Data Science) VI Semester

Internal Assessment

MULTIPLE CHOICE QUESTIONS

1. Which ACID property is related to the encapsulation of information? [ c ]


a) Atomicity b) Consistency c) Isolation d) Durability
2. APS stands for: [ d ]
a) Analytical Platform System b) Analytics Performance System
c) Analytics Platform Server d) Analytics Platform System
3. Which of the following is an example of a non-relational database? [ c ]
a) SQL b) ORACLE c) Mongo DB d) SQL SERVER 2012
4. Which of the following models is also known as commodity hardware? [ a ]
a) Appliance model b) Cloud model c) Hybrid model d) Private model
5. Which of the following is a disadvantage of relational databases? [ b ]
a) Concurrency b) Impedance mismatch
c) ACID transactions d) Normalization
6. Which of the following is a disadvantage of NoSQL databases? [ d ]
a) Scalability b) Performance
c) High availability d) Complex transaction support
7. Which of the following is not a NoSQL data model? [ a ]
a) Object relational b) Column-oriented store
c) Document-oriented store d) Graph database
8. Which type of format cannot be stored or retrieved in the document [ c ]
data model?
a) JSON b) XML c) DOC d) BSON
9. Which of the following data models can be used for social network [ d ]
Mining?
a) Document data model b) Schema-less data model
c) Key/value data model d) Graph data model

S.V. PUBLICATIONS 99

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Storing Data in Databases & Data Warehouses, NoSQL Data Management

FILL IN THE BLANKS


1. Appliance is a system used in data centers to optimize ___________ & __________.
(Data storage ,Management)
2. _____________ refers to the encapsulation of information. (Isolation)
3. ___________ does not force a relational model on the stored data. (Big Data)
4. The database that does not integrate in the table/key model is known as a _____.
(non-relational database)
5. A ___________ database is often used to solve a complex problem. (polyglot)
6. Big Data is analyzed to know the present _____________ and make future
predictions. (behavior or trends)
7. Data Management tasks to be created easily in any __________________.
(programming language)
8. The term NoSQL is frequently applied to an _________ phenomenon. (open-source)
9. The values of a single column are stored _________. (contiguously)
10. Graph databases are mainly used for storing _____________________ &
_________________ between these entities. (entities, relationships)

Very Short Questions


Q1. What is Non-relational Database?
Ans:
The database that does not integrate in the table/key model is known as a non-relational
database.

Q2.What is Polyglot persistence?


Ans:
Polyglot persistence is an enterprise storage term used to describe choosing different
data storage/data stores.

Q3. Define Big Data Analysis.


Ans:
Big Data is analyzed to know the present behavior or trends and make future
predictions. Various Big Data solutions are used for extracting useful information from Big
Data.

Q4. Define Column-Oriented Data Model


Ans:
Column-oriented databases are based on columns and every column is considered
individually. The values of a single column are stored contiguously.

Q5. List the examples of Document Data Model.


Ans:
There are many types of document databases, such as XML, JSON, BSON, etc.

S.V. PUBLICATIONS 100

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB

PRACTICAL LAB (BIG DATA)

Objective

 Installation and understanding of working of HADOOP


 Understanding of MapReduce program paradigm.
 Writing programs in Python using MapReduce
 Understanding working of Pig, Hive
 Understanding of working of Apache Spark Cluster
 Get familiar with Hadoop distributions, configuring Hadoop and performing
File management tasks
 Experiment MapReduce in Hadoop frameworks
 Implement MapReduce programs in variety applications
 Explore MapReduce support for debugging
 Understand different approaches for building Hadoop MapReduce programs
for real-time applications.

S.V. PUBLICATIONS A

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


1. Setting up and Installing Hadoop in its two operating modes:
 Pseudo distributed,
 Fully distributed.
Solution:

Steps to Install Hadoop


 Install Java JDK 1.8
 Download Hadoop and extract and place under C drive
 Set Path in Environment Variables
 Config files under Hadoop directory
 Create folder datanode and namenode under data directory
 Edit HDFS and YARN files
 Set Java Home environment in Hadoop environment
 Setup Complete. Test by executing start-all.cmd

Hadoop software can be installed in three modes of operation:


 Stand Alone Mode: Hadoop is a distributed software and is designed to run on a
commodity of machines. However, we can install it on a single node in stand-alone
mode. In this mode, Hadoop software runs as a single monolithic java process. This
mode is extremely useful for debugging purpose. You can first test run your Map-
Reduce application in this mode on small data, before actually executing it on cluster
with big data.
 Pseudo Distributed Mode: In this mode also, Hadoop software is installed on a
Single Node. Various daemons of Hadoop will run on the same machine as separate
java processes. Hence all the daemons namely NameNode, DataNode,
SecondaryNameNode, JobTracker, TaskTracker run on single machine.
 Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode,
JobTracker, SecondaryNameNode (Optional and can be run on a separate node) run
on the Master Node. The daemons DataNode and TaskTracker run on the Slave
Node.

There are two ways to install Hadoop, i.e.


 Single node
 Multi node

Single node cluster means only one DataNode running and setting up all the
NameNode, DataNode, ResourceManager and NodeManager on a single machine.

This is used for studying and testing purposes.


 So for testing whether the Oozie jobs have scheduled all the processes like collecting,
aggregating, storing and processing the data in a proper sequence, we use single
node cluster.
 It can easily and efficiently test the sequential workflow in a smaller environment as
compared to large environments which contains terabytes of data distributed across
hundreds of machines.

B S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


Setting up a single node Hadoop cluster
Prerequisites to install Hadoop on windows
 VIRTUAL BOX (For Linux): it is used for installing the operating system on it.
 OPERATING SYSTEM: You can install Hadoop on Windows or Linux based
operating systems. Ubuntu and CentOS are very commonly used.
 JAVA: You need to install the Java 8 package on your system.
 HADOOP: You require Hadoop latest version

Install Java
– Java JDK Link to download
https://www.oracle.com/java/technologies/javase-jdk8-downloads.html
– extract and install Java in C:\Java
– open cmd and type -> javac -version

Download Hadoop
– https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.0/hadoop-
3.3.0.tar.gz
– extract to C:\Hadoop

– Set the path JAVA_HOME Environment variable


– Set the path HADOOP_HOME Environment variable

S.V. PUBLICATIONS C

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA

D S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB

S.V. PUBLICATIONS E

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA

F S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB

Configurations
– Edit file C:/Hadoop-3.3.0/etc/hadoop/core-site.xml, paste the xml code in folder and
save
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
– Rename (If Necessary) “mapred-site.xml.template” to “mapred-site.xml” and edit
this file C:/Hadoop-3.3.0/etc/hadoop/mapred-site.xml, paste xml code and save this
file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
– Create folder “data” under “C:\Hadoop-3.3.0”
– Create folder “datanode” under “C:\Hadoop-3.3.0\data”
– Create folder “namenode” under “C:\Hadoop-3.3.0\data”

S.V. PUBLICATIONS G

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


– Edit file C:\Hadoop-3.3.0/etc/hadoop/hdfs-site.xml, paste xml code and save this
file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/hadoop-3.3.0/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/hadoop-3.3.0/data/datanode</value>
</property>
</configuration>
– Edit file C:/Hadoop-3.3.0/etc/hadoop/yarn-site.xml, paste xml code and save this
file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

– Edit file C:/Hadoop-3.3.0/etc/hadoop/hadoop-env.cmd by closing the command line


“JAVA_HOME=%JAVA_HOME%” instead of set “JAVA_HOME=C:\Java”

Hadoop Configurations
Download
– https://github.com/brainmentorspvtltd/BigData_RDE/blob/master/Hadoop%20C
onfiguration.zip or (for hadoop 3)
– https://github.com/s911415/apache-hadoop-3.1.0-winutils
– Copy folder bin and replace existing bin folder in C:\Hadoop-3.3.0\bin
– Format the NameNode
– Open cmd and type command “hdfs namenode –format”

H S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


Testing
– Open cmd and change directory to C:\Hadoop-3.3.0\sbin
– type start-all.cmd

(Or you can start like this)


– Start namenode and datanode with this command
– type start-dfs.cmd
– Start yarn through this command
– type start-yarn.cmd

Make sure these apps are running


– Hadoop Namenode
– Hadoop datanode
– YARN Resource Manager
– YARN Node Manager

Open: http://localhost:8088

S.V. PUBLICATIONS I

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


Open: http://localhost:9870

2. Implement the following file management tasks in Hadoop:


 Adding files and directories
 Retrieving files
 Deleting Files
Solution:
Hadoop provides a set of command line utilities that work similarly to the Linux file
commands, and serve as your primary interface with HDFS. We„re going to have a look into
HDFS by interacting with it from the command line. We will take a look at the most
common file management tasks in Hadoop, which include:
 Adding files and directories to HDFS
 Retrieving files from HDFS to local filesystem
 Deleting files from HDFS

Adding Files and Directories to HDFS


Before you can run Hadoop programs on data stored in HDFS, you„ll need to put the
data into HDFS first.

Start all the services before we perform any activity.


Dfs services and Yarn
C:\Users\admin>cd\
C:\>cd C:\hadoop-3.3.4\sbin
C:\hadoop-3.3.4\sbin>start-dfs
C:\hadoop-3.3.4\sbin>start-yarn
starting yarn daemons
Note: Don‟t Close the command prompt windows

J S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


Before you can run Hadoop programs on data stored in HDFS, you„ll need to put the
data into HDFS first. This directory isn„t automatically created for you, though, so let„s
create it with the mkdir command.
hadoop fs -mkdir /user/chuck
hadoop fs -put example.txt
hadoop fs -put example.txt /user/chuck
Retrieving Files from HDFS
The Hadoop command get copies files from HDFS back to the local filesystem. To
retrieve data.txt, we can run the following command:
hadoop fs -cat example.txt

Deleting Files from HDFS


hadoop fs -rm example.txt

3. Implementation of Word Count Map Reduce program


 Find the number of occurrence of each word appearing in the input file(s)
 Performing a MapReduce Job for word search count (look for specific keywords in
a file)
Solution:
MapReduce is the heart of Hadoop. It is this programming paradigm that allows for
massive scalability across hundreds or thousands of servers in a Hadoop cluster. The
MapReduce concept is fairly simple to understand for those who are familiar with clustered
scale-out data processing solutions.

The term MapReduce actually refers to two separate and distinct tasks that Hadoop
programs perform. The first is the map job, which takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples (key/value
pairs). The reduce job takes the output from a map as input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job
is always performed after the map job.

1. Download MapReduceClient.jar
(Link: https://github.com/MuhammadBilalYar/HADOOP-INSTALLATION-ON-
WINDOW-10/blob/master/MapReduceClient.jar)
2. Download Input_file.txt from c:\
3. Create an input directory in HDFS.
hadoop fs -mkdir /input_dir
4. Copy the input text file named input_file.txt in the input directory (input_dir) of
HDFS.
hadoop fs -put C:/input_file.txt /input_dir
5. Verify input_file.txt available in HDFS input directory (input_dir).
hadoop fs -ls /input_dir/

6. Verify content of the copied file.

S.V. PUBLICATIONS K

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


hadoop dfs -cat /input_dir/input_file.txt

7. Run MapReduceClient.jar and also provide input and out directories.


hadoop jar C:/MapReduceClient.jar wordcount /input_dir /output_dir

8. Verify content for generated output file.


hadoop dfs -cat /output_dir/*

L S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB

4. Map Reduce Program for Stop word elimination:


 Map Reduce program to eliminate stop words from a large text file.
Solution:
Save the file as SkipMapper.java
package com.hadoop.skipper;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class SkipMapper extends Mapper<LongWritable, Text, Text, NullWritable>


{
private Text word = new Text();
private Set<String> stopWordList = new HashSet<String>();
private BufferedReader fis;
/* (non-Javadoc)
* @see
* org.apache.hadoop.mapreduce.Mapper#setup(org.apache.hadoop.mapreduce.
* Mapper.Context) */

S.V. PUBLICATIONS M

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


@SuppressWarnings("deprecation")
protected void setup(Context context) throws java.io.IOException,InterruptedException
{
try
{
Path[] stopWordFiles = new Path[0];
stopWordFiles = context.getLocalCacheFiles();
System.out.println(stopWordFiles.toString());
if (stopWordFiles != null && stopWordFiles.length > 0)
{
for (Path stopWordFile : stopWordFiles)
{
readStopWordFile(stopWordFile);
}
}
}
catch (IOException e)
{
System.err.println("Exception reading stop word file: " + e);
}
}
/* Method to read the stop word file and get the stop words */
private void readStopWordFile(Path stopWordFile)
{
try
{
fis = new BufferedReader(new FileReader(stopWordFile.toString()));
String stopWord = null;
while ((stopWord = fis.readLine()) != null)
{
stopWordList.add(stopWord);
}
}
catch (IOException ioe)
{
System.err.println("Exception while reading stop word file"+ stopWordFile + ":"
+ ioe.toString());
}
}
/* (non-Javadoc)
* @see org.apache.hadoop.mapreduce.Mapper#map(KEYIN, VALUEIN,
* org.apache.hadoop.mapreduce.Mapper.Context) */

N S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
String token = tokenizer.nextToken();
if (stopWordList.contains(token))
{
context.getCounter(StopWordSkipper.COUNTERS.STOPWORDS).increment(1L);
}
else
{
context.getCounter(StopWordSkipper.COUNTERS.GOODWORDS)
.increment(1L);
word.set(token);
context.write(word, null);
}
}
}
}

Save the File as StopWordSkipper.java


package com.hadoop.skipper;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counters;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
@SuppressWarnings("deprecation")
public class StopWordSkipper
{
public enum COUNTERS
{
STOPWORDS,

S.V. PUBLICATIONS O

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


GOODWORDS
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
args = parser.getRemainingArgs();
Job job = new Job(conf, "StopWordSkipper");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setJarByClass(StopWordSkipper.class);
job.setMapperClass(SkipMapper.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
List<String> other_args = new ArrayList<String>();
// Logic to read the location of stop word file from the command line
// The argument after -skip option will be taken as the location of stop word file
for (int i = 0; i < args.length; i++)
{
if ("-skip".equals(args[i])) {
DistributedCache.addCacheFile(new
Path(args[++i]).toUri(),job.getConfiguration());
if (i+1 < args.length) {
i++;
}
else {
break;
}
}
other_args.add(args[i]);
}
FileInputFormat.setInputPaths(job, new Path(other_args.get(0)));
FileOutputFormat.setOutputPath(job, new Path(other_args.get(1)));
job.waitForCompletion(true);
Counters counters = job.getCounters();
System.out.printf("Good Words: %d, Stop Words: %d\n", counters.findCounter
(COUNTERS.GOODWORDS).getValue(),counters.findCounter(COUNTERS.STOPW
ORDS).getValue());
}
StopWordSkipper.java
Create a java project with the above java classes. Add the dependent java
libraries.(Libraries will be present in your hadoop installation). Export the project as a

P S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


runnable jar and execute. The file containing the stop words should be present in hdfs. The
stop words should be added line by line in the stop word file. Sample format is given below.

hadoop jar <jar-name> -skip <stop-word-file-in hdfs> <input-data-location> <output-


location>

5. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter
your data.
Solution:
1. Hadoop Cluster Installation: Apache Pig is a platform build on the top of Hadoop.
You can refer to our previously published article to install a Hadoop single node
cluster on Windows 10.
2. 7zip is needed to extract .tar.gz archives we will be downloading in this guide.
3. Downloading Apache Pig: To download the Apache Pig, you should go to the
following link: https://downloads.apache.org/pig/

4. After the file is downloaded, we should extract it twice using 7zip (using 7zip: the
first time we extract the .tar.gz file, the second time we extract the .tar file). We will
extract the Pig folder into “E:\hadoop-env” directory.
5. Setting Environment Variables: After extracting Derby and Hive archives, we
should go to Control Panel > System and Security > System. Then Click on
“Advanced system settings”.

S.V. PUBLICATIONS Q

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


6. In the advanced system settings dialog, click on “Environment variables” button.

7. We should add the following user variables:

PIG_HOME: “E:\hadoop-env\pig-0.17.0”

8. Now, we should edit the Path user variable to add the following paths:
%PIG_HOME%\bin

R S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB

9. Starting Apache Pig: After setting environment variables, let's try to run Apache Pig.
10. Open a command prompt as administrator, and execute the following command
pig –version

11. To fix this error, we should edit the pig.cmd file located in the “pig-0.17.0\bin”
directory by changing the HADOOP_BIN_PATH value from
“%HADOOP_HOME%\bin” to “%HADOOP_HOME%\libexec”.
pig –version

The simplest way to write PigLatin statements is using Grunt shell which is an
interactive tool where we write a statement and get the desired output. There are two modes
to involve Grunt Shell:
 Local: All scripts are executed on a single machine without requiring Hadoop.
(command: pig -x local)
 MapReduce: Scripts are executed on a Hadoop cluster (command: pig -x
MapReduce)

S.V. PUBLICATIONS S

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


List the relational operators in Pig.
All Pig Latin statements operate on relations (and operators are called relational
operators). Different relational operators in Pig Latin are:
– COGROUP: Joins two or more tables and then perform GROUP operation on the
joined table result.
– CROSS: CROSS operator is used to compute the cross product (Cartesian product)
of two or more relations.
– DISTINCT: Removes duplicate tuples in a relation.
– FILTER: Select a set of tuples from a relation based on a condition.
– FOREACH: Iterate the tuples of a relation, generating a data transformation.
– GROUP: Group the data in one or more relations.
– JOIN: Join two or more relations (inner or outer join).
– LIMIT: Limit the number of output tuples.
– LOAD: Load data from the file system.
– ORDER: Sort a relation based on one or more fields.
– SPLIT: Partition a relation into two or more relations.
– STORE: Store data in the file system.
– UNION: Merge the content of two relations. To perform a UNION operation on two
relations, their columns and domains must be identical.

6. Write a Pig Latin script for finding TF-IDF value for book dataset (A corpus of eBooks
available at: Project Gutenberg)
Solution:
import math
from pyspark.sql.functions import *
data=[(1,'i love dogs'),(2,"i hate dogs and knitting"),(3,"knitting is my hobby and my
passion")]
lines=sc.parallelize(data)
map1=lines.flatMap(lambda x: [((x[0],i),1) for i in x[1].split()])
reduce=map1.reduceByKey(lambda x,y:x+y)
tf=reduce.map(lambda x: (x[0][1],(x[0][0],x[1])))
map3=reduce.map(lambda x: (x[0][1],(x[0][0],x[1],1)))
map3.collect()
map4=map3.map(lambda x:(x[0],x[1][2]))
map2.collect()
reduce2=map4.reduceByKey(lambda x,y:x+y)
reduce2.collect()
idf=reduce2.map(lambda x: (x[0],math.log10(len(data)/x[1])))
idf.collect()
idf=numberofdocs_word.map(lambda x: (x[0],math.log10(len(data)/x[1])))
idf.collect()
rdd=tf.join(idf)
rdd=rdd.map(lambda x: (x[1][0][0],(x[0],x[1][0][1],x[1][1],x[1][0][1]*x[1][1]))).sortByKey()
rdd.collect()
rdd=rdd.map(lambda x: (x[0],x[1][0],x[1][1],x[1][2],x[1][3]))

T S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


rdd.toDF(["DocumentId","Token","TF","IDF","TF-IDF"]).show()

7. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views,
functions, and indexes.
Solution:
1. Prerequisites
7zip
In order to extract tar.gz archives, you should install the 7zip tool.
Installing Hadoop
To install Apache Hive, you must have a Hadoop Cluster installed and running: You can
refer to our previously published step-by-step guide to install Hadoop 3.2.1 on Windows 10.

Apache Derby
In addition, Apache Hive requires a relational database to create its Metastore (where all
metadata will be stored). In this guide, we will use the Apache Derby database 4.

We have Java 8 installed, we must install Apache Derby 10.14.2.0 version which can be
downloaded from the following link
https://downloads.apache.org//db/derby/db-derby-10.14.2.0/db-derby-10.14.2.0-
bin.tar.gz

Once downloaded, we must extract twice (using 7zip: the first time we extract the .tar.gz
file, the second time we extract the .tar file) the content of the db-derby-10.14.2.0-bin.tar.gz
archive into the desired installation directory. Since in the previous guide we have installed
Hadoop within “E:\hadoop-env\hadoop-3.3.0\” directory, we will extract Derby into
“E:\hadoop-env\db-derby-10.14.2.0\” directory.

Cygwin
Since there are some Hive 3.1.2 tools that aren‟t compatible with Windows (such as
schematool). We will need the Cygwin tool to run some Linux commands.

2. Downloading Apache Hive binaries


In order to download Apache Hive binaries, you should go to the following website:
https://downloads.apache.org/hive/hive-3.1.2/. Then, download the apache-hive-3.1.2.-
bin.tar.gz file.

When the file download is complete, we should extract twice (as mentioned above) the
apache-hive.3.1.2-bin.tar.gz archive into “E:\hadoop-env\apache-hive-3.3.0” directory.

S.V. PUBLICATIONS U

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA

3. Setting environment variables


After extracting Derby and Hive archives, we should go to Control Panel > System and
Security > System. Then Click on “Advanced system settings”.

In the advanced system settings dialog, click on “Environment variables” button.

Now we should add the following user variables:

V S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


 HIVE_HOME: “E:\hadoop-env\apache-hive-3.1.2\”
 DERBY_HOME: “E:\hadoop-env\db-derby-10.14.2.0\”
 HIVE_LIB: “%HIVE_HOME%\lib”
 HIVE_BIN: “%HIVE_HOME%\bin”
 HADOOP_USER_CLASSPATH_FIRST: “true”

Besides, we should add the following system variable:


HADOOP_USER_CLASSPATH_FIRST: “true”
Now, we should edit the Path user variable to add the following paths:
%HIVE_BIN%
%DERBY_HOME%\bin

4.Configuring Hive
Copy Derby libraries
Now, we should go to the Derby libraries directory (E:\hadoop-env\db-derby-
10.14.2.0\lib) and copy all *.jar files.

Then, we should paste them within the Hive libraries directory (E:\hadoop-env\apache-
hive-3.1.2\lib).

S.V. PUBLICATIONS W

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA

Configuring hive-site.xml
Now, we should go to the Apache Hive configuration directory (E:\hadoop-
env\apache-hive-3.1.2\conf) create a new file “hive-site.xml”. We should paste the
following XML code within this file:

5. Starting Services
Hadoop Services
To start Apache Hive, open the command prompt utility as administrator. Then, start the
Hadoop services using start-dfs and start-yarn commands (as illustrated in the Hadoop
installation guide).

X S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


Derby Network Server
Then, we should start the Derby network server on the localhost using the following
command:
E:\hadoop-env\db-derby-10.14.2.0\bin\StartNetworkServer -h 0.0.0.0

6. Starting Apache Hive


Now, let try to open a command prompt tool and go to the Hive binaries directory
(E:\hadoop-env\apache-hive-3.1.2\bin) and execute the following command:
hive
We will receive the following error:
'hive' is not recognized as an internal or external command, operable program or batch file.

This error is thrown since the Hive 3.x version is not built for Windows (only in some
Hive 2.x versions). To get things working, we should download the necessary *.cmd files
from the following link: https://svn.apache.org/repos/asf/hive/trunk/bin/. Note that,
you should keep the folder hierarchy (bin\ext\util). You can download all *.cmd files from
the following GitHub repository
https://github.com/HadiFadl/Hive-cmd

7. Initializing Hive
After ensuring that the Apache Hive started successfully. We may not be able to run any
HiveQL command. This is because the Metastore is not initialized yet. Besides HiveServer2
service must be running.

To initialize Metastore, we need to use schematool utility which is not compatible with
windows. To solve this problem, we will use Cygwin utility which allows executing Linux
command from windows.

Creating symbolic links


First, we need to create the following directories:
 E:\cygdrive
 C:\cygdrive
Now, open the command prompt as administrator and execute the following
commands:
 mklink /J E:\cygdrive\e\ E:\
 mklink /J C:\cygdrive\c\ C:\
These symbolic links are needed to work with Cygwin utility properly since Java may
cause some problems.

S.V. PUBLICATIONS Y

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


Initializing Hive Metastore
Open Cygwin utility and execute the following commands to define the environment
variables:
export HADOOP_HOME='/cygdrive/e/hadoop-env/hadoop-3.2.1'
export PATH=$PATH:$HADOOP_HOME/bin
export HIVE_HOME='/cygdrive/e/hadoop-env/apache-hive-3.1.2'
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*.jar

We can add these lines to the “~/.bashrc” file then you don‟t need to write them each
time you open Cygwin.Now, we should use the schematool utility to initialize the
Metastore:
$HIVE_HOME/bin/schematool -dbType derby -initSchema
Starting HiveServer2 service
Now, open a command prompt and run the following command:
hive --service hiveserver2 start
We should leave this command prompt open, and open a new one where we should
start Apache Hive using the following command:
hive

Starting WebHCat Service (Optional)


In the project we are working on, we need to execute HiveQL statement from SQL Server
Integration Services which can access Hive from the WebHCat server.

To start the WebHCat server, we should open the Cygwin utility and execute the
following command:
$HIVE_HOME/hcatalog/sbin/webhcat_server.sh start

Databases, Tables, Views, Functions and Indexes


Databases in Hive
The Hive concept of a database is essentially just a catalog or namespace of tables. However,
they are very useful for larger clusters with multiple teams and users, as a way of avoiding
table name collisions. It‟s also common to use databases to organize production tables into
logical groups.
hive> CREATE DATABASE financials;
Hive will throw an error if financials already exists.
hive> CREATE DATABASE IF NOT EXISTS financials;

hive> SHOW DATABASES;


default
financials

Drop Database
Drop the database by using the following command.
hive> drop database financials;
check whether the database is dropped or not.
hive> show databases;

Z S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


default
Creating Tables
The CREATE TABLE statement follows SQL conventions, but Hive‟s version offers
significant extensions to support a wide range of flexibility where the data files for tables are
stored, the formats used, etc.

CREATE TABLE IF NOT EXISTS mydb.employees (


name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT> COMMENT 'Keys are deductions names, values
are percentages',
address STRUCT<street:STRING, city:STRING,
state:STRING, zip:INT> COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2023-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';

hive> USE mydb;


hive> SHOW TABLES;
employees
table1
table2

Load Data
Once the internal table has been created, the next step is to load the data into it. So, in
Hive, we can easily load data from any file to the database.

Let's load the data of the file into the database by using the following command: -
load data local inpath '/home/codegyani/hive/emp_details' into table demo.employee;

Here, emp_details is the file name that contains the data.


select * from demo.employee;
Drop Table
Hive facilitates us to drop a table by using the SQL drop table command. Let's follow the
below steps to drop the table from the database.
hive> drop table employee;

Alter Table
Most table properties can be altered with ALTER TABLE statements, which change
metadata about the table but not the data itself. These statements can be used to fix mistakes
in schema, move partition locations (as we saw in External Partitioned Tables), and do other
operations.

S.V. PUBLICATIONS AA

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


Renaming a Table
Use this statement to rename the table log_messages to logmsgs:
ALTER TABLE log_messages RENAME TO logmsgs;
Adding, Modifying, and Dropping a Table Partition
As we saw previously, ALTER TABLE table ADD PARTITION … is used to add a new
partition to a table (usually an external table). Here we repeat the same command shown
previously with the additional options available:

ALTER TABLE log_messages ADD IF NOT EXISTS


PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
PARTITION (year = 2011, month = 1, day = 2) LOCATION '/logs/2011/01/02'
PARTITION (year = 2011, month = 1, day = 3) LOCATION '/logs/2011/01/03'...;

Multiple partitions can be added in the same query when using Hive v0.8.0 and later. As
always, IF NOT EXISTS is optional and has the usual meaning.

Changing Columns
You can rename a column, change its position, type, or comment:
ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;

View
Views are similar to tables, which are generated based on the requirements.
• We can save any result set data as a view in Hive
• Usage is similar to as views used in SQL
• All type of DML operations can be performed on a view

Creation of View:
Create VIEW < VIEWNAME> AS SELECT
Example:
Hive>Create VIEW Sample_View AS SELECT * FROM employees WHERE
salary>25000
In this example, we are creating view Sample_View where it will display all the row
values with salary field greater than 25000.

8. Install, Deploy & configure Apache Spark Cluster. Run apache spark applications
using Scala.
Solution:
Apache Spark is an open-source framework that processes large volumes of stream data
from multiple sources. Spark is used in distributed computing with machine learning
applications, data analytics, and graph-parallel processing.

Prerequisites
 A system running Windows 10

BB S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


 A user account with administrator privileges (required to install software, modify
file permissions, and modify system PATH)
 Command Prompt or Powershell
 A tool to extract .tar files, such as 7-Zip

Install Apache Spark on Windows


Installing Apache Spark on Windows 10 may seem complicated to novice users, but this
simple tutorial will have you up and running. If you already have Java 8 and Python 3
installed, you can skip the first two steps.

Step 1: Install Java 8


i. Apache Spark requires Java 8. You can check to see if Java is installed using the
command prompt.
ii. Open the command line by clicking Start > type cmd > click Command Prompt.
iii. Type the following command in the command prompt:
java –version
If you don’t have Java installed:
 Open a browser window, and navigate to https://java.com/en/download/.
Step 2: Install Python
i. To install the Python package manager, navigate to https://www.python.org/ in
your web browser.
ii. Mouse over the Download menu option and click Python 3.8.3. 3.8.3 is the latest
version at the time of writing the article.
iii. Once the download finishes, run the file.

iv. Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH.
Leave the other box checked.
v. Next, click Customize installation.
vi. You can leave all boxes checked at this step, or you can uncheck the options you do
not want.
vii. Click Next.
viii. Select the box Install for all users and leave other boxes as they are.
ix. Under Customize install location, click Browse and navigate to the C drive. Add a
new folder and name it Python.
x. Select that folder and click OK.

S.V. PUBLICATIONS CC

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


xi. Click Install, and let the installation complete.
xii. When the installation completes, click the Disable path length limit option at the
bottom and then click Close.
xiii. If you have a command prompt open, restart it. Verify the installation by checking
the version of Python:
python --version
Step 3: Download Apache Spark
i. Open a browser and navigate to https://spark.apache.org/downloads.html.
ii. Under the Download Apache Spark heading, there are two drop-down menus. Use
the current non-preview version.
 In our case, in Choose a Spark release drop-down menu select 2.4.5 (Feb 05 2020).
 In the second drop-down Choose a package type, leave the selection Pre-built for
Apache Hadoop 2.7.
iii. Click the spark-2.4.5-bin-hadoop2.7.tgz link.

iv. A page with a list of mirrors loads where you can see different servers to download
from. Pick any from the list and save the file to your Downloads folder.
Step 4: Verify Spark Software File
i. Verify the integrity of your download by checking the checksum of the file. This
ensures you are working with unaltered, uncorrupted software.
ii. Navigate back to the Spark Download page and open the Checksum link, preferably
in a new tab.
iii. Next, open a command line and enter the following command:
certutil -hashfile c:\users\username\Downloads\spark-2.4.5-bin-hadoop2.7.tgz SHA512
iv. Change the username to your username. The system displays a long alphanumeric
code, along with the message Certutil: -hashfile completed successfully.

v. Compare the code to the one you opened in a new browser tab. If they match, your
download file is uncorrupted.

DD S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


Step 5: Install Apache Spark
Installing Apache Spark involves extracting the downloaded file to the desired location.
i. Create a new folder named Spark in the root of your C: drive. From a command line,
enter the following:
cd \
mkdir Spark
ii. In Explorer, locate the Spark file you downloaded.
iii. Right-click the file and extract it to C:\Spark using the tool you have on your system
(e.g., 7-Zip).
iv. Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the
necessary files inside.
Step 6: Add winutils.exe File
Download the winutils.exe file for the underlying Hadoop version for the Spark
installation you downloaded.
i. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin
folder, locate winutils.exe, and click it.

ii. Find the Download button on the right side to download the file.
iii. Now, create new folders Hadoop and bin on C: using Windows Explorer or the
Command Prompt.
iv. Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.
Step 7: Configure Environment Variables
Configuring environment variables in Windows adds the Spark and Hadoop locations to
your system PATH. It allows you to run the Spark shell directly from a command prompt
window.
i. Click Start and type environment.
ii. Select the result labeled Edit the system environment variables.
iii. A System Properties dialog box appears. In the lower-right corner, click Environment
Variables and then click New in the next window.

S.V. PUBLICATIONS EE

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA


iv. For Variable Name type SPARK_HOME.
v. For Variable Value type C:\Spark\spark-2.4.5-bin-hadoop2.7 and click OK. If you
changed the folder path, use that one instead.

vi. In the top box, click the Path entry, then click Edit. Be careful with editing the system
path. Avoid deleting any entries already on the list.

vii. You should see a box with entries on the left. On the right, click New.
viii. The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-
2.4.5-bin-hadoop2.7\bin. We recommend using %SPARK_HOME%\bin to avoid
possible issues with the path.

ix. Repeat this process for Hadoop and Java.


 For Hadoop, the variable name is HADOOP_HOME and for the value use the
path of the folder you created earlier: C:\hadoop. Add C:\hadoop\bin to the
Path variable field, but we recommend using %HADOOP_HOME%\bin.
 For Java, the variable name is JAVA_HOME and for the value use the path to
your Java JDK directory (in our case it‟s C:\Program Files\Java\jdk1.8.0_251).
x. Click OK to close all open windows.
Step 8: Launch Spark
i. Open a new command-prompt window using the right-click and Run as
administrator:
ii. To start Spark, enter:
C:\Spark\spark-2.4.5-bin-hadoop2.7\bin\spark-shell
If you set the environment path correctly, you can type spark-shell to launch Spark.
iii. The system should display several lines indicating the status of the application. You
may get a Java pop-up. Select Allow access to continue.

FF S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

BIG DATA Practical LAB


Finally, the Spark logo appears, and the prompt displays the Scala shell.

iv. Open a web browser and navigate to http://localhost:4040/.


v. You can replace localhost with the name of your system.
vi. You should see an Apache Spark shell Web UI. The example below shows the
Executors page.

vii. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.
Test Spark
We will launch the Spark shell and use Scala to read the contents of a file. You can use an
existing file, such as the README file in the Spark directory, or you can create your own.
We created pnaptest with some text.
i. Open a command-prompt window and navigate to the folder with the file you want
to use and launch the Spark shell.
ii. First, state a variable to use in the Spark context with the name of the file. Remember
to add the file extension if there is any.
val x =sc.textFile("pnaptest")
iii. The output shows an RDD is created. Then, we can view the file contents by using
this command to call an action:
x.take(11).foreach(println)

S.V. PUBLICATIONS GG

Downloaded by menaka pamarthi ([email protected])


lOMoARcPSD|52011493

B.Sc. (Data Science) VI Semester BIG DATA

iv. This command instructs Spark to print 11 lines from the file you specified. To
perform an action on this file (value x), add another value y, and do a map
transformation.
v. For example, you can print the characters in reverse with this command:
val y = x.map(_.reverse)
vi. The system creates a child RDD in relation to the first one. Then, specify how many
lines you want to print from the value y:
y.take(11).foreach(println)

HH S.V. PUBLICATIONS

Downloaded by menaka pamarthi ([email protected])

You might also like