Data Analytics - Unit - 5
Data Analytics - Unit - 5
PART-1
5
UNIT
Frame Works and
Visualizatio,
Frame Works and Visualization:
Questions-Answers
MapReduce, Hadoop
Intermediate
Frame Works and Visualization
outputs from the numerous map steps need to be
note on and provided the proper machines for the collected
Write short to
Que6 .
6 Einal output needs to be made available for usereduce
by
step execution.
can write applications
which welarge application, or perhaps another MapReduce job. another user, another
Answer framework using parallel, on clusters in a reliable
a
MapReduce isamnounts of data, in
1
processhuge PART-2
manner technique and a program model for Pig, Hive.
processing
MapReduce is a Java.
computingbased on to
distributed provides the means break alarge task
The
MapReduce paradigmntasks in parallel, andl consolidate the
outputa Questions-Answers
3 task8, run the the final output.
intosmaller Long Answer Type and Medium Answer Type Questions
the individual tasks into :
of two basic parts
MapReduceconsists of Que 5A. Write short note on data access component of HadooP
4
data
Map operation to a piece of system.
Applies an intermediate output
Provides some Answer
b.
ii. Reduce : intermediateoutputsfrom the map steps Data access component of Hadoop system are :
Consolidatesthe a. Pig (Apache Pig):
output
Provides the final two functione 1. Apache Pig isa high level language platform for analyzing and
Map) andReduce) are
b
program, query huge datasets that are stored in HDFS.
In a Map Reduce filtering, grouping ond
5.
function performs actions like 2 Apache Pig uses Pig Latin language which is similar to SQL.
The Map
a.
the roeutk 3 It loads the data, applies the required filters and dumps the required
sorting.
function aggregates and summarizes format.
While Reduce
b.
produced by Map function. For program execution, Pig requires Java run time environment.
function is a key-value nair
4
The result generated by the MapReduce function. 5. Apache Pig consists of a data flow language and an environment
C.
the input for to execute the Pig code.
(K, V) which acts as
are required for
executing 6. The main benefit of using Pig is to utilize the power of MapReduce
Que 5.3. What are the activities that in a distributed system, while simplifying the tasks of developing
MapReduce job ? and executing a MapReduce job.
common data
7. Pig provides for the execution of several more
Answer management and manipulations, Such as inner and outer joins bet ween two or
Executing a MapReduce job requires the files (tables).
activities :
coordination of several
based on the system's workload.
b. Hive :
MapReduce jo bs need to be scheduled 1 HIVE is a data warehousing component which
performs reading,
environment
managed to ensure that any encountered
1 a distributed
writing and managing large datasets in
2 Jobs need to be monitored and that the job continues to execute if the using SQ-like interface.
errors are properly handled so
system partially fails. HIVE + SQL = HQL
the cluster. Query Language (HQL),
3. Input data needs to be spread across 2. The query language of Hive is called Hive
processing of the input needs to be conducted across the which is very similar like SQL.
4. Map step same machines where the data
distributed system, preferably on the 3. It has two basic components:
resides.
ii. used
HQL
commanda,
65J(CS-ST4
Data Analytics Hive Command line : T'he Hive Command line interface i
Hive
Hive server CLI a It receives queries from different sources like web UI, CLI, Thrift,
Hive web UI and JDBC/ODBC driver.
services
b. It transfers the queries to the compiler.
Hive driver
6. Hive compiler :
a. The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions.
It converts HiveQL statements into Map Reduce jobs.
Metastore
b
7. Hive execution engine:
of DAG of
Mapreduce a. Optimizer generates the logical plan in the form
MapReduce tasks and HDFS tasks.
executes the incoming tasks in
b. In the end, the execution engine
HDFS the order of their dependencies.
using Hive ?
Que 5.6. What are the conditions for
Fig, 5.5.1. Hive architecture.
in various languages, including
Hive client : Hive allows writing applications Answer
different types of clients such as: con ditions exist :
Java, Python, and C++. It supports Hive is used when the following
cross-language service provider platform that Data easily fits into a table
structure.
1 Thrift Server : It is a programming languages that supports 1
serves the request from all those Data is already in HDFS.
2 programming and queries.
Developers are comfortable with SQL
Thrift.
a connection between Hive and 3
JDBC Driver : It is used to establish datasets based on time.
2
applications. The JDBC Driver is pre sent in the class 4 There is a desire to partition
Java
org.apache.hadoop.hive.jdbc.Hive Driver.
Data Analytics
acceptable.
-7JC8-5/1T4) -8J(CS-51T-6)
Scheme is optional.
Scheme is mandatory. Application of ApachePig:
3. web logs, streaming online
analytic It uses OLTP (Online Transaction 1. It is used to process huge data sources like
It uses scan-centre Processing) workload. data ete.
workload.
It supportsAd Hoc queries across
large dataset.
Significant opportunity for query 2.
platforms.
5. Limited query optimization.
optimization. 3 Used toperform data processing in search
sensitive data loads.
4 It is alsoused to process time
What are the advantages and features of Apache
Pig (or
Apache Pig is generally used by data scientists for performing tasks like
Que 5.9. 5.
prototyping.
ad-hoc processing and quick
Pig).
PART-3
Answer
Databases.
Advantage of Apache Pig: HBase, MapR, Sharding, NoSQL
1 Pig Latin language is easy to program.
2 It decreases the development time.
10J(CS-51T-6)
Data Analyties
5-9J (CS-5/1T-4, Frame Works and
Visualization
2
Region server :
HBase tables are divided
Questions-Answers horizontally by row key range into regions.
Regions are the basic building
Answer
b elements of HBase cluster that
consists of the distribution of tables
and Medium Type Questions and are comprised of column
Long An swer Type families.
Region server runs on HIDFS data node which is
C.
cluster. present in Hadoop
Discuss architecture of HBase data
Que 5.11. What is HBase? Regions of region server are responsible for
d
model. bandling, managing, executing as well as readsseveral things, like
and writes HBase
operations on that set of regions. The default size of a region is
Answer 256 MB.
Java.
distributed database written in Zookeeper:
1 It is an open source, 3
of Hadoop ecosystem, It runs on top of HDFs It is like a coordinator in HBase.
2 HBase is an essential part
System).
(Hadoop Distributed File t b. It provides services like maintaining configuration information,
amounts of data from terabytes to petabytes.
3 It can store massive horizontally scalable. naming, providing distributed synchronization, server failure
notification etc.
column oriented and
HBase architecture : C. Clients communicate with region servers via zookeeper.
Oue 5.12. Write the features of HBase.
Client
HMaster
Answer
Features of HIBase :
It is linearlyscalable across various nodes as well as modularly scalable,
Region Region Zookeeper
Region server 1
server
as it divided across various nodes.
server
Region Region
Region
2 HBase provides consistent read and writes.
Region Region Region 3 It providesatomic read and write means during one read or write process,
write
all other processes are prevented from performing any read or
operations.
HDFS
4. It provides easy to use Java API for client access.
ends which supports
5 It supports Thrift and REST API for non-Java front
Fig. 5.11.1. HBase architecture. XML, Protobuf and binary data encoding options.
for real-time queries and
HBase architecture has three main components :
6 It supports a Block Cache and Bloom Filters
for highvolume query optimization.
1. HMaster :
7 HBase provides automatic failure
support between region servers.
The implementation of master server in HBase is HMaster. with the Hadoop metrics subsystem to
8. It support for exporting metrics
b It is a process in which regions are assigned to region server as well files.
as DDL (create, delete table) operations. within data.
9. It doesnot enforce relationship
It monitors all region server instances present in the cluster. and retrieving data with random access.
10. It is a platform for storing
C.
mitigating the
Visualization
Que
&.IB.Defie
techniques for sharding
sharding and Explain the wHh asharded database, an outage is likely to
Drawbanck of sharding:
impact of outages.
affect only a single shard.
is quite complex to implement a sharded database
1. architecture.
partitioningthat splits very large The shards ofen becorne unbalanced.
1
Anewee
Sharding is a type
faster
of
databae
easily managed
part.
and more partitionoffdata in a database or
database 2
3
adatabase has been sharded, it
its unsbharded architecture.
can be very difficult to return it to
into smaller horizontal search Sharding is not supported by every database engine.
is referred
shardin a to as a shard or
2 Adatabase individualpartition
engine. Eachahardia held on a
shard. Each
separate database server
instance,base
data to Que 6.15. What short notes on NoSQL database with
its
spread load techniques to apply sharding: advantages.
various
Following are the our data : Answer
value to shard
Use a key Use different locations to store data with NoSQL databases are non tabular, and store data differently than
this nmethod user may 1 relational tables,
In pair.
help ofkey value access by key of
that data. NoSQL databases come in a variety of types based on their data model.
data can be easily to its location storage.
2.
b All irrespective The main types are document, key-value, wide-column, and graph.
tostore data 3
It makes easy our data : They provide flexible schemas and scale easily with large amounts of
balancing to shard 4
data and high user loads.
Use toad
take individual decision foor storing datain different
Database can 5 NoSQL data models allow related data to be nested within a single data
locations. short sharding that reframe structure.
also split into
b Large sharding canitself. Advantage of NoSQLdatabase:
decision bydatabase 1 Cheap and easy to implenment.
Hash the key : by hashing its value Data arereplicated to multiple nodes and can be partitioned.
3
keys can be arranged document. 2
a. In this development, store all Easy to distribute.
hashed to 3
Allassignments can be value Do not require a schema.
b.
hashing assigns documents with a particular key 4
Consistent ring.
in a hash
Que 5.16. Explain the benefits of NoSQL database.
C.
servers
to one of the
distribution : NoSQL supporn Graph databases store data in nodes and edges.
Data
4 distributedsystems. high availability and
databases ensure for self-healing, uptime b Nodes typically store informatíon about people, places, and things
Reliability : NoSQL
with native
built-in failover
replication and ,resilient while edges store information about the relationahips
nodes.
between the
databaseclusters. at allowing users to test ew Graph databases are commonly used when we need to traverse
databases are better MongoDB, C
Flexibility : NoSQL
datastructures.
For example,
meaning fields can vary
stores data relationships to look for patterns such as social networks, fraud
detection, and recommendation engines.
update
ideas and JSON-like documents, from
structurescan be easily
in flexible, documentandthe data evolve.
documentto requirements
changed Neo4j and JanusGraph are ezamples of graph databases.
application Que B.18, Differentíatee between SQL and NoSQL
over timne, as databases ?
are the types ofNoSQL
Que 5.17. What Answer
exploration can be seen as a user to gain insight into the interpretable for the
3. Visual data
the data allow the These patterns are visualized to make them
the visualizations of
b.
hypotheses.
with new data analyst.
data and come up
s-90JCST6)
Frame Werks ht i
Data Analvtick dW
Bi93C8-5T4) Answer
forminatadeciinforsmioatnisut
pictorial or u
e a n b t e o r i p bnd e sat graphical
lentatitnesp Tlofthedta
Visuitiitin
Usedfor The goal of the data
vísualization is to on the hve
1t will help
the
be
com municate
clearly and
Yhaké mnore
informáation busiriek Tofireie
ved
ni n d dt stannt efficientlythemto
users by presenting analyrine the tn
Visualization.azolpeahk visually.
deongn oPir. 5.23.2. Sabsequent
Subsequent visualizations enable the data analyst tto specify Relation Data visualization
C
feedbacks. Based on the visualization, the anddata analyst may Want better perception. helps to get Together dte ri
6toenreturn to the data-mining resuits. algorithm use diffetont-i and anslytie l re
parameters to obtain better lo bnal
Visualizatíon (TEV) :iueiY aribgea
3ashib
Coneloeinne
dataset
Data
echnologies visualization Data analyios te
3. Tightly Integrated Tndustres
performs an analysis of th
An automatie datá<mining algorithm and
data but does not produce the final results
pre sent the
ogi (M
intermediate ree
techniquesand etiet
are widely used in fnancs
BenkinE healtheare
isused to
lossnb liA visuali|ation technique l n aerinn lt 19vo etaing ete Neiei
of the data exploration processa
Data DenieDiatiaHaro, TaledTe
Dvaphs, ikviewgitdPoye
TingCHharr et
Vinualization
DM-algorithm step 1 Interaction THeeta eg
DM- algtthnr tep nt
Big inta
Reatt
CKnowlel Data
visualization (TIV).
Pig. 5.23.3. Tightly jntegrated
The combination of some automatic data-mining algorithms and
visualization techniques enables specified user feedback for the
dataaialynt identifies the
next data-mining nm. Thén, thei resulta
ed neiinteresting patterns in the visualization of the intermediate
based on his domain knowiedge,tegsteztz
data vÍsualization and
Que 5.24. What is the ifference between
data analytics ?
2J(CS-51T6)
6-21J (CS-5/1T-6) Frame Works and
Data Analytica
is n type
of data: ltpointa
Line eharts of chart which displays information
Stacked|display techniques:
StackedI display
Visualization
series called mnarkers connected by straight line
techniques are tailored to present data
in a hierarchical fashion.
Negments, In the case of partitioned
Bar charts : It represents the categorical data with rectangular
bars of heights and lengths proportional to the values they
b used for
aulti-dimensional
partitioning the data, the data
selected appropriately. data and building the dimensions to be
represent. An example offa stacked hierarchy have to be
Seatter charts: It is a type of plot or mathematical diagram that C
display technique is
display value for typically two varinbles for a Bet of data using d
The basic idea is to
coordinate system, embed
Le. two
one
coordinate dimensional
system stacking.
inside another
Cartesian coordinates. system, two oother attributes form the
system, and so on. attributes
statistical graph which decide into slices to are outer
Pie charts:It is circular emnbedded into the outer coordinate
d
illustrate numerical proportion
The display is generated itby
coordinate
Geometrically-transformed display technique : systeminto rectangular cells. dividing the outermost level
2
Geometrically-transformed display techniques aim at fndi are used to span the second
Within the cells, the next two coordinate
"interesting" transformations of multi-dimensional data sets level coordinate system. attributes
The class of geometric display
methods includes techniques from Que 5.26. Explain type of data that are
b
exploratory statistics such as scatter plot
matrices and a claee visualized.
that attempt to locate projections that satisfy sor Answer
techniques
computable quality of interestingness. The data type to be visualized may be:
include :
C. Geometric projection techniques One-dimensional data:
prosection views, only user-selected L