0% found this document useful (0 votes)
44 views

Data Analytics - Unit - 5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Data Analytics - Unit - 5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

6-2J (CS-5/1T-6)

Frame Works and Visualization

PART-1

5
UNIT
Frame Works and
Visualizatio,
Frame Works and Visualization:

Questions-Answers
MapReduce, Hadoop

Long Answer Type and Medium Answer Type Questions


Que 5.1. Write short note on Hadoop and also write its
CONTENTS advantages.
Answer
Visualization : ******.voese
5-2J to 5-4]
Works and 1. Hadoop is an open-source software framework
scalable, reliable and distributed applications thatdeveloped
Frame
Part-1 : MapReduce, Hadoop
for creating
5 4J to of data.
process huge amount
5-8J
Part-2 : Pig. Hive 2. It is an open-source distributed, batch processing, fault tolerance system
Sharding, 5-8J to 5-14] which is capable ofstoring huge amount of data along with processing
HBase, MapR, on the same amount of data.
Part-3 t NoSQL Databases
. 5-14J to 5-17J Advantages of Hadoop :
Hadoop Distributed Fast :
Part-4 : S3,
File Systems
1.
a. In HDFS (Hadoop Distributed File System), the data distributed
Visualization:Visual Data
5-17J to 5-25J over the cluster and are mapped which helps in faster retrieval.
Part-5 : Analysis Techniques, b Even the tools to process the data are often on the same servers,
Interaction Techniques, thus reducing the processing time.
Systems and Applications
2 Scalable : Hadoop cluster can be extended by just adding nodes in the
to R: 5-26J to 5-30J cluster.
Part-6 : Introduction
R Graphical User 3. Cost effective : Hadoop is open source and uses commodity hardware
Interfaces, Data Import to store data so it really cost effective as compared to traditional
and Export, Attribute relational database management system.
and Data Types Resilient to failure : HDFS has the property with which it can
replicate data over the network, so if one node is down or some other
network failure happens, then hadoop takes the other copy of data
and uses it.
5. Flexible :
a. Hadoop enables businesses to easily access new data sources and
tap into different types of data to generate value from that data.
b. It help to derive valuable business insights from data source such
as social media, email conversations, data warehousing, fraud
detection and market campaign analysis.
5-1J(CS-5/1T-6)
Data Analytics
MapReduce.
63J(C8-5/TT-4, 1JCS-5/1T-6)

Intermediate
Frame Works and Visualization
outputs from the numerous map steps need to be
note on and provided the proper machines for the collected
Write short to
Que6 .
6 Einal output needs to be made available for usereduce
by
step execution.
can write applications
which welarge application, or perhaps another MapReduce job. another user, another
Answer framework using parallel, on clusters in a reliable
a
MapReduce isamnounts of data, in
1
processhuge PART-2
manner technique and a program model for Pig, Hive.
processing
MapReduce is a Java.
computingbased on to
distributed provides the means break alarge task
The
MapReduce paradigmntasks in parallel, andl consolidate the
outputa Questions-Answers
3 task8, run the the final output.
intosmaller Long Answer Type and Medium Answer Type Questions
the individual tasks into :
of two basic parts
MapReduceconsists of Que 5A. Write short note on data access component of HadooP
4
data
Map operation to a piece of system.
Applies an intermediate output
Provides some Answer
b.
ii. Reduce : intermediateoutputsfrom the map steps Data access component of Hadoop system are :
Consolidatesthe a. Pig (Apache Pig):
output
Provides the final two functione 1. Apache Pig isa high level language platform for analyzing and
Map) andReduce) are
b
program, query huge datasets that are stored in HDFS.
In a Map Reduce filtering, grouping ond
5.
function performs actions like 2 Apache Pig uses Pig Latin language which is similar to SQL.
The Map
a.
the roeutk 3 It loads the data, applies the required filters and dumps the required
sorting.
function aggregates and summarizes format.
While Reduce
b.
produced by Map function. For program execution, Pig requires Java run time environment.
function is a key-value nair
4

The result generated by the MapReduce function. 5. Apache Pig consists of a data flow language and an environment
C.
the input for to execute the Pig code.
(K, V) which acts as
are required for
executing 6. The main benefit of using Pig is to utilize the power of MapReduce
Que 5.3. What are the activities that in a distributed system, while simplifying the tasks of developing
MapReduce job ? and executing a MapReduce job.
common data
7. Pig provides for the execution of several more
Answer management and manipulations, Such as inner and outer joins bet ween two or
Executing a MapReduce job requires the files (tables).
activities :
coordination of several
based on the system's workload.
b. Hive :
MapReduce jo bs need to be scheduled 1 HIVE is a data warehousing component which
performs reading,
environment
managed to ensure that any encountered
1 a distributed
writing and managing large datasets in
2 Jobs need to be monitored and that the job continues to execute if the using SQ-like interface.
errors are properly handled so
system partially fails. HIVE + SQL = HQL
the cluster. Query Language (HQL),
3. Input data needs to be spread across 2. The query language of Hive is called Hive
processing of the input needs to be conducted across the which is very similar like SQL.
4. Map step same machines where the data
distributed system, preferably on the 3. It has two basic components:
resides.
ii. used
HQL
commanda,
65J(CS-ST4
Data Analytics Hive Command line : T'he Hive Command line interface i

to executedrivert Java Database Connectivity ((JDBC)


JDBC/ODBC
66J(CS-51T-6)

3. ODBC Driver :lt


protocol to connect toallows
Hive.the
Frame Works and Visualization

applications that support the ODBC


and Object Database Connectivity (ODBC) is used to establih se. aervices :The following are the
data storage. Hive CLI :The Hive CLI (Command services provided by Hive :
1.
connection from both Line Interface) is a shell where
can serve the we can execute Hive queries and
Hive is highly scalable. As, it Batch
processing(i.e. query processing)
processing),
purposes,and reali, 2
commands.
Hive Web User Interface :The Hive Web UI is an
CLL It provides a web-based GUI for exe alternative of Hive
large data set Interactive query cuting Hive queries and
time processing (i.e. data types ofSQL. commands.
primitive
5. Itsupportsall architecture of
3. Hive MetaStore :
Hive in
Que 5.5. Draw and discuss the detail. It is a central repository that stores all the st ruct ure
of various tables and partitions in the
warehouse.
information
b It also includes rnetadata of colurnn and its type information which
Answer architecture explains the flow of is used to read and write data and the corres ponding HDFS files
architecture : The following where the data is stored.
Hive Hive.
submission of queryinto 4. Hive server :

JDBC driver ODBC driver a It is referred to as Apache Thrift Server.


Thrif server
Hive b. It accepts the request from different clients and provides it to Hive
Client Driver.
5. Hive driver:

Hive
Hive server CLI a It receives queries from different sources like web UI, CLI, Thrift,
Hive web UI and JDBC/ODBC driver.
services
b. It transfers the queries to the compiler.
Hive driver
6. Hive compiler :
a. The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions.
It converts HiveQL statements into Map Reduce jobs.
Metastore
b
7. Hive execution engine:
of DAG of
Mapreduce a. Optimizer generates the logical plan in the form
MapReduce tasks and HDFS tasks.
executes the incoming tasks in
b. In the end, the execution engine
HDFS the order of their dependencies.
using Hive ?
Que 5.6. What are the conditions for
Fig, 5.5.1. Hive architecture.
in various languages, including
Hive client : Hive allows writing applications Answer
different types of clients such as: con ditions exist :
Java, Python, and C++. It supports Hive is used when the following
cross-language service provider platform that Data easily fits into a table
structure.
1 Thrift Server : It is a programming languages that supports 1
serves the request from all those Data is already in HDFS.
2 programming and queries.
Developers are comfortable with SQL
Thrift.
a connection between Hive and 3
JDBC Driver : It is used to establish datasets based on time.
2
applications. The JDBC Driver is pre sent in the class 4 There is a desire to partition
Java
org.apache.hadoop.hive.jdbc.Hive Driver.
Data Analytics
acceptable.
-7JC8-5/1T4) -8J(CS-51T-6)

IL Can manage more complex data lows


Frame Worke and Visualization

Bateh processing is Apache Pig operates on the client side of a


Write some use
cases of Hive. ebas less number of lines of code by using cluster.
Que 5.7. 6.
It supports reusing the code.
multi-query approach.
6.
Die one of the best tools to make the large
is
Answer 7 structured data. unstructured data to
use cases:
Followving are some Hive |
analysis of HDFS data : Data can be queried, 8.
It is open source software.
1. Exploratory or ad-hoe analytical tools, such as R. It is procedural programming language so that we can control the
transformed, and exported to dash board.
9 execution of each and every step.
Extracts or data feeds to reporting systems,
queries can be Features of Apache Pig:
2
data repositories
such as HBase : Hive scheduled Rich set of operators: Apache pig has a rich collection set ofoperators
such periodic feeds. 1
in order to perform operations like join, filer, and sort.
provide
Combining external structured data to data already residing
2 Ease ofprogramming : Pig Latin is similar to SQL so it is very easy for
in HDFS: developers towrite a Pig script.
processing unstructured data, but often
a. Hadoop is excellent for residing in an RDBMS, such as Orael 3. Extensibility :Using the existing operators in Apache Pig, users can
there is structureddata joinedwith the data residing in HDBS developtheir own functions to read, process, and write data.
SQLServer, that needs to be tahl U'ser Define Functions (UDF's) : Apache Pig provides the facility to
data from an RDBMS can be periodically added to Hive 4
ereate user-de fined functions easily in other language like Java then
b The data in HDES. invoke them in Pig Latin Scripts.
existing
for querying with
SQL.
Handles all types of data :Apache Pig analyzes all types of data like
Que 5.8. Difference between Pig and structured, unstructured and semi-structured. It stores the results in
HDFS.
ETL (Extract Transform Load): Apache Pig extracts the huge data
6 required
set, performs operations on huge data and dumps the data in the
Answer
SQL format in HDFS.
S.No. Pig
It is a declarative language.
1 It is procedural
a language. Que 5.10. What are the applications of Apache Pig.
relational data It uses flat relational data model.
It uses nested
model. Answer

Scheme is optional.
Scheme is mandatory. Application of ApachePig:
3. web logs, streaming online
analytic It uses OLTP (Online Transaction 1. It is used to process huge data sources like
It uses scan-centre Processing) workload. data ete.
workload.
It supportsAd Hoc queries across
large dataset.
Significant opportunity for query 2.
platforms.
5. Limited query optimization.
optimization. 3 Used toperform data processing in search
sensitive data loads.
4 It is alsoused to process time
What are the advantages and features of Apache
Pig (or
Apache Pig is generally used by data scientists for performing tasks like
Que 5.9. 5.
prototyping.
ad-hoc processing and quick
Pig).
PART-3
Answer
Databases.
Advantage of Apache Pig: HBase, MapR, Sharding, NoSQL
1 Pig Latin language is easy to program.
2 It decreases the development time.
10J(CS-51T-6)

Data Analyties
5-9J (CS-5/1T-4, Frame Works and
Visualization
2
Region server :
HBase tables are divided
Questions-Answers horizontally by row key range into regions.
Regions are the basic building
Answer
b elements of HBase cluster that
consists of the distribution of tables
and Medium Type Questions and are comprised of column
Long An swer Type families.
Region server runs on HIDFS data node which is
C.
cluster. present in Hadoop
Discuss architecture of HBase data
Que 5.11. What is HBase? Regions of region server are responsible for
d
model. bandling, managing, executing as well as readsseveral things, like
and writes HBase
operations on that set of regions. The default size of a region is
Answer 256 MB.
Java.
distributed database written in Zookeeper:
1 It is an open source, 3
of Hadoop ecosystem, It runs on top of HDFs It is like a coordinator in HBase.
2 HBase is an essential part
System).
(Hadoop Distributed File t b. It provides services like maintaining configuration information,
amounts of data from terabytes to petabytes.
3 It can store massive horizontally scalable. naming, providing distributed synchronization, server failure
notification etc.
column oriented and
HBase architecture : C. Clients communicate with region servers via zookeeper.
Oue 5.12. Write the features of HBase.
Client
HMaster
Answer
Features of HIBase :
It is linearlyscalable across various nodes as well as modularly scalable,
Region Region Zookeeper
Region server 1
server
as it divided across various nodes.
server

Region Region
Region
2 HBase provides consistent read and writes.
Region Region Region 3 It providesatomic read and write means during one read or write process,
write
all other processes are prevented from performing any read or
operations.
HDFS
4. It provides easy to use Java API for client access.
ends which supports
5 It supports Thrift and REST API for non-Java front
Fig. 5.11.1. HBase architecture. XML, Protobuf and binary data encoding options.
for real-time queries and
HBase architecture has three main components :
6 It supports a Block Cache and Bloom Filters
for highvolume query optimization.
1. HMaster :
7 HBase provides automatic failure
support between region servers.
The implementation of master server in HBase is HMaster. with the Hadoop metrics subsystem to
8. It support for exporting metrics
b It is a process in which regions are assigned to region server as well files.
as DDL (create, delete table) operations. within data.
9. It doesnot enforce relationship
It monitors all region server instances present in the cluster. and retrieving data with random access.
10. It is a platform for storing
C.

d. In a distributed environment, master runs several


background
threads. HMaster has many features like controlling load balancing,
failover etc.
-12J(CS-/IT-6)
DataAnalytics
database shard.
11JCS-6T4 # makesan application more reliable by
Frame Works and

mitigating the
Visualization

Que
&.IB.Defie
techniques for sharding
sharding and Explain the wHh asharded database, an outage is likely to
Drawbanck of sharding:
impact of outages.
affect only a single shard.
is quite complex to implement a sharded database
1. architecture.
partitioningthat splits very large The shards ofen becorne unbalanced.

1
Anewee
Sharding is a type
faster
of
databae
easily managed
part.
and more partitionoffdata in a database or
database 2
3
adatabase has been sharded, it
its unsbharded architecture.
can be very difficult to return it to
into smaller horizontal search Sharding is not supported by every database engine.
is referred
shardin a to as a shard or
2 Adatabase individualpartition
engine. Eachahardia held on a
shard. Each
separate database server
instance,base
data to Que 6.15. What short notes on NoSQL database with
its
spread load techniques to apply sharding: advantages.
various
Following are the our data : Answer
value to shard
Use a key Use different locations to store data with NoSQL databases are non tabular, and store data differently than
this nmethod user may 1 relational tables,
In pair.
help ofkey value access by key of
that data. NoSQL databases come in a variety of types based on their data model.
data can be easily to its location storage.
2.
b All irrespective The main types are document, key-value, wide-column, and graph.
tostore data 3
It makes easy our data : They provide flexible schemas and scale easily with large amounts of
balancing to shard 4
data and high user loads.
Use toad
take individual decision foor storing datain different
Database can 5 NoSQL data models allow related data to be nested within a single data
locations. short sharding that reframe structure.
also split into
b Large sharding canitself. Advantage of NoSQLdatabase:
decision bydatabase 1 Cheap and easy to implenment.
Hash the key : by hashing its value Data arereplicated to multiple nodes and can be partitioned.
3
keys can be arranged document. 2
a. In this development, store all Easy to distribute.
hashed to 3
Allassignments can be value Do not require a schema.
b.
hashing assigns documents with a particular key 4
Consistent ring.
in a hash
Que 5.16. Explain the benefits of NoSQL database.
C.
servers
to one of the

benefits and drawback of sharding ?


Que 5.14. What are the Answer
Benefits of NoSQL database:
data models
Answer
L Data models : NoSQL databases often support different support simple
example, key-value databases
Benefits of sharding :
means adding more processing
and are purpose built. For
Horizontal scaling :Horizontal scaling or database to allow for more queries very efficiently.
1. server better than SQL
units or physical machines
to our 2 Performance :NoSQL databases can often perform
traffic and faster processing. databases. For example, if we are using a document database
relational
about an object in the same
2 Response time : and are storing all the information to one place for those queries.
document, the database only needs to go
a It speeds up query response times. designed to scale-out horizontally,
In the sharded database, queries
have to go over fewer rows and 3. Scalability : NoSQL databases areperformance as our workload grows
b
thus we get our result sets more quickly. makingit much easier to maintain
beyond the limits of a single server.
Data Analytics databases are
-13J(CS-STT4
designed to
14J(CS-S1T-6)
Graph based databases :
Frame Works and Visualization

distribution : NoSQL supporn Graph databases store data in nodes and edges.
Data
4 distributedsystems. high availability and
databases ensure for self-healing, uptime b Nodes typically store informatíon about people, places, and things
Reliability : NoSQL
with native
built-in failover
replication and ,resilient while edges store information about the relationahips
nodes.
between the
databaseclusters. at allowing users to test ew Graph databases are commonly used when we need to traverse
databases are better MongoDB, C

Flexibility : NoSQL
datastructures.
For example,
meaning fields can vary
stores data relationships to look for patterns such as social networks, fraud
detection, and recommendation engines.
update
ideas and JSON-like documents, from
structurescan be easily
in flexible, documentandthe data evolve.
documentto requirements
changed Neo4j and JanusGraph are ezamples of graph databases.
application Que B.18, Differentíatee between SQL and NoSQL
over timne, as databases ?
are the types ofNoSQL
Que 5.17. What Answer

Answer S.No. SQL NoSQL


database : It supports non-relational or
Document based documents similar to JSON 1. It supports Relational
1 store data in Database Management distributed database systen.
Document databases
Notation) objects. System (RDBMS).
JavaScriptObject and values,
containspairs of fields These databases have fixed They have dynamic schema.
Each document including things l
a variety of types
b 2
can typically be Because of or static or predefined
The values objects.
C booleans, arrays8, or can k schema.
strings, numbers, types and powerful query languages, it
variety of fieldvalue These databases are not These data bases are best suited for
purpose database. 3
used as a general scale-outto accommodate
large data vol suited for hierarchical data hierarchical data storage.
Theycan horizontally of documert storage.
Couchbase DB are example
MongoDB, Couch DB, 4
These databases are best These databases are not so good
databases. suited for complex queries. for complex queries.
Key-value based database:
database where each Vertically scalable. Horizontally scalable.
2
Key-value databases are a simpler type of 5.
a.
and values.
item contains keys value.
only be retrieved by referencing its PART-4
b Avalue can for use cases where we need to
Key-value databases are great do not need to perform complex S3, Hadoop Distributed File Systems
C but we
store large amounts of data
queries to retrieve it. databases.
example of key-value Questions-Answers
d Redis and DynanoDB are :
Wide-column based database Type Questions
3
Wide-column database store data
in tables, rows, and
dynamic
Long Answer Type and Medium Answer
columns. databases beca use each Distributed File System
offlexibility over relational Hadoop
b It provide a lot columns, Que 5.19. Explain architecture of
have the same
row is not required to Things data and (HDFS).
commonly used for storing Internet of OR
C. They are architecture and HDFS commands
user profile data.
wide-column databases. Define HDFS. Discuss the HDFS
d Cassandra and HBase are the example of in brief.
-14CS-TT6)
Eata Alyties S-16JCS-T6 Se The file in & file
Frame Works and
systes will be, divided ánto one
Visualization
eegments and/or stored in indiyidual data
9r more
Anewer nisthe eore connponent
System
or
the backbne 8egments are called as blocka nodes.These file
Dtributed File In other words, the
Hadep Feytem
fHatop store different minimum
resd or write is called a Block.amount of data that HEDES can
which itpossible to
makesunetructured and semi types of
one,
HDFN ie the etructured. structured data Ge 6.20 Diffetentiate1 between
lar data
eta tion aver the resources, from MapReduce and Apache Pig.
HDS
ef ahkracsingle
areates a levelIDES a
unit where Answer
& whole varioas nodes and
weean see the scross maintaining
storing eur data (mnetadata). ANo. MapReduce Apache Pig
t belp data
4 abeut the stored re is a low-levet data
the lng fle Meta data (ae, plca procesgtool
it is a high-level data flow tool.
homfndata,
Here, it is reqired to develop It is not required to develop
complex programs asingcomplex programs.
Java or Python.
It is difreult to perorm data It provides bult-in operators to
operations in MapReduce perfor data operations like
union, sorting and ordering.
e does not ailos nested data It provides nested data typeslke
Biocka types. tuple, beg, and map.
Data nodes
odes
Que 521- Write short note on Amazon 53(Simple Starage Service)
Padt 2
architecture. with its features.
Pig. 5.I9. 1. HDPS
Answer
HDFS has three core commponents
nl
Amazon S3 (Simple Storage ServiceiisSerices acloud laaS (infrastructure as
from Amazon Web for abyect starage via a
Name n ode : does not store the Aservice) solutaon
coDvenient web-based interface
the master node and
The name node is of 53 include industry-leading
actual data. iaformation about databases According to Amazon. the benefita and performance.
i RCalability. data availahility, security,
It contains metadata less storage and high
computational
Amazon 83 is the "abjectwhich pansista of a
Therefore, it requires The basic storage unit of
number and metadata
resourees 6le with an associated ID smary ofolders
Data node : HDPS. 4These objects are stored in bucketa, wheh funtionAWS regon af our cheice
act ual data in reside within the
Data node stores the or directoriesand which 2 stare,
store is the standard mechanisn
ALs alao calledslave daemon write operations as per the 6 The Amazon S3 object quantities of data in ARS
responible for read and retrieve, and share large
t i are:
reque oname nodel The features f Amazon 53 retrievig data
reoeives reqjest from for staring. lasting and
L Ohjeet stare model
Block :
the flesof HiDES.
ACenmerally the aer data isstored in
I8J(CS-MTR)
Data Analytics 5-17J (CS-T-A
terabytes, with many petabytes of verification of
visualization, but maythe
The Frame Works and
2. Support for objects up to 5
allowed in a single "bucket". data from statistics, pattern
also be hypotaceomplhesesishedcan byalso be doneVisualviainatidataon
3. Data in stored in Amazon
AWS regions.
S3 in buckets which are stored in
different Visual data exploration can
and noisy data.
recognieasilytion,deal machine
or
automatic
with learning techniques
different users. highly
intuitive and requires nonon-homogeneous
Buckets can be restricted to Visual data exploration is
Data stored in an Amaon 83 bucket is billed bnsed on the size of data complex mathematical oor
5.
Datalong it isinstored, and on operations accessing
this data statistical
Visunlization can provide a
understanding of
algorithms or ofparamet
overview the data,ers.allowing
how
isolatedqualitative
stored Amazon S3 can be backed up with Amazon Glacier. 7 data phenomena to be
6. for further
PART-5
Gue523 What are the quantitative analysis.
exploration process toapproaches to integrate the
Analysis Techniques, Interactiow data
wisual datamining? realize different kind of human in
Visualiza tion, Visual Data
Techniques, Systems and
Applications.
Answer
approaches to
Appronches to integrate the
Questions-Answers realize different kind of
human in
data
approaches to visualexploration process to
data mining:
Questione Preceding Visualization (PV):
Medium Answer Type L
Data is visualized in some vissual form
Long Answer Type and before running a
(DM) algorithm.
visual data exploratie By interaction with the raw data, the data
data-mining
Que 5.22. Explain data visualization and b analyst has full control
over the analysis in the search space
Interesting patterns s are discovered by
Answer exploring the data.
Data visualization : Data
chart, graph or
process of putting data into a
1. Data visualization is the helps inform analysis and interpretation
other visual format that analysis process. Visualization
of the data
critical tool in the data
2 Datavisualization is a generating fundamental distribution
Visualization tasks can range from variables in
3 interplay of complex influential DM-algorithm|
plots to understanding the
machine learning algorithms.
deal with the
data analysis can help to
Datavisualization and visual
Result
4
floud of information.
Knowledgo
Visual data exploration : involved in the data analysis
exploration, user is directly Fig. 5.23.1. Preceding Visualization
1. In visual data
process. of high value in Subsequent Visualization (SV):
techniques have proven to be
2. Visual data analysis 2
performs the data-mining task
An automatic data-mining algorithm dataset.
exploratory data analysis.
hypothesis generation process; by extracting patterns from a given
a.

exploration can be seen as a user to gain insight into the interpretable for the
3. Visual data
the data allow the These patterns are visualized to make them
the visualizations of
b.
hypotheses.
with new data analyst.
data and come up
s-90JCST6)
Frame Werks ht i
Data Analvtick dW
Bi93C8-5T4) Answer

Based on Data visualization


It is the graphieal
tis thData, snityto
Delinition
e o f f d a l l ritr DM-alarithuo teolqz s bInaly l
representation of proefttee
information and data data sets in

forminatadeciinforsmioatnisut
pictorial or u
e a n b t e o r i p bnd e sat graphical
lentatitnesp Tlofthedta
Visuitiitin
Usedfor The goal of the data
vísualization is to on the hve
1t will help
the
be

com municate
clearly and
Yhaké mnore
informáation busiriek Tofireie
ved
ni n d dt stannt efficientlythemto
users by presenting analyrine the tn
Visualization.azolpeahk visually.
deongn oPir. 5.23.2. Sabsequent
Subsequent visualizations enable the data analyst tto specify Relation Data visualization
C
feedbacks. Based on the visualization, the anddata analyst may Want better perception. helps to get Together dte ri
6toenreturn to the data-mining resuits. algorithm use diffetont-i and anslytie l re
parameters to obtain better lo bnal
Visualizatíon (TEV) :iueiY aribgea
3ashib
Coneloeinne
dataset
Data
echnologies visualization Data analyios te
3. Tightly Integrated Tndustres
performs an analysis of th
An automatie datá<mining algorithm and
data but does not produce the final results
pre sent the
ogi (M
intermediate ree
techniquesand etiet
are widely used in fnancs
BenkinE healtheare
isused to
lossnb liA visuali|ation technique l n aerinn lt 19vo etaing ete Neiei
of the data exploration processa
Data DenieDiatiaHaro, TaledTe
Dvaphs, ikviewgitdPoye
TingCHharr et
Vinualization
DM-algorithm step 1 Interaction THeeta eg
DM- algtthnr tep nt
Big inta
Reatt

CKnowlel Data
visualization (TIV).
Pig. 5.23.3. Tightly jntegrated
The combination of some automatic data-mining algorithms and
visualization techniques enables specified user feedback for the
dataaialynt identifies the
next data-mining nm. Thén, thei resulta
ed neiinteresting patterns in the visualization of the intermediate
based on his domain knowiedge,tegsteztz
data vÍsualization and
Que 5.24. What is the ifference between
data analytics ?
2J(CS-51T6)
6-21J (CS-5/1T-6) Frame Works and
Data Analytica
is n type
of data: ltpointa
Line eharts of chart which displays information
Stacked|display techniques:
StackedI display
Visualization
series called mnarkers connected by straight line
techniques are tailored to present data
in a hierarchical fashion.
Negments, In the case of partitioned
Bar charts : It represents the categorical data with rectangular
bars of heights and lengths proportional to the values they
b used for
aulti-dimensional
partitioning the data, the data
selected appropriately. data and building the dimensions to be
represent. An example offa stacked hierarchy have to be
Seatter charts: It is a type of plot or mathematical diagram that C
display technique is
display value for typically two varinbles for a Bet of data using d
The basic idea is to
coordinate system, embed
Le. two
one
coordinate dimensional
system stacking.
inside another
Cartesian coordinates. system, two oother attributes form the
system, and so on. attributes
statistical graph which decide into slices to are outer
Pie charts:It is circular emnbedded into the outer coordinate
d
illustrate numerical proportion
The display is generated itby
coordinate
Geometrically-transformed display technique : systeminto rectangular cells. dividing the outermost level
2
Geometrically-transformed display techniques aim at fndi are used to span the second
Within the cells, the next two coordinate
"interesting" transformations of multi-dimensional data sets level coordinate system. attributes
The class of geometric display
methods includes techniques from Que 5.26. Explain type of data that are
b
exploratory statistics such as scatter plot
matrices and a claee visualized.
that attempt to locate projections that satisfy sor Answer
techniques
computable quality of interestingness. The data type to be visualized may be:
include :
C. Geometric projection techniques One-dimensional data:
prosection views, only user-selected L

slices of the data are


:
i. Prosection views projected. In One-dimensional data usually have one dense
Atypical example of dimension.
ii. Parallel coordinates visualization
technique : Parallal b
One or multiple data
one-dimensional data is temporal data.
coordinate technique maps the k-dimensional parallel space onto the C. values may be associated with each point in
dimensions by usingk axes that are to eack time.
two display
other and are evenly spaced across the display. d Examples are time series of stock prices or time series of news data.
Two-dimensional data:
3. Icon-based display techniques : 2
values of a multi. Atwo-dimensionaldata is geographical data, where the two
In iconic display techniques, the attribute an icon. distinct
dimensional data item is presented in the form of
a.
dimensions are longitude and latitude.
as little faces, needle icons A standard method for visualizing
b Lcons may be defined arbitrarily such icons.
b. two-dimensional data are xy
icons, stick figure icons, color plots and maps are a special type of x-y plots for presenting two
star
attribute values of dimensional geographical data.
C The visualization is generated by mapping the
of the icons.
each data record to the features C. Example of two-dimensional data is geographical maps.
Multi-dimensional data :
4 Dense pixel display techniques :
to map each
The basic idea of dense pixel display techniques is belonging to Many data sets consist of more than three dimensiu ns and there fore
dimension value to a colored pixel and group the pixels do not allow a simnple visualization as 2-dimensional or 3-dimensional
each dimension into adjacent areas. plots.
the techniques
b Since in general it uses one pixel per data value, pOssible on
b Examples of multi-dimensional (or multivariate) data are tables
allow the visualization of the largest amount of data from relational databases, which often have tens to hundreds of
current displays. columns.
to provide
C. Dense pixel display techniques use different arrangements
dependencies, and hot
detailed information on local correlations,
spots.
JC8-6T-6)
Frame Works and
Data Analyties
hypertext t
21J(CS.TT4
importantdata types
anisgernent So, the data
ehe datarin an visualization
uncomplicated
Predict the data based on
way helps them in
Visualization
understanding
Text and
In the age ofthe as
Wide Web,
Ward multinedis web page éontérts, are text
their experience.
visualization
builds sort of period outlines for the
: The data
users which they visualization
hypertest, as well they ciannot be easily can ink using
and
mumbe,
These dnta
snd differ in that
therefore
typee deseribed
most of the vtandard viHualizationby
Que S.29, Explain different interaction
teehniqueseannot
be applied techniques,
Hierarehies and graphs relatioiship to other Answer
&
Data veords oten hrve
some
pieces
of Tnteractíon s
techniques allow the data'analyst to
visualizations and directly interact with
information ordered, hierarchical, or the dynamically change the visualizatíons
These
relationshipe may be
networks of relations. such interdependencies,
arbitrary tothe exploration
In addition,
objeetivex
they also make it
sndependerit visualizatíorie possible to relate ahd
according
combine multiple
used to represent
2

Craphs are widely objects, called nades. and


connections Different interaction techniques are :
aset of
consists of called edges or links. Dynamie projection :
d Agraph objects, interrelationships Dynamic projection is an automated navigation operation.
between these among people, their 1
The basic idea is to dynamically change the projections in order to
Examples are the e-mail structure of the hard disk, or the 2 esplore a multi-dimensional data set
the file
shopping behaviour, World Wide Web. Awel-known example is the GrandTour system which tries to
hyperlinks in the boifanty n 3.
software:enm show all interesting two-dimensional projections Gf a multi
Algorithms and algorithms andl software. oir dimensional data set as a series of scatter plots.
data is
Another class of The sequence of projections shown can be random, manual, pre
ofsoftware visualization
is support
to softwaredevelopment
algorithms, too enhance the
P4. computed, or data driven
b The goal
by helping to
understand
support the
programmer in understanding
debugging the ut Examples of dynamic projection techniques inelude XGobi
XLispStat, and ExplorN.
code and to
of writ ten BInteractivefiltering:
code. visualization? Interactive filtering is a combination of selection and view
advántages of data 1.
enhancement.
What are the partition
In exploring large data sets, it is important to interactively
Que 527.
2 subsets.
the data set into segments and focus on intéresting
bl a m n o w A
urbub l e n i d t subeet browsing)
Answer
visualization are
anol s seAOaib
n3. This can be done by adirect selection ofthe destredsubset fquerying).
Advantages of data pattern:A or by a specification of propertiesuSed ofthe desired
and its interactive filteringis the
understanding of the data increasing sales An example of a tool that can be for
Better
L. the flow of datalike 4 en
Magic Lens: TSwol s diw node ans tool similar toa magnifying
User can understand the sales report will reveal h
Magic Lens is to use a
a.
representation of basic 1dea of The data under
The line chart of thesales division of anyórganization 5.m The
glass to filter the data directly in the visualization.
b. and displayed in a
sales growth to the manager
data like
trendsTOiasmib-isloK
the magnifyingglass is processed by the filter
set.
Relevance of the hidden can be ideñtifed od la different way than the remaining data of the selected region,
while the
2. some unseen patterns which show a modified viewunafTected.lc
The datamay con tain 6. Magic Lens
remains
with data visualization. 3olg. HYsiqairest of the visuálization
techriques incudes InfoCrystal,
market may increase at filtering
ofany stock in share
For example, the data time. This period can be idehtified using the TExamples ofinteractivePolaris.
b
a particular period of Dynamic Queries, and widely
Zooming : modification technique that is
o eberbr data visualization.abstraction of data for users : C.
Zooming is a well known view
Encapsulation and by 1. applications.
are not understandable used in a number of
of verylarge size and which is a part of top
3.
The data sets are audience
everyone like non-technical
a
PART6
tily ng
eel Aner Type
Ehem
Tabiees acit brt mte o
uerial vae pgraming langagwi
rie nlades PAD- VELSine
cages llyched
ised fr datap
Eaaing and Linking aprucagndi
-fnetinal lagg which pries te fnti
s f the data se
ting the selected data to ather idiferenti daa
tcan toneese figueperoms
The i e iming snd brhig ito mie fputtigtgether asiea et gin nthe i the e
eche sbortemigs of iniidal tectigue
barbig can be apped to i a gmerade
Lakig E hasintegra fetinn for data handing e declartn nd
ipgtion Secitmies As arelt, the trubed
mng tpbie to det eftiom and t als upPps in-meryerge
ane hgged in al i i a t a ,
dependences nd eurreati Tt of data
uDPorta opertas omeeta of data ie setand metr
iuliation are automatia Merools are 2vaatle for data analyis ing
nteratie chnges made in one
sniualizatim
Val represemtatinn pruued ing Rcan be diplayed n the
refected in the other BCreen as We as can be printed
Sprogrglangunge is zvlable nline to ppert funti of
Dtortion is aiew moifcation technigue that supPporte the data Binmre simplifed maniner
ezploratinn prns by preservig a overview of the data duri Lerge mumbers of packages are avallabile in repotory for rios
drow operatin tinalities of data pruceingwith Rlanguge
data withahigh level of 7 Bsupperta the graghial trtin funtin fr dta y i ich
2 The boic idea is to show portioms of the level of deta
with a lower can also be exprted to eternal Ees in varius forta
deta wlle others are shown Bcan support end to endreqirements of data analytics. It can be
hyperbolic and spberieal
Popular distortion tectnigues are psed torapidly develop yanalymis
istortione
be
These are ofer sed on hierarchies or graphs but may also Gue 5.90. Write short note on R raphical user interfaces.
other viualization technigue
applied to any
include Bifocal Displavs. Amswer
Examples of distortion techniqgues Hyperbalic Visualization
5.
Perspective WallGraphical Fisheye Views, Rsoftare uses a command-line interface (CLI that is similar to the
and Hyperboz BASH shell in Linuz or the interactive versions of seriptimg languges
Buch as Python.
NomimalIattributes ane Prame VWorks andVializatinn
nd RatEthe termial
prony SOperatios pported4
Faremple ZIP coes, eomiered
y
catgrial
noial attribute atrib
terfaeUE provide& Ordinal:
DBumber,True or Palae natinality tt ameder,emgoye
are s.
and
Par
h of writicuting
written for R
Astributes
Ondinal
imply s mequenoe
attribtess are
nderGULs
oa hnve b HBtadie
Ratsle,ad GUe Poplar Operatins pported4 alsobry ciderd catgorical attributes
as
undimal
The for
window pnes
WarkepaceLite
a r as follow a
an r a te write
and nve Reode Far exxaplee:uality of d
earthquake diamonds,ttriute
are
cadenic , , 4
graden
eriptstSerea the ddutaeta dvrihlsin the Besiroen Enterval
aerated by the Kcode and nterval attribute define the
the plote s prdes Ioterval attributes are coidered itlersc
betweer
Plote iDieplaye
trightforwerd nisto export the
plote
Operatäons upported by a merie attribte
interval attribute
twalues
CansolesProvde a hieturyof t are , #,
data import and
export inP For example: Temgerature
in Caias or <,2*.
Write short notes on 4 latitudes
Parhit,calendar dates,
Ratio :
In ratio, both the diflerence and as the ratig of two
Answer
imported nto Rani the read cy 0funetion an in the defined valuns are
Ratio attributess are conidered
The d aode
followig
tt
2 umerie attribul
erations supported by ratio attribute
an the separatar
character n the directory an are,,z
ward
Kea forw
slah rexample :ge, temgeratue in Kalrin, oonte, length, weigct.
2 le path entionmakes ript fles oewhat ore prartable at the 4
ers, wo may e Explain data types in K
This c
jnitiad eonfusin on the part oWindoWs programming langunge.
Co ckalanhVas a separator
epOly to asing a ofmtipie fles with ong pat b names, the setwá) Amwer
imprt for the suboegent
4Tosplify thebe used o set the working directory E oode Various
data types of R are :
funti can operatios, as ehown in the follawing Vectons :
and export Vectors are a basie building block for data in RSimple Rvariabies are
import
etwd ldsta ar
whick ectors
table)and readdelim can only coneit ofvalues in the ame a
funetisinciude read. euch as TXT Avectar
tets for veetors can be condueted ing the is. vetor funetin
Other import fle types
other common import the file namey fles The
intended to importcasalno be sed to
iddes fanetionality that enaies the eay creatim and meniplatin
Thene funetions fveetors.
followin cde cY.header<TKUE, Sep For example z
ahowo in the read tableflename.
sles able amecyep #Creete a veetor
delien"le Pple <ered greenyellow
sales delin read attributes in R
proramming
What are different types of print appie yector
Que 592. #Get the class of the
orintelanapplel)
language ?
Output :
dreenyellow
Anewer types s
categorized into four ehsraeter
which can contain mary
dittenttypes o
Attributes c n he
ditinginh one from ansher Usts 1 Alist is an Robject
vetors, fanetione and eves another lit inide it
Nominal : labele that leoents ineide it like
represent
1 The valas
90J(CS-51T-6)

-29J (CS-5/TA) are created


Frame Works and
Data Analytics Factors
Rves Using the factor)
the count of levels.
function. The Visualization
For exomple :
#Create alist
13)
ForexAmple :
Createa vector. nlevelsfunctions
list 1 list(e(2,5,3),2 applecolors -c'green'
Print thelist
print(list 1) eCreate a factor object. 'green'yellow'
factor apple -factorlapple_colors)
; red;red' ;
red' 'green')
Output:
253 #Print the factor.
21.3 print(factor_apple)
data set.
Matrices :
two-dimensional rectangrular matrix function nt(nlevels(factor_apple)
A matrix is a using a vector input to the
b It can be created Output :
For example : green
green yellow red red red
byrow=TRUE green
# Create a mntrix.
matrix(e("a','a',b','c',b','a'), nrow = 2, ncol = 3, Levels: green red yellowy
M
print(M)
Output:
I.1] |.21 L.3)
[l] a" a" b
[2,] c b a
Arrays :
number of dimensions.
Arrays can be of any attribute which creates the
b The array function
takes a dim required
number of dimension.
For example:
Create an array.
array(c('green',yellow),dim = c(3,3,2))
a<-
print(a)
Output :
[,2) [,3)
"green"
[1]"green" yellow"
yellow"
[2]yellow" "green" "green"
[3,]"green" yellow"
Factors : a vector.
5
which are created using
a Factors are the R-objects distinct values of the elements in the
along with the
b It stores the vector
vector as labels.
whether it is numeric or
character irrespective of statistical
C. The labels are alwaysetc. in the input vector. They are useful in
character or Boolean
modeling.

You might also like