0% found this document useful (0 votes)
85 views

Bda Chapter 3 This Is The Notes of Bda

The document discusses the key differences between relational database management systems (RDBMS) and NoSQL databases. It covers aspects like schema, scalability, query capabilities, data structure support and examples of databases. It also talks about the four main business drivers that led to the emergence of NoSQL databases - volume, velocity, variability and agility.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

Bda Chapter 3 This Is The Notes of Bda

The document discusses the key differences between relational database management systems (RDBMS) and NoSQL databases. It covers aspects like schema, scalability, query capabilities, data structure support and examples of databases. It also talks about the four main business drivers that led to the emergence of NoSQL databases - volume, velocity, variability and agility.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

lOMoARcPSD|29042853

BDA Chapter 3 - This is the notes of BDA

Computer engineering (A. P. Shah Institute of Technology)

Studocu is not sponsored or endorsed by any college or university


Downloaded by TAHSEEN SHAIKH ([email protected])
lOMoARcPSD|29042853

OSQ
Big DataAnalytics(MU-Sem 7-COMP
RDBMS and NoSQL
3.1.7 Difference between
-
-
---* - - - - --

database.
UQ. Drfferentiate between a RDBMS and NoSQL
MU-Dec-18, 0 Bi

Sr. RDBMS NoSQL


No.
dynamic schema
1. Have fixed or static or predefined | Have
schema
Not suited for hierarchicaldata storage Best suited for hierarchical dat
2.
data stora
3. Vertically scalable Horizontally scalable
4. Follow ACID property Follows CAP (consistency,
partition tolerance) availab
5. Relational Database upports | NoSQL databases don't
transactions (also complex| transactions (support SUp
transactions with joins).
only
transactions).
6. Relational database manages only | NoSQL database can manage
structured data. struct
unstructuredand semi-structured
datu
7. Relational databases have a
single | NosQL databases have no
point offailure with failover. single poi
failure.
8. Relational Database supports a
NoSQL Database supports a
powerful query language. very si
query language.
9 Itgives only read scalability. It gives both read and
10. write scalability
| Transactions written in one location
Transactionswritten in
11. It supports complex transactions. many locations
It
supports simple
12. It is used to handle data
coming in lowIt is used
transactions.
to
velocity. handle data coming
velocity. con in he
13.
Examples-MySQL, Oracle, Sqlite, | Examples- Examples- Mongo
MongoDB,
PostgreSQLand MS-SQL etc. DR
RavenDB, Cassandra, BigTable, Red
CouchDBetc. a, Hbase, Neo
Downloaded by TAHSEEN SHAIKH ([email protected])
(New Sylabus w.e.f academiç vear 22-2a) (hA7 RON
lOMoARcPSD|29042853

BigDataAnayics(MU-Sem7-COMP (NOSOL)Pageno.(3-9
a3.1.8 NoSQL Business Drtvers
The scientist-philosopher Thomas Kuhn coined the term paradigm shift to identity a
recurring process. He observed that in science, where innovative ideas came in bursts and
impacted the world in nonlinear ways. Well use Kuhn's concept of the paradigm shift as a
way to think about and explain the NosQL movement and the changes in thought
patterns, architectures,and methods emerging today.
Many organizationsare supporting to the single- L
CPU relational systems that have fulfilled the needs of Volum
theirorganizations as per the requirements.
Businesses have initiated the value in fast
Sunge nod
catching and examining huge quantityof adjustable Vetocity RDBMS

data and making direct changes in their businesses


based on the data that they obtain. As all of
thesedrivers applies burden on single-processor /Variabity
relational model, its basis suits less steady and in time
no extendedencounters the organization's needs. Fig. 3.1.3

Fig. 3.1.3 In this, we eee how the business drivers volume, speed, variability, and
agility create pressure on a single CPU system, resulting in cracks. Volume and velocity
state to the capability to handle and managethe big datasets that appears early.
Variability states to how various data types do not fit within the structured tables,and
agility states to how much fast an organization replies to the business modification.

There are 4 major business drivers for NoSQL as:

1.
(1)

Volume
Volume (2) Velocity (3) Variability 4) Agility.
J
.Undoubtedly, the main factor forcing organizations to look at alternatives to their
existing RDBMS is to investigate big data using clusters of commodity processors.
Until around 2005, performance problems were eliminated by purchasing faster
processors. Over time, the ability to speed up the process is no longer an option.

As the chip density increases, the heat can no longer dissolve rapidly enough
without chip overheating. This phenomenon, known as the power wall, forced
system designers to shift their 1Ocus rom inereasing speeds on a single chip to

using more processors working together.


The need to scale out (also known as horizontal scaling), rather than scale up
(faster processors), moved organizations from serial to parallel processing where
data problems are split into separate paths and sent to separate processors to
divide and conquer the work.

(New Syllabus w.e.f academic year 22-23) (M7-80) Tech-NeoPublications...ASACHIN SHAH Venture
Downloaded by TAHSEEN SHAIKH ([email protected])
lOMoARcPSD|29042853

(MU-Sem 7u
Big Data Analytics

2. Velocity
data issues
are
considered
for many

system to
organizations

read and writ moving away


m.

rite data quiek


While large processor

RDBMSs, the
ability ofa single
also important. demand for on
the demand li.
online
querig
meet
cannot
RDBMS
single-processor public-facing websites,
Man
Many real-time
insert and
databases created by new row,
columns of each
index multiple
RDBMSs frequently
reduces system performance.
used as a
back-end in front of a web a
When single-processor
RDBMSs are
for everyone, and-
random outbursts in web traffic
reduce the response
and write throughput is ren tuning
system can be expensive when both high read require
3. Variability
exceptional dataconflict
Companies that seek to capture and report
database schema structure imposed by RDR
attempting to use the rigorous
capture some custom fields for a
For example, if a business unit wants to sper
sne

customer, it must store this information


even if it does not apply to all cust
iston
rows in the database.
Adding new columns to RDBMS requires shutting down the system and runi
the ALTER TABLE command. When the database is large, this process can a
the availability of the system, costing time and money.
4. Agility

Agility is ability to accept change easily and quickly.


Among the variety of agility dimensions such as model agility (ease and spea
changing data modules), operational agility (ease models),
(ease and speed of changing
operational
operationalaspects), and programmingagilhty
and speed of
application development) one of the ahi
most importantis the a
-

toquickly and seamlessly scale an


data, users and connections. applicationto accommodate large an unt

.The mOst
complex part of
putting data into and
building applications using RDBMS isis the
the proces
getting data out of the
Ifyour data has nested and database.
include an repeated subgroups of data
object-relational
Ou ned

The mapping layer. structures,


responsibilityof this
update, delete and select layer is to generate the correct of Ins

RDBMS persistence layer. SQL. statement to move combinatlo from


object data to aand data

(New Syllabus w.e.!


academic year
22-23) (M7-80)
ATech-NeoPublications...ASACHIN SHAH
V

Downloaded by TAHSEEN SHAIKH ([email protected])


lOMoARcPSD|29042853

Big Data Analytics (MU-Sem 7-cOMP) (NoSOL)...Page no. (3-11)


This process 1s not simple and is associated with the largest barrier to raPla
change when developingnew modifyingexisting applications.
or

Generally, object relational mapping requires experienced software developers.


Even with experienced staff, small change requests can cause slowdowns in
developmentand testing schedules.
All these hurdles are best overcome by NOSQL database.
These databases are schematic and be scaled down
can easily.
They can accommodate application changes easily and can handle any volume of
data efficiently.

This agility has become business driver for NOSQL databases.


3.2 NoSQLDATA ARCHITECTUREPATTERNS

UQ. What the different architectural


are
patterns in NoSQL? Explain Graph datastore and Column
Family Store patterns with relevant examples. (MU-May19,10 Marks)
NoSQL databases were born out of the rigidity of traditional relational or SQL
databases, which use tables, columns, and rows to establish relationships across data.
Developers welcomed NoSQL databases because they didn't require an upfront schema
design; they were able to go straight to development. And it's this flexibility, this "ad-hoc"
approach to organizing data, that has arguably been NosQL's greatest selling point, which
continues to appeal to organizations that need to store, retrieve, and analyze either
unstructured or rapidly changing data.

The data stored in NoSQL follows any of the four data architecture patterns.

(A) Key-Value Stores (B) Column family (Bigtable)Stores


(C) Document Stores (D) Graph Stores

3.2.1 Key-Value Stores

UQ Explain in detail key-valuestore NoSQL architecturalpattern.Identifytwo applicationsthat can


use thispattern.
--
MUMay 18, 5 Marks)
One of the most basic NoSQL database models is this model. The data is collected in
the pattern of Key-Value Pairs, as the name implies. A series of strings, integers or
characters is typically the key, but it can also be a more advanced form of data.

Usually, the value is connected or co-related to the key. The databases for key-valuue
pair storage typically store information as a hash table where each key is unique.

(New Syllabus w.e.f academic year 22-23) (M7-80) LaTech-NeoPublcatlons...ASACHIN SHAH Venture

Downloaded by TAHSEEN SHAIKH ([email protected])


lOMoARcPSD|29042853

7-COMP) Key
Analytics
(MU-Sem Valus
Big Data form

The value may be of any


Key Valu B
Notation
(JSON),
(JSON),

(JavaScript
Object strings,
Object
(BLOB), Key value
Binary Large
Key
etc.). architecture
Value
This style of
Application: websites
shopping
used in Key
is commonly and its
e-commerce
applications
wide
user-1233
John Do
or
or for
assets is its ability image-13P9 cbinary mage f
important heavy
of data
volumes, hvtp://webpage-123.htm cweb page Mb
management are
with which keys file-123.pdt <paf documerd
loads and the
ease

used to
retrieve data. Fig 3.2.1: An example of Key-Vah

web page URI


be image
names,
fle
flexible. Keys can and PDF don.
Keys and values
are
HTML web pages, ment.
like binary images,
names that point
to values
store
databases
databases is its cOms
complex
the key-value
with
Constraints
associated
many key-value
pairs that
that ma
will attempt to include
handling queries which many-to-many
relationships.
with
cause data to clash
output and may Downloaded by TAHSEEN SHAIKH ([email protected])
(2
Types
Key-value
NoSQL databases features
Column
Graph store
Features
store
Document (NoSOL)..Page no. (3-19
Types Key-value Document Suitable for
Storing store Column
Features store store oriented store
Focused on
session's Content orientedstore Graph store
Characteristics | A simple hash Multiple
Store data in
modelling the information, management Content
tableindexed by | keyvalue columnar user
systems, management Space problem and
format. Each
structure of the
profiles, blogging systems, connected
such as
data,
preferences, platforms, web blogging
pairstorm a
key data.
document.
|keyis platforms,
analytics, social networks,
associated with
shoppingcart
Document
data. maintaining spatial
real-timee data, routing
analytic's, counters,

Downloaded by TAHSEEN SHAIKH ([email protected])


stored multiple
information
e-commerce expiring usage, |for goods and
attributes.
generally in
heavy write
JSON applications. volume such asrecommendation
money,
format.
Powerful data log
Schema free:
Better for
Examples Riak
aggregation engines.
Pros
Very fast and complex read model.
Redis MongoDB BigTable
MemcacheDB CouchDB
Unstructured
scalable.
data can be
queries. Fast Locally connected Habse
Neo4j
Simple model. querying of data. ArangoDB OrientDB
stored easily. data. Dynamo Marklogic Casandra Allegro
data. Storage
of
| Indexed | Voldemort Accumulo Virtuoso
Simple, very large
Easy to query. RzthinkDB Hypertable
powerful data
quantities of Handling compler 3.2.5 Variations of NosQL ArchitecturalPatterns InfiniteGraph
model.
data. Better relational
lOMoARcPSD|29042853

GO How NoSQL
data architecture
Horizontal
information. processors? patterns varies as
you move from a
analytic single processor to mutiple
scalability.
performance. The (1OMarks)i
key-value store, graph store, Bigtable store, and document
----
Improved data
bemodified by focusing on a different
store patterns can
aspect of system implementation.Variations on the
compression. architecturesthat use RAM or solid state
drives (SSDs), and the
Very low-level Travers entire distributed systems patterns can be used on
Query model
or modified to createenhancedavailability.Finally, we'll look at how
Cons Stored data API. graph to database items be
limited to
can
groupedtogetherin different ways to make
navigation over many
haveno Undefined data | give correct items easier.
schema. Poor keys and
usage pattern. results.
for indexes.No Customizationfor RAM or SSD stores
standard query
Increased disk Sharding.
complex data. Some
seek time.
NoSQL products are designed to specificallywork with one type of memory, for
All joins syntax. xample, Memcache, a key-valuestore, was specificallydesigned to see iftems are in RAM
for Increased cost
must be done | Map Reduce of inserts.
On
multiple servers. A key-value store that only uses RAM is called a RAM cache; it
in code. larger 1exible and has general tools that application developers can use tostore global variables
No foreign key queries. Poor for Poor for configurationfiles, or intermediate results of doeument transformations.
interconnected
constraints.No interconnected
triggers. data. data. Tech-NeoPublications..A
New Syllabus w.e.f academic year 22-23) (M-0)
SACHIN SHAH
Vent
H Vent
Tech-Neo Publications..ASACHINSHA
(New Svlabus wef academic vear 22-23) (M7-80)
Downloaded by TAHSEEN SHAIKH ([email protected])
(Nos Big DataAnalytics(MU.Sem7-COMP)
NoSOL) Page no (3-5)
Big Data Analytics(MU-Sem7-COMP) Conslstency
launched.
2004- Google BigTable is This means that the data in the
database remains consistent after the execution of an
2005-CouchDB is launched. operation.
on Amazon Dynamo is released
2007- The research paper For example, after an
update operation all clients see the same data.
sources the Cassandra project.
2008- Facebooks open
Avallability
was
reintroduced.
2009- The term NoSQL
This means that the is
giants
like system always on (service
internet

guarantee availability), no
3.1.3 Why NoSQL? with
Google downtime.
system
popular
response
the time
databases
became
volumes
of NoSQL ofdata to
esolve this problen
resol Partition Tolerance
The concept so
with huge volues
volumes
Amazon etc. Who deals ofdata but
this process
F'acebook. volunes is a
hardware This means
massive that the system continues to function even
use
RDBMS for existing
the communication among the
slow when we load on multiple hosu
becomes our database
servers is
upgrading unreliable, i.e., the servers may be
our system by distribute

partitioned into multiple groups that


could scale up distribure
We d cannot communicate with one anotheer.
is to scaling out.
alternative for this issue
expensive. So as
better than relat
method known Sea scale-out

In theoreticallyit is impossible to fulfil all 3


increases this requirements. CAP provides the basic
whenever the load they
so Now
tional mind.
NoSQL database
NosQL databases are in
requirementsfor a distributed system to follow 2 of the 3 requirements. Therefore. all the
applications
applications structured

dat
iata, unstructu current NoSQL database follow the different combinations of the C, A. P from the
with the web CAP
sorts
of semi to theorem.
databases. As they designed all problems related
handle the
that can resolve larp
database
So. to Here is the
exactly type pf and bigdata. emerged. brief description of three combinations CA,
CP, AP.
have
rapidly chang data databases
data, CA Single site
structured
data. NosQL cluster, therefore all nodes are always in contact. When a
oume and semi occurs, the system blocks.
partition
3.1.4 CAP Theorem (4 Mark CP-Some data may not be accessible,but the rest is
Systems? still consistent/accurate
to NOSQL
Theorem: How it is applicable AP System is still available under
What is CAP theorem is also called breu.
We partitioning,but some of the data returned may
GQ. CAP be inaccurate.
databases.
role in NoSQL data store to offer more th
distributed
it plays inmportant for a The use of the word
impossible consistencyin CAP and its use in ACID do not refer to the same
which states that it is
theorem identicalconcept.
guarantees:
fno out of thrve consisteney
databases offer In CAP, the term
some NoSQL Consistency
consistencyrefers to the consisteneyof the values in different copies
So hasically, and of the same data item
While some offer availability in a replicateddistributed
system. In ACID, it refers to the fact that
toleranee.
and partition common as a
tolerance is CA CP transaction will not violate the
integrityconstraints specifiedon the database schema
tolerance. But partition
partition based on
nature so
distributed in Partidon 3.1.5
databases are
NosQL Availability|AP Tolerance Characterlstics/ Features of NosQL
database has to be ---
we can choose NoSQL
requirement, ---------
databases are available Describe characteristicsof a NoSQL database.
used. Ditierent types of NoSQL MU-Dec 17,10 Markss
based on data models. Non-relatlonal
Fig 3.1.1:CAP Propen
NoSQL databases never follow the relational model
(New Syllabus w.e.f academic year 22-23) (M7-80)
SACHIN SHAM
Tech-NeoPublications.A SACHIN SHAH Venture
Tech-Neo Publications...A
New Sylatus we facademic year22-23)(M7-80) Tech
lOMoARcPSD|29042853
lOMoARcPSD|29042853

Big Data Analytics (MU-Sem


Big Deta Anaiytics(MU-Sem 7-0U
Non-relabonal

OMP)
fixed

Shared (NoSOL)..Pageno. (3-7)


Never provide tables
with
lat

Open Nothing Architecture. This enablen less


Simple AP
our distribution. coordination and higner
column records
3.1.6 Advantages and
gates

aggreg
Work with
self-contained

NoSQL
DisadvantagesofNoSQL
or BLOBs
object-relational Advantagesof NosQL
Doesn't require Distibuted Disadvantagesof NosQL
normalization
Schema-ree

1. Scale(horizontal)
mapping and data 1. ACID transactions
Supports new
like
query 2. SQL databases are
vertically scalable. 2.
generabon
features
No complex erential
Web
This means that you can Cannot use SQL
query
planners, reere epplicado increase the load3
anguages,
on a single server Cannot perform searches
Fig. 3.1.2 by increasing
integrity joins, ACID
like things
RAM, CPU or SSD. But on the Data loss
un on inexpen
other
fees
and can
hand, NoSQL databases are
2. Open-source licensing
horizontally 5. No referential
integrity
expensive
scalable. This means that
NoSQL databases don'ti require cost-effective.

more traffic by
you handle 6. Lack of availabilityof
expertise
hardware, rendering their
deployment
sharding, or adding more
servers in your
NoSQL database
3. Schema-free schemas
schema-free
or
have
relaxed
3. Simple data model (fewer joins)
either
of the data
are
NoSQL databases
of the
schema Streaming/volume
sort of definition
Do not require any
data in the
same
domain
5. Reliability
structures of
Olters heterogeneous
6. Schema-less(no modelling or
prototyping)
4. Simple API 7. Rapid development
data provided
and querying
use interfaces
for storage 8. Flexible as it can handle
Offers easy to
manipulation &
selection methods semi-structured,
unstructuredand structured data.
APls allow low-leveldata
REST with JSON
Text-based protocols mostly
used with HTTP 9. Cheaperthan relationaldatabase
standard based query language 10. Creates a
Mostly used no
services
caching layer
internet-facing
Web-enabled databases running as
11. Wide data type variety
12. Uses large binary
s. Distributed objects for storing large
distributed fashion data
Multiple NoSQL databases can be executed in a

13. Bulk upload


Offers auto-scaling and fail-overcapabilities
Often ACID conceptcan be sacrificedfor sealabilityand throughput 14. Lower administration

Mostly no synchronous replication between distributed nodes Asynchrs 15. Distributed storage
MultiMaster Replication peer-to-peer,HDFS Replication 16. Real-time analysis
Only providingeventual consistency

(New Sylabus wef academic year22-23) (M7-80) Tech-NeoPublications..ASACHINSHAH New Syllabus w.e.f academic
year 22-23) (M7-80) LTech-NeoPublcations..ASACHIN SHAH Venture
Downloaded by TAHSEEN SHAIKH ([email protected])
lOMoARcPSD|29042853

Pago
GZ)

no. Big Analytics (MU-Sem7-COMP)


NoSOL)..
than moVing
One database
Big Data Analytics(MU-Sem7-COMP) rather thar

with OR

The MarkLogic architecture, moving queries to documen linear


scala
Diny

(NoSOL)..Pageno. (3-27)
.Data pertboning
to achieve
Repticaton
achieve

them
to the
allowed stems that
server,
Clustenng
ment syste

documents query
petabytes of documents. hing entities
ties that Wanted Easy to understand Query distrbubon
Load balencing
feder US natured into a Easy to setup and
products in Easy to administer confgure
MarkLogic found a

terabytes of intelligence
demand for their
information
and large
arkLogie has
p
D transactio,
t prim
for
ACH Single source of truth Consi stsynchronizaton
ency/Sunaing
LatClocketwork
enoyIConcurreney
stored
documents.
Since
2001,
language Limited scalability botWeneckstailures
their XML
O store and search
document store th
the
support
Ja.
Multple data cørters
scalable as
Disibuted
failurebackup
control. Initially, versions

general-purpose highly
role-based
access
REST;
newer

Node
and fine-grained, with Votng algorthms
was XQuery paired license
for any atasets
Administraton tor error
Marklogic developers detecton
Monrtoring of many systems
software

interfaces. open-source product


well as other language
that
require
requires
a

as
well
as

Fig. 3.3.1 Scalable if designed


product commercial

Here are some corecty


commerc typical big data use
commercial

with
with
associated
cases
to business problems.

NoSQL is ses:
over 40 GB.
solutions
(1) Bulk Image
Processing
innovative (2) Public
that provide
BIG DATA that it can't| (3) Remote Sensor Data Web Page
FOR large: be easily Data
so
(4) Event
(5) Mobile Phone Data
SOLUTION

Log Data
that's

NosQL to: move away


3.3
business
problem
force
you

of distr tri from


ted ments
computng 6) Social
(7) Game Data Media Data
problems

is any

class problem.
Big data
world

A bigdata complex
enviror
(8) Open Linked Data
processor.
computing

single
more

a the
managed using (1) Bulk Image
distributed
toward

single-processor
ronment

data
problems, Processing
big
Though great
for solving
move
from a
a single
single
i
processor .
Organizations like NASA regularly
set ofchallenges. when you environment

nontivi satellitesor even receive


with their
own

challenges
you
face
to a
distributed

really
warrant
these images and
rovers on terabytes of incoming data from
Mars. NASA uses
Some of the Moving problem

perform functions like large numberof


a

image enhancementandservers to process


system.
business

platforms i1.
computing
only if the This is why
Medical imaging
done
of time. systems like CAT scans and photo stitching
distributed

be
and
should
short
period things asier foor
easier for the
appliat data into
endeavour

volumes
in a
to
make
formats
that are MRIs
need to convert raw
hardware has been found to useful to doctors and patients. image
data
framework

handle large complex


and require
a
be more Custom imaging
processorson the cloud when expensive
they're needed. thanrenting a large number of
complex
databases:

developer.
for one
or many
managed by aa single prs
be managed
e.
For example, the
be New York
articles into web formats Times converted 3.3 million scans
challenges can
are that
These
nsider hio.
big data p
cases
use
and But we consider of old
includes
concepts
and data
quality.
hundreddollars. using tools like Amazon EC2 newspaper
NoSQL
on agility and Hadoop for a few
impact
positive
andhave
a

for NoSQL
(2) Publlc Web
use
case Page Data
a primary

Publicly accessible pages are full of


information that
more
competitive.They contain news stories, RSS organizationscan use to be
product reviews, and blog feeds, new product
information,
postings. Not all of the informationis authentic.

SACHINSHA (New Syllabus w.e.f


Tech-Neo Publications..A academic year 22-23) (M7-80)
(M7-80) KTech-NeoPublications.ASACHIN SHAH Venture
we.facademic year 22-23) Downloaded by TAHSEEN SHAIKH ([email protected])
Vew Sylabus
lOMoARcPSD|29042853

(NoSOL)Pagono, (3. Analytico (MU-Sem7.COMP)


B

7-COMP by compot.
Analytico(MU-Som
ted
(NoSOL).Page no.(3-29)
cren

Big Data produet


reviews

uct revi
reviewn ar
(7) Game Date
of fake which product
millions of pages
Finding
out
There are

disparage
other siten. naA
Gumes that run on PCa,
parties paid to vidoo
third
careful analysis.
datasets eta that need to scale g»me oonmolen, and
valid in a topic
for
all users as well as quickly. These gamen mobile deviceshave back end
gamedata for ench store and
Data world,
of our n.
player. eharehigh scores for
(3)
Remote Sensor
track
almost any
aspect

and fuel
fuel congun
Game site backer
Dackends must be able
consumptioon,
now
can and to scale
marketing mpaign scatch on with their by orders of
sensors
acoceleration,
Small, low-power speed,
installed on
vehicles track location,
about your
driving
habits.

Linked Data
neir uers. magnitude
tude if viral
insurance
company
alt Open
suggest
lternate (8)
and
and tell your in real
me

about traffic jams


lawn, and indoo Not only is thiss «data
Road sensors can
wan

in your garden, planta large, butit may


duplication,and find invaliditems. require complex toola to reconcile,
moisture
even
track the
routes. You can

for your
home. ile, remove
plan
to Buggest a watering
includes problems like image and
rThis
ient and reliable data signal
(4) Event Log Data
web page hits (also. transformationat scale.processing. Theirfocus is on the
systems create logs of
read-only
events from
called sactions support
query or transac These
provided by many NoSQL use cases don't need the
Computer attempts. login
clickstreams), email messages sent, or
hev read and write to
key-value stores or systems.
who's usine
understand
what distributed
DistributedFile System filesystems like Amazon's
can help organizations Simple StorageServie rvice (S3) or Hadoop
events
Each of these
be performing according
to specification
resources and when systems may
not
the danced
ad features of a
document store or an (HDFS) and may not need
intelligence tools to send alerts ta
users RDBMS. Other use cases are
Event logdata can be fed into operational amanding and need more reatures. Big data more
fall out of acceptable ranges.
problems like event log data and
when key indicators need to store their data
directly into structures that game dataa
can be
they will need different NoSQL solutions, queried and analysed, so
(5) Moblle Phone Data

to new locations; applications


can track these eventa You To be a good candidate for a
general class
Every time users move of big data problems, NoSQL
around you or when customers walk through Tyour solutionsshould:
can see when your friends are

Be efficientwith input and output, scale


retail store.
and linearly with growing data size.
privacy issues involved in accessing this data, it's formino. 9. Be operationallyeticient. Organizationscan't afford to hire
Although there are
be used in innovative ways to give companies servers. many people to run the
new type of eventstream that can

competitiveadvantage. 3. Require that reports and analyses be


performed by nonprogrammers using simple
toolsnot every businesscan afford a
(6) Social Medla Data full-time Java programmer to write
queries. on-demand
Social networkssuch as Twitter, Facebook, and LinkedIn provide a continuou
4. Meet the challenges of distributed computing, including consideration of
real-timedata feed that can be used to see relationships andtrends.
between systems and eventual node failures. latency
Each site creates data feeds that you can use to look at trends in customer mood
5. Meet both the needs of
or get feedbackon your own as well as competitorproducts. overnight batch processing economy-of-scaleand time-critical
event processing.

(New Sylabus w.e.! academic year 22-23) (M7-80) Downloaded by TAHSEEN SHAIKH
Tech-Neo Publications...ASACHIN SHAH Venture ([email protected])
(New Syllabus w.e.f academic year 22-23) (M7-80) LTech-NeoPublications...ASACHIN SHAH YentYra
lOMoARcPSD|29042853

(NoSOL)Pageno.(3.n Data
Analytlcs (MU-Som7-COMP) (NoSQL).Pagono.(3-31)
Big
7-COMP)
Biq Data
Analyics (MU-Sem
be
customized
to solve some
big data . he a web page click or an out-of-memorywarning a diak drive. In tho past, tun
on
and effort, to many
6 RDBMS can,
with enough
time

be
rewritten
to
distribute SQL queries
redesigned to processor
remo
move
cost
and amount of evont data producod wero no largo thut nany orgnnizntionn optod
gather or analyse it. Today, NoSQL ay»tomA are changing compnniow' thoughta
problems
Applications
can

of the queries.
Databases
can be
joins not

nodes. SQL systeme


ms can be value of log data ns the cost to staore
and menge
the results

physically
located on
different

processes, Vet these


on
the andanalyuo it in moro
alfordnblo.
between tables that are
data
synchroniZation
ability to cost-effoctivolyguthor and storo log oventa from nll computora in youur
The.ub
and other
replication
configured to use haa lead to Bl operational intolligonco nyntomn. Oporatlional intolligoncn gowon
time and money.
terpris

considerable
steps all
take
n analysing tronds in your wob trafie or rotil tranmactionn. t cn intograt
TYPES OF BIG DATAPROBLEMS boyond
network monitoring aystoms 0
UNDERSTANDING THE mation from you can dotoct problomn beforo they
M 3.4
informa

combinatin. your customers. Cost-ofloctive NosQL ayutoma cnn h pnrt of god oporntionn
each requiring
a different
tion o impact your

There are many types of big data problems, its type, you'l|. manayomont solution
categorized your data
and determined
find
NosQL systems.Aferyou've Cull-toxt documents: This catogoryof daln incudon nny documont thnt containn
there are different solutions. (8)
lext liko the Englisth langungo. An imporlnnt nupeet of eoeument
data classification systom. natural-language
Pig34.1 is n good example of a high-level big wloros
laron in that you can quory tho entiro contonla of your oflicn documont intho wA
Bg da wAy you would quory rows in your BQL, ayntom.
can cronte now roports thnt combine trnditionnl dnta infttIMn
Read-wrte Thin monna that you
Read-mosty well nn tho data within your offico documenta. For oxnmple, yueould cronte n ningl
woll

thnt oxtractod nll the nuthors of titlow of Powur'oint nlidoa thnt mtninod th
query

Event-og DocunentsGrph High aveiabilityTraneactions evwordn NoSql, or big data. Tho romult of thin lint could then
of nuthurs
filterod with
b
lid af titlon in tho Ht databoo to show which poogpln hndd tho title of bntn Arehitnt or

Preul tne Bth Full-toxt Sholution Architoct.

Thin in n od oxumplo of huw orgnnizntionsnre trying us tnp into the hidden nkilla
that alrondy oxint within n orynlalion fur trnining and mentwrsbip Intogrntin

Flg.34.1:A ampleof a tuxonomyof big duta types doeumontn intw what can be (quoriod is oponing now dors in knuwlulgn manngnent anl
officient staff utilizntion
Some ways you clanwily big data problems and ee how NoSQIL nyntoma aro changing
the way organizationsume data
W3.5 ANALYZINGBIGDATA WITH ASHARED-NOTHINdARCHITLCTURE
(1 Rrad-mostly Read-mostly datu
: in the mont common
clumnificntion. It includeon dnta
thatscreatud once and rarely nlturod. This Explain Shared Nothing archltecturs In detail (10 Marko)
type of dnta in typically found in dut»
warehoue applicationnhut in alwo
identified an
or video, event-loggingdata,
a et of non DBMS itwmn like imngon Thore nrs thras wnyn thnt ronnurcws enn bes shnrol bol.wenn cmputsr aystua alhnrerl
publisheod documenta, or graph dnta. Evunt datn B.5.I shows npnrlem tf thes threw
thing like retail nales includom KAM, shard dink, nnd shared nothing. Fig n

eventa, hits on n
webwito, nynton logging dnta,
erior data or roul-time dintributed computing urchitsstures, (f ths thrs nlternativos, shnreus nthin
whun ym'ro yeing
Log events:When operational events architnturs in mont cost ufsctlvs intaTIs of st par prwoss
oceur in your
log file and include enturprie,you can rocurd it in mudity hnrdwars
a
timestampw you know when tho event
(xcurrod. og ovont
syAawe' eca yes 22 ) M7 90)
Torh-Hen Putwatns A BACHIN
SHAH Vorur Slatave waf arani yan1 29 *) (MI 9))
Downloaded by TAHSEEN SHAIKH ([email protected])
lOMoARcPSD|29042853

ics (MU-Sem7-COMP)
(MUS

ver Big Dala


Analytics
(NoSOL)...Pageno.(3-35
(MU-Sem
7-COMP) to as
the 1.x
l.x vers1ons) we rings to evenly distribute data
Analytics Using hash
el
referred

Big Data on a cluster


of Hadoop
(frequently
with the
NameNode
of
cluster being
a
2.

The
initial
versions architecture

NameNodes

don't deal wi
usually don'+.
One of the most challenging problems with distributed
cansistent way of ass1gning a document to a processing node.
databases is figuring out a

distribute queries
master-slave

a cluster.

ofthe
use
to and distribute
designed status
the manage

Meing a hash ring technique to evenly distribute


is to
for managing
job
responsible
themselves.
Their Us big data loads over many servers
MapReduce
data with a randomly generated 40-character key is a good way to evenly distribute a
with
any
cluster of failure from
thecorrect nodes on the

are
designed
to
remove
single
points of
Hadoor network load.

Hash rings are common in big data solutions because they consistently determine
Hadoop 2.x versions
requiremente how
w to assign a piece of data to a specific processor. Hash rings take the leading
if high
business

cluster depend on your


document's hash value and this to determine which node the
availability is a concern, model will network might be the best solution.
a peer-to-peer If .mast bits of a use

Using the right


distribution
you o
document should be assigned. This allows any node in a cluster to know what
then the simple
in offhours, on and how to adapt to new
that run
how Man. T
node the data lives assignmentmethods as your data
data using batchjobs section, you'll
see
manage your big we move
to the next educe gTows.

best. As big data


might be to process your ranges and assigningdifferent key ranges to specific nodes
slave model
multiprocessor
configurations
Partitioningkeys into
be used in
systems can
is known keyspace management. Most No$QL systems, including MapReduce,
as

HANDLE BIG DATA PROBLEMS use keyspace concepts to manage distributed computingproblemns.
SYSTEMS TO
3.7 NOSQL
The concept of
a hash ring can also be extended to include the requirement that
not data to
thequerles
item must be stored on multiple nodes.
to the data, an
1. Moving querles
NoSQL systems use commod,
the hash ring rules might indicate both a primary
graph databases, most
odity When a new item is created,
With the exception of large their local shared-nothin of where an item is stored. If the node that contains
the
subset of the data on
ng and a secondary copy
processors
that each hold a
the system can look up the node where these secondary
item is
primary fails,
drives.
query to all nodes that hold data, it's m re stored.
When a client wants to send a general
to transter large datasets toa
each node than it is Using replication
to scale reads
efficient to send the query to 3.
but it's amazing how many traditional Cllent requests ReadWrite data Repllca data
seem obvious,
central processor. This may nodes nodes
query results.
databases still can't distribute queries
and aggregate
can have dramatic Read Replica
understand how NoSQL databases
This simple rule helps you node
that weren't designed to distribute queries
performance advantages over systems
Wite
Copy
to the data nodes. data
RDBMS that has tables distributed over two different nodes. In order
Consider an table must all be
for the SQL query to work, information about rows on one
ReadWite Replica
Read node
moved across the network to the other node. nodes

Larger tables result in more data movement, which results in slower queries.
White
Think of all the steps involved. The tables can be extracted, serialized, sent
through the network interface, transmitted over networks, reassembled, and then
compared on the server with the SQL query. Replica

Keeping all the data within each data node in the form of logical documents
Read node
means that only the query itself and the final result need to be moved over a Data to increase
the performance of NoSQLsystem
Fig 3.7.1: Duplication of the
network. This keeps your big data queries fast. Venture
Publications...A
SACHIN SHAH
LATech-Neo
year 22-23) (M7-80)
Sylabus we.f academic year 22-23) (M7-80) Te
LTech-Neo Publications...A
Downloaded SACHIN
by SHAH Venture
TAHSEEN
(New Sylabus w.e.f
SHAIKH ([email protected])
academic
lOMoARcPSD|29042853

(NoSOL)..Pageno. (3-38
(MUSem 7-COMP)
read
eed read lytics(MU-Sem7-COMP)
Big DataAnalytics performan
perfa. Big Dala
data to speed
replicate
The Fig 3.7.1 show how you
can
enter
from the left. All
reads canb
All i
Tn this figure NosyL
systems move
(NoSOL)..Pageno (3-37
client requests the query
replica nod
NoSQL systems.
All incoming
either a primary
readwrite
node or a
node. All wi
wil updat
to aquery node. In this
example, all incoming
to
data node, but don't move daa
a

directed to any node, read/write


node that
the nodes. These nodes then
forward the queries arrive at query analyer
betweendata
central
can be sent
to a
replica
nodes. The
time
time bat matches, the documents are queries to each data node. If they have
returned to the
transactions
to
send the updates th query node.
and then
automatically arrives on the
the replica nodes
renis
The query wont return until all
time the update
and the results. data nodes (or a
write to the primary
takes for reads to
return
consistent
responded the ornginal query request. If the response from a replica) have
to
long it data node is down, a
determines how
when you
must be concerned about the
lag time redirectedto a replica of the data node. query can e
few times
There are only a node and a client hat
reading that same
same
record The approach shown in Fig. 3.7.2 is
of that fron
one of
between a write to the
read/write
a write is a read of
read than moving the data to the
moving the query to the data rather
operationsafter
same query. This is an
the most
common
important of
a replica. One of
write and then
an
immediate read from that sow
strategies. In this
instance, moving the query is handled part NoSQL big data
record. If a client does a

occurs if a readoccurs from a


eplica nod
replica ode
node befon and distribution the query and
of
by the database server,
waiting for all nodes to respond is the sole
problem
problem. The
there's no

This is an example
of an inconsistent
read.
responsibilityof the database,not the applicationlayer.
the update happens.
is to only allow reads to the This approach is somewhat similar to the
The best way to
avoid this type of problem same write
sar
concept of federated search. Federated
be added to a session or e search takes a single query and distributes it to distinct
node aftera write
has been done.
This logiccan servers and then
Almost all distributed dat combines the results together to give the user the
management system at the
application layer.
nodes permit writ
tabases impression they're searching a
single system. In some cases, these servers may be in different
relax database consistency
rules when a large
number

must deal with it at


of rites. regions. In this case, sending query to a single cluster that's not only
geographic
needs fast readwrite
consisteney, you
the performing
your application search queries on a single local cluster but also
performing update and delete
application layer. operations.
data nodes
database distribute queries evenly to
4 Letting the Repllca data
Primary daa
nodes ChapterEnds.
Incomng nodes
quenes

Query
H Query
analyzer node
Data

Copy
data
Replica
node

Query
H Query
nalyzer
Data
node
Replica
node

Query Data Replica


analyzer node node

Flg 3.7.2: Distribute query to all the data nodes

In order to get high performance from queries that span multiple nodes, its
important to separate the concerns of query evaluation from query execution.
Fig.3.7.2 shows this structure.
Downloaded by TAHSEEN SHAIKH ([email protected])

You might also like