Bda Chapter 3 This Is The Notes of Bda
Bda Chapter 3 This Is The Notes of Bda
OSQ
Big DataAnalytics(MU-Sem 7-COMP
RDBMS and NoSQL
3.1.7 Difference between
-
-
---* - - - - --
database.
UQ. Drfferentiate between a RDBMS and NoSQL
MU-Dec-18, 0 Bi
BigDataAnayics(MU-Sem7-COMP (NOSOL)Pageno.(3-9
a3.1.8 NoSQL Business Drtvers
The scientist-philosopher Thomas Kuhn coined the term paradigm shift to identity a
recurring process. He observed that in science, where innovative ideas came in bursts and
impacted the world in nonlinear ways. Well use Kuhn's concept of the paradigm shift as a
way to think about and explain the NosQL movement and the changes in thought
patterns, architectures,and methods emerging today.
Many organizationsare supporting to the single- L
CPU relational systems that have fulfilled the needs of Volum
theirorganizations as per the requirements.
Businesses have initiated the value in fast
Sunge nod
catching and examining huge quantityof adjustable Vetocity RDBMS
Fig. 3.1.3 In this, we eee how the business drivers volume, speed, variability, and
agility create pressure on a single CPU system, resulting in cracks. Volume and velocity
state to the capability to handle and managethe big datasets that appears early.
Variability states to how various data types do not fit within the structured tables,and
agility states to how much fast an organization replies to the business modification.
1.
(1)
Volume
Volume (2) Velocity (3) Variability 4) Agility.
J
.Undoubtedly, the main factor forcing organizations to look at alternatives to their
existing RDBMS is to investigate big data using clusters of commodity processors.
Until around 2005, performance problems were eliminated by purchasing faster
processors. Over time, the ability to speed up the process is no longer an option.
As the chip density increases, the heat can no longer dissolve rapidly enough
without chip overheating. This phenomenon, known as the power wall, forced
system designers to shift their 1Ocus rom inereasing speeds on a single chip to
(New Syllabus w.e.f academic year 22-23) (M7-80) Tech-NeoPublications...ASACHIN SHAH Venture
Downloaded by TAHSEEN SHAIKH ([email protected])
lOMoARcPSD|29042853
(MU-Sem 7u
Big Data Analytics
2. Velocity
data issues
are
considered
for many
system to
organizations
RDBMSs, the
ability ofa single
also important. demand for on
the demand li.
online
querig
meet
cannot
RDBMS
single-processor public-facing websites,
Man
Many real-time
insert and
databases created by new row,
columns of each
index multiple
RDBMSs frequently
reduces system performance.
used as a
back-end in front of a web a
When single-processor
RDBMSs are
for everyone, and-
random outbursts in web traffic
reduce the response
and write throughput is ren tuning
system can be expensive when both high read require
3. Variability
exceptional dataconflict
Companies that seek to capture and report
database schema structure imposed by RDR
attempting to use the rigorous
capture some custom fields for a
For example, if a business unit wants to sper
sne
.The mOst
complex part of
putting data into and
building applications using RDBMS isis the
the proces
getting data out of the
Ifyour data has nested and database.
include an repeated subgroups of data
object-relational
Ou ned
The data stored in NoSQL follows any of the four data architecture patterns.
Usually, the value is connected or co-related to the key. The databases for key-valuue
pair storage typically store information as a hash table where each key is unique.
(New Syllabus w.e.f academic year 22-23) (M7-80) LaTech-NeoPublcatlons...ASACHIN SHAH Venture
7-COMP) Key
Analytics
(MU-Sem Valus
Big Data form
(JavaScript
Object strings,
Object
(BLOB), Key value
Binary Large
Key
etc.). architecture
Value
This style of
Application: websites
shopping
used in Key
is commonly and its
e-commerce
applications
wide
user-1233
John Do
or
or for
assets is its ability image-13P9 cbinary mage f
important heavy
of data
volumes, hvtp://webpage-123.htm cweb page Mb
management are
with which keys file-123.pdt <paf documerd
loads and the
ease
used to
retrieve data. Fig 3.2.1: An example of Key-Vah
GO How NoSQL
data architecture
Horizontal
information. processors? patterns varies as
you move from a
analytic single processor to mutiple
scalability.
performance. The (1OMarks)i
key-value store, graph store, Bigtable store, and document
----
Improved data
bemodified by focusing on a different
store patterns can
aspect of system implementation.Variations on the
compression. architecturesthat use RAM or solid state
drives (SSDs), and the
Very low-level Travers entire distributed systems patterns can be used on
Query model
or modified to createenhancedavailability.Finally, we'll look at how
Cons Stored data API. graph to database items be
limited to
can
groupedtogetherin different ways to make
navigation over many
haveno Undefined data | give correct items easier.
schema. Poor keys and
usage pattern. results.
for indexes.No Customizationfor RAM or SSD stores
standard query
Increased disk Sharding.
complex data. Some
seek time.
NoSQL products are designed to specificallywork with one type of memory, for
All joins syntax. xample, Memcache, a key-valuestore, was specificallydesigned to see iftems are in RAM
for Increased cost
must be done | Map Reduce of inserts.
On
multiple servers. A key-value store that only uses RAM is called a RAM cache; it
in code. larger 1exible and has general tools that application developers can use tostore global variables
No foreign key queries. Poor for Poor for configurationfiles, or intermediate results of doeument transformations.
interconnected
constraints.No interconnected
triggers. data. data. Tech-NeoPublications..A
New Syllabus w.e.f academic year 22-23) (M-0)
SACHIN SHAH
Vent
H Vent
Tech-Neo Publications..ASACHINSHA
(New Svlabus wef academic vear 22-23) (M7-80)
Downloaded by TAHSEEN SHAIKH ([email protected])
(Nos Big DataAnalytics(MU.Sem7-COMP)
NoSOL) Page no (3-5)
Big Data Analytics(MU-Sem7-COMP) Conslstency
launched.
2004- Google BigTable is This means that the data in the
database remains consistent after the execution of an
2005-CouchDB is launched. operation.
on Amazon Dynamo is released
2007- The research paper For example, after an
update operation all clients see the same data.
sources the Cassandra project.
2008- Facebooks open
Avallability
was
reintroduced.
2009- The term NoSQL
This means that the is
giants
like system always on (service
internet
guarantee availability), no
3.1.3 Why NoSQL? with
Google downtime.
system
popular
response
the time
databases
became
volumes
of NoSQL ofdata to
esolve this problen
resol Partition Tolerance
The concept so
with huge volues
volumes
Amazon etc. Who deals ofdata but
this process
F'acebook. volunes is a
hardware This means
massive that the system continues to function even
use
RDBMS for existing
the communication among the
slow when we load on multiple hosu
becomes our database
servers is
upgrading unreliable, i.e., the servers may be
our system by distribute
dat
iata, unstructu current NoSQL database follow the different combinations of the C, A. P from the
with the web CAP
sorts
of semi to theorem.
databases. As they designed all problems related
handle the
that can resolve larp
database
So. to Here is the
exactly type pf and bigdata. emerged. brief description of three combinations CA,
CP, AP.
have
rapidly chang data databases
data, CA Single site
structured
data. NosQL cluster, therefore all nodes are always in contact. When a
oume and semi occurs, the system blocks.
partition
3.1.4 CAP Theorem (4 Mark CP-Some data may not be accessible,but the rest is
Systems? still consistent/accurate
to NOSQL
Theorem: How it is applicable AP System is still available under
What is CAP theorem is also called breu.
We partitioning,but some of the data returned may
GQ. CAP be inaccurate.
databases.
role in NoSQL data store to offer more th
distributed
it plays inmportant for a The use of the word
impossible consistencyin CAP and its use in ACID do not refer to the same
which states that it is
theorem identicalconcept.
guarantees:
fno out of thrve consisteney
databases offer In CAP, the term
some NoSQL Consistency
consistencyrefers to the consisteneyof the values in different copies
So hasically, and of the same data item
While some offer availability in a replicateddistributed
system. In ACID, it refers to the fact that
toleranee.
and partition common as a
tolerance is CA CP transaction will not violate the
integrityconstraints specifiedon the database schema
tolerance. But partition
partition based on
nature so
distributed in Partidon 3.1.5
databases are
NosQL Availability|AP Tolerance Characterlstics/ Features of NosQL
database has to be ---
we can choose NoSQL
requirement, ---------
databases are available Describe characteristicsof a NoSQL database.
used. Ditierent types of NoSQL MU-Dec 17,10 Markss
based on data models. Non-relatlonal
Fig 3.1.1:CAP Propen
NoSQL databases never follow the relational model
(New Syllabus w.e.f academic year 22-23) (M7-80)
SACHIN SHAM
Tech-NeoPublications.A SACHIN SHAH Venture
Tech-Neo Publications...A
New Sylatus we facademic year22-23)(M7-80) Tech
lOMoARcPSD|29042853
lOMoARcPSD|29042853
OMP)
fixed
aggreg
Work with
self-contained
NoSQL
DisadvantagesofNoSQL
or BLOBs
object-relational Advantagesof NosQL
Doesn't require Distibuted Disadvantagesof NosQL
normalization
Schema-ree
1. Scale(horizontal)
mapping and data 1. ACID transactions
Supports new
like
query 2. SQL databases are
vertically scalable. 2.
generabon
features
No complex erential
Web
This means that you can Cannot use SQL
query
planners, reere epplicado increase the load3
anguages,
on a single server Cannot perform searches
Fig. 3.1.2 by increasing
integrity joins, ACID
like things
RAM, CPU or SSD. But on the Data loss
un on inexpen
other
fees
and can
hand, NoSQL databases are
2. Open-source licensing
horizontally 5. No referential
integrity
expensive
scalable. This means that
NoSQL databases don'ti require cost-effective.
more traffic by
you handle 6. Lack of availabilityof
expertise
hardware, rendering their
deployment
sharding, or adding more
servers in your
NoSQL database
3. Schema-free schemas
schema-free
or
have
relaxed
3. Simple data model (fewer joins)
either
of the data
are
NoSQL databases
of the
schema Streaming/volume
sort of definition
Do not require any
data in the
same
domain
5. Reliability
structures of
Olters heterogeneous
6. Schema-less(no modelling or
prototyping)
4. Simple API 7. Rapid development
data provided
and querying
use interfaces
for storage 8. Flexible as it can handle
Offers easy to
manipulation &
selection methods semi-structured,
unstructuredand structured data.
APls allow low-leveldata
REST with JSON
Text-based protocols mostly
used with HTTP 9. Cheaperthan relationaldatabase
standard based query language 10. Creates a
Mostly used no
services
caching layer
internet-facing
Web-enabled databases running as
11. Wide data type variety
12. Uses large binary
s. Distributed objects for storing large
distributed fashion data
Multiple NoSQL databases can be executed in a
Mostly no synchronous replication between distributed nodes Asynchrs 15. Distributed storage
MultiMaster Replication peer-to-peer,HDFS Replication 16. Real-time analysis
Only providingeventual consistency
(New Sylabus wef academic year22-23) (M7-80) Tech-NeoPublications..ASACHINSHAH New Syllabus w.e.f academic
year 22-23) (M7-80) LTech-NeoPublcations..ASACHIN SHAH Venture
Downloaded by TAHSEEN SHAIKH ([email protected])
lOMoARcPSD|29042853
Pago
GZ)
with OR
(NoSOL)..Pageno. (3-27)
.Data pertboning
to achieve
Repticaton
achieve
them
to the
allowed stems that
server,
Clustenng
ment syste
documents query
petabytes of documents. hing entities
ties that Wanted Easy to understand Query distrbubon
Load balencing
feder US natured into a Easy to setup and
products in Easy to administer confgure
MarkLogic found a
terabytes of intelligence
demand for their
information
and large
arkLogie has
p
D transactio,
t prim
for
ACH Single source of truth Consi stsynchronizaton
ency/Sunaing
LatClocketwork
enoyIConcurreney
stored
documents.
Since
2001,
language Limited scalability botWeneckstailures
their XML
O store and search
document store th
the
support
Ja.
Multple data cørters
scalable as
Disibuted
failurebackup
control. Initially, versions
general-purpose highly
role-based
access
REST;
newer
Node
and fine-grained, with Votng algorthms
was XQuery paired license
for any atasets
Administraton tor error
Marklogic developers detecton
Monrtoring of many systems
software
as
well
as
with
with
associated
cases
to business problems.
NoSQL is ses:
over 40 GB.
solutions
(1) Bulk Image
Processing
innovative (2) Public
that provide
BIG DATA that it can't| (3) Remote Sensor Data Web Page
FOR large: be easily Data
so
(4) Event
(5) Mobile Phone Data
SOLUTION
Log Data
that's
is any
class problem.
Big data
world
A bigdata complex
enviror
(8) Open Linked Data
processor.
computing
single
more
a the
managed using (1) Bulk Image
distributed
toward
single-processor
ronment
data
problems, Processing
big
Though great
for solving
move
from a
a single
single
i
processor .
Organizations like NASA regularly
set ofchallenges. when you environment
challenges
you
face
to a
distributed
really
warrant
these images and
rovers on terabytes of incoming data from
Mars. NASA uses
Some of the Moving problem
platforms i1.
computing
only if the This is why
Medical imaging
done
of time. systems like CAT scans and photo stitching
distributed
be
and
should
short
period things asier foor
easier for the
appliat data into
endeavour
volumes
in a
to
make
formats
that are MRIs
need to convert raw
hardware has been found to useful to doctors and patients. image
data
framework
developer.
for one
or many
managed by aa single prs
be managed
e.
For example, the
be New York
articles into web formats Times converted 3.3 million scans
challenges can
are that
These
nsider hio.
big data p
cases
use
and But we consider of old
includes
concepts
and data
quality.
hundreddollars. using tools like Amazon EC2 newspaper
NoSQL
on agility and Hadoop for a few
impact
positive
andhave
a
for NoSQL
(2) Publlc Web
use
case Page Data
a primary
7-COMP by compot.
Analytico(MU-Som
ted
(NoSOL).Page no.(3-29)
cren
uct revi
reviewn ar
(7) Game Date
of fake which product
millions of pages
Finding
out
There are
disparage
other siten. naA
Gumes that run on PCa,
parties paid to vidoo
third
careful analysis.
datasets eta that need to scale g»me oonmolen, and
valid in a topic
for
all users as well as quickly. These gamen mobile deviceshave back end
gamedata for ench store and
Data world,
of our n.
player. eharehigh scores for
(3)
Remote Sensor
track
almost any
aspect
and fuel
fuel congun
Game site backer
Dackends must be able
consumptioon,
now
can and to scale
marketing mpaign scatch on with their by orders of
sensors
acoceleration,
Small, low-power speed,
installed on
vehicles track location,
about your
driving
habits.
Linked Data
neir uers. magnitude
tude if viral
insurance
company
alt Open
suggest
lternate (8)
and
and tell your in real
me
for your
home. ile, remove
plan
to Buggest a watering
includes problems like image and
rThis
ient and reliable data signal
(4) Event Log Data
web page hits (also. transformationat scale.processing. Theirfocus is on the
systems create logs of
read-only
events from
called sactions support
query or transac These
provided by many NoSQL use cases don't need the
Computer attempts. login
clickstreams), email messages sent, or
hev read and write to
key-value stores or systems.
who's usine
understand
what distributed
DistributedFile System filesystems like Amazon's
can help organizations Simple StorageServie rvice (S3) or Hadoop
events
Each of these
be performing according
to specification
resources and when systems may
not
the danced
ad features of a
document store or an (HDFS) and may not need
intelligence tools to send alerts ta
users RDBMS. Other use cases are
Event logdata can be fed into operational amanding and need more reatures. Big data more
fall out of acceptable ranges.
problems like event log data and
when key indicators need to store their data
directly into structures that game dataa
can be
they will need different NoSQL solutions, queried and analysed, so
(5) Moblle Phone Data
(New Sylabus w.e.! academic year 22-23) (M7-80) Downloaded by TAHSEEN SHAIKH
Tech-Neo Publications...ASACHIN SHAH Venture ([email protected])
(New Syllabus w.e.f academic year 22-23) (M7-80) LTech-NeoPublications...ASACHIN SHAH YentYra
lOMoARcPSD|29042853
(NoSOL)Pageno.(3.n Data
Analytlcs (MU-Som7-COMP) (NoSQL).Pagono.(3-31)
Big
7-COMP)
Biq Data
Analyics (MU-Sem
be
customized
to solve some
big data . he a web page click or an out-of-memorywarning a diak drive. In tho past, tun
on
and effort, to many
6 RDBMS can,
with enough
time
be
rewritten
to
distribute SQL queries
redesigned to processor
remo
move
cost
and amount of evont data producod wero no largo thut nany orgnnizntionn optod
gather or analyse it. Today, NoSQL ay»tomA are changing compnniow' thoughta
problems
Applications
can
of the queries.
Databases
can be
joins not
physically
located on
different
considerable
steps all
take
n analysing tronds in your wob trafie or rotil tranmactionn. t cn intograt
TYPES OF BIG DATAPROBLEMS boyond
network monitoring aystoms 0
UNDERSTANDING THE mation from you can dotoct problomn beforo they
M 3.4
informa
combinatin. your customers. Cost-ofloctive NosQL ayutoma cnn h pnrt of god oporntionn
each requiring
a different
tion o impact your
There are many types of big data problems, its type, you'l|. manayomont solution
categorized your data
and determined
find
NosQL systems.Aferyou've Cull-toxt documents: This catogoryof daln incudon nny documont thnt containn
there are different solutions. (8)
lext liko the Englisth langungo. An imporlnnt nupeet of eoeument
data classification systom. natural-language
Pig34.1 is n good example of a high-level big wloros
laron in that you can quory tho entiro contonla of your oflicn documont intho wA
Bg da wAy you would quory rows in your BQL, ayntom.
can cronte now roports thnt combine trnditionnl dnta infttIMn
Read-wrte Thin monna that you
Read-mosty well nn tho data within your offico documenta. For oxnmple, yueould cronte n ningl
woll
thnt oxtractod nll the nuthors of titlow of Powur'oint nlidoa thnt mtninod th
query
Event-og DocunentsGrph High aveiabilityTraneactions evwordn NoSql, or big data. Tho romult of thin lint could then
of nuthurs
filterod with
b
lid af titlon in tho Ht databoo to show which poogpln hndd tho title of bntn Arehitnt or
Thin in n od oxumplo of huw orgnnizntionsnre trying us tnp into the hidden nkilla
that alrondy oxint within n orynlalion fur trnining and mentwrsbip Intogrntin
Flg.34.1:A ampleof a tuxonomyof big duta types doeumontn intw what can be (quoriod is oponing now dors in knuwlulgn manngnent anl
officient staff utilizntion
Some ways you clanwily big data problems and ee how NoSQIL nyntoma aro changing
the way organizationsume data
W3.5 ANALYZINGBIGDATA WITH ASHARED-NOTHINdARCHITLCTURE
(1 Rrad-mostly Read-mostly datu
: in the mont common
clumnificntion. It includeon dnta
thatscreatud once and rarely nlturod. This Explain Shared Nothing archltecturs In detail (10 Marko)
type of dnta in typically found in dut»
warehoue applicationnhut in alwo
identified an
or video, event-loggingdata,
a et of non DBMS itwmn like imngon Thore nrs thras wnyn thnt ronnurcws enn bes shnrol bol.wenn cmputsr aystua alhnrerl
publisheod documenta, or graph dnta. Evunt datn B.5.I shows npnrlem tf thes threw
thing like retail nales includom KAM, shard dink, nnd shared nothing. Fig n
eventa, hits on n
webwito, nynton logging dnta,
erior data or roul-time dintributed computing urchitsstures, (f ths thrs nlternativos, shnreus nthin
whun ym'ro yeing
Log events:When operational events architnturs in mont cost ufsctlvs intaTIs of st par prwoss
oceur in your
log file and include enturprie,you can rocurd it in mudity hnrdwars
a
timestampw you know when tho event
(xcurrod. og ovont
syAawe' eca yes 22 ) M7 90)
Torh-Hen Putwatns A BACHIN
SHAH Vorur Slatave waf arani yan1 29 *) (MI 9))
Downloaded by TAHSEEN SHAIKH ([email protected])
lOMoARcPSD|29042853
ics (MU-Sem7-COMP)
(MUS
The
initial
versions architecture
NameNodes
don't deal wi
usually don'+.
One of the most challenging problems with distributed
cansistent way of ass1gning a document to a processing node.
databases is figuring out a
distribute queries
master-slave
a cluster.
ofthe
use
to and distribute
designed status
the manage
are
designed
to
remove
single
points of
Hadoor network load.
Hash rings are common in big data solutions because they consistently determine
Hadoop 2.x versions
requiremente how
w to assign a piece of data to a specific processor. Hash rings take the leading
if high
business
HANDLE BIG DATA PROBLEMS use keyspace concepts to manage distributed computingproblemns.
SYSTEMS TO
3.7 NOSQL
The concept of
a hash ring can also be extended to include the requirement that
not data to
thequerles
item must be stored on multiple nodes.
to the data, an
1. Moving querles
NoSQL systems use commod,
the hash ring rules might indicate both a primary
graph databases, most
odity When a new item is created,
With the exception of large their local shared-nothin of where an item is stored. If the node that contains
the
subset of the data on
ng and a secondary copy
processors
that each hold a
the system can look up the node where these secondary
item is
primary fails,
drives.
query to all nodes that hold data, it's m re stored.
When a client wants to send a general
to transter large datasets toa
each node than it is Using replication
to scale reads
efficient to send the query to 3.
but it's amazing how many traditional Cllent requests ReadWrite data Repllca data
seem obvious,
central processor. This may nodes nodes
query results.
databases still can't distribute queries
and aggregate
can have dramatic Read Replica
understand how NoSQL databases
This simple rule helps you node
that weren't designed to distribute queries
performance advantages over systems
Wite
Copy
to the data nodes. data
RDBMS that has tables distributed over two different nodes. In order
Consider an table must all be
for the SQL query to work, information about rows on one
ReadWite Replica
Read node
moved across the network to the other node. nodes
Larger tables result in more data movement, which results in slower queries.
White
Think of all the steps involved. The tables can be extracted, serialized, sent
through the network interface, transmitted over networks, reassembled, and then
compared on the server with the SQL query. Replica
Keeping all the data within each data node in the form of logical documents
Read node
means that only the query itself and the final result need to be moved over a Data to increase
the performance of NoSQLsystem
Fig 3.7.1: Duplication of the
network. This keeps your big data queries fast. Venture
Publications...A
SACHIN SHAH
LATech-Neo
year 22-23) (M7-80)
Sylabus we.f academic year 22-23) (M7-80) Te
LTech-Neo Publications...A
Downloaded SACHIN
by SHAH Venture
TAHSEEN
(New Sylabus w.e.f
SHAIKH ([email protected])
academic
lOMoARcPSD|29042853
(NoSOL)..Pageno. (3-38
(MUSem 7-COMP)
read
eed read lytics(MU-Sem7-COMP)
Big DataAnalytics performan
perfa. Big Dala
data to speed
replicate
The Fig 3.7.1 show how you
can
enter
from the left. All
reads canb
All i
Tn this figure NosyL
systems move
(NoSOL)..Pageno (3-37
client requests the query
replica nod
NoSQL systems.
All incoming
either a primary
readwrite
node or a
node. All wi
wil updat
to aquery node. In this
example, all incoming
to
data node, but don't move daa
a
This is an example
of an inconsistent
read.
responsibilityof the database,not the applicationlayer.
the update happens.
is to only allow reads to the This approach is somewhat similar to the
The best way to
avoid this type of problem same write
sar
concept of federated search. Federated
be added to a session or e search takes a single query and distributes it to distinct
node aftera write
has been done.
This logiccan servers and then
Almost all distributed dat combines the results together to give the user the
management system at the
application layer.
nodes permit writ
tabases impression they're searching a
single system. In some cases, these servers may be in different
relax database consistency
rules when a large
number
Query
H Query
analyzer node
Data
Copy
data
Replica
node
Query
H Query
nalyzer
Data
node
Replica
node
In order to get high performance from queries that span multiple nodes, its
important to separate the concerns of query evaluation from query execution.
Fig.3.7.2 shows this structure.
Downloaded by TAHSEEN SHAIKH ([email protected])