0% found this document useful (0 votes)
263 views

NO SQL Data Management

The document discusses NoSQL data management in unit 2 of the course RCA E45 Big Data. It provides an introduction to NoSQL, explaining that NoSQL databases are non-relational and designed for large, distributed datasets. It then discusses why NoSQL databases are useful, including for application development productivity, large data, analytics, scalability, high write performance, flexible data models, and easier maintenance.

Uploaded by

Hirdesh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
263 views

NO SQL Data Management

The document discusses NoSQL data management in unit 2 of the course RCA E45 Big Data. It provides an introduction to NoSQL, explaining that NoSQL databases are non-relational and designed for large, distributed datasets. It then discusses why NoSQL databases are useful, including for application development productivity, large data, analytics, scalability, high write performance, flexible data models, and easier maintenance.

Uploaded by

Hirdesh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 123

Noida Institute of Engineering and Technology, Greater Noida

NO SQL Data Management

Unit: 2

RCA E45- Big Data


Hirdesh Sharma,
Department of MCA
MCA 5th Sem

Hirdesh Sharma RCA E45 Big Data Unit: 2


1
08/11/2021
Content

• Course Objective
• Course Outcome
• CO and PO Mapping
• Introduction to No SQL
• aggregate data models
• Aggregates
• key-value and document data models
• relationships
• graph databases
• Schema less databases
• materialized views
• distribution models

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 2


Content

• sharding
• master-slave replication
• peer-peer replication
• sharding and replication
• consistency
• relaxing consistency
• version stamps
• map-reduce
• partitioning and combining
• composing map-reduce calculations
• Summary

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 3


Course Objective

Upon completion of this course, students will be able to do


the following:
• What is Big Data and Why Big Data used.
• What are Hadoop and open source technologies.

• Demonstrate a familiarity with NO SQL data management.


• Apply important concepts of Big Data and Hadoop with
unstructured data.
• Synthesize the use of Hbase data models and implementation.

Hirdesh Sharma RCA E45 Big Data Unit: 2


08/11/2021 4
Course Outcome

After Completing this course the students will be able to:


• CO1: To study paradigms and approaches used to analyze unstructured
data into semi structured data and structured data, cloud and big data
mobile business intelligence in practice.
• CO2: Explain Why big data concept is used, Basics of hadoop Data
format, analyzing data with Hadoop , scaling out , Hadoop streaming ,
Hadoop pipes , design of Hadoop distributed file system (HDFS).
• CO3: Apply the industry examples of Big data in real life and analyze to
implement the industry examples of big data.
• CO4: Explain the concept of NO SQL, aggregate data models
,aggregates ,key-value and document data models, relationships,
partitioning and combining, composing map-reduce calculations.
• CO5: Gather information about Hadoop related tools, Hbase, data model
and implementations, Hbase clients, Hbase examples – praxis. Cassandra,
cassandra data model HiveQL queries.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 5


Program Outcome

• PO1: Computational Knowledge: Develop knowledge of computing fundamentals,


computing specialization, mathematics and domain knowledge for solving real
world problems.
• PO2: Problem Analysis: Identify formulate review research literature and analyze
complex problems reaching substantial conclusions using first fundamental
principles of mathematics, computing science and relevant domain discipline.
• PO3: Design /Development of Solutions: Ability to design and evaluate system,
components or processes for complex computing problems that meets specified
needs with appropriate consideration for the public health and safety and cultural
societal and environmental consideration.
• PO4: Conduct investigations of complex Computing problems: Use research-
based knowledge and research methods including design of experiments, analysis
and interpretation of data, and synthesis of the information to provide valid
conclusions.
• PO5: Modern Tool Usage: Create, select, adapt and apply appropriate techniques,
resources, and modern computing tools including prediction and modeling to
complex computing activities, with an understanding of the limitations.
• PO6: Professional Ethics: Understand and commit to professional ethics and cyber
regulations, responsibilities, and norms of professional computing practices.
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 1 6
Program Outcome
• PO7: Life-long Learning: Recognize the need, and have the ability, to engage in
independent learning for continual preparation and development as a computing
professional for broadest content of technological change.
• PO8: Project management and finance: Demonstrate knowledge and
understanding of the computing and management principles and apply these to
one’s own work, as a member and leader in a team, to manage projects and in
multidisciplinary environments.
• PO9: Communication Efficacy: Communicate effectively with the computing
community, and with society at large, about complex computing activities by being
able to comprehend and write effective reports, design documentation, make
effective presentations, and give and understand clear instructions.
• PO10: Societal and Environmental Concern: Understand and assess societal,
environmental, health, safety, legal, and cultural issues within local and global
contexts, and the consequential responsibilities relevant to professional computing
practices.
• PO11: Individual and Team Work: Function effectively as an individual and as a
member or leader in diverse teams and in multidisciplinary environments.
• PO12: Innovation and Entrepreneurship: Identify a timely opportunity and using
innovation to pursue that opportunity to create value and wealth for the
betterment of the individual and society at large.
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 1 7
CO-PO Mapping

Mappping of Course Outcomes(COs)and Program Outcomes (POs):

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 1 8


Unit 2 Objective

Upon completion of this course, students will be able to do


the following:
• What are Hadoop and open source technologies.
• Demonstrate a familiarity with NO SQL data management.

Hirdesh Sharma RCA E45 Big Data Unit: 2


08/11/2021 9
Prerequisite and Recap

• Big data is a popular term used to describe the exponential growth and
availability of data, both structured and unstructured.
• 3 dimensions / characteristics of Big data: 3Vs (volume, variety and
velocity)
• Web analytics is the measurement, collection, analysis and reporting of
web data for purposes of understanding and optimizing web usage.
• Fraud is intentional deception made for personal gain or to damage another
individual.
• Credit risk management is a critical function that spans a diversity of
businesses across a wide range of industries.
• HDFS is the storage system for a Hadoop cluster.
• MapReduce are designed to continue to work in the face of system failures.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 10


Topic Name (CO2)

• Introduction to NoSQL
• NoSQL Databases

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 11


Topic Objective (CO2)

After completion of this topic, students will be able to understand:


• What is NoSQL?

• NoSQL Databases

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 12


What is NOSQL? (CO2)

• NoSQL database, also called Not Only SQL, is an approach to data


management and database design that's useful for very large sets of
distributed data.
• NoSQL is not a relational database. The reality is that a relational
database model may not be the best solution for all situations.
• The easiest way to think of NoSQL, is that of a database which does
not adhering to the traditional relational database management
system (RDMS) structure.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 13


Why Are NoSQL Databases Interesting? (CO2)

Why Are NoSQL Databases Interesting? / Why we should use


Nosql? / when to use Nosql?
There are several reasons why people consider using a NoSQL
database:
• Application development productivity
• Large data
• Analytics

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 14


Why Are NoSQL Databases Interesting? (CO2)

• Scalability: NoSQL databases are designed to scale; it’s one of the


primary reasons that people choose a NoSQL database.
• Massive write performance: This is probably the canonical usage
based on Google's influence, which implies key-value access,
MapReduce, replication, fault tolerance, consistency issues, and all
the rest. For faster writes in-memory systems can be used.
•  Fast key-value access: This is probably the second most cited
virtue of NoSQL in the general mind set.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 15


Why Are NoSQL Databases Interesting? (CO2)

• Flexible data model and flexible datatypes: NoSQL products


support a whole range of new data types. We have: column-oriented,
graph, advanced data structures, document-oriented, and key-value.
Complex objects can be easily stored without a lot of mapping.
• Schema migration
• Write availability
• Easier maintainability, administration and operations

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 16


Why Are NoSQL Databases Interesting? (CO2)

• No single point of failure


• Generally available parallel computing
• Programmer ease of use: Accessing your data should be easy.
Programmers grok keys, values, JSON, Javascript stored procedures,
HTTP, and so on. NoSQL is for programmers.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 17


Why Are NoSQL Databases Interesting? (CO2)

• Use the right data model for the right problem: Different data
models are used to solve different problems.
• Distributed systems and cloud computing support: Not everyone
is worried about scale or performance over and above that which
can be achieved by non-NoSQL systems.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 18


Difference between SQL and NoSQL (CO2)

• SQL databases are primarily called as Relational Databases


(RDBMS); whereas NoSQL database are primarily called as non-
relational or distributed database.
• SQL databases are table based databases whereas NoSQL databases
are document based, key-value pairs, graph databases or wide-
column stores.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 19


Difference between SQL and NoSQL (CO2)

• SQL databases are scaled by increasing the horse-power of the


hardware. NoSQL databases are scaled by increasing the databases
servers in the pool of resources to reduce the load.
• SQL database examples: MySql, Oracle, Sqlite, Postgres and MS-
SQL. NoSQL database examples: MongoDB, BigTable, Redis,
RavenDb, Cassandra, Hbase, Neo4j and CouchDb.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 20


Difference between SQL and NoSQL (CO2)

• For the type of data to be stored: SQL databases are not best fit for
hierarchical data storage. But, NoSQL database fits better for the
hierarchical data storage as it follows the key-value pair way of
storing data similar to JSON data. NoSQL database are highly
preferred for large data set (i.e for big data).

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 21


Difference between SQL and NoSQL (CO2)

• For properties: SQL databases emphasizes on ACID properties


( Atomicity, Consistency, Isolation and Durability) whereas the
NoSQL database follows the Brewers CAP theorem ( Consistency,
Availability and Partition tolerance ).
• For DB types: On a high-level, we can classify SQL databases as
either open-source or close-sourced from commercial vendors.
NoSQL databases can be classified on the basis of way of storing
data as graph databases, key-value store databases, document store
databases, column store database and XML databases.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 22


Type of NoSQL Database (CO2)

There are four general types of NoSQL databases, each with their
own specific attributes:
1. Key-Value storage

2. Document Databases

3. Column Storage
4. Graph Storage

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 23


Type of NoSQL Database (CO2)

• Key-Value storage: This is the first category of NoSQL database.


Key-value stores have a simple data model, which allow clients to
put a map/dictionary and request value par key. In the key-value
storage, each key has to be unique to provide non-ambiguous
identification of values. For example:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 24


Type of NoSQL Database (CO2)

• Document databases: In the document database NoSQL store


document in JSON format. JSON-based document are store in
completely different sets of attributes can be stored together, which
stores highly unstructured data as named value pairs and
applications that look at user behavior, actions, and logs in real time.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 25


Type of NoSQL Database (CO2)

• Columns storage: Columnar databases are almost like tabular


databases. Thus keys in wide column store scan have many
dimensions, resulting in a structure similar to a multi-dimensional,
associative array. Shown in below example storing data in a wide
column system using a two-dimensional key.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 26


Type of NoSQL Database (CO2)

• Graph storage: Graph databases are best suited for representing


data with a high, yet flexible number of interconnections, especially
when information about those interconnections is at least as
important as there presented data. In NoSQL database, data is stored
in a graph like structures in graph databases, so that the data can be
made easily accessible. Graph databases are commonly used on
social networking sites. As show in below figure.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 27


NoSQL Database Examples (CO1)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 28


Pros and Cons of Relational Databases (CO2)

Advantages
– Data persistence
– Concurrency – ACID, transactions, etc.
– Integration across multiple applications
– Standard Model – tables and SQL

Disadvantages
– Impedance mismatch
– Integration databases vs. application databases
– Not designed for clustering

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 29


Database Impedance mismatch (CO2)

• Impedance Mismatch means the difference between data model and


in memory data structures.
• Impedance is the measure of the amount that some object resists (or
obstruct, resist) the flow of another object.
• The data representation in RDMS is not matched with the data
structure used in memory.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 30


Characteristics of NoSQL (CO2)

Some common characteristics of nosql include:

• Does not use the relational model (mostly)


• Generally open source projects (currently)
• Driven by the need to run on clusters
• Built for the need to run 21st century web properties
• Schema-less
• Polygot persistence
• Auto Sharding

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 31


Polygot Persistence (CO2)

• The point of view of using different data stores in different


circumstances is known as Polyglot Persistence.
• Polyglot persistence is commonly used to define this hybrid
approach.
• The definition of polyglot is “someone who speaks or writes several
languages.”  The term polyglot is redefined for big data as a set of
applications that use several core database technologies.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 32


Polygot Persistence (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 33


Daily Quiz

• What license is Hadoop distributed under?


a) Apache License 2.0
b) Mozilla Public License
c) Shareware
d) Commercial
• Which of the following genres does Hadoop produce?
a) Distributed file system
b) JAX-RS
c) Java Message Service
d) Relational Database Management System

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 34


Noida Institute of Engineering and Technology, Greater Noida

NO SQL Data Management

Unit: 2

RCA E45- Big Data


Hirdesh Sharma,
Department of MCA
MCA 5th Sem

Hirdesh Sharma RCA E45 Big Data Unit: 2


35
08/11/2021
Recap

• NoSQL database, also called Not Only SQL, is an approach to data


management and database design that's useful for very large sets of
distributed data.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 36


Topic Name (CO2)

• NoSQL Data Models

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 37


Topic Objective (CO2)

After completion of this topic, students will be able to understand:


• NoSQL Data Models

• Aggregate Data Models

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 38


NoSQL Data Model (CO2)

• NoSQL databases have a very different model. For example, a


document-oriented NoSQL database takes the data you want to store
and aggregates it into documents using the JSON format.
• Each JSON document can be thought of as an object to be used by
your application.
• A JSON document might, for example, take all the data stored in a
row that spans 20 tables of a relational database and aggregate it into
a single document/object.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 39


NoSQL Data Model (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 40


NoSQL Data Model (CO2)

• Another major difference is that relational technologies have rigid


schemas while NoSQL models are schemaless.
• The exact opposite of the behavior desired in the Big Data era,
where application developers need to constantly – and rapidly –
incorporate new types of data to enrich their apps.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 41


Aggregate Data Model in NoSQL (CO2)

• Data Model: A data model is the model through which we perceive


and manipulate our data.

• Relational Data Model: The relational model takes the information


that we want to store and divides it into tuples.

• Aggregate Model: Aggregate is a term that comes from Domain-


Driven Design, an aggregate is a collection of related objects.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 42


Aggregate Data Model in NoSQL (CO2)

• Atomic property holds within an aggregate.


• Communication with data storage happens in unit of aggregate.

Example of Relations and Aggregates


• Let’s assume we have to build an e-commerce website; we are going
to be selling items directly to customers over the web.
• We can use this scenario to model the data using a relation data store
as well as NoSQL data stores and talk about their pros and cons.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 43


Aggregate Data Model in NoSQL (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 44


Aggregate Data Model in NoSQL (CO2)

• The following figure presents some sample data for this model.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 45


Aggregate Data Model in NoSQL (CO2)

• In relational, everything is properly normalized. We also have


referential integrity. A realistic order system would naturally be
more involved than this.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 46


Aggregate Data Model in NoSQL (CO2)

Again, we have some sample data, which we’ll show in JSON


format as that’s a common representation for data in NoSQL.
// in customers
{"
id":1,
"name":"Martin",
"billingAddress":[{"city":"Chicago"}]
}
// in orders
{"
id":99,
"customerId":1,
"orderItems":[

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 47


Aggregate Data Model in NoSQL (CO2)

{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}
],
}
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 48
Aggregate Data Model in NoSQL (CO2)

• In this model, we have two main aggregates: customer and order.


We’ve used the black-diamond composition marker in UML to show
how data fits into the aggregation structure.
• The customer contains a list of billing addresses; the order contains
a list of order items, a shipping address, and payments.
• The payment itself contains a billing address for that payment.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 49


Aggregate Oriented Databases (CO2)

Aggregate-oriented databases work best when most data interaction


is done with the same aggregate;
Key-value databases
– Stores data that is opaque to the database
– The database does cannot see the structure of records
– Application needs to deal with this
– Allows flexibility regarding what is stored (i.e. text or binary
data)
Document databases
– Stores data whose structure is visible to the database
– Imposes limitations on what can be stored
– Allows more flexible access to data (i.e. partial records) via
querying

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 50


Aggregate Oriented Databases (CO2)

Both key-value and document databases consist of aggregate records


accessed by ID values
Column-family databases
– Two levels of access to aggregates (and hence, two pars to the
“key” to access an aggregate’s data)
– ID is used to look up aggregate record
– Column name – either a label for a value (name) or a key to a list
entry (order id)
– Columns are grouped into column families

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 51


Relational Vs Aggregate Data Models (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 52


Relational Vs Aggregate Data Models (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 53


Schemaless Databases (CO2)

• A common theme across all the forms of NoSQL databases is that


they are schemaless.
• When you want to store data in a relational database, you first have
to define a schema—a defined structure for the database which says
what tables exist, which columns exist, and what data types each
column can hold.
• Before you store some data, you have to have the schema defined
for it.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 54


Schemaless Databases (CO2)

Why schemaless?
– A schemaless store also makes it easier to deal with nonuniform
data
– When starting a new development project you don't need to
spend the same amount of time on up-front design of the
schema.
– No need to learn SQL or database specific stuff and tools.
– The rigid schema of a relational database (RDBMS). It can be
harder to push data into the DB as it has to perfectly fit the
schema.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 55


Schemaless Databases (CO2)

Pros:
– More freedom and flexibility
– you can easily change your data organization
– you can deal with non uniform data
Cons:
– A program that accesses data: almost always relies on some form
of implicit schema, it assumes that certain fields are present ,
carry data with a certain meaning
– The implicit schema is shifted into the application code that
accesses data

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 56


Distribution Models (CO2)

Multiple servers:
– In NoSQL systems, data distributed over large clusters.
Single server:
– simplest model, everything on one machine. Run the database on
a single machine that handles all the reads and writes to the data
store.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 57


Orthogonal aspects of data distribution models (C02)

Sharding:
• DB Sharding is nothing but horizontal partitioning of data. Different
people are accessing different parts of the dataset.
• In these circumstances we can support horizontal scalability by
putting different parts of the data onto different servers—a technique
that’s called sharding.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 58


Sharding (C02)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 59


Sharding (C02)

• Different parts of the data onto different servers


– Horizontal scalability
– Ideal case: different users all talking to different server nodes
– Data accessed together on the same node ̶aggregate unit!

• Pros: it can improve both reads and writes


• Cons: Clusters use less reliable machines ̶resilience decreases

• Many NoSQL databases offer auto-sharding


– the database takes on the responsibility of sharding

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 60


Sharding (CO2)

Improving performance:

Main rules of sharding:


• Place the data close to where it’s accessed
– Orders for Boston: data in your eastern US data center
• Try to keep the load even
– All nodes should get equal amounts of the load
• Put together aggregates that may be read in sequence
– Same order, same node

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 61


Master Slave Replication (CO2)

Master
– is the authoritative source for the data
– is responsible for processing any updates to that data
– can be appointed manually or automatically

Slaves
– A replication process synchronizes the slaves with the master
– After a failure of the master, a slave can be appointed as new
master very quickly

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 62


Master Slave Replication (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 63


Master Slave Replication (CO2)

Pros and cons of Master-Slave Replication

Pros
– More read requests:
– Add more slave nodes
– Ensure that all read requests are routed to the slaves
Cons
– The master is a bottleneck
– Limited by its ability to process updates and to pass those
updates on
– Its failure does eliminate the ability to handle writes until:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 64


Peer to Peer Replication (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 65


Peer to Peer Replication (CO2)

• All the replicas have equal weight, they can all accept writes
• The loss of any of them doesn’t prevent access to the data store.
Pros and cons of peer-to-peer replication
Pros:
– you can ride over node failures without losing access to data
– you can easily add nodes to improve your performance
Cons:
– Inconsistency
– Slow propagation of changes to copies on different nodes

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 66


Sharding and Replication on Master-Slave (CO2)

• Replication and sharding are strategies that can be combined.


• If we use both master slave replication and sharding, this means that
we have multiple masters, but each data item only has a single
master.
• We have multiple masters, but each data only has a single master.

Two schemes:
– A node can be a master for some data and slaves for others
– Nodes are dedicated for master or slave duties

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 67


Sharding and Replication on Master-Slave (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 68


Sharding and Replication on P2P (CO2)

• Using peer-to-peer replication and sharding is a common strategy


for column family databases.
• Usually each shard is present on three nodes.
• A common strategy for column-family databases.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 69


Sharding and Replication on P2P (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 70


Replication (CO2)

Key Points:
– Sharding distributes different data across multiple servers, so
each server acts as the single source for a subset of data.
– Replication copies data across multiple servers, so each bit of
data can be found in multiple places.
A system may use either or both techniques. Replication comes in
two forms:
– Master-slave replication makes one node the authoritative copy
that handles writes while slaves synchronize with the master and
may handle reads.
– Peer-to-peer replication allows writes to any node; the nodes
coordinate to synchronize their copies of the data.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 71


Daily Quiz

•  The right number of reduces seems to be ____________


a) 0.90
b) 0.80
c) 0.36
d) 0.95
• Which of the following phases occur simultaneously?
a) Shuffle and Sort
b) Reduce and Sort
c) Shuffle and Map
d) All of the mentioned

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 72


Noida Institute of Engineering and Technology, Greater Noida

NO SQL Data Management

Unit: 2

RCA E45- Big Data


Hirdesh Sharma,
Department of MCA
MCA 5th Sem

Hirdesh Sharma RCA E45 Big Data Unit: 2


73
08/11/2021
Recap (CO2)

• Replication and sharding are strategies that can be combined.


• We have multiple masters, but each data only has a single master.
Two schemes:
– A node can be a master for some data and slaves for others
– Nodes are dedicated for master or slave duties

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 74


Topic Name (CO2)

• Consistency and Version Stamp


• Positioning and Combining

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 75


Topic Objective (CO2)

After completion of this topic, students will be able to understand:


• Consistency

• Version Stamp
• Positioning and Combining

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 76


Consistency (CO2)

• The consistency property ensures that any transaction will bring the
database from one valid state to another.
• Relational databases has strong consistency whereas NoSQL
systems hass mostly eventual consistency.
• ACID: A DBMS is expected to support “ACID transactions,”
processes that are:
– Atomicity: either the whole process is done or none is
– Consistency: only valid data are written
– Isolation: one operation at a time
– Durability: once committed, it stays that way

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 77


Various forms of Consistency (CO2)

Update Consistency (or write-write conflict):


• Martin and Pramod are looking at the company website and notice
that the phone number is out of date. Incredibly, they both have
update access, so they both go in at the same time to update the
number.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 78


Various forms of Consistency (CO2)

Update Consistency (or write-write conflict):


Solutions:
– Pessimistic approach
– Prevent conflicts from occurring
Approaches:
– conditional updates: test the value just before updating
– Do not work if there’s more than one server (peer-to-peer
replication)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 79


Various forms of Consistency (CO2)

Read Consistency (or read-write conflict)


• Alice and Bob are using Ticketmaster website to book tickets for a
specific show.
• Only one ticket is left for the specific show. Alice signs on to
Ticketmaster first and finds one left, and finds it expensive. Alice
takes time to decide.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 80


Various forms of Consistency (CO2)

Read Consistency (or read-write conflict)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 81


Various forms of Consistency (CO2)

Replication consistency
• Let’s imagine there’s one last hotel room for a desirable event. The
hotel reservation system runs on many nodes.
• This is another inconsistent read—but it’s a breach of a different
form of consistency we call replication consistency: ensuring that
the same data item has the same value when read from different
replicas.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 82


Various forms of Consistency (CO2)

Replication consistency

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 83


Various forms of Consistency (CO2)

Eventual consistency:
• At any time, nodes may have replication inconsistencies but, if there
are no further updates, eventually all nodes will be updated to the
same value.
• In other words, Eventual consistency is a consistency model used in
nosql database to achieve high availability that informally
guarantees that, if no new updates are made to a given data item,
eventually all accesses to that item will return the last updated value.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 84


Version Stamp (CO2)

• A field that changes every time the underlying data in the record
changes.
• When you read the data you keep a note of the version stamp, so
that when you write data you can check to see if the version has
changed.
• You may have come across this technique with updating resources
with HTTP.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 85


Version Stamp (CO2)

• In short,
 It helps you detect concurrency conflicts.
 When you read data, then update it, you can check the version
stamp to ensure nobody updated the data between your read and
write
 Version stamps can be implemented using counters, GUIDs (a
large random number that’s guaranteed to be unique), content
hashes, timestamps, or a combination of these.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 86


Relaxing Consistency (CO2)

• The CAP Theorem: The basic statement of the CAP theorem is that,
given the three properties of Consistency, Availability, and Partition
tolerance, you can only get two.
– Consistency: all people see the same data at the same time
– Availability: if you can talk to a node in the cluster, it can read
and write data
– Partition tolerance: the cluster can survive communication
breakages that separate the cluster into partitions unable to
communicate with each other

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 87


Relaxing Consistency (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 88


Network Partition (CO2)
• The CAP theorem states that if you get a network partition, you have
to trade off availability of data versus consistency.
• Very large systems will “partition” at some point::

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 89


CA System (CO2)

A single-server system is the obvious example of a CA system:


– CA cluster: if a partition occurs, all the nodes would go down
– A failed, unresponsive node doesn’t infer a lack of CAP
availability

An example
– Ann is trying to book a room of the Ace Hotel in New York on a
node located in London of a booking system
– Pathin is trying to do the same on a node located in Mumbai

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 90


CA System (CO2)

Possible solutions
– CP: Neither user can book any hotel room, sacrificing
availability
– CAP: Designate Mumbai node as the master for Ace hotel

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 91


Map Reduce (CO2)

• It is a way to take a big task and divide it into discrete tasks that can
be done in parallel.
• A common use case for Map/Reduce is in document database .
• A Map Reduce program is composed of a Map() procedure that
performs filtering and sorting and a Reduce() procedure that
performs a summary operation.
• "Map" step
• "Reduce" step

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 92


Map Reduce (CO2)

Logical view
• The Map function is applied in parallel to every pair in the input
dataset.
• Map(k1,v1) → list(k2,v2)
• The Reduce function is then applied in parallel to each group, which in
turn produces a collection of values in the same domain:
• Reduce(k2, list (v2)) → list(v3)
• Each Reduce call typically produces either one value v3 or an empty
return

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 93


Map Reduce (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 94


Map Reduce (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 95


Map Reduce (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 96


Map Reduce (CO2)
Multistage map-reduce calculations: Let us say that we have a set of
documents and its attributes with the following form:
{
"type": "post",
"name": "Raven's Map/Reduce functionality",
"blog_id": 1342,
"post_id": 29293921,
"tags": ["raven", "nosql"],
"post_content": "<p>...</p>",
"comments": [
{
"source_ip": '124.2.21.2',
"author": "martin",
"text": "excellent blog..."
}]
}
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 97
Map Reduce (CO2)

• Let us see how this works, we start by applying the map query to the
set of documents that we have, producing this output:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 98


Map Reduce (CO2)
• The next step is to start reducing the results, in real Map/Reduce
algorithms, we partition the original input, and work toward the
final result.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 99


Map Reduce (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 100


Map Reduce (CO2)

• And the final step is:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 101


RDBMS compared to MapReduce (CO2)

• MapReduce is a good fit for problems that need to analyze the


whole dataset, in a batch fashion, particularly for ad hoc analysis.
• MapReduce suits applications where the data is written once, and
read many times, whereas a relational database is good for datasets
that are continually updated.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 102


RDBMS compared to MapReduce (CO2)

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 103


Portioning and Combining (CO2)

• In the simplest form, we think of a map-reduce job as having a


single reduce function.
• The outputs from all the map tasks running on the various nodes are
concatenated together and sent into the reduce.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 104


Portioning and Combining (CO2)

• To take advantage of this, the results of the mapper are divided up


based the key on each processing node.
• Typically, multiple keys are grouped together into partitions. The
framework then takes the data from all the nodes for one partition,
combines it into a single group for that partition, and sends it off to a
reducer.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 105


Portioning and Combining (CO2)
• Reduce Partitioning Example:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 106


Portioning and Combining (CO2)

• Combinable Reducer Example:

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 107


Portioning and Combining (CO2)

• Combinable Reducer:
A combiner function is, in essence, a reducer function—indeed, in
many cases the same function can be used for combining as the final
reduction. The reduce function needs a special shape for this to
work: Its output must match its input. We call such a function a
combinable reducer.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 108


Daily Quiz

• Mapper and Reducer implementations can use the ________ to


report progress or just indicate that they are alive.
a) Partitioner
b) OutputCollector
c) Reporter
d) All of the mentioned

• _________ is the primary interface for a user to describe a


MapReduce job to the Hadoop framework for execution.
a) Map Parameters
b) JobConf
c) MemoryConf
d) None of the mentioned

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 109


Recap

• The consistency property ensures that any transaction will bring the
database from one valid state to another.
• Relational databases has strong consistency whereas NoSQL
systems hass mostly eventual consistency.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 110


Faculty Video Links, Youtube & NPTEL Video Links and Online
Courses Details

• https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
• https://www.tutorialspoint.com/hadoop/hadoop_mapreduce
.htm
• https://www.sanfoundry.com/mapreduce-questions-answers/

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 111


Weekly Assignment 1

Q:1 Explain the concept of NoSQL in Big Data.


Q:2 Give the difference between Relation database and NoSQL
database.
Q:3 Explain various aggregate data modes:
Key value storage
Document Storage
Graph Storage
Colum value Storage
Q:4 Explain the Schema less database. Also explain the properties
of schema less database.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 112


Weekly Assignment 2

Q:1 Write a short note on following terms:


–Materialized Views
–Distribution Models
–Sharing
Q:2 Explain the following terms:
–Version Stamps
–Map Reduce Calculations
–Portioning and Combining
–Consistency
Q:3 Explain Master Slave Replication with the help of suitable
example.
Q:4 Explain Peer to peer Replication with the help of suitable
example.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 113


MCQ s

• Point out the correct statement.


a) MapReduce tries to place the data and the compute as close as
possible
b) Map Task in MapReduce is performed using the Mapper() function
c) Reduce Task in MapReduce is performed using the Map() function
d) All of the mentioned
• Point out the correct statement.
a) Hadoop is an ideal environment for extracting and transforming
small volumes of data
b) Hadoop stores data in HDFS and supports data
compression/decompression
c) The Giraph framework is less useful than a MapReduce job to solve
graph and machine learning
d) None of the mentioned

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 114


MCQ s

• What license is Hadoop distributed under?


a) Apache License 2.0
b) Mozilla Public License
c) Shareware
d) Commercial
• Which of the following genres does Hadoop produce?
a) Distributed file system
b) JAX-RS
c) Java Message Service
d) Relational Database Management System

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 115


Old Question Papers

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 116


Old Question Papers

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 117


Old Question Papers

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 118


Old Question Papers

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 119


Expected Questions for University Exam

Q:1 Write a short note on following terms:


–Materialized Views
–Distribution Models
–Sharing
Q:2 Explain the following terms:
–Version Stamps
–Map Reduce Calculations
–Portioning and Combining
–Consistency
Q:3 Explain Master Slave Replication with the help of suitable
example.
Q:4 Explain Peer to peer Replication with the help of suitable
example.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 120


Expected Questions for University Exam

Q:5 Explain the concept of NoSQL in Big Data.


Q:6 Give the difference between Relation database and NoSQL
database.
Q:7 Explain various aggregate data modes:
Key value storage
Document Storage
Graph Storage
Colum value Storage
Q:8 Explain the Schema less database. Also explain the properties
of schema less database.

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 121


Summary

• No SQL database, also called Not Only SQL, is an approach to data


management and database design that's useful for very large sets of
distributed data.
• SQL databases have predefined schema whereas No SQL databases
have dynamic schema for unstructured data.
• There are four general types of No SQL databases, each with their
own specific attributes:
–Key value storage
–Document Storage
–Graph Storage
–Colum value Storage

08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 122


References

1. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big
Analytics: Emerging Business Intelligence and Analytic Trends for Today's
Businesses", Wiley, 2013.
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging
World of
3. Polyglot Persistence", Addison-Wesley Professional, 2012.
4. Tom White, "Hadoop: The Definitive Guide", Third Edition, O'Reilley, 2012.
5. Eric Sammer, "Hadoop Operations", O'Reilley, 2012.
6. E. Capriolo, D. Wampler, and J. Rutherglen, "Programming Hive", O'Reilley,
2012.
7. Lars George, "HBase: The Definitive Guide", O'Reilley, 2011.
8. Eben Hewitt, "Cassandra: The Definitive Guide", O'Reilley, 2010.
9. Alan Gates, "Programming Pig", O'Reilley, 2011.

Thank You
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 123

You might also like