0% found this document useful (0 votes)
29 views

G7 - P3 - Big Data Concepts and Application - NoSQL Vs Relational DB - Key-Value Model

The document discusses big data concepts and applications, NoSQL databases as big data storage systems, and key-value model NoSQL systems. It covers topics like big data characteristics, distributed computing, big data analytics applications, big data technologies, and characteristics of NoSQL systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

G7 - P3 - Big Data Concepts and Application - NoSQL Vs Relational DB - Key-Value Model

The document discusses big data concepts and applications, NoSQL databases as big data storage systems, and key-value model NoSQL systems. It covers topics like big data characteristics, distributed computing, big data analytics applications, big data technologies, and characteristics of NoSQL systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Big Data Concepts & Applications

NoSQL vs Relational DB
Key-Value model

Nguyen Thi Kim Tuyen


Huynh Nguyen Hong Nhan

Nguyễn Thị Kim Tuyên

Huỳnh Nguyễn Hồng Nhân


Agenda
● Big Data
○ Concepts
○ Applications
● NOSQL - Big Data Storage Systems
○ NoSQL vs Relational DB
● Categories of NOSQL Systems
○ Document-based NOSQL system
○ NOSQL key-value stores
○ Column-based or wide column NOSQL systems
○ Graph-based NOSQL systems
○ Hybrid NOSQL systems
○ Object databases
○ XML databases
● NoSQL Key-Value Stores
○ DynamoDB
○ Voldemort Key-Value distributed Data Store
○ Oracle key-value store ( Oracle NoSQL Database)
○ Redis key-value cache and store
○ Apache Cassandra
○ DEMO
Big Data Concepts
The McKinsey Global Institute’s report
(2012) on big data defines the term Big
Data as datasets whose size exceeds the
typical reach of a DBMS to capture, store,
manage, and analyze that data.

The facts are mentioned in the McKinsey


report:
❏ A $600 disk can store all of the world’s music today.
❏ Every month, 30 billion of items of contents are stored on
Facebook
❏ More data is stored in 15 of 17 sectors of the U.S.
economy than is stored in the Library of Congress, which,
as of 2011, stored 235 terabytes of data
❏ There is currently a need for over 140,000 deep-data-
analysis positions and over 1.5 million data-savvy
managers in the United States. Deep data analysis
involves more knowledge discovery type analyses. https://www.domo.com/learn/infographic/data-never-sleeps-9
Big Data’s Characteristics/
Dimensions (Vs)

● Volume: refers size of data managed by the system.


○ Petabytes, Zettabytes
○ Enterprises’ transactions
○ Sensors, scanning equipments, radio-frequency identification (RFID)
○ Industrial Internet of Things ( IIOT/IOT)
○ Social networks
● Velocity: as speed at which data is created, accumulated, ingested,
and processed. High velocity: e.g.
○ Processing of streaming data for analysis ( accumulate the likes
of Twitter and Facebook)
○ Processing the transaction to detect potential fraud.
○ Transaction on stock exchange.
● Variety: The forms of data - structured, text, meida. E.g. Internet data
(clickstream,social media), research data (surveys, industry reports),
location data (mobile device data, geospatial data),
images( surveillance, satellites, medical scanning), e-mails, signal data
( sensors, RFID devices), videos(YouTube)
○ Structured data
○ Semistructured data
○ Unstructured data
● Veracity: refers to the quality of the collected data. Having a variety of
trustworthiness, it must go through some degree of quality testing and credibility
analysis. Two built-in feature:
○ The credibility of the source
○ The suitability of data for its target audience
Big Data’s Characteristics
Variety

Big Data Classification


Source: https://statswiki.unece.org/display/bigdata/Classification+of+Types+of+Big+Data
Distributed Computing for Big Data
There isn’t a single distributed
computing model because
computing resources can be
distributed in many ways.
E.g you can distribute a set of
programs on the same
physical server and use
messaging services to enable
them to communicate and
pass information.
It is also possible to have
many different systems or
servers, each with its own
memory, that can work
together to solve one
problem.
Big Data Architecture
Big Data Applications - Analytics
IBM (2014): “Analytics Across the Enterprise: How IBM Realizes
“People respond to facts. Rational people will make rational decisions if you
Business Value from Big Data and Analytics” book describes various present them with the right data.” —Linda Sanford, Senior Vice President,
types of analytics applications: Enterprise Transformation, IBM Corporation

● Descriptive and predictive analytics:


○ Descriptive analytics relates to report what has happened,
analyzing the data that contributed it to figure out what happened,
monitoring new data to find out what happening now. E.g Weather
Report
○ Predictive analytics uses statistical and data mining techniques to
make predictions about what will happen in the future. E.g Weather
forecast.
● Prescriptive analytics: Refers to analytics that recommends actions.
E,g Setting prices based on predicted price elasticity, advertising based on
predicted views of ad and predicted sales lift per viewers, target
promotional offers ( coupons) based on customer segmentation.
● Social media analytics: Refers to do a sentiment analysis to assess
public opinion on topics or events. Allow users to discover the behaviour
patterns & tastes of individuals, which can measuring the return of
investment (ROI), E.g. sports clothing manufacturer, an automobile
manufacturer, among other industries.
● Entity analytics: New area that groups data about entities of interest
and learns more about relationship between them. Focusing on sorting
through data and grouping together data that relates to the same entity.
E.g. Customer records ( driver’s license number, credit card number can be
combined into one record. )
● Cognitive computing: Refers to an area of developing computing
systems that will interact with people to give them better insight and
advice.
https://financesonline.com/big-data-statistics/
Big Data Applications - A Real-World View
● Streaming data with an Environment impact
○ Using sensors to provide real-time information about rivers and oceans
● Streaming data with a Public Policy impact
○ every area of a city has the capability to collect data, whether in the form of taxes, sensors on buildings and bridges, traffic pattern monitoring,
location data, and data about criminal activity. Creating workable policies that make cities safer, more efficient, and more desirable places to live and
work requires the collection and analysis of huge amounts of data from a variety of sources
○ As a result, city leaders have an abundance of information about how policies impacted people in their city in prior years, but it has been very
challenging to share and leverage fast-changing data to make real-time decisions that can improve city life
● Streaming data in the Healthcare Industry
○ Medical clinicians and researchers are using streaming data to speed decision making in hospital settings and improve healthcare outcomes for patients.
○ Doctors make use of large amounts of time-sensitive data when caring for patients, including results of lab tests, pathology reports, X-rays, and digital
imaging. They also use medical devices to monitor a patient’s vital signs such as blood pressure, heart rate, and temperature. While these devices
provide alerts when the readings go out of normal range, in some cases, preventive action could take place if doctors were able to receive an early
warning.
● Streaming data in the Energy Industry
○ To increase energy efficiency. E.g A large university monitors streaming data on its energy consumption and integrates it with weather data to make
real-time adjustments in energy use and production.
○ To advance the production of alternative sources of energy. E.g A wind-farm company uses streaming data to create hourly and daily predictions
about energy production.The resulting analytics are used to select the best location for its wind turbines and to reduce cost per kilowatt-
hour of energy produced.
● Connecting streaming data to Historical and other real time data sources
○ The most important aspect of these types of outcomes requires the ability to understand the context of the situation. A doctor might see analysis that
points to a particular disease. However, further analysis of other patients with similar symptoms and test results shows that other possible
diagnoses may exist. In a complicated world, data is valuable in taking action only in the context of how it is applied to a problem.
Big Data Applications (cont)

Log data applications Log data Ad/media applications Digital marketing applications
Big Data Technologies
NOSQL Databases - Distributed DB - Big Data Storage Systems

● Class of systems developed to manage large amounts of data in organization


such as Google, Amazon, Facebook, Twitter, … and in applications e.g. social
media, Web links, marketing & sales, road maps & spatial data, e-mail.
● The term NOSQL is as Not Only SQL - rather than NO to SQL - and it meant
to convey that many applications need systems other than traditional
relational SQL systems for data management.
● Most NOSQL systems are distributed databases/ distributed storage systems
○ Focus on semistructured data storage, high performance , availability , data replication,
scalability
○ Not emphasis on immediate data consistency, powerful query languages, structured data
storage.
Characteristics of NOSQL Systems
● Scalability : 2 kind of scalability in distributed systems
○ Vertical Scalability
○ Horizontal Scalability: used in NOSQL systems
● Availability, Replication, Eventual Consistency
● Replication Models:
○ Master-Slave replication
○ Master-Master replication
● Sharding of Files: ( horizontal partitioning)
● High-Performance Data Access
○ Hashing
○ Range partitioning
● Data models & query languages: emphasize performance & FLEXIBILITY
○ Not requiring a Schema
○ Less Powerful Query Languages
○ Versioning
NOSQL vs Relational Database/ SQL Database (1)
Relational databases provide a store of related data
tables. These tables have a fixed schema, use SQL
(Structured Query Language) to manage data, and support
ACID guarantees.

No-SQL databases refer to high-performance, non-


relational data stores. They excel in their ease-of-use,
scalability, resilience, and availability characteristics. Instead
of joining tables of normalized data, NoSQL stores
unstructured or semi-structured data, often in key-value pairs
or JSON documents. No-SQL databases typically don't
provide ACID guarantees beyond the scope of a single
database partition. High volume services that require sub
second response time favor NoSQL datastores.
NOSQL vs Relational Database/ SQL Database (2)
SQL NoSQL

RELATIONAL DATABASE MANAGEMENT SYSTEM


Non-relational or distributed database system.
(RDBMS)

These databases have fixed or static or predefined schema They have dynamic schema

These databases are best suited for hierarchical data


These databases are not suited for hierarchical data storage.
storage.

These databases are best suited for complex queries These databases are not so good for complex queries

Vertically Scalable Horizontally scalable

Follows ACID property Follows CAP(consistency, availability, partition tolerance)


Consideration for NoSQL vs Relational Database
Consider a NoSQL datastore when: Consider a relational database when:

You have high volume workloads that require Your workload volume generally fits within thousands
predictable latency at large scale (e.g. latency of transactions per second
measured in milliseconds while performing millions of
transactions per second)

Your data is dynamic and frequently changes Your data is highly structured and requires
referential integrity

Relationships can be denormalized data models Relationships are expressed through table joins on
normalized data models

Data retrieval is simple and expressed without table You work with complex queries and reports
joins

Data is typically replicated across geographies and Data is typically centralized, or can be replicated
requires finer control over consistency, availability, and regions asynchronously
performance

Your application will be deployed to commodity Your application will be deployed to large, high-end
Consistency Models How NOSQL systems approach the issue of consistency among replicas

Eventual consistency: Eventually, everything will be


consistent. yes, eventually. That's the only promise.

Strict consistency: system always return the latest write:


For any incoming write operation, once a write is
acknowledged to the client the updated value is visible
on read from any node.

Implementation of Eventual Consistency

1. Multi-leader replication : two data center. Each data


center has one master write node and several read
replica. The two master nodes will asynchronously
replicate towards each other. Eventually everything will
be in sync. However, if you write(x,v1) on master 1 and
write(x,v1) on master 2, there is no guarantee which
value read(x) will be eventually, depending on the
replication speed or wall clock. Wall clock is not reliable
in distributed system. Distributed systems don't share
the same time in their internal wall clock.

2. Leaderless replication: Cassandra and DynamoDB.


you can adjust your consistency level. If you opt in for
lower level of consistency, they actually uses
asynchronous replication to spread the writes through https://www.cohesity.com/blogs/strict-vs-eventual-consistency/
gossip protocols.
CAP Theorem - emphasis of NOSQL systems on Availability

The theorem states that distributed data systems will offer


a trade-off between consistency, availability, and partition
tolerance. And, that any database can only guarantee two of
the three properties:

❖ Consistency : the nodes will have the same copies of


a replicated data item visible for various transactions.
❖ Availability : each read or write request of a data item
will either be processed successfully or will receive a
message that operation cannot be completed.
❖ Partition Tolerance : guarantees the system
continues to operate even if a replicated data node
fails or loses connectivity with other replicated data
nodes.
CAP Theorem
CAP’s consistency vs ACID’s consistency https://docs.microsoft.com/en-us/dotnet/archite
Strong consistency vs weak consistency (see cture/cloud-native/relational-vs-nosql-data
Consistency model)
Categories of NOSQL Systems
DynamoDB Introduction - Data Model
DynamoDB system is an Amazon product and is available as part of
Amazon’s AWS/SDK platforms. It can be used as part of Amazon’s cloud
computing services, for the data storage component.

DynamoDB data model: concepts of


tables, items and attributes.
A table in DynamoDB does not have a
schema. It holds a collections of self-
describing items.
Each item will consist of a number of
(attribute, value) pairs, and attribute
values can be single-valued or
multivalued.
Allows the user to specify the items in
JSON format ( system will convert them
to internal storage format of DynamoDB).
DynamoDB keys and attributes
https://medium.com/zenofai/scaling-dynamodb-for-big-data-using-parallel-s
can-1b3baa3df0d8
Denormalization by using a complex attribute

https://www.alexdebrie.com/posts/dynamodb-one-to-many/
Denormalization by duplicating data

https://www.alexdebrie.com/posts/dynamodb-one-to-many/
DynamoDB Indexing
When a table is created, it is required to specify a table name
and primary key.

Two types of the primary key:

❏ A single attribute (Partition): DynamoDB system use


this attribute to build a hash index on the items in the
table -> hash type primary key. The items are not
ordered in storage on the value of the hash attribute.
❏ A pair of attributes ( Partition key and sort key): hash
and range type primary key. (A,B) : attribute A will be
used for hashing, the B values will be used for ordering
the records with the same A value.

A table with this type of key can have additional


secondary indexes defined on its attributes. E.g., if we
want to store multiple versions of some type of items in a
table, we could use ItemID as hash and Date or
Timestamp as range in a hash and range type primary
key. Primary key & Partition key
https://aws.amazon.com/blogs/database/choosing-the-righ
t-dynamodb-partition-key/
DynamoDB Indexing
Why do I need a partition key?

DynamoDB stores and retrieves each item based on the


primary key value (unique).. Items are distributed across 10-
GB storage units, called partitions (physical storage
internal to DynamoDB). Each table has one or more
partitions.

DynamoDB uses the partition key’s value as an input to an


internal hash function. The output from the hash function
determines the partition in which the item is stored. Each
item’s location is determined by the hash value of its
partition key.

In most cases, all items with the same partition key are
stored together in a collection, which we define as a group
of items with the same partition key but different sort keys.
For tables with composite primary keys, the sort key may
be used as a partition boundary. DynamoDB splits
Primary key & Partition key
partitions by sort key if the collection size grows bigger than https://aws.amazon.com/blogs/database/choosing-the-right-dy
10 GB. namodb-partition-key/
Recommendations for Partition Keys
Use high-cardinality attributes (distinct values for
each item, like emailid, employee_no, customerid,
sessionid, orderid)

Use composite attributes: access pattern. E.g.


customerid#productid#countrycode as the
partition key and order_date as the sort key, where
the symbol # is used to split different field

Add random numbers or digits from a


predetermined range for write-heavy use cases.
Suppose that you expect a large volume of writes for
a partition key (for example, greater than 1000 1 K
writes per second). In this case, use an additional
prefix or suffix (a fixed number from predetermined
range, say 0–9) and add it to the partition key.

https://aws.amazon.com/blogs/database/choosing-the-right-d
ynamodb-partition-key/
DynamoDB Secondary Index
● Global secondary index — an index with a hash and range key that can be
different from those on the table. A global secondary index is considered
“global” because queries on the index can span all of the data in a table,
across all partitions.
● Local secondary index — an index that has the same hash key as the table,
but a different range key. A local secondary index is “local” in the sense that
every partition of a local secondary index is scoped to a table partition that
has the same hash key.
DynamoDB Query & Scan
Adding a GSI (Global Secondary Index) to index that attribute and enable Query. In the last resort, use Scan.

https://dynobase.dev/dynamodb-scan-vs-query/
DynamoDB Use Cases

SnapChat
Voldemort Key-Value Distributed Data Store

● Voldemort is an open source system ( Apache 2.0 license), based on Amazon’s DynamoDB.
● Focus on high performance and horizontal scalability, as well as on providing replication for high
availability and sharding for improving latency (response time) of read and write requests.
● Technique to distribute the key-value pairs among the nodes of distributed cluster : Consistent
hashing.
● Features:
○ Simple basic operations: A collection of (key, value) pairs is kept in a Voldemort store (s).
○ High-level formatted data values : JSON
○ Consistent hashing for distributing (key,value) pairs.
○ Consistency and versioning: similar to DynamoDB for consistency in the presence of
replicas. Concurrent write operations are allowed by difference processes -> exits two or more
different values associated with the same key at different nodes when items are replicated.
Consistency is achieved using technique versioning and read repair. Each write is associated
with a vector clock value. When a read occurs, system can reconcile the final single value
between different version of the same value (of the same key) based on application semantics.
Voldemort Consistent Hashing
For distributing (k,v) pair among the nodes in the
distributed cluster of nodes. Hash function h(k) is
applied to the key k , determines where the item will
be store.

h(k) values’ range: [0-Hmax=2^(n-1) ] to be evenly


distributed on a circle ( or ring)

n: desired range for hash values

A item (k,v) will be stored on the node whose position


in the ring follows the position of h(k) on the ring in a
clockwise direction.

This scheme allows horizontal scalability : a new


node can be added in one or more locations on the
ring depending on the node capacity.

Allows replication: number of specified replicas of


item on successive nodes on the ring in a clockwise
direction.

Sharding: different items in the store (file) are located


in the different nodes. Nodes with higher capacity can
have more locations on the ring
DEMO
Big Data Technologies Challenges
● Heterogeneity of information
● Privacy and confidentiality
● Need for visualization and better human interfaces
● Inconsistent and incomplete information
References
1. https://docs.microsoft.com/en-us/dotnet/architecture/cloud-native/relational-vs-nosql-data
2. https://www.geeksforgeeks.org/difference-between-sql-and-nosql/
3. https://medium.com/zenofai/scaling-dynamodb-for-big-data-using-parallel-scan-1b3baa3df0d
8
4. https://aws.amazon.com/blogs/database/choosing-the-right-dynamodb-partition-key/
5. https://www.project-voldemort.com/voldemort/
6. https://financesonline.com/big-data-statistics/
7. IBM (2014): “Analytics Across the Enterprise: How IBM Realizes Business Value from Big
Data and Analytics”
8. https://aws.amazon.com/dynamodb/customers/
9. J. Hurwitz, A. Nugent, F. Halper, M. Kaufman “Big Data for Dummies”, John Wiley & Son Inc.,
2013, ISBN: 978-1-118-64401-0
10. R. Elmasri & S.B. Navathe (2016): Fundamentals of Database Systems, 7th
11. Edition, Addison-Wesley, ISBN-13: 978-0-13-397077-7
12. https://www.cohesity.com/blogs/strict-vs-eventual-consistency/
13. https://www.linkedin.com/pulse/consistency-models-distributed-system-hohuan-chang
14. https://techvidvan.com/tutorials/big-data-technologies/

You might also like