0% found this document useful (0 votes)
1 views

Big Data Analytics Lecture 3A

The document outlines the agenda for a Big Data Analytics course, focusing on NoSQL fundamentals, the CAP theorem, and MongoDB installation. It discusses the motivations for using NoSQL databases over traditional SQL databases, emphasizing scalability, flexibility, and performance in handling large data volumes. Additionally, it explains the CAP theorem's implications for distributed systems and provides an overview of various NoSQL database types.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Big Data Analytics Lecture 3A

The document outlines the agenda for a Big Data Analytics course, focusing on NoSQL fundamentals, the CAP theorem, and MongoDB installation. It discusses the motivations for using NoSQL databases over traditional SQL databases, emphasizing scalability, flexibility, and performance in handling large data volumes. Additionally, it explains the CAP theorem's implications for distributed systems and provides an overview of various NoSQL database types.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

1

MSDA9215: Big Data Analytics

April/May 2024

Week 3A: Data Storage Methods & Big Data


Processing
Temitope Oguntade
Agenda

● Midterm Exam
● NoSQL Fundamentals
● CAP Theorem
● Flexible Data Model
● MongoDB Installation
Logistics

● Instructor: Temitope Oguntade


● Email: [email protected]
● TAs:
● Class Time: 08:30 - 17:00 Mon, 18:00 - 20:45 Wed
● Course Credit:
● Prerequisite: As may be determined
● Code Source:
● Credits: The Sail Platform
Why NoSQL?

● The choice of NoSQL and SQL databases boils down to the


requirements of the application and the nature of the data.

● NoSQL databases are designed as an addition to the field of


databases and not as a replacement of SQL databases.
Motivation for industry: 1/2
The scale of users, devices, and data continues to increase
significantly. The challenge and competitive advantage are to provide
the best possible experience for users or customers. Technology has
to adapt to offer experiences that correspond to the evolving behavior
of users.

● Walmart saw a sharp decline in conversion rate when page load


time increased from one to four seconds. (Source: Walmart Labs)
● Amazon noticed a 1% drop in sales for every 100ms of latency cost
Motivation for industry: 2/2

● The browsing and shopping experience on Amazon is another


good example because it can cost Amazon millions if they do not
provide performant solutions.
● Amazon observed that many users prefer low latency over
accuracy. They are happy to refresh their shopping cart but are
not willing to tolerate a slow loading page.
● Hence, occasionally having an inaccurate but fast-loading
shopping cart is favored over a shopping cart that is slow to load
with accurate results.
Motivation for industry: Requirements

Let’s look at some of the requirements of this competitive and


fast-evolving market that has driven the development of the NoSQL
paradigm where traditional RDBMSs fall short.

● For Clients/customers
● For Developers
● For Business Owners
Motivation for industry: Requirements
For clients/customers
○ Provide high availability and scalability
○ Process a large amount of data
○ Be geographically aware

● Customers expect a service provided by web services to be


available at all times irrespective of whether it is a Black Friday or
a Cyber Monday.
● With the advent of mobile phones, social networks and the
Internet of things (IoT), large amounts of data are being
generated.
Motivation for industry: Requirements
● It is also important to provide customers with the same Quality of
Service (QoS) irrespective of the fact that the user is in any part of
the globe.
● Relational databases are generally deployed on a single server
and the available resources such as memory, processing power,
and storage, are fixed.
○ Hence, bigger servers need to be added to meet the scaling
requirements (vertical scaling) which tend to be costlier.
● However, it is just a matter of adding more servers to scale out for
NoSQL databases (horizontal scaling).
Motivation for industry: Requirements
For Developers
○ A flexible data model or schema
○ Agile development

● Unlike relational databases that require structured schemas,


NoSQL does not impose restrictions on the structure of the data.
● This provides for an Agile Development process and helps to
achieve fast development and deployment lifecycles.
Motivation for industry: Requirements
For Business owners
○ Move on from costly enterprise solutions to open source
solutions
○ Exploit commodity hardware and the power of cloud
computing resources in a cost-effective way

● Due to the heavy increase in users, business owners have to think


about how to manage the operational cost of their business.
● Traditional database solutions tend to be costlier when
businesses need to scale as their data or user base grows.
The Challenge to Achieve Scalability
● In the cloud era, one of the major features is scalability to satisfy the
increasing workload.
● Vertical scaling (a.k.a. scale-up) is limited by the capacity of one
single machine and does not align with the need of the distributed
systems. Horizontal scaling (a.k.a. scale-out) is more practical to
scale dynamically over time by adding more resources to the
distributed systems.
● However, scaling out introduces additional complexity due to the
distribution of an application or a system onto multiple networked
machines, which could encounter intermittent or permanent failure.
● This complexity and necessity trade-off is best stated by the CAP
theorem.
CAP Theorem: 1/2
CAP theorem states that it is impossible for a distributed data store to
simultaneously provide more than two out of the following three
guarantees.
● Consistency. Every read receives the most recent write or an error,
but it is guaranteed that no stale data will be returned. All nodes will
return the same result at the same time regardless of the execution
of any operation.
● Availability. The system is always available with no downtime.
Every request receives a non-error response, which can be either the
most recent write or stale data.
CAP Theorem: 2/2
● Partition Tolerance. Partition means the nodes are partitioned into
multiple groups that cannot communicate with every other group
because of network failure. For example, the network between
group A and group B is cut off, or all the nodes in group B are down.
Partition tolerance requires that during a network failure or partition
between nodes, the system continues to operate.
● Figure 1: CAP Theorem
Diagram
CAP Theorem Diagram
● The CAP theorem applies when a network partition or failure happens,
instead of at all times. If there were no network failures, both availability
and consistency can be satisfied and there will be no trade-off between
them. However, network failures are unavoidable for any distributed
system, hence, network partitioning has to be tolerated.
● If partition happens for a distributed data store, one has to make the
trade-off between consistency and availability.
● During the network partitioning, if consistency is chosen over availability,
the system will return an error or a timeout if particular information
cannot be guaranteed to be the most recent and the clients will
experience downtime; if availability is chosen over consistency, the
system will return the most recent available version which can be stale.
Examples

Guarantees Examples Trade-of

CA traditional relational databases e.g. non-distributed


MySQL, PostgreSQL

CP HBase, MongoDB, Memcached, downtime


Redis, Azure SQL

AP Cassandra, Amazon DynamoDB stale data

● Traditional SQL databases essentially do not meet the distributed feature,


and NoSQL databases become the solutions. The trade-off is only
between consistency and availability if the datastore is distributed.
Quick Note

CAP theorem is not an unconditional "two out of three" dilemma. It is a


conditional trade-off when a network partition happens for a distributed
system.
Keywords in CAP Theorem:
● Consistency: no stale data
● Availability: no downtime
● Partition Tolerance: network failure tolerance in a distributed system
Low Latency v/s Strong Consistency

● A network failure may be a rare event, but there is always network delay.
● The latency between nodes can be regarded as a temporary partitioning.
It will cause a temporary trade-off between consistency and availability.
The CAP theorem indicates that a distributed data store (P) that
guarantees low latency (A) has to lose consistency and allow stale data
(C). If stale data is tolerable and low latency is what you care about, such
as item recommendations on Amazon, you can choose the AP model with
loose consistency such as eventual consistency. Eventual consistency
means that stale data can be tolerated.
● One of the best examples of the AP model to achieve low latency is
Amazon DynamoDB. Amazon DynamoDB is "consistently single-digit
millisecond latency at any scale" with eventual consistency.
Flexible Data Model 1/4
Introduction to Data Models
● SQL databases excel with structured data, prioritizing consistency but
often at the cost of performance.
● NoSQL databases enhance flexibility, especially suitable for unstructured
data, and improve performance via horizontal scaling and tolerance for
slightly outdated (stale) data.

Context: Building a Social Media Platform


● Scenario: Developing a multimedia social website where users can post
content (images, videos), comment, and vote.
● Initial thought: Designing SQL tables for Users, Posts, Comments, Images,
Videos, and Votes.
Flexible Data Model 2/4
Challenges with SQL for Social Media

● Complex Queries: Requires joining multiple tables to aggregate content for each
post, including images, videos, comments, and votes.

● Performance Issues: These complex joins can degrade performance, especially as


the database scales.

Limitations of SQL in Dynamic Environments

● Schema Inflexibility: SQL databases require predefined schemas which need extensive
modifications to accommodate new features like adding audio support or additional
attributes for images.

● Data Duplication Concerns: SQL insists on data normalization to avoid duplication,


which might not be necessary for all types of data, particularly in a dynamic,
content-driven website.
Flexible Data Model 3/4
Advantages of Using NoSQL for Social Media Platforms
● Schema Flexibility: NoSQL, such as MongoDB, allows for more flexible data
models which are adaptable to changes like new content types without schema
redesign.
● Efficient Data Retrieval: Data is stored in a way that reflects its use on the site,
e.g., posts with all multimedia and interactions can be retrieved in a single query
without joins.

Example of NoSQL Implementation


● MongoDB Collections: Each post can be stored as a separate document within a
collection, simplifying the data retrieval process for dynamic content types.
{

"_id": ObjectId("54c955492b7c8eb21818bd09"),

"title": "post title",

"date": "2016-04-30",

"text": "My heart is on the cloud",

"user": "boo",

"images": ["http://social.s3.amazonaws.com/images/1.png"],

"videos": [

{"url": "http://social.s3.amazonaws.com/videos/1.mp4", "title": "First video"},

{"url": "http://social.s3.amazonaws.com/videos/2.mp4", "title": "Second video"}

],

"votes": 12,

"comments": [

{"text": "Nice post!", "date": "2016-05-01", "user": "far"},

{"text": "Love this!", "date": "2016-05-01", "user": "boo"}

}
Flexible Data Model 4/4
● When you need to change the data format, you can simply adopt the changes with new
posts in the future without needing to change your data model.

● Scenarios when alternative solutions to RDBMS are expected are often driven by the
need to:

○ Handle unrelated, indeterminate or evolving data.

○ Achieve faster time to market, enabled by agile development and flexible data models.

○ Handle rapidly growing volumes of structured, semi-structured and unstructured data.

○ Scale beyond the capacity constraints of existing systems.

○ Free yourselves from expensive proprietary database software and hardware.

○ To meet these requirements, companies are building operational applications with


NoSQL databases.
Basic Types of NoSQL Databases: 1/2
Type Definition Example Applications

Schema-less The least complex NoSQL model which stores Memcached, Generally used as a
Key-Value data in a unstructured schema-less way that Redis distributed cache
Stores consists of indexed keys and values.

Wide Column A column-based model that stores data tables Cassandra, Used in applications which
Stores based on columns rather than rows to enable HBase deal with large amounts of
(Column fast data access, search and aggregation. data like recording and
Family Stores) storing logs about consumer
history in an e-commerce
platform

Document A data model to store semi-structured data as MongoDB, Highly general purpose due
Stores documents made up of tagged elements. Amazon to the flexibility of the data
Semi-structured data means the data shares DynamoDB model
some standard format or encoding, such as
XML, JSON, etc. The data is grouped by
collections and each document in the same
collection can have the same or different
structure.
Basic Types of NoSQL Databases: 2/2

Type Definition Example Applications

Graph DBMS A network database model that uses edges and Neo4J, AWS Navigate social network
nodes to represent the data Neptune connections

● MongoDB, HBase, Memcached, and Redis are the leading technology


options in the NoSQL space. We will explore MongoDB in the next class.
Install MongoDB Community Edition

Click to Install MongoDB Community Edition

You might also like