Big Data Analytics Lecture 3A
Big Data Analytics Lecture 3A
April/May 2024
● Midterm Exam
● NoSQL Fundamentals
● CAP Theorem
● Flexible Data Model
● MongoDB Installation
Logistics
● For Clients/customers
● For Developers
● For Business Owners
Motivation for industry: Requirements
For clients/customers
○ Provide high availability and scalability
○ Process a large amount of data
○ Be geographically aware
● A network failure may be a rare event, but there is always network delay.
● The latency between nodes can be regarded as a temporary partitioning.
It will cause a temporary trade-off between consistency and availability.
The CAP theorem indicates that a distributed data store (P) that
guarantees low latency (A) has to lose consistency and allow stale data
(C). If stale data is tolerable and low latency is what you care about, such
as item recommendations on Amazon, you can choose the AP model with
loose consistency such as eventual consistency. Eventual consistency
means that stale data can be tolerated.
● One of the best examples of the AP model to achieve low latency is
Amazon DynamoDB. Amazon DynamoDB is "consistently single-digit
millisecond latency at any scale" with eventual consistency.
Flexible Data Model 1/4
Introduction to Data Models
● SQL databases excel with structured data, prioritizing consistency but
often at the cost of performance.
● NoSQL databases enhance flexibility, especially suitable for unstructured
data, and improve performance via horizontal scaling and tolerance for
slightly outdated (stale) data.
● Complex Queries: Requires joining multiple tables to aggregate content for each
post, including images, videos, comments, and votes.
● Schema Inflexibility: SQL databases require predefined schemas which need extensive
modifications to accommodate new features like adding audio support or additional
attributes for images.
"_id": ObjectId("54c955492b7c8eb21818bd09"),
"date": "2016-04-30",
"user": "boo",
"images": ["http://social.s3.amazonaws.com/images/1.png"],
"videos": [
],
"votes": 12,
"comments": [
}
Flexible Data Model 4/4
● When you need to change the data format, you can simply adopt the changes with new
posts in the future without needing to change your data model.
● Scenarios when alternative solutions to RDBMS are expected are often driven by the
need to:
○ Achieve faster time to market, enabled by agile development and flexible data models.
Schema-less The least complex NoSQL model which stores Memcached, Generally used as a
Key-Value data in a unstructured schema-less way that Redis distributed cache
Stores consists of indexed keys and values.
Wide Column A column-based model that stores data tables Cassandra, Used in applications which
Stores based on columns rather than rows to enable HBase deal with large amounts of
(Column fast data access, search and aggregation. data like recording and
Family Stores) storing logs about consumer
history in an e-commerce
platform
Document A data model to store semi-structured data as MongoDB, Highly general purpose due
Stores documents made up of tagged elements. Amazon to the flexibility of the data
Semi-structured data means the data shares DynamoDB model
some standard format or encoding, such as
XML, JSON, etc. The data is grouped by
collections and each document in the same
collection can have the same or different
structure.
Basic Types of NoSQL Databases: 2/2
Graph DBMS A network database model that uses edges and Neo4J, AWS Navigate social network
nodes to represent the data Neptune connections