Module 1
Module 1
-
Prof. Kaushika S.,
Department of CSE,
SOET, CMR University.
Structured Data
Semi-Structured Data
Unstructured Data
Introduction to NoSQL
NoSQL is a type of database management system (DBMS) that is designed to handle and
store large volumes of unstructured and semi-structured data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the
term has since evolved to mean “not only SQL,” as NoSQL databases have expanded to
include a wide range of different database architectures and data models.
Characteristics:
Flexible schema design.
Horizontal scalability.
Designed for handling unstructured and semi-structured data.
The Value of Relational Databases
Structured Data Organization:
• Relational databases organize data in structured tables, ensuring a clear and well-defined
structure.
ACID Properties:
• Relational databases adhere to ACID properties (Atomicity, Consistency, Isolation,
Durability) to ensure data integrity.
Imagine you're at a busy restaurant. The waiter takes orders from multiple tables. Now, let's see
how ACID properties might affect the restaurant's efficiency:
Atomicity:
Every order must be completed fully. If a customer orders a burger and fries, the kitchen must
prepare both items together. This can slow down service if one item takes longer to cook.
Consistency:
The food must meet the restaurant's standards. If the burger isn't cooked properly, it needs to be
remade. This slows down service as it delays serving the customer until the food is right.
ACID Property
Isolation:
Each table's order is independent. The kitchen can't start cooking another table's order until it
finishes the current one. This can lead to delays during peak hours when many orders come in
at once.
Durability:
Once an order is placed, it's recorded and won't be lost, even if the system crashes. This
reliability adds overhead, making the ordering process slower.
In busy times, adhering strictly to these principles can slow down service, as each order
requires careful handling and verification.
In database systems, adhering strictly to ACID properties can sometimes hinder performance
and scalability, especially in high-demand scenarios where speed and efficiency are paramount.
BASE Property
Basically Available:
• The restaurant ensures that even if one part of the kitchen is busy or unavailable, other parts
can still prepare and serve food.
• This keeps the overall service running smoothly, even if some sections are temporarily
unavailable.
Soft State:
• The restaurant doesn't need to maintain strict consistency across all orders. If one table's
order takes longer to prepare, it doesn't hold up other orders.
• Each order is processed independently, allowing the kitchen to focus on serving as quickly
as possible.
BASE Property
Eventually Consistent:
• While the kitchen works on preparing orders, the waiters continue taking new orders.
• The kitchen might not immediately process all orders, but eventually, all tables receive
their food.
• This flexibility allows the restaurant to handle peaks in demand without sacrificing overall
service quality.
In this way, BASE properties allow the restaurant to prioritize availability and responsiveness,
ensuring that customers receive their orders in a timely manner, even during busy periods.
Similarly, in database systems, BASE properties prioritize system availability and
responsiveness over strict consistency, allowing for better performance and scalability in
high-demand scenarios.
Emergence of No-SQL
Diverse Implementations:
• NoSQL encompasses various types: document-oriented, key-value stores, columnar, and
graph databases.
• Examples: MongoDB, Redis, Apache Cassandra, Neo4j.
• Each tailored to specific use cases, providing unique advantages for modern applications.
Aggregate Data Models
Aggregate Data Models
Aggregate data models provide a framework for organizing and managing data in
databases.
They help make sense of different types of data and are good for different kinds of tasks.
Document Store
Each document can have a varying number of fields, accommodating evolving data
structures without affecting others.
Example: MongoDB
Suitable for applications ranging from content management systems to real-time analytics.
Document Store
Each product, like a t-shirt or a book, is stored as a separate "document." These documents
contain all the details about the product, like its name, price, and description
If the store wants to add new information, like color options for a t-shirt, MongoDB let
them do it without changing everything else.
Document Store
Key-Value Store
Simple structure enables fast data retrieval, making it efficient for high-speed data access.
Example: Redis
Supports various data types and features like pub/sub messaging and data persistence.
Each player has a unique "key" like their username, and Redis stores all their game
progress, like levels completed and scores earned, as the "value" associated with that key.
So, when a player logs in, Redis quickly finds their key and loads their game data.
Columnar Store
Enables efficient data compression, faster query performance, and better support for
analytical workloads.
Apache Cassandra: Distributed columnar database designed for high availability and
scalability.
Commonly used for time-series data, IoT applications, and real-time analytics.
Columnar Store
Each device, like a thermostat or a security camera, stores its readings in columns.
So, the temperature sensor records all its readings in one column, and the camera records
all its footage in another.
This makes it easy to find specific data quickly, even when there's lots of it.
Columnar Store
Graph Databases
Data represented as nodes, edges, and properties, allowing for modeling complex
relationships.
Example: Neo4j
Neo4j: Leading graph database with a native graph storage and processing engine.
Features a powerful query language (Cypher) and APIs for building graph-based
applications.
Commonly used for social networks, recommendation engines, and network analysis.
Graph Databases
Neo4j helps a social media platform understand how users are connected.
Each user is a "node," and their relationships, like friendships or following other users, are
the "edges" between nodes.
So, Neo4j shows how users are connected, making it easy to see who's friends with whom
and how information spreads through the network.
Graph Databases
Distribution Models
NoSQL databases are great for handling lots of data on big computer networks.
As more data comes in, it gets harder and more expensive to make the system handle it all.
That's where distribution models come in.
They help NoSQL databases handle more data, process more traffic, and stay online even if
parts of the network have issues.
There are two main ways to spread data around in a NoSQL database: Replication and
Sharding.
Replication: Replication takes the same data and copies it over multiple nodes. This is like
making copies of the same book. There are two types: master-slave, where one node
controls everything, and peer-to-peer, where all nodes are equal.
Sharding: Sharding puts different data on different nodes. This is like splitting a big
library into smaller ones. Each library holds different books, so together they can handle
more readers.
Single Server
Imagine you have a huge collection of books that you want to organize. You decide to use
a NoSQL database to store information about each book. But how do you make sure
everything stays organized and easy to find?
Think of your NoSQL database as a bookshelf in your room. You are the keeper of this
bookshelf, and you manage everything about it. When someone wants to add a new book or
take one out, they come to you.
Real-Life Example:
In a library, the librarian is like the single-server setup. They manage the entire collection
of books and handle all requests for adding, removing, or finding books.
Single Server
The first and the simplest distribution option is the one, would most often
recommend—no distribution at all.
Run the database on a single machine that handles all the reads and writes to the
data store.
Graph databases are the obvious category here—these work best in a single-server
configuration.
Shrading
In a book club, there are different sections for different types of books: fiction, non-fiction,
mystery, etc. Each section has its own leader who manages the books within that category.
When someone wants a book from a particular section, they go to the leader of that section.
Real-Life Example:
In a supermarket, different sections like produce, dairy, and meats can be compared to
shards in sharding. Each section is responsible for managing and stocking items within its
category.
Shrading
A busy data store is busy often, because different people are accessing different parts of
the dataset.
Many NoSQL databases offer auto-sharding, where the database takes on the
responsibility of allocating data to shards and ensuring that data access goes to the right
shard. This can make it much easier to use sharding in an application.
Master - Slave Replication
In master-slave replication, one node (the master) is in charge of updates, while others (the
slaves) copy what the master does.
This setup helps when lots of people are reading data, and even if the master fails, the
slaves can still help out.
Now, imagine you have a bookstore with many shelves and lots of books. To handle all the
requests, you appoint a head librarian (the master) who oversees everything. Whenever
someone wants to know about a book, they go to the head librarian, who then tells them
where to find it.
Real-Life Example
In a company, the head manager may be like the master in a master-slave replication setup.
They oversee operations and delegate tasks to other employees (slaves) based on their
expertise.
Peer to Peer Replication
In Peer-to-peer replication does not have a master.
In peer-to-peer replication, all nodes are equal. They can all write and read data.
If one node goes down, it's okay because the others can still work. And if you need more
power, just add more nodes.
Imagine a book exchange where everyone is equal partners. Each person has their own
collection of books, and they can share or borrow books from each other freely. There's no
one in charge, and everyone has access to all the books.
Real-Life Example:
In a community garden, everyone contributes equally and shares the produce. There's no
central authority, and everyone has access to the same resources, similar to peer-to-peer
replication in a NoSQL database.
Beyond No-SQL
File Systems: Efficiently organize and manage unstructured data with hierarchical storage
structures.
Event Sourcing: Capture and store data changes as events for detailed audit trails and state
reconstruction.
Memory Image: Utilize memory-based storage for rapid data access in real-time analytics
and high-throughput scenarios.
Version Control: Track file changes over time, facilitating collaboration and change
management in software development.
XML Databases: Specialize in storing and querying XML data, supporting XML data
types and querying languages.
Object Databases: Store data in object form, aligning with object-oriented programming
principles for seamless integration.
Choosing your Database
Data Type: Look at what kind of data you have and choose a database that's good for that
type.
Speed and Capacity: Think about how fast your database needs to be and how much it can
hold.
Growing and Staying Up: Check if the database can handle getting bigger and if it can
stay running even if something goes wrong.
Help and Updates: Find a database with people who can help if something goes wrong
and stays updated with new features.
Cost and Rules: Consider how much it costs to use the database and if you have to follow
any special rules.