0% found this document useful (0 votes)
58 views38 pages

No SQL

The document discusses NoSQL databases and data pipelines. It provides motivation for NoSQL databases due to the large volumes of data being collected from various sources. It notes that most big data is unstructured or semi-structured. NoSQL databases are designed for big data and support horizontal scaling, high availability, and flexible schemas. The main types of NoSQL databases are discussed as key-value stores, document databases, column-family databases, and graph databases. Examples of each type are also provided.

Uploaded by

prab bains
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views38 pages

No SQL

The document discusses NoSQL databases and data pipelines. It provides motivation for NoSQL databases due to the large volumes of data being collected from various sources. It notes that most big data is unstructured or semi-structured. NoSQL databases are designed for big data and support horizontal scaling, high availability, and flexible schemas. The main types of NoSQL databases are discussed as key-value stores, document databases, column-family databases, and graph databases. Examples of each type are also provided.

Uploaded by

prab bains
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

NoSQL Databases and

data pipelines

Based on slides by
Mike Franklin, George Kollios and Jimmy Lin
Part of the slides are adapted from Database System Concepts
Seventh Edition by Avi Silberschatz, Henry F. Korth, S. Sudarshan 1
Motivation
• Very large volumes of data being collected
 Driven by growth of web, social media, and more
recently internet-of-things
 Web logs were an early source of data
• Analytics on web logs has great value for
advertisements, web site structuring, what posts to
show to a user, etc
• Big Data: differentiated from data handled by
earlier generation databases
 Volume: much larger amounts of data stored
 Velocity: much higher rates of
insertions/updates
 Variety: many types of data, beyond relational
data
Big Data (some old numbers)
• Facebook:
 130TB/day: user logs
 200-400TB/day: 83 million pictures

• Google: > 25 PB/day processed data

• Gene sequencing: 100M kilobases


per day per machine
 Sequence 1 cell for every infant by 2015?
 10 trillion cells / human body

• Total data created in 2010: 1.ZettaByte


(1,000,000 PB)/year
 ~60% increase every year
3
~80% of Big Data is not structured

• Structured:
 Data of a well-defined data type, format, or structure
 Example: Relational database tables and CSV files
• Semi-structured:
 Textual data files with a discernable pattern, enabling parsing
 Example: XML, JSON files
• Quasi-structured:
 Textual data with erratic data formats: can be formatted with effort, tools,
and time
 Example: Web clickstream data
• Unstructured:
 Data that has no inherent structure
 Examples: Text documents, images, and video

4
What is NoSQL?
• An emerging “movement” around
non-relational software for Big Data
Wikipedia: “A NoSQL database provides a mechanism for storage
and retrieval of data that use looser consistency models than
traditional relational databases in order to achieve 
horizontal scaling and higher availability. Some authors refer to
them as "Not only SQL" to emphasize that some NoSQL systems do
allow SQL-like query language to be used.”

https://en.wikipedia.org/wiki/NoSQL
NoSQL features
• Scalability is crucial!
 load increased rapidly for many applications
 Large servers are expensive
 Solution: use clusters of small (cheap) commodity
machines (often cloud based)
• Need to partition the data and use replication (sharding)
• E.g., records with key values from 1 to 100,000 on
database 1, records with key values from 100,001 to
200,000 on database 2, etc
• Application must track which records are on which
database and send queries/updates to that database
• Develop with agility
 Suitable for faster and more agile application
development due to its flexibility
6
https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-nosql-database/
NoSQL features
• Sometimes not a well defined schema
 Performance and availability are more important
than strong consistency provided by RDBMS
 Supports flexible schema
• Allow for semi-structured data
 Handle large, unrelated, indeterminate, or rapidly
changing data
 Still need to provide ways to query efficiently
(use of index methods)
 Need to express specific types of queries easily

7
NoSQL Example
• Storing information about a user ( first name, last name,
cell phone number, city) and their hobbies
Relational
database
way

NoSQL
way

8
Image source: https://www.mongodb.com/nosql-explained
The Structure Spectrum

Structured Semi-Structured Unstructured


(schema-first) (schema-later) (schema-never)

Relational Documents Plain Text


Database XML
Media
Formatted Tagged
Messages Text/Media
Flavors of NoSQL

Four main types:


• key-value stores
• document databases
• column-family (aka big-table/columnar) stores
• graph databases

https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-nosq
l-database/ 10
Key Value Storage Systems
• Key-value storage systems store large numbers
(billions or even more) of small (KB-MB) sized
records
• Records are partitioned across multiple machines
• Queries are performed on keys and routed by the
system to appropriate machine
• Records are also replicated across multiple machines,
to ensure availability even if a machine fails
 Key-value stores ensure that updates are applied
to all replicas, to keep values consistent

Example: Redis, MemcacheDB, Amazon's


DynamoDB, Voldemort
Key value pairs in Amazon DynamoDB

https://aws.amazon.com/nosql/key-value/ 12
Key Value Storage Systems
• Key-value stores support
 put(key, value): used to store values with an
associated key,
 get(key): retrieves the stored value associated
with the specified key
 delete(key) -- Remove the key and its
associated value
• Some systems also support range queries on key
values
JSON
• JSON is an alternative data model for
semi-structured data.
• JavaScript Object Notation

• Built on two key structures:


• an object, which is a sequence of name/value pairs
{ ”_id": "1000",
"name": "Sanders Theatre",
"capacity": 1000 }
• an array of values [ "123", "222", "333" ]
• A value can be:
• an atomic value: string, number, true, false, null
• an object
• an array

14
Data Representation in Key Value
• An example of a JSON object is:
{
"ID": "22222",
"name": {
"firstname: "Albert",
"lastname: "Einstein"
},
"deptname": "Physics",
"children": [
{ "firstname": "Hans", "lastname":
"Einstein" },
{ "firstname": "Eduard", "lastname":
"Einstein" }
]
}
Document Databases

• Extends the idea of key/value pairs


- However, the value is a document.
• expressed using some sort of semi-structured data model
• XML
• more often: JSON or BSON (JSON's binary counterpart)
• the value can be examined and used by the DBMS (unlike in key/
data stores)
• Queries can be based on the key (as in key/value stores), but
more often they are based on the contents of the document.

• Here again, there is support for sharding and replication


• sharding can be based on values within the document

Examples include: MongoDB, CouchDB, Terrastore

16
MongoDB (An example of a Document
Database)
-Data are organized in collections. A collection stores
a set of documents.
- Collection like table and document like record
but: each document can have a different set of
attributes even in the same collection
Semi-structured schema!
- Only requirement: every document should have an
“_id” field
humongous => Mongo

17
Example mongodb

{ "_id”:ObjectId("4efa8d2b7d284dad101e4bc9"),
"Last Name": ” Cousteau",
"First Name": ” Jacques-Yves",
"Date of Birth": ”06-1-1910" },
 
{ "_id": ObjectId("4efa8d2b7d284dad101e4bc7"),
"Last Name": "PELLERIN",
"First Name": "Franck",
"Date of Birth": "09-19-1983",
"Address": "1 chemin des Loges",
"City": "VERSAILLES" }

18
XML Example
<employees>
  <employee>
<id>4efa8d2b7d284dad101e4bc9</id> <Last Name> Cousteau </Last
Name> <First Name>Jacques-Yves</First Name> <Date of Birth>06- 1-
1910 </Date of Birth>
</employee>
<employee>
<id>4efa8d2b7d284dad101e4bc7</id> <Last Name>PELLERIN</Last
Name> <First Name>Franck</First Name> <Date of Birth>09-19-1983
</Date of Birth> <Address>1 chemin des Loges</Address>
<City>VERSAILLES</City>
</employee>
</employees>

 
19
https://www.w3schools.com/js/js_json_xml.asp
Columnar databases
• stores data tables by columns rather than by rows
• are advantageous when querying a subset of columns by
eliminating the need to read columns that are not relevant
• used in analytics that can quickly aggregate the value of a
given column (adding up the total sales for the year
• are typically less efficient for inserting new data

NoSQL (Columnar) way


Relational database way
20
https://en.wikipedia.org/wiki/Column-oriented_DBMS
Graph databases
• focuses on the relationship between data elements
• each element is stored as a node (e.g., a person in a social
media graph)
• Connections (e.g., friendship in social media) between
elements are called links or relationships
• connections are first-class elements of the database, stored
directly
• graph databases are usually run alongside other more
traditional databases
• fraud detection, social networks, and knowledge graphs

21
https://www.mongodb.com/scale/types-of-nosql-databases
Example Document Database:
MongoDB
Key features include:
• JSON-style documents
• actually uses BSON (JSON's binary format)
• replication for high availability
• auto-sharding for scalability
• document-based queries
• can create an index on any attribute
• for faster reads

22
MongoDB Terminology
relational term <== >MongoDB equivalent
----------------------------------------------------------
database <== > database
table <== > collection
row <== > document
attributes <== > fields (field-name:value pairs)
primary key <== > the _id field, which is the key
associated with the document

23
The _id Field
Every MongoDB document must have an _id field.
• its value must be unique within the collection
• acts as the primary key of the collection
• it is the key in the key/value pair
• If you create a document without an _id field:
• MongoDB adds the field for you
• assigns it a unique BSON ObjectID
• example from the MongoDB shell:
> db.test.save({ rating: "PG-13" })
> db.test.find() { "_id" :ObjectId("528bf38ce6d3df97b49a0569"),
"rating" : "PG-13" }

• Note: quoting field names is optional (see rating above)

24
Data Modeling in MongoDB
Need to determine how to map entities and relationships to
collections of documents
• It can make sense to group different types of entities together
• create an aggregate containing data that tends to be accessed
together
• Could in theory store each type of entity in a collection:
• its own (flexibly formatted) type of document
• those documents would be stored in the same collection
• store references to other documents in different collection

25
Capturing Relationships in MongoDB

• embed documents within


{
other documents if "_id":ObjectId("52ffc33cd85242f436000001"),
"contact": "987654321",
 there are contained or "dob": "01-01-1991",
"name": "Tom Benzamin",
one-to-few relationships "address": [
{
between entities "building": "22 A, Indiana Apt",
"pincode": 123456,
 embedded data do "city": "Los Angeles",
"state": "California"
not change frequently },
{
or grow without bound "building": "170 A, Acropolis Apt",
"pincode": 456789,
 embedded data "city": "Chicago",
"state": "Illinois"
is queried frequently }
]
together }

26
https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data
Capturing Relationships in MongoDB
{
• store references to "_id":ObjectId("52ffc33cd85242f436000001"),
"name": "Tom Hanks",
other documents using "contact": "987654321",
"dob": "01-01-1991"
their _id values if }
 data grows
unbounded (e.g., {
"_id":ObjectId("52ffc4a5d85242602e000000"),
comments on a post) "building": "22 A, Indiana Apt",
"pincode": 123456,
 data changes "city": "Los Angeles",
"state": "California"
frequently (e.g., }
stock information)
{
{ "_id":ObjectId("52ffc4a5d85242602e000001"),
"_id":ObjectId("52ffc33cd85242f436000001"), "building": "170 A, Acropolis Apt",
"contact": "987654321", "pincode": 456789,
"dob": "01-01-1991", "city": "Chicago",
"name": "Tom Benzamin", "state": "Illinois"
"address_ids": [ }
ObjectId("52ffc4a5d85242602e000000"),
ObjectId("52ffc4a5d85242602e000001")
]
}
27
https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data
Queries in MongoDB

Each query can only access a single collection of


documents.
• Use a method called
db.collection.find(<selection>, <projection>)

• Example: find the names of all G-rated movies:


> db.movies.find({ rating: ‘G' }, { name: 1 })

28
Projection
• Specify the name of the fields that you want in the output with
1 ( 0 hides the value)

• Example:
 >db.movies.find({},{"title":1,_id:0})
(will report the title but not the id)

29
Selection
• You can specify the condition on the corresponding attributes
using the find:
>db.movies.find({ rating: “G", year: 2000 }, {name: 1, runtime: 1 })
• Operators for other types of comparisons:
MongoDB SQL equivalent
$gt, $gte >, >=
$lt, $lte <, <=
$ne !=
Example: find the names of movies with an earnings <= 200000
> db.movies.find({ earnings: { $lte: 200000 }})

• For logical operators $and, $or, $nor


 use an array of conditions and apply the logical operator among the array conditions:

> db.movies.find({ $or: [ { rating: “G" }, { rating: "PG-13" } ] })

30
Aggregation
• Recall the aggregate operators in SQL: AVG(), SUM(), etc.
More generally, aggregation involves computing a result
from a collection of data.
• db.collection.count(<selection>)
returns the number of documents in the collection
that satisfy the specified selection document
Example: how may G-rated movies are shorter than 90 minutes?
>db.movies.count({ rating: “G”, runtime: { $lt: 90 }})

• db.collection.distinct(<field>, <selection>)
returns an array with the distinct values of the specified field in
documents that satisfy the specified selection document
- which actors have been in one or more of the top 10 grossing movies?
>db.movies.distinct("actors.name”, { earnings_rank: { $lte: 10 }})

if we omit the selection, get all distinct values of that field 31


Aggregation Pipeline
• MongoDB supports several approaches to aggregation:
- single-purpose aggregation methods
- an aggregation pipeline
- map-reduce

Aggregation pipelines are more flexible and useful (see next):


• A very powerful approach to write queries in MongoDB is to use
pipelines
• We execute the query in stages.
• Every stage gets as input some documents, applies
filters/aggregations/projections and outputs some new documents.
• These documents are the input to the next stage (next operator) and
so on

https://docs.mongodb.com/manual/core/aggregation-pipeline/
32
Aggregation Pipeline example
• Let’s use the following pizza orders collection and find the total order
quantity of medium size pizzas grouped by pizza name

db.orders.aggregate( [
// Stage 1: Filter pizza order documents by pizza size
{
$match: { size: "medium" }
},
// Stage 2: Group remaining documents by pizza name and calculate total quantity
{
$group: { _id: "$name", totalQuantity: { $sum: "$quantity" } }
}
]) 33
https://docs.mongodb.com/manual/core/aggregation-pipeline/
Aggregation Pipeline example
• Let’s use the following pizza orders collection and find the total order
quantity of medium size pizzas grouped by pizza name

db.orders.aggregate( [
// Stage 1: Filter pizza order documents by pizza size
{
$match: { size: "medium" }
},
// Stage 2: Group remaining documents by pizza name and calculate total quantity
{
$group: { _id: "$name", totalQuantity: { $sum: "$quantity" } }
}
])
• Example, output

https://docs.mongodb.com/manual/core/aggregation-pipeline/
34
Aggregation Pipeline example
• Let’s use the pizza orders collection and find total pizza order value and
average order quantity between two dates
db.orders.aggregate( [
// Stage 1: Filter pizza order documents by date range
{
$match:
{
"date": { $gte: new ISODate( "2020-01-30" ), $lt: new ISODate( "2022-01-30" ) }
}},
// Stage 2: Group remaining documents by date and calculate results
{
$group:
{
_id: { $dateToString: { format: "%Y-%m-%d", date: "$date" } },
totalOrderValue: { $sum: { $multiply: [ "$price", "$quantity" ] } },
averageOrderQuantity: { $avg: "$quantity" }
}
},
// Stage 3: Sort documents by totalOrderValue in descending order
{
$sort: { totalOrderValue: -1 }
}] )
• Example, output

35
What is a Data Pipeline?
• A data pipeline is a process for moving data between a
source system and a target repository
• It involves software which automates the many steps
that may or may not be involved in moving data for a
specific use case, such as extracting data from a source
system, and then loading it into a target repository

https://www.qlik.com/us/etl/etl-pipeline
36
What is Extract, Transform, and Load (ETL)?

• Set of processes to extract data from one system, transform it, and
then load it into a target repository (data warehouse or data lake )
• Transform is the process of converting the format or structure of the
data set to match the target system
 Data mapping, applying concatenations or calculation
• ETL process is most appropriate for small data sets which require
complex transformations
 Transforming larger data sets can take a long time up front but analysis
can take place immediately once the ETL process is complete

37
https://www.qlik.com/us/etl/etl-pipeline
What is Extract, Load, and Transform (ELT)?

• All data is extracted from the source and immediately loaded into the target
system (data warehouse or data lake )
• Data is transformed on an as-needed basis in the target system
 raw, unstructured, semi-structured and structured data
 Transformation can slow down the querying and analysis processes if there is
not sufficient processing power
• ELT is more cost effective then ETL, is appropriate for larger, structured and
unstructured data sets and when timeliness is important
 Cloud platforms (Amazon Redshift, Snowflake, Azure Synapse, Databricks) offer
much lower costs and a variety of plan options to store and process data

38
https://www.qlik.com/us/etl/etl-vs-elt

You might also like