No SQL
No SQL
data pipelines
Based on slides by
Mike Franklin, George Kollios and Jimmy Lin
Part of the slides are adapted from Database System Concepts
Seventh Edition by Avi Silberschatz, Henry F. Korth, S. Sudarshan 1
Motivation
• Very large volumes of data being collected
Driven by growth of web, social media, and more
recently internet-of-things
Web logs were an early source of data
• Analytics on web logs has great value for
advertisements, web site structuring, what posts to
show to a user, etc
• Big Data: differentiated from data handled by
earlier generation databases
Volume: much larger amounts of data stored
Velocity: much higher rates of
insertions/updates
Variety: many types of data, beyond relational
data
Big Data (some old numbers)
• Facebook:
130TB/day: user logs
200-400TB/day: 83 million pictures
• Structured:
Data of a well-defined data type, format, or structure
Example: Relational database tables and CSV files
• Semi-structured:
Textual data files with a discernable pattern, enabling parsing
Example: XML, JSON files
• Quasi-structured:
Textual data with erratic data formats: can be formatted with effort, tools,
and time
Example: Web clickstream data
• Unstructured:
Data that has no inherent structure
Examples: Text documents, images, and video
4
What is NoSQL?
• An emerging “movement” around
non-relational software for Big Data
Wikipedia: “A NoSQL database provides a mechanism for storage
and retrieval of data that use looser consistency models than
traditional relational databases in order to achieve
horizontal scaling and higher availability. Some authors refer to
them as "Not only SQL" to emphasize that some NoSQL systems do
allow SQL-like query language to be used.”
https://en.wikipedia.org/wiki/NoSQL
NoSQL features
• Scalability is crucial!
load increased rapidly for many applications
Large servers are expensive
Solution: use clusters of small (cheap) commodity
machines (often cloud based)
• Need to partition the data and use replication (sharding)
• E.g., records with key values from 1 to 100,000 on
database 1, records with key values from 100,001 to
200,000 on database 2, etc
• Application must track which records are on which
database and send queries/updates to that database
• Develop with agility
Suitable for faster and more agile application
development due to its flexibility
6
https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-nosql-database/
NoSQL features
• Sometimes not a well defined schema
Performance and availability are more important
than strong consistency provided by RDBMS
Supports flexible schema
• Allow for semi-structured data
Handle large, unrelated, indeterminate, or rapidly
changing data
Still need to provide ways to query efficiently
(use of index methods)
Need to express specific types of queries easily
7
NoSQL Example
• Storing information about a user ( first name, last name,
cell phone number, city) and their hobbies
Relational
database
way
NoSQL
way
8
Image source: https://www.mongodb.com/nosql-explained
The Structure Spectrum
https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-nosq
l-database/ 10
Key Value Storage Systems
• Key-value storage systems store large numbers
(billions or even more) of small (KB-MB) sized
records
• Records are partitioned across multiple machines
• Queries are performed on keys and routed by the
system to appropriate machine
• Records are also replicated across multiple machines,
to ensure availability even if a machine fails
Key-value stores ensure that updates are applied
to all replicas, to keep values consistent
https://aws.amazon.com/nosql/key-value/ 12
Key Value Storage Systems
• Key-value stores support
put(key, value): used to store values with an
associated key,
get(key): retrieves the stored value associated
with the specified key
delete(key) -- Remove the key and its
associated value
• Some systems also support range queries on key
values
JSON
• JSON is an alternative data model for
semi-structured data.
• JavaScript Object Notation
14
Data Representation in Key Value
• An example of a JSON object is:
{
"ID": "22222",
"name": {
"firstname: "Albert",
"lastname: "Einstein"
},
"deptname": "Physics",
"children": [
{ "firstname": "Hans", "lastname":
"Einstein" },
{ "firstname": "Eduard", "lastname":
"Einstein" }
]
}
Document Databases
16
MongoDB (An example of a Document
Database)
-Data are organized in collections. A collection stores
a set of documents.
- Collection like table and document like record
but: each document can have a different set of
attributes even in the same collection
Semi-structured schema!
- Only requirement: every document should have an
“_id” field
humongous => Mongo
17
Example mongodb
{ "_id”:ObjectId("4efa8d2b7d284dad101e4bc9"),
"Last Name": ” Cousteau",
"First Name": ” Jacques-Yves",
"Date of Birth": ”06-1-1910" },
{ "_id": ObjectId("4efa8d2b7d284dad101e4bc7"),
"Last Name": "PELLERIN",
"First Name": "Franck",
"Date of Birth": "09-19-1983",
"Address": "1 chemin des Loges",
"City": "VERSAILLES" }
18
XML Example
<employees>
<employee>
<id>4efa8d2b7d284dad101e4bc9</id> <Last Name> Cousteau </Last
Name> <First Name>Jacques-Yves</First Name> <Date of Birth>06- 1-
1910 </Date of Birth>
</employee>
<employee>
<id>4efa8d2b7d284dad101e4bc7</id> <Last Name>PELLERIN</Last
Name> <First Name>Franck</First Name> <Date of Birth>09-19-1983
</Date of Birth> <Address>1 chemin des Loges</Address>
<City>VERSAILLES</City>
</employee>
</employees>
19
https://www.w3schools.com/js/js_json_xml.asp
Columnar databases
• stores data tables by columns rather than by rows
• are advantageous when querying a subset of columns by
eliminating the need to read columns that are not relevant
• used in analytics that can quickly aggregate the value of a
given column (adding up the total sales for the year
• are typically less efficient for inserting new data
21
https://www.mongodb.com/scale/types-of-nosql-databases
Example Document Database:
MongoDB
Key features include:
• JSON-style documents
• actually uses BSON (JSON's binary format)
• replication for high availability
• auto-sharding for scalability
• document-based queries
• can create an index on any attribute
• for faster reads
22
MongoDB Terminology
relational term <== >MongoDB equivalent
----------------------------------------------------------
database <== > database
table <== > collection
row <== > document
attributes <== > fields (field-name:value pairs)
primary key <== > the _id field, which is the key
associated with the document
23
The _id Field
Every MongoDB document must have an _id field.
• its value must be unique within the collection
• acts as the primary key of the collection
• it is the key in the key/value pair
• If you create a document without an _id field:
• MongoDB adds the field for you
• assigns it a unique BSON ObjectID
• example from the MongoDB shell:
> db.test.save({ rating: "PG-13" })
> db.test.find() { "_id" :ObjectId("528bf38ce6d3df97b49a0569"),
"rating" : "PG-13" }
24
Data Modeling in MongoDB
Need to determine how to map entities and relationships to
collections of documents
• It can make sense to group different types of entities together
• create an aggregate containing data that tends to be accessed
together
• Could in theory store each type of entity in a collection:
• its own (flexibly formatted) type of document
• those documents would be stored in the same collection
• store references to other documents in different collection
25
Capturing Relationships in MongoDB
26
https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data
Capturing Relationships in MongoDB
{
• store references to "_id":ObjectId("52ffc33cd85242f436000001"),
"name": "Tom Hanks",
other documents using "contact": "987654321",
"dob": "01-01-1991"
their _id values if }
data grows
unbounded (e.g., {
"_id":ObjectId("52ffc4a5d85242602e000000"),
comments on a post) "building": "22 A, Indiana Apt",
"pincode": 123456,
data changes "city": "Los Angeles",
"state": "California"
frequently (e.g., }
stock information)
{
{ "_id":ObjectId("52ffc4a5d85242602e000001"),
"_id":ObjectId("52ffc33cd85242f436000001"), "building": "170 A, Acropolis Apt",
"contact": "987654321", "pincode": 456789,
"dob": "01-01-1991", "city": "Chicago",
"name": "Tom Benzamin", "state": "Illinois"
"address_ids": [ }
ObjectId("52ffc4a5d85242602e000000"),
ObjectId("52ffc4a5d85242602e000001")
]
}
27
https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data
Queries in MongoDB
28
Projection
• Specify the name of the fields that you want in the output with
1 ( 0 hides the value)
• Example:
>db.movies.find({},{"title":1,_id:0})
(will report the title but not the id)
29
Selection
• You can specify the condition on the corresponding attributes
using the find:
>db.movies.find({ rating: “G", year: 2000 }, {name: 1, runtime: 1 })
• Operators for other types of comparisons:
MongoDB SQL equivalent
$gt, $gte >, >=
$lt, $lte <, <=
$ne !=
Example: find the names of movies with an earnings <= 200000
> db.movies.find({ earnings: { $lte: 200000 }})
30
Aggregation
• Recall the aggregate operators in SQL: AVG(), SUM(), etc.
More generally, aggregation involves computing a result
from a collection of data.
• db.collection.count(<selection>)
returns the number of documents in the collection
that satisfy the specified selection document
Example: how may G-rated movies are shorter than 90 minutes?
>db.movies.count({ rating: “G”, runtime: { $lt: 90 }})
• db.collection.distinct(<field>, <selection>)
returns an array with the distinct values of the specified field in
documents that satisfy the specified selection document
- which actors have been in one or more of the top 10 grossing movies?
>db.movies.distinct("actors.name”, { earnings_rank: { $lte: 10 }})
https://docs.mongodb.com/manual/core/aggregation-pipeline/
32
Aggregation Pipeline example
• Let’s use the following pizza orders collection and find the total order
quantity of medium size pizzas grouped by pizza name
db.orders.aggregate( [
// Stage 1: Filter pizza order documents by pizza size
{
$match: { size: "medium" }
},
// Stage 2: Group remaining documents by pizza name and calculate total quantity
{
$group: { _id: "$name", totalQuantity: { $sum: "$quantity" } }
}
]) 33
https://docs.mongodb.com/manual/core/aggregation-pipeline/
Aggregation Pipeline example
• Let’s use the following pizza orders collection and find the total order
quantity of medium size pizzas grouped by pizza name
db.orders.aggregate( [
// Stage 1: Filter pizza order documents by pizza size
{
$match: { size: "medium" }
},
// Stage 2: Group remaining documents by pizza name and calculate total quantity
{
$group: { _id: "$name", totalQuantity: { $sum: "$quantity" } }
}
])
• Example, output
https://docs.mongodb.com/manual/core/aggregation-pipeline/
34
Aggregation Pipeline example
• Let’s use the pizza orders collection and find total pizza order value and
average order quantity between two dates
db.orders.aggregate( [
// Stage 1: Filter pizza order documents by date range
{
$match:
{
"date": { $gte: new ISODate( "2020-01-30" ), $lt: new ISODate( "2022-01-30" ) }
}},
// Stage 2: Group remaining documents by date and calculate results
{
$group:
{
_id: { $dateToString: { format: "%Y-%m-%d", date: "$date" } },
totalOrderValue: { $sum: { $multiply: [ "$price", "$quantity" ] } },
averageOrderQuantity: { $avg: "$quantity" }
}
},
// Stage 3: Sort documents by totalOrderValue in descending order
{
$sort: { totalOrderValue: -1 }
}] )
• Example, output
35
What is a Data Pipeline?
• A data pipeline is a process for moving data between a
source system and a target repository
• It involves software which automates the many steps
that may or may not be involved in moving data for a
specific use case, such as extracting data from a source
system, and then loading it into a target repository
https://www.qlik.com/us/etl/etl-pipeline
36
What is Extract, Transform, and Load (ETL)?
• Set of processes to extract data from one system, transform it, and
then load it into a target repository (data warehouse or data lake )
• Transform is the process of converting the format or structure of the
data set to match the target system
Data mapping, applying concatenations or calculation
• ETL process is most appropriate for small data sets which require
complex transformations
Transforming larger data sets can take a long time up front but analysis
can take place immediately once the ETL process is complete
37
https://www.qlik.com/us/etl/etl-pipeline
What is Extract, Load, and Transform (ELT)?
• All data is extracted from the source and immediately loaded into the target
system (data warehouse or data lake )
• Data is transformed on an as-needed basis in the target system
raw, unstructured, semi-structured and structured data
Transformation can slow down the querying and analysis processes if there is
not sufficient processing power
• ELT is more cost effective then ETL, is appropriate for larger, structured and
unstructured data sets and when timeliness is important
Cloud platforms (Amazon Redshift, Snowflake, Azure Synapse, Databricks) offer
much lower costs and a variety of plan options to store and process data
38
https://www.qlik.com/us/etl/etl-vs-elt