0% found this document useful (0 votes)
2 views

Data-Modeling

The document discusses data modeling in relational and NoSQL databases, emphasizing the importance of schema design for performance and cost efficiency. It outlines the challenges of embedding versus referencing data, providing guidelines for when to use each approach based on relationships and data characteristics. Additionally, it covers indexing policies in Azure Cosmos DB, highlighting the benefits of automatic indexing and the ability to customize indexing strategies.

Uploaded by

suresh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data-Modeling

The document discusses data modeling in relational and NoSQL databases, emphasizing the importance of schema design for performance and cost efficiency. It outlines the challenges of embedding versus referencing data, providing guidelines for when to use each approach based on relationships and data characteristics. Additionally, it covers indexing policies in Azure Cosmos DB, highlighting the benefits of automatic indexing and the ability to customize indexing strategies.

Uploaded by

suresh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Data Modeling

Data Modeling
Is just as important with relational data!

There’s still a schema – just enforced at the application level

Plan upfront for best performance & costs


Feedback: “Small collections add up ->$$$”
Answer: Smart data modelling will help
2 Extremes

R M
O Normalize everything
SQL

NoSQL

Embed as 1 piece
Contoso Restaurant Menu

Menu Item Category


Relational modelling –
------------ ------------
Each menu item has a
ID ID
reference to a category.
Item Name Category Name
Item Description Category Description
Item Category ID

{
"ID": 1,
"ItemName": "hamburger", Non-relational
"ItemDescription": "cheeseburger, no cheese", modeling – each menu
“CategoryId": 5, item is a self-contained
"Category": "sandwiches" document
"CategoryDescription": "2 pieces of bread + filling"
}
Number 1 question…
“Where are my joins?”

“Where are my joins!?!?”

Naïve way: Normalize, make 2 network calls, merge client side

But! We can model our data in a way to get the same functionality of a join,
without the tradeoff
Modeling Challenges

#1: To de-normalize, or normalize? To embed, or to


reference?

#2: Put data types in same collection, or different?


Modeling challenge #1: To embed or
reference?
Embed
{
"menuID": 1,
"menuName": "Lunch menu",
"items": [
{"ID": 1, "ItemName": "hamburger", "ItemDescription":...}
{"ID": 2, "ItemName": "cheeseburger", "ItemDescription":...}
]
}
Reference
{
"menuID": 1,
"menuName": "Lunch menu",
"items": [ {"ID": 1, "ItemName": “hamburger", "ItemDescription":...}
{"ID": 1}
{"ID": 2} {"ID": 2, "ItemName": “cheeseburger", "ItemDescription":...}
]
}
When To Embed #1
{
"ID": 1,
"ItemName": "hamburger",
"ItemDescription": "cheeseburger, no cheese",
"Category": "sandwiches",
"CategoryDescription": "2 pieces of bread + filling",
"Ingredients": [
{"ItemName": "bread", "calorieCount": 100, "Qty": "2 slices"},
{"ItemName": "lettuce", "calorieCount": 10, "Qty": "1 slice"}
{"ItemName": "tomato","calorieCount": 10, "Qty": "1 slice"}
{"ItemName": "patty", "calorieCount": 700, "Qty": "1"}
}

E.g. in Recipe, ingredients are always queried with the item


When To Embed #2
Child data is dependent/intrinsic to a parent

{
"id": "Order1",
"customer": "Customer1",
"orderDate": "2018-09-26",
"itemsOrdered": [
{"ID": 1, "ItemName": "hamburger", "Price":9.50, "Qty": 1}
{"ID": 2, "ItemName": "cheeseburger", "Price":9.50, "Qty": 499}
]
}

Items Ordered depends on Order


When To Embed #3
1:1 relationship
{
"id": "1",
"name": "Alice",
"email": "[email protected]",
“phone": “555-5555"
“loyaltyNumber": 13838359,
"addresses": [
{"street": "1 Contoso Way", "city": "Seattle"},
{"street": "15 Fabrikam Lane", "city": "Orlando"}
]
}

All customers have email, phone, loyalty number for 1:1 relationship
When To Embed #4, #5
Similar rate of updates – does the data change at the same (slower)
pace? -> Minimize writes
1:few relationships

{
"id": "1",
"name": "Alice",
"email": "[email protected]", //Email, addresses don’t change too often
"addresses": [
{"street": "1 Contoso Way", "city": "Seattle"},
{"street": "15 Fabrikam Lane", "city": "Orlando"}
]
}
When To Embed - Summary
• Data from entities is queried together
• Child data is dependent on a parent
• 1:1 relationship
• Similar rate of updates – does the data change at the same pace
• 1:few – the set of values is bounded

• Usually embedding provides better read performance


• Follow-above to minimize trade-off for write perf
Modeling challenge #1: To embed or reference?
Embed

{
"menuID": 1,
"menuName": "Lunch menu",
"items": [
{"ID": 1, "ItemName": "hamburger", "ItemDescription":...}
{"ID": 2, "ItemName": "cheeseburger", "ItemDescription":...}
]
}

Reference
{
"menuID": 1,
"menuName": "Lunch menu",
{"ID": 1, "ItemName": “hamburger", "ItemDescription":...}
"items": [
{"ID": 2, "ItemName": “cheeseburger", "ItemDescription":...}
{"ID": 1}
{"ID": 2}
]
}
When To Reference #1
1 : many (unbounded relationship)
{ Embedding doesn’t make sense:
"id": "1", - Too many writes to same
"name": "Alice", document
"email": "[email protected]",
"Orders": [
{
- 2MB document limit
"id": "Order1",
"orderDate": "2018-09-18",
"itemsOrdered": [
{"ID": 1, "ItemName": "hamburger", "Price":9.50, "Qty": 1}
{"ID": 2, "ItemName": "cheeseburger", "Price":9.50, "Qty": 499}]
},
...
{
"id": "OrderNfinity",
"orderDate": "2018-09-20",
"itemsOrdered": [
{"ID": 1, "ItemName": "hamburger", "Price":9.50, "Qty": 1}]
}]
}
When To Reference #2
Data changes at different rates #2

{ Number of orders, amount


"id": "1", spent will likely change faster
"name": "Alice", than email
"email": "[email protected]",
"stats":[ Guidance: Store these
{"TotalNumberOrders": 100},
aggregate data in own
{"TotalAmountSpent": 550}]
}
document, and reference it
When To Reference #3

Many : Many relationships


{
"id": "speaker1",
"name": "Alice",
"email": "[email protected]", {
"sessions":[ "id": "session1",
{"id": "session1"}, "name": "Modelling Data 101",
{"id": "session2"} "speakers":[
] {"id": "speaker1"},
} {"id": "speaker2"}
{ ]
"id": "speaker2", }
"name": "Bob",
"email": "[email protected]", Speakers have multiple sessions
"sessions":[
{"id": "session1"},
Sessions have multiple speakers
{"id": "session4"}
] Have Speaker & Session documents
}
When To Reference #4
What is referenced, is heavily referenced by many others
{
"id": "speaker1",
"name": "Alice",
"email": "[email protected]", {
"sessions":[ "id": "session1",
{"id": "session1"}, "name": "Modelling Data 101",
{"id": "session2"} "speakers":[
] {"id": "speaker1"},
} {"id": "speaker2"}
{ ]
"id": “attendee1", }
"name": “Eve",
"email": “[email protected]", Here, session is referenced by
“bookmarkedSessions":[ speakers and attendees
{"id": "session1"},
{"id": "session4"} Allows you to update Session
] independently
}
When To Reference Summary

• 1 : many (unbounded relationship)


• many : many relationships
• Data changes at different rates
• What is referenced, is heavily referenced by many others

• Typically provides better write performance


• But may require more network calls for reads
But wait, you can do both!
{
"id": "speaker1", {
"name": "Alice", "id": “session1",
"email": "[email protected]", "name": "Modelling Data 101",
“address”: “1 Microsoft Way” "speakers":[
“phone”: “555-5555” {"id": "speaker1“, “name”: “Alice”,
"sessions":[ “email”: “[email protected]”},
{"id": "session1"}, {"id": "speaker2“, “name”: “Bob”}
{"id": "session2"} ]
] }
}
Session
Speaker
Embed frequently used data, but use the reference to get less frequently
used
Modelling Challenge #2: What entities go into a collection?

Relational: One entity per table

In Cosmos DB & NoSQL:

• Option 1: One entity per collection

• Option 2: Multiple entities per collection


Option 2: Multiple entities per collection

“Feels” weird, but it can greatly improve performance!

• Makes sense when there are similar access patterns


- If you need “join-like” capabilities, & data is not already embedded

• Approach: Introduce “type” property


Approach- Introduce Type Property

Ability to query across multiple entity types with a single network request.

For example, we have two types of documents: person and cat

{
{ "id": "Ralph",
"id": "Andrew", "type": "Cat",
"type": "Person", "familyId": "Liu",
"familyId": "Liu", "fur": {
"worksOn": "Azure Cosmos DB" "length": "short",
} "color": "brown"
}
}
Approach- Introduce Type Property

Ability to query across multiple entity types with a single network request.

For example, we have two types of documents: person and cat

{
{ "id": "Ralph",
"id": "Andrew", "type": “Cat",
"type": "Person", "familyId": "Liu",
"familyId": "Liu", "fur": {
"worksOn": "Azure Cosmos DB" "length": "short",
} "color": "brown"
}
}

We can query both types of documents without needing a JOIN simply by running a query without a filter on type:
SELECT * FROM c WHERE c.familyId = "Liu"
Handle any data with no schema or indexing required

Azure Cosmos DB’s schema-less service automatically indexes all


your data, regardless of the data model, to delivery blazing fast
queries.

• Automatic index management


• Synchronous auto-indexing GEEK
• No schemas or secondary indices needed
• Works across every data model Microwave Liquid
Item Color CPU Memory Storage
safe capacity
Geek mug Graphite Yes 16ox ??? ??? ???
Coffee Tan No 12oz ??? ??? ???
Bean mug
Surface Gray ??? ??? 3.4 GHz 16GB 1 TB SSD
book Intel
Skylake
Core i7-
6600U
Indexing Policies
Custom Indexing Policies {
Though all Azure Cosmos DB data is indexed by default, you "automatic": true,
"indexingMode": "Consistent",
can specify a custom indexing policy for your collections.
"includedPaths": [{
Custom indexing policies allow you to design and customize
"path": "/*",
the shape of your index while maintaining schema flexibility.
"indexes": [{
• Define trade-offs between storage, write and query "kind": “Range",
"dataType": "String",
performance, and query consistency
"precision": -1
• Include or exclude documents and paths to and from the }, {
index "kind": "Range",
"dataType": "Number",
• Configure various index types "precision": -1
}, {
"kind": "Spatial",
"dataType": "Point"
}]
}],
"excludedPaths": [{
"path": "/nonIndexedContent/*"
}]
}
Indexing JSON Documents

{
"locations": [
{
"country": "Germany",
"city": "Berlin" locations headquarter exports
},
{
"country": "France", 0 1 Belgium 0 1
"city": "Paris"
} country city country city city city
],
"headquarter": "Belgium",
"exports": [ Germany Berlin France Paris Moscow Athens
{ "city": "Moscow" },
{ "city": "Athens" }
]
}
Indexing JSON Documents

{
"locations": [
{
locations headquarter exports
"country": "Germany",
"city": "Bonn",
"revenue": 200 0 Italy 0 1
}
],
"headquarter": "Italy", country city revenue city dealers city
"exports": [
{
Germany Bonn 200 Berlin 0
"city": "Berlin",
"dealers": [
{ "name": "Hans" } name
]
},
Hans
{ "city": "Athens" }
]
}
Indexing JSON Documents

locations headquarter exports locations headquarter exports

+
0 1 Belgium 0 1 0 Italy 0 1

country city country city city city country city revenue city dealers city

Germany Berlin France Paris Moscow Athens Germany Bonn 200 Berlin 0 Athens

name

Hans
Inverted Index

{1, 2}

{1, 2} locations {1, 2} headquarter {1, 2} exports

{1 {1
{1, 2} 0 1 Belgium {2} Italy {1, 2} 0 {1, 2} 1
} }

{1, 2} country {1, 2} city {1, 2} revenue {1, 2} country {1, 2} city {1, 2} city {2} dealers {1, 2} city

{1 {1 {1 {1
} Berlin } France } Paris } Moscow

{1, 2} Germany
{2} Bonn {2} 200 {2} Berlin {2} 0 {2} Athens

{2} name

{2} Hans
{
"indexingMode": "consistent",

Indexing Policy "automatic": true,


"includedPaths": [
{
"path": "/age/?",
"indexes": [
{
"kind": "Range",
{ "dataType": "Number",
"precision": -1
"indexingMode": "none", },
"automatic": false, ]
"includedPaths": [], },
{
"excludedPaths": []
"path": "/gender/?",
} "indexes": [
{
"kind": "Range",
"dataType": "String",
"precision": -1
No indexing }, Index some properties
]
}
],
"excludedPaths": [
{
"path": "/*"
}
]
}
Range Indexes
These are created by default for each property and are needed for:

Equality queries:
SELECT * FROM container c WHERE c.property = 'value’

Range queries:
SELECT * FROM container c WHERE c.property > 'value' (works for >, <, >=,
<=, !=)

ORDER BY queries:
SELECT * FROM container c ORDER BY c.property

JOIN queries:
SELECT child FROM container c JOIN child IN c.properties WHERE child =
'value'
Spatial Indexes
These must be added and are needed for geospatial queries:

Geospatial distance queries:


SELECT * FROM container c WHERE ST_DISTANCE(c.property, { "type":
"Point", "coordinates": [0.0, 10.0] }) < 40
Geospatial within queries:
SELECT * FROM container c WHERE ST_WITHIN(c.property, {"type":
"Point", "coordinates": [0.0, 10.0] } })
Composite Indexes
These must be added and are needed for queries that ORDER BY
two or more properties.

ORDER BY queries on multiple properties:


SELECT * FROM container c ORDER BY c.firstName, c.lastName
Online Index Transformations
On-the-fly Index Changes
In Azure Cosmos DB, you can make changes to the
indexing policy of a collection on the fly. Changes can
affect the shape of the index, including paths,
precision values, and its consistency model. New document writes (CRUD) & queries

A change in indexing policy effectively requires a


transformation of the old index into a new index.

v1 v2

Policy Policy

t0 t1

PUT /colls/examplecollection GET /colls/examplecollection


{ indexingPolicy: … } x-ms-index-transformation-progress: 100
Index Tuning

Metrics Analysis Update Index


Policy
The SQL APIs provide information about performance metrics, such as the View Headers
index storage used and the throughput cost (request units) for every
operation. You can use this information to compare various indexing
Query Collection
policies, and for performance tuning.
When running a HEAD or GET request against a collection resource, the
x-ms-request-quota and the x-ms-request-usage headers provide the
storage quota and usage of the collection.
You can use this information to compare various indexing policies, Index
and for performance tuning.

Collection
Best Practices

Understand query patterns – which properties are being


used?

Understand impact on write cost – index update RU cost


scales with # properties

You might also like