DF200 - 01 - Indexes and Optimization Mongo DB Training
DF200 - 01 - Indexes and Optimization Mongo DB Training
Release: 20211008
What are Indexes for
● Speed up queries and updates
● Avoid disk I/O
● Reduce overall computation
MongoDB uses an empirical technique to look for the best index, this does not require any
collection statistics and is well suited
to the simple - single collection type queries mongoDB performs, normally with just one index.
Queries tend to always be quick on small datasets, so we will use tools to see the implications
rather than merely observing time.
The example query has no index, so the DB Engine must look through every document in the
collection.
If the collection is too big and isn’t cached in RAM, then a huge amount of Disk IO will be
consumed, and it will be very slow.
“executionStats” gives us more detail, such as how long the query takes to run and how much data
it has to examine
“allPlansExecution” runs all candidate plans and gathers statistics for comparison
Index Demonstration
We can see this looked at all
5,555 documents, returning 11 > use sample_airbnb
sample_airbnb
in 8 milliseconds > db.listingsAndReviews.find({number_of_reviews:50}).explain(
"executionStats")
We can create an index to ...
improve this. executionStats: {
executionSuccess: true,
Explain plans are complicated nReturned: 11,
executionTimeMillis: 8,
with nested stages of totalKeysExamined: 0,
processing. totalDocsExamined: 5555,
executionStages: {
stage: "COLLSCAN",
Key metrics are in bold here. filter: {
number_of_reviews: {
'$eq' : 50
...
● nReturned is how many documents this stage returns - e.g., the index may narrow to 100
documents, but then an unindexed filter drops that to 10 documents
● totalKeysExamined - number of index entries
● totalDocsExamined - number of documents read
In production, we need to
look at the impact this will
have.
Creating an index takes an object of fields with an ascending index (1) or descending index (-1)
option.
In languages where members of objects aren’t ordered, the syntax is a little different.
Making indexes in the production system needs to consider the impact on the server, cache,
disks, and any locking that may occur.
millisecond
nReturned: 11,
works: 12,
...
The example shows improved efficiency of using an index - Note how totalKeysExamined,
totalDocsExamined, and nReturned are all the same.
Same behavior as an RDBMS
Explainable Operations
● find()
● aggregate()
● count()
● update()
● remove()
● findAndModify()
The newer style API for example updateOne() or updateMany() does not allow
explain so you need to use update() - same functionality
peoplecoll = db.people
explainpeoplecoll = peoplecoll.explain()
explainpeoplecoll.count()
A query is sent to the server once we start requesting from the cursor - so we set a flag on the
cursor to request the explain plan rather than the results.
If we don’t have a cursor, calling explain() on a collection returns a collection with that flag set.
Listing Indexes
Call getIndexes() on a
collection to see the index > db.listingsAndReviews.getIndexes()
[
definitions { v: 2, key: { _id: 1 }, name: '_id_' },
...
{
v: 2,
key: { number_of_reviews: 1 },
name: 'number_of_reviews_1'
}
]
getIndexes() shows index information (the number of indexes may vary from what you see here on
the slide)
Index Sizes
We can call stats() method
and look at the indexSizes key > db.listingsAndReviews.stats().indexSizes
{
to see how large each index is _id_: 143360,
in bytes. property_type_1_room_type_1_beds_1: 65536,
name_1: 253952,
'address.location_2dsphere': 98304,
number_of_reviews_1: 45056
}
Among other options, the scaling factor can be passed to see stats in desired unit.
db.listingsAndReviews.stats({ scale: 1024 }).indexSizes shows the index
sizes in KB.
Exercise
Use the sample_airbnb database and the listingsAndReviews collection.
1. Find the name of the host with the most total listings (this is an existing field)
2. Create an index to support the query.
3. Calculate how much more efficient it is now with this index.
○ NULL is a value and so only one record can have a NULL in unique field.
● Sparse Indexes don’t index missing fields or nulls.
db.scores.createIndex( { score: 1 } , { sparse: true } )
○ Sparse Indexes are superseded by Partial Indexes
○ Use { field : { $exists : true } } for your partialFilterExpression.
● Partial indexes index a subset of documents based on values.
○ Can greatly reduce index size
db.orders.createIndex( { customer: 1, store: 1 },
{ partialFilterExpression: { archived: false } } )
Hashed indexing creates performance challenges for range matches etc., so it should be used
with caution.
Random values (Hashes or traditional GUIDS) in a BTree maximize the requirement for RAM and
Disk/IO and so should be avoided. This is covered later in the course.
Indexes and Performance
● Indexes improve read performance when used.
● Each index adds ~10% overhead
○ Hashed Indexes can add a lot more.
● An index is modified any time a document:
○ Is inserted (applies to all indexes)
○ Is deleted (applies to all indexes)
○ Is updated in such a way that its indexed field changes
Indexes must be applied with careful consideration as they do create overhead when writing data
Unused indexes should be identified and removed
Index Limitations
● You can have up to 64 indexes per collection.
● You should NEVER be anywhere close to that upper bound.
● Write performance will degrade to unusable at somewhere between 20-30.
● 4 is a good number to aim for
The hard limit is 64 indexes per collection, but you should not have anywhere near this number
Use Indexes with Care
● Every query should use an index.
● Every index should be used by a query.
● Indexes require RAM.
● Be mindful about the choice of key.
Depending on the size and available resources, indexes will either be used from disk or cached.
You should aim to fit indexes in the cache. Otherwise, performance will be seriously impacted.
Index Prefix Compression
● MongoDB Indexes use a special compressed format
● Each entry is just delta from the previous one
● If there are identical entries, they need only one byte
● As indexes are inherently sorted, this makes them much smaller
● Smaller indexes mean less RAM required to keep them in RAM
MongoDB uses index prefix compression to reduce the space that indexes consume.
Where an entry shares a prefix with a previous entry in the block, it has a pointer to that entry and
length, and then the new data.
So subsequent identical keys take very little space. This helps optimize cache usage.
Introduction to Multikey Indexes
● A multikey index is an index that has indexed an array.
● An index entry is created on each unique value found in an array.
● Multikey indexes can index primitives, documents, or sub-arrays.
● There is nothing special that you need to do to create a multikey index.
● You create them using createIndex() just as you would with an ordinary single-
field index.
● If any field in the index is ever found to be an array then the index is described
as being multikey.
You cannot create a compound multikey index if more than one to-be-indexed field of a document
is an array.
Multikey Basics
Exercise for the class
> use test
● How many records are > db.race_results.drop()
● Will they use our index? // Answer Questions before running these two!
> db.blog.insertMany([
For each query: {"comments": [{ "name" : "Bob", "rating" : 1 },
{ "name" : "Frank", "rating" : 5.3 },
{ "name" : "Susan", "rating" : 3 } ]},
● How many results?
{"comments": [{ name : "Megan", "rating" : 1 } ] },
● Which index, if any, will
it use? {"comments": [{ "name" : "Luke", "rating" : 1.4 },
{ "name" : "Matt", "rating" : 5 },
{ "name" : "Sue", "rating" : 7 } ] }
])
query. }
{ $elemMatch : {$eq : 3} }
})
On the Previous page we tried to search for a value in an array of arrays - and not only could we not
index it but there seemed to be no way to search for it.
If you do need to search in an array of arrays it is possible - but using the very powerful if
misunderstood $elemMatch which returns true or false based on an array member matching a
query.
Compound Indexes
● Create an index based on more than one field.
○ They are called Compound Indexes
○ MongoDB normally only uses one index per query
○ Compound indexes are the most common type of indexes
○ They are the same conceptually as used in an RDBMS
● You may use up to 32 fields in a compound index.
● The field order and direction is very important.
● You create them like a single field index but with more fields specified
createIndex({country:1,state:1,city:1})
find({country:"UK",city:"Glasgow"})
Uses an Index for country and city but must look at every state in the country, so
looks at many index keys.
createIndex({country:1,city:1,state:1})
A Better Index for this query as can go straight to country and city.
The directions matter when doing range queries.
In addition to supporting queries that match all the index fields, compound indexes can support
queries that match on the prefix of the index fields.
The Order of Fields Matters
● Equality First.
● In order of selectivity
● What fields, for a typical query, will filter the most.
● selectivity != cardinality, selective can be a boolean choice
● Normally Male/Female is not selective (for the common query case)
● Dispatched versus Delivered IS selective though
● Then Range or Sort (Usually Sort)
● Sorts are much more expensive than range queries when no index is used.
Order of sort should usually be Equality, Sort, Range but it can be sometimes be better to have
range first.
Putting the most selective fields first , can greatly reduce the quantity of index that is in the
working set. If you have a field for "Archived" which it true/false having at the start keeps the
archived portion of the index out of RAM.
But be aware that Selectivity and Cardinality are different concepts.
Example: A Simple Message Board
We will look at the indexes needed for a simple Message Board App.
We want to automatically clean up our board and remove some older, low-rated
anonymous messages on a regular basis.
Here is our query requirement:
1. Find all messages in a specified timestamp range.
2. Select for whether the messages are anonymous or not.
3. Sort by rating from lowest to highest.
...
...
4, Anonymous
2, Anonymous 5, Martha
1,Anonymous 3, Sam
Martha, 5
Anonymous, 2 Sam, 3
Anonymous, 1 Anonymous, 4
1. Exact Match at start filters down the tree to walk (Just Anonymous).
2. Find first Anonymous where timestamp >= 2
3. Walk tree whilst Anonymous & timestamp <= 4
4. Visits only two index nodes in total (2 and 4)
order
○ Otherwise, we executionStats: {
executionSuccess: true,
need to reorder nReturned: 2,
...
The index should also cover sorting where possible to prevent sorting in memory.
Index in correct order ?
Query: {timestamp:{$gte:2, $lte:4}, username:"anonymous"} Sort: { rating: 1 }
With Index: { username: 1, timestamp: 1 , rating:1 }
Martha, 5, 5
Anonymous, 2, 5 Sam, 3, 2
1. Exact Match at start filters down the tree to walk (Just Anonymous).
2. Find first Anonymous where timestamp >= 2
3. Walk tree whilst Anonymous & timestamp <= 4
4. Visits only two index nodes in total (2 and 4)
5. But results in order 5,2 - so need sorted
Copyright 2020-2021 MongoDB, Inc. All rights reserved. Slide 40
Adding a sort field to the index after a range can produce undesired results.
Index in better order ?
Query: {timestamp:{$gte:2, $lte:4}, username:"anonymous"} Sort: { rating: 1 }
With Index: { username: 1, rating: 1, timestamp: 1 }
Martha, 5, 5
Anonymous, 3, 1 Sam, 2, 3
1. Exact Match at start filters down the tree to walk (Just Anonymous)
2. Find first Anonymous where timestamp >= 2
3. Walk tree L to R until timestamp, not <= 4 (3 nodes) check each if 'anonymous.'
4. Return only 2 of the three nodes visited (2 and 4)
5. But results in the correct order
Copyright 2020-2021 MongoDB, Inc. All rights reserved. Slide 41
Amending the index field order with sort before range can improve performance significantly.
When checking the tree in step 3, one extra node is checked as you need to check one that is
'wrong' to know where the 'right' ones end.
Rules of Compound Indexing
● Equality before range
● Equality before sorting
You can have a compound Multikey index, all the fields in the compound index are stored in an
index entry for each unique value in the array field.
For any given document - which field is an array in the compound index does not matter.
But only one of the fields in any give compound index can be an array in a single document.
If this was not the case we would need an index entry for every possible combination of values -
which might be huge.
Index Covered Queries
● Fetch all data from
MongoDB> use sample_airbnb //These numbers are obtained when the
Index not the shell and server are in same region.
MongoDB> function mkdata()
If all the fields we need can be found in the index then we don't need to actually fetch the
document.
We need to remove _id from the projection if it's not in the index we use to query.
We cannot use a multikey index for projection as we don't know from the index entry about
position or quantity of values, or even if it's an array versus a scalar value in any given document.
Indexes store data in a slightly different format to BSON as all numbers have a common format in
the index (and a type) and so need to be converted back to BSON , this takes more processing
time.
If we are fetching one or two values form a larger document - index covering is good - if it's most of
the fields in a document it's probably better to fetch from the document.
If we need to add extra fields to the index in order to facilitate covering, add them at the end and
be aware of the extra storage.
Index-covered queries are the ultimate goal but not always achievable.
Once we reach our data storage limit we cannot create anything else, including an index - so we
are making the index first then making the big collection, this collection technically is larger than
our storage limit in the free tier but as it is created in an aggregation we are able to generate it.
toArray() method on a cursor fetches all the contents - we do have some network overhead here
though which is a constant in the total time - without that (for example in an aggregation) the
difference is much larger proportionally.
Do remember to drop the big collection after the above commands are executed or you will not
be able to perform any more write operations.
Exercise - Compound indexes
Create the best index you can for this query - how efficient can you get it?
> query = { amenities: "Waterfront",
"bed_type" : { $in : [ "Futon", "Real Bed" ] },
first_review : { $lt: ISODate("2018-12-31") },
last_review : { $gt : ISODate("2019-02-28") }
}
● Spherical 2d
○ Full set of GeoJSON objects
○ Required for larger areas
$geoNear, $geoWithin, and $geoIntersects are operators for using geospatial functionality within
MongoDB
Geo Indexing Exercise
I like the idea of staying at "Ribeira Charming Duplex" in Porto.
It has no pool - find the five properties within 5KM that do
You will need to do more than one query; however, you should write a program (that runs in the
shell) that takes the name of a property and finds somewhere nearby with a pool.
> use sample_airbnb
> var villaname = "Ribeira Charming Duplex"
> var nearto = db.listingsAndReviews.findOne({name:villaname})
> var position = nearto.address.location
> query = <write your query here>
> db.listingsAndReviews.find(query,{name:1})
automatically deletes > //TTL set to auto delete where create_date > 1 minute old.
TTL indexes should be used with caution as restoring deleted data can be a huge pain or even not
possible if no backups exist.
These should be widely communicated to ensure it comes as no surprise when this happens.
Be careful not to delete huge amounts of data in production.
Native text Indexes
● Superseded in Atlas by Lucene - but relevant to on-premise
● Indexes tokens (words, etc.) used in string fields.
○ It allows you to search for 'contains.'
● Algorithm
○ Split text fields into a list of words.
○ Drop language-specific stop words ("the", "an", "a", "and").
○ Apply language-specific suffix stemming
○ "running", "runs", "runner" all become "run".
○ Take the set of stemmed words and make a multikey index.
● MongoDB supports text search for several western languages.
● Queries are OR by default.
● Can be compound indexes
Copyright 2020-2021 MongoDB, Inc. All rights reserved. Slide 54
Text indexes use an algorithm to split text fields into words and make them available for search
using contains.
Native text Indexes - limits
● Logical AND queries can be performed by putting required terms in quotes
○ This can also be used for required phrases "ocean drive"
○ This applied as secondary filter to an OR query for all terms
○ This makes AND and Phrase queries very inefficient.
● No fuzzy matching
● Many index entries per document (slow to update)
● No wildcard searching
● Indexes are smaller than Lucene though
Text indexes are limited, and Lucene should be used instead on Atlas
Text Index Example
Only one text index per
collection, so create on > use sample_airbnb
> db.listingsAndReviews.createIndex(
multiple fields. {
"reviews.comments": "text",
"summary": "text",
Index all fields with }
"notes": "text"
)
db.collection.createIndex( > db.listingsAndReviews.find({$text : { $search : "dogs" }})
{ "$**": "text" }
) > db.listingsAndReviews.find({$text:{ $search : "dogs -cats" }})
>
Use $meta and $sort to db.listingsAndReviews.find({$text : { $search : " \"coffee shop\"
" }})
order if results are small.
> //Fails as cannot sort so much data in RAM
> db.listingsAndReviews.find(
{ $text: { $search: "coffee shop cake" } },
{ name: 1, score: { $meta: "textScore" } }
).sort( { score: { $meta: "textScore" } } )
You can use -term as in "dogs -cats" to say do NOT return results with this term.
Wildcard Indexes
● Dynamic schema means it's hard to index all fields
○ There are alternative schemas that we will see later
● Wildcard indexes index all fields, or a subtree
● Index Entries treat the fieldpath as the first value
○ Normal index on "user.firstname" Index contains "Job","Joe","John" as keys.
○ Wildcard Index on "user" contains the fieldnames in the keys:
"firstname.Job", "firstname.Joe", "firstname.John",
and any other other fields in user e.g.
"lastname.Adams", "lastname.Jones", "lastname.Melville"
○ What does that do to the index size?
○ What are the performance and hardware implications?
○ This feature is easy to overuse/misuse.
○ Indexing correctly matters - not just "index everything"
● Indexes are mostly just like RDBMS indexes - but queries are simpler.
Indexing should be used effectively to improve query performance, but over-indexing or indexing
incorrectly can have an adverse effect.
Indexes do NOT need to be held entirely in RAM - like collections the database will cache in RAM
parts it accesses often and evict those it doesn't.
The index exists on DISK with a cache in RAM of parts accesses recently - we want to access
something from RAM so we need to design the indexes
so infrequently accessed data does not get cached (or add more RAM) . Often this is based on a
date value in the index.
Indexes in Production
To build an index, one has to read whole data, which may be much bigger than the
RAM size in the production. So, there were two ways of creating an index in the
previous versions: Foreground and Background.
Foreground Index (MongoDB Pre 4.2)
Foreground index builds were fast but required blocking all read-write access to the parent
database of the collection being indexed for the duration of the build.
Background Index (MongoDB Pre 4.2)
Background index builds were slower and had less efficient results but allowed read-write
access to the database and its collections during the build process.
Hybrid Index (MongoDB 4.2+)
Does not lock the server and builds quickly. It is in the newer version of MongoDB.
Exercise Answers
Exercise - Indexing
Find the name of the host with the most total listings
db.listingsAndReviews.find({},{"host.host_total_listings_count":1,
"host.host_name":1}).sort({"host.host_total_listings_count":-1}).limit(1)
Now create an index to support this and show somehow how much more efficient it is.
"executionTimeMillis" : 11,
"totalDocsExamined" : 5555,
"works" : 5559,
db.listingsAndReviews.createIndex({"host.host_total_listings_count": 1})
"executionTimeMillis" : 1,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
"works" : 2,
Note: In the first query where we apply a sort and limit, MongoDB doesn’t guarantee consistent
return of the same document especially when there are simultaneous writes happening. It is
advisable to include the _id field in sort if consistency is required.
Exercise - Compound Indexes
Create the best index you can for this query - how efficient can you get it?
DOCS_EXAMINED: 13
KEYS_EXAMINED: 117
TIME: 0MS
Exercise Answers
Multikey Basics Answers:
Slide 1 (Simple Arrays) : 3 Record
Only 11 Index Entries as only one entry needed for duplicate '3' in third record.
Then you could index on {"lastmoves.p":1} as a multikey index nested arrays are indexable as long as they are not anonymous.
db.listingsAndReviews.find(query,{name:1})