An in Depth Look at Database Indexing
An in Depth Look at Database Indexing
Developers who deal with relational databases have used or at least heard
about indexing, and it’s a very common concept in the database world. However,
the most important part is to understand what to index & how the indexing is
going to boost the query response time. For doing that you need to understand
how you are going to query your database tables. A proper index can be created
only when you know exactly what your query & data access patterns look like.
We will use MySQL with a default InnoDB database engine, although concepts
explained in this article are more or less same in other database servers as well
like Oracle, MSSQL etc.
The Engine column in the above screen shot represents the engine that is used
to create the table. Here InnoDB is used.
Now Insert some random data in the table, my table with 5 rows looks like the
following:
I have not created any index till now on this table. Let’s verify this by the
command: SHOW INDEX . It returns 0 results.
At this moment, if we run a simple SELECT query, since there is no user defined
index, the query will scan the whole table to find out the result:
EXPLAIN shows how the query engine plans to execute the query. In the above
screenshot, you can see that the rows column returns 5 & possible_keys
returns null . possible_keys represents what all available indices are there
which can be used in this query. The key column represents which index is
actually going to be used out of all possible indices in this query.
Primary Key:
The above query is very inefficient. Let’s optimise this query. We will make the
phone_no column a PRIMARY KEY assuming that no two users can exist in our
system with the same phone number. Take the following into consideration
when creating a primary key:
The ideal primary key type should be a number like INT or BIGINT
because integer comparisons are faster, so traversing through the index
will be very fast.
Since we don’t have any primary key defined as of now, let’s see what InnoDB by
default created for us:
EXTENDED shows all the indices that are not usable by the user but managed
completely by MySQL.
Here we see that MySQL has defined a composite index (we will discuss
composite indices later) on DB_ROW_ID , DB_TRX_ID , DB_ROLL_PTR , & all
columns defined in the table. In the absence of a user defined primary key, this
index is used to find records uniquely.
Let’s now create the primary index on phone_no & examine the created index:
Note that CREATE INDEX can not be used to create a primary index, but ALTER
TABLE is used.
In the above screenshot, we see that one primary index is created on the
column phone_no . The columns of the following images are described as
follows:
Non_unique : If the value is 1, the index is not unique, if the value is 0, the index
is unique.
Key_name : The name of the index created. The name of the primary index is
always PRIMARY in MySQL, irrespective of if you have provided any index name
or not while creating the index.
Null : YES if the column may contain NULL values and blank if it does not.
Index_type : Indicates which indexing data structure is used for this index.
Some possible candidates are — BTREE , HASH , RTREE , or FULLTEXT .
Comment : The information about the index not described in its own column.
Index_comment : The comment for the index specified when you created the
index with the COMMENT attribute.
Now let’s see if this index reduces the number of rows which will be searched
for a given phone_no in the WHERE clause of a query.
TJz8cx0CrDPswJzfooUNA5HThlP5bAqZ5f8w
In this snapshot, notice that the rows column has returned 1 only, the
possible_keys & key both returns PRIMARY . So it essentially means that using
the primary index named as PRIMARY (the name is auto assigned when you
create the primary key), the query optimizer just goes directly to the record &
fetches it. It’s very efficient. This is exactly what an index is for — to minimize
the search scope at the cost of extra space.
Clustered Index:
A clustered index is collocated with the data in the same table space or same
disk file. You can consider that a clustered index is a B-Tree index whose leaf
nodes are the actual data blocks on disk, since the index & data reside together.
This kind of index physically organizes the data on disk as per the logical order
of the index key.
aVIkXV0c5nNwQHjL1T501JC0OG-E9iZGzt3H
The yellow coloured big rectangle represents a disk block / data block
the blue coloured rectangles represent data stored as rows inside that
block
the footer area represents the index of the block where red coloured
small rectangles reside in sorted order of a particular key. These small
blocks are nothing but sort of pointers pointing to offsets of the records.
Records are stored on the disk block in any arbitrary order. Whenever new
records are added, they get added in the next available space. Whenever an
existing record is updated, the OS decides whether that record can still fit into
the same position or a new position has to be allocated for that record.
In this way, you really don’t need to care about actually organizing the physical
record in a certain order, rather a small index section is maintained in that order
& fetching or maintaining records becomes very easy.
SELECT * FROM index_demo WHERE phone_no > '9010000000' AND phone_no < '9020000000'
A data block is fetched in memory when the query is executed. Say the data
block contains phone_no in the range from 9010000000 to 9030000000 . So
whatever range you requested for in the query is just a subset of the data
present in the block. If you now fire the next query to get all the phone numbers
in the range, say from 9015000000 to 9019000000 , you don’t need to fetch any
more blocks from the disk. The complete data can be found in the current block
of data, thus clustered_index reduces the number of disk IO by collocating
related data as much as possible in the same data block. This reduced disk IO
causes improvement in performance.
So if you have a well thought of primary key & your queries are based on the
primary key, the performance will be super fast.
When you define a PRIMARY KEY on your table, InnoDB uses it as the
clustered index. Define a primary key for each table that you create. If
there is no logical unique and non-null column or set of columns, add a
new auto-increment column, whose values are filled in automatically.
If you do not define a PRIMARY KEY for your table, MySQL locates the
first UNIQUE index where all the key columns are NOT NULL and InnoDB
uses it as the clustered index.
In short, the MySQL InnoDB engine actually manages the primary index as
clustered index for improving performance, so the primary key & the actual
record on disk are clustered together.
In the following diagram, the left side rectangles represent leaf level index
blocks, and the right side rectangles represent the data blocks. Logically the
data blocks look to be aligned in a sorted order, but as already described earlier,
the actual physical locations may be scattered here & there.
Is it possible to create a primary index on a non-primary key?
In MySQL, a primary index is automatically created, and we have already
described above how MySQL chooses the primary index. But in the database
world, it’s actually not necessary to create an index on the primary key column
— the primary index can be created on any non primary key column as well. But
when created on the primary key, all key entries are unique in the index, while in
the other case, the primary index may have a duplicated key as well.
- If the primary key does not exist, you get the following error:
"ERROR 1091 (42000): Can't DROP 'PRIMARY'; check that column/key exists"
Secondary Index:
Any index other than a clustered index is called a secondary index. Secondary
indices does not impact physical storage locations unlike primary indices.
iWZI5S-Lqf9EljZxrNpmFCIajB8kmsTVkQ0i
So to understand, you can assume that the secondary index has reference to the
primary key’s address, although it’s not the case. Retrieving data through the
secondary index means you have to traverse two B+ trees — one is the
secondary index B+ tree itself, and the other is the primary index B+ tree.
0eg06hWYJWhXPt1QNuaDlETYrmnSKAo6Nf44
Also, if a primary key is very large like a URL , since secondary indexes contain a
copy of the primary key column value, it can be inefficient in terms of storage.
More secondary keys means a greater number of duplicate copies of the
primary key column value, so more storage in case of a large primary key. Also
the primary key itself stores the keys, so the combined effect on storage will be
very high.
This process is expensive when several secondary indexes exist. Also other
tables may have a foreign key reference to the primary key, so you need to
delete those foreign key references before you delete the primary key.
Unlike other database servers, in MySQL a unique key column can have as many
null values as possible. In SQL standard, null means an undefined value. So if
MySQL has to contain only one null value in a unique key column, it has to
assume that all null values are the same.
But logically this is not correct since null means undefined — and undefined
values can’t be compared with each other, it’s the nature of null . As MySQL
can’t assert if all null s mean the same, it allows multiple null values in the
column.
The following command shows how to create a unique key index in MySQL:
ApzPAl3z-AwYSR7YXofmjf17TYXgPLHoX6AZ
Composite Index:
MySQL lets you define indices on multiple columns, up to 16 columns. This
index is called a Multi-column / Composite / Compound index.
Let’s say we have an index defined on 4 columns — col1 , col2 , col3 , col4 .
With a composite index, we have search capability on col1 , (col1, col2) ,
(col1, col2, col3) , (col1, col2, col3, col4) . So we can use any left side
prefix of the indexed columns, but we can’t omit a column from the middle & use
that like — (col1, col3) or (col1, col2, col4) or col3 or col4 etc. These
are invalid combinations.
If you have queries containing a WHERE clause on multiple columns, write the
clause in the order of the columns of the composite index. The index will benefit
that query. In fact, while deciding the columns for a composite index, you can
analyze different use cases of your system & try to come up with the order of
columns that will benefit most of your use cases.
Composite indices can help you in JOIN & SELECT queries as well. Example: in
the following SELECT * query, composite_index_2 is used.
SmJU2MejEJjaWUtJxkYprwJXNye6fOhYvkFr
When several indexes are defined, the MySQL query optimizer chooses that
index which eliminates the greatest number of rows or scans as few rows as
possible for better efficiency.
MySQL maintains something called index statistics which helps MySQL infer
what the data looks like in the system. Index statistics is a generilization though,
but based on this meta data, MySQL decides which index is appropriate for the
current query.
In our example, for the following record, a composite index key is formed by
concatenating pan_no , name , age — HJKXS9086Wkousik28 .
+--------+------+------------+------------+
name
age
pan_no
phone_no
+--------+------+------------+------------+
kousik
28
HJKXS9086W
9090909090
If you are creating an index in col1 & a composite index in ( col1 , col2 ),
then only the composite index should be fine. col1 alone can be served
by the composite index itself since it’s a left side prefix of the index.
Covering Index:
A covering index is a special kind of composite index where all the columns
specified in the query somewhere exist in the index. So the query optimizer
does not need to hit the database to get the data — rather it gets the result from
the index itself. Example: we have already defined a composite index on
(pan_no, name, age) , so now consider the following query:
SELECT age FROM index_demo WHERE pan_no = 'HJKXS9086W' AND name = 'kousik'
The columns mentioned in the SELECT & WHERE clauses are part of the
composite index. So in this case, we can actually get the value of the age
column from the composite index itself. Let’s see what the EXPLAIN command
shows for this query:
EXPLAIN FORMAT=JSON SELECT age FROM index_demo WHERE pan_no = 'HJKXS9086W' AND name = '111kousik1';
1HqlKe6UuO9ldQ3tgbZ0zxsHdm8YBxHARAUK
In the above response, note that there is a key — using_index which is set to
true which signifies that the covering index has been used to answer the query.
Partial Index:
We already know that Indices speed up our queries at the cost of space. The
more indices you have, the more the storage requirement. We have already
created an index called secondary_idx_1 on the column name . The column
name can contain large values of any length. Also in the index, the row locators’
or row pointers’ metadata have their own size. So overall, an index can have a
high storage & memory load.
In MySQL, it’s possible to create an index on the first few bytes of data as well.
Example: the following command creates an index on the first 4 bytes of name.
Though this method reduces memory overhead by a certain amount, the index
can’t eliminate many rows, since in this example the first 4 bytes may be
common across many names. Usually this kind of prefix indexing is supported on
CHAR , VARCHAR , BINARY , VARBINARY type of columns.
ZdBDdRbFqPSdScLJ51qAVPaDffc4qUXcAtUB
There are many other indices as well like Spatial index and Full Text Search
index offered by MySQL. I have not yet experimented with those indices, so I’m
not discussing them in this post.
With DML operations, indices are updated, so write operations are quite
costly with indexes. The more indices you have, the greater the cost.
Indexes are used to make read operations faster. So if you have a system
that is write heavy but not read heavy, think hard about whether you
need an index or not.
Indices might need some maintenance as well if old data still remains in
the index. They need to be deleted otherwise memory will be hogged, so
try to have a monitoring plan for your indices.
Please do clap & share with your friends & on social media if you like this
article. :)
References:
1. https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html
2. https://www.quora.com/What-is-difference-between-primary-index-
and-secondary-index-exactly-And-whats-advantage-of-one-over-
another
3. https://dev.mysql.com/doc/refman/8.0/en/create-index.html
4. https://www.oreilly.com/library/view/high-performance-
mysql/0596003064/ch04.html
5. http://www.unofficialmysqlguide.com/covering-indexes.html
6. https://dev.mysql.com/doc/refman/8.0/en/multiple-column-indexes.html
7. https://dev.mysql.com/doc/refman/8.0/en/show-index.html
8. https://dev.mysql.com/doc/refman/8.0/en/create-index.html
Kousik Nath
Engineer @ PayPal, loves to have deep discussion on distributed and scalable systems, system
architecture, design patterns, algorithmic problem solving. Linkedin:
https://www.linkedin.com/in/kousikn/
If you read this far, tweet to the author to show them you care.
Tweet a thanks
Learn to code for free. freeCodeCamp's open source curriculum has helped
more than 40,000 people get jobs as developers. Get started
freeCodeCamp is a donor-supported tax-exempt 501(c)(3) charity organization (United States Federal Tax
Identification Number: 82-0779546)
Our mission: to help people learn to code for free. We accomplish this by creating thousands of videos, articles,
and interactive coding lessons - all freely available to the public. We also have thousands of freeCodeCamp study
groups around the world.
Donations to freeCodeCamp go toward our education initiatives, and help pay for servers, services, and staff.
Trending Guides
Python Print Same Line Text Align in CSS Excel Absolute Reference
What Does Coding Mean? Python Split String Square a Number in Python
What is Data Analysis? Python List insert() How to Lock Cells in Excel
How to Comment Out CSS Merge Sort Algorithm Python Delete Key from Dict
Double vs Float in C++ What is an SVG File? Beginner Tech Jobs Examples
Our Charity
About Alumni Network Open Source Shop Support Sponsors Academic Honesty Code of Conduct