0% found this document useful (0 votes)
29 views7 pages

Database Query Efficiency and Indexing

The document discusses various data structures and indexing techniques to improve the efficiency of database queries, emphasizing the importance of selectivity and data characteristics. It covers types of indexes, dimensionality of data, and different query types, including exact match, range, and similarity queries. Additionally, it explains specific structures like R-Trees and Pyramid Indexes, detailing their operations and the implications of using them for various data types.

Uploaded by

xerodo1379
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

Database Query Efficiency and Indexing

The document discusses various data structures and indexing techniques to improve the efficiency of database queries, emphasizing the importance of selectivity and data characteristics. It covers types of indexes, dimensionality of data, and different query types, including exact match, range, and similarity queries. Additionally, it explains specific structures like R-Trees and Pyramid Indexes, detailing their operations and the implications of using them for various data types.

Uploaded by

xerodo1379
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Exercises:

3.1 – Time Complexity Analysis


5.3 – Pyramid Indexes (Which queries can be accessed efficiently using PIs?)

The efficiency of database queries often depends on two things:


1. The selectivity of a query / how constrained a query is (searching for just one object is way quicker
than searching for a list of objects)
2. Data characteristics (how data is distributed and how our data is stored)

A simple sequential scan has the runtime of 𝑂(𝑛), yet it can be improved through usage of various data
structures.

Index Structures have the following properties:


 Common Goal: Improve (shorten) processing time of query
 It's the task of the index to give the user the queried data cell
 Has to grow with the data structure at large
 Improved time complexity:
o commonly 𝑂(log 𝑛)
o sometimes 𝑂(1)
o in many cases only 𝑂(𝑛)
 Success is dependent on data characteristics
 Are additional structures which consist of redundant information
 Have 2 types of indexes:
o Primary (only one per relational table / file): physical clustering/sorting of data entries
according to one selected (unique) primary key
o Secondary: support for additional non-key attributes; creation of a secondary index requires
only the creation of a new directory, no physical sorting of data entries required; multiple
secondary index structures per relational table / file possible
 Requirements:
o Efficient Search
o Dynamic insert, deletion, update of data entries
o Order-preserving index
o Efficient space usage of the index
o Easy to implement
o Adaptive towards changing data distributions
o Parallelism

To be noted: all index structures promise an efficiency improvement, but it's always dependent on what
kind of data is stored, queries and data distribution.
Dimensionality: depending on what the dimensionality is, we end up with varying results of our queries:

 One-dimensional data
o Scalar (numerical), nominal data, ordinal data, metric data
 Multi-dimensional data
o Multi-variant queries, geographic information systems (spatial data), …
 High-dimensional data
o Images, vector spaces, time series, sensor representations, time-series data
 No-dimensional data (metric data)
o Only distance functions between data items are available (protein folding structures)
o Important characteristics of metric data:
1. Symmetry: 𝑑(𝑝, 𝑞) = 𝑑(𝑞, 𝑝)
2. Definiteness: 𝑑(𝑝, 𝑞) = 0 ⇒ 𝑝 = 𝑞
3. Triangle Inequality: 𝑑(𝑝, 𝑟) ≤ 𝑑(𝑝, 𝑞) + 𝑑(𝑞, 𝑟)

Types of Data Structuring:

1. Space-Oriented Structures: easy partitioning, but the complexity may


rise unpredictably.
Example: hashing functions.
2. Data-Oriented Structures: constant re-partitioning, but a balanced
distribution.
Example: search trees.

We can also improve efficiency by changing Internal Structures:

1. Efficient Sequential Scan: instead of storing all the information, we store parts of it.
Complexity: Still 𝑂(𝑛), but still quicker than a sequential scan.
Example: Bitmaps, VA files.
2. Hierarchical structures: Idea: pruning of search paths.
Complexity: 𝑂(𝑛) to 𝑂(log 𝑛).
Example: search trees.
3. Scatter Storage Structures: scattering the data over a fixed set of buckets, and with this scattering
we achieve the best query complexity. This may also lead to degeneration of data tables, however.
Complexity: 𝑂(𝑛)to 𝑂(1).
Example: hash functions.
Types of Search Queries:

1. Exact Match Query


 Specifies all k attributes exactly
 (𝑥 , … , 𝑥 )
2. Partial Match Query
 Specifies values only for some attributes
 (∗, 𝑥 , … , 𝑥 ,∗, 𝑥 )
3. Range Query
 Specifies all k ranges
 ([𝑢 , 𝑜 ], … , [𝑢 , 𝑜 ])
4. Partial Range Query
 Specifies some ranges
 Similar to partial range queries

Types of Similarity Queries (from most strict to least specific):

1. Range Queries
 Returns all objects with a distance smaller than a specified 𝜀 value
 {𝑜|𝑜 ∈ 𝐷𝐵 ∧ 𝑑(𝑜, 𝑞) ≤ 𝜀}
2. k-Nearest Neighbor Queries
 Gives us the object(s) o that are the closest to us, i.e. those that have the smallest distance to q
compared to all other o'
 {𝑜|𝑜 ∈ 𝐷𝐵 ∧ ∀𝑜 ∈ 𝐷𝐵: 𝑑(𝑜, 𝑞), = 𝑑(𝑜 , 𝑞)}
3. Ranking Queries
 Similar to k-nearest neighbor, but it's ordered in a certain (e.g. ascending in relation to 𝑑(𝑜, 𝑞))
manner and gives us a large result
INVERTED LISTS:

Inverted Lists work as follows:

ONLY if all attributes 𝐴 … 𝐴 are equally important, we can answer a multivariate query using Inverted
Lists.

COMPOSITE INDEXES:

In Composite Indexes, however, the fact that not all attributes are made equal is used to the fullest.
The importance is encoded inside of the composite index and doesn't change.
Example:
R-TREES:

R-Trees are another data structure that is based on storage of overlapping page regions.

The height is always ≤ ⌈log 𝑁⌉ − 1 where m is the minimum amount of entries in a page and N is the
amount of stored objects.

The idea is to approximate objects via minimal bounding rectangles and to, as such, allow for quicker
searching (through so-called MBR’s: “minimum bounding rectangle“):

1. If we find that the queried range is empty, we can stop at the 1 st level.
2. Even if we keep on looking for an object that doesn’t exist, we only need to go through a small
amount of data pages (≤ ⌈log 𝑁⌉ − 1)

Each rectangle in a directory page covers the MBR of all rectangles in all directory- or data-pages stored
in the respective subtree:

It should be noted:
 Only the leaf nodes contain any kinds of objects, intermediary nodes only contain references to
other pages
 An object is always stored in one node, but a query might have to search in two nodes if the queried
object is located at an intersection
 R Trees can do spatial data well, but it's a bad idea to use R Trees for high-dimensional data due to
degeneration. The more overlapping there is, the more a tree is degenerated. The "sweet" spot is
somewhere in the range of 2 to 6-dimensional data.
 ONLY if all attributes 𝐴 … 𝐴 are equally important can we use R-Trees
The operations on R-Trees work as follows:

1. Search: start at the root node, search for all index-entries (in leaf nodes) which have a non-empty
intersection with the query rectangle q (i.e. they overlap), find q

2. Insert: insertion performed in leaf nodes. In case of overflow, perform a split of nodes and insertion
of the respective index-entry into the parent node; hence, recursive split of parent nodes possible.
Complexity: 𝑂(log 𝑛), assuming no split is necessary, otherwise 𝑂(log 𝑛) + the complexity of the
respective split operation

3. Node Splitting:
a. Quadratic Split:
Complexity: 𝑂(𝑀 ), where M is the maximum amount of entries in a page, yet linear in
dimensions of the data space
After performing this split, if we’re inserting an object, we insert it into the rectangle with the
smallest amount of objects.
b. Linear Split:
Complexity: 𝑂(𝑀)

These are the main types of queries in R-Trees:

1. Point Query: results in the entire data page being


returned. If a point is located at the intersection of
2 pages, both are returned
2. Rectangle Query:
a. Intersection: returns an intersection of all
nodes that contain something in a range
b. Inclusion: returns only the page that fully
contains the query
3. Range Query:
a. Intersection
b. Cover
4. Nearest-Neighbor Query: returns the closest
data page at the lowest level (leaf node).
5. K-Nearest-Neighbors Query: returns k leaf
nodes closest to a data point.

When doing a rectangular query (example to the right):


1. Start at the root node
2. Do a depth-first search for the leaf node
intersecting with the query
3. Continue the search in intersecting rectangles
4. Go one level higher, repeat with step 2
PYRAMID INDEXES:

A d-dimensional key is translated into a one-dimensional key with the help of pyramids.
Then every object is assigned one specific one-dimensional key.
Example:

Each object can be indexed with just 2 values:


1. The pyramid number 𝑖 (which pyramid the object is in)
2. The pyramid value 𝑝𝑣

Pyramid Values:

For a point 𝑣 in a pyramid 𝑖 we have to calculate the height ℎ :


ℎ = |𝑣 |
With d being the number of dimensions.

Then we can generate a one-dimensional representation of the object that can be stored in a B-Tree,
the so-called pyramid value:
𝑝𝑣 = (𝑖 + ℎ )

The pyramid value is thus the sum of the pyramid number and the height of the point within that
pyramid.

The problem with pyramid values:


Each pyramid value does not represent one object only; each value represents infinitely many objects.
This might lead to many objects being returned as a result of a query (for example, a range query), so
the set of the returned objects will then have to be filtered (or refined).

If the pyramid values (i.e. heights) are different across all of the queried data points, then the data set
can and will be queried efficiently. However, if the heights of the queried data objects are the same,
then pyramid indexes don’t provide faster query times.

You might also like