An Easy Guide To Advanced SQL Window Functions - by Julia Kho - Towards Data Science
An Easy Guide To Advanced SQL Window Functions - by Julia Kho - Towards Data Science
Save
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 1/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
This article is a guide on advanced window functions for data analysis in SQL. This is
definitely a must know for data scientists and analysts. I will first introduce what
window functions are, why you should use them, and the 3 types of window functions.
Next, I will go through real-life examples to show how each of these functions are used.
“A window function performs a calculation across a set of table rows that are somehow
related to the current row…Behind the scenes, the window function is able to access more
than just the current row of the query result.”
Window functions are similar to the aggregation done in the GROUP BY clause.
However, rows are not grouped into a single row, each row retains their separate
identity. That is, a window function may return a single value for each row. Here’s a
good visualization of what I mean by that.
Image by Author
Notice how the GROUP BY aggregation on the left hand side of the picture groups the
three rows into one single row. The window function on the right hand side of the
picture is able to output each row with an aggregation value. This may save you from
having to do a join after the GROUP BY.
Let’s say we have some salary data and we want to find to create a column that gives us
the average salary for each job title.
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 3/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
On the left is what a GROUP BY aggregation would return and on the right is what a
window function would return. As you can see, the group by consolidates our data into
just three rows. With a window function, we retain the original 11 rows and have a
new column called AVG_SALARY. We could then choose to compare each individual’s
salary to the average salary if desired.
Window functions are also simple to use and read. That is, they can reduce the
complexity of your queries, which makes it easier to maintain down the road.
In addition, they can help with performance issues. For example, you can use a
window function instead of having to do a self-join or cross-join.
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 4/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Important Note:
Before we start, it is important to note that in terms of the order of operations in SQL,
window functions come in sixth on the list.
Image by Author
This is important because based off of this logical order, window functions are allowed
in SELECT and ORDER BY, but they are not allowed in FROM , WHERE , GROUP BY , or
HAVING clauses.
Note: If you really need to have it inside a WHERE clause or GROUP BY clause, you may
get around this limitation by using a subquery or a WITH query.
Image by Author
expression is the name of the column that we want the window function operated
on. This may not be necessary depending on what window_function is used
PARTITION BY divides the rows into partitions so we can specify which rows to
use to compute the window function
ORDER BY is used so that we can order the rows within each partition. This is
optional and does not have to be specified
ROWS can be used if we want to further limit the rows within our partition. This is
optional and usually not used
Don’t worry about memorizing the definitions and syntax or even fully understanding
what it means exactly right now. Everything will make a lot more sense once you look
at the examples in the article and get an intuitive understanding of how to go about
writing a window function.
Quick Example
To help you get a better idea of how the syntax really works, below is an example of
what a window function would look like in practice.
This is the query that would have generated the output we saw earlier regarding salary
by job title.
Here, AVG() is the name of the window function, SALARY is the expression and
JOB TITLE is our partition list We did not use an ORDER BY as it is not needed and
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 6/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
JOB_TITLE is our partition list. We did not use an ORDER BY as it is not needed and
we do not want to use ROWS because we do not want to further
Openlimit
in appour partition.
Get started
Image by Author
Again, no need for memorizing syntax for now. At this stage, the one concept I want
you to understand is that the window function computes a value for each row in the
“window” or “partition”. A window can be one of more rows and it is specified by the
clause PARTITION BY . In our example, we partitioned by job title. As you can see in the
snippet above, I’ve highlighted each job title a different color. Each color represents a
different “window” or a different “partition”. The window function computes one
average salary value for each partition.
Image by Author
There are three main types of window functions available to use: aggregate, ranking,
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 7/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
and value functions. In the image below, you can see some of the names of the
Open in app Get started
Image by Author
Here’s a quick overview of what each type of window function is useful for.
Ranking functions: these functions are useful for ranking rows within its partition.
Value functions: these functions allow you to compare values from previous or
following rows within the partition or the first or last value within the partition.
Data
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 8/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
For the example problems below, I am using data from the Northwind database located
on this website. See Northwind_small.sqlite. The file can also be found in my github
repo.
Per the download source, “The Northwind sample database was provided with
Microsoft Access as a tutorial schema for managing small business customers, orders,
inventory, purchasing, suppliers, shipping, and employees. Northwind is an excellent
tutorial schema for a small-business ERP, with customers, orders, inventory,
purchasing, suppliers, shipping, employees, and single-entry accounting.”
The full schema of the database is displayed on the website linked above. For the
examples in this article, I will only be using the [Order] and [OrderDetail] tables.
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 9/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Source: https://github.com/jpwhite3/northwind-SQLite3
Aggregate Functions
Image by Author
Exercise 1: Create a new column that calculates average Unit Price for each CustomerId
Image by Author
From the list of aggregate window functions listed on the left, we can see that AVG()
will be the window function that we want to use. Our expression will be the Unit Price
column because we want to calculate the average of Unit Price.
Next, we need to figure out how we want to partition. That is, how should the rows be
grouped together in order to create our partitions? The exercise statement tells us to
find the avg price for each CustomerId. That tells us that we want to group rows that
have the same CustomerId, so that is will be part of our partition list. For the purpose
of this exercise, we have no use of an ORDER BY. Below is what our query would look
like.
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 10/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
SELECT CustomerId,
UnitPrice,
AVG(UnitPrice) OVER (PARTITION BY CustomerId) AS
“AvgUnitPrice”
FROM [Order]
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 11/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 12/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Image by Author
As you can see in the image above, an average unit price is computed for each of our
partitions of CustomerId.
1. Create a new column that calculates max Discount for each CustomerId
2. Create a new column that calculates min Unit Price for each ProductId
Exercise 2: Create a new column that calculates average Unit Price for each group of
CustomerId AND EmployeeId.
You can choose to partition by more than 1 column. Earlier, we calculated the average
unit price for each CustomerId group. This time, we’ll add in EmployeeId.
SELECT CustomerId,
EmployeeId,
AVG(UnitPrice) OVER (PARTITION BY CustomerId, EmployeeId) AS
“AvgUnitPrice”
FROM [Order]
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 13/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Image by Author
Open in app Get started
Notice how the partition changes from earlier. The calculations are computed for every
unique group of CustomerId and EmployeeId as visually shown by the different colors
in the table.
Ranking Functions
Image by Author
Exercise 3: Create a new column that ranks Unit Price in descending order for each
CustomerId.
We’ll use the first three ranking functions on the list on the left: ROW_NUMBER,
RANK, and DENSE_RANK.
ROW_NUMBER
We can use this function to show the row number of a given row within its partition.
N t th t f th ki f ti d th
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 t if i ithi th 14/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Note that for the ranking functions, we do not have to specify an expression within the
Open in app
parentheses like we did previously for the aggregate functions. Get started
In addition, since are doing a ranking, order is important here. We have to make sure
that the Unit Price column is ordered correctly so that the ranking is applied correctly.
To do so, we can add ORDER BY UnitPrice DESC as part of the windows function, right
SELECT CustomerId,
OrderDate,
UnitPrice,
ROW_NUMBER() OVER (PARTITION BY CustomerId ORDER BY UnitPrice
DESC) AS “UnitRank”
FROM [Order]
INNER JOIN OrderDetail
ON [Order].Id = OrderDetail.OrderId
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 15/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Source: Author
As you can see in the output above, our UnitPrice column is in descending order and
the unit’s rank is shown for each customer id in the last column. There are 12 rows for
customer ALFK, so the rank goes from 1 to 12.
You may be wondering, what happens if I use the ORDER BY at the end of the SQL
statement and not inside the windows function, will I get the same results?
Take a minute a think about it and come back. Does it matter if I have the ORDER BY
inside the windows function versus outside?
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 16/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Let’s try it out! Remove the ORDER BY from the windows function and let’s add it to
the end.
SELECT CustomerId,
OrderDate,
UnitPrice,
ROW_NUMBER() OVER (PARTITION BY CustomerId) AS “UnitRank”
FROM [Order]
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
ORDER BY CustomerId, UnitPrice DESC
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 17/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Hm, it looks like we aren’t getting the same results as earlier. Unit price is ordered
correctly in descending order, but the unit’s rank doesn’t look right. Why not?
Remember back to the SQL order of operations. Window functions are processed sixth
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 18/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Image by Author
So the row numbers were created BEFORE the UnitPrice was ordered. That’s why we
don’t get the same results! Makes sense.
RANK()
Now, let’s try RANK() in place of ROW_NUMBER().
SELECT CustomerId,
OrderDate,
UnitPrice,
RANK() OVER (PARTITION BY CustomerId ORDER BY UnitPrice DESC)
AS “UnitRank”
FROM [Order]
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 19/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Image by Author
What’ the difference now? With row number, there were no repeated numbers. But
with RANK(), if you have multiple values with the exact same value, the rank function
will give them the same rank.
Notice in rows 2 and 3, the unit price is 45.60, so both rows are given the same rank of
2. Rows 7 and 8 also have the same unit price and are given the same rank of 7.
Also note that the ranking skips a number. For example, in row 3, the rank skips to 4
since there are two rows of rank 2. If there were three rows with rank 2, then it would
skip to rank 5 and so on.’
Well what if you don’t want it to skip numbers? Well we can use DENSE RANK()
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 20/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Well, what if you don t want it to skip numbers? Well, we can use DENSE_RANK()
instead. Open in app Get started
DENSE_RANK()
Again, replace our window function to be DENSE_RANK() and keep all else the same.
SELECT CustomerId,
OrderDate,
UnitPrice,
DENSE_RANK() OVER (PARTITION BY CustomerId ORDER BY UnitPrice
DESC) AS “UnitRank”
FROM [Order]
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
Image by Author
It follows the same behavior as RANK() in that if the values are the same the same
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 21/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
It follows the same behavior as RANK() in that if the values are the same, the same
rank will be given to those rows. See rows 2 and 3. Notice that in app4, theGet
in row
Open started
rank does
Your homework now is to learn how PERCENT_RANK() and NTILE() work and to try
those functions for yourself.
Value Functions
To me, value functions are probably the top reason why window functions are so
amazing. These functions are great for extracting values from other rows that might be
useful for a report.
Image by Author
We can use the LAG or LEAD functions to help us create a column that is able to pull
values from other rows. LAG can return values from the previous rows whereas LEAD
returns values from following rows. Comparing previous or following rows can be
super useful when working with time series data and calculating differences across
time.
Exercise 4: Create a new column that provides the previous order date’s Quantity for each
ProductId.
SELECT ProductId,
OrderDate,
Quantity,
LAG(Quantity) OVER (PARTITION BY ProductId ORDER BY
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 22/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
OrderDate) AS "LAG"
Open in app Get started
FROM [Order]
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
We use LAG on the Quantity column to return the value from the previous row. Just
like before, we need to make sure our data is sorted in order inside our windows
function. We’ll sort by the OrderDate here.
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 23/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
As you can see in the image above, we get a column with the previous OrderDate’s
Quantity. This is really useful because we could compare the current order date to the
previous order date and calculate the differences across the two time periods. In the
first row, there is no previous order date, so it is NaN, or null.
Exercise 5: Create a new column that provides the following order date’s Quantity for each
ProductId.
This is going to look pretty similar to our earlier example. However, this time, since we
want the following row, we will use LEAD().
SELECT ProductId,
OrderDate,
Quantity,
LEAD(Quantity) OVER (PARTITION BY ProductId ORDER BY
OrderDate) AS "LEAD"
FROM [Order]
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 24/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Image by Author
As you can see, the new column LEAD contains the values from the the next row down.
Exercise 6: Create a new column that provides the very first Quantity ever ordered for each
ProductId.
To get the first quantity, we can use the FIRST_VALUE function, which will give us the
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 25/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
SELECT ProductId,
OrderDate,
Quantity,
FIRST_VALUE(Quantity) OVER (PARTITION BY ProductId ORDER BY
OrderDate) AS "FirstValue"
FROM [Order]
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
Image by Author
As you can see in the image, the first order for product ID 1 was on 8/20/2012 with a
quantity of 45, so we get a value of 45 for all rows related to product 1.
1. Create a new column that provides the second Quantity ordered for each
ProductId. (HINT: use the NTH_VALUE function)
Let’s take a quick break from the exercises to learn a new concept that hasn’t been
Open in app Get started
discussed yet.
You may remember from the definitions at the beginning that we can specify ROWS
with a frame_clause to further limit our window size. I’m saving this towards the end
here because people tend to get a bit confused about this. I’m going to quickly go over
the syntax and how it’s used, then let’s look at an example to really understand what’s
going on.
In the <starting_row> and <ending row>, we have the following options at our
disposal:
UNBOUNDED PRECEDING — all rows before the current row in the partition, i.e.
the first row of the partition
UNBOUNDED FOLLOWING — all rows after the current row in the partition, i.e.
the last row of the partition
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW — this means look back the
previous 3 rows up to the current row.
ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING — this means look back the
previous 5 rows up to 1 row before the current row
One worthy note is that anytime that you add an ORDER BY clause, SQL sets the
default window as ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
SELECT CustomerId,
UnitPrice,
AVG(UnitPrice) OVER (PARTITION BY CustomerId
ORDER BY CustomerId
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS “CumAvg”
FROM [Order]
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 28/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Image by Author
In the output above, you can see that an average is recalculated at every row. In row 1,
there is only 1 number, so the average is 45.60. In row 2, the CumAvg is the average of
45.60 and 18. In row 3, the CumAvg is the average of 45.60, 18, and 12. And so on…
1. Calculate the average Quantity of the previous 5 rows up to the previous 3 rows for
each CustomerId.
2. Calculated the minimum UnitPrice for the previous 2 rows up to the current row
for each CustomerId
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 29/32
8/26/22, 10:01 AM An Easy Guide to Advanced SQL Window Functions | by Julia Kho | Towards Data Science
Challenge Exercise: Create a new column that provides the last Quantity ordered for
each ProductId. (Hint: use the LAST_VALUE() window function and think about the
partitioning)
Great work getting through this article! Window functions are really powerful and a lot
of fun, so I hope you enjoyed going through these exercises. Thank you for reading!
Drop a comment below if you have feedback or questions.
And if you are looking for a cheatsheet, check out this link here.
https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9 30/32