Episode 2 - Transcription
Episode 2 - Transcription
Hey guys, and welcome back to the Data Analyst Portfolio Project Series! In this episode, we are diving into the
first technical part of our marketing analysis project. Today, we're going to look at extracting and cleaning
marketing data using SQL.
We'll go through different SQL transformations that will be applied to each table in our data model before we
bring the data into Power BI. We'll break down the SQL queries step by step, explaining each transformation and
its purpose.
If you haven't watched the previous episode where we introduced the project and discussed the business case,
make sure to check that out first. You can find it in the description. I have the links and all the information to be
able to find that there.
That being said, we are ready to take a look at the queries and the transformations that we're going to do in this
project.
Product Table
Okay, guys, so here we have some different SQL statements that we're going to walk through. We're going to
break them down, and I'm going to explain as we go. We're going to start on a little bit of the simpler side and then
get a little bit more advanced as we progress through the different statements.
What's important is that there is a database called Portfolio Project Marketing Analytics. You can find the link for
that in the description of this video. You need to restore that backup file onto this SQL Server Express server that
you have so that you can get these tables available for you locally.
That is what I show in my video on how to set up and create a practice environment. If you haven't done that,
make sure to go back and check that video. That's also described in the description. So before you can do this,
before you can follow along, you need to make sure that you have this database attached on your SQL Server, and
you can find all of the instructions to do that.
The first statement that we will walk through is the one that is on our product data. Now, you'll see in these
statements that there is a top part, and I've added something like a divider.
I will share these SQL queries on my GitHub—again, link in the description. Everything is there. This part above
the stars will not be there; that's just for me to go a little bit back and forth to show you guys the table before and
after. But just follow along.
If you have any questions, ask in the comments, and I think this is going to be quite useful. I've also tried to really
add comments on the different parts for you guys to understand what is going on here.
Let’s just start by looking at the products data. What does the products table look like before anything has been
done to it? So we're going to mark this up and execute this part.
Here you can see we have a Product ID, we have a Product Name, we have a Category, and we have a Price.
Now, right away, this table is fine as is. One thing you're going to see is that I am going to comment out the
Category column because I think it's a bit redundant when we have just products and all the categories are just
one.
So, I'm going to remove that in our final query. Then there's one thing that I'm going to do, and that is I'm going to
case this Price column. I want to make a category for when the price is above or below something, or within
certain amounts. I want it to go Low, Medium, or High, and I'm going to store that—or we're going to turn it into—
a Price Category column.
Now, the values that you decide here, I'm going to let you guys decide that. However you want, but I recommend
that you just write this statement out and try to understand what is going on.
Because what we're doing, we're doing a comparison. We're doing a case, and when something is fulfilled, then
categorize it as this text. When it is between these bands, this text. Else, this text. And end it and return it as this
column, which becomes Price Category.
So, it's fairly simple. You know, this is a simple start. We're going to start a little bit soft with a query here. There
isn't anything else that we're going to do on this table.
If I show you guys, if I run this now—and if you're new in SQL Server Management Studio—the part that I mark
is what is going to be returned in this window, this result window down here.
So, I mark up this query now, execute, and now you can see that we have this new column. We have this Price
Category column, and you can see that there's High, Medium, Low.
The point of doing that is just if a company wants to have an easy way to understand how many products are
selling within a certain category, it's easier to have some buckets than having these values and having to just try
and understand for yourself what is High, Medium, Low.
So, I've just decided to add that category there. Also, you could also just go Ctrl + A, Select. Then you will see
both—you'll see the first statement, which is this one, and the bottom statement, which is the result of this one.
Now, like I said, on my GitHub, I'm going to add these SQL statements there. It will be the exact same statements,
but it will be without anything that is above this star. The stars, we don't need that.
This is the statement that will produce this result, and this is the statement—or the result of this SQL query—is
what is going to be used in our dashboard later.
But I recommend that you guys go in, look at the video, try to write the statement yourself, look at the different
comments that I have added, and try to understand what is going on here.
You can see here, it's the part where I commented out the category because I thought that was a little bit
redundant. Now it's back, but let's take it out again.
If you get other ideas of things you want to do, then please go ahead. There's no right or wrong answer here.
What's important is that the steps we are applying in this SQL statement are going to make our opportunity and use
case more analytically driven and better later for the dashboards.
So, we're going to add and do some improvements here so that the output becomes better, cleaner, and more
structured in the Power BI dashboard at a later stage.
But this is the first one. This is the one that relates to product information.
Customer Table
The next one then is going to be about customers. Now, this one is a little bit different.
What we're doing here is we have one customers table and we have one geography table. I'm going to mark both of
them, and we'll get two different results.
So, you can see we have a Customer ID, we have a Name, we have an Email, Gender, Age, and Geography.
This Geography ID is going to correspond to an ID in the geography table.
Like I said, the stars and above are just for me to show you guys. Below is the actual query that we're going to
focus on.
So, we can then just go back to, let's say, if we have customers—that's what that looks like by itself—or we have
geography—that’s what that looks like by itself.
Now, if we go down, you can see this is a join, a statement which is going to combine these two tables.
So, we have a SELECT and we are selecting from the customer table as "C", where we're going to alias all the
columns that come from this table in the statement.
So, you can see customers over here—that's this table—and geography over here—that's that table.
We have a LEFT JOIN on this condition where Geography ID from the table, which has been aliased with "C", is
matched onto the Geography ID on the table, which has been aliased as "G", which is Geography.
Customer ID
Customer Name
Email
Gender
Age
...with this alias to kind of tell the SELECT statement which table does this come from.
Then we're going to add the Country and City from this other table on this condition.
This is a LEFT JOIN. If you want to, you can go in and change it to a RIGHT JOIN.
Actually, let's first look at the result of the LEFT JOIN before I start changing the join criteria.
So, we can mark this part off, execute, and now you can see we have a table which combines customer information
and geography information.
In this dimension, we are doing a join, and it's because we don't need two different tables. We can combine them
into one.
This is a LEFT JOIN, so these two columns have been added to the left table, which is the first one in the join,
which is customers.
Just for the sake of it, if I comment this out and I do this as a RIGHT JOIN...
You can see now it's going to list the cities first because those are the values that come up in the geography table
first, and then it adds whatever matches on that from the customer table.
So, this becomes the first table on the right side and is joined into that.
Then there's an INNER and FULL OUTER JOIN—just some other types.
In this example, it doesn't really make any difference, but I've added those there in case you want to comment in
and out and just look at it.
But the purpose of this one is for you guys to just do a quick practice on selecting two different tables, combining
them.
And again, from a more sensible point of view, why are we doing this?
Here you can see we have an Age, we have a Country, and a City.
If you want to, you can go in and case the Age column into, let's call it, different categories of age like Young,
Middle Age, Old, or some other ways of putting it.
Or you can do countries—Northern Europe, Western Europe, Eastern Europe, Southern Europe.
If you want to, I'm just trying to come up with some ideas for you guys.
You could categorize them into Large or Small based on the number of people that live there.
There are different things you can do here to categorize or put it into a context which means that you can use it
more in a different analytical point of view.
That is what we're trying to do with the SQL here.
So far, we have two dimension tables. They are going to become what we call Dimension Tables in a data model
or lookup tables.
We'll come back to this later in Power BI, but let's say we have, you know...
Let me find...
Let's go.
Okay, so if we have Customers and Geography, and then we're going to have some tables down here—excuse my
paint drawing skills—
These are going to be Dimension Tables, and they're going to filter down onto these tables.
They're going to filter the Fact Tables, which we will just get to.
This will be a little more clear when we get into Power BI.
We have some tables down here which are Fact Tables, and some tables here which are Lookup Tables or
Dimension Tables.
So far, we have done category changes, and often you do that on Dimension Tables, or we've combined different
dimension tables.
The first one we can look at is a simple example. It is the one that has some information about customer reviews.
A Review ID
The Customer who did the review
The Product ID to identify what product was reviewed
The Date
The Rating
The Text
Now, the issue with this one is that in some of these reviews, there is double spacing instead of single spacing.
So, I've added a simple REPLACE function to clean up double spacing with single spacing.
That’s the only thing we're going to do on this table for now.
We can run this query, and you can see that the text looks a little bit cleaner.
This is also the table that we're going to work on in Python later.
We’re going to combine that with our understanding of the review text and the rating that they gave to see if we
can make it even more accurate or extract some other types of information.
That’s in the next episode, but I thought I’d explain to you guys what we are going to do.
This table is a Fact Table, which means we can calculate metrics like:
Average Rating
For now, the rating and the text are something that we want to understand based on the product or the customer
who wrote it.
These details will then be used to filter the rating to understand how this all comes together.
So, this table becomes one of the Fact Tables in our data model.
Engagement Table
The next Fact Table we’re going to look at is the engagement data table.
We’re going to make some improvements here to address a few issues, including:
1. Different spellings in the Content Type column (e.g., "Blog", "blog", "Video", "video").
2. Inconsistent formatting in the Engagement Date column.
3. Combined Views & Clicks column, which needs to be split into two separate columns.
1. Standardize the content types using the UPPER function to make all values consistent in terms of case.
2. Replace variations like "social media" with a single, standardized version.
3. Format the Engagement Date column for consistency.
4. Split the Views & Clicks column into separate columns for easier analysis.
We’ve also decided to filter out rows where the Content Type is “Newsletter,” as we don’t need those records for
our analysis.
This cleaned data is now ready to be brought into Power BI for further analysis.
This one is a little more advanced, so I’ll try to take you through it step by step.
Journey ID
Customer ID
Product ID
The date they visited
Different stages
An action they performed
For example:
One issue with this table is the Duration column, which has missing values in some places.
Another issue is that there are duplicate rows—some rows have exactly the same information.
1. Use a subquery to calculate the average duration for each visit date and use that to fill in missing values.
2. Identify and remove duplicate rows using the ROW_NUMBER function.
This query uses a Common Table Expression (CTE) to assign a row number to each record based on its unique
combination of fields, such as:
Customer ID
Product ID
Visit Date
Stage
Action
If a unique combination appears more than once, the ROW_NUMBER function assigns it multiple row numbers.
For example, the first occurrence gets a row number of 1, the second occurrence gets 2, and so on.
Here, you can see rows where the ROW_NUMBER is greater than 1—these are the duplicates.
Now that we’ve identified them, we can remove them by keeping only the rows with a ROW_NUMBER equal to 1.
We’ll use another subquery to calculate the average duration for each visit date.
If a Duration value is null, we’ll replace it with the average duration for that specific date.
After running the final query for the Customer Journey table, here’s what the cleaned data looks like:
The Duration column now has no null values; all missing durations have been replaced with averages for
their respective visit dates.
Duplicate rows have been removed; only the first occurrence of each unique combination remains.
To verify, let’s look at a specific Journey ID that previously had duplicates. For example, Journey ID 23 initially
had two rows. Now, it has only one row in the cleaned table.
The query has successfully eliminated duplicates and replaced missing Duration values.
Combining Results
This cleaned table will now be brought into Power BI, where it will serve as one of our Fact Tables.
These are essential SQL techniques for cleaning and preparing data for analysis.
1. Dimension Tables:
o Product Table
o Customer Table (combined with Geography)
2. Fact Tables:
o Customer Reviews Table
o Engagement Table
o Customer Journey Table
In Power BI, these Dimension Tables will serve as filters for the Fact Tables.
For example:
The Product Table can filter customer reviews by product type or price category.
The Customer Table can filter engagement or journey data by customer demographics or location.
Next Episode
That’s it for this episode on extracting and cleaning marketing data using SQL.
In the next episode, we’ll enhance our data by performing sentiment analysis on customer reviews using Python.
This will add valuable insights to our marketing analytics project.
Make sure to like, subscribe, and hit the notification bell so you don’t miss any updates in this series.
If you have any questions or thoughts, leave a comment below. I’d love to hear from you in the comments section.
Conclusion
Thank you for watching, and I’ll see you in the next episode!