0% found this document useful (0 votes)
14 views

Episode 2 - Transcription

Power BI material
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Episode 2 - Transcription

Power BI material
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Intro

Hey guys, and welcome back to the Data Analyst Portfolio Project Series! In this episode, we are diving into the
first technical part of our marketing analysis project. Today, we're going to look at extracting and cleaning
marketing data using SQL.

We'll go through different SQL transformations that will be applied to each table in our data model before we
bring the data into Power BI. We'll break down the SQL queries step by step, explaining each transformation and
its purpose.

If you haven't watched the previous episode where we introduced the project and discussed the business case,
make sure to check that out first. You can find it in the description. I have the links and all the information to be
able to find that there.

That being said, we are ready to take a look at the queries and the transformations that we're going to do in this
project.

Product Table
Okay, guys, so here we have some different SQL statements that we're going to walk through. We're going to
break them down, and I'm going to explain as we go. We're going to start on a little bit of the simpler side and then
get a little bit more advanced as we progress through the different statements.

What's important is that there is a database called Portfolio Project Marketing Analytics. You can find the link for
that in the description of this video. You need to restore that backup file onto this SQL Server Express server that
you have so that you can get these tables available for you locally.

That is what I show in my video on how to set up and create a practice environment. If you haven't done that,
make sure to go back and check that video. That's also described in the description. So before you can do this,
before you can follow along, you need to make sure that you have this database attached on your SQL Server, and
you can find all of the instructions to do that.

The first statement that we will walk through is the one that is on our product data. Now, you'll see in these
statements that there is a top part, and I've added something like a divider.

I will share these SQL queries on my GitHub—again, link in the description. Everything is there. This part above
the stars will not be there; that's just for me to go a little bit back and forth to show you guys the table before and
after. But just follow along.

If you have any questions, ask in the comments, and I think this is going to be quite useful. I've also tried to really
add comments on the different parts for you guys to understand what is going on here.

Let’s just start by looking at the products data. What does the products table look like before anything has been
done to it? So we're going to mark this up and execute this part.

Here you can see we have a Product ID, we have a Product Name, we have a Category, and we have a Price.

Now, right away, this table is fine as is. One thing you're going to see is that I am going to comment out the
Category column because I think it's a bit redundant when we have just products and all the categories are just
one.

So, I'm going to remove that in our final query. Then there's one thing that I'm going to do, and that is I'm going to
case this Price column. I want to make a category for when the price is above or below something, or within
certain amounts. I want it to go Low, Medium, or High, and I'm going to store that—or we're going to turn it into—
a Price Category column.
Now, the values that you decide here, I'm going to let you guys decide that. However you want, but I recommend
that you just write this statement out and try to understand what is going on.

Because what we're doing, we're doing a comparison. We're doing a case, and when something is fulfilled, then
categorize it as this text. When it is between these bands, this text. Else, this text. And end it and return it as this
column, which becomes Price Category.

So, it's fairly simple. You know, this is a simple start. We're going to start a little bit soft with a query here. There
isn't anything else that we're going to do on this table.

If I show you guys, if I run this now—and if you're new in SQL Server Management Studio—the part that I mark
is what is going to be returned in this window, this result window down here.

So, I mark up this query now, execute, and now you can see that we have this new column. We have this Price
Category column, and you can see that there's High, Medium, Low.

The point of doing that is just if a company wants to have an easy way to understand how many products are
selling within a certain category, it's easier to have some buckets than having these values and having to just try
and understand for yourself what is High, Medium, Low.

So, I've just decided to add that category there. Also, you could also just go Ctrl + A, Select. Then you will see
both—you'll see the first statement, which is this one, and the bottom statement, which is the result of this one.

Now, like I said, on my GitHub, I'm going to add these SQL statements there. It will be the exact same statements,
but it will be without anything that is above this star. The stars, we don't need that.

This is the statement that will produce this result, and this is the statement—or the result of this SQL query—is
what is going to be used in our dashboard later.

But I recommend that you guys go in, look at the video, try to write the statement yourself, look at the different
comments that I have added, and try to understand what is going on here.

You can see here, it's the part where I commented out the category because I thought that was a little bit
redundant. Now it's back, but let's take it out again.

If you get other ideas of things you want to do, then please go ahead. There's no right or wrong answer here.
What's important is that the steps we are applying in this SQL statement are going to make our opportunity and use
case more analytically driven and better later for the dashboards.

So, we're going to add and do some improvements here so that the output becomes better, cleaner, and more
structured in the Power BI dashboard at a later stage.

But this is the first one. This is the one that relates to product information.

Customer Table
The next one then is going to be about customers. Now, this one is a little bit different.

What we're doing here is we have one customers table and we have one geography table. I'm going to mark both of
them, and we'll get two different results.

So, you can see we have a Customer ID, we have a Name, we have an Email, Gender, Age, and Geography.
This Geography ID is going to correspond to an ID in the geography table.

What we're doing here is trying to create a combined table.

Like I said, the stars and above are just for me to show you guys. Below is the actual query that we're going to
focus on.

So, we can then just go back to, let's say, if we have customers—that's what that looks like by itself—or we have
geography—that’s what that looks like by itself.

Now, if we go down, you can see this is a join, a statement which is going to combine these two tables.

So, we have a SELECT and we are selecting from the customer table as "C", where we're going to alias all the
columns that come from this table in the statement.

Then we're selecting from the geography table as "G".

So, you can see customers over here—that's this table—and geography over here—that's that table.

We have a LEFT JOIN on this condition where Geography ID from the table, which has been aliased with "C", is
matched onto the Geography ID on the table, which has been aliased as "G", which is Geography.

We are going to then SELECT:

 Customer ID
 Customer Name
 Email
 Gender
 Age

...with this alias to kind of tell the SELECT statement which table does this come from.

Then we're going to add the Country and City from this other table on this condition.

This is a LEFT JOIN. If you want to, you can go in and change it to a RIGHT JOIN.

Actually, let's first look at the result of the LEFT JOIN before I start changing the join criteria.

So, we can mark this part off, execute, and now you can see we have a table which combines customer information
and geography information.

In this dimension, we are doing a join, and it's because we don't need two different tables. We can combine them
into one.

Fewer tables in our data model are more efficient.

It makes sense. It's fine.


This is the combined result that we want to get into Power BI at a later stage when we're going to do some
analysis.

This is a LEFT JOIN, so these two columns have been added to the left table, which is the first one in the join,
which is customers.

Just for the sake of it, if I comment this out and I do this as a RIGHT JOIN...

Let me see if I can get this part here...

You can see now it's going to list the cities first because those are the values that come up in the geography table
first, and then it adds whatever matches on that from the customer table.

So, this becomes the first table on the right side and is joined into that.

Then there's an INNER and FULL OUTER JOIN—just some other types.

In this example, it doesn't really make any difference, but I've added those there in case you want to comment in
and out and just look at it.

But the purpose of this one is for you guys to just do a quick practice on selecting two different tables, combining
them.

And again, from a more sensible point of view, why are we doing this?

It's because we don't need two different tables.

We can combine them.

Fewer tables in a data model—it just makes sense.

Here you can see we have an Age, we have a Country, and a City.

If you want to, you can go in and case the Age column into, let's call it, different categories of age like Young,
Middle Age, Old, or some other ways of putting it.

Or you can do countries—Northern Europe, Western Europe, Eastern Europe, Southern Europe.

If you want to, I'm just trying to come up with some ideas for you guys.

You could have cities.

You could categorize them into Large or Small based on the number of people that live there.

There are different things you can do here to categorize or put it into a context which means that you can use it
more in a different analytical point of view.
That is what we're trying to do with the SQL here.

So far, we have two dimension tables. They are going to become what we call Dimension Tables in a data model
or lookup tables.

If we have, let me see if I can draw this up for you guys...

Say we have a data model.

We'll come back to this later in Power BI, but let's say we have, you know...

Let me find...

I haven't done this forever, that's for sure.

Let's go.

Okay, so if we have Customers and Geography, and then we're going to have some tables down here—excuse my
paint drawing skills—

These are going to be Dimension Tables, and they're going to filter down onto these tables.

They're going to filter the Fact Tables, which we will just get to.

This will be a little more clear when we get into Power BI.

I'm just trying to get it out there already.

We have some tables down here which are Fact Tables, and some tables here which are Lookup Tables or
Dimension Tables.

So far, we have done category changes, and often you do that on Dimension Tables, or we've combined different
dimension tables.

Customer Review Table


Now we're going to go into the tables that will hold the calculations, which we call Fact Tables.

The first one we can look at is a simple example. It is the one that has some information about customer reviews.

Let’s take a look at that one.

You can see we have:

 A Review ID
 The Customer who did the review
 The Product ID to identify what product was reviewed
 The Date
 The Rating
 The Text

Now, the issue with this one is that in some of these reviews, there is double spacing instead of single spacing.

So, I've added a simple REPLACE function to clean up double spacing with single spacing.

That’s the only thing we're going to do on this table for now.

We can run this query, and you can see that the text looks a little bit cleaner.

So, we've done a small fix there.

This is also the table that we're going to work on in Python later.

In Python, we’re going to do something called sentiment analysis.

I'll just very quickly explain what it is.

We will try to determine if the text in the review is positive or negative.

We’re going to combine that with our understanding of the review text and the rating that they gave to see if we
can make it even more accurate or extract some other types of information.

That’s in the next episode, but I thought I’d explain to you guys what we are going to do.

This table is a Fact Table, which means we can calculate metrics like:

 Average Rating

Later, when we add the sentiment analysis, we can create:

 A Rating Sentiment Score


 A Rating Sentiment Category

For now, the rating and the text are something that we want to understand based on the product or the customer
who wrote it.

The customer provides context:

 Who was the customer?


 Which city are they from?
 How old are they?
 What gender?

The product tells us:


 The category of the product
 The product type
 Whether it’s a Low, Medium, or High priced product

These details will then be used to filter the rating to understand how this all comes together.

So, this table becomes one of the Fact Tables in our data model.

Engagement Table
The next Fact Table we’re going to look at is the engagement data table.

Here’s what it looks like before we’ve done anything to it.

We’re going to make some improvements here to address a few issues, including:

1. Different spellings in the Content Type column (e.g., "Blog", "blog", "Video", "video").
2. Inconsistent formatting in the Engagement Date column.
3. Combined Views & Clicks column, which needs to be split into two separate columns.

These are the steps we’ll take to clean up this table:

1. Standardize the content types using the UPPER function to make all values consistent in terms of case.
2. Replace variations like "social media" with a single, standardized version.
3. Format the Engagement Date column for consistency.
4. Split the Views & Clicks column into separate columns for easier analysis.

We’ve also decided to filter out rows where the Content Type is “Newsletter,” as we don’t need those records for
our analysis.

Let’s run the query.

After execution, this is what the cleaned table looks like:

 IDs are listed first for better clarity.


 Content types are standardized.
 Views and clicks are now in separate columns.
 Dates are formatted properly.

This cleaned data is now ready to be brought into Power BI for further analysis.

We’ve introduced a few more SQL concepts here, such as:

 Using UPPER to standardize text


 Splitting columns with string functions
 Filtering rows with specific conditions
If you want to see the query, like I mentioned earlier, check the description for links to my GitHub.

Customer Journey Table


The last table we’ll look at is the Customer Journey table.

This one is a little more advanced, so I’ll try to take you through it step by step.

If we start by just looking at the Customer Journey table by itself, it shows:

 Journey ID
 Customer ID
 Product ID
 The date they visited
 Different stages
 An action they performed

This table allows us to analyze customer journeys in a funnel.

For example:

 Did they view a product?


 Did they click on it?
 Did they buy something?

One issue with this table is the Duration column, which has missing values in some places.

Another issue is that there are duplicate rows—some rows have exactly the same information.

To fix these issues, we’ll do the following:

1. Use a subquery to calculate the average duration for each visit date and use that to fill in missing values.
2. Identify and remove duplicate rows using the ROW_NUMBER function.

I’ll show you the query to identify duplicates first.

This query uses a Common Table Expression (CTE) to assign a row number to each record based on its unique
combination of fields, such as:

 Customer ID
 Product ID
 Visit Date
 Stage
 Action
If a unique combination appears more than once, the ROW_NUMBER function assigns it multiple row numbers.

For example, the first occurrence gets a row number of 1, the second occurrence gets 2, and so on.

Let’s execute this query to verify duplicate records.

Here, you can see rows where the ROW_NUMBER is greater than 1—these are the duplicates.

Now that we’ve identified them, we can remove them by keeping only the rows with a ROW_NUMBER equal to 1.

Next, we’ll handle the missing Duration values.

We’ll use another subquery to calculate the average duration for each visit date.

If a Duration value is null, we’ll replace it with the average duration for that specific date.

When we execute the final query, the cleaned table will:

1. Have no duplicate rows.


2. Have missing durations replaced with average values.

After running the final query for the Customer Journey table, here’s what the cleaned data looks like:

 The Duration column now has no null values; all missing durations have been replaced with averages for
their respective visit dates.
 Duplicate rows have been removed; only the first occurrence of each unique combination remains.

To verify, let’s look at a specific Journey ID that previously had duplicates. For example, Journey ID 23 initially
had two rows. Now, it has only one row in the cleaned table.

The query has successfully eliminated duplicates and replaced missing Duration values.

Combining Results
This cleaned table will now be brought into Power BI, where it will serve as one of our Fact Tables.

The advanced concepts we’ve covered here include:

1. Using the ROW_NUMBER function to identify and remove duplicates.


2. Implementing a subquery to calculate averages and fill missing values.
3. Leveraging Common Table Expressions (CTEs) for temporary result sets to aid in intermediate
calculations.
If this process feels complex, I recommend revisiting the query and breaking it down step by step. Pay attention to:

 How the ROW_NUMBER function partitions and orders data.


 How the subquery calculates averages based on a specific column.

These are essential SQL techniques for cleaning and preparing data for analysis.

Data Model Overview


At this stage, we’ve worked on several tables:

1. Dimension Tables:
o Product Table
o Customer Table (combined with Geography)
2. Fact Tables:
o Customer Reviews Table
o Engagement Table
o Customer Journey Table

In Power BI, these Dimension Tables will serve as filters for the Fact Tables.

For example:

 The Product Table can filter customer reviews by product type or price category.
 The Customer Table can filter engagement or journey data by customer demographics or location.

Next Episode
That’s it for this episode on extracting and cleaning marketing data using SQL.

You’ve learned how to:

1. Connect to a database in SQL Server Management Studio.


2. Extract relevant data.
3. Apply cleaning techniques to prepare data for analysis in Power BI.

In the next episode, we’ll enhance our data by performing sentiment analysis on customer reviews using Python.
This will add valuable insights to our marketing analytics project.

Make sure to like, subscribe, and hit the notification bell so you don’t miss any updates in this series.

If you have any questions or thoughts, leave a comment below. I’d love to hear from you in the comments section.

Conclusion
Thank you for watching, and I’ll see you in the next episode!

You might also like