0% found this document useful (0 votes)
21 views

Project to share

Power BI material
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Project to share

Power BI material
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Let’s start, after the coffee I need to get from down there.

Falta fazer o document transcript (GPT) da apresentação do projecto e uma introdução.

This is a project that will use different tools and techniques related to Marketing Analytics.
We’ll be using SQL, Python and PowerBI.
First we’ll use SQL to query the database.
(We’ll then see if we can do the same using Power Query) – For the moment we’ll stick to the
tutorial
I used the database “PortfolioProject_MarketingAnalytics”with the following tables
 dbo.customer_journey
 dbo.customer_reviews
 dbo.customers
 dbo.engagement_data
 dbo,geography
 dbo.products

Let’s start with products


-- SQL Query to categorize products based on their price

SELECT
ProductID, -- Selects the unique identifier for each product
ProductName, -- Selects the name of each product
Price, -- Selects the price of each product
-- Category, -- Selects the product category for each product

CASE -- Categorizes the products into price categories: Low, Medium, or High
WHEN Price < 50 THEN 'Low' -- If the price is less than 50, categorize as 'Low'
WHEN Price BETWEEN 50 AND 200 THEN 'Medium' -- If the price is between 50 and 200
(inclusive), categorize as 'Medium'
ELSE 'High' -- If the price is greater than 200, categorize as 'High'
END AS PriceCategory -- Names the new column as PriceCategory

FROM
dbo.products; -- Specifies the source table from which to select the data

We’ll be dropping the Product Category column has it has only one Category – Sports.
We’ll be labeling the products by their price using CASE. The values used are the ones we see in
the code (I want to see if there’s a metric (like average +/- standard deviation or other metric to
label products). This new column may be of good use. So we just added this column.
There are other transformations and columns we could add but we’ll leave like this. It’s a matter of
thinking about this and if we feel it makes sense then we just create those.
Let’s move forward to Customers

The Customers table have a geographyID that relates to the table geography and so we can enrich
the Customers table with this information. PowerBI has great features to visualizer geographical
data. So we’ll be joining customers to geography to get the location fields (Country and City) for
the customers.
As we’ll be bringing the geographical information to the table we’ll drop geographyID from
customers.
-- SQL statement to join dim_customers with dim_geography to enrich customer data with
geographic information

SELECT
c.CustomerID, -- Selects the unique identifier for each customer
c.CustomerName, -- Selects the name of each customer
c.Email, -- Selects the email of each customer
c.Gender, -- Selects the gender of each customer
c.Age, -- Selects the age of each customer
g.Country, -- Selects the country from the geography table to enrich customer data
g.City -- Selects the city from the geography table to enrich customer data
FROM
dbo.customers as c -- Specifies the alias 'c' for the dim_customers table
LEFT JOIN
-- RIGHT JOIN
-- INNER JOIN
-- FULL OUTER JOIN
dbo.geography g -- Specifies the alias 'g' for the dim_geography table
ON
c.GeographyID = g.GeographyID; -- Joins the two tables on the GeographyID field to
match customers with their geographic information

Again, there were other transformations and creations we could think about on creating that could
make sense, like binning ages, but we won’t be doing that. We can also obtain that directly inside
PowerBI (Best practices determine that, for performance gains, new columns, if it’s possible to
create them in the source system, then it should be done there).

Now we’ll move to fact tables


This first one customer_reviews has customer reviews. Here we identified that there are double
spacing between words and we’re removing one of those spaces using REPLACE.
-- Query to clean whitespace issues in the ReviewText column

SELECT
ReviewID, -- Selects the unique identifier for each review
CustomerID, -- Selects the unique identifier for each customer
ProductID, -- Selects the unique identifier for each product
ReviewDate, -- Selects the date when the review was written
Rating, -- Selects the numerical rating given by the customer (e.g., 1 to 5 stars)
-- Cleans up the ReviewText by replacing double spaces with single spaces to ensure the
text is more readable and standardized
REPLACE(ReviewText, ' ', ' ') AS ReviewText
FROM
dbo.customer_reviews; -- Specifies the source table from which to select the data

We’ll be using this table later in Pyhton on our sentimental analysis. (Trying to determine if the text
of the review indicate a positive or negative feeling)
The next table we’ll be addressing is engagement_data. This is what it looks like before doing
anything to it

In this table we’ll be doing some different things.


 From the get go we can see ContentType field has different spellings for. We’ll be addressing
that
 We’re going to change EngagementDate
 We’re going to restructure the Campaign and Product IDs
 We’re going to split ViewsClickCombined
 We’re filtering out content type Newsletter

-- Query to clean and normalize the engagement_data table

SELECT
EngagementID, -- Selects the unique identifier for each engagement record
ContentID, -- Selects the unique identifier for each piece of content
CampaignID, -- Selects the unique identifier for each marketing campaign
ProductID, -- Selects the unique identifier for each product
UPPER(REPLACE(ContentType, 'Socialmedia', 'Social Media')) AS ContentType, -- Replaces
"Socialmedia" with "Social Media" and then converts all ContentType values to uppercase
LEFT(ViewsClicksCombined, CHARINDEX('-', ViewsClicksCombined) - 1) AS Views, --
Extracts the Views part from the ViewsClicksCombined column by taking the substring before
the '-' character
RIGHT(ViewsClicksCombined, LEN(ViewsClicksCombined) - CHARINDEX('-',
ViewsClicksCombined)) AS Clicks, -- Extracts the Clicks part from the ViewsClicksCombined
column by taking the substring after the '-' character
Likes, -- Selects the number of likes the content received
-- Converts the EngagementDate to the dd.mm.yyyy format
FORMAT(CONVERT(DATE, EngagementDate), 'dd.MM.yyyy') AS EngagementDate -- Converts and
formats the date as dd.mm.yyyy
FROM
dbo.engagement_data -- Specifies the source table from which to select the data
WHERE
ContentType != 'Newsletter'; -- Filters out rows where ContentType is 'Newsletter' as
these are not relevant for our analysis

NOTE: When starting to build the visuals of this project, I noticed the engagementdate was coming
as text and when attempting to change it to the date data type in Power Query, it would return a
lot of errors, so I’ve opted not to format the date and import the original format, action hat solved
the issue.
Now we’ll be doing a little bit more advanced SQL on the customer_journey table. Let’s look at it

It shows:
 Journey ID
 Customer ID
 Product ID
 The date they visited
 Different stages
 An action they performed

This table allows us to analyze customer journeys in a funnel. For example:


 Did they view a product?
 Did they click on it?
 Did they buy something?
But there are a lot of things to do before we dive into those analyzes. Let’s address them:
1. There are duplicate rows—some rows have exactly the same information. To find those we
have the ROW_NUMBER method on a CTE. This expression is just for us to check if there are
duplicates.
We’ll be using that method after this, on our final query to address all the issues in this table.
-- Common Table Expression (CTE) to identify and tag duplicate records

WITH DuplicateRecords AS (
SELECT
JourneyID, -- Select the unique identifier for each journey (and any other
columns you want to include in the final result set)
CustomerID, -- Select the unique identifier for each customer
ProductID, -- Select the unique identifier for each product
VisitDate, -- Select the date of the visit, which helps in determining the
timeline of customer interactions
Stage, -- Select the stage of the customer journey (e.g., Awareness,
Consideration, etc.)
Action, -- Select the action taken by the customer (e.g., View, Click,
Purchase)
Duration, -- Select the duration of the action or interaction
-- Use ROW_NUMBER() to assign a unique row number to each record within the
partition defined below
ROW_NUMBER() OVER (
-- PARTITION BY groups the rows based on the specified columns that should
be unique
PARTITION BY CustomerID, ProductID, VisitDate, Stage, Action
-- ORDER BY defines how to order the rows within each partition (usually
by a unique identifier like JourneyID)
ORDER BY JourneyID
) AS row_num -- This creates a new column 'row_num' that numbers each row
within its partition
FROM
dbo.customer_journey -- Specifies the source table from which to select the
data
)

-- Select all records from the CTE where row_num > 1, which indicates duplicate
entries

SELECT *
FROM DuplicateRecords
WHERE row_num > 1 -- Filters out the first occurrence (row_num = 1) and only shows
the duplicates (row_num > 1)
ORDER BY JourneyID

-- Outer query selects the final cleaned and standardized data

o If we wanted to see if there are duplicate records we can use the GROUP BY method
that shows us if there are duplicates in the table.
SELECT
CustomerID,
ProductID,
VisitDate,
Stage,
Action,
Duration,
COUNT(*) AS DuplicateCount
FROM customer_journey
GROUP BY CustomerID, ProductID, VisitDate, Stage, Action, Duration
HAVING COUNT(*) > 1;
2. The other issue we have are the NULLS in the Duration column
So, in order to fix this we’ll be replacing those values with a calculated value using
COALESCE. In this case we’ll be finding the average duration time for the corresponding date
on the row of the NULLS. (It’s an approach to replace these NULLS, but there are other
approaches. We’ll be using this approach for now.)
3. So, the final query, which will remove duplicates, find the average duration per date and
replace NULL values with that average is the final one.

SELECT
JourneyID, -- Selects the unique identifier for each journey to ensure data
traceability
CustomerID, -- Selects the unique identifier for each customer to link journeys
to specific customers
ProductID, -- Selects the unique identifier for each product to analyze customer
interactions with different products
VisitDate, -- Selects the date of the visit to understand the timeline of
customer interactions
Stage, -- Uses the uppercased stage value from the subquery for consistency in
analysis
Action, -- Selects the action taken by the customer (e.g., View, Click, Purchase)
COALESCE(Duration, avg_duration) AS Duration -- Replaces missing durations with
the average duration for the corresponding date
FROM
(
-- Subquery to process and clean the data
SELECT
JourneyID, -- Selects the unique identifier for each journey to ensure
data traceability
CustomerID, -- Selects the unique identifier for each customer to link
journeys to specific customers
ProductID, -- Selects the unique identifier for each product to analyze
customer interactions with different products
VisitDate, -- Selects the date of the visit to understand the timeline of
customer interactions
UPPER(Stage) AS Stage, -- Converts Stage values to uppercase for
consistency in data analysis
Action, -- Selects the action taken by the customer (e.g., View, Click,
Purchase)
Duration, -- Uses Duration directly, assuming it's already a numeric type
AVG(Duration) OVER (PARTITION BY VisitDate) AS avg_duration, --
Calculates the average duration for each date, using only numeric values
ROW_NUMBER() OVER (
PARTITION BY CustomerID, ProductID, VisitDate, UPPER(Stage), Action
-- Groups by these columns to identify duplicate records
ORDER BY JourneyID -- Orders by JourneyID to keep the first
occurrence of each duplicate
) AS row_num -- Assigns a row number to each row within the partition to
identify duplicates
FROM
dbo.customer_journey -- Specifies the source table from which to select
the data
) AS subquery -- Names the subquery for reference in the outer query
WHERE
row_num = 1; -- Keeps only the first occurrence of each duplicate group
identified in the subquery
Now we’ll be entering a bit in Python. We’ll be enhancing our marketing data by incorporating
sentiment analysis.
We’ll be taking some steps in order to start joining all our stages together and create a cohesive
project. First of all
 Connecting to our database to retrieve customer reviews data
Let’s revisit our SQL query for the customer_reviews table as well as the tale itself

-- Query to clean whitespace issues in the ReviewText column

SELECT
ReviewID, -- Selects the unique identifier for each review
CustomerID, -- Selects the unique identifier for each customer
ProductID, -- Selects the unique identifier for each product
ReviewDate, -- Selects the date when the review was written
Rating, -- Selects the numerical rating given by the customer (e.g., 1 to 5 stars)
-- Cleans up the ReviewText by replacing double spaces with single spaces to ensure the
text is more readable and standardized
REPLACE(ReviewText, ' ', ' ') AS ReviewText
FROM
dbo.customer_reviews; -- Specifies the source table from which to select the data

We’re going to use the review text and use a python library to perform sentiment analysis on it.
Basically analyzing what the text say and then retrieve a positive or negative sentiment from it.
As we can see there’s already a Rating field that gives us a way of “knowing” more or less the
sentiment of the review. For this exercise we’ll be focusing only on the review text and try to
determine if what was written was positive or negative. This will add more information than we can
then combine or use together with the score to enrich or deepen the analysis.
Next we’ll enrich our dataset using Python. Python is a little bit more advanced, uses a different
language, different everything from the SQL.
The script is a bit long but nothing special. In it we have a lot of commenting to guide you through
it
# pip install pandas nltk pyodbc sqlalchemy

import pandas as pd
import pyodbc
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#before importing these libraries to the project we need to install them. To do so we need
to import them with pip !pip install nltk and the others.

# Download the VADER lexicon for sentiment analysis if not already present.
nltk.download('vader_lexicon')
#This is another download of a component we'll need to use during the exercise. It's part of
nltk library so it's just a matter of bringing into play

#bellow is where the exercise starts.


#We're creating a dataframe with a SQL statement
#we're difining atributes to connect to the database instance
#we're just saying give us these columns. No transformation is being done

# Define a function to fetch data from a SQL database using a SQL query
def fetch_data_from_sql():
# Define the connection string with parameters for the database connection
conn_str = (
"Driver={ODBC Driver 17 for SQL Server};" # Specify the driver for SQL Server
"Server=DESKTOP-PC06D3B;" # Specify your SQL Server instance
"Database=PortfolioProject_MarketingAnalytics;" # Specify the database name
"Trusted_Connection=yes;" # Use Windows Authentication for the connection
)
# Establish the connection to the database
conn = pyodbc.connect(conn_str)

# Define the SQL query to fetch customer reviews data


query = "SELECT ReviewID, CustomerID, ProductID, ReviewDate, Rating, ReviewText FROM
customer_reviews"

# Execute the query and fetch the data into a DataFrame


df = pd.read_sql(query, conn)

# Close the connection to free up resources


conn.close()

# Return the fetched data as a DataFrame


return df

# Fetch the customer reviews data from the SQL database


customer_reviews_df = fetch_data_from_sql()

# Initialize the VADER sentiment intensity analyzer for analyzing the sentiment of text data
sia = SentimentIntensityAnalyzer()

# Define a function to calculate sentiment scores using VADER


# Here we're defining a function to grade the sentiment of the reviews (creio que estejamos
a criar um campo (Review) que irá receber estes valores. Nesta fase de definição não creio
que sejam feitos cálculos
def calculate_sentiment(review):
# Get the sentiment scores for the review text
sentiment = sia.polarity_scores(review)
# Return the compound score, which is a normalized score between -1 (most negative) and
1 (most positive)
return sentiment['compound']

# Define a function to categorize sentiment using both the sentiment score and the review
rating
# Here we're defining a function that will populate two fields (score and rating). Score is
a field created by VADER. Rating comes from our customer_reviews table.
def categorize_sentiment(score, rating):
# Use both the text sentiment score and the numerical rating to determine sentiment
category
if score > 0.05: # Positive sentiment score
if rating >= 4:
return 'Positive' # High rating and positive sentiment
elif rating == 3:
return 'Mixed Positive' # Neutral rating but positive sentiment
else:
return 'Mixed Negative' # Low rating but positive sentiment
elif score < -0.05: # Negative sentiment score
if rating <= 2:
return 'Negative' # Low rating and negative sentiment
elif rating == 3:
return 'Mixed Negative' # Neutral rating but negative sentiment
else:
return 'Mixed Positive' # High rating but negative sentiment
else: # Neutral sentiment score
if rating >= 4:
return 'Positive' # High rating with neutral sentiment
elif rating <= 2:
return 'Negative' # Low rating with neutral sentiment
else:
return 'Neutral' # Neutral rating and neutral sentiment

# Define a function to bucket sentiment scores into text ranges


# Here were defining a function to create buckets populated with values only based on the
score given to the text.
def sentiment_bucket(score):
if score >= 0.5:
return '0.5 to 1.0' # Strongly positive sentiment
elif 0.0 <= score < 0.5:
return '0.0 to 0.49' # Mildly positive sentiment
elif -0.5 <= score < 0.0:
return '-0.49 to 0.0' # Mildly negative sentiment
else:
return '-1.0 to -0.5' # Strongly negative sentiment

# Apply sentiment analysis to calculate sentiment scores for each review


# Here we are creating the column 'SentimentScore', calculated as per the definition.
customer_reviews_df['SentimentScore'] =
customer_reviews_df['ReviewText'].apply(calculate_sentiment)

# Apply sentiment categorization using both text and rating


# Here we are creating a column 'SentimentCategory', calculated as per the definition
customer_reviews_df['SentimentCategory'] = customer_reviews_df.apply(
lambda row: categorize_sentiment(row['SentimentScore'], row['Rating']), axis=1)

# Apply sentiment bucketing to categorize scores into defined ranges


# Here we are creating a column 'SentimentBucket', calculated as per the definition
customer_reviews_df['SentimentBucket'] =
customer_reviews_df['SentimentScore'].apply(sentiment_bucket)

# Display the first few rows of the DataFrame with sentiment scores, categories, and buckets
print(customer_reviews_df.head())

# Save the DataFrame with sentiment scores, categories, and buckets to a new CSV file
customer_reviews_df.to_csv('customer_reviews_with_sentiment_VH.csv', index=False)
After some adjustments, namely on the connection settings to SQL and renaming the
customer_reviews tale for its name on the database we’ve created our dataframe with the new
columns and saved it as ‘customer_reviews_with_sentiment_VH.csv’. As we can see below, the
issue regarding double spacing is now present again, as our query from python to the database
didn’t use any transformation, but that’s also good so we can use some tools in Power Query to
address this issue.

For this part of the exercise we used Jupyter Notebook and created the file ‘Marketing Analytics-
checkpoint.ipynb’ with these steps.
Now we’ll be loading into PowerBI the tables we’ve created before and create our data model
inside PowerBI, build an interactive dashboard and so, transforming raw data into actionable
insights.
Let’s bring data to PowerBi.
We’ll connect to our database. I have my database running on my laptop so I’ll use ‘localhost’ and
then choose the database we’re working with. In this case we’ll be importing all the data as it is a
personal project. (Direct query could also be used, but that option is usually used for other type of
projects. Usually companies don’t have the need or the storage to import all the data and so, query
the database directly whenever there’s the need for that).
Let’s bring them as they are, meaning that we’ll be importing them to PowerBI how they were in
their original state, before our SQL transformations and selections. We won’t be importing
geography table as we joined its information with customers table. We’ll see that the relationships
have already been set by PowerBI also. We’ll be looking at those a little bit later to see if they are
correct or if we need to make some adjustments.

Now let’s move to Power Query by selecting “Transform Data”. Here we won’t get into the details
of each tale imported, we’ll just use the case of customer_journey table as an example

As you can see, the transformations we did in SQL are not here. The NULL values are present, the
stage is not in Uppercase letters. So, nothing that we’ve done in SQL has been imported. To do
that, let’s remove the applied step “Navigation” from the Query Settings pane and we’ll see the
database tables (even geography that we’ve not imported).
Clicking on the settings wheel near source we’ll get this window and clicking on “advanced options”
we’ll see the area where we can copy the SQL query we’ve done on SQL.
When we click ok, we’ll see that the result we obtained in SQL is now also in PowerBI.

We could, and eventually should, check the data types of the fields. For now we’ll just adjust the
VisitDate datatype to date instead of date time. (Eventually having also the time for each step of
the journey could add some value but… as we do not have that we’ll leave like it is)
Now it’s time to perform the same steps for the other tables, always, as we don’t have time data,
to change date fields to date data type.
On engagement_data the fields Views and Clicks came as text so we changed them to whole
number.
NOTE: Basically all numeric fields came as decimals despite being whole numbers. We haven’t
addressed that but I think we should. Let’s see.
We’ll also bee importing our sentiment analysis .csv created in our python exercise. For that, as it
is a .csv file, from inside Power Query we’ll click on “new source” -> “text/csv” and choose our file.
We can see, if selected a cell, that the
issue with the double spacing in this
file hasn’t been addressed (like
mentioned before). We’ll address that
in the next step

To do that we’ll select the column


and then select replace values,
inserting a double space on the value
to find and one space on the replace
with

We now have our double spacing issue addressed and fixed.


NOTE: Here, in the CSV file the data types of the numeric fields came up correct.

We’re now ready to close and apply the changes and move to next step.
And the next step is to analyze the relationships created by PowerBI on our model. In order to go
through the process without PowerBI deciding the relationships we’ll delete them all and then start
from scratch to better understand relationships and the ones in our model in particular.
Let’s place the dimensions on top and fact tables on the bottom and create the relationships

After creating the relationships we will need a date table. As we know PowerBI works best with a
dedicated Date table that we can create from scratch with a lot of different fields depending on the
analyzes we want to make. For this example we’ll use a predefined script, which we can find in this
txt file.
To do so, we’ll choose “create table” and paste our script on the DAX formula bar.
Calendar =
ADDCOLUMNS (
CALENDAR ( DATE ( 2023, 1, 1 ), DATE ( 2025, 12, 31 ) ),
"DateAsInteger", FORMAT ( [Date], "YYYYMMDD" ),
"Year", YEAR ( [Date] ),
"Monthnumber", FORMAT ( [Date], "MM" ),
"YearMonthnumber", FORMAT ( [Date], "YYYY/MM" ),
"YearMonthShort", FORMAT ( [Date], "YYYY/mmm" ),
"MonthNameShort", FORMAT ( [Date], "mmm" ),
"MonthNameLong", FORMAT ( [Date], "mmmm" ),
"DayOfWeekNumber", WEEKDAY ( [Date] ),
"DayOfWeek", FORMAT ( [Date], "dddd" ),
"DayOfWeekShort", FORMAT ( [Date], "ddd" ),
"Quarter", "Q" & FORMAT ( [Date], "Q" ),
"YearQuarter",
FORMAT ( [Date], "YYYY" ) & "/Q"
& FORMAT ( [Date], "Q" )
)

So now we also have our dimension date table and we’ll connect it to the date fields of the fact
tables.
Now it’s time to start working on enhancing our model with some calculations or measures. We
have a lot of information by now, that would allow us to perform a lot of analyzes but it is also true
that this amount of data allows us to think about other metrics that we can calculate and then use
on our visuals.
A best practice is to create an empty table to accommodate the measures we’ll create. Let’s name
it “_Measures” the underscore behind the name puts this table on top the others. The syntax to
create this table is to simply write on the formula bar “_Measures = “ and apply.
We are ready to create some simple measures (NOTE: ver com GPT se estas measures fazem
sentido ou se pelo contrário foram feitas para este exercício.)
 Clicks = SUM(engagement_data[Clicks])
 Likes = SUM(engagement_data[Likes])
 Views = SUM(engagement_data[Views])
 Rating (Average) = AVERAGE(customer_reviews_with_sentiment_VH[Rating])
 Number of Cust Reviews =
DISTINCTCOUNT(customer_reviews_with_sentiment_VH[ReviewID])
 Number of Cust Journeys = DISTINCTCOUNT(customer_journey[JourneyID])
 Number of Campaigns = DISTINCTCOUNT(engagement_data[CampaignID])
 Coversion Rate =
VAR TotalVisitors =
CALCULATE(COUNT(customer_journey[JourneyID]),customer_journey[Action] = "View")
VAR TtoalPurchases = CALCULATE(
COUNT(customer_journey[JourneyID]),customer_journey[Action] = "Purchase")
RETURN
IF(
TotalVisitors = 0, 0, DIVIDE(TtoalPurchases, TotalVisitors)

Let’s start building our Report in PowerBI

Checks
Empty months do not show when products not bought that month
Line Chart for conversion rate calculates wrongly (I think)

You might also like