0% found this document useful (1 vote)
343 views

1 - Creating A Data Transformation Pipeline With Cloud Dataprep

The document describes setting up a data transformation pipeline using Cloud Dataprep. Key steps include: 1. Connecting a BigQuery dataset containing ecommerce session data to Cloud Dataprep for exploration and cleaning. 2. Exploring the dataset fields using Cloud Dataprep's UI, finding insights like most common traffic sources, countries, products, and categories. 3. Creating a Cloud Dataprep recipe to transform the raw data and schedule output to BigQuery for ongoing reporting.

Uploaded by

subodh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
343 views

1 - Creating A Data Transformation Pipeline With Cloud Dataprep

The document describes setting up a data transformation pipeline using Cloud Dataprep. Key steps include: 1. Connecting a BigQuery dataset containing ecommerce session data to Cloud Dataprep for exploration and cleaning. 2. Exploring the dataset fields using Cloud Dataprep's UI, finding insights like most common traffic sources, countries, products, and categories. 3. Creating a Cloud Dataprep recipe to transform the raw data and schedule output to BigQuery for ongoing reporting.

Uploaded by

subodh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Creating a Data Transformation Pipeline with Cloud

Dataprep

Overview

Cloud Dataprep by Trifacta is an intelligent data service for visually exploring,


cleaning, and preparing structured and unstructured data for analysis. In this lab
you explore the Cloud Dataprep UI to build a data transformation pipeline that
runs at a scheduled interval and outputs results into BigQuery.
The dataset you'll use is an ecommerce dataset that has millions of Google
Analytics session records for the Google Merchandise Store loaded into
BigQuery. You have a copy of that dataset for this lab and will explore the
available fields and row for insights.

Objectives

In this lab, you learn how to perform these tasks:

 Connect BigQuery datasets to Cloud Dataprep.


 Explore dataset quality with Cloud Dataprep.
 Create a data transformation pipeline with Cloud Dataprep.
 Schedule transformation jobs outputs to BigQuery.

What you'll need

 The Google Chrome browser. Other browsers are currently not supported


by Cloud Dataprep.
Task 1. Setting up your development
environment

Opening BigQuery console

Although this lab is largely focused on Cloud Dataprep, you need BigQuery as an
endpoint for dataset ingestion to the pipeline and as a destination for the output
when the pipeline is completed.

In the Google Cloud Console, select Navigation menu > BigQuery:

The Welcome to BigQuery in the Cloud Console message box opens. This


message box provides a link to the quickstart guide and lists UI updates.
Click Done.

Task 2. Creating a BigQuery Dataset


In this task, you will create a new BigQuery dataset to receive the output table of
your new pipeline.

1. In the left pane, select your project name:

2. Then from the right-hand side of the Console, click CREATE DATASET:

 For Dataset ID, type ecommerce.

 Leave the other values at their defaults.

3. Then click Create Dataset. You will now see your dataset under your
project in the left-hand menu:
4. Now find the Query editor and copy and paste the following SQL query into
it:

#standardSQL
CREATE OR REPLACE TABLE ecommerce.all_sessions_raw_dataprep
OPTIONS(
description="Raw data from analyst team to ingest into Cloud Dataprep"
) AS
SELECT * FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170801'; # limiting to one day of data 56k rows for this lab
5. Then click Run. This query copies over a subset of the public raw
ecommerce dataset (one day's worth of session data, or about 56
thousand records) into a new table named all_sessions_raw_dataprep,
which has been added to your ecommerce dataset for you to explore and
clean in Cloud Dataprep.
6. Confirm that the new table exists in your ecommerce dataset:

Task 3. Opening Cloud Dataprep


In this task, you will open Cloud Dataprep and go through some initialization
steps.

1. Using a Chrome browser (Reminder: Cloud Dataprep works ONLY in


Chrome), open the Navigation menu and select Dataprep.

2. Accept the terms of service.

3. In the Share account information with Trifacta dialog, select the


checkbox, and then click Agree and Continue.

4. To allow Trifacta to access your project data, click Allow.


Note: This authorization process might take a few minutes.
5. In the Sign in with Google window appears, select your Qwiklabs account
and then click Allow. Accept the Trifacta Terms of Service if prompted.

6. If prompted to use the default location for the storage bucket,


click Continue.

Soon after you should be on the Cloud Dataprep homepage:


Task 4. Connecting BigQuery data to Cloud
Dataprep
In this task, you will connect Cloud Dataprep to your BigQuery data source. On
the Cloud Dataprep page:

1. Click Create Flow in the top-right corner.

2. In the Create Flow dialog, specify these details:

 For Flow Name, type Ecommerce Analytics Pipeline


 For Flow Description, type Revenue reporting table
3. Click Create.

4. If prompted with a What's a flow? popup, select Don't show me any


helpers.

5. Now Click Import & Add Datasets.

6. In the left pane, click BigQuery.

7. When your ecommerce dataset is loaded, click on it.


8. Click on the Create dataset icon (+ sign) on the left of
the all_sessions_raw_dataprep table.

9. Click Import & Add to Flow in the bottom right corner.

The data source automatically updates. In the right pane, Add should become an


available option. You are ready to go to the next task.

Task 5. Exploring ecommerce data fields


with a UI
In this task, you will load and explore a sample of the dataset within Cloud
Dataprep.

1. In the right pane, click Add > Recipe.


2. Click Edit Recipe.
Cloud Dataprep loads a sample of your dataset into the Transformer view. This
process might take a few seconds. You are now ready to start exploring the data!

Answer the following questions:

 How many columns are there in the dataset?

Answer: 32 columns.

 How many rows does the sample contain?


Answer: About 12 thousand rows.

 What is the most common value in the channelGrouping column?


Hint: Find out by hovering your mouse cursor over the histogram under
the channelGrouping column title.

Answer: Referral. A referring site is typically any other website that has a link to
your content. An example here is a different website reviewed a product on our
ecommerce website and linked to it. This is considered a different acquisition
channel than if the visitor came from a search engine.
Tip: When looking for a specific column, click the Find column icon ( ) in the top right
corner, then start typing the column's name in the Find column textfield, then click on the
column's name. This will automatically scroll the grid to bring the column on the screen.
 What are the top three countries from which sessions are originated?

Answer: United States, India, United Kingdom

 What does the grey bar under totalTransactionRevenue represent?

Answer: Missing values for the totalTransactionRevenue field. This means


that a lot of sessions in this sample did not generate revenue. Later, we will filter
out these values so our final table only has customer transactions and associated
revenue.

 What is the maximum timeOnSite in seconds, maximum pageviews, and


maximum sessionQualityDim for the data sample? (Hint: Open the menu to the
right of the timeOnSite column by clicking  the Column Details menu)
To close the details window, click the Close Column Details button in the top
right corner. Then repeat the process to view details for
the pageviews and sessionQualityDim columns.
Answers:

 Maximum Time On Site: 5,561 seconds (or 92 minutes)


 Maximum Pageviews: 155 pages
 Maximum Session Quality Dimension: 97
Note: Your answers for maximums may vary slightly due to the data sample used by Cloud
DataprepNote on averages: Use extra caution when performing aggregations like averages over
a column of data. We need to first ensure fields like timeOnSite are only counted once per
session. We'll explore the uniqueness of visitor and session data in a later lab.
 Looking at the histogram for sessionQualityDim, are the data values
evenly distributed?
Answer: No, they are skewed to lower values (low quality sessions), which is
expected.

 What is the date range for the dataset? Hint: Look at date field


Answer: 8/1/2017 (one day of data)

 You might see a red bar under the productSKU column. If so, what might
that mean?

Answer: A red bar indicates mismatched values. While sampling data, Cloud
Dataprep attempts to automatically identify the type of each column. If you do not
see a red bar for the productSKU column, then this means that Cloud Dataprep
correctly identified the type for the column (i.e. the String type). If you do see a
red bar, then this means that Cloud Dataprep found enough number values in its
sampling to determine (incorrectly) that the type should be Integer. Cloud
Dataprep also detected some non-integer values and therefore flagged those
values as mismatched. In fact, the productSKU is not always an integer (for
example, a correct value might be "GGOEGOCD078399"). So in this case, Cloud
Dataprep incorrectly identified the column type: it should be a string, not an
integer. You will fix that later in this lab.

 Looking at the v2ProductName column, what are the most popular


products?
Answer: Nest products

 Looking at the v2ProductCategory column, what are some of the most


popular product categories?

Answers:

 Nest
 Bags
 (not set) (which means that some sessions are not associated with a
category)
are the most popular in our sample

 True or False? The most common productVariant is COLOR.


Answer: False. It's (not set) because most products do not have variants (80%
+)

 What are the two values in the type column?


Answer: PAGE and EVENT
A user can have many different interaction types when browsing your website.
Types include recording session data when viewing a PAGE or a special EVENT
(like "clicking on a product") and other types. Multiple hit types can be triggered
at the exact same time so you will often filter on type to avoid double counting.
We'll explore this more in a later analytics lab.

 What is the maximum productQuantity?


Answer: 100 (your answer may vary)

productQuantity indicates how many units of that product were added to cart.


100 means 100 units of a single product was added.

 What is the dominant currencyCode for transactions?


Answer: USD (United States Dollar)

 Are there valid values for itemQuantity or itemRevenue?


Answer: No, they are all NULL (or missing) values.

Note: After exploration, in some datasets you may find duplicative or deprecated
columns. We will be using productQuantity and productRevenue fields instead
and dropping the itemQuantity and itemRevenue fields later in this lab to
prevent confusion for our report users.

 What percentage of transactionId values are valid? What does this


represent for our ecommerce dataset?
 Answer: About 4.6% of transaction IDs have a valid value, which
represents the average conversion rate of the website (4.6% of visitors transact).
 How many eCommerceAction_type values are there, and what is the most
common value?
Hint: count the distinct number of histogram columns.

Answers: There are seven values found in our sample. The most common value
is zero 0 which indicates that the type is unknown. This makes sense as the
majority of the web sessions on our website will not perform any ecommerce
actions as they are just browsing.

 Using the schema, what does eCommerceAction_type = 6 represent?


Hint: Search for eCommerceAction type and read the description for the mapping
Answer: 6 maps to "Completed purchase". Later in this lab we will ingest this
mapping as part of our data pipeline.
Task 6. Cleaning the data
In this task, you will clean the data by deleting unused columns, eliminating
duplicates, creating calculated fields, and filtering out unwanted rows.

Converting the productSKU column data type

To ensure that the productSKU column type is a string data type, open the


menu to the right of the productSKU column by clicking  , then click Change
type > String.
Verify that the first step in your data transformation pipeline was created by
clicking on the Recipe icon:

Deleting unused columns

As we mentioned earlier, we will be deleting


the itemQuantity and itemRevenue columns as they only contain NULL values
are not useful for the purpose of this lab.

 Open the menu for the itemQuantity column, and then click Delete.


 Repeat the process to delete the itemRevenue column.

Deduplicating rows

Your team has informed you there may be duplicate session values included in
the source dataset. Let's remove these with a new deduplicate step.

1. Click the Filter rows icon in the toolbar, then click Remove duplicate


rows.
2. Click Add in the right-hand panel.

3. Review the recipe that you created so far, it should resemble the following:

Filtering out sessions without revenue

Your team has asked you to create a table of all user sessions that bought at
least one item from the website. Filter out user sessions with NULL revenue.

1. Under the totalTransactionRevenue column, click the grey Missing


values bar. All rows with a missing value
for totalTransactionRevenue are now highlighted in red.
2. In the Suggestions panel, in Delete rows , click Add.
This step filters your dataset to only include transactions with revenue
(where totalTransactionRevenue is not NULL).

Filtering sessions for PAGE views

The dataset contains sessions of different types, for example PAGE (for page


views) or EVENT (for triggered events like "viewed product categories" or "added
to cart"). To avoid double counting session pageviews, add a filter to only include
page view related hits.
1. In the histogram below the type column, click the bar for PAGE. All rows
with the type PAGE are now highlighted in green.

2. In the Suggestions panel, in Keep rows, and click Add.


Task 7. Enriching the data
Search your schema documentation for visitId and read the description to
determine if it is unique across all user sessions or just the user.
 visitId: an identifier for this session. This is part of the value usually
stored as the utmb cookie. This is only unique to the user. For a completely
unique ID, you should use a combination of fullVisitorId and visitId.*
As we see, visitId is not unique across all users. We will need to create a
unique identifier.

Creating a new column for a unique session ID

As you discovered, the dataset has no single column for a unique visitor session.
Create a unique ID for each session by concatenating
the fullVisitorID and visitId fields.

1. Click on the Merge columns icon in the toolbar.

2. For Columns, select fullVisitorId and visitId.

3. For Separator type a single hyphen character: -

4. For the New column name, type unique_session_id.


5. Click Add.
The unique_session_id is now a combination of the fullVisitorId and visitId.
We will explore in a later lab whether each row in this dataset is at the unique
session level (one row per user session) or something even more granular.

Creating a case statement for the ecommerce action type

As you saw earlier, values in the eCommerceAction_type column are integers


that map to actual ecommerce actions performed in that session. For example, 3
= "Add to Cart" or 5 = "Check out." This mapping will not be immediately
apparent to our end users so let's create a calculated field that brings in the value
name.

1. Click on Conditions in the toolbar, then click Case on single column.


2. For Column to evaluate, specify eCommerceAction_type.

3. Next to Cases (1), click Add 8 times for a total of 9 cases.


4. For each Case, specify the following mapping values (including the single
quote characters):
Value to compare New value

0 'Unknown'

1 'Click through of product lists'


2 'Product detail views'

3 'Add product(s) to cart'

4 'Remove product(s) from cart'

5 'Check out'

6 'Completed purchase'

7 'Refund of purchase'

8 'Checkout options'
5. For New column name, type eCommerceAction_label. Leave the other
fields at their default values.

6. Click Add.

Adjusting values in the totalTransactionRevenue column


As mentioned in the schema, the totalTransactionRevenue column contains
values passed to Analytics multiplied by 10^6 (e.g., 2.40 would be given as
2400000). You now divide contents of that column by 10^6 to get the original
values.
1. Open the menu to the right of the totalTransactionRevenue column by
clicking  , then select Calculate > Custom formula.

2. For Formula, type: DIVIDE(totalTransactionRevenue,1000000) and
for New column name, type: totalTransactionRevenue1. Notice the
preview for the transformation:
3. Click Add.

4. To convert the new totalTransactionRevenue1 column's type to a


decimal data type, open the menu to the right of
the totalTransactionRevenue1 column by clicking  , then
click Change type > Decimal.
5. Review the full list of steps in your recipe:
Task 8. Running and scheduling Cloud
Dataprep jobs to BigQuery
Challenge: Now that you are satisfied with the flow, it's time to execute the transformation
recipe against your source dataset. The challenge for you is to load the output of the job into the
BigQuery dataset that you created earlier. Make sure you load the output into a separate table
and name it revenue_reporting.
Once your Cloud Dataprep job is completed, refresh your BigQuery page and
confirm that the output table revenue_reporting exists.

Click Check my progress to verify the objective.

Verify if the Cloud Dataprep jobs output the data to BigQuery


Check my progress
While it's running, you can also schedule the execution of pipeline in the next
step so the job can be re-run automatically on a regular basis to account for
newer data. Note: You can navigate and perform other operations while jobs are
running.

1. You will now schedule a recurrent job execution. Click the Flows icon on


the left of the screen.

2. On the right of your Ecommerce Analytics Pipeline flow click

the More icon (  ), then click Schedule Flow.


3. In the Add Schedule dialog:

4. For Frequency, select Weekly.

5. For day of week, select Saturday and unselect Sunday.

6. For time, enter 3:00 and select AM.

7. Click Save.
The job is now scheduled to run every Saturday at 3AM!

8. Click the Jobs icon on the left of the screen.

9. You see the list of jobs, and wait until your job is marked as Completed.

You might also like