1 - Creating A Data Transformation Pipeline With Cloud Dataprep
1 - Creating A Data Transformation Pipeline With Cloud Dataprep
Dataprep
Overview
Objectives
Although this lab is largely focused on Cloud Dataprep, you need BigQuery as an
endpoint for dataset ingestion to the pipeline and as a destination for the output
when the pipeline is completed.
3. Then click Create Dataset. You will now see your dataset under your
project in the left-hand menu:
4. Now find the Query editor and copy and paste the following SQL query into
it:
#standardSQL
CREATE OR REPLACE TABLE ecommerce.all_sessions_raw_dataprep
OPTIONS(
description="Raw data from analyst team to ingest into Cloud Dataprep"
) AS
SELECT * FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170801'; # limiting to one day of data 56k rows for this lab
5. Then click Run. This query copies over a subset of the public raw
ecommerce dataset (one day's worth of session data, or about 56
thousand records) into a new table named all_sessions_raw_dataprep,
which has been added to your ecommerce dataset for you to explore and
clean in Cloud Dataprep.
6. Confirm that the new table exists in your ecommerce dataset:
Answer: 32 columns.
Answer: Referral. A referring site is typically any other website that has a link to
your content. An example here is a different website reviewed a product on our
ecommerce website and linked to it. This is considered a different acquisition
channel than if the visitor came from a search engine.
Tip: When looking for a specific column, click the Find column icon ( ) in the top right
corner, then start typing the column's name in the Find column textfield, then click on the
column's name. This will automatically scroll the grid to bring the column on the screen.
What are the top three countries from which sessions are originated?
You might see a red bar under the productSKU column. If so, what might
that mean?
Answer: A red bar indicates mismatched values. While sampling data, Cloud
Dataprep attempts to automatically identify the type of each column. If you do not
see a red bar for the productSKU column, then this means that Cloud Dataprep
correctly identified the type for the column (i.e. the String type). If you do see a
red bar, then this means that Cloud Dataprep found enough number values in its
sampling to determine (incorrectly) that the type should be Integer. Cloud
Dataprep also detected some non-integer values and therefore flagged those
values as mismatched. In fact, the productSKU is not always an integer (for
example, a correct value might be "GGOEGOCD078399"). So in this case, Cloud
Dataprep incorrectly identified the column type: it should be a string, not an
integer. You will fix that later in this lab.
Answers:
Nest
Bags
(not set) (which means that some sessions are not associated with a
category)
are the most popular in our sample
Note: After exploration, in some datasets you may find duplicative or deprecated
columns. We will be using productQuantity and productRevenue fields instead
and dropping the itemQuantity and itemRevenue fields later in this lab to
prevent confusion for our report users.
Answers: There are seven values found in our sample. The most common value
is zero 0 which indicates that the type is unknown. This makes sense as the
majority of the web sessions on our website will not perform any ecommerce
actions as they are just browsing.
Deduplicating rows
Your team has informed you there may be duplicate session values included in
the source dataset. Let's remove these with a new deduplicate step.
3. Review the recipe that you created so far, it should resemble the following:
Your team has asked you to create a table of all user sessions that bought at
least one item from the website. Filter out user sessions with NULL revenue.
As you discovered, the dataset has no single column for a unique visitor session.
Create a unique ID for each session by concatenating
the fullVisitorID and visitId fields.
2. For Columns, select fullVisitorId and visitId.
0 'Unknown'
5 'Check out'
6 'Completed purchase'
7 'Refund of purchase'
8 'Checkout options'
5. For New column name, type eCommerceAction_label. Leave the other
fields at their default values.
6. Click Add.
2. For Formula, type: DIVIDE(totalTransactionRevenue,1000000) and
for New column name, type: totalTransactionRevenue1. Notice the
preview for the transformation:
3. Click Add.
4. For Frequency, select Weekly.
7. Click Save.
The job is now scheduled to run every Saturday at 3AM!
9. You see the list of jobs, and wait until your job is marked as Completed.