Simple Data Infrastructure Using Google Cloud B09K5R1CW2
Simple Data Infrastructure Using Google Cloud B09K5R1CW2
Chapter2.Building a
Data Lake and DWH
with Cloud Storage
and BigQuery
About the data analysis
infrastructure built in this
book
Now, I would like to start building a data
analysis infrastructure using Google Cloud
Platform. First of all, I will explain a simple
image of the data analysis infrastructure to be
built in this book. The data analysis
infrastructure can be roughly divided into two
parts: a "data lake" that stores various data,
and a "DWH" that reads data from the data
lake and processes and aggregates the data
into an image that can actually be used. First,
we will build the data lake and DWH with
Google Cloud Platform's Cloud Storage and
BigQuery services, respectively.
If you want to know more about Cloud
Storage and BigQuery, please refer to the
Google Cloud documentation.
■ About Cloud Storage
https://cloud.google.com/storage/docs/introduction
■ About BigQuery
https://cloud.google.com/bigquery/what-
is-bigquery?
Advance preparation
How to create a Google Cloud Platform
Account
The details of the creation procedure are
omitted in this document. If you have any
questions, please refer to the technical articles
on the Internet.
https://cloud.google.com/free/
Creating an application with the App Engine
In order to use the Cloud Scheduler for
scheduling, which will be explained later, it is
assumed that you already have an application
running on the App Engine. If you do not
have an application running on the App
Engine, please create one in advance.
■ Setting up a Google Cloud project for App
Engine
https://cloud.google.com/appengine/docs/standard
About the contents of the sample data
handled in this manual
The following are the specifications of the
sample data to be uploaded to the Cloud
Storage bucket and the table definitions to be
created in BigQuery. Please check them if
necessary.
Files to be uploaded to the Cloud Storage
bucket
file-name:pos_salesdata_daily.csv
Column Name
1 sales_date
2 store_code
3 store_name
4 product_code
5 product_name
6 sales_amount
7 sales_cnt
Image of data
sales_date store_code store_name product_code
1 2020-10-21 10031 Narita 1000001
2 2020-10-21 10031 Narita 1000002
3 2020-05-01 10021 Suginami 5000301
4 2020-05-01 10021 Suginami 5000301
5 2020-08-21 10001 Tachikawa 8000001
$ gsuitl mb gs://[test-sample-
gcp]/
The bucket name must be set to a globally
unique name.
(4)Uploading files to the bucket
When you select the bucket you created, you
will be taken to the "Bucket Details" screen.
・ Click "UPLOAD FILES" and select the
files you want to upload.
For operations using the CLI on Cloud Shell
$ gsutil cp
[pos_salesdata_daily.csv]
gs://test-sample-gcp/
In this example, we assume that we want to
copy the files on the Cloud Shell home
directory to the Cloud Storage bucket.
The above steps will complete the building of
the data lake.
"Well, you can't just create a bucket and
upload files to it."
But that's exactly what it is. All you have to
do is "create a bucket" and "upload files". In
reality, you probably store files in separate
buckets and folders within the buckets for
different business purposes (e.g. for each data
source such as mission-critical systems) or
for different periods of time of the target data.
However, it is important to understand that
the minimum configuration unit of a data lake
is one file stored in one bucket, as shown in
the previous step.
Building a DWH
Next, we will use BigQuery, GCP's DWH
service, to build the DWH.
(1)Create a dataset in BigQuery
Select "BigQuery" from the navigation menu.
$ bq --location=us-central1 mk \
--dataset \
[project_id]:bq_dataset
Please replace [project_id] with your own
working environment.
(2)Create a table in the dataset
With the dataset you just created "bq_dataset"
selected in Resources, click "Create Table"
on the right side of the screen.
From the "Create Table" screen on the right
side of the screen, fill in the following items
and click "Create Table".
Source : Empty table
Send to: Search for projects
Project name: Select an existing project
data set : bq_dataset
Table type: Native table
Table name : pos_salesdata
Schema: See the table below
Name Type Mode
sales_date DATE NULLABLE
store_code STRING NULLABLE
store_name STRING NULLABLE
product_code STRING NULLABLE
product_name STRING NULLABLE
sales_amount NUMERIC NULLABLE
sales_cnt INTEGER NULLABLE
For operations using the CLI on Cloud Shell
Creating a table
$ bq mk \
--table \
[project_id]:bq_dataset.pos_salesdata
\
sales_date:DATE,store_code:STRING,sto
This will create a bucket in Cloud Storage,
upload the sample data, and create a data set
and table in BigQuery.
Chapter3.Using Cloud
Functions to Automate
BigQuery Load
Processing
About the image of the loading
process to BigQuery
From now on, we will use GCP services such
as Cloud Functions, Cloud Pub/Sub, and
Cloud Scheduler to create a process that
automatically loads files stored in a Cloud
Storage bucket into a BigQuery table at a
predetermined time. We will use Cloud
Functions, Cloud Pub/Sub, and Cloud
Scheduler.
Image of the load process from Cloud
Storage to BigQuery
$ python3 bqload.py
After running the above code and confirming
that the table has been created in BigQuery,
we need to modify the above code so that it
can be executed as a function in Cloud
Functions. To execute the Python code, we
need to decode the base64-encoded data in
the Pub/Sub message, so we modify the code
as follows
import base64
import datetime
from google.cloud import bigquery
def bqload_func(event, context):
pubsub_message =
base64.b64decode(event['data']).decode('utf-
8')
print(pubsub_message)
client = bigquery.Client()
dataset_id = 'bq_dataset'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
bigquery.SchemaField("sales_date",
"DATE"),
bigquery.SchemaField("store_code",
"STRING"),
bigquery.SchemaField("store_name",
"STRING"),
bigquery.SchemaField("product_code",
"STRING"),
bigquery.SchemaField("product_name",
"STRING"),
bigquery.SchemaField("sales_amount",
"NUMERIC"),
bigquery.SchemaField("sales_cnt",
"INTEGER")
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so
the line below is optional.
job_config.source_format =
bigquery.SourceFormat.CSV
uri = "gs://test-sample-
gcp/pos_salesdata_daily.csv"
load_job = client.load_table_from_uri(
uri,
dataset_ref.table("pos_salesdata"),
job_config=job_config
) # API request
load_job.result() # Waits for table load to
complete.
destination_table =
client.get_table(dataset_ref.table("pos_salesdata"))
print("Loaded {}
rows.".format(destination_table.num_rows))
Contents of requirements.txt
google-cloud-bigquery==1.25.0
Paste the above code and deploy it.
(3) Publish the message.
Return to the Cloud Pub/Sub screen, and
from the topic details screen, select "Publish
message" -> "Publish one message".
Chapter5.Other things
that are necessary for
building a data
infrastructure
ETL processing that occurs in
data infrastructure
construction
In this book, I have not mentioned ETL
processing at all, because I have touched on
the construction of the infrastructure that is
the minimum requirement for building a data
analysis infrastructure. However, ETL
processing is always necessary for data
analysis infrastructure. In that case, if you
implement it in GCP, you need to use a
service to execute ETL processing or
workflow processing, specifically Dataflow
or Composer. However, building and
operating a data analysis infrastructure using
these services requires a considerable amount
of engineering resources to be allocated, so it
is not necessary to get involved at the stage of
building a new and casual data analysis
infrastructure. If you want to do ETL
processing on the simple data analytics
infrastructure configuration discussed in this
book, we recommend you to use BigQuery's
"Query Scheduling" feature. "The query
scheduling feature allows you to schedule
SQL to be executed in BigQuery. You can
use this feature to process and transform the
data loaded into a BigQuery table by a Cloud
Functions function and store it in another
table using SQL.