0% found this document useful (0 votes)
10 views

Simple Data Infrastructure Using Google Cloud B09K5R1CW2

DATA SCIENCE

Uploaded by

Marcos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Simple Data Infrastructure Using Google Cloud B09K5R1CW2

DATA SCIENCE

Uploaded by

Marcos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Introduction.

The purpose of this book is to show you how


to build a simple data analysis infrastructure
using Google Cloud Platform. In general,
building a data analysis infrastructure
requires a great deal of time and money. For
example, there are many procedures and tasks
that need to be completed in addition to the
actual system construction, such as
organizing the business strategies and data
utilization policies of the management
strategy and IT departments within the
company, organizing the data utilization
images required by the user departments, and
then using these images as the basis for
internal budget deliberations and other
procedures and adjustments. In addition to
the actual system construction, many other
procedures and tasks occur.
In the case of a company that does not have a
well-developed internal IT environment and
is not willing to invest in IT, when I
explained to the internal executives that we
were going to build a new data analysis
infrastructure, they said, "Data analysis
infrastructure? What's the use of building
such a thing? or "How much sales will be
generated by building it? It is easy to imagine
that they would point out something that is
not essential. (That's how it was at my
company, too.) However, since the data
analysis infrastructure itself is a system that
deals with "data," which at first glance seems
concrete but is actually abstract, it is very
difficult to clarify the usage to what extent it
can be used when considering the
construction. Thus, what should we do if the
organization does not invest enough in the
use of data? To answer this question, I would
like to recommend that you start building
your data infrastructure with a small start.
This means actually building the data
infrastructure with your own hands.
When you hear the words "build a data
infrastructure", you may have the image of
doing something very difficult. Of course,
data infrastructure has various functions, and
it is not realistic to create it from scratch by
yourself, but if you use cloud services such as
Amazon Web Service (AWS) or Google
Cloud Platform (GCP), it is actually not that
hard. In fact, it's rather easy to do.
Specifically, how to proceed is to assign one
or two people within the company who are
familiar with AWS, GCP, Azure, etc. as PoC
personnel for building the data infrastructure,
build the infrastructure once in about two to
three months, and then actually store the
company's data and perform tasks related to
data utilization (e.g., creating data marts for
BI connection, creating data marts for
advertising operations, etc.). If it works well,
then we can invest a little more to develop it
into a solid infrastructure, which would be
more efficient considering the cost of
building the system.
This book describes how to build a data
infrastructure as simply and simply as
possible for organizations (or individuals)
who are going to build a new data analysis
infrastructure and want to build it easily
without spending a lot of money.
First of all, with this book in hand, let's try to
build a data infrastructure easily using GCP!
October 23, 2021 Data Consulting Lab.
Kanazawa
Chapter1.What is a "data analytics
infrastructure"?
How to build a data analysis
infrastructure
About Data Lake and DWH
Chapter2.Building a Data Lake and DWH
with Cloud Storage and BigQuery
About the data analysis infrastructure
built in this book
Data Lake and DWH Construction
Procedures
Chapter3.Using Cloud Functions to
Automate BigQuery Load Processing
About the image of the loading process to
BigQuery
About the Python code that runs in Cloud
Functions
Steps to automate BigQuery load
processing using Cloud Functions
Chapter4.The cost of building a data
infrastructure
Chapter5.Other things that are necessary
for building a data infrastructure
ETL processing that occurs in data
infrastructure construction
Chapter1.What is a
"data analytics
infrastructure"?
As the name suggests, a data analysis
infrastructure is "an infrastructure for
analyzing data. This may be difficult to
understand, so I'll try to explain it more
carefully.
"We collect a variety of internal data into the
system, use that data to perform aggregation
and analysis according to our objectives, and
then use the results to create and distribute
reports."
This is the system you need to use to do the
above. Of course, this is just one example of
how it can be used. If you are acquiring
various data from external sources in addition
to your own data, it is possible to make new
discoveries by combining those data with
your own data for analysis. If you are
operating your own e-commerce site or using
CRM or MA tools, you can combine the
member attribute information and purchase
history contained in those tools to conduct
more accurate digital marketing.

How to build a data analysis


infrastructure
In order to build a data analysis
infrastructure, we first need to investigate
what kind of data is being managed within
the company and think about how to collect it
and where to manage and aggregate it. It is
also necessary to consider how to utilize this
data within the company. First of all, we need
to create a system to store the data that will
be linked from various systems such as the
company's core system, and then we need to
create a system to analyze and aggregate the
data. (This book does not deal with specific
data utilization methods, etc.)

About Data Lake and DWH


A place to store data linked from various
systems is generally called a "data lake," and
a mechanism to perform aggregation and
analysis based on that data is called a "data
warehouse (DWH). A general data analysis
infrastructure is built based on these two
mechanisms.
Nowadays, most data lakes and DWHs are
built using cloud services such as Amazon
Web Service (AWS), Google Cloud Platform
(GCP), and Azure. In this section, I would
like to introduce the services required to use
the data analysis infrastructure services in
AWS and GCP respectively.
In the case of AWS
Data Lake: Amazon S3
DWH: Amazon Redshift
In the case of GCP
Data Lake: Cloud Storage
DWH: BigQuery
There are many cloud services available for
building data analysis infrastructure, and you
can choose from a variety of services
depending on the scale of your data and your
usage. In addition, since the services are
updated quite frequently, it is recommended
to check the latest updates of the services you
are using. In this document, we will be using
the services of Google Cloud Platform
(GCP).

Chapter2.Building a
Data Lake and DWH
with Cloud Storage
and BigQuery
About the data analysis
infrastructure built in this
book
Now, I would like to start building a data
analysis infrastructure using Google Cloud
Platform. First of all, I will explain a simple
image of the data analysis infrastructure to be
built in this book. The data analysis
infrastructure can be roughly divided into two
parts: a "data lake" that stores various data,
and a "DWH" that reads data from the data
lake and processes and aggregates the data
into an image that can actually be used. First,
we will build the data lake and DWH with
Google Cloud Platform's Cloud Storage and
BigQuery services, respectively.
If you want to know more about Cloud
Storage and BigQuery, please refer to the
Google Cloud documentation.
■ About Cloud Storage
https://cloud.google.com/storage/docs/introduction
■ About BigQuery
https://cloud.google.com/bigquery/what-
is-bigquery?
Advance preparation
How to create a Google Cloud Platform
Account
The details of the creation procedure are
omitted in this document. If you have any
questions, please refer to the technical articles
on the Internet.
https://cloud.google.com/free/
Creating an application with the App Engine
In order to use the Cloud Scheduler for
scheduling, which will be explained later, it is
assumed that you already have an application
running on the App Engine. If you do not
have an application running on the App
Engine, please create one in advance.
■ Setting up a Google Cloud project for App
Engine
https://cloud.google.com/appengine/docs/standard
About the contents of the sample data
handled in this manual
The following are the specifications of the
sample data to be uploaded to the Cloud
Storage bucket and the table definitions to be
created in BigQuery. Please check them if
necessary.
Files to be uploaded to the Cloud Storage
bucket
file-name:pos_salesdata_daily.csv
Column Name
1 sales_date

2 store_code

3 store_name

4 product_code

5 product_name

6 sales_amount

7 sales_cnt

Tables to create in BigQuery


Table name:pos_salesdata
Field Name Type contents
1 sales_date DATE Date the product was sold

2 store_code STRING Store Code

3 store_name STRING Store name

4 product_code STRING Product Code

5 product_name STRING Product name

6 sales_amount NUMERIC Sales amount


7 sales_cnt INTEGER Number of sales

Image of data
sales_date store_code store_name product_code
1 2020-10-21 10031 Narita 1000001
2 2020-10-21 10031 Narita 1000002
3 2020-05-01 10021 Suginami 5000301
4 2020-05-01 10021 Suginami 5000301
5 2020-08-21 10001 Tachikawa 8000001

We do not provide sample data for download,


so please create appropriate data using the
above format.

Data Lake and DWH


Construction Procedures
Now, let's create a data infrastructure using
GCP's services. First, we will build a data
lake and a DWH as shown in the following
figure.

Image of data lake and DWH configuration


Building a Data Lake
We will use Cloud Storage, a storage service
of GCP, to build a Data Lake.
(1) Log in to GCP.
(2) Log in to the GCP console from the
following link.
https://console.cloud.google.com/
To operate the command line from Cloud
Shell, start Cloud Shell from the Cloud Shell
start button on the upper right of the screen.
(3)Create a bucket
Open the navigation menu (hamburger button
in the upper left corner of the screen) and
select "Storage" -> "Browser" in "Storage".
When the storage browser screen opens,
select "Create Bucket" at the top of the
screen.

In the bucket creation screen


Enter a name for the bucket.

Name it so that it is globally unique.


・ Check "Region" for location type.
・ Select "us-central1" for Location
・ Check "Standard" for the storage class.
・ Check "Fine-grained management" for the
method to control access to objects.
・ Check "Fine-grained management" for the
method of controlling object access.
For operations using the CLI on Cloud Shell
This is the procedure for executing the GUI
procedure using the CLI (command line
operation).

$ gsuitl mb gs://[test-sample-
gcp]/
The bucket name must be set to a globally
unique name.
(4)Uploading files to the bucket
When you select the bucket you created, you
will be taken to the "Bucket Details" screen.
・ Click "UPLOAD FILES" and select the
files you want to upload.
For operations using the CLI on Cloud Shell
$ gsutil cp
[pos_salesdata_daily.csv]
gs://test-sample-gcp/
In this example, we assume that we want to
copy the files on the Cloud Shell home
directory to the Cloud Storage bucket.
The above steps will complete the building of
the data lake.
"Well, you can't just create a bucket and
upload files to it."
But that's exactly what it is. All you have to
do is "create a bucket" and "upload files". In
reality, you probably store files in separate
buckets and folders within the buckets for
different business purposes (e.g. for each data
source such as mission-critical systems) or
for different periods of time of the target data.
However, it is important to understand that
the minimum configuration unit of a data lake
is one file stored in one bucket, as shown in
the previous step.
Building a DWH
Next, we will use BigQuery, GCP's DWH
service, to build the DWH.
(1)Create a dataset in BigQuery
Select "BigQuery" from the navigation menu.

With an existing project selected in the


resource column on the left of the screen,
click "Create dataset" on the right of the
screen.

The "Create dataset" screen will appear on


the right side of the screen. Fill in the input
fields as shown below, and then click "Create
dataset".
Data Set ID:bq_dataset
Data Location: us-central1
Default table expiration date
Encryption: Key managed by Google
For operations using the CLI on Cloud Shell
Creating a Data Set

$ bq --location=us-central1 mk \
--dataset \
[project_id]:bq_dataset
Please replace [project_id] with your own
working environment.
(2)Create a table in the dataset
With the dataset you just created "bq_dataset"
selected in Resources, click "Create Table"
on the right side of the screen.
From the "Create Table" screen on the right
side of the screen, fill in the following items
and click "Create Table".
Source : Empty table
Send to: Search for projects
Project name: Select an existing project
data set : bq_dataset
Table type: Native table
Table name : pos_salesdata
Schema: See the table below
Name Type Mode
sales_date DATE NULLABLE
store_code STRING NULLABLE
store_name STRING NULLABLE
product_code STRING NULLABLE
product_name STRING NULLABLE
sales_amount NUMERIC NULLABLE
sales_cnt INTEGER NULLABLE
For operations using the CLI on Cloud Shell
Creating a table
$ bq mk \
--table \
[project_id]:bq_dataset.pos_salesdata
\
sales_date:DATE,store_code:STRING,sto
This will create a bucket in Cloud Storage,
upload the sample data, and create a data set
and table in BigQuery.

Chapter3.Using Cloud
Functions to Automate
BigQuery Load
Processing
About the image of the loading
process to BigQuery
From now on, we will use GCP services such
as Cloud Functions, Cloud Pub/Sub, and
Cloud Scheduler to create a process that
automatically loads files stored in a Cloud
Storage bucket into a BigQuery table at a
predetermined time. We will use Cloud
Functions, Cloud Pub/Sub, and Cloud
Scheduler.
Image of the load process from Cloud
Storage to BigQuery

The process of loading files stored in Cloud


Storage into BigQuery table can be controlled
in the GCP console, but I think it would be
more convenient to automate it in actual
operation.
Make good use of GCP's scheduler function,
Cloud Scheduler, and FaaS function, Cloud
Functions, to automate the loading process.
The following services are required to
automate the process of loading from Cloud
Storage to BigQuery.
・Cloud Scheduler
・Cloud Pub/Sub
・Cloud Functions
The specific steps to be followed are as
follows
(1) Create a topic in Cloud Pub/Sub
(2) Setting up triggers for Cloud
Functions
(3) Publish the message.
( 4)Set up the function execution schedule
of Cloud Functions with Cloud Scheduler.

About the Python code that


runs in Cloud Functions
Before going into the steps, I will briefly
explain the function (Python code) to be
executed by Cloud Functions in this
document. There is a sample code in the
Google Cloud documentation, which will be
used as the basis for the explanation. In the
following URL, you will find a section called
"Load CSV data into a table" and select the
Python code.
■ Importing CSV data from Cloud Storage
https://cloud.google.com/bigquery/docs/loading-
data-cloud-storage-csv?

/// Here is a sample code


from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of
the table to create.
# table_id = "your-
project.your_dataset.your_table_name"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("name",
"STRING"),
bigquery.SchemaField("post_abbr",
"STRING"),
],
skip_leading_rows=1,
# The source format defaults to CSV, so
the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://cloud-samples-data/bigquery/us-
states/us-states.csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to
complete.
destination_table =
client.get_table(table_id) # Make an API
request.
print("Loaded {}
rows.".format(destination_table.num_rows))

For the sample code, here is a brief


explanation of how it is processed.
Define the schema for loading into BigQuery
tables
schema=[
bigquery.SchemaField("name",
"STRING"),
bigquery.SchemaField("post_abbr",
"STRING"),
],

Set the number of lines to read from the


source file (in the sample, the first line is
skipped and the second line is read).
skip_leading_rows=1,
Describe the URI of the Cloud Storage you
are loading from.
uri = "gs://cloud-samples-
data/bigquery/us-states/us-states.csv"
Load data into the target table (table_id)
destination_table =
client.get_table(table_id)

Now, let's create the code to actually load the


data into the BigQuery table based on the
above sample code. Once you have set the
project name, dataset name, table name, and
the URI of the Cloud Storage you are loading
from to match your environment, run the
code from the Cloud Shell and check that the
data is actually loaded into the BigQuery
table. You can save the modified code in a
file such as "bqload.py" on Cloud Shell and
run it manually for easy confirmation.
///Sample code (modified version)
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of
the table to create.
table_id = "[your-
project].bq_dataset.pos_salesdata"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("sales_date",
"DATE"),
bigquery.SchemaField("store_code",
"STRING"),
bigquery.SchemaField("store_name",
"STRING"),
bigquery.SchemaField("product_code",
"STRING"),
bigquery.SchemaField("product_name",
"STRING"),
bigquery.SchemaField("sales_amount",
"NUMERIC"),
bigquery.SchemaField("sales_cnt",
"INTEGER")
],
skip_leading_rows=1,
# The source format defaults to CSV, so
the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://[test-sample-
gcp]/pos_salesdata.csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to
complete.
destination_table =
client.get_table("testtable") # Make an API
request.
print("Loaded {}
rows.".format(destination_table.num_rows))

Now, let's see if the above code can actually


be executed.

$ python3 bqload.py
After running the above code and confirming
that the table has been created in BigQuery,
we need to modify the above code so that it
can be executed as a function in Cloud
Functions. To execute the Python code, we
need to decode the base64-encoded data in
the Pub/Sub message, so we modify the code
as follows

///Sample code (to be executed as a Cloud


Functions function)
import base64
import datetime
from google.cloud import bigquery
def bqload_func(event, context):
pubsub_message =
base64.b64decode(event['data']).decode('utf-
8')
print(pubsub_message)
client = bigquery.Client()
dataset_id = 'bq_dataset'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
bigquery.SchemaField("sales_date",
"DATE"),
bigquery.SchemaField("store_code",
"STRING"),
bigquery.SchemaField("store_name",
"STRING"),
bigquery.SchemaField("product_code",
"STRING"),
bigquery.SchemaField("product_name",
"STRING"),
bigquery.SchemaField("sales_amount",
"NUMERIC"),
bigquery.SchemaField("sales_cnt",
"INTEGER")
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so
the line below is optional.
job_config.source_format =
bigquery.SourceFormat.CSV
uri = "gs://test-sample-
gcp/pos_salesdata_daily.csv"
load_job = client.load_table_from_uri(
uri,
dataset_ref.table("pos_salesdata"),
job_config=job_config
) # API request
load_job.result() # Waits for table load to
complete.
destination_table =
client.get_table(dataset_ref.table("pos_salesdata"))
print("Loaded {}
rows.".format(destination_table.num_rows))

The above code will be used later to


configure Cloud Functions.

Steps to automate BigQuery


load processing using Cloud
Functions
Let's start with Cloud Pub/Sub and go
through the steps in order.
(1) Create a topic in Cloud Pub/Sub
・ Select "Pub/Sub" -> "Topic" from the
navigation menu.
・ Enter the following information on the
topic creation screen.
Topic ID: bqload-topic
Encryption: Unchecked
(2) Setting up triggers for Cloud Functions
From the Pub/Sub screen, click the "Trigger
Cloud Function" link at the top of the "Topic
Details" screen to create a function.
function name : bqload_func
region : us-central1
Trigger Type : Cloud Pub/Sub
Please select a Cloud Pub/Sub topic : bqload-
topic (the topic you created in the previous
step)
Retry on failure: Unchecked
When you have completed the above settings,
select "Save" and click "Next" at the bottom
of the screen.
You will be redirected to the "Edit Function"
screen. Configure the settings as follows, edit
main.py and requirements.txt, and deploy.
Runtime : Python 3.8
entry point : bqload_func
Source code: inline editor
Contents of main.py

import base64
import datetime
from google.cloud import bigquery
def bqload_func(event, context):
pubsub_message =
base64.b64decode(event['data']).decode('utf-
8')
print(pubsub_message)
client = bigquery.Client()
dataset_id = 'bq_dataset'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
bigquery.SchemaField("sales_date",
"DATE"),
bigquery.SchemaField("store_code",
"STRING"),
bigquery.SchemaField("store_name",
"STRING"),
bigquery.SchemaField("product_code",
"STRING"),
bigquery.SchemaField("product_name",
"STRING"),
bigquery.SchemaField("sales_amount",
"NUMERIC"),
bigquery.SchemaField("sales_cnt",
"INTEGER")
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so
the line below is optional.
job_config.source_format =
bigquery.SourceFormat.CSV
uri = "gs://test-sample-
gcp/pos_salesdata_daily.csv"
load_job = client.load_table_from_uri(
uri,
dataset_ref.table("pos_salesdata"),
job_config=job_config
) # API request
load_job.result() # Waits for table load to
complete.
destination_table =
client.get_table(dataset_ref.table("pos_salesdata"))
print("Loaded {}
rows.".format(destination_table.num_rows))
Contents of requirements.txt
google-cloud-bigquery==1.25.0
Paste the above code and deploy it.
(3) Publish the message.
Return to the Cloud Pub/Sub screen, and
from the topic details screen, select "Publish
message" -> "Publish one message".

In the "Publish Message" screen, enter the


following information and click "Publish".
Type of publication: 1 time
Message body: bqload
(4)Set up the function execution schedule of
Cloud Functions in Cloud Scheduler.
Click "Create Job" in the Cloud Scheduler,
and make the following settings from the job
creation screen.
Name : bqload_scd
Frequency : 0 */1 * * * (running every
hour)
Time Zone : Japanese Standard Time
Target : Pub/Sub
Topics. : bqload-topic
Payload : test
After creating the job, click the "Run Now"
button on the job screen, and confirm that the
BigQuery table is loaded with the data from
the target file in Cloud Storage. (It is better to
delete the target table beforehand for easier
confirmation.
Chapter4.The cost of
building a data
infrastructure
For the data analysis infrastructure built with
Cloud Storage, BigQuery, Cloud Functions,
etc. described in this book, the basic cost of
construction and operation is almost zero.
(Exclude storing extremely large files in
Cloud Storage or querying large tables in
BigQuery.)
When building and operating a data analytics
infrastructure in GCP, the most costly points
are storing data in Cloud Storage, executing
queries (SQL) with BigQuery, and using
other services (Cloud Composer and
Dataflow). However, by using services such
as Cloud Scheduler and Cloud Functions as
described in this document, the operational
costs can be dramatically reduced. However,
by using services such as Cloud Scheduler
and Cloud Functions described in this book,
the operational cost can be dramatically
reduced. (However, if complex ETL
processing is required, it is recommended to
introduce a separate ETL tool, etc.)
Especially in the phase where you are
building a new data analysis infrastructure,
there is almost no charge, so you should feel
free to try to build a data analysis
infrastructure.

Chapter5.Other things
that are necessary for
building a data
infrastructure
ETL processing that occurs in
data infrastructure
construction
In this book, I have not mentioned ETL
processing at all, because I have touched on
the construction of the infrastructure that is
the minimum requirement for building a data
analysis infrastructure. However, ETL
processing is always necessary for data
analysis infrastructure. In that case, if you
implement it in GCP, you need to use a
service to execute ETL processing or
workflow processing, specifically Dataflow
or Composer. However, building and
operating a data analysis infrastructure using
these services requires a considerable amount
of engineering resources to be allocated, so it
is not necessary to get involved at the stage of
building a new and casual data analysis
infrastructure. If you want to do ETL
processing on the simple data analytics
infrastructure configuration discussed in this
book, we recommend you to use BigQuery's
"Query Scheduling" feature. "The query
scheduling feature allows you to schedule
SQL to be executed in BigQuery. You can
use this feature to process and transform the
data loaded into a BigQuery table by a Cloud
Functions function and store it in another
table using SQL.

You might also like