OD M1 Introduction To Data Engineering
OD M1 Introduction To Data Engineering
Introduction to
Data Engineering
In this Introduction to Data Engineering module, we’ll describe the role of a data
engineer and motivate the claim why data engineering should be done in the Cloud.
Proprietary + Confidential
03 Introduction to BigQuery
A data engineer is someone who builds data pipelines, and so we’ll start by looking at
what this means -- what kinds of pipelines a data engineer builds and their purpose.
We’ll look at the challenges associated with the practice of data engineering and how
many of those challenges are easier to address when you build your data pipelines in
the cloud.
Next, we’ll introduce you to BigQuery, Google Cloud’s petabyte-scale serverless data
warehouse.
Having defined what data lakes and data warehouses are, we’ll then discuss these in
more detail.
Data Engineers may be responsible for both the backend transactional database
systems that support a company's applications and the data warehouses that support
their analytic workloads. In this lesson, we’ll explore the differences between
databases and data warehouses and the Google Cloud solutions for each of these
workloads.
Since a data warehouse also serves other teams, it is crucial to learn how to partner
effectively with them.
As part of being an effective partner, your engineering team will be asked to set up
data access policies and overall governance of how data is to be used and NOT used
by your users. We’ll discuss how to provide access to the data warehouse while
keeping to data governance best practices.
We’ll also discuss productionizing the whole operation and automating and monitoring
as much of it as possible.
Finally, we’ll look at a case study of how a Google Cloud customer solved a specific
business problem, before you complete a hands-on lab where you will use BigQuery
to analyze data.
Proprietary + Confidential
01
Let’s start by exploring the role of a data engineer in a little more detail.
Proprietary + Confidential
Get the data to where it can be useful So… how do we get the
raw data from multiple
systems and where can
we store it durably?
Get the data into a usable condition
What does a data engineer do? A data engineer builds data pipelines.
Why does the data engineer build data pipelines? Because they want to get their data
into a place, such as a dashboard or report or machine learning model, from where
the business can make data-driven decisions.
The data has to be in a usable condition so that someone can use this data to make
decisions. Many times, the raw data is by itself not very useful.
Proprietary + Confidential
Raw data
Replicate
Spreadsheets
One term you will hear a lot when you do data engineering is the concept of a data
lake.
A data lake brings together data from across the enterprise into a single location.
So, you might get the data from a relational database or from a spreadsheet, and
store the raw data in a data lake.
One option for this single location to store the raw data is to store it in a Cloud
Storage bucket.
What are the key considerations when deciding between data lake options? What do
you think?
https://cloud.google.com/solutions/build-a-data-lake-on-gcp#cloud-storage-as-data-la
ke
Proprietary + Confidential
There are some considerations that you need to keep in mind as you build a data
lake.
1. Does your data lake handle all the types of data you have? Can it all fit into a
cloud storage bucket? If you have a RDBMS, you might need to put the data
in Cloud SQL, a managed database, rather than cloud storage.
2. Can it elastically scale to meet the demand? As your data collected increases,
will you run out of disk space? This is more a problem with on-premises
systems than with cloud.
3. Does it support high-throughput ingestion? What is the network bandwidth?
Do you have edge points of presence?
4. Is there fine-grained access control to objects? Do users need to seek within a
file? Or is it enough to get a file as a whole? Cloud Storage is blob storage, so
you might need to think about the granularity of what you store.
5. Can other tools connect easily? How do they access the store? Don’t lose
sight of the fact that the purpose of a data lake is to make data accessible for
analytics.
https://cloud.google.com/solutions/build-a-data-lake-on-gcp#cloud-storage-as-data-la
ke
Proprietary + Confidential
We mentioned our first Google Cloud product, the Cloud Storage bucket, which is a
good option for staging all of your raw data in one place before building
transformation pipelines into your data warehouse.
Why choose Google Cloud storage? Commonly, businesses use Cloud Storage as a
backup and archival utility for their businesses. Because of Google’s many data
center locations and high network availability, storing data in a Cloud Storage bucket
is durable and performant.
For a data engineer, you will often use a cloud storage bucket as part of your data
lake to store many different raw data files, such as CSV, JSON, or Avro. You could
then load or query them directly from BigQuery as a data warehouse.
Later in the course, you’ll create Cloud Shell buckets using the Cloud Console and
command line like you see here. Other Google Cloud products and services can
easily query and integrate with your bucket once you’ve got it setup and loaded with
data.
Proprietary + Confidential
Dataproc Dataflow
Speaking of loading data, what if your raw data needs additional processing?
You may need to extract the data from its original location, transform it, and then load
it in.
One option is to carry out data processing. This is often done using Dataproc or
Dataflow. We’ll discuss using these products to carry out batch pipelines later in this
course.
Proprietary + Confidential
But what if batch pipelines are not enough? What if you need real-time analytics on
data that arrives continuously and endlessly?
In that case, you might receive the data in Pub/Sub, transform it using Dataflow and
stream it into BigQuery.
Data Engineering
02
Challenges
As a data engineer, you will usually encounter a few problems when building data
pipelines.
You might find it difficult to access the data that you need. You might find that the
data, even after you access it, …doesn't have the quality that's required by the
analytics or machine learning model.
You plan to build a model, and even if the data quality exists, you might find that the
transformations require computational resources that might not be available to you.
And finally, you might run into challenges around query performance and being able
to run all of the queries and all of the transformations that you need with the
computational resources that you have.
Proprietary + Confidential
Let’s take the first challenge of consolidating disparate datasets, data formats, and
managing access at scale.
Proprietary + Confidential
Data is scattered
across Google No common tool
Analytics 360, CRM, exists to analyze data
and Campaign and share results
Manager products, with the rest of the
among other sources. organization.
For example, you want to compute the customer acquisition cost: how much does it
cost in terms of marketing and promotions and discounts to acquire a customer?
That data might be scattered across a variety of marketing products and customer
relationship management software, … and finding a tool that can analyze all of this
data might be difficult because it might come from different organizations, different
tools, and different schemas, and maybe some of that data is not even structured.
WEB
Set of Files
Catalog ETL DSS Database
TPC-DS
Inventory
Reports
Promotions
So what makes data access so difficult? Primarily, this is because data in many
businesses is siloed by departments and each department creates its own
transactional systems to support its own business processes.
Proprietary + Confidential
WEB
Set of Files
Catalog ETL DSS Database
TPC-DS
Inventory
Reports
Promotions
So, for example, you might have operational systems that correspond to store
systems, have a different operational system maintained by your product warehouses
that manages your inventory, and have a marketing department that manages all the
promotions given that you need to do an analytics query on, such as, …
Proprietary + Confidential
WEB
Set of Files
Catalog ETL DSS Database
TPC-DS
Inventory
Reports
Promotions
“Give me all the in-store promotions for recent orders and their inventory levels.”
Proprietary + Confidential
WEB
Set of Files
Catalog ETL DSS Database
TPC-DS
Stored in a separate system
Inventory
and restricted access
Reports
Promotions
You need to know how to combine data from the stores, from the promotions, and
from the inventory levels, and because these are all stored in separate
systems—some of which have restricted access—building an analytic system that
uses all three of these data sets to answer an ad hoc query like this can be very
difficult.
Proprietary + Confidential
The second challenge is that cleaning, formatting and getting the data ready for
insights requires you to build ETL pipelines.
ETL pipelines are usually necessary to ensure data accuracy and quality. The cleaned
and transformed data are typically stored, not in a data lake, but in a data warehouse.
A data warehouse is a consolidated place to store the data, and all the data are easily
joinable and queryable. Unlike a data lake, where the data is in the raw format, in the
data warehouse, the data is stored in a way that makes it efficient to query.
Proprietary + Confidential
WEB
Set of Files
Catalog ETL DSS Database
TPC-DS
Inventory
Reports
Promotions
Because data becomes useful only after you clean it up, you should assume that any
raw data that you collect from source systems needs to be cleaned and transformed.
And if you are transforming it, you might as well transform it into a format that makes it
efficient to query. In other words, ETL the data and store it in a data warehouse.
Let’s say you are a retailer and you have to consolidate data from multiple source
systems. Think about what the use case is.
Proprietary + Confidential
WEB
Query:
Give me the best performing Set of Files
Catalog ETL DSS Database
in-store promotions in France.
TPC-DS
Inventory
Reports
Promotions
Suppose the use case is to get the best performing in-store promotions in France.
You need to get the data from the stores and you have to get the data from the
promotions. But perhaps the store data is missing information.
Proprietary + Confidential
WEB
Query:
Give me the best performing Set of Files
Catalog ETL DSS Database
in-store promotions in France.
TPC-DS
Inventory
Reports
Promotions
Maybe some of the transactions are in cash, and for those, perhaps there is no
information on who the customer is; or some transactions might be spread over
multiple receipts, and you might need to combine those transactions because they
come from the same customer. Or perhaps the time stamps of the products are stored
in local time, whereas you have to spread across the globe, and so before you can do
anything, you need to convert everything into UTC.
Proprietary + Confidential
WEB
Query:
Give me the best performing Set of Files
Catalog ETL DSS Database
in-store promotions in France.
TPC-DS
Inventory
Reports
Promotion list stored as .csv
Promotions
files or manual spreadsheets
Similarly, the promotions may not be stored in the transaction database at all. They
might be just a text file that somebody loads on their web page, and it has a list of
codes that are used by the web application to apply discounts.
It can be extremely difficult to do a query like finding the best performing in-store
promotions because the data has so many problems. Whenever you have data like
this, you need to get the raw data and transform it into a form with which you can
actually carry out the necessary analysis.
It is obviously best if you can do this sort of cleanup and consolidation just once, and
store the resulting data to make further analysis easy. That’s the point of a data
warehouse.
Proprietary + Confidential
If you need to do so much consolidation and cleanup, a common problem that arises
is where to carry out this compute. The availability of computation resources can be a
challenge.
Proprietary + Confidential
Resources Under-provisioned
(Wasting time) On-premises
compute capacity
Under-utilized
Consumption (Wasting $$$)
Capacity
Time
If you are on an on-premises system, data engineers will need to manage server and
cluster capacity and make sure that enough capacity exists to carry out the ETL jobs.
The problem is that the compute needed by these ETL jobs is not constant over time.
Very often, it varies week to week, and depending on factors like holidays and
promotional sales. This means that when traffic is low, you are wasting money and
when traffic is high, your jobs are taking way too long.
Proprietary + Confidential
Once your data is in your data warehouse, you need to optimize the queries your
users are running to make the most efficient use of your compute resources.
Proprietary + Confidential
If you’re managing an on-premise data analytics cluster, you will be responsible for
choosing a query engine and installing the query engine software and keeping it up to
date as well as provisioning any more servers for additional capacity.
Isn’t there a better way to manage server overhead so we can focus on insights?
Proprietary + Confidential
03 Introduction to
BigQuery
There is a much better way to manage server overhead so we can focus on insights.
It is to use a serverless data warehouse.
BigQuery replaces BigQuery organizes BigQuery defines Function the same IAM grants
a typical data data tables into schemas and issues way as in a permission to
warehouse units called queries directly on traditional data perform specific
hardware setup. datasets. external data warehouse. actions.
sources.
The BigQuery service replaces the typical hardware setup for a traditional data
warehouse. That is, it serves as a collective home for all analytical data in an
organization.
Datasets are collections of tables that can be divided along business lines or a given
analytical domain. Each dataset is tied to a Google Cloud project.
A data lake might contain files in Cloud Storage or Google Drive or even transactional
data from Cloud Bigtable. BigQuery can define a schema and issue queries directly
on external data as federated data sources.
Database tables and views function the same way in BigQuery as they do in a
traditional data warehouse, allowing BigQuery to support queries written in a standard
SQL dialect which is ANSI: 2011 compliant.
Performance Resource
tuning provisioning
Utilization Handling
improvements growing scale
Deployment Reliability
& configuration
A key consideration behind agility is being able to do more with less, and it’s important
to make sure that you're not doing things that don’t add value.
If you do work that is common across multiple industries, it's probably not something
that your business wants to pay for.
The cloud lets you, the data engineer, spend less time managing hardware and more
time doing things that are much more customized and specific to the business.
You don’t have to be concerned about provisioning and reliability and utilization
improvements in performance or tuning on the cloud, so you can spend all your time
thinking about how to get better insights from your data.
Proprietary + Confidential
Consumption
Allocation
Time
You don't need to provision resources before using BigQuery, unlike many RDBMS
systems. BigQuery allocates storage and query resources dynamically based on your
usage patterns.
Storage resources are allocated as you consume them and deallocated as you
remove data or drop tables.
Query resources are allocated according to query type and complexity. Each query
uses some number of slots, which are units of computation that comprise a certain
amount of CPU and RAM.
Proprietary + Confidential
04
Warehouses
We’ve defined what a data lake is, and what a data warehouse is. Let’s look at these
in a bit more detail.
Proprietary + Confidential
Recall that we emphasized that the data has to be in a usable condition so that
someone can use this data to make decisions. Many times, the raw data is by itself
not very useful.
Proprietary + Confidential
Extract, Transform,
Replicate Load Pipeline
We said that raw data gets replicated and stored in a data lake. In order to make the
data usable, you will use Extract-Transform-Load or ETL pipelines to make the data
usable, and store this more usable data in a data warehouse.
Let’s consider what are the key considerations when deciding between data
warehouse options ...
Proprietary + Confidential
Traditional data warehouses are hard to manage and operate. They were designed
for a batch paradigm of data analytics and for operational reporting needs. The data in
the data warehouse was meant to be used by only a few management folks for
reporting purposes. BigQuery is a modern data warehouse that changes the
conventional mode of data warehousing. Here we can see some of the key
comparisons between a traditional data warehouse and BigQuery.
BigQuery provides mechanisms for automated data transfer and powers business
applications using technology that teams already know and use, so everyone has
access to data insights. You can create read-only shared data sources that both
internal and external users can query, and make query results accessible for anyone
through user-friendly tools such as Looker, Google Sheets, Tableau, or Google Data
Studio.
BigQuery lays the foundation for AI. It’s possible to train Tensorflow and Google Cloud
Machine Learning models directly with data sets stored in BigQuery, and BigQuery
ML can be used to build and train machine learning models with simple SQL. Another
extended capability is BigQuery GIS, which allows organizations to analyze
geographic data in BigQuery, essential to many critical business decisions that
revolve around location data.
BigQuery also simplifies data operations through the use of Identity and Access
Management to control user access to resources, creating roles and groups and
assigning permissions for running jobs and queries in a project, and also providing
automatic data backup and replications.
Proprietary + Confidential
MySQL
Cloud SQL
Cloud Storage
Even though we talked about getting data into BigQuery by running ETL pipelines,
there is another option.
That is to treat BigQuery as a query engine and allow it to query the data in-place. For
example, you can use BigQuery to directly query database data in Cloud SQL — that
is, managed relational databases like PostgreSQL and MySQ.
You can also use BigQuery to directly query files on Cloud Storage as long as these
files are in formats like CSV or Parquet.
The real power comes when you can leave your data in place and still join it against
other data in the data warehouse.
Transactional
Databases Versus
05
Data Warehouses
Data engineers may be responsible for both the back end transactional database
systems that support your company's applications AND the data warehouses that
support your analytic workloads. In this lesson, you’ll explore the differences between
databases and data warehouses and the Google Cloud solutions for each workload.
Proprietary + Confidential
● Automatic encryption
Cloud SQL
● 64 TB storage capacity
If you have SQL Server, MySQL, or PostgreSQL as your relational database, you can
migrate it to Cloud SQL which is Google Cloud’s fully managed relational database
solution.
One question you might get asked is: “Why not use simply use Cloud SQL for
reporting workflows? You can run SQL directly on the database right?”
https://cloud.google.com/products/databases/
Proprietary + Confidential
This is a great question and will be answered in greater detail in the Building a Data
Warehouse module. Google Cloud has several options for RDBMS’s including
Spanner and Cloud SQL.
The fundamental architecture of these data storage options is quite different. Cloud
SQL databases are RECORD-based storage -- meaning the entire record must be
opened on disk, even if you just selected a single column in your query.
BigQuery is COLUMN-based storage, which as you might guess, allows for really
wide reporting schemas since you can simply read individual columns out from disk.
Proprietary + Confidential
Store
Inventory
Promotions
This isn’t to say RDBMS’ aren’t as performant as Data Warehouses -- they serve two
different purposes. RDBMS helps your business manage new transactions.
Take this point-of-sale terminal at a storefront. Each order and product is likely written
out as new records in a relational database somewhere. This database may store all
of the orders received from their website, all of the products listed in the catalog, or
the number of items in their inventory.
This is so that when an existing order is changed, it can be quickly updated in the
database. Transactional systems allow for a single row in a database table to be
modified in a consistent way. They also are built on certain relational database
principles, like referential integrity, to guard against cases like a customer ordering a
product that doesn’t exist in the product table.
So where does all this raw data end up in our data lake and data warehouse
discussion? What’s the complete picture?
Proprietary + Confidential
WEB
Eng pipeline
Catalog
Other team data
warehouse
Data lake Data warehouse BI p
Inventory ipe
line
Promotions
Here it is. Our operational systems, like our relational databases that store online
orders, inventory, and promotions, are our raw data sources on the left. Note that this
isn’t exhaustive -- you could have other source systems that are manual like CSV files
or spreadsheets too.
These upstream data sources get gathered together in a single consolidated location
in our data lake which is designed for durability and high availability.
Once in the data lake, the data often needs to be processed via transformations that
then output the data into our data warehouse where it is ready for use by
downstream teams.
Here are three quick examples of other teams that often build pipelines on our data
warehouse:
So who works on these teams and how do they partner with our data engineering
team?
Proprietary + Confidential
Partner Effectively
06
with Other Data Teams
Since a data warehouse also serves other teams, it is crucial to learn how to partner
effectively with them.
Proprietary + Confidential
Remember that once you’ve got data where it can be useful and it’s in a usable
condition we need to add new value to the data through analytics and machine
learning. What teams might rely on our data?
Proprietary + Confidential
There are many data teams that rely on your data warehouse and partnerships with
data engineering to build and maintain new data pipelines.
Let’s examine how each of these roles interacts with your new data warehouse and
how data engineers can best partner with them.
Proprietary + Confidential
Feature pipeline
ML model
Raw data Data lake Data warehouse
As you’ll see in our course on machine learning, an ML team’s models rely on having
lots of high quality input data to create, train, test, evaluate, and serve their models.
They will often partner with data engineering teams to build pipelines and datasets for
use in their models.
Feature pipeline
ML model
Raw data Data lake Data warehouse
“How long does it take for a transaction to make it from raw data all the way into the
data warehouse”? They’re asking this because any data that they train their models
on must also be available at prediction-time as well. If there is a long delay in
collecting and aggregating the raw data it will impact the ML team’s ability to create
useful models.
Proprietary + Confidential
Feature pipeline
ML model
Raw data Data lake Data warehouse
A second question that you will definitely get asked is how difficult it would be to add
more columns or rows of data into certain datasets. Again, the ML team relies on
teasing out relationships between the columns of data and having a rich history to
train models on. You will earn the trust of your partner ML teams by making your
datasets easily discoverable, documented, and available to ML teams to experiment
on quickly.
Proprietary + Confidential
FROM
ML.EVALUATE(MODEL
`bqml_tutorial.sample_model`,
TABLE eval_table)
A unique feature of BigQuery is that you can create high-performing machine learning
models directly in BigQuery using just SQL by using Bigquery ML. Here is the actual
model code to CREATE a model, EVALUATE it, and then MAKE predictions. You’ll
see this again in our lectures on machine learning later on.
Proprietary + Confidential
BI pipeline
Reporting
Raw data Data lake Data warehouse dashboards
Other critical stakeholders are your business intelligence and data analyst teams that
rely on good clean data to query for insights and build dashboards.
These teams need datasets that have clearly defined schema definitions, the ability to
quickly preview rows, and the performance to scale to many concurrent dashboard
users.
Proprietary + Confidential
Looker
● No need to manage OLAP cubes or BigQuery Studio
BI Engine
separate BI servers for dashboard Batch or Sheets
Streaming
performance. Partner
BI tools
● Natively integrates with BigQuery
streaming for real-time data refresh.
One of the Google Cloud products that helps manage the performance of dashboards
is BigQuery BI Engine. BI Engine is a fast, in-memory analysis service that is built
directly into BigQuery and available to speed up your business intelligence
applications.
Historically, BI teams would have to build, manage, and optimize their own BI servers
and OLAP cubes to support reporting applications. Now, with BI Engine, you can get
sub-second query response time on your BigQuery datasets without having to create
your own cubes. BI Engine is built on top of the same BigQuery storage and compute
architecture and servers as a fast in-memory intelligent caching service that maintains
state.
https://www.youtube.com/watch?v=TqlrIcmqPgo
Proprietary + Confidential
Pipeline
Other team
Raw data Data lake Data warehouse data warehouse
One last group of stakeholders are other data engineers that rely on the uptime and
performance of your data warehouse and pipelines for their downstream data lakes
and data warehouses.
They will often ask “How can you ensure that the data pipeline we depend on will
always be available when we need it?”
Or, “We are noticing high demand for certain really popular datasets. How can you
monitor and scale the health of your entire data ecosystem?”
Proprietary + Confidential
● One popular way is to use the built-in Cloud Monitoring of all resources on
Google Cloud.
● Since Google Cloud Storage and BigQuery are resources, you can set up
alerts and notifications for metrics like “Statement Scanned Bytes” or “Query
Count” so you can better track usage and performance.
● Here are two other reasons why Cloud Monitoring is used. One Is for tracking
spending of all the different resources used and two is for what the billing
trends are for your team or organization.
● And lastly, you can use the Cloud Audit Logs to view actual query job
information to see granular level details about which queries were executed
and by whom. This is useful if you have sensitive datasets that you need to
monitor closely. A topic we will discuss more next.
https://cloud.google.com/bigquery/docs/monitoring
Proprietary + Confidential
07
and Governance
As part of being an effective partner, your engineering team will be asked to set up
data access policies and overall governance of how data is to be used and NOT used
by your users.
Proprietary + Confidential
This is what we mean when we say a data engineer must manage the data. This
includes critical topics, such as privacy and security. What are some key
considerations when managing certain datasets?
Proprietary + Confidential
Eng pipeline
Reporting
dashboards
Who should have access?
Data Catalog
One solution for data governance is the Cloud Data Catalog and the Data Loss
Prevention API.
● The Data Catalog makes all the metadata about your datasets available to
search for your users. You group datasets together with tags, flag certain
columns as sensitive, etc.
● Why is this useful? If you have many different datasets with many different
tables -- which different users have different access levels to -- the Data
Catalog provides a single unified user experience for discovering those
datasets quickly. No more hunting for specific table names in SQL first.
● Often used in conjunction with Data Catalog is the Cloud Data Loss
Prevention API, or DLP API, which helps you better understand and manage
sensitive data. It provides fast, scalable classification and redaction for
sensitive data elements like credit card numbers, names, social security
numbers, US and selected international identifier numbers, phone numbers,
and Google Cloud credentials.
Proprietary + Confidential
Build
08
Production-ready
Pipelines
Once your data lakes and data warehouses are set up and your governance policy is
in place, it’s time to productionalize the whole operation and automate and monitor as
much of it as we can.
Proprietary + Confidential
That’s what we mean when we say productionalize the data process. It has to be an
end-to-end and scalable data processing system.
Proprietary + Confidential
Eng pipeline
Other team data
warehouse
BI
pip
Raw data Data lake Data warehouse eline
Reporting
dashboards
How can we ensure pipeline health and data cleanliness?
Are we using the latest data engineering tools and best practices?
Your data engineering team is responsible for the health of the plumbing -- that is the
pipelines -- and ensuring that the data is available and up-to-date for analytic and ML
workloads.
Database server
Rec API ML Data
Application server
App Engine Cloud Storage
Database server
Rec API ML Data
Application server
App Engine Cloud Storage
Cloud Composer helps your data engineering team orchestrate all the pieces to the
data engineering puzzle that we have discussed so far (and even more that you
haven’t come across yet!).
For example, when a new CSV file gets dropped into Cloud Storage, you can
automatically have that trigger an event that kicks off a data processing workflow and
puts that data directly into your data warehouse.
The power of this tool comes from the fact that Google Cloud big data products and
services have API endpoints that you can call.
A Cloud Composer job can then run every night or every hour and kick-off your entire
pipeline from raw data, to the data lake, and into the data warehouse for you.
We’ll discuss workflow orchestration in greater detail in later modules and you’ll do a
lab on Cloud Composer as well.
Proprietary + Confidential
09 Google Cloud
Customer Case Study
We have looked at a lot of different aspects of what a data engineer has to do. Let’s
look at a case study of how a Google Cloud customer solves a specific business
problem. That will help tie all these different aspects together.
Proprietary + Confidential
“We believe that users with a wide range of technical skills should be able to discover data
and have access to SQL-based analysis and visualization tools that perform well”
-- Twitter
Cloud
BigQuery
Storage
BigQuery
blaster Airflow
Cloud
Replicator
Google Data
Studio
Vertica HDFS
Twitter has large amounts of data, and they also have high-powered sales teams and
marketing teams, which for a long time did not have access to the data and couldn’t
use that data for carrying out the analysis that they wanted to be able to do.
Much of the data was stored on Hadoop clusters that were completely over-taxed. So
Twitter replicated some of that data from HDFS on to Cloud Storage, loaded it into
BigQuery, and provided BigQuery to the rest of the organization. These were some of
the most frequently requested data sets within Twitter, and they discovered that, with
ready access to the data, many people who were not data analysts are now analyzing
data and making better decisions as a result.
For more information, a link to the blog post is available in the PDF version of this
content under Course Resources.
https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/democratizing-da
ta-analysis-with-google-bigquery.html
Proprietary + Confidential
Recap
● Data sources
● Data lakes
● Data warehouses
Concept review
Sources Stream processing Refined data
Real-time store
Pub/Sub Dataflow
Cloud
Bigtable
Sensors Journal
Data sources feed ML (training) Cloud
Spanner
Dataproc
Cloud
Storage
● Recall that your data sources are your upstream systems like RDBMS’ and
other raw data that comes from your business in different formats.
● Data lakes -- your consolidated location for raw data that is durable and highly
available. In this example our data lake is Cloud Storage.
● And data warehouses which are the end result of preprocessing the raw data
in your data lake and getting it ready for analytic and ML workloads.
You’ll notice a lot of other Google Cloud product icons here like batch and streaming
data into your lake and running ML on your data. We’ll cover those topics in detail
later in this course.
Proprietary + Confidential
https://github.com/priyankavergadia/google-cloud-4-words
https://cloud.google.com/blog/topics/developers-practitioners/back-popular-demand-g
oogle-cloud-products-4-words-or-less-2022-edition
Proprietary + Confidential
Lab Intro
Using BigQuery to do Analysis
Now it's time to practice analyzing data with BigQuery in your lab. In this lab, you will
execute interactive queries in the BigQuery console, and then combine and run
analytics on multiple datasets.