How Google Big Query Changed The Game
How Google Big Query Changed The Game
1
TABLE OF CONTENTS
Introduction ........................................................................................................................................................................ 3
The Normal Approach To Database Work................................................................................................................ 4
BigQuery Does it Differently .......................................................................................................................................... 6
BigQuery Reality ................................................................................................................................................................ 7
Public Datasets ................................................................................................................................................................... 8
Federated Tables................................................................................................................................................................ 9
Conclusion......................................................................................................................................................................... 11
The rules for data access are different now, data is available now like never before, and we are at the
beginning of something “data amazing”. ................................................................................................................ 11
2
INTRODUCTION
Databases are wonderful places to organize, store, analyze and report on data. The approach to getting
answers from a database has been consistent across all data warehousing technologies for the last 25
years, and only with the advent of GCP BigQuery has this changed. The effort and work which goes into
joining large disparate datasets together to provide a single linked view for analysis & reporting has gone
from huge to negligible. It’s a new data world for us, as one of the most fundamental guiding principles
of data warehousing has now changed.
3
THE NORMAL APPROACH TO DATABASE WORK
Until now, the only way to get reports which link large amounts of data, such as customers, products and
sales by stores, is to bring this data into a single, central database and query it to build reports. A picture
I’m sure most readers would be familiar with:
This works for reporting on data which is internal to your business and available to be pulled into a central
database. However, you need to build the ETL; which is not trivial. This becomes more complex when
trying to link external data. You find a source for the data you need, then build an ETL pipeline from that
source into your central database. This is normally much more involved than reading from an internal
source. Let’s take the example of linking your internal data of customer/sales/store data with local
weather data and comparing that against vendor predicted sales.
4
Pulling data from outside your business and loading it into your central database for reporting is normally
a large effort. Most people reading this have firsthand experience with the challenges related to building
ETL pipelines and the effort and risk related to making data available in a central database for reporting
and analytics. There is an endless list of problems to be encountered while doing this. For any given ETL
just consider the design, approvals, build, QA and governance related to extracting, transforming,
combining, validating and publishing.
The large footprint, expensive tools such as Informatica and Talend support a hugely profitable business
model selling software, requiring specific training to build and manage complex ETL pipelines. I cannot
overstate the build and maintenance cost of ETL for reporting and analytics. Often underestimated and
under-appreciated.
Another factor is the time it takes to complete the build and implementation of a new ETL. Requesting
& allocating resources, provisioning systems, gaining approvals, scoping the requirements, producing the
documentation, not to mention design and build the actual ETL. This all takes time. To further push out
the timeline, there is a direct correlation between the size of a company, and the complexity (bureaucracy)
of the process around sourcing and loading data, which makes the process longer for bigger companies.
So now you have a simple request from a data scientist to get access to public data or vendor data to
build a model and they are waiting months.
5
BIGQUERY DOES IT DIFFERENTLY
Here's the game changer... ANY object in BigQuery is available for use by ANY BigQuery user, restricted
only by permissions.
Imagine a business storing data in BigQuery as commonplace as having a website; and anyone who wants
access to that data …is just a credit card away. Customer lists, sales transactions, document contents,
pictures, sound recordings, music, videos, vendors, products, invoices, shipping notices, sales forecasts,
questionnaires, demographic models, user activity… any data you can imagine that a business tracks and
stores internally, could seamlessly be made available to everyone for a simple, cheap, pay per usage fee.
For our Weather data example, find your source, figure out your requirements, get approval to spend as
you normally would, but don’t worry about assembling a project team to build your ETL.
6
BIGQUERY REALITY
Once you have an account with GCP, you pay for use through your existing billing project for access to
any data you have permissions to read. The “pay-per-use” cost is based on bytes read or a monthly flat
fee. Incorporating the newly sourced weather data into your report is a matter of understanding the
new data and using it.
7
PUBLIC DATASETS
An incredible idea.
These are datasets where Google pays for the storage, and you pay for usage in bytes/read. These cover
fascinating topics such as climate, science, health and Finance.
BigQuery as a Public Dataset Marketplace which currently has 147 datasets in areas such as climate, science,
health and Finance. https://console.cloud.google.com/marketplace
8
FEDERATED TABLES
Querying data outside a database with incredible power.
The “Federated Tables” feature, allows a link to an external data source that you can query directly even
though the data is not stored in BigQuery. Instead of loading or streaming the data, you create a table
that references the external data source.
BigQuery offers support for querying data directly from:
- Bigtable
- Cloud Storage, e.g. Hive data from Dataproc, etc.
- Google Drive, e.g. Google Sheet
- Cloud SQL (beta)
https://cloud.google.com/bigquery/external-data-sources
9
IMAGINE
10
CONCLUSION
The rules for data access are different now, data is available now like never before, and
we are at the beginning of something “data amazing”.
Having spent 25 years working in this space, I feel this is one of those moments worth writing about. The
beginning of a technology wave which will forever change data warehousing.
Google BigQuery has done to data, what Google.com did to the Internet.
Thank you for reading my white paper and please consider joining my linkedin group: Google Cloud
BigQuery.
11