GCP - DataPlex - Building A Data Lakehouse
GCP - DataPlex - Building A Data Lakehouse
September 2021
Building a data
lakehouse on
Google Cloud
Platform
Rachel Levy, Steve Thill and Firat Tekiner
2
For example, our data center enables a Dataproc environment to
connect to either Google Cloud Storage or the BigQuery storage
subsystem and read/write data at storage speeds, thanks to our
network that achieves petabit bisectional bandwidth. This allows With trust in the
Spark developers to leverage data inside BigQuery without the
need for data duplication and cumbersome ETL operations. The data you have, you
speed of the internal Google network enables your organization
to bring the processing to the data and avoid data duplication, spend more time
reducing data latency, processing time, data discrepancies, and
cost. In addition, Dataplex, our intelligent data fabric, enables you deriving value out
to manage your distributed data assets while making data
securely accessible to all your analytics tools.
of the data and less
Dataplex provides metadata-led data management with built-in
data quality and governance capabilities. With trust in the data,
time wrestling with
you have, you spend more time deriving value out of data and
less time wrestling with infrastructure boundaries and
infrastructure
inefficiencies.
boundaries and
Additionally, with the integrated analytics experience provided by
Dataplex, you are enabled to rapidly create, secure, integrate, inefficiencies.
and analyze your data at scale. Finally, you can build an analytics
strategy that augments existing architecture and meets your
financial governance goals.
3
Introduction
In the ever-evolving world of data architectures and underlying infrastructure and thus enable teams to
ecosystems, there is a growing suite of tools being focus on bringing value to the business. Data
offered to enable data management, governance, engineers should be able to focus on making the raw
scalability, and even machine learning. With promises data more useful to an organization. Data scientists
of digital transformation and evolution, organizations should be able to focus on looking at the data, using
often find themselves with sophisticated solutions tools to exploit hidden information, and producing
that have a considerable amount of bolt-on features. predictive data models.
However, the ultimate goal should be to simplify the
Data Catalog
& Governance
Data Business
Preparation Intelligence
Real-Time
Streaming & Ingestion
4
Google Cloud has taken this approach by using our In reality, these promises were not realized for many
planet-scale analytics platform to bring together two organizations. This was mainly because they were not
foundational solutions. The core capabilities for easily operationalized, productionized, or utilized. To
enterprise data operations, data lakes, and data interact with data in the lake, an end user had to be
warehouses have been unified — simplifying the fairly proficient in particular coding paradigms, which
management tasks while increasing the value. The limited the set of people who could use the data. All
centerpiece of this architectural revolution is of these in return increased the total cost of
BigQuery. As shown in Figure 1, BigQuery is at the ownership. There were also significant data
center of our customers’ Data Ecosystem because it governance challenges created by the data lakes .
is both tightly integrated with Google Cloud and open They did not work well with the existing identity and
to partners’ technologies. BigQuery provides the access management (IAM) and security models.
lakehouse architecture, which brings the best of the Furthermore, they ended up creating data silos
lake and the warehouse without the overhead of both. because data was not easily shared across the
This unlocks the value of data and ensures a unified Hadoop environment.
governance approach with tools such as Dataplex and
Analytics Hub. During the big data era, these two systems co-existed
and complemented each other as the two main
Data warehouses are systems that came about when database management systems of enterprises,
business leaders were looking to gain analytical residing side by side. Traditionally, structured and
insight from operational data stores. The legacy processed data was stored in the data warehouse. On
systems that may have worked for the past 40 years the other hand, data lakes provided the ability to land
have proven to be expensive and cannot often raw data without having to create a schema. This
address the challenges around data freshness, model created silos between teams. Essentially, data
scaling, and high costs. Also, data warehouses only fit warehouse users were closer to the business and had
the needs of tabular data, limiting their usability for a ideas about how to improve analysis, often without
rapidly growing variety of data types and structures. the ability to explore the data to drive a deeper
Schema was often applied on-write and was driven by understanding. Data lake users, conversely, were
a specific analytic use case. This also limited the closer to the raw data and had the tools and
flexibility of future use of the data for things such as capabilities to explore the data. However, they spend
machine learning and advanced analytics. so much time doing this, they were consequently
more focused on the data itself than on the business.
To solve some data warehouse limitations, new
technologies (such as Hadoop-powered Data Lakes) The architecture of a data lakehouse reduces
were developed and ushered in the big data era. For operational costs, simplifies transformation
example, data lakes were developed as low-cost processes, and enhances governance. This model is
storage solutions that essentially amounted to built on convergence of data lakes and warehouses,
distributed storage of files. They looked great on as well as data teams across organizations. In
paper by promising low cost and the ability to scale. essence, it implements warehouse-like data
structures and data management functions on
low-cost storage that is typical of data lakes.
5
BigQuery as a
data lakehouse Data
pipeline
To make it possible for all users to have access to the General User
same underlying data, the data lakehouse takes reports BigQuery
advantage of BigQuery’s storage and the compute
power to use views rather than materialized tables.
This is important because a data lakehouse has the
same storage subsystem, enabling shared storage
behind the views to minimize unnecessary data Business Users
replication. This is all done in BigQuery, without the
Workday reports,
standard storage premium often associated with SAP financial reports,
traditional data warehouses. The permanent location JDA
6
A modern data warehouse like BigQuery can handle governed in it). In many organizations adopting data
massive data volumes and has cost parity with other lakehouses, the centralized analytics or IT team
data storage mechanisms such as Cloud Storage. This ingests the data from source systems and provides a
reduces operational costs, simplifies transformation standardized set of views that various teams can then
processes, and enhances governance. Furthermore, a leverage for their own use cases. An example of these
data warehouse is then used as the data fabric for all views is in Figure 3.
the datasets (that are kept and
7
In this example, IT teams can ingest data into the by providing a nearly limitless and instantaneous
bronze layer and utilize views to cleanse the data scalability due, in large part, to the separation of
through the staging/silver layer (see Figure 3), while compute and storage.
the specific use of the data can then be made into
curated views in their own projects. A deep dive into BigQuery's separation of storage and compute allows
each of the architectural components of a data for BigQuery compute to be brought to other storage
lakehouse on Google Cloud is found in the next mechanisms through federated queries and have
sections. other compute paradigms by using data stored in
native BigQuery format through the Storage API.
Traditionally, with data warehousing, compute and BigQuery has a Storage API that allows the storage
storage were scarce resources that teams competed (which is separated from the compute clusters) to be
for. When those resources were no longer available, it treated like structured data in a lake. Rather than
often led to fragmentation of resources (data marts). reading in parquet or Avro files, Dataproc, Google
These arbitrary resource constraints tied the ability to Cloud’s managed Hadoop, can read the data directly
unlock the data’s value to the capacity of the from BigQuery storage, run its computations, and
hardware, rather than the capacity of the imagination. write it back to BigQuery. An example of this is seen in
Data lakehouses, specifically on Google Cloud, Figure 4.
remove the artificial constraints to unlock data’s value
PROCESSING
STORAGE
8
Separation of compute and storage is key to managing resources
in the cloud. It enables resource sharing across applications with
reduced overhead. Furthermore, it enables setting budgets at
various levels and stages. It is possible to define caps for
different workloads meeting SLAs. For example, flexible slots in
BigQuery simultaneously provide SLA guarantees and elasticity. IT Reporting Analysis Customer
Resource guarantees can be done at minute intervals, monthly
intervals, or yearly intervals at BigQuery level. On the other hand,
Ephemeral Dataproc clusters let you bring up complex Hadoop
(including Spark) clusters within a matter of seconds, running the
required workload and shutting it down. A combination of these
means you can manage overruns and handle unexpected hikes
without requiring considerable capital investment.
This is how the lines that have traditionally been drawn between
data lakes and data warehouses can start to blur. External IT Reporting Analysis Customer
9
Ingesting data
transactional systems that stream data in real time.
For example, when streaming data into a lakehouse
such as BigQuery, it is best to use an append-only
into the data model. This means that historical data will always be in
the table, but the query can include a where clause
that ensures only the latest version of each record is
10
Storing data in
There may be instances where Spark or other ETL
processes are already codified, so changing them for
the sake of new technology might not make sense. If,
11
Figure 7: Data lakehouse design pattern
12
BigQuery storage Hadoop Spark High-perf dataframes
Analyzing data
in the data
lakehouse There are a few ways to use the data that is stored in
BigQuery, and the access method should be based on
an end user’s skill set. Meeting users at their level of
Once the data is ingested and stored in the data data access including SQL, Python, or more
lakehouse, it must be analyzed and activated to drive GUI-based methods mean that technological skills do
business value. If the data is not accessible to the not limit their ability to use data for any job. Data
right resources, it is not even paying for the storage scientists may be working outside traditional
costs it incurs. To activate the data, an analyst or data SQL-based or BI types of tools. Because BigQuery
scientist must find insight that drives action. has the storage API, tools such as Spark running on
Traditional reporting with data in a warehouse is to Dataproc or AI notebooks can easily be integrated into
look back at historical data over the past week, the workflow. The paradigm shift here is that the data
month, quarter, etc. While there is value in lakehouse architecture supports bringing the
understanding these trends in the business, it is also compute to the data rather than moving the data
important to use analytics to look forward so that around. In addition to the BigQuery SQL engine, the
real-time actions can be taken to correct issues following diagram demonstrates other computation
before or once they arrive. frameworks.
13
The data lakehouse architecture makes it easy to When data is organized and democratized with a
share data with granular access controls across business-driven approach, data can be leveraged as a
enterprises and with other/partner companies. For shareable and monetizable asset within an
example, role-based access methods across a suite organization or with partner organizations. To
of products make it possible to apply the same rules formalize this capability, Google offers a layer on top
to data in its transformation journey, ensuring data of BigQuery called Analytics Hub, that can create
governance and reduced operational cost. Therefore, private data exchanges. Exchange administrators
Spark code using the BigQuery Storage API as well as (a.k.a. data curators) give permission to publish and
users using spreadsheets rather than writing SQL subscribe data in the exchange to specific individuals
would still be leveraging the data lakehouse as their or groups both internally and externally to business
data source. This would allow increased collaboration partners or buyers, as depicted in Figure 9.
across the organization and enable the
democratization of data.
Analytics Hub
Private, Public, Commercial
or Google hosted
exchanges / listings Public Commercial Google
Private
BigQuery BigQuery
14
You can publish, discover and subscribe to shared exchange. You can drive innovation with access to
assets that are powered by the scalability of unique Google datasets, commercial/industry
BigQuery, including open source formats. Publishers datasets, public datasets, or curated data exchanges
can view aggregated usage metrics. Data providers from your organization or partner ecosystem. These
can reach enterprise BigQuery customers with data, capabilities can be driven when data operations are
insights, ML models, or visualizations, and leverage optimized to provide more valuable opportunities to
Cloud marketplace to monetize their apps, insights, or the organization, rather than spending time feeding
models. This is similar to how BigQuery public and caring for individual, and potentially redundant,
datasets are managed through a Google-managed systems.
Retailers
Suppliers Logistics
15
Making it
systems. And it complements this by automatically
registering metadata as tables and filesets as
metastores and Data Catalog. Furthermore, with our
Integrated workspaces
Experience
Analytics
Serverless BYOI
Unified metastore
Management
Data Intelligence
16
With Dataplex, an integrated analytics workspace is notebook repository with links to associated data
becoming a reality. There is no infrastructure to while being able to save and share notebooks as if
manage and it provides one-click access to insights they were sharing another asset within the
for different personas. This means that data organization. Data analysts are able to use SQL
administrators are able to set up and manage Workspace for ad-hoc analysis without being
workspaces together with appropriate environment dependent on any data processing environment.
profiles, including compute parameters, libraries, etc. Effectively, through a single pane of glass they will be
At the same time, they are able to control user access able to use Presto, Hive, or BigQuery without needing
and manage costs through one seamless interface. to access various environments.
Data scientists have one-click access to notebooks.
Further, they can discover notebooks by using a
Data science
Analytics
Last, but not least, data access is made simple and serverless Spark for data science. All Cloud Storage
straightforward. An integrated experience across all data is automatically made queryable through OSS
Google Cloud Data Analytics services provides virtual tools and BigQuery, while enabling search and
lakehouse experiences. This is complemented with an discovery across the board by using Data Catalog.
integrated serverless notebook experience with
17
Conclusion
We are in a transformative era for data analytics in the Google is a data company, with a world-class suite of
Cloud. As data volumes increase and companies analytics products. But our secret sauce lies in our
become more data-driven, they need to break down planet-scale, intelligent infrastructure upon which our
data silos and make data more accessible by products are built. Not only can you develop a
numerous users across the business. We have seen an lakehouse that meets your data users’ needs, but we
increase in the number of unified data platform have you covered in unique hardware and networking,
architecture options that meet the needs of different integration that enables streaming at unlimited scale,
organization types. Google Cloud’s suite of data a serverless data warehouse with an unmatched
analytics products is well suited for any modern 99.99% uptime SLA, and flexible and intelligent
analytics data platform pattern, including a data compute that takes the guesswork out of provisioning
lakehouse. servers.
BigQuery
Infinite scalability
Unmatched reliability
18
Building a data
lakehouse on
Google Cloud
Platform
September 2021
19