Exam Overview: GCP Data Engineer
Exam Overview: GCP Data Engineer
v20190802
- Procedural Data Flow Language: - Wide column store
PigLatin
- solution for problems where one of your
- Less development effort & code requirements is to have a very heavy
efficiency
write system and you want to have a
1. Exam overview - does not have any notion for partitions quite responsive reporting system on top
- supports Avro
of that stored data.
- Hive
- does not provide ACID and relational
The exam consists of 50 questions that must be - data warehousing system and query data properties
answered in 2 hours.
language
- an available, partition-tolerant system
- SQL-like
that supports eventual consistency
you have to choose which option would be the structured datastores (relational)
- Redis
particularly focusing on Apache Pig, Hive, Spark, - The two previous points mean you can - You can easily add nodes to MySql
and Beam
use Sqoop to import data directly into Cluster Data Nodes and build cube to do
Cloud Storage
OLAP. (answer for a mock question)
- Hadoop
- Oozie
1 by github.com/xg1990 modified based on ‘James' GCP Dotpoints’ with various online sources which are acknowledged at the end, please share under
the Creative Commons Attribution 3.0 Australia License. Please feel free to submit any issues on https://github.com/xg1990/GCP-Data-Engineer-Study-Guide/
issues if you find any mistakes
2 refer to https://hadoopecosystemtable.github.io
3.2.Cloud SQL & Spanner Performance:
- Blob storage. Upload any bytes to a location. - best for gigabytes of data with transactional - High-throughput analytics
(Amazon S3)
- Low latency
- for large analytical and operational workloads
- Nearline and Coldline: for ~1 sec lookup for - data structures and underlying
access, charged for volume of data accessed infrastructure is required.
Key Design:
- Management:
data, use a reverse date stamp at the end of
- geographical separation:
the key.
same time.
tables
Encryption:
- no transactional support (so can handle
- avoid hotspotting
- Google Cloud Platform encrypts customer data petabytes of data)
- Server-side encryption:
- Avoid schema designs that require
- Salting: (only where field promotion does
- Customer-supplied encryption atomicity across rows
server-side encryption
- work with it using Hbase API
Service.
- Single key lookup. No property search.
- This takes minutes to hours and requires to use
- Client-side encryption: encryption that - Stored lexicographically in big endian format at least 300GB of data
Storage.
- quick range lookup - Stay below the recommended storage
utilization per node.
- Before you test, run a heavy pre-test for performance (when use Cloud Bigtable with a
several minutes
Compute Engine-based application)
desired data and compaction can - you can update any of the following settings
remove the deleted row later. This means without any downtime:
!
that deleting data will temporarily - number of clusters / replication settings
in CSQL/CDS)
- A single row key: 4 KB (soft limit?)
- Impossible to Switching between SSD and
- tables should be tall and narrow (store - A single value in a table cell: 100 MB
HDD
changes by appending new rows- tall, collapse - All values in a single row: 256 MB
- export the data from the existing
flags into a single column - narrow)
instance and import the data into a new
Access Control instance.
- you can configure access control at the - OR write a Cloud Dataflow or Hadoop
Group columns project level and the instance level(lowest IAM MapReduce job that copies the data
- Group columns of data you are likely to query resources control)
from one instance to another.
together (for example address fields, first/last - A Cloud Bigtable instance is mostly just
name and contact details).
a container for your clusters and nodes,
- use short column names, organise into which do all of the real work.
3.4.Datastore
column families (groups e.g. MD:symbol, - Tables belong to instances, not to - Built on top of BigTable.
MD:price)
clusters or nodes. So if you have an - non-consistent for every row
• BigTable periodically compacts the data for - Atomic transactions: can execute a set
you. This means reorganising and removing of operations where either all succeed, or
deleted records once they’re no longer needed.
Production & Development
none occur.
-
organise the data. - use replication to provide high - for structured data.
Architecture
Engine instance in the same zone as your - Key lookups somewhat like Amazon
Cloud Bigtable instance for the best possible DynamoDB
- Allows for SQL-like querying down to - Has connectors for BigTable, GCS, Google - web console (local files), GCS, GDS
property level
Drive and can import from Datastore backups, - stream (costly)
- does not support complex joins with CSV, JSON and ARVO
- data with CDF, Cloud logging or POST
multiple inequality filters
- for analytics. serverless. calls
- Performance:
- alternative to Hadoop with Hive
- Raw files:
- Fast to Terabyte scale, Low latency
Performance - federated data source, CSV/JSON/Avro
- Quick read, slow write as it relies on - Petabyte scale
on GCS, Google sheets
indexing every property (by default) and - High latency used more for analytics than for - Google Drive
must update indexes as updates/writes low latency rapid lookups like a RDBMS like - Loading data into BigQuery from Google
occur
CloudSQL or Spanner
Drive is not currently supported,
RDBMS/CloudSQL/ Datastore - Cannot use both Legacy and SQL2011 in the - By default, the BigQuery service expects all
Spanner same query
source data to be UTF-8 encoded
- Table partitioning
- JSON files must always be encoded in
Row Entity - Distributed writing to file for output. Eg: UTF-8
`file-0001-of-0002`
- to support (occasionally) schema changing
Tables Kind you can use 'Automatically detect' for
- user defined functions in JS (UDFJS)
- Query jobs are actions executed schema changes. Automatically detect is not
Fields Property
asynchronously to load, export, query, or copy default selected
- If you use the LIMIT clause, BigQuery will still - Upload a file greater than 10 MB in size
Structured relational Structured - Avoid SELECT * (full scan), select only - Upload a file in SQL format
Errors and Error Handling - BUT: Normalize is the way make dataset - You cannot change an existing table into a
- UNAVAILABLE, DEADLINE_EXCEEDED
better organized but less performance partitioned table. You must create a
- Retry using exponential backoff.
optimized
partitioned table from scratch.
- INTERNAL
- Two types of partitioned tables:
- Wildcard tables
- Cross-Join: avoid joins that generate more • Views: Virtual tables defined by a SQL
- Used if you want to union all similar outputs than inputs
query. For more information, see Using
tables with similar names. ’*’ (e.g. - Update/Insert Single Row/Column: avoid point- views.
project.dataset.Table*)
specific DML, instead batch updates and
- Partitioned tables include a pseudo column inserts
Caching:
named _PARTITIONTIME that contains a date- - anti-patterns and schema design: https:// - There is no charge for a query that retrieves its
based timestamp for data loaded into the table
cloud.google.com/bigtable/docs/schema- results from cache.
Windowing:
- Security can be applied at the project and - If any of the referenced tables or logical
- window functions increase the efficiency and dataset level, but not table or view level
views have changed since the results
reduce the complexity of queries that analyze - three types of resources in BigQuery are were previously cached
be the same size and is based on the hash Billing - If you are querying multiple tables using
function of a column. Each bucket is a - based on storage (amount of data stored), a wildcard
separate file, which makes for more efficient - If the query runs against an external data
querying (amount of data/number of bytes
sampling and joining data.
source
legacy vs. standard SQL3 - Storage options are active and long-term Export:
- `project.dataset.tablename*`
(modified or not past 90 days).
- Data can only be exported in JSON / CSV /
- It is set each time you run a query
- Query options are on-demand and flat-rate.
Avro
BigQuery storage.
- To export more than 1 GB of data, you need to
• External tables: tables backed by storage put a wildcard in the destination filename. (up
Anti-patterns external to BigQuery(also known as a to 1 GB of table data to a single file)
3 https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql
(eg Bigtable for low latency use and bigquery - Currently, data-driven triggers only
4. Big Data Processing for data exploration).
support firing after a certain number of
Covering knowledge about BigQuery, Cloud data elements.
Dataflow, Cloud Dataproc, Cloud Datalab and Windowing - Composite triggers. These triggers combine
Cloud Pub/Sub. - Can apply windowing to streams for rolling multiple triggers in various ways.
- window types
with automated scaling and deployment potentially distributed, multi-element data set,
- Single Global Window
that acts as the pipeline’s data
- Handle sudden and extreme spikes of traffic - Default windowing behavior is to assign all
which require immediate scaling - A transform is a data processing operation, or
elements of a PCollection to a single, global a step, in your pipeline. A transform takes one
window, even for unbounded PCollections
or more PCollections as input, performs a
4.2.GCP Compute processing function that you provide on the
Triggers elements of that PCollection, and produces an
- VM
- IT determines when a Window's contents output PCollection.
- Pre-emptible instances (up to 80% discount should be output based on certain criteria - DirectPipelineRunner allows you to execute
but can be taken away)
being met
operations in the pipeline directly & locally
- Allocated on-demand and only pay for the - Allows specifying a trigger to control - Create a cron job with Google App Engine
time they are up
when (in processing time) results for the Cron Service to run the Cloud Dataflow job
given window can be produced.
- Time-Based Triggers
- Can be used for batch or stream data
- dataflow.worker role provides the permissions
- Event time triggers. These triggers
- scaleable, fault-tolerant, multi-step necessary for a Compute Engine service
operate on the event time, as indicated
processing of data
by the timestamp on each data element. account to execute work units for a Dataflow
- Often used for data preparation/ETL for data Beam’s default trigger is event time- pipeline
sets
based.
Message Flow
How to integrate with BigQuery
Storage
why you would use JSON or Java related to until they are delivered and acknowledged by - - store in HDFS (split up on the cluster, but
Pipelines.
subscribers.
requires cluster to be up) or in GCS (separate
- The Pub/Sub service forwards messages from cluster and storage)
- server-less messaging
from its subscription and acknowledges - Modify configuration files using cluster
- decouples producers and consumers of data message.
properties
Deduplicate Tech:
- Maintain a database table to store the hash - creating a new Cloud Dataproc cluster with the
value and other metadata for each data entry.
projects.regions.clusters.create operation,
- Cloud Pub/Sub assigns a unique `message_id` these four values are required: project, region,
to each message, which can be used to detect name, and zone.
duplicate messages received by the - you can access the YARN web interface by
subscriber.
configuring a browser to connect through a
- Lots duplicate messages may happen when: SOCKS proxy
endpoint is not acknowledging messages - You can SSH directly to the cluster nodes
- throughput (balance load among workers) - Dataproc is managed (No Ops) Hadoop cluster
- unification (cross-organisational) on GCP (i.e. managed Hadoop, Pig, Hive, Billing:
- latency (accept requests at edge of network) Spark programs)
- is billed by the second. All Cloud Dataproc
- consistency - automated cluster management. resizing
clusters are billed in one-second clock-time
- Code/Query only
increments, subject to a 1-minute minimum
basic concepts - Job management screen in the console.
billing
IAM
• TensorFlow CheatSheet and Terminology
- Features: The data values/fields you choose to
- Service accounts used with Cloud model, these can be transformed (x^2, y^2,
Understand the different ML services available: etc)
Dataproc must have Dataproc/Dataproc ML Engine, ML APIs, and TensorFlow, as well as - Feature Engineering: The process of building a
Worker role (or have all the permissions the relevance of Cloud DataLab.
set of feature combinations to act on inputs.
ML service
- Recall: The true positive rate, How many times
write to Google Cloud Logging.
a think IS in the class (the actual number of
cats)
- Managed Trifacta for preparing and analysing - Label: The correct classification/value
of any size stored in CSV, JSON, or relational- - Model: Mathematical function, Some work is - cluster similar items (products, houses, etc):
table formats done on inputs for an output
Most users rate House Y as a 2
- Training: Adjusting Variable weights in a model - combine these two to product a rating
to minimise error
- Can be triggered by dataflow, GCS bucket - Neuron: A way to combine inputs and - Common models tend to include regression
events, Pub/Sub Messages and HTTP weighting them to make a decision (it is one (what value?) and classification (what
Calls
unit of input combination)
category?)
- Gradient Descent: The process of testing error - Converting inputs to vectors for analysis a well
in order to minimise its value iteratively documented problem that can actually be
decreases towards a minimum. (This can be improved with the use of ML itself.Some public
global or local maximum and starting points
5. Machine Learning4 and learning rates are important to ensure the
models already exist incluing https://
en.wikipedia.org/wiki/Word2vec
process doesn’t stop at local minima. Too low - Initial weight selection is hard to get right,
Covering knowledge on GCP API (Vision API, a learning rate and your model will tgrain
Speech API, Natural Language API and Translate human intuition can often help with this starting
slowly, too high and it may miss the minimum)
point for gradient descent
• Deploying Models
- Cost depends on problem - usually the sum of Online versus Batch Prediction
- CloudML - for Data Scientist, use custom
square errors for regression problems or model, scaleable, no-ops (where have enough - Online
Cross_entropy on classification problems
data to train a model)
- Optimized to minimize the latency of
- Feature Engineering less needed than linear - ML APIs - for App Developer, use pre-built serving predictions.
specify.
Training cluster:
- build pipeline to continue train you model
based on both new and old data,
- The training service allocates the resources for Exception
the machine types you specify
- The training service runs until your job
- Replica: Your running job on a given node is succeeds or encounters an unrecoverable
called a replica.
error.
Effective ML: - each replica in the training cluster is - In the distributed case, it is the status of the
- Data Collection
given a single role or task in distributed master replica that signals the overall job
- Data Organisation
training:
status.
- Model Creation using human insight and - run a Cloud ML Engine training job locally
domain knowledge
- master: Exactly one replica is designated the (gcloud ml-engine local train) is especially
- Use machines to flesh out the model from the master.
useful in the case of testing distributed models
input data
- This task manages the others and
- Your Dataset should cover all cases both reports status for the job as a whole.
positive and negative and should have at least - If you are running a single-process job, 5.3.TensorFlow
5 cases of each otherwise the model cannot the sole replica is the master for the job.
correctly classify that case. Near misses are - workers: One or more replicas may be ● OS Machine learning/Deep Learning
also important for this, Explore your data, find designated as workers. These replicas do their platform
causes of problems and try fix the issue. If it
● Lazy evaluate during build, full
evaluate during execution - - also can insert SQL & JS for BigQuery, HTML 6.2.Stackdriver
for web content, charts, tables
Tensorflow
- - analyse data in BQ, GCE, GCS
- For store, search, analyse, monitor, and alert
- Machine learning library
- - free, just pay for resources
on log data and events.
- - discard info that identifies a row (need at - Docker + Gateway (best, uses a gateway and - how you can debug, monitor and log using
least 5-10 examples of a particular value - to proxy, runs locally) Stackdriver.
avoid overfitting)
- how to use Stackdriver to help debug source
- - predictor columns must be numerical (not - a powerful interactive tool created to explore, code
categorial / codes)
analyze, transform and visualize data and build - Audit Logs to review data access (e.g.
- - Create model
machine learning models on Google cloud BigQuery)
Feature Engineering
6. Management Google Cloud Platform that collects
latency data from Google App Engine,
- a sparse vector Google HTTP(S) load balancers, and
- very long, with many zeros, contains only 6.1.IAM & Billing applications instrumented with the
a single 1
Stackdriver Trace SDKs, and displays it
- If you don't know the set of possible values in - 3 Member types, Service account, google in near real time in the Google Cloud
advance, you can use account and google group.
Platform Console
categorical_column_with_hash_bucket instead.
- Service Accounts are for non human
- An embedding is a mapping from discrete users such as applications
numbers.
- Project transfer fees are associated with the - can use the existing YouTube data
instigator.
source
- Billing access can be provided to a project or
5.4.Datalab set of projects without granting access to the
- The prefetch cache is only active for data
content. This is useful for separation of duties sources that use owner's credentials to
- Datalab: Managed Jupyter notebooks great access the underlying data.
for use with a dataproc cluster to write pyspark between finance/devs etc.
- How to run Datalab higher data usage costs for paid data
sources, such as BigQuery
- your data changes frequently and you - Compressing and combining smaller files - Searching for objects by attribute value -
want to prioritize freshness over info fewer larger files is also a best practice for Datastore (bigtable only by single row key)
performance.
speeding up transfer speeds
- High throughput writes of wide column data:
- you can turn the prefetch cache on, if
Bigtable
- you are using a data source that incurs Avro Data Format5:
- Warehousing structured data - BigQuery
usage costs (e.g., BigQuery) and want to - Is faster to load. The data can be read in
minimize those costs.
parallel, even if the data blocks
are compressed.
!
- Allows you to specify all the resources needed
for your application in a declarative format
using yaml
- Transfer Appliance
- if the data set is relatively small < 10TB then
- one-time
DataStore will be preferred.
- Rack, capture and then ship your offline - If the data set is > 10TB and/or there is no
data to Google Cloud
requirement for multiple indexes then
- If you have a large number of files to transfer BigTable will be better.
5 https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro
6 https://cloud.google.com/storage-options/
’17): https://www.youtube.com/watch?
8. Case Study 9. Resources and v=TLpfGaYWshw
- Data Engineering Roadmap: https://
the correct answer is not necessary the best github.com/hasbrain/data-engineer-roadmap
the right outcome for the company based on - Let me know any source is missing
v=r_yYDysfB-k
Serverless data processing with Google Cloud
Dataflow (Google Cloud Next ’17): https://
Other helpful case:
www.youtube.com/watch?v=3BrcmUqWNm0
7 https://www.linkedin.com/pulse/google-cloud-certified-professional-data-engineer-writeup-rix/