100% found this document useful (5 votes)
652 views

Exam Overview: GCP Data Engineer

- Cloud Storage is a low-latency blob storage service for storing unlimited data. It is ideal for storing data but not for high volume read/write workloads. Data is spread across multiple zones and regions for durability and availability. - Cloud SQL is a managed relational database service for transactional workloads up to gigabytes in size. It offers MySQL and PostgreSQL databases. - BigTable is a wide column store for petabyte-scale datasets. It has no transactions but can handle high read/write volumes. Data is stored using a column-oriented model and keys are used to query data.

Uploaded by

sunil choudhury
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (5 votes)
652 views

Exam Overview: GCP Data Engineer

- Cloud Storage is a low-latency blob storage service for storing unlimited data. It is ideal for storing data but not for high volume read/write workloads. Data is spread across multiple zones and regions for durability and availability. - Cloud SQL is a managed relational database service for transactional workloads up to gigabytes in size. It offers MySQL and PostgreSQL databases. - BigTable is a wide column store for petabyte-scale datasets. It has no transactions but can handle high read/write volumes. Data is stored using a column-oriented model and keys are used to query data.

Uploaded by

sunil choudhury
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

- scripting language that compiles into - Wide-column store based on ideas of

GCP Data Engineer1 MapReduce jobs


BigTable and DynamoDB(Datastore)

v20190802
- Procedural Data Flow Language: - Wide column store

PigLatin
- solution for problems where one of your
- Less development effort & code requirements is to have a very heavy
efficiency
write system and you want to have a
1. Exam overview - does not have any notion for partitions quite responsive reporting system on top
- supports Avro
of that stored data.

- Hive
- does not provide ACID and relational
The exam consists of 50 questions that must be - data warehousing system and query data properties

answered in 2 hours.
language
- an available, partition-tolerant system
- SQL-like
that supports eventual consistency

The content including:


- Spark
- MongoDB

• Storage (20% of questions),


- fast, interactive, general-purpose - is fit for use cases where your system
• Big Data Processing (35%),
framework for SQL, streaming, machine demands a schema-less document
• Machine Learning (18%),
learning, etc.
store.

• case studies (15%) and


- solves similar problems as Hadoop - HBase

• others (Hadoop and security about 12%).


MapReduce but with a fast in-memory - might be fit for search engines, analyzing
approach.
log data, or any place where scanning
- Sqoop
huge, two-dimensional join-less tables is
All the questions are scenario simulations where - transfer data between Hadoop and a requirement.

you have to choose which option would be the structured datastores (relational)
- Redis

best way to deal with the situation.


- Sqoop imports data from a relational - is built to provide In-Memory search for
database system or a mainframe into varieties of data structures like trees,
HDFS.
queues, linked lists, etc and can be a
- Running Sqoop on a Dataproc Hadoop good fit for making real-time
2. Big Data Ecosystem2 cluster gives you access to the built-in
Google Cloud Storage connector
- MYSQL
leaderboards, pub-sub kind of system.

particularly focusing on Apache Pig, Hive, Spark, - The two previous points mean you can - You can easily add nodes to MySql
and Beam
use Sqoop to import data directly into Cluster Data Nodes and build cube to do
Cloud Storage
OLAP. (answer for a mock question)

- Hadoop
- Oozie

- open source MapReduce framework


- workflow scheduler system to manage
- the underlying technology for Dataproc
Apache Hadoop jobs

- Oozie Workflow jobs are Directed


3. Storage
- HDFS

- Hadoop File System


Acyclical Graphs (DAGs) of actions.
Cloud Storage/Cloud SQL/DataStore/BigTable/
- Pig
- Cassandra
BigQuery.

1 by github.com/xg1990 modified based on ‘James' GCP Dotpoints’ with various online sources which are acknowledged at the end, please share under
the Creative Commons Attribution 3.0 Australia License. Please feel free to submit any issues on https://github.com/xg1990/GCP-Data-Engineer-Study-Guide/
issues if you find any mistakes
2 refer to https://hadoopecosystemtable.github.io
3.2.Cloud SQL & Spanner Performance:

- Fast to petabyte scale, not a good solution for


- Managed/No ops relational database (MySQL storing less than 1 TB of data.

3.1.Cloud Storage (GCS) and PostgreSQL) like Amazon RDS.


- Low-latency read/write access

- Blob storage. Upload any bytes to a location. - best for gigabytes of data with transactional - High-throughput analytics

The content isn’t indexed at all just stored. nature


- Native time series support

(Amazon S3)
- Low latency
- for large analytical and operational workloads

- Virtually unlimited storage


- Doesn’t scale well beyond GB’s - designed for sparse tables

- Nearline and Coldline: for ~1 sec lookup for - data structures and underlying
access, charged for volume of data accessed infrastructure is required.

Key Design:

- Nearline for once per month


- Spanner is a distributed and scalable
solution for RDBMS however also more - Design your keys how you intend to query.

- Coldline for once per year


- If your most common query is the most recent
- buckets to segregate storage items
expensive.

- Management:
data, use a reverse date stamp at the end of
- geographical separation:
the key.

- persistant, durable, replicated


- Managed backups & automatic
replication
- Ensure your keys are evenly distributed to
- spread data across zones to minimise
- fast connection with GCE/GAE
void hot spotting. This is why date stamps as
impact of service disruptions

- uses Google security


a key or starting a key is bad practise as all of
- spread data across regions to provide
- Flexible pricing, pay for when you use it
the most recent data is being written at the
global access to data

same time.

- Ideal for storing but not for high volume of


- For historical data analytic, hotspot
read/write (e.g. sensor data)
3.3.BigTable issue may not a biggest concern.

- A way to store the data that can be commonly


Feature:
- For time-series data, use tall/narrow tables.
used by Dataproc and BigQuery
Denormalize- prefer multiple tall and narrow
- Stored on Google’s internal store Colossus

tables

Encryption:
- no transactional support (so can handle
- avoid hotspotting
- Google Cloud Platform encrypts customer data petabytes of data)

- Field promotion (preferred): Move fields


stored at rest by default
- not relational (No SQL or joins), ACID only at from the column data into the row key to
- Encryption Options:
row level.
make writes non- contiguous.

- Server-side encryption:
- Avoid schema designs that require
- Salting: (only where field promotion does
- Customer-supplied encryption atomicity across rows

not resolve) Add an additional calculated


keys: You can create and manage - high throughput: Throughput has linear element to the row key to artificially
your own encryption keys for growth with node count if correctly balanced.
make writes non- contiguous.

server-side encryption
- work with it using Hbase API

- Customer-managed encryption - no-ops, auto-balanced, replicated, Performance Test


keys: You can generate and compacted,
- learns about your access patterns and will
manage your encryption keys adjust the metadata stored in nodes in order to
using Cloud Key Management Query:
try balance your workloads.

Service.
- Single key lookup. No property search.
- This takes minutes to hours and requires to use
- Client-side encryption: encryption that - Stored lexicographically in big endian format at least 300GB of data

occurs before data is sent to Cloud so keys can be anything.


- use a production instance

Storage.
- quick range lookup - Stay below the recommended storage
utilization per node.

- Before you test, run a heavy pre-test for performance (when use Cloud Bigtable with a
several minutes
Compute Engine-based application)

- Run your test for at least 10 minutes


TOOLS
Data update - cbt is a tool for doing basic interactions with
- When querying BigTable selects the most Cloud Bigtable.

recent value that matches the key.


- HBase shell is a command-line tool that
- This means that when deleting/updating performs administrative tasks, such as creating
we actually write a new row with the and deleting tables.

desired data and compaction can - you can update any of the following settings
remove the deleted row later. This means without any downtime:

!
that deleting data will temporarily - number of clusters / replication settings

increase the disk usage


- upgrade a development instance to a
- append only (cannot update a single field like Size limits production (permanent)

in CSQL/CDS)
- A single row key: 4 KB (soft limit?)
- Impossible to Switching between SSD and
- tables should be tall and narrow (store - A single value in a table cell: 100 MB
HDD

changes by appending new rows- tall, collapse - All values in a single row: 256 MB
- export the data from the existing
flags into a single column - narrow)
instance and import the data into a new
Access Control instance.

- you can configure access control at the - OR write a Cloud Dataflow or Hadoop
Group columns project level and the instance level(lowest IAM MapReduce job that copies the data
- Group columns of data you are likely to query resources control)
from one instance to another.

together (for example address fields, first/last - A Cloud Bigtable instance is mostly just
name and contact details).
a container for your clusters and nodes,
- use short column names, organise into which do all of the real work.
3.4.Datastore
column families (groups e.g. MD:symbol, - Tables belong to instances, not to - Built on top of BigTable.

MD:price)
clusters or nodes. So if you have an - non-consistent for every row

instance with up to 2 clusters, you can't


- document DB for non-relational data

Periodically compacts assign tables to individual clusters



- Suitable:

• BigTable periodically compacts the data for - Atomic transactions: can execute a set
you. This means reorganising and removing of operations where either all succeed, or
deleted records once they’re no longer needed.
Production & Development
none occur.

- Production: A standard instance with either 1


- Supports ACID transactions, SQL-like
Sorted String Tables or 2 clusters, as well as 3 or more nodes in
queries.

- BigTable relies on Sorted String Tables to each cluster.

-
organise the data. - use replication to provide high - for structured data.

- are immutable pre sorted key,value pairs of availability

- for hierarchical document storage such


strings or protobufs.
- Development: A low-cost instance for as HTML

development and testing, with performance - Query

limited to the equivalent of a 1-node cluster.

- can search by keys or properties (if


- It is recommended to create your Compute indexed)

Architecture
Engine instance in the same zone as your - Key lookups somewhat like Amazon
Cloud Bigtable instance for the best possible DynamoDB

- Allows for SQL-like querying down to - Has connectors for BigTable, GCS, Google - web console (local files), GCS, GDS

property level
Drive and can import from Datastore backups, - stream (costly)

- does not support complex joins with CSV, JSON and ARVO
- data with CDF, Cloud logging or POST
multiple inequality filters
- for analytics. serverless. calls

- Performance:
- alternative to Hadoop with Hive
- Raw files:
- Fast to Terabyte scale, Low latency
Performance - federated data source, CSV/JSON/Avro
- Quick read, slow write as it relies on - Petabyte scale
on GCS, Google sheets

indexing every property (by default) and - High latency used more for analytics than for - Google Drive

must update indexes as updates/writes low latency rapid lookups like a RDBMS like - Loading data into BigQuery from Google
occur
CloudSQL or Spanner
Drive is not currently supported,

Query - but can query data in Google Drive using


- Standard SQL (preferred) or Legacy SQL (old)
an external table.

RDBMS/CloudSQL/ Datastore - Cannot use both Legacy and SQL2011 in the - By default, the BigQuery service expects all
Spanner same query
source data to be UTF-8 encoded
- Table partitioning
- JSON files must always be encoded in
Row Entity - Distributed writing to file for output. Eg: UTF-8

`file-0001-of-0002`
- to support (occasionally) schema changing
Tables Kind you can use 'Automatically detect' for
- user defined functions in JS (UDFJS)

- Query jobs are actions executed schema changes. Automatically detect is not
Fields Property
asynchronously to load, export, query, or copy default selected

Column values must Properties can vary data.


- You cannot use the Web UI to:

- If you use the LIMIT clause, BigQuery will still - Upload a file greater than 10 MB in size

be consistent between entities


process the entire table.
- Upload multiple files at the same time

Structured relational Structured - Avoid SELECT * (full scan), select only - Upload a file in SQL format

data hierarchical data columns needed (SELECT * EXCEPT)

(html, xml) - benefits of using denormalized data Partitions


- Increases query speed
- which improves query performance and
- makes queries simpler
reduces costs

Errors and Error Handling - BUT: Normalize is the way make dataset - You cannot change an existing table into a
- UNAVAILABLE, DEADLINE_EXCEEDED
better organized but less performance partitioned table. You must create a
- Retry using exponential backoff.
optimized
partitioned table from scratch.

- INTERNAL
- Two types of partitioned tables:

- Do not retry this request more than types of queries:


- Ingestion Time: Tables partitioned
once.
- Interactive: query is executed immediately, based on the data’s ingestion (load) date
- Other
counts toward daily/concurrent usage (default)
or arrival date. Each partitioned table will
- Do not retry without fixing the problem
- Batch: batches of queries are queued and the have pseudocolumn_PARTITIONTIME, or
query starts when idle resources are available, time data was loaded into table.
only counts for daily and switches to Pseudocolumns are reserved for the
3.5.BigQuery interactive if idle for 24 hours
table and cannot be used by the user.

Feature - Partitioned Tables: Tables that are


- Fully managed data warehouse
Data Import
partitioned based on a TIMESTAMP or
- batch (free)
DATE column.

- Wildcard tables
- Cross-Join: avoid joins that generate more • Views: Virtual tables defined by a SQL
- Used if you want to union all similar outputs than inputs
query. For more information, see Using
tables with similar names. ’*’ (e.g. - Update/Insert Single Row/Column: avoid point- views.

project.dataset.Table*)
specific DML, instead batch updates and
- Partitioned tables include a pseudo column inserts
Caching:

named _PARTITIONTIME that contains a date- - anti-patterns and schema design: https:// - There is no charge for a query that retrieves its
based timestamp for data loaded into the table
cloud.google.com/bigtable/docs/schema- results from cache.

- It can be used to query specific design


- BigQuery caches query results for 24 hours.

partitions in the WHERE clause


- By default, a query's results are cached unless:

Access Control - When a destination table is specified

Windowing:
- Security can be applied at the project and - If any of the referenced tables or logical
- window functions increase the efficiency and dataset level, but not table or view level
views have changed since the results
reduce the complexity of queries that analyze - three types of resources in BigQuery are were previously cached

partitions (windows) of a dataset by providing organizations, projects, and datasets


- When any of the tables referenced by the
complex operations without the need for many - Authorized views allow you to share query query have recently received streaming
intermediate calculations.
results with particular users/groups without inserts (a streaming buffer is attached to
- They reduce the need for intermediate tables to giving them access to underlying data
the table) even if no new rows have
store temporary data
- Can be used to restrict access to arrived

particular columns or rows - If the query uses non-deterministic


- Create a separate dataset to store the functions such as
Bucketing view
CURRENT_TIMESTAMP() and NOW(),
- Like partitioning, but each split/partition should CURRENT_USER()

be the same size and is based on the hash Billing - If you are querying multiple tables using
function of a column. Each bucket is a - based on storage (amount of data stored), a wildcard

separate file, which makes for more efficient - If the query runs against an external data
querying (amount of data/number of bytes
sampling and joining data.
source

processed by query), and streaming inserts

legacy vs. standard SQL3 - Storage options are active and long-term Export:

- `project.dataset.tablename*`
(modified or not past 90 days).
- Data can only be exported in JSON / CSV /
- It is set each time you run a query
- Query options are on-demand and flat-rate.
Avro

- default query language is


- The only compression option available is GZIP.

- Legacy SQL for classic UI


Table types: - GZIP compression is not supported for
- Standard SQL for Beta UI
• Native tables: tables backed by native Avro exports.

BigQuery storage.
- To export more than 1 GB of data, you need to
• External tables: tables backed by storage put a wildcard in the destination filename. (up
Anti-patterns external to BigQuery(also known as a to 1 GB of table data to a single file)

federated data source). For more


- Avoid self-joins
information, see Querying External Data More refer to: https://cloud.google.com/
- Partition/Skew: avoid unequally sized Sources.
bigquery/docs/

partitions, or when a value occurs more often https://github.com/jorwalk/data-engineering-


than any other value -
gcp/blob/master/know/bigquery.md

3 https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql
(eg Bigtable for low latency use and bigquery - Currently, data-driven triggers only
4. Big Data Processing for data exploration).
support firing after a certain number of
Covering knowledge about BigQuery, Cloud data elements.

Dataflow, Cloud Dataproc, Cloud Datalab and Windowing - Composite triggers. These triggers combine
Cloud Pub/Sub. - Can apply windowing to streams for rolling multiple triggers in various ways.

average for the window, max in a window etc.

- window types

4.1.App Engine - Fixed Time Windows


Tech:
- Sliding Time Windows (overlapped)
- PCollections: abstraction that represents a
- run code on managed instances of machines - Session Windows

with automated scaling and deployment potentially distributed, multi-element data set,
- Single Global Window
that acts as the pipeline’s data

- Handle sudden and extreme spikes of traffic - Default windowing behavior is to assign all
which require immediate scaling - A transform is a data processing operation, or
elements of a PCollection to a single, global a step, in your pipeline. A transform takes one
window, even for unbounded PCollections
or more PCollections as input, performs a
4.2.GCP Compute processing function that you provide on the
Triggers elements of that PCollection, and produces an
- VM
- IT determines when a Window's contents output PCollection.
- Pre-emptible instances (up to 80% discount should be output based on certain criteria - DirectPipelineRunner allows you to execute
but can be taken away)
being met
operations in the pipeline directly & locally

- Allocated on-demand and only pay for the - Allows specifying a trigger to control - Create a cron job with Google App Engine
time they are up
when (in processing time) results for the Cron Service to run the Cloud Dataflow job
given window can be produced.

- If unspecified, the default behavior is to


4.3.Dataflow trigger first when the watermark passes IAM
the end of the window, and then trigger
Feature - dataflow.developer role enable the developer
again every time there is late arriving
- Executes Apache Beam Pipelines(no-ops, interacting with the Cloud Dataflow job , with
data.

could use Spark, Flink)


data privacy.

- Time-Based Triggers
- Can be used for batch or stream data
- dataflow.worker role provides the permissions
- Event time triggers. These triggers
- scaleable, fault-tolerant, multi-step necessary for a Compute Engine service
operate on the event time, as indicated
processing of data
by the timestamp on each data element. account to execute work units for a Dataflow
- Often used for data preparation/ETL for data Beam’s default trigger is event time- pipeline

sets
based.

- filter, group, transform


- Processing time triggers. These Pipeline Update
- Pipelines and how it works for ETL
triggers operate on the processing time – - With Update, to replace an existing pipeline in-
the time when the data element is place with the new one and preserve
DataSource processed at any given stage in the Dataflow's exactly-once processing guarantee

- The Cloud Dataflow connector for Cloud pipeline.


- When update pipeline manually, use DRAIN
Bigtable makes it possible to use Cloud - Data-driven triggers. These triggers operate instead of CANCEL to maintain in-flight data.

Bigtable in a Cloud Dataflow pipeline.


by examining the data as it arrives in each - The Drain command is supported for
- Can read data from multiple sources and can window, and firing when that data meets a streaming pipelines only.

kick off multiple cloud functions in parallel certain property.


- pipelines cannot share data or transforms

writing to multiple sinks in a distributed fashion


Key things to focus on are: - push and pull - Used if migrating existing on-premise
- Event vs. Processing time
- pull is a more efficiency message deliver/ Hadoop or Spark infrastructure to Google
consume mechanism Cloud Platform without redevelopment effort.

- Configuring ETL pipelines

Message Flow
How to integrate with BigQuery
Storage

- Publisher creates a topic in the Cloud Pub/


- Can store data on disk (HDFS) or can use GCS

Sub service and sends messages to the topic.

constraints you might have.


- GCS allows for the use of preemptible
- Messages are persisted in a message store
machines that can reduce costs significantly

why you would use JSON or Java related to until they are delivered and acknowledged by - - store in HDFS (split up on the cluster, but
Pipelines.
subscribers.
requires cluster to be up) or in GCS (separate
- The Pub/Sub service forwards messages from cluster and storage)

a topic to all of its subscriptions, individually.


4.4.Cloud Pub/Sub Each subscription receives messages either by
Customize the software
pushing/pulling.

Feature - The subscriber receives pending messages - Set initialization actions

- server-less messaging
from its subscription and acknowledges - Modify configuration files using cluster
- decouples producers and consumers of data message.
properties

in large organisations / complex systems


- When a message is acknowledged by the - Log into the master node and make changes
- the glue that connects all the components.
subscriber, it is removed from the from there

- order not guaranteed subscription’s message queue.


- NO Cloud Deployment Manager

Deduplicate Tech:
- Maintain a database table to store the hash - creating a new Cloud Dataproc cluster with the
value and other metadata for each data entry.
projects.regions.clusters.create operation,
- Cloud Pub/Sub assigns a unique `message_id` these four values are required: project, region,
to each message, which can be used to detect name, and zone.

duplicate messages received by the - you can access the YARN web interface by
subscriber.
configuring a browser to connect through a
- Lots duplicate messages may happen when: SOCKS proxy

endpoint is not acknowledging messages - You can SSH directly to the cluster nodes

within the acknowledgement deadline


- you can use a SOCKS proxy to connect your

browser through an SSH tunnel.

- YARN ResourceManager and the HDFS


Asynchronous processing 4.5.Dataproc NameNode interfaces are available on master
- availability (buffer during outages)
- change management Feature node

- throughput (balance load among workers) - Dataproc is managed (No Ops) Hadoop cluster
- unification (cross-organisational) on GCP (i.e. managed Hadoop, Pig, Hive, Billing:
- latency (accept requests at edge of network) Spark programs)
- is billed by the second. All Cloud Dataproc
- consistency - automated cluster management. resizing
clusters are billed in one-second clock-time
- Code/Query only
increments, subject to a 1-minute minimum
basic concepts - Job management screen in the console.
billing

- topics - think in terms of a ‘job-specific resource’, for


- subscriptions each job, create a cluster then delete it

IAM
• TensorFlow CheatSheet and Terminology
- Features: The data values/fields you choose to
- Service accounts used with Cloud model, these can be transformed (x^2, y^2,
Understand the different ML services available: etc)

Dataproc must have Dataproc/Dataproc ML Engine, ML APIs, and TensorFlow, as well as - Feature Engineering: The process of building a
Worker role (or have all the permissions the relevance of Cloud DataLab.
set of feature combinations to act on inputs.

granted by Dataproc Worker role).


- Precision: The positive predictive value how
- need permissions to read and write mostly ML domain (and not TensorFlow specific) many times it correctly predicted a thing as its
to Google Cloud Storage, and to basically about training. Nothing about the Cloud classification (eg cat)

ML service
- Recall: The true positive rate, How many times
write to Google Cloud Logging.
a think IS in the class (the actual number of
cats)

- Only recognised 1/10 cats, but was right.


4.6.Dataprep 5.1.ML Terms: 100% precision, 10% recall

- Managed Trifacta for preparing and analysing - Label: The correct classification/value

quality and transforming the input data - Input: Predictor Variables


Recommendation Engine
- service for visually exploring, cleaning, and - Example: Input + Label sample to train your - cluster similar users: User A and User B both
preparing data for analysis. can transform data model
rate House Z as a 4

of any size stored in CSV, JSON, or relational- - Model: Mathematical function, Some work is - cluster similar items (products, houses, etc):
table formats done on inputs for an output
Most users rate House Y as a 2

- Training: Adjusting Variable weights in a model - combine these two to product a rating

to minimise error

- Prediction: Using the model to guess the label


4.7.Cloud Functions: for an input

- Supervised Learning: Training your model


- NodeJS functions as a service.

using examples data to predict future data


5.1.ML Basics
- No ops, no server just code entry point and
- Unsupervised Learning: Data is analysed - There are two main stages of ML, Training and
response, autoscaled by GCP

without labels for patterns or clusters.


Inference. Inference is often predictive in nature

- Can be triggered by dataflow, GCS bucket - Neuron: A way to combine inputs and - Common models tend to include regression
events, Pub/Sub Messages and HTTP weighting them to make a decision (it is one (what value?) and classification (what
Calls
unit of input combination)
category?)

- Gradient Descent: The process of testing error - Converting inputs to vectors for analysis a well
in order to minimise its value iteratively documented problem that can actually be
decreases towards a minimum. (This can be improved with the use of ML itself.Some public
global or local maximum and starting points
5. Machine Learning4 and learning rates are important to ensure the
models already exist incluing https://
en.wikipedia.org/wiki/Word2vec

process doesn’t stop at local minima. Too low - Initial weight selection is hard to get right,
Covering knowledge on GCP API (Vision API, a learning rate and your model will tgrain
Speech API, Natural Language API and Translate human intuition can often help with this starting
slowly, too high and it may miss the minimum)
point for gradient descent

API) and Tensorflow.


- Hidden Layer: A set of neurons that act on the - Weights are iteractively tweaked. Initial weights
same input data.
-> Calculate Erorr -> adjust weight ->
• Embeddings

• Deploying Models

4 This part is not well organised yet


recalculate error -> repeat until error is can’t be fixed try find more `bad` cases to try portion of the work as you designate in your
minimised.
train your model for it, Otherwise remove these job configuration.

- In Image ML each pixel is represented by a from the data set.


- parameter servers: One or more replicas may
number, to vectorising images is actually much - Overfitting and how to correct
be designated as parameter servers. These
easier than text where you have to recognise - Neural network basics (nodes / layers)
replicas coordinate shared model state
that Man is to Woman as Boy is to Girl so they - Use the Tensorflow playground to understand between the workers.

need to have similar magnitude differences.


neural networks
- CUSTOM tier for Cloud Machine Learning
Engine allows you to specify the number of
Neural Network Workers and parameter servers

Machine Learning on GCP 

- Goal is to minimize cost


- Tensorflow - for ML researcher, use SDK

- Cost depends on problem - usually the sum of Online versus Batch Prediction
- CloudML - for Data Scientist, use custom
square errors for regression problems or model, scaleable, no-ops (where have enough - Online
Cross_entropy on classification problems
data to train a model)
- Optimized to minimize the latency of
- Feature Engineering less needed than linear - ML APIs - for App Developer, use pre-built serving predictions.

models, but still useful 


models, e.g. vision, speech, language (where - Predictions returned in the response
- Understand how to reduce noise
would need more data to train a model)
message.

- Returns as soon as possible.

Wide & Deep Learning model - Batch


- The wide model is used for memorization, 5.2.ML Engine - Optimized to handle a high volume of
while the deep model is used for instances in a job and to run more
- Train and predict ML models
complex models.

generalization - Can use multiple ML platforms such as


- Use for recommender system, search, and - Predictions written to output files in a
TensorFlow , scikit-learn and XGBoost
Cloud Storage location that you
ranking problems.

specify.

Online training and/or continue learning - Asynchronous request.

Training cluster:
- build pipeline to continue train you model
based on both new and old data,
- The training service allocates the resources for Exception
the machine types you specify
- The training service runs until your job
- Replica: Your running job on a given node is succeeds or encounters an unrecoverable
called a replica.
error.

Effective ML: - each replica in the training cluster is - In the distributed case, it is the status of the
- Data Collection
given a single role or task in distributed master replica that signals the overall job
- Data Organisation
training:
status.

- Model Creation using human insight and - run a Cloud ML Engine training job locally
domain knowledge
- master: Exactly one replica is designated the (gcloud ml-engine local train) is especially
- Use machines to flesh out the model from the master.
useful in the case of testing distributed models

input data
- This task manages the others and
- Your Dataset should cover all cases both reports status for the job as a whole.

positive and negative and should have at least - If you are running a single-process job, 5.3.TensorFlow
5 cases of each otherwise the model cannot the sole replica is the master for the job.

correctly classify that case. Near misses are - workers: One or more replicas may be ● OS Machine learning/Deep Learning
also important for this, Explore your data, find designated as workers. These replicas do their platform
causes of problems and try fix the issue. If it
● Lazy evaluate during build, full
evaluate during execution - - also can insert SQL & JS for BigQuery, HTML 6.2.Stackdriver
for web content, charts, tables

Tensorflow
- - analyse data in BQ, GCE, GCS
- For store, search, analyse, monitor, and alert
- Machine learning library
- - free, just pay for resources
on log data and events.

- - Underpins many of Google’s products


- Three ways to run Datalab:
- Be sure to know the sub-products of
- - C++ engine and API (so can run on GPUs)
- - locally (good, if only one person using)
Stackdriver (Debugger, Error Reporting,
- - Python API (so can easily write code)
- - Docker on GCE (better, use by multiple Alerting, Trace, Logging), what they do and
- To use Tensorflow:
people through SSH;CloudShell, uses when they should be used.

- - Collect predictors and target data


resources on GCE)
- Hybrid monitoring service.

-     - discard info that identifies a row (need at - Docker + Gateway (best, uses a gateway and - how you can debug, monitor and log using
least 5-10 examples of a particular value - to proxy, runs locally) Stackdriver.

avoid overfitting)
- how to use Stackdriver to help debug source
-     - predictor columns must be numerical (not - a powerful interactive tool created to explore, code

categorial / codes)
analyze, transform and visualize data and build - Audit Logs to review data access (e.g.
- - Create model 
machine learning models on Google cloud BigQuery)

-     - how many nodes and layers do we need?


platform. - Stackdriver Monitoring

- - Train the model based on input data


- can see the usage of BigQuery query
-     - Regression model predicts a number
slots.

-     - Classification model predicts a category


- Stackdriver Trace
- Use the model on new data
- is a distributed tracing system for

Feature Engineering
6. Management Google Cloud Platform that collects
latency data from Google App Engine,
- a sparse vector Google HTTP(S) load balancers, and
- very long, with many zeros, contains only 6.1.IAM & Billing applications instrumented with the
a single 1
Stackdriver Trace SDKs, and displays it
- If you don't know the set of possible values in - 3 Member types, Service account, google in near real time in the Google Cloud
advance, you can use account and google group.
Platform Console

categorical_column_with_hash_bucket instead.
- Service Accounts are for non human
- An embedding is a mapping from discrete users such as applications

- Google Accounts are for single users


6.3.Data Studio
objects, such as words, to vectors of real - Google groups are for multi users
- Data Dashboard

numbers.
- Project transfer fees are associated with the - can use the existing YouTube data
instigator.
source
- Billing access can be provided to a project or
5.4.Datalab set of projects without granting access to the
- The prefetch cache is only active for data
content. This is useful for separation of duties sources that use owner's credentials to
- Datalab: Managed Jupyter notebooks great access the underlying data.

for use with a dataproc cluster to write pyspark between finance/devs etc.

- How billing works across projects


- Disabling the query cache could result in
jobs

- How to run Datalab higher data usage costs for paid data
sources, such as BigQuery

- you can turn the prefetch cache off for a


- - open source notebook built on Jupyter

given report. You might want to do this if:

- - use existing Python packages 

- your data changes frequently and you - Compressing and combining smaller files - Searching for objects by attribute value -
want to prioritize freshness over info fewer larger files is also a best practice for Datastore (bigtable only by single row key)

performance.
speeding up transfer speeds
- High throughput writes of wide column data:
- you can turn the prefetch cache on, if
Bigtable

- you are using a data source that incurs Avro Data Format5:
- Warehousing structured data - BigQuery
usage costs (e.g., BigQuery) and want to - Is faster to load. The data can be read in
minimize those costs.
parallel, even if the data blocks
are compressed.

- Doesn't require typing or serialization.

6.4.Cloudshell - Is easier to parse because there are no


- temporary VM encoding issues found in other formats such as
- recycled every 60 minutes (approx.)
ASCII.

- Compressed Avro files are not supported, but


compressed data blocks are. BigQuery
6.5.Cloud Deployment Manager supports the DEFLATE and Snappy codecs.

!
- Allows you to specify all the resources needed
for your application in a declarative format
using yaml

- Repeatable Deployment Process


7. Product Selection
- if there is a requirement to search terrabytes -
6.6.Data Transfer > petabytes of data relatively quickly it will
make more sense to simply store in BigQuery !
- Storage Transfer Service
(comparable to AWS Redshift).

- import online data into Cloud Storage.


-
- repeating schedule
- For DataStore, there is a possibility that this
- transfer data within Cloud Storage, from could work as a replacement for Cassandra.

one bucket to another.


- It is most likely that BigTable will be a better
- Source: GCS, S3, URL solution

- Transfer Appliance
- if the data set is relatively small < 10TB then
- one-time
DataStore will be preferred.

- Rack, capture and then ship your offline - If the data set is > 10TB and/or there is no
data to Google Cloud
requirement for multiple indexes then
- If you have a large number of files to transfer BigTable will be better.

you might want to use the gsutil -m option, to -


perform a parallel (multi-threaded/multi- - Be aware of any limitations regarding indexes 6
processing) copy
and partitioning,

5 https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro
6 https://cloud.google.com/storage-options/
’17): https://www.youtube.com/watch?
8. Case Study 9. Resources and v=TLpfGaYWshw

There are 2 case studies which are as same as in


the GCP website: a logistic Flowlogistic
references Webinar: Building a real-time analytics pipeline
company and a communications hardware with BigQuery and Cloud Dataflow (EMEA):
MJTelco company. Each case study includes https://www.youtube.com/watch?
Resources: v=kdmAiQeYGgE

about 4 questions which ask how to transform


current technologies of that company to use - Google Data Engineering Cheatsheet: https://
GCP technologies. We can learn details about github.com/ml874/Data-Engineering-on-GCP-
these case studies in LinuxAcademy.
Cheatsheet


- Data Engineering Roadmap: https://
the correct answer is not necessary the best github.com/hasbrain/data-engineer-roadmap

technical solution but a solution that achieves - Whizlab Mock Exam

the right outcome for the company based on - Let me know any source is missing

their current limitations.7

Youtube Video to Watch:


Auto-awesome: advanced data science on
flowlogistic: https://cloud.google.com/ Google Cloud Platform (Google Cloud Next ’17):
certification/guides/data-engineer/casestudy- https://www.youtube.com/watch?v=Jp-
flowlogistic
qJFF9jww&list=PLIivdWyY5sqLq-
eM4W2bIgbrpAsP5aLtZ

- to find way to store the data that can be


commonly used by Dataproc and BigQuery
Introduction to Google Cloud Machine Learning
- Second both Dataproc and BigQuery (Google Cloud Next ’17): https://
integrating with Cloud Storage well
www.youtube.com/watch?v=COSXg5HKaO4

- Introduction to big data: tools that deliver deep


mjtelco: https://cloud.google.com/certification/ insights (Google Cloud Next ’17): https://
guides/data-engineer/casestudy-mjtelco
www.youtube.com/watch?v=dlrP2HJMlZg

Easily prepare data for analysis with Google


Deconstructing a Customer Case: Data Engineer Cloud (Google Cloud Next ’17): https://
Exam: https://www.youtube.com/watch? www.youtube.com/watch?v=Q5GuTIgmt98

v=r_yYDysfB-k
Serverless data processing with Google Cloud
Dataflow (Google Cloud Next ’17): https://
Other helpful case:
www.youtube.com/watch?v=3BrcmUqWNm0

Spotify’s Event Delivery – The Road to the Cloud:


https://labs.spotify.com/2016/02/25/spotifys- Data Modeling for BigQuery (Google Cloud Next
event-delivery-the-road-to-the-cloud-part-i/
’17) https://www.youtube.com/watch?
v=Vj6ksosHdhw

Migrating your data warehouse to Google


BigQuery: Lessons Learned (Google Cloud Next

7 https://www.linkedin.com/pulse/google-cloud-certified-professional-data-engineer-writeup-rix/

You might also like