GCCF Unit 5
GCCF Unit 5
proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document through
email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from your
system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
20CS929
Department : CSE
Created by:
Date : 04.09.2023
1. CONTENTS
S. No. Contents
1 Contents
2 Course Objectives
3 Pre Requisites
4 Syllabus
5 Course outcomes
7 Lecture Plan
9 Lecture Notes
10 Assignments
12 Part B Questions
16 Assessment Schedule
• Pre-requisite Chart
PSO3
PSO2
PSO1
PO- PO- PO- PO- PO- PO- PO- PO- PO- PO- PO- PO-
1 2 3 4 5 6 7 8 9 10 11 12
C306.1 K3 2 1 1 - - - - - - 2 2 2 2 2 2
C306.2 K3 3 3 3 - - - - 2 2 2 2 2 2 2 2
C306.3 K3 3 3 3 - - 2 - 2 2 2 2 2 2 2 2
C306.4 K3 3 3 3 - - - - 2 2 2 2 2 2 2 2
C306.5 K3 3 3 3 - - 2 - - 2 2 2 2 2 2 2
Correlation Level:
1. Slight (Low)
2. Moderate (Medium)
3. Substantial (High)
If there is no correlation, put “-“.
7. LECTURE PLAN
Number Actual
Sl. Proposed Taxonomy Mode of
Topic of Lecture CO
No. Date Level Delivery
Periods Date
Introduction to big
1 data managed services 1 14.10.2023 CO5 K2 PPT
in the cloud
Leverage big data
2 operations with Cloud 1 16.10.2023 CO5 K3 PPT
Dataproc
Build Extract,
Transform, and Load
3 pipelines using Cloud 1 18.10.2023 CO5 K3 PPT
Dataflow
BigQuery, Google’s
4 Enterprise Data 1 19.10.2023 CO5 K3 PPT
Warehouse
Introduction to
5 machine learning in 1 20.10.2023 CO5 K2 PPT
the cloud
Building bespoke
machine learning
6 models with AI 1 21.10.2023 CO5 K3 PPT
Platform
Google’s pre-trained
8 machine learning APIs 1 03.11.2023 CO5 K3 PPT
Big Data is a collection of datasets that is large and complex, making it very difficult to
process using legacy data processing applications.
So, legacy or traditional systems cannot process a large amount of data in one go.
Tools for big data can help with the volume of the data collected, the speed at which that
data becomes available to an organization for analysis, and the complexity or varieties of
that data.
Data can be a company’s most valuable asset. Using big data to reveal insights can help
understand the areas that affect your business—from market conditions and customer
purchasing behaviors to your business processes.
Big data consists of petabytes (more than 1 million gigabytes) and exabytes (more than
1 billion gigabytes), as opposed to the gigabytes common for personal devices.
As big data emerged, so did computing models with the ability to store and manage it.
Centralized or distributed computing systems provide access to big data. Centralized
computing means the data is stored on a central computer and processed by computing
platforms like BigQuery.
Distributed computing means big data is stored and processed on different computers,
which communicate over a network. A software framework like Hadoop makes it possible
to store the data and run applications to process it.
There are benefits to using centralized computing and analyzing big data where it lives,
rather than extracting it for analysis from a distributed system. Insights are accessible to
every user in your company—and integrated into daily workflows—when big data is
housed in one place and analyzed by one platform.
Big data is different from typical data assets because of its volume complexity and need
for advanced business intelligence tools to process and analyze it. The attributes that
define big data are volume, variety, velocity, and variability. These big data attributes are
commonly referred to as the four v’s.
Volume: The key characteristic of big data is its scale—the volume of data that is
available for collection by your enterprise from a variety of devices and sources.
Variety: Variety refers to the formats that data comes in, such as email messages, audio
files, videos, sensor data, and more. Classifications of big data variety include structured,
semi-structured, and unstructured data.
Velocity: Big data velocity refers to the speed at which large datasets are acquired,
processed, and accessed.
Variability: Big data variability means the meaning of the data constantly changes.
Therefore, before big data can be analyzed, the context and meaning of the datasets
must be properly understood.
Structured Data
Unstructured Data
Semi-structured Data
The above three types of Big Data are technically applicable at all levels of analytics. It is
critical to understand the source of raw data and its treatment before analysis while
working with large volumes of big data. Because there is so much data, extraction of
information needs to be done efficiently to get the most out of the data.
Dataproc
Dataflow
BigQuery
Google Cloud Dataproc
Cloud Dataproc is a managed Apache Hadoop and Spark service for batch processing,
querying, streaming and machine learning. Users can quickly spin up Hadoop or Spark
clusters and resize them at any time without compromising data pipelines through
automation and orchestration. It can be fully integrated with other Google big data
services, such as BigQuery and Bigtable, as well as Stackdriver Logging and Monitoring.
Cloud Dataflow is a serverless stream and batch processing service. Users can build a
pipeline to manage and analyze data in the cloud, while Cloud Dataflow automatically
manages the resources. It was built to integrate with other Google services, including
BiqQuery and Cloud Machine Learning, as well as third-party products, such as Apache
Spark and Apache Beam.
Google BigQuery:
BigQuery is a data warehouse that processes and analyzes large data sets using SQL
queries. These services can capture and examine streaming data for real-time analytics.
It stores data with Google's Capacitor columnar data format, and users can load data via
streaming or batch loads. To load, export, query and copy data, use the classic web UI,
the web UI in the GCP Console, the bq command-line tool or client libraries. Since
BigQuery is a serverless offering, enterprises only pay for the storage and compute they
consume.
DATAPROC:
Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-
source data tools for batch processing, querying, streaming, and machine learning.
Dataproc automation helps to create clusters quickly, manage them easily, and save
money by turning clusters off when you don't need them.
In addition, Dataproc has powerful features to enable your organization to lower costs,
increase performance and streamline operational management of workloads running on
the cloud.
Apache Hadoop and Apache Spark are open-source technologies that often are the
foundation of big data processing.
Apache Hadoop is a set of tools and technologies which enables a cluster of computers
to store and process large volumes of data. The Apache Hadoop software library is a
framework that allows for the distributed processing of large data sets across clusters of
computers using simple programming models.
Apache Spark is a unified analytics engine for large-scale data processing and achieves
high performance for both batch and stream data.
Cost effective: Dataproc is priced at 1 cent per virtual CPU per cluster per hour, on top
of any other Google Cloud resources you use. In addition, Dataproc clusters can include
preemptible instances that have lower compute prices. You use and pay for things only
when you need them.
Fast and scalable: Dataproc clusters are quick to start, scale, and shut down, and each
of these operations takes 90 seconds or less, on average. Clusters can be created and
scaled quickly with many virtual machine types, disk sizes, number of nodes, and
networking options.
Open-source ecosystem: You can use Spark and Hadoop tools, libraries, and
documentation with Dataproc.
Fully managed: You can easily interact with clusters and Spark or Hadoop jobs, without
the assistance of an administrator or special software, through the Cloud Console, the
Google Cloud SDK, or the Dataproc REST API.
Image versioning: Dataproc’s image versioning feature lets you switch between
different versions of Apache Spark, Apache Hadoop, and other tools.
Built-in integration: The built-in integration with Cloud Storage, BigQuery, and Cloud
Bigtable ensures that data will not be lost. This, together with Cloud Logging and Cloud
Monitoring, provides a complete data platform and not just a Spark or Hadoop cluster.
Some of the core open-source components included with Dataproc clusters, such as
Apache Hadoop and Apache Spark, provide web interfaces. These interfaces can be used
to manage and monitor cluster resources and facilities, such as the YARN resource
manager, the Hadoop Distributed File System (HDFS), MapReduce, and Spark.
You can connect to web interfaces running on a Dataproc cluster using the Dataproc
Component Gateway, your project's Cloud Shell, or the Google Cloud CLI gcloud
command-line tool.
Dataproc components:
When you create a cluster, standard Apache Hadoop ecosystem components are
automatically installed on the cluster. You can install additional components, called
"optional components" on the cluster when you create the cluster. Adding optional
components to a cluster has the following advantages:
- - o p t onal-components=CO PONENT-NA E ( s ) \
. . . oĞher filags
Dataproc clusters are built on Compute Engine instances. Machine types define the
virtualized hardware resources available to an instance. Compute Engine offers both
predefined machine types and custom machine types. Dataproc clusters can use both
predefined and custom types for both master and/or worker nodes.
Dataproc supports the following Compute Engine predefined machine types in clusters:
General purpose machine types, which include N1, N2, N2D, and E2 machine
types:
Dataproc also supports N1, N2, N2D, and E2 custom machine types.
Compute-optimized machine types, which include C2 machine types.
Memory-optimized machine types, which include M1 and M2 machine types.
Custom machine types:
Custom machine types are ideal for the following workloads:
Workloads that are not a good fit for the predefined machine types.
Workloads that require more processing power or more memory, but don't
need all of the upgrades that are provided by the next machine type level.
Dataproc integrates with Apache Hadoop and the Hadoop Distributed File System (HDFS).
The following features and considerations can be important when selecting compute and
data storage options for Dataproc clusters and jobs:
HDFS with Cloud Storage: Dataproc uses the Hadoop Distributed File System
(HDFS) for storage. Additionally, Dataproc automatically installs the HDFS-
compatible Cloud Storage connector, which enables the use of Cloud Storage
in parallel with HDFS. Data can be moved in and out of a cluster through
upload/download to HDFS or Cloud Storage.
VM disks:
By default, when no local SSDs are provided, HDFS data and intermediate
shuffle data is stored on VM boot disks, which are Persistent Disks.
If you use local SSDs, HDFS data and intermediate shuffle data is stored on
the SSDs.
Persistent Disk size and type affect performance and VM size, whether using
HDFS or Cloud Storage for data storage.
VM Boot disks are deleted when the cluster is deleted.
Cloud Dataproc use cases:
1. Log processing
2. Ad-hoc data analysis
3. Machine learning
Dataproc Workflow Templates:
The Dataproc WorkflowTemplates API provides a flexible and easy-to-use mechanism for
managing and executing workflows. A Workflow Template is a reusable workflow
configuration. It defines a graph of jobs with information on where to run those jobs.
Key Points:
If the workflow uses a managed cluster, it creates the cluster, runs the jobs, and then
deletes the cluster when the jobs are finished.
If the workflow uses a cluster selector, it runs jobs on a selected existing cluster.
Workflows are ideal for complex job flows. You can create job dependencies so that a job
starts only after its dependencies complete successfully.
When you create a workflow template Dataproc does not create a cluster or submit jobs
to a cluster. Dataproc creates or selects a cluster and runs workflow jobs on the cluster
when a workflow template is instantiated.
Managed cluster:
A workflow template can specify a managed cluster. The workflow will create an
"ephemeral" cluster to run workflow jobs, and then delete the cluster when the workflow
is finished.
Cluster selector:
A workflow template can specify an existing cluster on which to run workflow jobs by
specifying one or more user labels previously attached to the cluster. The workflow will
run on a cluster that matches all of the labels. If multiple clusters match all labels,
Dataproc selects the cluster with the most YARN available memory to run all workflow
jobs. At the end of workflow, Dataproc does not delete the selected cluster.
Parameterized:
If you will run a workflow template multiple times with different values, use parameters
to avoid editing the workflow template for each run:
Workflows can be instantiated inline using the gcloud command with workflow template
YAML files or by calling the Dataproc InstantiateInline API. Inline workflows do not create
or modify workflow template resources.
DATAFLOW:
Dataflow is a serverless, fast and cost-effective service that supports both stream and
batch processing. It provides portability with processing jobs written using the open-
source Apache Beam libraries and removes operational overhead from your data
engineering teams by automating the infrastructure provisioning and cluster
management.
The Apache Beam SDK is an open-source programming model that enables you to develop
both batch and streaming pipelines. Pipelines are created with an Apache Beam program
and then run them on the Dataflow service.
Apache Beam Programming Model:
Apache Beam is an open source, unified model for defining both batch- and streaming-
data parallel-processing pipelines. The Apache Beam programming model simplifies the
mechanics of large-scale data processing. Using one of the Apache Beam SDKs, you build
a program that defines the pipeline. Then, one of Apache Beam's supported distributed
processing backends, such as Dataflow, executes the pipeline.
Basic concepts:
Pipelines
A pipeline encapsulates the entire series of computations involved in reading
input data, transforming that data, and writing output data. The input source and
output sink can be the same or of different types, allowing you to convert data
PCollection
A PCollection represents a potentially distributed, multi-element dataset that
acts as the pipeline's data. Apache Beam transforms use PCollection objects as
inputs and outputs for each step in your pipeline. A PCollection can hold a dataset
of a fixed size or an unbounded dataset from a continuously updating data source.
Transforms
A transform represents a processing operation that transforms data. A
transform takes one or more PCollections as input, performs an operation that you
specify on each element in that collection and produces one or more PCollections
as output. A transform can perform nearly any kind of processing operation,
including performing mathematical computations on data, converting data from
one format to another, grouping data together, reading and writing data, filtering
data to output only the elements you want or combining data elements into single
values.
ParDo
ParDo is the core parallel processing operation in the Apache Beam SDKs,
invoking a user-specified function on each of the elements of the
input PCollection. ParDo collects the zero or more output elements into an
output PCollection. The ParDo transform processes elements independently and
possibly in parallel.
Pipeline I/O
Apache Beam I/O connectors let you read data into your pipeline and write
output data from your pipeline. An I/O connector consists of a source and a sink.
All Apache Beam sources and sinks are transforms that let your pipeline work with
data from several different data storage formats. You can also write a custom I/O
connector.
Aggregation
Aggregation is the process of computing some value from multiple input
elements. The primary computational pattern for aggregation in Apache Beam is
to group all elements with a common key and window. Then, it combines each
group of elements using an associative and commutative operation.
Source
A transform that reads from an external storage system. A pipeline typically
reads input data from a source. The source has a type, which may be different
from the sink type, so you can change the format of data as it moves through the
pipeline.
Sink
A transform that writes to an external data storage system, like a file or a database.
Streaming pipelines:
The concept of windows also applies to bounded PCollections that represent data in batch
pipelines.
A watermark is a threshold that indicates when Dataflow expects all of the data in a
window to have arrived. If new data arrives with a timestamp that's in the window but
older than the watermark, the data is considered late data.
The data source determines the watermark. You can allow late data with the Apache
Beam SDK. Dataflow SQL does not process late data.
Triggers:
Triggers determine when to emit aggregated results as data arrives. By default, results
are emitted when the watermark passes the end of the window.
You can use the Apache Beam SDK to create or modify triggers for each collection in a
streaming pipeline. You cannot set triggers with Dataflow SQL.
The Apache Beam SDK can set triggers that operate on any combination of the following
conditions:
Event time, as indicated by the timestamp on each data element.
Processing time, which is the time that the data element is processed at any given
stage in the pipeline.
Dataflow templates allow you to package a Dataflow pipeline for deployment. Anyone
with the correct permissions can then use the template to deploy the packaged pipeline.
You can create your own custom Dataflow templates, and Google provides pre-built
templates for common scenarios.
Dataflow supports two types of template: Flex templates, which are newer, and classic
templates. If you are creating a new Dataflow template, we recommend creating it as a
Flex template.
Template workflow
3. Other users submit a request to the Dataflow service to run the template.
BIGQUERY:
BigQuery is a fully managed enterprise data warehouse that helps you manage
and analyze your data with built-in features like machine learning, geospatial
analysis, and business intelligence.
BigQuery's serverless architecture allows to use SQL queries to answer your
organization's biggest questions with zero infrastructure management.
BigQuery maximizes flexibility by separating the compute engine that analyzes
your data from your storage choices.
BigQuery interfaces include Google Cloud console interface and the BigQuery
command-line tool.
Developers and data scientists can use client libraries with familiar programming
including Python, Java, JavaScript, and Go, as well as BigQuery's REST API and
RPC API to transform and manage data.
BigQuery storage:
BigQuery stores data using a columnar storage format that is optimized for
analytical queries.
BigQuery presents data in tables, rows, and columns and provides full support for
database transaction semantics (ACID).
BigQuery storage is automatically replicated across multiple locations to provide
high availability.
BigQuery administration:
Batch loading:
With batch loading, you load the source data into a BigQuery table in a single batch
operation.
Load jobs. Load data from Cloud Storage or from a local file by creating a load
job.
SQL. The LOAD DATA SQL statement loads data from one or more files into a new
or existing table.
BigQuery Data Transfer Service. Use Big Query Data Transfer Service to
automate loading data from Google Software as a Service (SaaS) apps or from
third-party applications and services.
BigQuery Storage Write API. The Storage write API lets you batch-process an
arbitrarily large number of records and commit them in a single atomic operation.
If the commit operation fails, you can safely retry the operation. Unlike BigQuery
load jobs, the Storage Write API does not require staging the data to intermediate
storage such as Cloud Storage.
Other managed services. Use other managed services to export data from an
external data store and import it into BigQuery.
Streaming:
With streaming, you continually send smaller batches of data in real time, so the data is
available for querying as it arrives. Options for streaming in BigQuery include the
following:
Storage Write API. The Storage Write API supports high-throughput streaming
ingestion with exactly-once delivery semantics.
Dataflow. Use Dataflow with the Apache Beam SDK to set up a streaming
pipeline that writes to BigQuery.
BigQuery Connector for SAP. The BigQuery Connector for SAP enables near
real time replication of SAP data directly into BigQuery.
Generated data:
You can use SQL to generate data and store the results in BigQuery. Options for
generating data include:
Use data manipulation language (DML) statements to perform bulk inserts into an
existing table or store query results in a new table.
Use a CREATE TABLE ... AS statement to create a new table from a query result.
Run a query and save the results to a table. You can append the results to an
existing table or write to a new table.
BigQuery analytics:
Analytic workflows:
Ad hoc analysis. BigQuery uses Google Standard SQL, the SQL dialect in BigQuery, to
support ad hoc analysis.
Geospatial analysis. BigQuery uses geography data types and Google Standard SQL
geography functions to let you analyze and visualize geospatial data.
Machine learning. BigQuery ML uses Google Standard SQL queries to let you create
and execute machine learning (ML) models in BigQuery.
Queries:
The primary unit of analysis in BigQuery is the SQL query. BigQuery has two SQL dialects:
Google Standard SQL and legacy SQL. Google Standard SQL is the preferred dialect.
Data sources:
External data - query various external data sources such other Google Cloud storage
services (like Cloud Storage) or database services (like Cloud Spanner or Cloud SQL).
Multi-cloud data - query data that's stored in other public clouds such as AWS or Azure.
Public datasets - can analyze any of the datasets that are available in the public dataset
marketplace.
Types of queries:
After you load your data into BigQuery, you can query the data using one of the following
query job types:
Batch query jobs. With these jobs, BigQuery queues each batch query on your
behalf and then starts the query when idle resources are available, usually within
a few minutes.
INTRODUCTION TO MACHINE LEARNING IN THE CLOUD
Machine learning allows businesses to enable the data to teach the system how to solve
the problem at hand with machine learning algorithms and how to get better over time.
Machine learning (ML) is a subfield of artificial intelligence (AI). The goal of ML is to make
computers learn from the data that you give them. Instead of writing code that describes
the action the computer should take, your code provides an algorithm that adapts based
on examples of intended behavior. The resulting program, consisting of the algorithm
and associated learned parameters, is called a trained model.
● Machine learning is a toolset, like Newton’s laws of mechanics. Just as you can use
Newton’s laws to learn how long it will take a ball to fall to the ground if you drop
it off a cliff, you can use machine learning to solve certain kinds of problems at
scale by using data examples, but without the need for custom code.
● You might have also heard the term deep learning, or deep neural networks. Deep
learning is a subset of machine learning that adds layers in between input data
and output results to make a machine learn at more depth. It’s a type of machine
learning that works even when the data is unstructured, like images, speech,
video, natural language text, and so on. Image classification is a type of deep
learning. A machine can learn how to classify images into categories when it is
shown lots of examples.
The basic difference between machine learning and other techniques in AI is that in
machine learning, machines learn. They don’t start out intelligent, they become
intelligent.
So, how do machines become intelligent? Intelligence requires training. To train a
machine learning model, examples are required.
For example, to train a model to estimate how much you’ll owe in taxes, you must show
the model many, many examples of tax returns or if you want to train a model to estimate
trip time between one location and another, you’ll need to show it many examples of
previous journeys.
Imagine you work for a manufacturing company, and you want to train a machine learning
model to detect defects in the parts before they are assembled into products. You’d start
by creating a dataset of images of parts. Some of those parts would be good and some
parts would be defective. For each image, you will assign a corresponding label and use
that set of examples to train the model.
An important detail to emphasize is that a machine learning model is only as good as the
data used to train it. And a good model requires a lot of training data of historical
examples of rejected parts and parts in good condition. With these elements, you can
train a model to categorize parts as defective or not.
The basic reason why ML models need high-quality data is because they don’t have
human general knowledge; data is the only thing they have access to.
After the model has been trained, it can be used to make predictions on data it has never
seen before.
In this example, the input for the trained model is an image of a part. Because the model
has been trained on specific examples of good and defective parts, it can correctly predict
that this part is in good condition.
Algorithms, or ML models, are standard. That means that they exist independently of the
use case. Although detecting manufacturing defects in images and detecting diseased
leaves in images are two different use cases, the same algorithm, which is an image
classification network, works for both.
Similarly, standard algorithms predict the future value of a time series or to transcribe
human speech or text. ResNet, for example, is a standard algorithm for image
classification. It’s not essential to understand how an image classification algorithm
works, only that it’s the algorithm that we should use if we want to classify images of
automotive parts.
Google Cloud offers four options for building machine learning models.
1. The first option is BigQuery ML. You’ll remember from the previous module of
this course that BigQuery ML is a tool for using SQL queries to create and execute
machine learning models in BigQuery. If you already have your data in BigQuery
and your problems fit the predefined ML models, this could be your best choice.
2. The second option is AutoML, which is a no-code solution, so you can build your
own machine learning models on Vertex AI through a point-and-click interface.
3. The third option is custom training, through which you can code your own
machine learning environment, the training, and the deployment, which provides
you with flexibility and control over the ML pipeline.
4. And finally, there are pre-built APIs, which are application programming
interfaces. This option lets you use machine learning models that have already
been built and trained by Google, so you don’t have to build your own machine
learning models if you don’t have enough training data or sufficient machine
learning expertise in-house.
Big Query ML
BigQuery ML lets you create and execute machine learning models in BigQuery using
standard SQL queries. BigQuery ML democratizes machine learning by letting SQL
practitioners build models using existing SQL tools and skills. BigQuery ML increases
development speed by eliminating the need to move data.
BigQuery ML empowers data analysts to use machine learning through existing SQL tools
and skills. Analysts can use BigQuery ML to build and evaluate ML models in BigQuery.
Analysts don't need to export small amounts of data to spreadsheets or other applications
or wait for limited resources from a data science team.
Matrix Factorization for creating product recommendation systems. You can create
product recommendations using historical customer behavior, transactions, and product
ratings and then use those recommendations for personalized customer experiences.
Time series for performing time-series forecasts. You can use this feature to create
millions of time series models and use them for forecasting. The model automatically
handles anomalies, seasonality, and holidays.
Boosted Tree for creating XGBoost based classification and regression models.
Deep Neural Network (DNN) for creating TensorFlow-based Deep Neural Networks
for classification and regression models.
BigQuery ML has the following advantages over other approaches to using ML with a
cloud-based data warehouse:
primary data warehouse users, to build and run models using existing business
intelligence tools and spreadsheets. Predictive analytics can guide business
decision-making across the organization.
2. There is no need to program an ML solution using Python or Java. Models are
trained and accessed in BigQuery using SQL—a language data analysts know.
3. BigQuery ML increases the speed of model development and innovation by
removing the need to export data from the data warehouse. Instead, BigQuery ML
brings ML to the data. The need to export and reformat data has the following
disadvantages:
Increases complexity because multiple tools are required.
Reduces speed because moving and formatting large amounts data for
Python-based ML frameworks takes longer than model training in BigQuery.
Requires multiple steps to export data from the warehouse, restricting the
ability to experiment on your data.
Build, deploy, and scale machine learning (ML) models faster, with fully managed ML
tools for any use case. Vertex AI allows users to build machine learning models with
either AutoML, a no-code solution, or Custom Training, a code-based solution. AutoML is
easy to use and lets data scientists spend more time turning business problems into ML
solutions, but custom training enables data scientists to have full control over the
development environment and process.
Vertex AI brings AutoML and AI Platform together into a unified API, client library, and
user interface. AutoML lets you train models on image, tabular, text, and video datasets
without writing code, while training in AI Platform lets you run custom training code. With
Vertex AI, both AutoML training and custom training are available options. Whichever
option you choose for training, you can save models, deploy models, and request
predictions with Vertex AI.
Google’s solution to many of the production and ease-of-use challenges is Vertex AI, a
unified platform that brings all the components of the machine learning ecosystem and
workflow together.
Vertex AI is a unified platform, it means having one digital experience to create, manage,
and deploy models over time, and at scale. In Vertex AI Machine Learning process carried
out by 4 stages
● Data readiness:
During the data readiness stage, users can upload data from wherever it’s
stored: Cloud Storage, BigQuery, or a local machine.
● Feature readiness:
Then, during the feature readiness stage, users can create features, which
are the processed data that will be put into the model, and then share them with
others by using the feature store.
After that, it’s time for training and hyperparameter tuning. This means that
when the data is ready, users can experiment with different models and adjust
hyperparameters.
And finally, during deployment and model monitoring, users can set up the
pipeline to transform the model into production by automatically monitoring and
performing continuous improvements.
Benefits of Vertex AI
It’s seamless. Vertex AI provides a smooth user experience from uploading and
preparing data all the way to model training and production.
It’s scalable. The machine learning operations (MLOps) provided by Vertex AI helps
to monitor and manage the ML production and therefore scale the storage and
computing power automatically.
It’s sustainable. All the artifacts and features created with Vertex AI can be reused
and shared.
And it’s speedy. Vertex AI produces models that have 80% fewer lines of code
than competitors
Components of Vertex AI
This section describes the pieces that make up Vertex AI and the primary purpose of each
piece.
Model training
You can train models on Vertex AI by using AutoML, or if you need the wider range of
customization options available in AI Platform Training, use custom training.
In custom training, you can select from among many different machine types to power
your training jobs, enable distributed training, use hyperparameter tuning, and accelerate
with GPUs.
You can deploy models on Vertex AI and get an endpoint to serve predictions on Vertex
AI whether or not the model was trained on Vertex AI.
Data Labeling jobs let you request human labeling for a dataset that you plan to use to
train a custom machine learning model. You can submit a request to label your video,
image, or text data.
To submit a labeling request, you provide a representative sample of labeled data, specify
all the possible labels for your dataset, and provide some instructions for how to apply
those labels. The human labelers follow your instructions, and when the labeling request
is complete, you get your annotated dataset that you can use to train a machine learning
model.
Vertex AI Feature Store is a fully managed repository where you can ingest, serve, and
share ML feature values within your organization. Vertex AI Feature Store manages all
the underlying infrastructure. For example, it provides storage and compute resources
for you and can easily scale as needed.
REST API
The Vertex AI REST API provides RESTful services for managing jobs, models, and
endpoints, and for making predictions with hosted models on Google Cloud.
Vertex AI Workbench notebooks instances
Vertex AI Workbench is a Jupyter notebook-based development environment for
the entire data science workflow. You can interact with Vertex AI services from
within a Vertex AI Workbench instance's Jupyter notebook. Vertex AI Workbench
integrations and features can make it easier to access your data, process data
faster, schedule notebook runs, and more.
Vertex AI workflow
Gather your data: Determine the data you need for training and testing your model
based on the outcome you want to achieve.
Prepare your data: Make sure your data is properly formatted and labeled.
Train: Set parameters and build your model.
Evaluate: Review model metrics.
Deploy and predict: Make your model available to use.
But before you start gathering your data, you need to think about the problem you are
trying to solve. This will inform your data requirements.
Vertex AI allows users to build machine learning models with either AutoML, a no-code
solution, or Custom Training, a code-based solution. AutoML is easy to use and lets
data scientists spend more time turning business problems into ML solutions, but custom
training enables data scientists to have full control over the development environment
and process.
AUTO ML
AutoML is a no-code solution to build ML models through Vertex AI. It allows users to
train high-quality custom machine learning models with minimal effort or machine
learning expertise.
If you've worked with ML models before, you know that training and deploying ML models
can be incredibly time consuming, because you need to repeatedly add new data and
features, try different models, and tune parameters to achieve the best result.
To solve this problem, when AutoML was first announced in January of 2018, the goal
was to automate machine learning pipelines to save data scientists from manual work,
such as tuning hyperparameters and comparing against multiple models.
Transfer Learning:
In transfer learning, you build a knowledge base in the field. You can think of this like
gathering many books to create a library.
Transfer learning is a powerful technique that lets people with smaller datasets, or less
computational power, achieve state-of-the-art results by using pre-trained models that
have been trained on similar, larger datasets. Because the model learns through transfer
learning, it doesn’t have to learn from the beginning, so it can generally reach higher
accuracy with much less data and computation time than models that don’t use transfer
learning.
The second technology is neural architect search. The goal of neural architect
search is to find the optimal model for the relevant project.
AutoML supports four types of data: image, tabular, text, and video. For each data
type, AutoML solves different types of problems, called objectives.
Benefits of AutoML
Using these technologies has produced a tool that can significantly benefit data
scientists
One of the biggest benefits is that it’s a no-code solution. That means it can
train high-quality custom machine learning models with minimal effort and requires
little machine learning expertise.
This allows data scientists to focus their time on tasks like defining business
problems or evaluating and improving model results.
Others might find AutoML useful to quickly prototype models and explore new
datasets before investing in development. This might involve using it to identify
the best features in a dataset
To solve this problem, when AutoML was first announced in January of 2018, the goal
was to automate machine learning pipelines to save data scientists from manual work,
such as tuning hyperparameters and comparing against multiple models.
AutoML Tables takes your dataset and starts training for multiple model architectures at
the same time. This let's cloud AutoML determine the best model architecture fo your
data quickly, without having to serially iterate over the many possible model
architectures. The model architectures to test include:
Linear
Feedforward deep neural network
Gradient Boosted Decision Tree
AdaNet
Ensembles of various model architectures
As new model architectures come out of the research community, Google will add those
as well.
Dataset → AutoML → Generate Predictions with a REST API call. Image Source, Cloud
AutoML
Data Preparation
Before we can start training, we must ensure that the training data is in a format that is
supported by the platform. AutoML Tabular supports two ways of importing training data;
if we are already utilising Google’s BigQuery platform, we can import the data straight
from a BigQuery table, or we can upload a comma-separated values (CSV) file to Cloud
Storage.
If you are dealing with structured data, AutoML Tables is recommended. Depending on
use cases, AutoML Tables will create the necessary model to solve the problem:
A binary classification model predicts a binary outcome (one of two classes). Use
this for yes or no questions, for example, predicting whether a customer would
buy a subscription (or not). All else being equal, a binary classification problem
requires less data than other model types.
A multi-class classification model predicts one class from three or more discrete
classes. Use this to categorize things. For the retail example, you’d want to build
a multi-class classification model to segment customers into different personas.
A regression model predicts a continuous value. For the retail example, you’d also
want to build a regression model to forecast customer spending over the next
month.
Creating a Dataset
Importing a dataset can be done through the Google Cloud Platform (GCP) Console. In
the Vertex AI section there is a Datasets page where we can manage our datasets. From
there we can create a tabular dataset and select a data source for the dataset.
Sample structured data that you can use with Cloud AutoML
A visualisation of the splits, Image Source: About Train, Validation and Test Sets in
Machine Learning
On the other hand, working with unstructured data is more complicated. The Vision API
classifies images into thousands of predefined categories, detects individual objects and
faces within images, and finds and reads printed words contained within images. If you
want to detect individual objects, faces, and text in your dataset, or your need for image
classification is quite generic, try out the Vision API and see if it works for you.
The Vision API classification images
Coupled with image labeling in the Vision API, AutoML Vision enables you to perform
supervised learning, which involves training a computer to recognize patterns from
labeled data. Using supervised learning, we can train a model to recognize the patterns
and content that we care about in images.
Training a Model
Once the dataset has been created we can start training our first model.
The model outputs a series of numbers that communicate how strongly it associates each
label with that example. If the number is high, the model has high confidence that the
label should be applied to that document.
After you have created (trained) a model, you can request a prediction for an image using
the predict method. The available options for online (individual) prediction are:
Web UI
Command-Line
Python
Java
Node.js
Deployment, Testing and Explainability
The trained model that is produced by AutoML can then be deployed in two ways; we
can export the model as a saved TensorFlow (TF) model which we can then serve
ourselves in a Docker container, or we can deploy the model directly to a GCP endpoint.
CUSTOM TRAINING
Custom training allows the ML environment to control the entire ML development process
by developing customers own code, starting from data preparation to model deployment.
If you want to code your machine learning model, you can use this option by building a
custom training solution with Vertex AI Workbench.
First, determine what structure you want your ML training code to take. You can provide
training code to Vertex AI in one of the following forms:
Pre-built container
Custom container
Pre-built container
So, if your ML training needs a platform like TensorFlow, Pytorch, scikit-learn, or XGboost,
and Python code to work with the platform, a pre-built container is probably your best
solution
Custom container
Create a Docker container image with code that trains an ML model and exports it to
Cloud Storage. Include any dependencies required by your code in the container image.
Use this option if you want to use dependencies that are not included in one of the Vertex
AI pre-built containers for training. For example, if you want to train using a Python ML
framework that is not available in a pre-built container, or if you want to train using a
programming language other than Python, then this is the better option.
Pre-built APIs
When using AutoML, you define a domain-specific labeled training dataset that is used to
create the custom ML model you require. However, if you don’t need a domain-specific
dataset, Google’s suite of pre-built ML APIs might meet your needs.
Good machine learning models require lots of high-quality training data. You should aim
for hundreds of thousands of records to train a custom model. If you don't have that kind
of data, pre-built APIs are a great place to start.
Pre-built APIs are offered as services. Often, they can act as building blocks to create
the application you want without the expense or complexity of creating your own
models. They save the time and effort of building, curating, and training a new dataset,
so you can directly deal with predictions.
Examples of Pre-built API
● The Cloud Natural Language API recognizes parts of speech called entities and
sentiment.
● The Cloud Translation API converts text from one language to another.
● The Vision API works with and recognizes content in static images.
● And the Video Intelligence API recognizes motion and action in video.
And Google has already done a lot of work to train these models by using Google datasets.
For example
You recall that how well a model is trained depends on how much data is available to
train it. As you might expect, Google has many images, text, and ML researchers to train
its pre-built models. This means less work for you.
1.Speech-to-Text API
Speech requests
Speech-to-Text has three main methods to perform speech recognition. These are listed
below:
Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text
API, performs recognition on that data, and returns results after all audio has been
processed. Synchronous recognition requests are limited to audio data of 1 minute or less
in duration.
Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-
Text API and initiates a Long Running Operation. Using this operation, you can
periodically poll for recognition results. Use asynchronous requests for audio data of any
duration up to 480 minutes.
Requests contain configuration parameters as well as audio data. The following sections
describe these type of recognition requests, the responses they generate, and how to
handle those responses in more detail.
Service: language.googleapis.com
To call this service, we recommend that you use the Google-provided client libraries. If
your application needs to use your own libraries to call this service, use the following
information when you make the API requests.
Discovery document
https://language.googleapis.com/$discovery/rest?version=v1
https://language.googleapis.com/$discovery/rest?version=v1beta2
Service endpoint
A service endpoint is a base URL that specifies the network address of an API service.
One service might have multiple service endpoints. This service has the following service
endpoint and all URIs below are relative to this service endpoint
https://language.googleapis.com.
3. Text-to-Speech API
Text-to-Speech converts text or Speech Synthesis Markup Language (SSML) input into
audio data of natural human speech like MP3 or LINEAR16 (the encoding used in WAV
files).
Text-to-Speech is ideal for any application that plays audio of human speech to users. It
allows you to convert arbitrary strings, words, and sentences into the sound of a person
speaking the same things.
Imagine that you have a voice assistant app that provides natural language feedback to
your users as playable audio files. Your app might take an action and then provide human
speech as feedback to the user.
Speech synthesis
The process of translating text input into audio data is called synthesis and the output of
synthesis is called synthetic speech.
Raw text
SSML-formatted data (discussed below).
To create a new audio file, you call the synthesize endpoint of the API.
The speech synthesis process generates raw audio data as a base64-encoded string. You
must decode the base64-encoded string into an audio file before an application can play
it. Most platforms and operating systems have tools for decoding base64 text into
playable media files.
Voices
Text-to-Speech creates raw audio data of natural, human speech. That is, it creates audio
that sounds like a person talking. When you send a synthesis request to Text-to-Speech,
you must specify a voice that 'speaks' the words.
Text-to-Speech has a wide selection of custom voices available for you to use. The
voices differ by language, gender, and accent (for some languages). For example, you
can create audio that mimics the sound of a female English speaker with a British accent.
You can also convert the same text into a different voice, say a male English speaker with
an Australian accent.
Cloud Translation enables your websites and applications to dynamically translate text
programmatically through an API. Translation uses a Google pre-trained or a custom
machine learning model to translate text. By default, Translation uses a Google pre-
trained Neural Machine Translation (NMT) model, which Google updates on semi-regular
cadence when more training data or better techniques become available.
Benefits
Cloud Translation can translate text for more than 100 language pairs. If you don't know
the language of your source text, Cloud Translation can detect it for you.
Cloud Translation scales seamlessly and allows unlimited character translations per day.
However, there are restrictions on content size for each request and request rates.
Additionally, you can use quota limits to manage your budget.
Cloud Translation - Advanced has all of the capabilities of Cloud Translation - Basic and
more. Cloud Translation - Advanced edition is optimized for customization and long form
content use cases. Customization features include glossaries and custom model selection.
Pricing
Translation charges you on a monthly basis based on the number of characters that
you send.
5. Vision API
Cloud Vision allows developers to easily integrate vision detection features within
applications, including image labeling, face and landmark detection, optical character
recognition (OCR), and tagging of explicit content.
Text detection
Optical character recognition (OCR) for an image; text recognition and conversion
to machine-coded text. Identifies and extracts UTF-8 text in an image.
Document text detection (dense text / handwriting)
Optical character recognition (OCR) for a file (PDF/TIFF) or dense text image;
dense text recognition and conversion to machine-coded text.
Landmark detection
Landmark Detection detects popular natural and human-made structures within
an image.
Logo Detection
Logo Detection detects popular product logos within an image.
Label detection
Provides generalized labels for an image.
For each label returns a textual description, confidence score, and topicality rating.
Image properties
Returns dominant colors in an image.
Each color is represented in the RGBA color space, has a confidence score, and
displays the fraction of pixels occupied by the color [0, 1].
Object localization
The Vision API can detect and extract multiple objects in an image with Object
Localization.
Crop hint detection
Provides a bounding polygon for the cropped image, a confidence score, and an
importance fraction of this salient region with respect to the original image for
each request.
You can provide up to 16 image ratio values (width:height) for a single image.
Web entities and pages
Provides a series of related Web content to an image.
Face detection
Locates faces with bounding polygons and identifies specific facial "landmarks"
such as eyes, ears, nose, mouth, etc. along with their corresponding confidence
values.
Returns likelihood ratings for emotion (joy, sorrow, anger, surprise) and general
image properties (underexposed, blurred, headwear present).
Likelihoods ratings are expressed as 6 different values: UNKNOWN,
VERY_UNLIKELY, UNLIKELY, POSSIBLE, LIKELY, or VERY_LIKELY.
The Video Intelligence API allows developers to use Google video analysis technology as
part of their applications. The REST API enables users to annotate videos stored locally
or in Cloud Storage, or live-streamed, with contextual information at the level of the entire
video, per segment, per shot, and per frame.
10. ASSIGNMENT
service account with IAM roles. Explore permissions for the Network
Admin and Security Admin roles.
-optional-components flag.
gc l o ud dataproc c l u s t e r s c reat e clusĞer-name \
- - o p t onal-components=CO PONENT-NA E ( s ) \
. . . oĞher filags
9. List the usecases of Dataproc.
Log processing
Ad-hoc data analysis
Machine learning
10. What is Dataproc workflow template?
The Dataproc WorkflowTemplates API provides a flexible and easy-to-use mechanism
for managing and executing workflows. A Workflow Template is a reusable workflow
configuration. It defines a graph of jobs with information on where to run those jobs.
Data sources:
External data - query various external data sources such other Google Cloud storage
services (like Cloud Storage) or database services (like Cloud Spanner or Cloud SQL).
Multi-cloud data - query data that's stored in other public clouds such as AWS or Azure.
Public datasets - can analyze any of the datasets that are available in the public dataset
marketplace.
During the data readiness stage, users can upload data from wherever it’s stored:
Cloud Storage, BigQuery, or a local machine.
● Feature readiness:
Then, during the feature readiness stage, users can create features, which are the
processed data that will be put into the model, and then share them with others
by using the feature store.
After that, it’s time for training and hyperparameter tuning. This means that when
the data is ready, users can experiment with different models and adjust
hyperparameters.
And finally, during deployment and model monitoring, users can set up the pipeline
to transform the model into production by automatically monitoring and
performing continuous improvements.
● The Cloud Translation API converts text from one language to another.
● The Text-to-Speech API converts text into high-quality voice audio.
● The Vision API works with and recognizes content in static images.
● And the Video Intelligence API recognizes motion and action in video.
12. PART B QUESTIONS
1. How will you leverage Big Data operations with cloud Dataproc?
2. How will you build extract, transform, and load pipelines using cloud dataflow?
3. Explain in detail about BigQuery.
4. Explain in detail about Machine Learning tools in Google Cloud.
5. Demonstrate the concept VERTEX AI in detail.
6. List out the Prebuilt APIs and explain it in detail.
7. Explain the concept Auto ML in detail.
13. ONLINE CERTIFICATIONS
https://onlinecourses.nptel.ac.in/noc20_cs55/preview
https://learndigital.withgoogle.com/digitalgarage/course/gcloud-
computing-foundations
14. REAL TIME APPLICATIONS
“ML fits into the natural progression of Twitter consumer and revenue product building,
especially for a product feature such as Spaces,” explains Diem Nguyen, Senior Machine
Learning Engineer and Data Scientist at Twitter. “We launched Spaces with a base-line
algorithm using the ‘most popular’ heuristic which assumes that if a Space is popular,
there’s a good chance you'd like it too. But our aim is to leverage ML to surface the most
interesting and relevant Spaces to a particular Twitter customer, making it easier for them
to find and join the conversations they personally care about. This is a complex
functionality that Google Cloud ML capabilities help us to enable.”
While looking for the right tools to power this vision, Nguyen and her team started
evaluating in December 2021 whether the Vertex AI platform and AutoML in particular
could solve challenges observed when they first started building Spaces. These included
a lack of dedicated ML resources to build and deploy the product feature, and the need
to work on a multi-cloud environment.
“We had three key questions in mind during our assessment,” Nguyen explains. “Can we
realistically deploy the AutoML model off-platform? Once deployed, can it solve for the
request load that we get from the service we’re serving (in this case, the Spaces tab)?
And finally, can we develop and maintain such a solution without a dedicated team of ML
experts for this project?” The answer to all three questions was yes.
Positive answers motivated the Spaces Engineering team to take the solution to
production in February 2022. “We started using AutoML Tables to train high-accuracy
models with minimal ML expertise or effort, alleviating our resource constraint,” says
Nguyen of the results. “Soon AutoML also stood out for its high performance and for
supporting easy deployment beyond the Google Cloud Platform, making it ideal for this
project hosted in a multi-cloud environment.”
Because Spaces are live audio conversations, the Spaces tab needs to be ranked to
customers in near real time, so they don’t miss out. With this in mind, Twitter’s model
currently performs 900 queries per second on the Spaces tab and evaluates 50,000
candidates per second. Meanwhile, 99% of these requests are faster than 100
milliseconds, and 90% of requests are faster than 50 milliseconds.
Name of the
S.NO Assessment Start Date End Date Portion
REFERENCES:
1. https://cloud.google.com/docs
2. https://www.cloudskillsboost.google/course_templates/153
3. https://nptel.ac.in/courses/106105223
18. MINI PROJECT
As a junior data engineer in Jooli Inc. and recently trained with Google Cloud
and a number of data services you have been asked to demonstrate your newly
learned skills. The team has asked you to complete the following tasks.
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.