0% found this document useful (0 votes)
6 views

GCP Data Engineer Curriculum

Uploaded by

aepuri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

GCP Data Engineer Curriculum

Uploaded by

aepuri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Google Cloud Data Engineering Training

with Real-world Projects and Case Studies

Role: GCP Data Engineer


Course duration: 2.5 months
Mode: Online
Teaching Language: English
Trainer: Shaik Saidhul
7305101711
GCP Cloud Basics
GCP Introduction
o The need for cloud computing in modern businesses.
o Key features and offerings of Google Cloud Platform (GCP).
o Overview of core GCP services and products.
o Benefits and advantages of using cloud infrastructure.
o Step-by-step guide to creating a free-tier account on GCP.

GCP Interfaces
o Console
• Navigating the GCP Console
• Configuring the GCP Console for Efficiency
• Using the GCP Console for Service Management
o Shell
• Introduction to GCP Shell
• Command-line Interface (CLI) Basics
• GCP Shell Commands for Service Deployment and Management
o SDK
• Overview of GCP Software Development Kits (SDKs)
• Installing and Configuring SDKs
• Writing and Executing GCP SDK Commands

GCP Locations
o Regions
• Understanding GCP Regions
• Selecting Regions for Service Deployment
• Impact of Region on Service Performance
o Zones
•Exploring GCP Zones
•Distributing Resources Across Zones
•High Availability and Disaster Recovery Considerations
o Importance
• Significance of Choosing the Right Location
• Global vs. Regional Resources
• Factors Influencing Location Decisions

GCP IAM & Admin


o Identities
• Introduction to Identity and Access Management (IAM)
• Users, Groups, and Service Accounts
• Best Practices for Identity Management
o Roles
• GCP IAM Roles Overview
• Defining Custom Roles
• Role-Based Access Control (RBAC) Implementation
o Policy
• Resource-based Policies
• Understanding and Implementing Organization Policies
• Auditing and Monitoring Policies
o Resource Hierarchy
• GCP Resource Hierarchy Structure
• Managing Resources in a Hierarchy
• Organizational Structure Best Practices

Linux Basics on Cloud Shell


o Getting started with Linux
o Linux Installation
o Basic Linux Commands
• Cloud shell tips
• File and Directory Operations (ls, cd, pwd, mkdir, rmdir, cp, mv, touch, rm, nano)
• File Content Manipulation (cat, less, head, tail, grep)
• Text Processing (awk, sed, cut, sort, uniq)
• User and Permission related (whoami, id, su, sudo, chmod, chown)

Python for Data Engineer


o Data Types
• Strings
• Operators
• Numbers (Int, Float)
• Booleans
o Data Structures
• Lists
• Tuples
• Dictionaries
• Sets
o Python Programming Constructs
• if, elif, else statements
• for loops, while loops
• Exception Handling
• File I/O operations
o Modular Programming in Python
• Functions & Lambda Functions
• Classes
GCP Data Engineering Tools
Google Cloud Storage
o Overview of Cloud Storage as a scalable and durable object storage service.
o Understanding buckets and objects in Cloud Storage.
o Use cases for Cloud Storage, such as data backup, multimedia storage, and website content
o Creating and managing Cloud Storage buckets.
o Uploading and downloading objects to and from Cloud Storage.
o Setting access controls and permissions for buckets and objects.
o Data Transfer and Lifecycle Management
o Versioning and Object Versioning
o Integration with Other GCP Services
o Implementing best practices for optimizing Cloud Storage performance.
o Securing data in Cloud Storage with encryption and access controls.
o Monitoring and logging for Cloud Storage operations.

Cloud SQL
o Introduction to Cloud SQL
o Creating and Managing Cloud SQL Instances
o Configuring database settings, users, and access controls.
o Connecting to Cloud SQL instances using Cloud SQL studio, Shell, Workbenches
o Importing and exporting data in Cloud SQL.
o Backups and High Availability
o Integration with Other GCP Services
o Managing database user roles and permissions.
o Introduction to DMS
o End to End Database migration Project
• Offline: Export and Import method
• Online: DMS method

BigQuery (SQL development)


o Introduction to BigQuery
o BigQuery Architecture
o Use cases for BigQuery in business intelligence and analytics.
o Various method of creating table in BigQuery
o BigQuery Data Sources and File Formats
o Native table and External Tables
o SQL Queries and Performance Optimization
• Writing and optimizing SQL queries in BigQuery.
• Understanding query execution plans and best practices.
• Partitioning and clustering tables for performance.
o Data Integration and Export
• Loading data into BigQuery from Cloud Storage, Cloud SQL, and other sources.
• Exporting data from BigQuery to various formats.
• Real-time data streaming into BigQuery.
o Configuring access controls and permissions in BigQuery.
o BigQuery Views:
• Views
• Materialized Views
• Authorized Views
o Integration with Other GCP Services
• Integrating BigQuery with Dataflow for ETL processes.
• Building data pipelines with BigQuery and Composer.
o Case Study-1: Spotify
o Case Study-2: Social Media

DataProc (Pyspark Development)


o Introduction to Hadoop and Apache Spark
o Understanding the difference between Spark and MapReduce
o What is Spark and Pyspark.
o Understanding Spark framework and its functionalities
o Overview of DataProc as a fully managed Apache Spark and Hadoop service.
o Use cases for DataProc in data processing and analytics.
o Cluster Creation and Configuration
• Creating and managing DataProc clusters.
• Configuring cluster properties for performance and scalability.
• Preemptible instances and cost optimization.
o Running Jobs on DataProc
• Submitting and monitoring Spark and Hadoop jobs on DataProc.
• Use of initialization actions and custom scripts.
• Job debugging and troubleshooting.
o Integration with Storage and BigQuery
• Reading and writing data from/to Cloud Storage and BigQuery.
• Integrating DataProc with other storage solutions.
• Performance optimization for data access.
o Automation and scheduling of recurring jobs.
o Case Study-1: Data Cleaning of Employee Travel Records
o End to End Batch Pyspark pipeline using Dataproc, BigQuery, GCS

Databricks on GCP
o What is Databricks lakehouse platform
o Databricks architecture and components
o Setting up and Administering a Databricks workspace
o Managing data with Delta Lake
o Databricks Unity Catalog
o Note books and clusters
o ELT with Spark SQL and Python
o optimize performance within Databricks.
o Incremental Data Processing
o Delta Live tables
o Case study: creating end to end workflows

DataFlow (Apache Beam development)


o Introduction to DataFlow
o Use cases for DataFlow in real-time analytics and ETL.
o Understanding the difference between Apache Spark and Apache Beam
o How Dataflow is different from Dataproc
o Building Data Pipelines with Apache Beam
• Writing Apache Beam pipelines for batch and stream processing.
• Custom Pipelines and Pre-defined pipelines
• Transformations and windowing concepts.
o Integration with Other GCP Services
• Integrating DataFlow with BigQuery, Pub/Sub, and other GCP services.
• Real-time analytics and visualization using DataFlow and BigQuery.
• Workflow orchestration with Composer.
o End to End Streaming Pipeline using Apache beam with Dataflow, Python app, PubSub,
BigQuery, GCS
o Template method of creating pipelines

Cloud Pub/Sub
o Introduction to Pub/Sub
o Understanding the role of Pub/Sub in event-driven architectures.
o Key Pub/Sub concepts: topics, subscriptions, messages, and acknowledgments.
o Creating and Managing Topics and Subscriptions
• Using the GCP Console to create Pub/Sub topics and subscriptions.
• Configuring message retention policies and acknowledgment settings.
o Publishing and Consuming Messages
• Writing and deploying code to publish messages to a topic.
• Implementing subscribers to consume and process messages from subscriptions.
o Integration with Other GCP Services
• Connecting Pub/Sub with Cloud Functions for serverless event-driven computing.
• Integrating Pub/Sub with Dataflow for real-time stream processing.
o Streaming use-case using Dataflow

Cloud Composer (DAG Creations)


o Introduction to Composer/Airflow
o Overview of Airflow Architecture
o Use cases for Composer in managing and scheduling workflows.
o Creating and Managing Workflows
• Creating and configuring Composer environments.
• Defining and scheduling workflows using Apache Airflow.
• Monitoring and managing workflow executions.
o Integration with Data Engineering Services
• Orchestrating workflows involving BigQuery, DataFlow, and other services.
• Coordinating ETL processes with Composer.
• Integrating with external systems and APIs.
o Error Handling and Troubleshooting
• Handling errors and retries in Composer workflows.
• Debugging and troubleshooting failed workflow executions.
• Logging and monitoring for Composer workflows.
o Level-1-DAG: Orchestrating the BigQuery pipelines
o Level-2-DAG: Orchestrating the DataProc pipelines
o Level-3-DAG: Orchestrating the Dataflow pipelines
o Implementing CI/CD in Composer Using Cloud Build and GitHub

Data Fusion
o Introduction to Data Fusion
• Overview of Data Fusion as a fully managed data integration service.
• Use cases for Data Fusion in ETL and data migration.
o Building Data Integration Pipelines
• Creating ETL pipelines using the visual interface.
• Configuring data sources, transformations, and sinks.
• Using pre-built templates for common integration scenarios.
o Integration with GCP and External Services
• Integrating Data Fusion with BigQuery, Cloud Storage, and other GCP services.
o End to End pipeline using Data fusion with Wrangler, GCS, BigQuery

Cloud Functions
o Cloud Functions Introduction
o Setting up Cloud Functions in GCP
o Event-driven architecture and use cases
o Writing and deploying Cloud Functions
o Triggering Cloud Functions:
• HTTP triggers
• Pub/Sub triggers
• Cloud Storage triggers
o Monitoring and logging Cloud Functions
o Usecase-1: Loading the files from GCS to BigQuery as soon as it is uploaded.

Terraform
o Terraform Introduction
o Installing and configuring Terraform.
o Infrastructure Provisioning
o Terraform basic commands
• Init, plan, apply, destroy
o Create Resources in Google Cloud Platform
• GCS buckets
• Dataproc cluster
• BigQuery Datasets and tables
• And more resources as needed

By the End of the course What Students can Expect


Proficient in SQL Development:
o Mastering SQL for querying and manipulating data within Google BigQuery and Cloud SQL.
o Writing complex queries and optimizing performance for large-scale datasets.
o Understanding schema design and best practices for efficient data storage.
Pyspark Development Skills:
o Proficiency in using PySpark for large-scale data processing on Google Cloud.
o Developing and optimizing Spark jobs for distributed data processing.
o Understanding Spark's RDDs, Dataframes, and transformations for data manipulation.

Apache Beam Development Mastery:


o Creating data processing pipelines using Apache Beam.
o Understanding the concepts of parallel processing and data parallelism.
o Implementing transformations and integrating with other GCP services.

DAG Creations with Cloud Composer:


o Designing and implementing Directed Acyclic Graphs (DAGs) for orchestrating workflows.
o Using Cloud Composer for workflow automation and managing dependencies.
o Developing DAGs that integrate various GCP services for end-to-end data processing.

Notebooks, Workflows with Databricks:


• Understand how to build and manage data pipelines using Databricks and Delta Lake.
• Efficiently query and analyze large datasets with Databricks SQL and Apache Spark.
• Implement scalable workflows and optimize performance within Databricks.

Architecture Planning:
o Proficient in architecting end-to-end data solutions on GCP.
o Understanding the principles of designing scalable, reliable, and cost-effective data
architectures.

Certification Readiness
o Prepare for the Google Cloud Professional Data Engineer (PDE) and
o Associate Cloud Engineer (ACE) certifications through a combination of theoretical knowledge
and hands-on experience.

The course will empower students with practical skills in SQL, PySpark, Apache Beam, DAG creations,
and architecture planning, ensuring they are well-prepared to tackle real-world data engineering
challenges and successfully obtain GCP certifications.

Thank You.

You might also like