Big Data Engineer Course (2) (1)
Big Data Engineer Course (2) (1)
Bootcamp
With
AWS & Azure
Big Data Bootcamp with AWS & Azure
This comprehensive program is designed to equip you with the skills and knowledge needed to
thrive as a Big Data Engineer in today’s cloud-driven world. Starting from foundational Big Data
concepts and tools, you’ll progress to mastering distributed computing frameworks like Apache
Spark and Kafka.
The course emphasizes practical learning through hands-on projects and real-world use
cases, focusing on integrating Big Data solutions with Azure & AWS Cloud. By the end of the
program, you’ll have the ability to design, build, and deploy scalable data pipelines, process
massive datasets, and implement analytics workflows.
The course is completely beginner friendly with only pre-requisite being understanding of
Python and basic Idea of databases and SQL which too will be provided as part of this course.
Learning Objectives
Topics
Working with Python Overview of essential Python libraries such as Pandas (for
Libraries data manipulation), OS (for file handling), and Math (for
basic calculations).
Joins and Aggregations in Learn how to use JOINs (INNER, LEFT, RIGHT, FULL) to
SQL combine data from multiple tables and apply aggregate
functions (e.g., SUM, AVG, COUNT) for analysis.
This section introduces the fundamentals of Big Data, exploring its importance in today’s data-
driven world. It explains the challenges of traditional data processing methods and how Big Data
technologies address these challenges. The section also covers distributed systems, key
characteristics of Big Data, and its applications across various industries. Finally, it introduces the
Hadoop ecosystem and distributed storage and processing concepts, setting the stage for
advanced topics in subsequent modules.
Topics
Big Data Overview Understand what Big Data is, its importance, and its
defining characteristics (Volume, Velocity, Variety,
Veracity, and Value
Applications of Big Data Learn about real-world use cases of Big Data in industries
like finance, healthcare, retail, and cloud computing.
Module 2
Deep Dive into Hadoop Architecture and Ecosystem
This module focuses on the Hadoop ecosystem, covering its architecture, core components, and
the role of distributed storage and processing. Students will learn about the Hadoop Distributed
File System (HDFS) and YARN, and how they enable the handling of massive datasets. The
module also introduces essential tools in the Hadoop ecosystem that support data storage,
processing, and management.
Topics
HDFS (Hadoop Distributed File Understand the design principles of HDFS, its
System) block storage mechanism, and replication
strategies for fault tolerance.
Topics
Parallelism and Partitioning Explore how Spark handles parallelism and data
partitioning for efficient processing across clusters.
Topics
Spark SQL Basics Learn to query structured data using Spark SQL and integrate it
with DataFrames for seamless processing.
Optimization with Dive into Spark’s Catalyst optimizer for query optimization and
Catalyst understand its role in performance tuning.
Topics
Data Partitioning and Shuffling Understand how data is partitioned in Spark and
how shuffling occurs during transformations like
joins and groupBy.
Optimizing Spark Jobs with Dive into tuning Spark configurations to maximize
Configurations efficiency, focusing on driver and executor
memory, number of partitions, and parallelism.
Module 6
Spark Performance Optimization and Advanced
Tuning
This module will focus entirely on advanced performance optimization techniques for Spark. After
covering the basics and intermediate performance tuning in earlier modules, this section will dive
into the finer details of optimizing Spark jobs, tuning Spark for large datasets, and improving the
overall efficiency of Spark workloads. Students will also explore best practices for performance
troubleshooting and debugging Spark jobs.
Topics
Advanced Spark SQL Learn how to optimize Spark SQL queries, including
Optimizations query plan optimization, partition pruning, and
predicate pushdown.
Tuning Spark for Large- Understand the strategies for tuning Spark when
Scale Data working with very large datasets, such as handling
skew and data repartitioning.
Executor and Memory Dive deeper into managing executor memory, fine-
Management tuning garbage collection, and adjusting Spark
configurations for better performance.
Topics
Types of NoSQL Learn about the four major types of NoSQL databases:
Databases document stores, key-value stores, column-family
stores, and graph databases.
Use Cases for NoSQL Learn when to use NoSQL databases, with real-world
examples such as social media platforms, IoT
applications, and big data analytics.
Module 8
MongoDB: Document-Based NoSQL Database
This module focuses on MongoDB, a popular document-based NoSQL database. Students will
learn about MongoDB’s architecture, data model, and query language. Hands-on practice will
include creating collections, inserting documents, querying data, and performing aggregations
using MongoDB’s powerful features.
Topics
MongoDB Indexing and Learn how to optimize MongoDB queries using indexing
Performance and other performance optimization techniques.
Module 9
Cassandra: Column-Family Based NoSQL Database
In this module, students will explore Cassandra, a column-family-based NoSQL database
designed for handling large-scale, high-velocity data. Cassandra is widely used for applications
that require high availability and fault tolerance. This module will cover its architecture, query
language (CQL), and how to scale Cassandra for massive datasets.
Topics
Cassandra Query Language Learn how to use CQL, Cassandra's query language, to
(CQL) interact with the database.
Topics
Hive Data Types Explore Hive’s data types, including primitive types
(int, string) and complex types (array, map).
Creating Databases and Learn how to create databases and tables in Hive,
Tables including specifying column types and table
partitions.
Loading Data into Hive Cover the process of loading data into Hive from local
storage and HDFS (both internal and external tables).
Module 11
Topics
Internal vs. External Tables Understand the difference between internal and
external tables in Hive and their practical uses.
Complex Data Types in Hive Learn how to work with complex data types like
Arrays, Maps, and Structs for flexible data storage.
Bucketing and Performance Learn how bucketing helps distribute data across
Tuning files and optimizes query performance for large
datasets.
Join Optimizations in Hive Learn about advanced join techniques like Map-
Side Join, Sorted Merge Join, and Skew Join for
optimizing large queries.
Module 12
Topics
Topics
Message Key-Value Pairs in Explore working with key-value pairs for Kafka
Kafka messages.
Working with JSON, CSV Learn to send and consume JSON and CSV
Data formatted data using Kafka.
Producer and Consumer in Learn the concept of consumer groups and how they
Consumer Groups manage parallel processing of Kafka messages.
Module 14
Topics
Kafka Integration with Spark Learn how to consume data from Kafka topics in
Spark Structured Streaming for real-time
processing.
Topics
What is Orchestration in Understand the concept of orchestration and its role in automating
Big Data? the execution of data pipelines in Big Data environments.
Need for Dependency Learn why dependency management is crucial in ensuring tasks
Management in Data are executed in the correct order and how it prevents issues in
Pipeline Design complex data workflows.
What is Apache Airflow? Get an introduction to Apache Airflow, its purpose in data pipeline
orchestration, and its role in the Big Data ecosystem.
Airflow Operators Learn about the operators in Airflow, such as BashOperator and
PythonOperator, and their role in task execution.
Writing Airflow DAG Understand how to write DAG scripts in Airflow, including the
Scripts basic structure, task dependencies, and scheduling.
Executing Parallel Tasks Learn how to configure parallel task execution in Airflow to run
in Airflow multiple tasks concurrently.
Module 16
Topics
Cloud Service Models Overview of the three primary cloud models: IaaS
(Infrastructure as a Service), PaaS (Platform as a
Service), SaaS (Software as a Service).
Benefits of Cloud for Big Explore the benefits of using the cloud for Big Data
Data engineering: scalability, flexibility, cost efficiency, and on-
demand computing.
Overview of Azure Introduction to Azure and its role in the cloud ecosystem,
highlighting key services for data engineering.
Azure Global Infrastructure Learn about Azure’s data centers, regions, and availability
zones and their importance for high-availability systems.
Module 17
This module focuses on the various Azure Storage Services, including Blob Storage and Azure
Data Lake Storage Gen2, which are critical for storing and managing large datasets for Big Data
applications.
Topics
Azure Storage Overview Learn about Azure’s storage solutions and their roles in
storing data for Big Data engineering.
Azure Blob Storage Introduction to Azure Blob Storage, its use cases, and
data management techniques for unstructured data.
Azure Data Lake Storage Explore ADLS Gen2, its integration with HDFS, and its
Gen2 hierarchical namespace for managing large-scale data.
Storage Tiers in Azure Understand the Hot, Cool, and Archive tiers in Blob
Storage for cost-effective data management.
Setting up Blob Storage and Hands-on setup and configuration of Blob Storage and
ADLS Gen2 Azure Data Lake Storage Gen2.
Module 18
This module covers Azure Databricks, a powerful platform for Apache Spark. Students will learn
how to create and manage Databricks clusters, use notebooks for data processing, and perform
data transformations with Spark.
Topics
Azure Databricks Pricing Learn about pricing models for Azure Databricks and how
to optimize cluster usage for cost efficiency.
Module 19
This module introduces Azure Data Factory (ADF), a cloud service for orchestrating data
workflows. Students will learn how to create data pipelines, schedule data movements, and
monitor the performance of their pipelines.
Topics
Introduction to Azure Data Learn about Azure Data Factory (ADF) and its role in
Factory creating and managing ETL and ELT pipelines.
Creating Data Pipelines in Understand how to create data pipelines for automating
ADF data ingestion, transformation, and loading tasks.
Working with Datasets and Learn about datasets and linked services in ADF to define
Linked Services source and destination data locations.
Topics
Error Handling and Logging Understand error handling and logging best practices in
ADF to ensure pipeline robustness.
Data Flow Debugging and Learn how to debug and optimize data flows in ADF,
Optimization improving performance in large-scale workflows.
ADF Integration with Other Understand how ADF integrates with other Azure
Azure Services services such as Azure Databricks, Azure Functions, and
Azure Synapse.
Module 21
Topics
EMR Cluster Setup and Learn to create and configure an EMR cluster, including
Configuration selecting appropriate instance types, node types (Master,
Core, and Task Nodes), and scaling options.
Hadoop and Spark on EMR Understand how to run distributed Hadoop MapReduce
and Apache Spark jobs on EMR clusters.
EMR and S3 Integration Learn to store input data in S3, process it using EMR, and
save the output back to S3 for scalability and cost-
efficiency.
Monitoring and Optimizing Explore tools for tracking job progress, debugging issues,
EMR Jobs and tuning cluster performance for faster execution.
Module 22
Topics
Introduction to Amazon S3 Overview of Amazon S3, its role as a storage solution for
Big Data, and its ability to store massive datasets
efficiently.
Versioning and Lifecycle Learn to use versioning to track file changes and set
Policies lifecycle policies for archiving or deleting unused data.
Topics
Introduction to AWS Athena Overview of Athena, its serverless architecture, and how
it simplifies querying structured and semi-structured data
stored in S3.
Setting Up Athena for Learn to configure Athena, define external tables, and
Querying Data run SQL queries on data stored in S3.
Introduction to AWS Glue Understand the role of AWS Glue in creating data
catalogs, schema discovery, and automating ETL
processes.
Using Glue Crawlers Learn how to set up Glue crawlers to infer data schemas
and create metadata tables for use in Athena.
Capstone Project - 1
Outline :
Data Source: Start with raw data stored locally or ingested from a file source
such as CSV or logs.
ETL Process:
Extract: Use Hadoop to ingest raw data into HDFS.
Transform: Clean and preprocess the data using a MapReduce job,
performing tasks like filtering, deduplication, and aggregations.
Load: Save the processed data into Hive tables for querying and reporting.
Querying Data: Use HiveQL to perform operations such as grouping, filtering,
and aggregating data for business insights.
Key Technologies:
Hadoop (HDFS, Spark): Modules 2–5.
Hive: Modules 10–11.
Outcome:
By completing this project, students will:
Understand how to design an end-to-end batch processing pipeline using
Hadoop and Hive.
Gain experience with HDFS storage, MapReduce, and HiveQL for Big Data
analytics.
Capstone Project - 2
Outline :
Data Source: Simulate real-time data streams using a Kafka producer (e.g.,
sending IoT sensor readings or stock prices).
Pipeline Components:
Apache Kafka: Use Kafka to manage the stream of incoming data with
appropriate topics and partitions.
Spark Streaming: Consume the Kafka stream, process data in real-time (e.g.,
compute rolling averages or identify anomalies), and write results to HDFS
or S3 for further analysis.
Output: Store processed data in a NoSQL database like Cassandra or
MongoDB (covered in earlier modules) for querying and visualization.
Key Technologies:
Apache Kafka: Modules 12–13.
Spark Streaming: Module 14.
NoSQL Database Integration: Modules 7–9.
Outcome:
By completing this project, students will:
Learn to create and manage real-time data pipelines.
Apply streaming analytics for fast, event-driven insights.
Showcase the ability to work with Kafka, Spark Streaming, and NoSQL
databases in a single workflow.
Capstone Project - 3
Outline :
Key Technologies:
Azure Data Factory (ADF): Modules 19–21.
Azure Databricks: Modules 18–21.
Azure Data Lake Storage Gen2 (ADLS): Module 20.
Outcome:
By completing this project, students will:
Gain experience with modern cloud-based data lakehouse architectures.
Learn to integrate Databricks, ADF, and ADLS for scalable workflows.
Prepare for real-world cloud-based data engineering challenges.
Capstone Project - 4
Outline :
Data Source: Store a raw dataset in S3 (e.g., customer logs or product data).
Pipeline Components:
AWS Glue Crawlers: Use Glue crawlers to automatically discover schemas
and create a data catalog.
AWS Glue Jobs: Write transformation scripts to clean and preprocess the
data, converting it into an optimized format like Parquet or ORC.
AWS Athena: Query the cataloged data in S3 using SQL to perform analysis,
such as generating reports or KPIs.
Output: Visualize the results using AWS QuickSight or export them to BI tools
like Tableau.
Key Technologies:
AWS S3: Module 23.
AWS Glue: Module 24.
AWS Athena: Module 24.
Outcome:
By completing this project, students will:
Master serverless tools like Glue and Athena for Big Data analytics.
Learn to catalog, transform, and query large datasets efficiently.
Build cost-effective, serverless data engineering workflows.