Developer Training For Apache Spark and Hadoop
Developer Training For Apache Spark and Hadoop
Course Outcomes:
• Distribute, store, and process data in a Hadoop cluster
• Write, configure, and deploy Spark applications on a cluster
• Use the Spark shell for interactive data analysis
• Process and query structured data using Spark SQL and Hive Query Language
• Understand a wide variety of learning algorithms and build an end-to-end Machine
Learning Model with MLlib in pySpark.
• Use Spark Streaming to process a live data stream
What to Expect
This course is designed for developers and engineers who have programming experience, but
prior knowledge of Hadoop and/or Spark is not required.
• Apache Spark examples and hands-on exercises are presented in Scala and Python. The
ability to program in one of those languages is required.
• Basic familiarity with the Linux command line is assumed
• Basic knowledge of SQL is helpful
Module 1 Module 3
Introduction to Apache Hadoop Distributed Processing on an Apache Hadoop
and the Hadoop Ecosystem Cluster
• Apache Hadoop Overview • YARN Architecture
• Data Ingestion and Storage • Working With YARN
• Data Processing
• Data Analysis and Exploration Module 4
• Other Ecosystem Tools Apache Spark Basics
• Introduction to the Hands-On Exercises • What is Apache Spark?
• Starting the Spark Shell
Module 2 • Using the Spark Shell
Apache Hadoop File Storage • Getting Started with Datasets and DataFrames
• Apache Hadoop Cluster Components • DataFrame Operations
• HDFS Architecture
• Using HDFS
1
Module 5 Module 11
Working with DataFrames and Schemas Transforming Data with RDDs
• Introduction to DataFrames • Writing and Passing Transformation
• Exercise: Introducing DataFrames Functions
• Exercise: Reading and Writing DataFrames • Transformation Execution
• Exercise: Working with Columns • Converting Between RDDs and DataFrames
• Exercise: Working with Complex Types
• Exercise: Combining and Splitting DataFrames Module 12
• Exercise: Summarizing and Grouping DF Aggregating Data with Pair RDDs
• Exercise: Working with UDFs • Key-Value Pair RDDs
• Exercise: Working with Windows • Map-Reduce
• Eager and Lazy Execution • Other Pair RDD Operations
Module 6 Module 13
Analyzing Data with DataFrame Queries Querying Tables and Views with Apache
• Querying DataFrames Using Column Exp. Spark SQL
• Grouping and Aggregation Queries • Querying Tables in Spark Using SQL
• Joining DataFrames • Querying Files and Views
• The Catalog API
Module 7 • Comparing Spark SQL, Apache Impala,
Introduction to Apache Hive and Apache Hive-on-Spark
• About Hive
• Transforming data with Hive QL Module 14
Working with Datasets in Scala
Module 8 • Datasets and DataFrames
Working with Apache Hive • Creating Datasets
• Exercise: Working with Partitions • Loading and Saving Datasets
• Exercise: Working with Buckets • Dataset Operations
• Exercise: Working with Skew
• Exercise: Using Serdes to Ingest Text Data Module 15
• Exercise: Using Complex Types to Denormalize Writing, Configuring, and Running Apache
Data Spark Applications
• Writing a Spark Application
Module 9 • Building and running an application
Hive and Spark Integration • Application Deployment Mode
• Hive and Spark Integration • The Spark Application Web UI
• Exercise: Spark Integration with Hive • Configuring Application Properties
Module 10
RDD Overview
• RDD Overview
• RDD Data Sources
• Creating and Saving RDDs
• RDD Operations
2
Module 16 • ML model with Spark ML
Distributed Processing • Exercise: Implement Linear regression
• Review: Apache Spark on a Cluster • Exercise: Implement logistic regression
• RDD Partitions • Exercise: Implement Random Forest
• Example: Partitioning in Queries • Exercise: Implement k-means
• Stages and Tasks
• Job Execution Planning Module 20
• Example: Catalyst Execution Plan Apache Spark Streaming: Introduction to
• Example: RDD Execution Plan DStreams
• Apache Spark Streaming Overview
Module 17 • Example: Streaming Request Count
Distributed Processing Challenges • DStreams
• Shuffle • Developing Streaming Applications
• Skew
• Order Module 21
Apache Spark Streaming: Processing Multiple
Module 18 Batches
Distributed Data Persistence • Multi-Batch Operations
• DataFrame and Dataset Persistence • Time Slicing
• Persistence Storage Levels • State Operations
• Viewing Persisted RDDs • Sliding Window Operations
• Preview: Structured Streaming
Module 19
Machine Learning with Spark ML Module 22
• Common Apache Spark Use Cases Apache Spark Streaming: Data Sources
• Iterative Algorithms in Apache Spark: Machine • Streaming Data Source Overview
Learning, Graph Processing • Apache Flume and Apache Kafka Data Sources
• Introduction to MLlib- Various ML algorithms • Example: Using a Kafka Direct Data Source
supported by Mlib