0% found this document useful (0 votes)
154 views

B2. Introduction To Big Data With Spark and Hadoop - Coursera

This course provides an introduction to big data concepts and tools like Apache Hadoop and Apache Spark. It covers topics such as explaining the impact of big data, describing Hadoop architecture and applications, applying Spark programming basics, and using Spark SQL. The course contains several hands-on labs and assessments to help students apply the concepts.

Uploaded by

Hafiszan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views

B2. Introduction To Big Data With Spark and Hadoop - Coursera

This course provides an introduction to big data concepts and tools like Apache Hadoop and Apache Spark. It covers topics such as explaining the impact of big data, describing Hadoop architecture and applications, applying Spark programming basics, and using Spark SQL. The course contains several hands-on labs and assessments to help students apply the concepts.

Uploaded by

Hafiszan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Browse Information Technology Data Management

Introduction to Big Data with Spark and Hadoop


This course is part of multiple programs. Learn more

Taught in English 19 languages available Some content may not be translated

Instructors: Aije Egwaikhide +2 more

Go To Course

Already enrolled

38,140 already enrolled

Course
Gain insight into a topic and learn the fundamentals

4.4 (288 reviews) | 88%

Intermediate level
Recommended experience

18 hours (approximately)

Flexible schedule
Learn at your own pace

View course modules

What you'll learn

Explain the impact of big data, including use cases, tools, and processing Describe Apache Hadoop architecture, ecosystem, practices, and user-
methods. related applications, including Hive, HDFS, HBase, Spark, and MapReduce.

Apply Spark programming basics, including parallel programming basics Use Spark’s RDDs and data sets, optimize Spark SQL using Catalyst and
for DataFrames, data sets, and Spark SQL. Tungsten, and use Spark’s development and runtime environment options.

Skills you'll gain

Big Data SparkSQL SparkML Apache Hadoop Apache Spark

Details to know

Shareable certificate Assessments


Add to your LinkedIn profile 14 assignments
See how employees at top companies are mastering
in-demand skills
Learn more about Coursera for Business
Go To Course

Build your subject-matter expertise


This course is available as part of multiple programs
When you enroll in this course, you'll also be asked to select a specific program.

Learn new concepts from industry experts


Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects

Earn a shareable career certificate

Earn a career certificate


Add this credential to your LinkedIn profile, resume, or CV
Share it on social media and in your performance review
There are 7 modules in this course
This self-paced IBM course will teach you all about big data! You will become familiar with the characteristics of big data and its application in big
data analytics. You will also gain hands-on experience with big data processing tools like Apache Hadoop and Apache Spark.

Bernard Marr defines big data as the digital trace that we are generating in this digital era. You will start the course by understanding what big data is
and exploring how insights from big data can be harnessed for a variety of use cases. You’ll also explore how big data uses technologies like parallel
processing, scaling, and data parallelism.

Next, you will learn about Hadoop, an open-source framework that allows for the distributed processing of large data and its ecosystem. You will
discover important applications that go hand in hand with Hadoop, like Distributed File System (HDFS), MapReduce, and HBase. You will become
familiar with Hive, a data warehouse software that provides an SQL-like interface to efficiently query and manipulate large data sets.

You’ll then gain insights into Apache Spark, an open-source processing engine that provides users with new ways to store and use big data. In this
course, you will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the
components that make up Apache Spark.

You’ll learn about DataFrames and perform basic DataFrame operations and work with SparkSQL. Explore how Spark processes and monitors the
requests your application submits and how you can track work using the Spark Application UI.

This course has several hands-on labs to help you apply and practice the concepts you learn. You will complete Hadoop and Spark labs using various
tools and technologies, including Docker, Kubernetes, Python, and Jupyter Notebooks.

Read more

What Is Big Data?


Module details
Module 1 • 1 hour to complete

In this module, you’ll begin your acquisition of Big Data knowledge with the most up-to-date definition of Big Data. You’ll explore the impact of Big Data
on everyday personal tasks and business transactions with Big Data Use Cases. You’ll also learn how Big Data uses parallel processing, scaling, and data
parallelism. Going further, you’ll explore commonly used Big Data tools and explain the role of open-source in Big Data. Finally, you’ll go beyond the
hype and explore additional Big Data viewpoints.

What's included

8 videos 1 reading 2 assignments 1 plugin

Hide info about module content

8 videos • Total 47 minutes

Course Introduction • 5 minutes • Preview module

What is Big Data? • 7 minutes

Impact of Big Data • 5 minutes

Parallel Processing, Scaling, and Data Parallelism • 7 minutes

Big Data Tools and Ecosystem • 4 minutes

Open Source and Big Data • 6 minutes

Beyond the Hype • 4 minutes

Big Data Use Cases • 5 minutes

1 reading • Total 2 minutes

Summary and Highlights: Introduction to Big Data • 2 minutes

2 assignments • Total 41 minutes

Practice Quiz: Introduction to Big Data • 14 minutes

Graded Quiz: What Is Big Data? • 27 minutes


1 plugin • Total 12 minutes

Module 1 Glossary: What Is Big Data? • 12 minutes

Introduction to the Hadoop Ecosystem


Module details
Module 2 • 2 hours to complete

In this module, you'll gain a fundamental understanding of the Apache Hadoop architecture, ecosystem, practices, and commonly used applications,
including Distributed File System (HDFS), MapReduce, Hive, and HBase. You’ll also gain practical skills in hands-on labs when you query the data added
using Hive, launch a single-node Hadoop cluster using Docker, and run MapReduce jobs.

What's included

6 videos 1 reading 2 assignments 3 app items 2 plugins

Hide info about module content

6 videos • Total 37 minutes

Introduction to Hadoop • 7 minutes • Preview module

Intro to MapReduce • 5 minutes

Hadoop Ecosystem • 4 minutes

HDFS • 8 minutes

HIVE • 5 minutes

HBASE • 5 minutes

1 reading • Total 2 minutes

Summary and Highlights: Introduction to Hadoop • 2 minutes

2 assignments • Total 36 minutes

Practice Quiz: Introduction to Hadoop • 12 minutes

Graded Quiz: Introduction to Hadoop Ecosystem • 24 minutes

3 app items • Total 60 minutes

Hands-on Lab: Getting Started with Hive • 20 minutes

Hands-on Lab: Hadoop MapReduce • 20 minutes

Hands-on lab : Hadoop Cluster (Optional) • 20 minutes

2 plugins • Total 30 minutes

Cheat Sheet: Introduction to the Hadoop Ecosystem • 15 minutes

Module 2 Glossary: Introduction to the Hadoop Ecosystem • 15 minutes

Apache Spark
Module details
Module 3 • 1 hour to complete

In this module, you’ll turn your attention to the popular Apache Spark platform, where you will explore the attributes and benefits of Apache Spark and
distributed computing. You'll gain key insights about functional programming and Lambda functions. You’ll also explore Resilient Distributed Datasets
(RDDs), parallel programming, resilience in Apache Spark, and relate RDDs and parallel programming with Apache Spark. Then, you’ll dive into
additional Apache Spark components and learn how Apache Spark scales with Big Data. Working with Big Data signals the need for working with
queries, including structured queries using SQL. You’ll also learn about the functions, parts, and benefits of Spark SQL and DataFrame queries, and
discover how DataFrames work with Spark SQL.

What's included

5 videos 1 reading 2 assignments 1 app item 2 plugins

Hide info about module content

5 videos • Total 24 minutes

Why use Apache Spark? • 5 minutes • Preview module

Functional Programming Basics • 5 minutes

Parallel Programming using Resilient Distributed Datasets • 5 minutes

Scale out / Data Parallelism in Apache Spark • 3 minutes

Dataframes and SparkSQL • 4 minutes

1 reading • Total 2 minutes

Summary and Highlights: Introduction to Apache Spark • 2 minutes

2 assignments • Total 31 minutes

Practice Quiz: Introduction to Apache Spark • 10 minutes

Graded Quiz: Apache Spark • 21 minutes

1 app item • Total 15 minutes

Hands-on Lab: Getting Started with Spark using Python • 15 minutes

2 plugins • Total 30 minutes

Cheat Sheet: Apache Spark • 15 minutes

Module 3 Glossary: Apache Spark • 15 minutes

DataFrames and Spark SQL


Module details
Module 4 • 2 hours to complete

In this module, you’ll learn about Resilient Distributed Datasets (RDDs), their uses in Apache Spark, and RDD transformations and actions. You'll
compare the use of datasets with Spark's latest data abstraction, DataFrames. You'll learn to identify and apply basic DataFrame operations. You’ll
explore Apache Spark SQL optimization and learn how Spark SQL and memory optimization benefit from using Catalyst and Tungsten. Finally, you’ll
fortify your skills with guided hands-on lab to create a table view and apply data aggregation techniques.

What's included

5 videos 1 reading 2 assignments 2 app items 4 plugins

Hide info about module content

5 videos • Total 25 minutes

RDDs in Parallel Programming and Spark • 5 minutes • Preview module

Data-frames and Datasets • 4 minutes

Catalyst and Tungsten • 5 minutes


ETL with DataFrames • 6 minutes

Real-world usage of SparkSQL • 4 minutes

1 reading • Total 2 minutes

Summary and Highlights: Introduction to DataFrames and Spark SQL • 2 minutes

2 assignments • Total 31 minutes

Practice Quiz: Introduction to DataFrames & Spark SQL • 10 minutes

Graded Quiz: DataFrames and Spark SQL • 21 minutes

2 app items • Total 30 minutes

Hands-on Lab: Introduction to DataFrames • 15 minutes

Hands-On Lab: Introduction to SparkSQL • 15 minutes

4 plugins • Total 60 minutes

Reading: User-Defined Schema (UDS) for DSL and SQL • 10 minutes

Reading: Common Transformations and Optimization Techniques in Spark • 20 minutes

Cheat Sheet: DataFrames and Spark SQL • 15 minutes

Module 4 Glossary: DataFrames and Spark SQL • 15 minutes

Development and Runtime Environment Options


Module details
Module 5 • 3 hours to complete

In this module, you’ll explore how Spark processes the requests that your application submits and learn how you can track work using the Spark
Application UI. Because Spark application work happens on the cluster, you need to be able to identify Apache Cluster Managers, their components,
and benefits. You’ll also know how to connect with each cluster manager and how and when you might want to set up a local, standalone Spark
instance. Next, you’ll learn about Apache Spark application submission, including the use of Spark’s unified interface, “spark-submit,” and learn about
options and dependencies. You’ll also describe and apply options for submitting applications, identify external application dependency management
techniques, and list Spark Shell benefits. You’ll also look at recommended practices for Spark's static and dynamic configuration options and perform
hands-on labs to use Apache Spark on IBM Cloud and run Spark on Kubernetes.

What's included

6 videos 2 readings 3 assignments 2 app items 4 plugins

Hide info about module content

6 videos • Total 32 minutes

Apache Spark Architecture • 5 minutes • Preview module

Overview of Apache Spark Cluster Modes • 6 minutes

How to Run an Apache Spark Application • 6 minutes

Using Apache Spark on IBM Cloud • 4 minutes

Setting Apache Spark Configuration • 5 minutes

Running Spark on Kubernetes • 4 minutes

2 readings • Total 4 minutes


Summary and Highlights: Spark Architecture • 2 minutes

Summary and Highlights: Spark Runtime Environments • 2 minutes

3 assignments • Total 33 minutes

Practice Quiz: Spark Architecture • 6 minutes

Practice Quiz: Spark Runtime Environments • 6 minutes

Graded Quiz: Development and Runtime Environment Options • 21 minutes

2 app items • Total 80 minutes

Hands-on Lab: Submit Apache Spark Applications • 60 minutes

Hands-on Lab: Apache Spark on Kubernetes • 20 minutes

4 plugins • Total 40 minutes

Spark Environments - Overview and Options • 5 minutes

How to Set Up Your Own Spark Environments (Optional) • 5 minutes

Cheat Sheet: Development and Runtime Environment Options • 15 minutes

Module 5 Glossary: Development and Runtime Environment Options • 15 minutes

Monitoring and Tuning


Module details
Module 6 • 2 hours to complete

Platforms and applications require monitoring and tuning to manage issues that inevitably happen. In this module, you'll learn about connecting the
Apache Spark user interface web server and using the same UI web server to manage application processes. You’ll also identify common Apache Spark
application issues and learn about debugging issues using the application UI and locating related log files. Further, you’ll discover and gain real-world
knowledge about how Spark manages memory and processor resources using the hands-on lab.

What's included

5 videos 1 reading 2 assignments 1 app item 3 plugins

Hide info about module content

5 videos • Total 30 minutes

The Apache Spark User Interface • 5 minutes • Preview module

Monitoring Application Progress • 7 minutes

Debugging Apache Spark Application Issues • 5 minutes

Understanding Memory Resources • 5 minutes

Understanding Processor Resources • 5 minutes

1 reading • Total 2 minutes

Summary and Highlights: Introduction to Monitoring and Tuning • 2 minutes

2 assignments • Total 31 minutes

Practice Quiz: Introduction to Monitoring and Tuning • 10 minutes

Graded Quiz: Monitoring and Tuning • 21 minutes


1 app item • Total 30 minutes

Hands-on Lab: Monitoring and Performance Tuning • 30 minutes

3 plugins • Total 35 minutes

[Optional] Batch Data Ingestion Methods • 5 minutes

Cheat Sheet: Monitoring and Tuning • 15 minutes

Module 6 Glossary: Monitoring and Tuning • 15 minutes

Final Project and Assessment


Module details
Module 7 • 4 hours to complete

In this module, you’ll perform a practice lab where you’ll explore two critical aspects of data processing using Spark: working with Resilient Distributed
Datasets (RDDs) and constructing DataFrames from JSON data. You will also apply various transformations and actions on both RDDs and DataFrames
to gain insights and manipulate the data effectively. Further, you’ll apply your knowledge in a final project where you will create a DataFrame by loading
data from a CSV file and applying transformations and actions using Spark SQL. Finally, you’ll be assessed based on your learning from the course.

Instructors
Instructor ratings 4.3 (62 ratings)

Aije Egwaikhide
IBM
6 Courses • 533,410 learners

Romeo Kienzler
IBM
10 Courses • 602,961 learners

Rav Ahuja
IBM
44 Courses • 2,079,406 learners

Offered by

IBM
Learn more

Recommended if you're interested in Data Management

Recommended Degrees

IBM IBM

Machine Learning with Apache Spark Data Engineering Capstone Project

Course Course
Show 8 more

Why people choose Coursera for their career

Felipe M. Jennifer J.
Learner since 2018 Learner since 2020

"To be able to take courses at my own pace and rhythm has been "I directly applied the concepts and skills I learned from my
an amazing experience. I can learn whenever it fits my schedule courses to an exciting new project at work."
and mood."

● ○

4.4 288 reviews

5 stars 63.66%

4 stars 19.72%

3 stars 9.34%

2 stars 2.76%

1 star 4.49%

JS
4 · Reviewed on May 2, 2022

hands on lab and quizzes at the end of each session was very
helpful

MG
5 · Reviewed on Jul 16, 2023

Course was full of information and details for a beginner In big data
technology

RS
5 · Reviewed on May 8, 2022

Fantastic blend of theory and practical (labs). The labs are short and have concise
material.

View more reviews

New to Data Management? Start here.

What Is a Cryptographer? 2024 What Is a Cybersecurity What Is a White Hat? The Ethical What Is a Data Security Breach?
Career Guide Consultant? (And How to Side of Hacking Definition, Causes, and How to
Become One) Protect Your Data

December 14, 2023 November 29, 2023 November 20, 2023 November 29, 2023
Article · 6 min read Article · 6 min read Article · 5 min read Article · 6 min read
Open new doors with Coursera Plus
Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready
certificate programs - all included in your subscription

Learn more

Advance your career with an online degree


Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business
Upskill your employees to excel in the digital economy

Learn more
Frequently asked questions

When will I have access to the lectures and assignments?

What will I get if I subscribe to this Certificate?

What is the refund policy?

More questions
Visit the learner help center

Coursera Community
About Learners
What We Offer Partners
Leadership Beta Testers
Careers Translators
Catalog Blog
Coursera Plus The Coursera Podcast
Professional Certificates Tech Blog
MasterTrack® Certificates Teaching Center
Degrees
For Enterprise
For Government
For Campus
Become a Partner
Coronavirus Response
Social Impact

More
Press
Investors
Terms
Privacy
Help
Accessibility
Contact
Articles
Directory
Affiliates
Modern Slavery Statement
Manage Cookie Preferences

Learn Anywhere

Follow Us

© 2023 Coursera Inc. All rights reserved.

You might also like