B2. Introduction To Big Data With Spark and Hadoop - Coursera
B2. Introduction To Big Data With Spark and Hadoop - Coursera
Go To Course
Already enrolled
Course
Gain insight into a topic and learn the fundamentals
Intermediate level
Recommended experience
18 hours (approximately)
Flexible schedule
Learn at your own pace
Explain the impact of big data, including use cases, tools, and processing Describe Apache Hadoop architecture, ecosystem, practices, and user-
methods. related applications, including Hive, HDFS, HBase, Spark, and MapReduce.
Apply Spark programming basics, including parallel programming basics Use Spark’s RDDs and data sets, optimize Spark SQL using Catalyst and
for DataFrames, data sets, and Spark SQL. Tungsten, and use Spark’s development and runtime environment options.
Details to know
Bernard Marr defines big data as the digital trace that we are generating in this digital era. You will start the course by understanding what big data is
and exploring how insights from big data can be harnessed for a variety of use cases. You’ll also explore how big data uses technologies like parallel
processing, scaling, and data parallelism.
Next, you will learn about Hadoop, an open-source framework that allows for the distributed processing of large data and its ecosystem. You will
discover important applications that go hand in hand with Hadoop, like Distributed File System (HDFS), MapReduce, and HBase. You will become
familiar with Hive, a data warehouse software that provides an SQL-like interface to efficiently query and manipulate large data sets.
You’ll then gain insights into Apache Spark, an open-source processing engine that provides users with new ways to store and use big data. In this
course, you will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the
components that make up Apache Spark.
You’ll learn about DataFrames and perform basic DataFrame operations and work with SparkSQL. Explore how Spark processes and monitors the
requests your application submits and how you can track work using the Spark Application UI.
This course has several hands-on labs to help you apply and practice the concepts you learn. You will complete Hadoop and Spark labs using various
tools and technologies, including Docker, Kubernetes, Python, and Jupyter Notebooks.
Read more
In this module, you’ll begin your acquisition of Big Data knowledge with the most up-to-date definition of Big Data. You’ll explore the impact of Big Data
on everyday personal tasks and business transactions with Big Data Use Cases. You’ll also learn how Big Data uses parallel processing, scaling, and data
parallelism. Going further, you’ll explore commonly used Big Data tools and explain the role of open-source in Big Data. Finally, you’ll go beyond the
hype and explore additional Big Data viewpoints.
What's included
In this module, you'll gain a fundamental understanding of the Apache Hadoop architecture, ecosystem, practices, and commonly used applications,
including Distributed File System (HDFS), MapReduce, Hive, and HBase. You’ll also gain practical skills in hands-on labs when you query the data added
using Hive, launch a single-node Hadoop cluster using Docker, and run MapReduce jobs.
What's included
HDFS • 8 minutes
HIVE • 5 minutes
HBASE • 5 minutes
Apache Spark
Module details
Module 3 • 1 hour to complete
In this module, you’ll turn your attention to the popular Apache Spark platform, where you will explore the attributes and benefits of Apache Spark and
distributed computing. You'll gain key insights about functional programming and Lambda functions. You’ll also explore Resilient Distributed Datasets
(RDDs), parallel programming, resilience in Apache Spark, and relate RDDs and parallel programming with Apache Spark. Then, you’ll dive into
additional Apache Spark components and learn how Apache Spark scales with Big Data. Working with Big Data signals the need for working with
queries, including structured queries using SQL. You’ll also learn about the functions, parts, and benefits of Spark SQL and DataFrame queries, and
discover how DataFrames work with Spark SQL.
What's included
In this module, you’ll learn about Resilient Distributed Datasets (RDDs), their uses in Apache Spark, and RDD transformations and actions. You'll
compare the use of datasets with Spark's latest data abstraction, DataFrames. You'll learn to identify and apply basic DataFrame operations. You’ll
explore Apache Spark SQL optimization and learn how Spark SQL and memory optimization benefit from using Catalyst and Tungsten. Finally, you’ll
fortify your skills with guided hands-on lab to create a table view and apply data aggregation techniques.
What's included
In this module, you’ll explore how Spark processes the requests that your application submits and learn how you can track work using the Spark
Application UI. Because Spark application work happens on the cluster, you need to be able to identify Apache Cluster Managers, their components,
and benefits. You’ll also know how to connect with each cluster manager and how and when you might want to set up a local, standalone Spark
instance. Next, you’ll learn about Apache Spark application submission, including the use of Spark’s unified interface, “spark-submit,” and learn about
options and dependencies. You’ll also describe and apply options for submitting applications, identify external application dependency management
techniques, and list Spark Shell benefits. You’ll also look at recommended practices for Spark's static and dynamic configuration options and perform
hands-on labs to use Apache Spark on IBM Cloud and run Spark on Kubernetes.
What's included
Platforms and applications require monitoring and tuning to manage issues that inevitably happen. In this module, you'll learn about connecting the
Apache Spark user interface web server and using the same UI web server to manage application processes. You’ll also identify common Apache Spark
application issues and learn about debugging issues using the application UI and locating related log files. Further, you’ll discover and gain real-world
knowledge about how Spark manages memory and processor resources using the hands-on lab.
What's included
In this module, you’ll perform a practice lab where you’ll explore two critical aspects of data processing using Spark: working with Resilient Distributed
Datasets (RDDs) and constructing DataFrames from JSON data. You will also apply various transformations and actions on both RDDs and DataFrames
to gain insights and manipulate the data effectively. Further, you’ll apply your knowledge in a final project where you will create a DataFrame by loading
data from a CSV file and applying transformations and actions using Spark SQL. Finally, you’ll be assessed based on your learning from the course.
Instructors
Instructor ratings 4.3 (62 ratings)
Aije Egwaikhide
IBM
6 Courses • 533,410 learners
Romeo Kienzler
IBM
10 Courses • 602,961 learners
Rav Ahuja
IBM
44 Courses • 2,079,406 learners
Offered by
IBM
Learn more
Recommended Degrees
IBM IBM
Course Course
Show 8 more
Felipe M. Jennifer J.
Learner since 2018 Learner since 2020
"To be able to take courses at my own pace and rhythm has been "I directly applied the concepts and skills I learned from my
an amazing experience. I can learn whenever it fits my schedule courses to an exciting new project at work."
and mood."
● ○
5 stars 63.66%
4 stars 19.72%
3 stars 9.34%
2 stars 2.76%
1 star 4.49%
JS
4 · Reviewed on May 2, 2022
hands on lab and quizzes at the end of each session was very
helpful
MG
5 · Reviewed on Jul 16, 2023
Course was full of information and details for a beginner In big data
technology
RS
5 · Reviewed on May 8, 2022
Fantastic blend of theory and practical (labs). The labs are short and have concise
material.
What Is a Cryptographer? 2024 What Is a Cybersecurity What Is a White Hat? The Ethical What Is a Data Security Breach?
Career Guide Consultant? (And How to Side of Hacking Definition, Causes, and How to
Become One) Protect Your Data
December 14, 2023 November 29, 2023 November 20, 2023 November 29, 2023
Article · 6 min read Article · 6 min read Article · 5 min read Article · 6 min read
Open new doors with Coursera Plus
Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready
certificate programs - all included in your subscription
Learn more
Explore degrees
Join over 3,400 global companies that choose Coursera for Business
Upskill your employees to excel in the digital economy
Learn more
Frequently asked questions
More questions
Visit the learner help center
Coursera Community
About Learners
What We Offer Partners
Leadership Beta Testers
Careers Translators
Catalog Blog
Coursera Plus The Coursera Podcast
Professional Certificates Tech Blog
MasterTrack® Certificates Teaching Center
Degrees
For Enterprise
For Government
For Campus
Become a Partner
Coronavirus Response
Social Impact
More
Press
Investors
Terms
Privacy
Help
Accessibility
Contact
Articles
Directory
Affiliates
Modern Slavery Statement
Manage Cookie Preferences
Learn Anywhere
Follow Us