RDDs Vs DataFrames and Datasets
RDDs Vs DataFrames and Datasets
Of all the developers’ delight, none is more attractive than a set of APIs that make developers
productive, that is easy to use, and that is intuitive and expressive. One of Apache Spark’s appeal to
developers has been its easy-to-use APIs, for operating on large datasets, across languages: Scala,
Java, Python, and R.
In this blog, I explore three sets of APIs—RDDs, DataFrames, and Datasets—available in Apache
Spark 2.2 and beyond; why and when you should use each set; outline their performance and
optimization benefits; and enumerate scenarios when to use DataFrames and Datasets instead of
RDDs. Mostly, I will focus on DataFrames and Datasets, because in Apache Spark 2.0, these two
APIs are unified.
Our primary motivation behind this unification is our quest to simplify Spark by limiting the
number of concepts that you have to learn and by offering ways to process structured data. And
through structure, Spark can offer higher-level abstraction and APIs as domain-specific language
constructs.
DataFrames
Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is
organized into named columns, like a table in a relational database. Designed to make large data
sets processing even easier, DataFrame allows developers to impose a structure onto a distributed
collection of data, allowing higher-level abstraction; it provides a domain specific language API to
manipulate your distributed data; and makes Spark accessible to a wider audience, beyond
specialized data engineers.
In our preview of Apache Spark 2.0 webinar and subsequent blog, we mentioned that in Spark 2.0,
DataFrame APIs will merge with Datasets APIs, unifying data processing capabilities across
libraries. Because of this unification, developers now have fewer concepts to learn or remember,
and work with a single high-level and type-safe API called Dataset.
Datasets
Starting in Spark 2.0, Dataset takes on two distinct APIs characteristics: a strongly-typed API and
an untyped API, as shown in the table below. Conceptually, consider DataFrame as an alias for a
collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset,
by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in
Scala or a class in Java.
Typed and Un-typed APIs
Languag
Main Abstraction
e
Dataset[T] & DataFrame (alias for
Scala
Dataset[Row])
Java Dataset[T]
Python* DataFrame
R* DataFrame
Note: Since Python and R have no compile-time type-safety, we only have untyped APIs,
namely DataFrames.
val dsAvgTmp = ds.filter(d => {d.temp > 25}).map(d => (d.temp, d.humidity,
d.cca3)).groupBy($"_3").avg()