Cse3002 Big Data m3 Detailed
Cse3002 Big Data m3 Detailed
Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later
to become the AMPLab. The researchers in the lab had previously been working
on Hadoop MapReduce, and observed that MapReduce was inefficient for
iterative and interactive computing jobs.
Thus, from the beginning, Spark was designed to be fast for interactive queries
and iterative algorithms, bringing in ideas like support for in-memory storage
and efficient fault recovery.
15
Spark Versions and Releases
• Since its creation, Spark has been a very active project and
community, with the number of contributors growing with each
release.
• Spark 1.0 had over 100 individual contributors. Though the level of
activity has rapidly grown, the community continues to release
updated versions of Spark on a regular schedule.
• Spark-versions
Storage Layers for Spark
• Spark can create distributed datasets from any file stored in the
Hadoop distributed filesystem (HDFS) or other storage systems
supported by the Hadoop APIs (including your local filesystem,
Amazon S3, Cassandra, Hive, HBase, etc.).
• It’s important to remember that Spark does not require Hadoop; it
simply has support for storage systems implementing the Hadoop
APIs. Spark supports text files, Sequence Files, Avro, Parquet, and any
other Hadoop InputFormat.
Programming with RDDs
• Actions, on the other hand, compute a result based on an RDD, and either return it to the driver
program or save it to an external storage system (e.g., HDFS).
• One example of an action we called earlier is first(), which returns the first element in an RDD and
is demonstrated in Example 3-3.
RDD Basics
• Transformations and actions are different because of the way Spark computes
RDDs.
• Although you can define new RDDs any time, Spark computes them only in a lazy
fashion — that is, the first time they are used in an action.
• This approach makes a lot of sense when you are working with Big Data.
• For instance, in Example 3-2 and Example 3-3, we defined a text file and then
filtered the lines that include Python.
• If Spark were to load and store all the lines in the file as soon as we wrote lines =
sc.textFile(…), it would waste a lot of storage space, given that we then
immediately filter out many lines.
• Instead, once Spark sees the whole chain of transformations, it can compute just
the data needed for its result.
• In fact, for the first() action, Spark scans the file only until it finds the first
matching line; it doesn’t even read the whole file
• Spark’s RDDs are by default recomputed each time you run an action on them.
RDD Basics
• If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it
using RDD.persist().
• In practice, we will often use persist() to load a subset of your data into memory and
query it repeatedly.
• For example, if we knew that we wanted to compute multiple results about the
README lines that contain Python, we could write the script as shown below.
• The behavior of not persisting by default makes a lot of sense for big datasets: if you will not
reuse the RDD, there’s no reason to waste storage space when Spark could instead stream
through the data once and just compute the result.
RDD Basics
To summarize, every Spark program and shell session will work as follows:
1. Create some input RDDs from external data.
2. Transform them to define new RDDs using transformations like filter().
3. Ask Spark to persist() any intermediate RDDs that will need to be reused.
4. Launch actions such as count() and first() to kick off a parallel
computation, which is then optimized and executed by Spark.
Creating RDDs
• Spark provides two ways to create RDDs: loading an external dataset and parallelizing
a collection in your driver program.
• The simplest way to create RDDs is to take an existing collection in your program and
pass it to SparkContext’s parallelize() method, as shown in Examples 3-5 through 3-7.
RDD Operations
• As you derive new RDDs from each other using transformations, Spark keeps
track of the set of dependencies between different RDDs, called the lineage
graph.
• It uses this information to compute each RDD on demand and to recover lost
data if part of a persistent RDD is lost
Actions
• Actions are the second type of RDD operation.
• They are the operations that return a final value to the driver program or write data
to an external storage system.
• Actions force the evaluation of the transformations required for the RDD they were
called on, since they need to actually produce output.
Actions
• In this example, we used take() to retrieve a small number of elements in the RDD at the
driver program.
• We then iterate over them locally to print out information at the driver.
• RDDs also have a collect() function to retrieve the entire RDD.
• This can be useful if your program filters RDDs down to a very small size and you’d like to
deal with it locally.
• Entire dataset must fit in memory on a single machine to use collect() on it, so collect()
shouldn’t be used on large datasets.
• In most cases RDDs can’t just be collect()ed to the driver because they are too large.
• In these cases, it’s common to write data out to a distributed storage system such as
HDFS or Amazon S3.
• You can save the contents of an RDD using the saveAsTextFile() action,
saveAsSequenceFile(), or any of a number of actions for various built-in formats.
Passing Functions to Spark
• Instead, just extract the fields you need from your object into a local variable
and pass that in
Common Transformations and Actions
Basic RDDs
We will begin by describing what transformations and actions we can perform on all
RDDs regardless of the data.
Element-wise transformations
• The two most common transformations you will likely be using are map() and filter() .
• The map() transformation takes in a function and applies it to each element in the
RDD with the result of the function being the new value of each element in the
resulting RDD.
• The filter() transformation takes in a function and returns an RDD that only has
elements that pass the filter() function.
• We can use map() to do any number of things, from fetching the website associated
with each URL in our collection to just squaring the numbers.
• It is useful to note that map()’s return type does not have to be the same as its input
type, so if we had an RDD String and our map() function were to parse the strings and
return a Double, our input RDD type would be RDD[String] and the resulting RDD type
would be RDD[Double].
Common Transformations and Actions
Common Transformations and Actions
Common Transformations and Actions
• Sometimes we want to produce multiple output elements for each input element.
• The operation to do this is called flatMap(). As with map(), the function we provide to
flatMap() is called individually for each element in our input RDD.
• Instead of returning a single element, we return an iterator with our return values.
• Rather than producing an RDD of iterators, we get back an RDD that consists of the
elements from all of the iterators. A simple usage of flatMap() is splitting up an input
string into words, as shown in Examples 3-29 through 3-31.
Common Transformations and Actions
• We illustrate the difference between flatMap() and map() in Figure 3-3.
• You can think of flatMap() as “flattening” the iterators returned to it, so that instead of ending
up with an RDD of lists we have an RDD of the elements in those lists