0% found this document useful (0 votes)

409 views18 pages

Fast Data Processing With Spark - Second Edition - Sample Chapter

Chapter No. 5 Loading and Saving Data in Spark Perform real-time analytics using Spark in a fast, distributed, and scalable way For more information: http://bit.ly/1BGfmct

Uploaded by

Packt Publishing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

409 views18 pages

Fast Data Processing With Spark - Second Edition - Sample Chapter

Uploaded by

Packt Publishing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Fr

Second Edition
Spark is a framework used for writing fast, distributed
programs. Spark solves similar problems as Hadoop
MapReduce does, but with a fast in-memory approach
and a clean functional style API. With its ability to integrate
with Hadoop and built-in tools for interactive query analysis
(Spark SQL), large-scale graph processing and analysis
(GraphX), and real-time analysis (Spark Streaming), it can be
interactively used to quickly process and query big datasets.

Fast Data Processing with Spark - Second Edition covers

how to write distributed programs with Spark. The book
will guide you through every step required to write effective
distributed programs from setting up your cluster and
interactively exploring the API to developing analytics
applications and tuning them for your purposes.

Who this book is written for

Fast Data Processing with Spark - Second Edition is for
software developers who want to learn how to write
distributed programs with Spark. It will help developers who
have had problems that were too big to be dealt with on
a single computer. No previous experience with distributed
programming is necessary. This book assumes knowledge
of either Java, Scala, or Python.

What you will learn from this book

Install and set up Spark on your cluster

Prototype distributed applications with Spark's

interactive shell

Learn different ways to interact with Spark's

distributed representation of data (RDDs)

Query Spark with a SQL-like query syntax

Effectively test your distributed software

Recognize how Spark works with big data

Implement machine learning systems with

highly scalable algorithms

P U B L I S H I N G

Prices do not include

local sales tax or VAT
where applicable

Visit www.PacktPub.com for books, eBooks,

code, downloads, and PacktLib.

Krishna Sankar
Holden Karau

$ 29.99 US
19.99 UK

community experience distilled

Fast Data Processing with Spark Second Edition

Fast Data Processing

with Spark

Sa
m

C o m m u n i t y

E x p e r i e n c e

D i s t i l l e d

Fast Data Processing

with Spark
Second Edition
Perform real-time analytics using Spark in a fast, distributed,
and scalable way

Krishna Sankar
Holden Karau

In this package, you will find:

The authors biography

A preview chapter from the book, Chapter 5 'Loading and Saving Data
in Spark'
A synopsis of the books content
More information on Fast Data Processing with Spark Second Edition

About the Authors

Krishna Sankar is a chief data scientist at
/, where he
focuses on optimizing user experiences via inference, intelligence, and interfaces. His
earlier roles include principal architect, data scientist at Tata America Intl, director of a
data science and bioinformatics start-up, and a distinguished engineer at Cisco. He has
spoken at various conferences, such as Strata-Sparkcamp, OSCON, Pycon, and Pydata
about predicting NFL (
), Spark (
), data science (
), machine learning
(
), and social media analysis (
).
He was a guest lecturer at Naval Postgraduate School, Monterey. His blogs can be found
at
. His other passion is Lego Robotics.
You can find him at the St. Louis FLL World Competition as the robots design judge.
Holden Karau is a software development engineer and is active in the open source
sphere. She has worked on a variety of search, classification, and distributed systems
problems at Databricks, Google, Foursquare, and Amazon. She graduated from the
University of Waterloo with a bachelor's of mathematics degree in computer science.
Other than software, she enjoys playing with fi re and hula hoops, and welding.

Fast Data Processing with Spark

Second Edition
Apache Spark has captured the imagination of the analytics and big data developers, and
rightfully so. In a nutshell, Spark enables distributed computing on a large scale in the lab
or in production. Till now, the pipeline collect-store-transform was distinct from the Data
Science pipeline reason-model, which was again distinct from the deployment of the
analytics and machine learning models. Now, with Spark and technologies, such as
Kafka, we can seamlessly span the data management and data science pipelines. We can
build data science models on larger datasets, requiring not just sample data. However,
whatever models we build can be deployed into production (with added work from
engineering on the "ilities", of course). It is our hope that this book would enable an
engineer to get familiar with the fundamentals of the Spark platform as well as provide
hands-on experience on some of the advanced capabilities.

What This Book Covers

Chapter 1, Installing Spark and Setting up your Cluster, discusses some common
methods for setting up Spark.
Chapter 2, Using the Spark Shell, introduces the command line for Spark. The Shell is
good for trying out quick program snippets or just figuring out the syntax of a
call interactively.
Chapter 3, Building and Running a Spark Application, covers Maven and sbt for
compiling Spark applications.
Chapter 4, Creating a SparkContext, describes the programming aspects of the
connection to a Spark server, for example, the SparkContext.
Chapter 5, Loading and Saving Data in Spark, deals with how we can get data in and out
of a Spark environment.
Chapter 6, Manipulating your RDD, describes how to program the Resilient Distributed
Datasets, which is the fundamental data abstraction in Spark that makes all the
magic possible.
Chapter 7, Spark SQL, deals with the SQL interface in Spark. Spark SQL probably is the
most widely used feature.
Chapter 8, Spark with Big Data, describes the interfaces with Parquet and HBase.

Chapter 9, Machine Learning Using Spark MLlib, talks about regression, classification,
clustering, and recommendation. This is probably the largest chapter in this book. If you
are stranded on a remote island and could take only one chapter with you, this should be
the one!
Chapter 10, Testing, talks about the importance of testing distributed applications.
Chapter 11, Tips and Tricks, distills some of the things we have seen. Our hope is that as
you get more and more adept in Spark programming, you will add this to the list and send
us your gems for us to include in the next version of this book!

Loading and Saving

Data in Spark
By this point in the book, you have already experimented with the Spark shell,
figured out how to create a connection with the Spark cluster, and built jobs for
deployment. Now to make those jobs useful, you will learn how to load and save data
in Spark. Spark's primary unit for representation of data is an RDD, which allows for
easy parallel operations on the data. Other forms of data, such as counters, have their
own representation. Spark can load and save RDDs from a variety of sources.

RDDs
Spark RDDs can be created from any supported Hadoop source. Native collections in
Scala, Java, and Python can also serve as the basis for an RDD. Creating RDDs from a
native collection is especially useful for testing.
Before jumping into the details on the supported data sources/links, take some time
to learn about what RDDs are and what they are not. It is crucial to understand that
even though an RDD is defined, it does not actually contain data but just creates
the pipeline for it. (As an RDD follows the principle of lazy evaluation, it evaluates
an expression only when it is needed, that is, when an action is called for.) This
means that when you go to access the data in an RDD, it could fail. The computation
to create the data in an RDD is only done when the data is referenced by caching
or writing out the RDD. This also means that you can chain a large number of
operations and not have to worry about excessive blocking in a computational
thread. It's important to note during application development that you can write
code, compile it, and even run your job; unless you materialize the RDD, your code
may not have even tried to load the original data.

[ 51 ]

Loading and Saving Data in Spark

Each time you materialize an RDD, it is recomputed; if we

are going to be using something frequently, a performance
improvement can be achieved by caching the RDD.

Loading data into an RDD

Now the chapter will examine the different sources you can use for your RDD. If
you decide to run it through the examples in the Spark shell, you can call .cache()
or .first() on the RDDs you generate to verify that it can be loaded. In Chapter
2, Using the Spark Shell, you learned how to load data text from a file and from S3.
In this chapter, we will look at different formats of data (text file and CSV) and the
different sources (filesystem, HDFS) supported.
One of the easiest ways of creating an RDD is taking an existing Scala collection
and converting it into an RDD. The SparkContext object provides a function called
parallelize that takes a Scala collection and turns it into an RDD over the same
type as the input collection, as shown here:

Scala:
val dataRDD = sc.parallelize(List(1,2,4))
dataRDD.take(3)

Java:
import
import
import
import

java.util.Arrays;
org.apache.spark.SparkConf;
org.apache.spark.api.java.*;
org.apache.spark.api.java.function.Function;

public class LDSV01 {

public static void main(String[] args) {
// TODO Auto-generated method stub
SparkConf conf = new SparkConf().setAppName("Chapter
05").setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(conf);
JavaRDD<Integer> dataRDD = ctx.parallelize(Arrays.
asList(1,2,4));
System.out.println(dataRDD.count());
System.out.println(dataRDD.take(3));
}
}
[..]
[ 52 ]

Chapter 5
14/11/22 13:37:46 INFO SparkContext: Job finished: count at
Test01.java:13, took 0.153231 s
3
[..]
14/11/22 13:37:46 INFO SparkContext: Job finished: take at
Test01.java:14, took 0.010193 s
[1, 2, 4]

The reason for a full program in Java is that you can use the Scala and Python shell,
but for Java you need to compile and run the program. I use Eclipse and add the JAR
file /usr/local/spark-1.1.1/assembly/target/scala-2.10/spark-assembly1.1.1-hadoop2.4.0.jar in the Java build path.

Python:
rdd = sc.parallelize([1,2,3])
rdd.take(3)

The simplest method for loading external data is loading text from a file. This has a
requirement that the file should be available on all the nodes in the cluster, which
isn't much of a problem for local mode. When you're in a distributed mode, you will
want to use Spark's addFile functionality to copy the file to all of the machines in
your cluster. Assuming your SparkContext object is called sc, we could load text
data from a file (you need to create the file):

Scala:
import org.apache.spark.SparkFiles;
...
sc.addFile("spam.data")
val inFile = sc.textFile(SparkFiles.get("spam.data"))
inFile.first()

Java:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.*;
import org.apache.spark.SparkFiles;;
public class LDSV02 {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Chapter 05").
setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(conf);
System.out.println("Running Spark Version : " +ctx.version());
ctx.addFile("/Users/ksankar/fpds-vii/data/spam.data");

[ 53 ]

Loading and Saving Data in Spark

JavaRDD<String> lines = ctx.textFile(SparkFiles.get("spam.
data"));
System.out.println(lines.first());
}
}

The runtime messages are interesting:

Running Spark Version : 1.1.1
<It copied the file to a temporary directory in the
cluster. This would work in local mode as well as in a
spark cluster of many machines>
14/11/22 14:05:43 INFO Utils: Copying
/Users/ksankar/Tech/spark/book/spam.data to
/var/folders/gq/70vnnyfj6913b6lms_td7gb40000gn/T/sparkf4c60229-8290-4db3-a39b-2941f63aabf8/spam.data
14/11/22 14:05:43 INFO SparkContext: Added file
/Users/ksankar/Tech/spark/book/spam.data at
http://10.0.1.3:52338/files/spam.data with timestamp
1416693943289
14/11/22 14:05:43 INFO MemoryStore: ensureFreeSpace(163705)
called with curMem=0, maxMem=2061647216
14/11/22 14:05:43 INFO MemoryStore: Block broadcast_0
stored as values in memory (estimated size 159.9 KB, free
1966.0 MB)
14/11/22 14:05:43 INFO FileInputFormat: Total input paths
to process : 1
[..]
14/11/22 14:05:43 INFO SparkContext: Job finished: first at
Test02.java:13, took 0.191388 s
0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64 0 0 0 0.32 0 1.29 1.93
0 0.96 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0.778 0 0 3.756 61 278 1

Python:
from pyspark.files import SparkFiles

sc.addFile("spam.data")
in_file = sc.textFile(SparkFiles.get("spam.data"))
in_file.take(1)

The resulting RDD is of the string type, with each line being a unique element in the
RDD. take(1) is an action that picks the first element from the RDD.

[ 54 ]

Chapter 5

Frequently, your input files will be CSV or TSV files, which you will want to read
and parse and then create RDDs for processing. The two ways of reading CSV files
are either reading and parsing them using our own functions or using a CSV library
like opencsv.
Let's first look at parsing using our own functions:

Scala:
val inFile = sc.textFile("Line_of_numbers.csv")
val numbersRDD = inFile.map(line => line.split(','))
scala> numbersRDD.take(10)
[..]
14/11/22 12:13:11 INFO SparkContext: Job finished: take at
<console>:18, took 0.010062 s
res7: Array[Array[String]] = Array(Array(42, 42, 55, 61, 53, 49,
43, 47, 49, 60, 68, 54, 34, 35, 35, 39))
It is an array of String. We need float or double
val numbersRDD = inFile.map(line => line.split(',')).map(_.
toDouble)
scala> val numbersRDD = inFile.map(line => line.split(',')).map(_.
toDouble)
<console>:15: error: value toDouble is not a member of
Array[String]
val numbersRDD = inFile.map(line => line.split(',')).map(_.
toDouble)
This will not work as we have an array of array of strings. This
is where flatMap comes handy!
scala> val numbersRDD = inFile.flatMap(line => line.split(',')).
map(_.toDouble)
numbersRDD: org.apache.spark.rdd.RDD[Double] = MappedRDD[10] at
map at <console>:15
scala> numbersRDD.collect()
[..]
res10: Array[Double] = Array(42.0, 42.0, 55.0, 61.0, 53.0, 49.0,
43.0, 47.0, 49.0, 60.0, 68.0, 54.0, 34.0, 35.0, 35.0, 39.0)
scala> numbersRDD.sum()
[..]
14/11/22 12:19:15 INFO SparkContext: Job finished: sum at
<console>:18, took 0.013293 s
res9: Double = 766.0
scala>

Python:
inp_file = sc.textFile("Line_of_numbers.csv")
numbers_rdd = inp_file.map(lambda line: line.split(','))
>>> numbers_rdd.take(10)
[ 55 ]

Loading and Saving Data in Spark

[..]
14/11/22 11:12:25 INFO SparkContext: Job finished: runJob at
PythonRDD.scala:300, took 0.023086 s
[[u'42', u'42', u'55', u'61', u'53', u'49', u'43', u'47', u'49',
u'60', u'68', u'54', u'34', u'35', u'35', u'39']]
>>>
But we want the values as integers or double
numbers_rdd = inp_file.flatMap(lambda line: line.split(',')).
map(lambda x:float(x))
>>> numbers_rdd.take(10)
14/11/22 11:52:39 INFO SparkContext: Job finished: runJob at
PythonRDD.scala:300, took 0.022838 s
[42.0, 42.0, 55.0, 61.0, 53.0, 49.0, 43.0, 47.0, 49.0, 60.0]
>>> numbers_sum = numbers_rdd.sum()
[..]
14/11/22 12:03:16 INFO SparkContext: Job finished: sum at
<stdin>:1, took 0.026984 s
>>> numbers_sum
766.0
>>>

Java:
import java.util.Arrays;
import java.util.List;
import
import
import
import
import
import
import

org.apache.spark.SparkConf;
org.apache.spark.api.java.*;
org.apache.spark.api.java.function.DoubleFunction;
org.apache.spark.api.java.function.FlatMapFunction;
org.apache.spark.api.java.function.Function;
org.apache.spark.api.java.function.Function2;
org.apache.spark.SparkFiles;;

public class LDSV03 {

public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Chapter 05").
setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(conf);
System.out.println("Running Spark Version : " +ctx.version());
ctx.addFile("/Users/ksankar/fdps-vii/data/Line_of_numbers.
csv");
//
JavaRDD<String> lines = ctx.textFile(SparkFiles.get("Line_of_
numbers.csv"));
[ 56 ]

Chapter 5
//
JavaRDD<String[]> numbersStrRDD = lines.map(new
Function<String,String[]>() {
public String[] call(String line) {return line.split(",");}
});
List<String[]> val = numbersStrRDD.take(1);
for (String[] e : val) {
for (String s : e) {
System.out.print(s+" ");
}
System.out.println();
}
//
JavaRDD<String> strFlatRDD = lines.flatMap(new FlatMapFunction
<String,String>() {
public Iterable<String> call(String line) {return Arrays.
asList(line.split(","));}
});
List<String> val1 = strFlatRDD.collect();
for (String s : val1) {
System.out.print(s+" ");
}
System.out.println();
//
JavaRDD<Integer> numbersRDD = strFlatRDD.map(new
Function<String,Integer>() {
public Integer call(String s) {return Integer.parseInt(s);}
});
List<Integer> val2 = numbersRDD.collect();
for (Integer s : val2) {
System.out.print(s+" ");
}
System.out.println();
//
Integer sum = numbersRDD.reduce(new Function2<Integer,Integer,
Integer>() {
public Integer call(Integer a, Integer b) {return a+b;}
});
System.out.println("Sum = "+sum);
}
}

The results are as expected:

[..]
14/11/22 16:02:18 INFO AkkaUtils: Connecting to HeartbeatReceiver:
akka.tcp://[email protected]:56479/user/HeartbeatReceiver
[ 57 ]

Loading and Saving Data in Spark

Running Spark Version : 1.1.1
14/11/22 16:02:18 INFO Utils: Copying /Users/ksankar/Tech/spark/
book/Line_of_numbers.csv to /var/folders/gq/70vnnyfj6913b6lms_
td7gb40000gn/T/spark-9a4bed6d-adb5-4e08-b5c5-5e9089d6e54b/Line_of_
numbers.csv
14/11/22 16:02:18 INFO SparkContext: Added file /Users/ksankar/
fdps-vii/data/Line_of_numbers.csv at http://10.0.1.3:56484/files/
Line_of_numbers.csv with timestamp
1416700938334
14/11/22 16:02:18 INFO MemoryStore: ensureFreeSpace(163705)
called with curMem=0, maxMem=2061647216
14/11/22 16:02:18 INFO MemoryStore: Block broadcast_0 stored
as values in memory (estimated size 159.9 KB, free 1966.0 MB)
14/11/22 16:02:18 INFO FileInputFormat: Total input paths to
process : 1
14/11/22 16:02:18 INFO SparkContext: Starting job: take at
Test03.java:25
[..]
14/11/22 16:02:18 INFO SparkContext: Job finished: take at
Test03.java:25, took 0.155961 s
42 42 55 61 53 49 43 47 49 60 68 54 34 35 35 39
14/11/22 16:02:18 INFO BlockManager: Removing broadcast 1
[..]
14/11/22 16:02:18 INFO SparkContext: Job finished: collect at
Test03.java:36, took 0.016938 s
42 42 55 61 53 49 43 47 49 60 68 54 34 35 35 39
14/11/22 16:02:18 INFO SparkContext: Starting job: collect at
Test03.java:45
[..]
14/11/22 16:02:18 INFO SparkContext: Job finished: collect at
Test03.java:45, took 0.016657 s
42 42 55 61 53 49 43 47 49 60 68 54 34 35 35 39
14/11/22 16:02:18 INFO SparkContext: Starting job: reduce at
Test03.java:51
[..]
14/11/22 16:02:18 INFO SparkContext: Job finished: reduce at
Test03.java:51, took 0.019349 s
Sum = 766

[ 58 ]

Chapter 5

This also illustrates one of the ways of getting data out of Spark; you can transform
it to a standard Scala array using the collect() function. The collect() function is
especially useful for testing, in much the same way that the parallelize() function
is. The collect() function collects the job's execution results, while parallelize()
partitions the input data and makes it an RDD. The collect function only works if
your data fits in memory in a single host (where your code runs on), and even in that
case, it adds to the bottleneck that everything has to come back to a single machine.
The collect() function brings all the data to the machine
that runs the code. So beware of accidentally doing
collect() on a large RDD!

The split() and toDouble() functions doesn't always work out so well for more
complex CSV files. opencsv is a versatile library for Java and Scala. For Python the
CSV library does the trick. Let's use the opencsv library to parse the CSV files in Scala.

Scala:
import au.com.bytecode.opencsv.CSVReader
import java.io.StringReader
sc.addFile("Line_of_numbers.csv")
val inFile = sc.textFile("Line_of_numbers.csv")
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line =>
line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
println(summedData.collect().mkString(","))
[..]
14/11/22 12:37:43 INFO TaskSchedulerImpl: Removed TaskSet
13.0, whose tasks have all completed, from pool
14/11/22 12:37:43 INFO SparkContext: Job finished: collect
at <console>:28, took 0.0234 s
766.0

While loading text files into Spark is certainly easy, text files on local disk are often
not the most convenient format for storing large chunks of data. Spark supports
loading from all of the different Hadoop formats (sequence files, regular text files,
and so on) and from all of the support Hadoop storage sources (HDFS, S3, HBase,
and so on). You can also load your CSV into HBase using some of their bulk loading
tools (like import TSV) and get your CSV data.
[ 59 ]

Loading and Saving Data in Spark

Sequence files are binary flat files consisting of key value pairs; they are one of the
common ways of storing data for use with Hadoop. Loading a sequence file into
Spark is similar to loading a text file, but you also need to let it know about the types
of the keys and values. The types must either be subclasses of Hadoop's Writable
class or be implicitly convertible to such a type. For Scala users, some natives are
convertible through implicits in WritableConverter. As of Version 1.1.0, the
standard WritableConverter types are int, long, double, float, boolean, byte arrays,
and string. Let's illustrate by looking at the process of loading a sequence file of
String to Integer, as shown here:

Scala:
val data = sc.sequenceFile[String, Int](inputFile)

Java:
JavaPairRDD<Text, IntWritable> dataRDD = sc.sequenceFile(file,
Text.class, IntWritable.class);
JavaPairRDD<String, Integer> cleanData = dataRDD.map(new
PairFunction<Tuple2<Text, IntWritable>, String, Integer>() {
@Override
public Tuple2<String, Integer> call(Tuple2<Text, IntWritable>
pair) {
return new Tuple2<String, Integer>(pair._1().toString(),
pair._2().get());
}
});

Note that in the preceding cases, like with the text input, the
file need not be a traditional file; it can reside on S3, HDFS,
and so on. Also note that for Java, you can't rely on implicit
conversions between types.

HBase is a Hadoop-based database designed to support random read/write

access to entries. Loading data from HBase is a bit different from text files
and sequence in files with respect to how we tell Spark what types to use for
the data.

Scala:
import spark._
import org.apache.hadoop.hbase.{HBaseConfiguration,
HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
.
val conf = HBaseConfiguration.create()
[ 60 ]

Chapter 5
conf.set(TableInputFormat.INPUT_TABLE, input_table)
// Initialize hBase table if necessary
val admin = new HBaseAdmin(conf)
if(!admin.isTableAvailable(input_table)) {
val tableDesc = new HTableDescriptor(input_table)
admin.createTable(tableDesc)
}
val hBaseRDD = sc.newAPIHadoopRDD(conf,
classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])

Java:
import spark.api.java.JavaPairRDD;
import spark.api.java.JavaSparkContext;
import spark.api.java.function.FlatMapFunction;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.client.Result;
...
JavaSparkContext sc = new JavaSparkContext(args[0],
"sequence load", System.getenv("SPARK_HOME"),
System.getenv("JARS"));
Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, args[1]);
// Initialize hBase table if necessary
HBaseAdmin admin = new HBaseAdmin(conf);
if(!admin.isTableAvailable(args[1])) {
HTableDescriptor tableDesc = new
HTableDescriptor(args[1]);
admin.createTable(tableDesc);
}
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD =
sc.newAPIHadoopRDD( conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);

The method that you used to load the HBase data can be generalized for loading all
other sorts of Hadoop data. If a helper method in SparkContext does not already
exist for loading the data, simply create a configuration specifying how to load the
data and pass it into a new APIHadoopRDD function. Helper methods exist for plain
text files and sequence files. A helper method also exists for Hadoop files similar to
the Sequence file API.
[ 61 ]

Loading and Saving Data in Spark

Saving your data

While distributed computational jobs are a lot of fun, they are much more useful
when the results are stored in a useful place. While the methods for loading an RDD
are largely found in the SparkContext class, the methods for saving an RDD are
defined on the RDD classes. In Scala, implicit conversions exist so that an RDD, that
can be saved as a sequence file, is converted to the appropriate type, and in Java
explicit conversion must be used.
Here are the different ways to save an RDD:

For Scala:
rddOfStrings.saveAsTextFile("out.txt")
keyValueRdd.saveAsObjectFile("sequenceOut")

For Java:
rddOfStrings.saveAsTextFile("out.txt")
keyValueRdd.saveAsObjectFile("sequenceOut")

For Python:
rddOfStrings.saveAsTextFile("out.txt")

In addition, users can save the RDD as a compressed text file

by using the following function:
saveAsTextFile(path: String, codec: Class[_
<: CompressionCodec])

Some references are as follows:

http://spark-project.org/docs/latest/scala-programming-guide.
html#hadoop-datasets

http://opencsv.sourceforge.net/

http://commons.apache.org/proper/commons-csv/

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/
mapred/InputFormat.html

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntulinux-single-node-cluster/

http://spark.apache.org/docs/latest/api/python/

http://wiki.apache.org/hadoop/SequenceFile

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/
mapred/SequenceFileInputFormat.html

[ 62 ]

Chapter 5

http://hbase.apache.org/book/quickstart.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/
api/java/JavaPairRDD.html

https://bzhangusc.wordpress.com/2014/06/18/csv-parser/

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/
mapreduce/TableInputFormat.html

Summary
In this chapter, you saw how to load data from a variety of different sources.
We also looked at basic parsing of the data from text input files. Now that we can get
our data loaded into a Spark RDD, it is time to explore the different operations we
can perform on our data in the next chapter.

[ 63 ]

Get more information Fast Data Processing with Spark Second Edition

Where to buy this book

You can buy Fast Data Processing with Spark Second Edition from the
Packt Publishing website.
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and most internet
book retailers.
Click here for ordering and shipping details.

www.PacktPub.com

Stay Connected:

Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
Sony CameraRemoteSDK API-Reference v1.09.00
No ratings yet
Sony CameraRemoteSDK API-Reference v1.09.00
292 pages
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Apache Spark Interview Questions Book
100% (1)
Apache Spark Interview Questions Book
15 pages
JIRA 7 Administration Cookbook Second Edition - Sample Chapter
No ratings yet
JIRA 7 Administration Cookbook Second Edition - Sample Chapter
35 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Learning Spark Preview Ed
No ratings yet
Learning Spark Preview Ed
18 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
Instant Apache Camel Messaging System
From Everand
Instant Apache Camel Messaging System
Evgeniy Sharapov
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
DBA's Guide to NoSQL
From Everand
DBA's Guide to NoSQL
The Enlightened DBA
5/5 (1)
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Py Spark
No ratings yet
Py Spark
427 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
Mastering Advanced Analytics With Apache Spark
No ratings yet
Mastering Advanced Analytics With Apache Spark
75 pages
How To Master Apache Spark Interview Questions
No ratings yet
How To Master Apache Spark Interview Questions
14 pages
Spark Lab
No ratings yet
Spark Lab
6 pages
Data Engineering Nanodegree Program Syllabus
No ratings yet
Data Engineering Nanodegree Program Syllabus
16 pages
WildFly Performance Tuning
From Everand
WildFly Performance Tuning
Arnold Johansson
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Apache Cassandra
No ratings yet
Apache Cassandra
3 pages
Mongodb Indexing and Aggregation in Mongodb
No ratings yet
Mongodb Indexing and Aggregation in Mongodb
33 pages
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet
Full download PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes 1st Edition Raju Kumar Mishra pdf docx
100% (2)
Full download PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes 1st Edition Raju Kumar Mishra pdf docx
50 pages
Hadoop in Action
No ratings yet
Hadoop in Action
1 page
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Spark SQL Tutorial
0% (1)
Spark SQL Tutorial
7 pages
Apache Spark Tutorial
100% (4)
Apache Spark Tutorial
36 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Spark Tutorial
No ratings yet
Spark Tutorial
8 pages
Day 4-01-Spark
No ratings yet
Day 4-01-Spark
43 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
Learning HBase
From Everand
Learning HBase
Shashwat Shriparv
No ratings yet
Data Engineering Explanation
No ratings yet
Data Engineering Explanation
43 pages
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
No ratings yet
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
58 pages
Query Optimization
No ratings yet
Query Optimization
103 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Scala Data Analysis Cookbook - Sample Chapter
100% (1)
Scala Data Analysis Cookbook - Sample Chapter
37 pages
Js SDK DG
No ratings yet
Js SDK DG
380 pages
Cloudera Spark
No ratings yet
Cloudera Spark
66 pages
PostgreSQL 9 High Availability Cookbook
From Everand
PostgreSQL 9 High Availability Cookbook
Shaun M. Thomas
5/5 (2)
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
Intro To Apache Spark: Paco Nathan, Download Slides
No ratings yet
Intro To Apache Spark: Paco Nathan, Download Slides
86 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Python Geospatial Development - Third Edition - Sample Chapter
No ratings yet
Python Geospatial Development - Third Edition - Sample Chapter
32 pages
Practical Digital Forensics - Sample Chapter
100% (3)
Practical Digital Forensics - Sample Chapter
31 pages
Mastering Mesos - Sample Chapter
No ratings yet
Mastering Mesos - Sample Chapter
36 pages
Unity 5.x Game Development Blueprints - Sample Chapter
No ratings yet
Unity 5.x Game Development Blueprints - Sample Chapter
57 pages
Modular Programming With Python - Sample Chapter
No ratings yet
Modular Programming With Python - Sample Chapter
28 pages
Moodle 3.x Teaching Techniques - Third Edition - Sample Chapter
No ratings yet
Moodle 3.x Teaching Techniques - Third Edition - Sample Chapter
23 pages
RESTful Web API Design With Node - Js - Second Edition - Sample Chapter
0% (1)
RESTful Web API Design With Node - Js - Second Edition - Sample Chapter
17 pages
Angular 2 Essentials - Sample Chapter
0% (1)
Angular 2 Essentials - Sample Chapter
39 pages
Android UI Design - Sample Chapter
No ratings yet
Android UI Design - Sample Chapter
47 pages
Mastering Drupal 8 Views - Sample Chapter
0% (1)
Mastering Drupal 8 Views - Sample Chapter
23 pages
Practical Mobile Forensics - Second Edition - Sample Chapter
No ratings yet
Practical Mobile Forensics - Second Edition - Sample Chapter
38 pages
Cardboard VR Projects For Android - Sample Chapter
No ratings yet
Cardboard VR Projects For Android - Sample Chapter
57 pages
Flux Architecture - Sample Chapter
No ratings yet
Flux Architecture - Sample Chapter
25 pages
Mastering Hibernate - Sample Chapter
No ratings yet
Mastering Hibernate - Sample Chapter
27 pages
Internet of Things With Python - Sample Chapter
100% (1)
Internet of Things With Python - Sample Chapter
34 pages
Expert Python Programming - Second Edition - Sample Chapter
57% (7)
Expert Python Programming - Second Edition - Sample Chapter
40 pages
Learning Probabilistic Graphical Models in R - Sample Chapter
No ratings yet
Learning Probabilistic Graphical Models in R - Sample Chapter
37 pages
Puppet For Containerization - Sample Chapter
No ratings yet
Puppet For Containerization - Sample Chapter
23 pages
QGIS 2 Cookbook - Sample Chapter
100% (1)
QGIS 2 Cookbook - Sample Chapter
44 pages
Troubleshooting NetScaler - Sample Chapter
No ratings yet
Troubleshooting NetScaler - Sample Chapter
25 pages
Sitecore Cookbook For Developers - Sample Chapter
No ratings yet
Sitecore Cookbook For Developers - Sample Chapter
34 pages
Apache Hive Cookbook - Sample Chapter
100% (1)
Apache Hive Cookbook - Sample Chapter
27 pages
Practical Linux Security Cookbook - Sample Chapter
100% (1)
Practical Linux Security Cookbook - Sample Chapter
25 pages
Odoo Development Cookbook - Sample Chapter
100% (1)
Odoo Development Cookbook - Sample Chapter
35 pages
Canvas Cookbook - Sample Chapter
No ratings yet
Canvas Cookbook - Sample Chapter
34 pages
Machine Learning in Java - Sample Chapter
100% (1)
Machine Learning in Java - Sample Chapter
26 pages
3D Printing Designs: Design An SD Card Holder - Sample Chapter
100% (1)
3D Printing Designs: Design An SD Card Holder - Sample Chapter
16 pages
RStudio For R Statistical Computing Cookbook - Sample Chapter
100% (1)
RStudio For R Statistical Computing Cookbook - Sample Chapter
38 pages
Sass and Compass Designer's Cookbook - Sample Chapter
No ratings yet
Sass and Compass Designer's Cookbook - Sample Chapter
41 pages
22616-2024-summer-model-answer-paper (1)
No ratings yet
22616-2024-summer-model-answer-paper (1)
28 pages
Semi-: Supervised Learning
No ratings yet
Semi-: Supervised Learning
40 pages
ReleaseNotes10521 en
No ratings yet
ReleaseNotes10521 en
18 pages
13 在线加密流量分类有无代码
No ratings yet
13 在线加密流量分类有无代码
14 pages
pdf&rendition=1
No ratings yet
pdf&rendition=1
126 pages
PME Uninstall For Win7 v1.0
No ratings yet
PME Uninstall For Win7 v1.0
4 pages
Lab Manual
No ratings yet
Lab Manual
3 pages
Scan
No ratings yet
Scan
75 pages
Computational Complexity
No ratings yet
Computational Complexity
5 pages
User Manual of HikCentral Web Client
No ratings yet
User Manual of HikCentral Web Client
143 pages
W25Q64JV Adatlap
No ratings yet
W25Q64JV Adatlap
78 pages
An 8-Bit in Resistive Memory Computing Core With Regulated Passive Neuron and Bitline Weight Mapping
No ratings yet
An 8-Bit in Resistive Memory Computing Core With Regulated Passive Neuron and Bitline Weight Mapping
13 pages
Auth0 Amd Cs Case Study
No ratings yet
Auth0 Amd Cs Case Study
7 pages
2 Amharic Online Handwriting Recognition SRS
No ratings yet
2 Amharic Online Handwriting Recognition SRS
36 pages
TVL CSS11 Q4 M14
No ratings yet
TVL CSS11 Q4 M14
11 pages
DA0QL4MB8E0 REV E Schematic Diagram
No ratings yet
DA0QL4MB8E0 REV E Schematic Diagram
44 pages
IDS Lab Manual
No ratings yet
IDS Lab Manual
38 pages
02verilogHDL교재
No ratings yet
02verilogHDL교재
196 pages
Power BI and SAP BW
No ratings yet
Power BI and SAP BW
59 pages
01 GGSN9811 V900R007C02 Software and Hardware Overview-20090904-B-V1.0
No ratings yet
01 GGSN9811 V900R007C02 Software and Hardware Overview-20090904-B-V1.0
35 pages
Lab Manual: Principle of Communication System
No ratings yet
Lab Manual: Principle of Communication System
11 pages
Principles Of Complier Design 1st Edition Raghavan - Quickly access the ebook and start reading today
No ratings yet
Principles Of Complier Design 1st Edition Raghavan - Quickly access the ebook and start reading today
70 pages
Srinivas Oic
No ratings yet
Srinivas Oic
4 pages
Trimble GA830 DS 1221 B LR
No ratings yet
Trimble GA830 DS 1221 B LR
2 pages
IOT
No ratings yet
IOT
125 pages
Dbms Poster
No ratings yet
Dbms Poster
1 page
Airline Reservation System
No ratings yet
Airline Reservation System
81 pages
A Brief History of AI On Past Present and Future
100% (2)
A Brief History of AI On Past Present and Future
11 pages
Computer Science Paper 02
No ratings yet
Computer Science Paper 02
10 pages

Fast Data Processing With Spark - Second Edition - Sample Chapter

Uploaded by

Fast Data Processing With Spark - Second Edition - Sample Chapter

Uploaded by

Fr

Fast Data Processing with Spark - Second Edition covers

Who this book is written for

What you will learn from this book

Prototype distributed applications with Spark's

Learn different ways to interact with Spark's

Query Spark with a SQL-like query syntax

Effectively test your distributed software

Recognize how Spark works with big data

Implement machine learning systems with

Prices do not include

Visit www.PacktPub.com for books, eBooks,

community experience distilled

Fast Data Processing with Spark Second Edition

Fast Data Processing

Fast Data Processing

In this package, you will find:

The authors biography

About the Authors

Fast Data Processing with Spark

What This Book Covers

Loading and Saving

Loading and Saving Data in Spark

Each time you materialize an RDD, it is recomputed; if we

Loading data into an RDD

public class LDSV01 {

Loading and Saving Data in Spark

The runtime messages are interesting:

Loading and Saving Data in Spark

public class LDSV03 {

The results are as expected:

Loading and Saving Data in Spark

Loading and Saving Data in Spark

HBase is a Hadoop-based database designed to support random read/write

Loading and Saving Data in Spark

Saving your data

In addition, users can save the RDD as a compressed text file

Some references are as follows:

Where to buy this book

You might also like