Fast Data Processing With Spark - Second Edition - Sample Chapter
Fast Data Processing With Spark - Second Edition - Sample Chapter
Second Edition
Spark is a framework used for writing fast, distributed
programs. Spark solves similar problems as Hadoop
MapReduce does, but with a fast in-memory approach
and a clean functional style API. With its ability to integrate
with Hadoop and built-in tools for interactive query analysis
(Spark SQL), large-scale graph processing and analysis
(GraphX), and real-time analysis (Spark Streaming), it can be
interactively used to quickly process and query big datasets.
P U B L I S H I N G
Krishna Sankar
Holden Karau
$ 29.99 US
19.99 UK
ee
Sa
m
pl
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Krishna Sankar
Holden Karau
Chapter 9, Machine Learning Using Spark MLlib, talks about regression, classification,
clustering, and recommendation. This is probably the largest chapter in this book. If you
are stranded on a remote island and could take only one chapter with you, this should be
the one!
Chapter 10, Testing, talks about the importance of testing distributed applications.
Chapter 11, Tips and Tricks, distills some of the things we have seen. Our hope is that as
you get more and more adept in Spark programming, you will add this to the list and send
us your gems for us to include in the next version of this book!
RDDs
Spark RDDs can be created from any supported Hadoop source. Native collections in
Scala, Java, and Python can also serve as the basis for an RDD. Creating RDDs from a
native collection is especially useful for testing.
Before jumping into the details on the supported data sources/links, take some time
to learn about what RDDs are and what they are not. It is crucial to understand that
even though an RDD is defined, it does not actually contain data but just creates
the pipeline for it. (As an RDD follows the principle of lazy evaluation, it evaluates
an expression only when it is needed, that is, when an action is called for.) This
means that when you go to access the data in an RDD, it could fail. The computation
to create the data in an RDD is only done when the data is referenced by caching
or writing out the RDD. This also means that you can chain a large number of
operations and not have to worry about excessive blocking in a computational
thread. It's important to note during application development that you can write
code, compile it, and even run your job; unless you materialize the RDD, your code
may not have even tried to load the original data.
[ 51 ]
Scala:
val dataRDD = sc.parallelize(List(1,2,4))
dataRDD.take(3)
Java:
import
import
import
import
java.util.Arrays;
org.apache.spark.SparkConf;
org.apache.spark.api.java.*;
org.apache.spark.api.java.function.Function;
Chapter 5
14/11/22 13:37:46 INFO SparkContext: Job finished: count at
Test01.java:13, took 0.153231 s
3
[..]
14/11/22 13:37:46 INFO SparkContext: Job finished: take at
Test01.java:14, took 0.010193 s
[1, 2, 4]
The reason for a full program in Java is that you can use the Scala and Python shell,
but for Java you need to compile and run the program. I use Eclipse and add the JAR
file /usr/local/spark-1.1.1/assembly/target/scala-2.10/spark-assembly1.1.1-hadoop2.4.0.jar in the Java build path.
Python:
rdd = sc.parallelize([1,2,3])
rdd.take(3)
The simplest method for loading external data is loading text from a file. This has a
requirement that the file should be available on all the nodes in the cluster, which
isn't much of a problem for local mode. When you're in a distributed mode, you will
want to use Spark's addFile functionality to copy the file to all of the machines in
your cluster. Assuming your SparkContext object is called sc, we could load text
data from a file (you need to create the file):
Scala:
import org.apache.spark.SparkFiles;
...
sc.addFile("spam.data")
val inFile = sc.textFile(SparkFiles.get("spam.data"))
inFile.first()
Java:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.*;
import org.apache.spark.SparkFiles;;
public class LDSV02 {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Chapter 05").
setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(conf);
System.out.println("Running Spark Version : " +ctx.version());
ctx.addFile("/Users/ksankar/fpds-vii/data/spam.data");
[ 53 ]
Python:
from pyspark.files import SparkFiles
sc.addFile("spam.data")
in_file = sc.textFile(SparkFiles.get("spam.data"))
in_file.take(1)
The resulting RDD is of the string type, with each line being a unique element in the
RDD. take(1) is an action that picks the first element from the RDD.
[ 54 ]
Chapter 5
Frequently, your input files will be CSV or TSV files, which you will want to read
and parse and then create RDDs for processing. The two ways of reading CSV files
are either reading and parsing them using our own functions or using a CSV library
like opencsv.
Let's first look at parsing using our own functions:
Scala:
val inFile = sc.textFile("Line_of_numbers.csv")
val numbersRDD = inFile.map(line => line.split(','))
scala> numbersRDD.take(10)
[..]
14/11/22 12:13:11 INFO SparkContext: Job finished: take at
<console>:18, took 0.010062 s
res7: Array[Array[String]] = Array(Array(42, 42, 55, 61, 53, 49,
43, 47, 49, 60, 68, 54, 34, 35, 35, 39))
It is an array of String. We need float or double
val numbersRDD = inFile.map(line => line.split(',')).map(_.
toDouble)
scala> val numbersRDD = inFile.map(line => line.split(',')).map(_.
toDouble)
<console>:15: error: value toDouble is not a member of
Array[String]
val numbersRDD = inFile.map(line => line.split(',')).map(_.
toDouble)
This will not work as we have an array of array of strings. This
is where flatMap comes handy!
scala> val numbersRDD = inFile.flatMap(line => line.split(',')).
map(_.toDouble)
numbersRDD: org.apache.spark.rdd.RDD[Double] = MappedRDD[10] at
map at <console>:15
scala> numbersRDD.collect()
[..]
res10: Array[Double] = Array(42.0, 42.0, 55.0, 61.0, 53.0, 49.0,
43.0, 47.0, 49.0, 60.0, 68.0, 54.0, 34.0, 35.0, 35.0, 39.0)
scala> numbersRDD.sum()
[..]
14/11/22 12:19:15 INFO SparkContext: Job finished: sum at
<console>:18, took 0.013293 s
res9: Double = 766.0
scala>
Python:
inp_file = sc.textFile("Line_of_numbers.csv")
numbers_rdd = inp_file.map(lambda line: line.split(','))
>>> numbers_rdd.take(10)
[ 55 ]
Java:
import java.util.Arrays;
import java.util.List;
import
import
import
import
import
import
import
org.apache.spark.SparkConf;
org.apache.spark.api.java.*;
org.apache.spark.api.java.function.DoubleFunction;
org.apache.spark.api.java.function.FlatMapFunction;
org.apache.spark.api.java.function.Function;
org.apache.spark.api.java.function.Function2;
org.apache.spark.SparkFiles;;
Chapter 5
//
JavaRDD<String[]> numbersStrRDD = lines.map(new
Function<String,String[]>() {
public String[] call(String line) {return line.split(",");}
});
List<String[]> val = numbersStrRDD.take(1);
for (String[] e : val) {
for (String s : e) {
System.out.print(s+" ");
}
System.out.println();
}
//
JavaRDD<String> strFlatRDD = lines.flatMap(new FlatMapFunction
<String,String>() {
public Iterable<String> call(String line) {return Arrays.
asList(line.split(","));}
});
List<String> val1 = strFlatRDD.collect();
for (String s : val1) {
System.out.print(s+" ");
}
System.out.println();
//
JavaRDD<Integer> numbersRDD = strFlatRDD.map(new
Function<String,Integer>() {
public Integer call(String s) {return Integer.parseInt(s);}
});
List<Integer> val2 = numbersRDD.collect();
for (Integer s : val2) {
System.out.print(s+" ");
}
System.out.println();
//
Integer sum = numbersRDD.reduce(new Function2<Integer,Integer,
Integer>() {
public Integer call(Integer a, Integer b) {return a+b;}
});
System.out.println("Sum = "+sum);
}
}
[ 58 ]
Chapter 5
This also illustrates one of the ways of getting data out of Spark; you can transform
it to a standard Scala array using the collect() function. The collect() function is
especially useful for testing, in much the same way that the parallelize() function
is. The collect() function collects the job's execution results, while parallelize()
partitions the input data and makes it an RDD. The collect function only works if
your data fits in memory in a single host (where your code runs on), and even in that
case, it adds to the bottleneck that everything has to come back to a single machine.
The collect() function brings all the data to the machine
that runs the code. So beware of accidentally doing
collect() on a large RDD!
The split() and toDouble() functions doesn't always work out so well for more
complex CSV files. opencsv is a versatile library for Java and Scala. For Python the
CSV library does the trick. Let's use the opencsv library to parse the CSV files in Scala.
Scala:
import au.com.bytecode.opencsv.CSVReader
import java.io.StringReader
sc.addFile("Line_of_numbers.csv")
val inFile = sc.textFile("Line_of_numbers.csv")
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line =>
line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
println(summedData.collect().mkString(","))
[..]
14/11/22 12:37:43 INFO TaskSchedulerImpl: Removed TaskSet
13.0, whose tasks have all completed, from pool
14/11/22 12:37:43 INFO SparkContext: Job finished: collect
at <console>:28, took 0.0234 s
766.0
While loading text files into Spark is certainly easy, text files on local disk are often
not the most convenient format for storing large chunks of data. Spark supports
loading from all of the different Hadoop formats (sequence files, regular text files,
and so on) and from all of the support Hadoop storage sources (HDFS, S3, HBase,
and so on). You can also load your CSV into HBase using some of their bulk loading
tools (like import TSV) and get your CSV data.
[ 59 ]
Sequence files are binary flat files consisting of key value pairs; they are one of the
common ways of storing data for use with Hadoop. Loading a sequence file into
Spark is similar to loading a text file, but you also need to let it know about the types
of the keys and values. The types must either be subclasses of Hadoop's Writable
class or be implicitly convertible to such a type. For Scala users, some natives are
convertible through implicits in WritableConverter. As of Version 1.1.0, the
standard WritableConverter types are int, long, double, float, boolean, byte arrays,
and string. Let's illustrate by looking at the process of loading a sequence file of
String to Integer, as shown here:
Scala:
val data = sc.sequenceFile[String, Int](inputFile)
Java:
JavaPairRDD<Text, IntWritable> dataRDD = sc.sequenceFile(file,
Text.class, IntWritable.class);
JavaPairRDD<String, Integer> cleanData = dataRDD.map(new
PairFunction<Tuple2<Text, IntWritable>, String, Integer>() {
@Override
public Tuple2<String, Integer> call(Tuple2<Text, IntWritable>
pair) {
return new Tuple2<String, Integer>(pair._1().toString(),
pair._2().get());
}
});
Note that in the preceding cases, like with the text input, the
file need not be a traditional file; it can reside on S3, HDFS,
and so on. Also note that for Java, you can't rely on implicit
conversions between types.
Scala:
import spark._
import org.apache.hadoop.hbase.{HBaseConfiguration,
HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
.
val conf = HBaseConfiguration.create()
[ 60 ]
Chapter 5
conf.set(TableInputFormat.INPUT_TABLE, input_table)
// Initialize hBase table if necessary
val admin = new HBaseAdmin(conf)
if(!admin.isTableAvailable(input_table)) {
val tableDesc = new HTableDescriptor(input_table)
admin.createTable(tableDesc)
}
val hBaseRDD = sc.newAPIHadoopRDD(conf,
classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
Java:
import spark.api.java.JavaPairRDD;
import spark.api.java.JavaSparkContext;
import spark.api.java.function.FlatMapFunction;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.client.Result;
...
JavaSparkContext sc = new JavaSparkContext(args[0],
"sequence load", System.getenv("SPARK_HOME"),
System.getenv("JARS"));
Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, args[1]);
// Initialize hBase table if necessary
HBaseAdmin admin = new HBaseAdmin(conf);
if(!admin.isTableAvailable(args[1])) {
HTableDescriptor tableDesc = new
HTableDescriptor(args[1]);
admin.createTable(tableDesc);
}
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD =
sc.newAPIHadoopRDD( conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
The method that you used to load the HBase data can be generalized for loading all
other sorts of Hadoop data. If a helper method in SparkContext does not already
exist for loading the data, simply create a configuration specifying how to load the
data and pass it into a new APIHadoopRDD function. Helper methods exist for plain
text files and sequence files. A helper method also exists for Hadoop files similar to
the Sequence file API.
[ 61 ]
For Scala:
rddOfStrings.saveAsTextFile("out.txt")
keyValueRdd.saveAsObjectFile("sequenceOut")
For Java:
rddOfStrings.saveAsTextFile("out.txt")
keyValueRdd.saveAsObjectFile("sequenceOut")
For Python:
rddOfStrings.saveAsTextFile("out.txt")
http://spark-project.org/docs/latest/scala-programming-guide.
html#hadoop-datasets
http://opencsv.sourceforge.net/
http://commons.apache.org/proper/commons-csv/
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/
mapred/InputFormat.html
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntulinux-single-node-cluster/
http://spark.apache.org/docs/latest/api/python/
http://wiki.apache.org/hadoop/SequenceFile
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/
mapred/SequenceFileInputFormat.html
[ 62 ]
Chapter 5
http://hbase.apache.org/book/quickstart.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/
api/java/JavaPairRDD.html
https://bzhangusc.wordpress.com/2014/06/18/csv-parser/
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/
mapreduce/TableInputFormat.html
Summary
In this chapter, you saw how to load data from a variety of different sources.
We also looked at basic parsing of the data from text input files. Now that we can get
our data loaded into a Spark RDD, it is time to explore the different operations we
can perform on our data in the next chapter.
[ 63 ]
Get more information Fast Data Processing with Spark Second Edition
www.PacktPub.com
Stay Connected: