Data Analysis Frameworks for IoT
Data Analysis Frameworks for IoT
10.1. Introduction
The volume, velocity and variety of data generated by data-intensive IoT systems is so large
that it is difficult to store, manage, process and analyze the data using traditional databases
and data processing tools. Analysis of data can be done with aggregation methods (such as
computing mean, maximum, minimum, counts, etc.) or using machine learning methods such
as clustering and classification. Clustering is used to grouping similar data items together
such that, data items which are more similar to each other (with respect to some similarity
criteria) than other data items are put in one cluster. Classification is used for categorizing
objects into predefined categories.
In this chapter, you will learn about various frameworks for data analysis including
Apache Hadoop, Apache Oozie, Apache Spark and Apache Storm. Case studies on batch
and real-time data analysis for a forest fire detection system are described. Before going into
the specifics of the data analysis tools, let us look at the IoT system and the requirements for
data analysis.
Figure 10.1 shows the deployment design of a forest fire detection system with multiple
end nodes which are deployed in a forest. The end nodes are equipped with sensors for
measuring temperature, humidity, light and carbon monoxide (CO) at various locations
in the forest. Each end node sends data independently to the cloud using REST-based
communication. The data collected in the cloud is analyzed to predict whether fire has broken
out in the forest.
Local Cloud
REST
Communication
- Analytics
REST Services <—> Component
t i | : (loT Intelligence)
Resource Resource
|7 a
| Database
i } | vena 3
Device Device :
Figure 10.2 shows an example of the data collected for forest fire detection. Each row
in the table shows timestamped readings of temperature, humidity, light and CO sensors.
By analyzing the sensor readings in real-time (each row of table), predictions can be made
about the occurrence of a forest fire. The sensor readings can also be aggregated on a various
Oe
nr MEME Maer el
Bahga & Madisetti, © 2014
10.1 Introduction 287
timescales (minute, hourly, daily or monthly) to determine the mean, maximum and minimum
readings. This data can help in developing prediction models.
|
2014-05-01 [Link] | 30 42 107500
4-05-01 [Link]
2014-05-01 [Link] BS i 42] Sh 107500 sees Mee Real-time data
2014-05-01 [Link] | 30 43 107500 30 |__, analytics
a | aie Sey with Storm
Figure 10.3 shows a schematic diagram of forest fire detection end node. The end node
is based on a Raspberry Pi device and uses DHT22 temperature and humidity sensor, light
dependent resistor and MICS5525 CO sensor. Box 10.1 shows the Python code for the native
controller service than runs on the end nodes. This example uses the Xively PaaS for storing
data. In the setupController function new Xively datastreams are created for temperature,
humidity, light and CO data. The runController function is called every second and the
sensor readings are obtained. The Xively REST API is used for sending data to the Xively
cloud.
= Box 10.1: Controller service for forest fire detection system - [Link]
import time
import datetime
import requests
import xively
import dhtreader
import spidev
global temp_datastream
global CO_datastream
global humidity _datastream
global light_datastream
e2
B
MICS 5525 -
CO Sensor
i
q
ee
Pre) ost COTSPLAY)
aS
at. J &
YNSUYS FSS
—CTHERNET =
iG
e
Us :8x Temperature &
Humidity Sensor
e s
828080
e@eeee
8280
e2eee
Figure 10.3: Schematic diagram of forest fire detection end node showing Raspberry Pi
device and sensors
#Initialize DHT22
[Link]()
temperature, humidity=read_DHT22_Sensor()
Light=readLDR()
CO reading = readCOSensor()
temp_datastream.current_value = temperature
temp_datastream.at = [Link] ()
light_datastream.current_value = light
light_datastream.at = [Link]()
CO_datastream.current
value = CO_reading
CO [Link] = [Link]()
try:
temp_datastream.update()
except requests HiTPRrcror as-6:
prink “HTTPRrror({0})= {1}".format ([Link], [Link])
feBY %
humidity _datastream.update ()
except requests HiIPError as ¢€:
print “HTTPError({0}): {1}".format (e¢.errno, [Link])
eyes
light_datastream.update()
except requests. HIIPError ase:
Drink "“HTTPRreror ({0}): {1}".format ([Link], [Link])
ELry:
CO_datastream.update()
excep Gequests,
ni IPError- as ¢:
prin’ MATTPError
(0 })-: {1}".format ([Link], [Link])
temp_datastream = get_tempdatastream
(feed)
temp_datastream.max_value = None
temp_datastream.min_value = None
humidity_datastream = get_humiditydatastream
(feed)
humidity _datastream.max_value = None
humidity_datastream.min_value = None
light_datastream = get_lightdatastream
(feed)
light_datastream.max_value = None
light_datastream.min_value = None
CO_datastream = get_COdatastream
(feed)
CO_datastream.max_value = None
co _datastream.min_value tI None
setupController()
while True:
runController()
[Link]
(1)
NameNode
NameNode keeps the directory tree of all files in the file system, and tracks where across
the cluster the file data is kept. It does not store the data of these files itself. Client
Lounbile Output
ae
Reduce =
Map ae
| Map. ae
applications talk to the NameNode whenever they wish to locate a file, or when they want to
add/copy/move/delete a file. The NameNode responds to the successful requests by returning
a list of relevant DataNode servers where the data lives. NameNode serves as both directory
namespace manager and ‘inode table’ for the Hadoop DFS. There is a single NameNode
running in any DFS deployment.
Secondary NameNode
HDFS is not currently a high availability system. The NameNode is a Single Point of Failure
for the HDFS Cluster. When the NameNode goes down, the file system goes offline. An
optional Secondary NameNode which is hosted on a separate machine creates checkpoints of
the namespace.
JobTracker
The JobTracker is the service within Hadoop that distributes MapReduce tasks to specific
nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
TaskTracker
TaskTracker is a node in a Hadoop cluster that accepts Map, Reduce and Shuffle tasks from
the JobTracker. Each TaskTracker has a defined number of slots which indicate the number
of tasks that it can accept. When the JobTracker tries to find a TaskTracker to schedule a
map or reduce task it first looks for an empty slot on the same node that hosts the DataNode
containing the data. If an empty slot is not found on the same node, the JobTracker looks for
an empty slot on a node in the same rack.
DataNode
A DataNode stores data in an HDFS file system. A functional HDFS filesystem has more
than one DataNode, with data replicated across them. DataNodes connect to the NameNode
on startup. DataNodes respond to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the
Slave Node
Master Node
Client a
a
Client =
Se Slave Node
Backup Node
so that any task failure does not bring down the TaskTracker. The TaskTracker monitors
these spawned processes while capturing the output and exit codes. When the process
finishes, successfully or not, the TaskTracker notifies the JobTracker. When a task fails the
TaskTracker notifies the JobTracker and the JobTracker decides whether to resubmit the job
to some other TaskTracker or mark that specific record as something to avoid. The JobTracker
can blacklist a TaskTracker as unreliable if there are repeated task failures. When the job is
completed, the JobTracker updates its status. Client applications can poll the JobTracker for
status of the jobs. ;
Install Java
Hadoop requires Java 6 or later version. Box 10.2 lists the commands for installing Java 7.
Install Hadoop
To setup a Hadoop cluster, the Hadoop setup tarball is downloaded and unpacked on all the
nodes. The Hadoop version used for the cluster example in this section is 1.0.4. Box 10.3
lists the commands for installing Hadoop.
#Modify /etc/hosts file and add private IPs of Master and Slave nodes:
Ssudo vim /etc/hosts
#<private_IP_master> master
#<private_IP_slavel> slavel
#<private_IP_slave2> slave2
#Open authorized keys file and copy authorized keys of each node
Ssudo vim /.ssh/authorized_keys
Networking
After unpacking the Hadoop setup package on all the nodes of the cluster, the next step is to
configure the network such that all the nodes can connect to each other over the network. To
make the addressing of nodes simple, assign simple host names to nodes (such master, slave1
and slave2). The /etc/hosts file is edited on all nodes and IP addresses and host names of all
the nodes are added.
Hadoop control scripts use SSH for cluster-wide operations such as starting and stopping
NameNode, DataNode, JobTracker, TaskTracker and other daemons on the nodes in the
cluster. For the control scripts to work, all the nodes in the cluster must be able to connect
to each other via a password-less SSH login. To enable this, public/private RSA key pair
is generated on each node. The private key is stored in the file /.ssh/id_rsa and public key
is stored in the file /.ssh/id_rsa.pub. The public SSH key of each node is copied to the
/.ssh/authorized_keys file of every other node. This can be done by manually editing the
/.ssh/authorized_keys file on each node or using the ssh-copy-id command. The final step to
setup the networking is to save host key fingerprints of each node to the known_hosts file of
every other node. This is done by connecting from each node to every other node by SSH.
Configure Hadoop
With the Hadoop setup package unpacked on all nodes and networking of nodes setup,
the next step is to configure the Hadoop cluster. Hadoop is configured using a number of
configuration files listed in Table 10.1. Boxes 10.4, 10.5, 10.6 and 10.7 show the sample
configuration settings for the Hadoop configuration files [Link], [Link],
[Link], masters/slaves files respectively.
<?xml version="1.0"?>
<configuration>
<property>
<name>[Link]</name>
<value>hdfs://master:54310</value>
</property>
</configuration>
<?xml version="1.0"?>
<configuration>
<property >
<name>[Link]</name>
<value>2</value>
</property>
</configuration>
<?xml version="1.0"?>
<configuration>
<property>
<name>[Link]</name>
<value>master:54311</value>
</property>
</configuration>
Scd hadoop/conf/
S$cd hadoop-1.0.4
#Format NameNode
Sbin/hadoop namenode -format
am ee are ee
Sjps
Brews
Nhe Mreyatesn
a = ‘ ae
Cluster Summary
‘6Mes and directories, 1 blocks * 7 totel. Heap Size Is25.12 MB / $66.69 MB (2%)
Configured Capacity > (7568
DFS Used OI KB
Non DFS Used 22965
DFS Remaining 12.46 GB
DFS Used 0%
OFS Remaining 79.08%
hive Maus z
Drona, Petia o
H &
Number of Under-Repticated Blocks e
NameNode Storage:
eo od
Msletoxresapmenetee Cot, eeee Be ee eae
Ensmsce aes seats 52S ae tite sy Shey toy om tee tet are IR = 40tec
Running Jobs
&]
Retired Jobs
f=]
Local Logs
Figure 10.9: Hadoop HDFS status page showing live data nodes
z ‘ aa
* “\ O) ec2-23-25-53-85,[Link]
"2014-04-30 [Link]",84,58,23,2
| FinalWrite 9
Combine
Map Write
Box 10.9 shows the map program for the batch analysis of sensor data. The map program
reads the data from standard input (stdin) and splits the data into timestamp and individual
sensor readings. The map program emits key-value pairs where key is a portion of the
timestamp (that depends on the timescale on which the data is to be aggregated) and the
value is a comma separated string of sensor readings.
#!/usr/bin/env python
import sys
line 7= [Link]})
data = [Link](’ og
l=len (data)
value=data[1]+’,’+data[2]+’,’+data[3]+’,’t+data[4]
print ’%s \t%s’ %
°]
(key, value)
Box 10.10 shows the reduce program for the batch analysis of sensor data. The key-value
pairs emitted by the map program are shuffled to the reducer and grouped by the key. The
reducer reads the key-value pairs grouped by the same key from standard input and computes
the means of temperature, humidity, light and CO readings.
#!/usr/bin/env python
from operator import itemgetter
import sys
import numpy as np
current_key = None
current vals list = {[]
word = None
if current_key == key:
current_vals_list.append(list_of_values)
else:
if current_key:
i= Len (current
vals: list)+ 1
b = [Link] (current vals list)
meanval = [[Link](b[0:1,0]),[Link](b[0:1,1]),
[Link](b[(0:1,2]), [Link](b[0:1,3]) ]
print 'SsSs5" & (current_key, str(meanval) )
Scurvrentavals
ligt i=. [|
current_vals_list.append(list_of_values)
current_key = key
if current_key == key:
1 = len(current_vals_list)+ 1
b = [Link](current_vals_list)
meanval = [[Link](b[0:1,0]),[Link](b[0:1,1]),
np-mean(b[0:1,2]), [Link](b[0:1,3])]
print 'SsSar & (current_key, str(meanval) )
#Testing locally
Scat [Link] | python [Link] | python [Link]
Hadoop 1.0
| Container |
Application; ——.-—__
Master | q
ACS Container —--
>,
Resource Manager
Client
Client
| Container ~’
Application |
Master °
‘ Container
Job submission
Node status
Resource request
» MapReduce status Container ~~
— —» Container status request
Container ~~~
constructs and submits an Application Submission Context which contains information such
as scheduler queue, priority and user information. The Application Submission Context also
contains a Container Launch Context which contains the application’s jar, job files, security
tokens and any resource requirements. The client can query the RM for application reports.
The client can-also "force kill" an application by sending a request to the RM.
Resource Manager
Client a 2
New Application Request
Response (Application-ID)
Submit Application
4S
| \ } | i } 4 | {
Get Application Report
Application Report Response
Figure 10.15 shows the interactions between Resource Manager and Application Master.
Upon receiving an application submission context from a client, the RM finds an available
container meeting the resource requirements for running the AM for the application. On
Heartbeats
Allocate Request
Application Response
NY
Ww
FF
wo
Dn Finish Application Master
| Application | ee ee
oe es earn
1 Start Container
2 Container Status Request
finding a suitable container, the RM contacts the NM for the container to start the AM process
on its node. When the AM is launched it registers itself with the RM. The registration process
consists of handshaking that conveys information such as the RPC port that the AM will be
listening on, the tracking URL for monitoring the application’s status and progress, etc. The
registration response from the RM contains information for the AM that is used in calculating
and requesting any resource requests for the application’s individual tasks (such as minimum
and maximum resource capabilities for the cluster). The AM relays heartbeat and progress
information to the RM. The AM sends resource allocation requests to the RM that contains a ~
list of requested containers, and may also contain a list of released containers by the AM.
Upon receiving the allocation request, the scheduler component of the RM computes a list of
containers that satisfy the request and sends back an allocation response. Upon receiving the
resource list, the AM contacts the associated NMs for starting the containers. When the job
finishes, the AM sends a Finish Application message to the RM.
Figure 10.16 shows the interactions between the an Application Master and Node
Manager. Based on the resource list received from the RM, the AM requests the hosting NM
for each container to start the container. The AM can request and receive a container status
report from the Node Manager.
In the previous section you learned how to setup a Hadoop 1.x cluster. This section describes
the steps involved in setting up Hadoop YARN cluster. The initial steps of setting up the
hosts, installing Java and configuring the networking are the same as in Hadoop 1.x. The next
step is to download the Hadoop YARN setup package and unpack it on all nodes as follows:
m wget [Link]
hadoop/common/stable2/[Link]
@ export HADOOP_HOME=/home/ubuntu/hadoop-2.2.0
export HADOOP_MAPRED_HOME=SHADOOP_HOME
export HADOOP_COMMON_HOME=SHADOOP_HOME
export HADOOP_HDFS_HOME=SHADOOP_HOME
export YARN_HOME=SHADOOP_HOME
export HADOOP_CONF_DIR=S$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
mB export JAVA_HOME=/usr/lib/jvm/java-7-oracle/
export HADOOP_HOMB=/home/ubuntu/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=SHADOOP_HOME
export HADOOP_CONF_DIR=S$HADOOP__HOME/etc/hadoop
export YARN_CONF_DIR=SHADOOP_HOME/etc/hadoop
a S$ mkdir -p $HADOOP_HOME/tmp
Next, add the slave hostnames to the etc/hadoop/slaves file on master machine:
mw slavel
_Slavez
slave3
The next step is to edit the Hadoop configuration files. Boxes 10.12, 10.13, 10.14
and 10.15 show the sample configuration settings for the Hadoop configuration files -
[Link], [Link], [Link], [Link] files respectively.
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.
framework .name</name>
<value>yarn</value>
</property>
</configuration>
<?xml version="1..0"2>
<configuration>
<property>
<name>[Link]-services</name>
<value>mapreduce. shuffle</value>
</property>
<property>
<name>[Link]</name>
<value>[Link]. ShuffleHandler</value>
</property>
<property>
<name>[Link]</name>
<value>master:8025</value>
</property>
<property>
<name>[Link]</name>
<value>master:8030</value>
</property>
<property>
<name>[Link]</name>
<value>master:8040</value>
</property>
</configuration>
__|2013-10-07706:282byhononmu from
~ If Cl0-8032b4i1-5 1b-4340
branch220|
ibb9ac619
Block Poo! iD: I BP-1454275776-10.240,57.228-1404115076663
Browse
the filesystem
NameNode Lous
Cluster Summary
SecurityisOFF
21 files and directories, 4 blocks = 25 total.
Heap Memory used 26.89 MB is 90% of Commited Heap Memory 29.64 MB. Max Heap Memory is 966.69 MB.
Non Heap Memory used 26.91 MB is 95% of Commited Non Heap Memory 28.19 M8. Max Non Heap Memory is 130 MB.
[
NonDFSUsed 2 OB
Box 10.16 shows the commands for starting/stopping Hadoop YARN cluster.
application 1404215108265 0001 “i word MAPREDUCE defeuk Won,30 Mon,30 FINISHED SUCCEEDED { } estore
id Jum 2014 Jun 2014
{Sheng 120
feCe
enwies itis cae
Oe acts JobHistory
= Application Retired Jobs
— bees20 * nies ee Cn ee ]
Mi Mi Reduces
»Tools — f at Te "FaishTne Jab10 ap NSM3 seco CUE) Sea 6 Tota completed Tors Completed
2014.06.30 2014.06.30 job_1404315109265
9901 ol ubuntu default. «=SUCCEEDED 1 tara cet
[Link] —08:01,06
UT c ure
Stxt Tene Pisiah Tare Jat Nace Leer Quece sane Sime Tol Supe tora ances Ye Piccen Con
nNeee NE ee
[Showing 1to1
of1enuries
First Prevwars Next “Last
$cd hadoop-1.0.4
#Format NameNode
bin/hadoop namenode -format
Figures 10.17, 10.18 and 10.19 show the screenshots of the Hadoop Namenode, YARN
cluster and job history server dashboards.
Next, follow the steps for setting up Hadoop described in the previous section. After
setting up Hadoop install the packages required for setting up Oozie as follows:
m wget [Link]
ar RVe OOP TESS.
2 cak. OZ
cd ocozie-3.3.2/bin
./[Link] -DskipTests
Create a new directory named ‘oozie’ and copy the built binaries. Also copy the jar files
from ‘hadooplibs’ directory to the libext directory as follows:
mcd /home/hduser
mkdir oozie
cp -R oozie-3.3.2/distro/target/oozie-3.3.2-distro/oozie-3.3.2/x* oozie
cd /home/hduser/oozie
_mkdir libext
cp /home/hduser/oozie-3.3.2/hadooplibs/hadoop-1/target/hadooplibs/
hadooplib-1.1.1.0ozie-3.3.2/* /home/hduser/oozie/libext/
Download Ext2Js to the ‘libext’ directory. This is required for the Oozie web console:
mcd /home/hduser/oozie/libext/
wget [Link]
|
mw #Create sharelib on HDFS
./bin/[Link] sharelib create —fs hdfs://master:54310
m #Start Oozie
./bin/[Link] start
The status of Oozie can be checked from command line or the web console as follows:
To setup the Oozie client, copy the client tar file to the ‘oozie-client’ and add the path in
.bashre file as follows:
"2014-07-01 ZO Gd O25 Ts AO
The goal of the analysis job is to find the counts of each status/error code and produce an
output with a structure as shown below:
killjo—
bAction
Boxes 10.17 and 10.18 show the map and reduce programs which are executed in the
workflow. The map program parses the status/error code from each line in the input and
emits key-value pairs where key is the status/error code and value is 1. The reduce program
receives the key-value pairs emitted by the map program aggregated by the same key. For
each key, the reduce program calculates the count and emits key-value pairs where key is the
status/error code and the value is the count.
= Box 10.17: Map program for computing counts of machine status/error codes
#!/usr/bin/env python
import sys
#Data format
$UQOLA=07=0 1 202082 182s
= Box 10.18: Reduce program for computing counts of machine status/error codes
#!/usr/bin/env python
from operator import itemgetter
import sys
current_key = None
current_count = 0
key = None
if current_key == key:
current_count += count
else:
if current_key:
unpackedKey = current_key.split(’,’)
print '%s%s' % (current_key, current count)
current count = count
current_key = key
if current_key == key:
aa ee ee
Bahga & Madisetti, © 2014
10.4 Apache Oozie 315
unpackedKey = current_key.split(’,’)
print ’tsts" s (current_key, current_count)
Box 10.20 shows the specification for the Oozie workflow shown in Figure 10.20. Oozie
workflow has been parameterized with variables within the workflow definition. The values
of these variables are provided in the job properties file shown in Box 10.19
nameNode=hdfs://master:54310
jobTracker=master:54311
queueName=default
[Link]=S$ {nameNode}/user/hduser/share/lib
[Link]=true
[Link]=true
oozieProjectRoot=S {nameNode}/user/hduser/oozieProject
appPath=$ {ooz1leProjectRoot}/pythonApplication
[Link]=${appPath}
oozieLibPath=${[Link]}
= Box 10.20: Oozie workflow for computing counts of machine status/error codes
<value>s {outputDir}</value>
: </property>
| <property>
<name>[Link]</name>
<value>1</value>
</property>
</configuration>
<file>${appPath}/[Link]#[Link]</file>
| <file>${appPath} /[Link]#[Link]</file>
</map-reduce>
E <ok to="sendEmailSuccess"/>
| <error to="sendEmailkill"/>
</action>
§ <action name="sendEmailSuccess">
<email xmlns="uri:oozie:email-action:0.1">
<to>${emailToAddress}</to>
<subject>Status of workflow ${wf:id() }</subject>
<body>The workflow ${wf:id()} completed successfully</body>
</email>
<ok to="end"(>
<error to="end"/>
</action>
<action name="sendEmailkKill">
<email xmlns="uri:oozie:email-action:0.1">
<to>S$ {emailToAddress}</to>
<subject>Status of workflow S{wf:id() }</subject>
<body>The workflow ${wf:id()} had issues and was killed.
The error message is: ${wf:errorMessage
(wf: lastErrorNode()) }</body>
</email>
<ok toH"killJobAct ion" />
<error to="killJobAction"/>
</action>
<kill name="killJobAction">
<message>"Killed job due to error:
| S{wf£:errorMessage (wf: lastErrorNode()) }"</message>
REP\CAA
<end name="end" />
</workflow-app>
Let us now look at a more complicated workflow which has two MapReduce jobs.
Extending the example described earlier in this section, let us say we want to find the
status/error code with the maximum count. The MapReduce job in the earlier workflow
computed the counts for each status/error code. A second MapReduce job, which consumes
the output of the first MapReduce job computes the maximum count. The map and reduce
programs for the second MapReduce job are shown in Boxes 10.21 and 10.22.
Figure 10.21 shows a DAG representation of the Oozie workflow for computing machine
status/error code with maximum count. The specification of the workflow is shown in
Box 10.23.
Figure 10.21: Oozie workflow for computing machine status/error code with maximum count
» Box 10.21: Map program for computing machine status/error code with
maximum count
#!/usr/bin/env python
import sys
#Data format
#"2014-07-01 [Link]",115
» Box 10.22: Reduce program for computing machine status/error code with
maximum count
#!/usr/bin/env python
from operator import itemgetter
import sys
current_key = None
current_count = 0
key = None
maxcount=0
maxcountkey=None
if count>maxcount:
maxcount=count
maxcountkey=key
print '%s%s’ % (maxcountkey, maxcount)
» Box 10.23: Oozie workflow for computing machine status/error code with
maximum count
<name>[Link]</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.
output .dir</name>
<value>${outputDir}</value>
</property>
<property>
<name>[Link]</name>
<value>1</value>
</property>
</configuration>
<file>${appPath} /[Link]#[Link]</file>
<file>${appPath}/[Link]#[Link]</file>
</map-reduce>
<ok to="streamingaAction2"/>
<error to="killJobAction"/>
</action>
<action name="streamingaAction2">
<map-reduce>
<job-tracker>S${jobTracker}</job-tracker>
<name-node>$ {nameNode} </name-node>
<streaming>
<mapper>python [Link]</mapper>
<reducer>python [Link]</reducer>
</streaming>
<configuration>
<property>
<name>[Link]</name>
<value>${oozieLibPath} /mapreduce-streaming</value>
</property>
<property>
<name>mapred. input .dir</name>
<value>${outputDir}</value>
</property>
<property>
<name>[Link]</name>
<value>${outputDir}/output2</value>
</property>
<property>
<name>[Link]</name>
<value>1</value>
</property>
</configuration>
<file>${appPath}/[Link]#Mapperl .py</file>
<file>${appPath}/[Link]#[Link]</file>
</map-reduce>
<ok to="end"/>
error to= hit JobAct Lon /—
</action>
<kill name="killJobAction">
<message>"Killed job due to error:
ee
Figure 10.22 shows a screenshot of the Oozie web console which can be used to monitor
the status of Oozie workflows.
| Gy [Link]:1 1000/ouzie/
CO Docume Aig;
Seen
eee een es caceppt esoA :
Workin Jobs, Easnerirasnas Dobm Baan: Jinte Sonsoon tee tee userAuind) Reetage
1-4a4
Worker Node
Executor
SparkContext
Apache Mesos
Executor
Task Task
Cache
Spark comes with a spark-ec2 script (in the spark/ec2 directory) which makes it easy to
setup Spark cluster on Amazon EC2. With spark-ec2 script you can easily launch, manage
and shutdown Spark cluster on Amazon EC2. To start a Spark cluster use the following
command:
Spark cluster setup on EC2 is configured to use HDFS as its default filesystem. To analyze
contents of a file, the file should be first copied to HDFS using the following command:
Spark supports a shell mode with which you can interactively run commands for analyzing
data. To launch the Spark Python shell, run the following command:
gw ./bin/pyspark
When you launch a PySpark shell, a SparkContext is created in the variable called sc.
The following commands show how to load a text file and count the number of lines from the
PySpark shell.
wtextPile = [Link]("[Link]")
[Link]()
Let us now look at a standalone Spark application that computes word counts in a file.
Box 10.24 shows a Python program for computing word count. The program uses the map
and reduce functions. The flatMap and map transformation take as input a function which
is applied to each element of the dataset. While the flatMap function can map each input
item to zero or more output items, the map function maps each input item to another item.
The transformations take as input, functions which are applied to the data elements. The
input functions can be in the form of Python lambda expressions or local functions. In the
word count example flatMap takes as input a lambda expression that splits each line of the
file into words. The map transformation outputs key value pairs where key is a word and
value is 1. The reduceByKey transformation aggregates values of each key using the function
specified (add function in this example). Finally the collect action is used to return all the
elements of the result as an array.
= Box 10.24: Apache Spark Python program for computing word count
sc = SparkContext (appName="WordCountApp")
lines = [Link]("[Link]")
counts = [Link](lambda x:
m@Split("-") ).mapilamoda x: (x, 1)) .reduceByKey
(add)
output = [Link]()
Let us look at another Spark application for batch analysis of data. Taking the example
of analysis of forest fire detection sensor data described in the previous section, let us look at
a Spark application that aggregates the time-stamped sensor data and finds hourly maximum
values for temperature, humidity, light and CO. The Python code for the Spark application is
shown in Box 10.25. The sensor data is loaded as a text file. Each line of the text file contains
time-stamped sensor data. The lines are first split by applying the map transformation to
access the individual sensor readings. In the next step, a map transformation is applied which
outputs key-value pairs where key is a timestamp (excluding the minutes and seconds part)
and value is a sensor reading. Finally the reduceByKey transformation is applied to find the
maximum sensor reading.
SERRE RAE LEDER ena oe eee Wee MMEOM Td
Bahga & Madisetti, © 2014
10.5 Apache Spark 323
» Box 10.25: Apache Spark Python program for computing maximum values for
sensor readings
#Data format:
#1 2014-06-25 10247:44"
26, 36,2860, 274
from pyspark import SparkContext
sc = SparkContext (appName="MyApp")
textFile = [Link]("[Link]")
Box 10.26 shows an example of using Spark for data filtering. This example uses the
sensor data from forest fire detection loT system.
= Box 10.26: Apache Spark Python program for filtering sensor readings
#Data format:
#"2014-06-25 [Link]",26, 36,2860,274
sc = SparkContext (appName="App")
textFile = [Link]("[Link]")
#Alternative implementation
def f£itterfunc (line):
if int (line[1])>20 and int (line[2])>20
and int (line[3])>6000 and int (line[4])>200:
return line
else:
"
return
[Link](filterfunc)
.collect ()
#Data format:
#26. 0, 2040; 2860.05
274 0
import numpy as np
from pyspark import SparkContext
from [Link] import KMeans
sc = SparkContext (appName="KMeans")
lines = [Link](inputfile)
data = [Link](parseVector)
model = [Link](data, k)
print "Final centers: " + str([Link])
Box 10.28 shows an example of classifying data with Naive Bayes classification algorithm.
The training data in this example consists of labeled points where value in the first column is
the label. The parsePoint function parses the data and creates Spark LabeledPoint objects.
The labeled points are passed to the NaiveBayes object for training a model. Finally, the
classification is done by passing the test data (as labeled point) to the trained model.
#Data format:
Hk Oy. 205. 0peee0s 286020, 274.0
import numpy as np
from pyspark import SparkContext
from [Link] import LabeledPoint
from [Link] import NaiveBayes
sc = SparkContext (appName="App")
points = [Link]("[Link]") .map(parsePoint)
# Make prediction.
prediction = [Link] ([20.0, 40.0, 1000.0, 300.0])
pring “Preditdon 16: + Str (Pprediceron)
group services [135]. Zookeeper is required for coordination of the Storm cluster.
A computation job on the Storm cluster is called a “topology” which is a graph of
computation. A Storm topology comprises of a number of worker processes that are
distributed on the cluster. Each worker process runs a subset of the topology. A topology
is composed of Spouts and Bolts. Spout is a source of streams (sequence of tuples), for
example, a sensor data stream. The streams emitted by the Spouts are processed by the Bolts.
Bolts subscribe to Spouts, consume the streams, process them and emit new streams. A
topology can consists of multiple Spouts and Bolts. Figure 10.26 shows a Storm topology
with one Spout and three Bolts. Bolts 1 and 2 subscribe to the Spout and consume the streams
emitted by the Spout. The outputs of Bolts 1 and 2 are consumed by Bolt-3.
/ | E4 Supervisor
Zookeeper Ee
Nimbus
| bee Workers
|Zookeeper | ir oe
Supervisor |
|Workers ll
L. | EE
i Sevnocachoraw anise RIO Ios
7i
On the instance with hostname “zookeeper”, setup Zookeeper by following the instructions
in Box 10.29.
#Create file:
sudo vim /etc/apt/[Link].d/cloudera.
list
ed /usr/lib/zookeeper/bin/
sudo ./[Link] start
On the instances with hostnames “nimbus” and “supervisor”, install Storm by following
the instructions shown in Box 10.30.
#ZEROMQ INSTALLATION
wget [Link]
tar —-xzi zeromq-2.1=/.[Link]
ca zeromg-Z.l.7
./configure
make
sudo make install
#JIZMQ INSTALLATION
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
ole mePAte|
ele mareyias|
touch classdist_noinst.stamp
CLASSPATH=.:./.:SCLASSPATH javac -d . org/zeromq/ZMO. java
org/zeromq/ZMQExceptio [Link] org/zeromq java
/ZMQQueue.
org/zeromg/[Link] org/zeromg/ZMQStreamer. java
OC aie
./[Link]
./configure
make
sudo make install
#STORM INSTALLATION
wget [Link]
UneZip sStorm=0.
8.2. 210
sudo In i=s storm=0.82 storm
vim .bashrce
PATH=SPATH:"/home/ubuntu/storm"
source .bashre
After installing Storm, edit the configuration file and enter the IP addresses of the Nimbus
and Zookeeper nodes as shown in Box 10.31. You can then launch Nimbus and Storm UI. The
Storm UI can be viewed in the broswer at the address [Link]
Figures 10.27 and 10.28 show screenshots of the Storm UI. The commands for submitting
topologies to Storm are shown in Box 10.31.
storm. [Link]:
~ "[Link]" #IP Address of Zookeeper node
[Link]: Wocimouse
= "192. 168,.1.21" #1P Address of Nimbus node
#Launch nimbus:
cd storm
bin/storm nimbus
#Launch UI:
bin/storm ui
#Submit topology:
bin/storm jar storm-starter-0.0.1-SNAPSHOT.
jar [Link]
my-topology
#Kill topology
storm -kill my-topology
Storm UI
Cluster Summary
Version Nimbus untime Supervisors Used stots Fres sicts Total siots Executots Tasks
Topology summary
Mame td Sie eS Uptime Nant workers Mure execetors Num taaks
Supervisor summary
ta = Mest Uptime Siots Used shots
Nimbus Configuration
_ Mey = Value
Figure 10.27: Screenshot of Storm UI showing cluster, topology and supervisor summary
Storm Ul
Topology summary
heme tat Stats Ratine Sur workers dium executors Nun tasks
Topology actions
\Soctenip||
Deastivate ||
Retalance
|Ke
Topology stats
Window » Esstted Transferred Conngtete tatoncy fers) Acked Failed
Ad Niet
word 1 1% %
excsernt 3 Bi 2.006 2 5
Figure 10.29: Using Apache Storm for real-time analysis of IoT data
yoroidde
paseq-jayoogqem
ABojodo|
aysedyW405
‘yovosdde
(q)
a
SN
paseq-L
eae
w01$
og ee
ASojOdo] ala
ee | |
Epeere et,
Say
wi0ys Ee
ianangy
st <>
ayoedy ae) (OW0J2Z)
|
boca
(8)
=
:WuOIg
See
W03S
ynods
ee pazyjoquaa
4a{/O1]U0)
pnoj)
!
pnoj>
YIM
(e) (q)
sisyeur
SI ((aaobuey)
aspgoj0q
|
|. | |
pnojp ae
Oe jayv0sqam\
saoinias.
sarinia’ (aWvM)
LO] BP
Ajax a
$
OWN-[VoI
uoyodiunwwo7) uoyoojunwuwo)
ee.
-.
Sh
Sal
a
ee 4OIOSGAM
| | { f
(oe
e e —|
|eeMERC| | asogoing
|| |
|1 =Be
ee
fe | | |
LSd4
aoihag,
a2inosay
MOPYION
OJ
g0INOSAY
aineg.
je907
je007
a2unosay
aotaaqg-
ating
eNLy :O¢OT
QJINOSIY
-
=» Box 10.32: Training and saving Decision Tree classifier for forest fire detection
import numpy as np
from [Link] import DecisionTreeClassifier
import -csvoas -csvV
import cPickile
train_data = [Link](train_data)
Xatbrarm = train-data
(03s, 1.2]
y_train = train_data[0::,0]
To use the saved Decision Tree classifier for predictions, the next step is to create a new
Storm project. Box 10.33 shows the commands for creating a Storm project and Box 10.34
shows a sample configuration file for the project.
= Box 10.34: Configuration file for forest fire detection Storm app - [Link]
<project xmlns="[Link]
xmins:xsi="http: //[Link]/2001/XMLSchema-instance"
xSi:schemaLocation="[Link]
[Link]
<modelVersion>4.0.0</modelVersion>
<groupld>com. [Link]</groupId>
vA
Dy EN
X[2] <= 8048.5000 X[3] <= 542,5000
error = 0242423257024 error = 0,0702582465278
samples = 248 samples = 192
value =[ 35, 213.) value=[185. 7.]
¥
X[3] <= 526.0000 X[3] <= 296.0000 X(2] <= 9715.0000
error = 0.263671875 error = 0.0713305898491 error = 0.0000
error = 0.0414322107444
samples = 32 samples = 216 samples = 3
samples = 189
value =[27. 5.) value=[ 8. 208.) value =[0. 3.)
value=[185. 4.]
error = 0.0000 error = 0.0000 AES ee error = 0.0000 nls Beare error = 0.0000
samples = 27 samples = 2 oped =r samples = 206 ARES aa samples = 3
value = [
=[27, 0.)
0, value =(0,
=(0, 2.)2, value= 12.a 2 value=[
= 0, 206.] value=[[Link] 1.) value = =[0, 3.]
Figure 10.31: Example of a generated decision tree for forest fire detection
<artifactId>forest-app</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>forest-app</name>
<url>[Link]
<build>
<resources>
<resource>
<directory>$ {basedir}/multilang</directory>
</resource>
</resources>
<plugins>
<plugin>
<groupid>[Link]</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
<compilerVersion>1.6</compilerVersion>
</configuration>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>[Link]</id>
<url>[Link]
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>storm</groupid>
<artifactld>storm</artifactid>
<version>0.8.0</version>
</dependency>
</dependencies>
</project>
After creating the project, within the project directory create directories
(/multilang/resources) and then create Python programs for Spout and Bolt within the
resources directory. Box 10.35 shows the Python code for the Storm Spout. The Spout
connects to the Xively data streams for temperature, humidity light and CO data and retrieves
the data. The Spout emits the sensor data as streams of comma separated values.
def getData():
#Format of data - "temperature, humidity, light, co"
He 6s 80,20, 6952;10
data = [Link] ("forest_data") .current_value
return data
def nextTuple(self):
[Link]
(2)
data = getData()
emit ([data] )
SensorSpout
() .run()
Box 10.36 shows the Python code for the Storm Bolt. This Bolt receives the streams of
sensor data emitted by the Spout. The Decision Tree classifier saved earlier is used in the
Bolt to make predictions.
import storm
import numpy as np
import cPickle
from [Link] import DecisionTreeClassifier
test_data=[]
test_data.append (data)
test_data = [Link](test_data)
SO LeSt-= test. datalo:s,
03s]
[Link] ([result])
SensorBolt
() .run()
After the Spout and Bolt programs are created the next step is to create a topology.
Box 10.37 shows the Java program for creating a topology. To create a topology an object of
the TopologyBuilder class is created. The Spout and Bolt are defined using the setS pout and
setBolt methods. These methods take as input a user-specified id, objects to the Spout/Bolt
classes, and the amount of parallelism required. Storm has two modes of operation - local
and distributed. In the local mode, Storm simulates worker nodes within a local process. The
distributed mode runs on the Storm cluster. The program in Box 10.37 shows the code for
submitting topology to both local and distributed modes.
With all the project files created, the final step is to build and run the project. Box 10.38
shows the commands for building and running a Storm project.
package [Link];
‘import [Link];
import [Link];
import [Link];
import [Link];
@Override
public void declareOutputFields (OutputFieldsDeclarer declarer) {
[Link]
(new Fields ("sensordata")));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
@Override
public void declareOutputFields (OutputFieldsDeclarer declarer) {
[Link]
(new Fields ("sensordata"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
[Link] (true) ;
[Link] (10000) ;
[Link]();
import time
import datetime
import dhtreader
import spidev
from random import randint
from [Link] import reactor
from [Link] import inlineCallbacks
from [Link] import sleep
from [Link] import ApplicationSession
#Invttalize DHT2Z2
[Link]()
volts = round(volts,places)
lux=500* (3.3-volts) / (R*volts)
return lux
client Il MongoClient()
elient = MongoClient(’ localhost’, 2/017)
db = client [’mydb’ ]
# ZeroMQ Context
context = [Link]()
#Subscribe to Topic
yield [Link](on_event, ’[Link]’)
# ZeroMQ Context
context = [Link]()
sock = [Link] ([Link])
Sock. connect ("tep;//12770.0:.
125690")
Sensorspout
() .run()
import storm
import numpy as np
import cPickle
from [Link] import DecisionTreeClassifier
data_without_timestamp = data[1:]
test_data=[]
test_data.append (data_without_timestamp)
test_data = [Link] (test_data)
Natest- = test: datcat0 7 Ose
[Link] ([result])
SensorBolt ().run()
(CAMERA)
SI ‘ pe oo
(2a) a epow
ig Auogdsey TS
CV
TESTES
—e
|
|
|
Mh
Figure 10.32: Schematic diagram of IoT device for structural health monitoring
Discrete Fourier Transform (DFT) is useful for converting a sampled signal from time
domain to frequency domain which makes the analysis of the signal easier. However, for
streaming vibration data in which the spectral content changes over time, using DFT cannot
reveal the transitions in the spectral content. Short Time Fourier Transform (STFT) is better
suited for revealing the changes in the spectral content corresponding to the SHM data. To
-+co
where w/[n] is a window function. Commonly used window functions are Hann and
Hamming windows.
Alternatively, STFT can be interpreted as a filtering operation as follows,
import time
import datetime
from [Link] import reactor
from [Link] import inlineCallbacks
from [Link] import sleep
from [Link] import ApplicationSession
adx1345 = ADXL345()
axes = [Link] (True)
for i in range(1000):
[Link](axes[’x’ ])
y-append(axes[’y’])
[Link](axes[’z’
])
GQataltet== Liimestanp,. x. vy 2.
return datalist
client = MongoClient()
elient = MongoClient
(’ localhost’, 27017)
db = client [’mydb’ ]
collection = db[’iotcollection’ ]
# ZeroMQ Context
context = [Link]()
#Subscribe to Topic
yield [Link](on_event, ’[Link]’ )
# ZeroMQ Context
context = [Link]()
sock = [Link] ([Link])
sock connect ("tep: //[Link]:5690")
def nextTuple(self):
[Link]
(2)
data = [Link]()
emit ({data] )
SensorSpout
() .run ()
import storm
import scipy
x= signal
fs - sample rate
framesize - frame size
x = datall]
i data[2]
ES
Zz = datal3]
X=st
ft (x)
Y=stft (y)
Z=stft (z)
output = [Link]
(X)
result=—-"STFT. = xX Nt str (OuepUut)
output = [Link]
(Y)
result= result + "STFT - Y "str (owbput)
output = [Link]
(Z)
result= result + "STFT - Z “+ str (output)
[Link] ([result])
SensorBolt
() .run()
Summary
In this chapter you learned about various tools for analyzing IoT data. IoT systems can have
varied data analysis requirements. For some IoT systems, the volume of data is so huge that
analyzing the data on a single machine is not possible. For such systems, distributed batch data
analytics frameworks such as Apache Hadoop can be used for data analysis. For IoT systems
which have real-time data analysis requirements, tools such as Apache Storm are useful. For
IoT systems which require interactive querying of data, tools such as Apache Spark can be
used. Hadoop is an open source framework for distributed batch processing of massive scale
data. Hadoop MapReduce provides a data processing model and an execution environment
for MapReduce jobs for large scale data processing. Key processes of Hadoop include
NameNode, Secondary NameNode, JobTracker, TaskTracker and DataNode. NameNode
keeps the directory tree of all files in the file system, and tracks where across the cluster the
file data is kept. Secondary NameNode creates checkpoints of the namespace. JobTracker
distributes MapReduce tasks to specific nodes in the cluster. TaskTracker accepts Map,
Reduce and Shuffle tasks from the JobTracker. DataNode stores data in an HDFS file system.
You learned how to setup a Hadoop cluster and run MapReduce jobs on the cluster. You
learned about the next generation architecture of Hadoop called YARN. YARN is framework
for job scheduling and cluster resource management. Key components of YARN include
Resource Manager, Application Master, Node Manager and Containers. You learned about
the Oozie workflow scheduler system that allows managing Hadoop jobs. You learned about
Apache Spark in-memory cluster computing framework. Spark supports various high-level
tools for data analysis such as Spark Streaming for streaming jobs, Spark SQL for analysis
of structured data, MLlib machine learning library for Spark, GraphX for graph processing
and Shark (Hive on Spark). Finally, you learned about Apache Storm which is a framework
for distributed and fault-tolerant real-time computation.
Lab Exercises
1. In this exercise you will create a multi-node Hadoop cluster on a cloud. Follow the
steps below:
e Create and Amazon Web Services account.
From Amazon EC2 console launch two [Link] EC2 instances.
When the instances start running, note the public DNS addresses of the instances.
Connect to the instances using SSH.
Run the commands given in Box 10.2 to install Java on each instance.
Run the commands given in Box 10.3 to install Hadoop on each instance.
Configure Hadoop. Use the templates for [Link], [Link],
[Link] and master and slave files shown in Boxes 10.4 - 10.7.
e Start the Hadoop cluster using the commands shown in Box 10.8.
e Ina browser open the Hadoop cluster status pages:
public-DNS-of-hadoop-master:50070
public-DNS-of-hadoop-master:50030
2. In this exercise you will run a MapReduce job on a Hadoop cluster for aggregating
data (computing mean, maximum and minimum) on various timescales. Follow the
steps below:
e Generate synthetic data using the following Python program:
for j in range(O,readings):
timestamp = [Link]([Link]()).
strftime(’
% Y-Yom-%d %H:%M:%S’ )
data = timestamp + ",” + str(randrange(0,100)) +
""" + str(randrange(0,100))+ "," + str(randrange(0,10000))+
""" + str(randrange(200,400))
[Link](data)
[Link]()
cool, chilly, cold, freezing, humid, dry. Follow the steps below:
e Save the weather monitoring data to a text or CSV file and manually classify and
label the data (50-100 rows). For example:
#Format of labeled file: