Data Science and Predictive Analytics Bi
Data Science and Predictive Analytics Bi
This chapter presents some technical details about data formats, streaming, optimization
of computation, and distributed deployment of optimized learning algorithms.
Chapter 22 provides additional optimization details. We show format conversion and
working with XML, SQL, JSON, 15 CSV, SAS and other data objects. In addition, we
illustrate SQL server queries, describe protocols for managing, classifying and predicting
outcomes from data streams, and demonstrate strategies for optimization, improvement
of computational performance, parallel (MPI) and graphics (GPU) computing.
The Internet of Things (IoT) leads to a paradigm shift of scientific inference –
from static data interrogated in a batch or distributed environment to on-demand
service-based Cloud computing. Here, we will demonstrate how to work with
specialized data, data-streams, and SQL databases, as well as develop and assess
on-the-fly data modeling, classification, prediction and forecasting methods. Impor-
tant examples to keep in mind throughout this chapter include high-frequency data
delivered real time in hospital ICU’s (e.g., microsecond Electroencephalography
signals, EEGs), dynamically changing stock market data (e.g., Dow Jones Industrial
Average Index, DJI), and weather patterns.
We will present (1) format conversion of XML, SQL, JSON, CSV, SAS and other
data objects, (2) visualization of bioinformatics and network data, (3) protocols for
managing, classifying and predicting outcomes from data streams, (4) strategies for
optimization, improvement of computational performance, parallel (MPI) and
graphics (GPU) computing, and (5) processing of very large datasets.
Unlike the case studies we saw in the previous chapters, some real world data may
not always be nicely formatted, e.g., as CSV files. We must collect, arrange, wrangle,
and harmonize scattered information to generate computable data objects that can be
further processed by various techniques. Data wrangling and preprocessing may take
over 80% of the time researchers spend interrogating complex multi-source data
archives. The following procedures will enhance your skills in collecting and han-
dling heterogeneous real world data. Multiple examples of handling long-and-wide
data, messy and tidy data, and data cleaning strategies can be found in this JSS Tidy
Data article by Hadley Wickham.
The R package rio imports and exports various types of file formats, e.g.,
tab-separated (.tsv), comma-separated (.csv), JSON (.json), Stata (.dta),
SPSS (.sav and .por), Microsoft Excel (.xls and .xlsx), Weka (.arff), and
SAS (.sas7bdat and .xpt).
rio provides three important functions import(), export() and convert().
They are intuitive, easy to understand, and efficient to execute. Take Stata (.dta) files
as an example. First, we can download 02_Nof1_Data.dta from our datasets folder.
# install.packages("rio")
library(rio)
# Download the SAS .DTA file first locally
# Local data can be loaded by:
#nof1<-import("02_Nof1_Data.dta")
# the data can also be loaded from the server remotely as well:
nof1<-read.csv("https://umich.instructure.com/files/330385/download?download
_frd=1")
str(nof1)
The data are automatically stored as a data frame. Note that rio sets
stingAsFactors¼FALSE as default.
rio can help us export files into any other format we choose. To do this we have
to use the export() function.
export(nof1, "02_Nof1.xlsx")
16.1 Working with Specialized Data and Databases 515
This line of code exports the Nof1 data in xlsx format located in the R working
directory. Mac users may have a problem exporting *.xslx files using rio
because of a lack of a zip tool, but still can output other formats such as ".csv".
An alternative strategy to save an xlsx file is to use package xlsx with default
row.name¼TRUE.
rio also provides a one step process to convert and save data into alternative
formats. The following simple code allows us to convert and save the
02_Nof1_Data.dta file we just downloaded into a CSV file.
# convert("02_Nof1_Data.dta", "02_Nof1_Data.csv")
convert("02_Nof1.xlsx",
"02_Nof1_Data.csv")
You can see a new CSV file popup in the current working directory. Similar
transformations are available for other data formats and types.
Let’s use as an example the CDC Behavioral Risk Factor Surveillance System
(BRFSS) Data, 2013-2015. This file for the combined landline and cell phone data
set was exported from SAS V9.3 in the XPT transport format. This file contains
330 variables and can be imported into SPSS or STATA. Please note: some of the
variable labels get truncated in the process of converting to the XPT format.
Be careful – this compressed (ZIP) file is over 315MB in size!
# install.packages("Hmisc")
library(Hmisc)
memory.size(max=T)
## [1] 115.81
pathToZip <- tempfile()
download.file("http://www.socr.umich.edu/data/DSPA/BRFSS_2013_2014_2015.zip"
, pathToZip)
# let's just pull two of the 3 years of data (2013 and 2015)
brfss_2013 <- sasxport.get(unzip(pathToZip)[1])
dim(brfss_2013); object.size(brfss_2013)
## 685581232 bytes
516 16 Specialized Machine Learning Topics
# clean up
unlink(pathToZip)
system.time(
gml1 <- glm(has_plan ~ as.factor(x.race), data=brfss_2013,
family=binomial)
) # report execution time
##
## Call:
## glm(formula = has_plan ~ as.factor(x.race), family = binomial,
## data = brfss_2013)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1862 0.4385 0.4385 0.4385 0.8047
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.293549 0.005649 406.044 <2e-16 ***
## as.factor(x.race)2 -0.721676 0.014536 -49.647 <2e-16 ***
## as.factor(x.race)3 -0.511776 0.032974 -15.520 <2e-16 ***
## as.factor(x.race)4 -0.329489 0.031726 -10.386 <2e-16 ***
## as.factor(x.race)5 -1.119329 0.060153 -18.608 <2e-16 ***
## as.factor(x.race)6 -0.544458 0.054535 -9.984 <2e-16 ***
## as.factor(x.race)7 -0.510452 0.030346 -16.821 <2e-16 ***
## as.factor(x.race)8 -1.332005 0.012915 -103.138 <2e-16 ***
## as.factor(x.race)9 -0.582204 0.030604 -19.024 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
16.1 Working with Specialized Data and Databases 517
Next, we’ll examine the odds (rather the log odds ratio, LOR) of having a health
care plan (HCP) by race (R). The LORs are calculated for two array dimensions,
separately for each race level (presence of health care plan (HCP) is binary, whereas
race (R) has 9 levels, R1, R2, . . ., R9). For example, the odds ratio of having a HCP
for R1 : R2 is:
PðHCPjR1Þ
1 PðHCPjR1Þ
ORðR1 : R2Þ ¼ PðHCPjR2Þ
:
1 PðHCPjR2Þ
# install.packages("DBI"); install.packages("RMySQL")
# install.packages("RODBC"); library(RODBC)
library(DBI)
library(RMySQL)
To complete the above database SQL commands, it requires access to the remote
UCSC SQL Genome server and user-specific credentials. You can see this functional
example on the DSPA website. Below is another example that can be done by all
readers, as it relies only on local services.
# install.packages("RSQLite")
library("RSQLite")
## <SQLiteConnection>
## Path: :memory:
## Extensions: TRUE
dbListTables(myConnection)
## character(0)
# Add tables to the local SQL DB
data(USArrests); dbWriteTable(myConnection, "USArrests", USArrests)
## [1] TRUE
## [1] TRUE
dbListTables(myConnection);
## avg(Assault)
## 1 170.76
myQuery <- dbGetQuery(myConnection, "SELECT avg(Assault) FROM USArrests GROU
P BY UrbanPop"); myQuery
## avg(Assault)
## 1 48.00
## 2 81.00
## 3 152.00
## 4 211.50
## 5 271.00
## 6 190.00
## 7 83.00
## 8 109.00
## 9 109.00
## 10 120.00
## 11 57.00
## 12 56.00
## 13 236.00
## 14 188.00
## 15 186.00
## 16 102.00
## 17 156.00
## 18 113.00
## 19 122.25
## 20 229.50
## 21 151.00
## 22 231.50
## 23 172.00
## 24 145.00
## 25 255.00
## 26 120.00
## 27 110.00
## 28 204.00
## 29 237.50
## 30 252.00
## 31 147.50
## 32 149.00
## 33 254.00
## 34 174.00
## 35 159.00
## 36 276.00
16.1 Working with Specialized Data and Databases 521
## avg(poorhlth)
## 1 56.25466
## 2 53.99962
## 3 58.85072
## 4 66.26757
myQuery1_13 - myQuery1_15
## avg(poorhlth)
## 1 0.4992652
## 2 -1.4952515
## 3 -2.5037326
## 4 -1.3536797
# reset the DB query
# dbClearResult(myQuery)
# clean up
dbDisconnect(myConnection)
## [1] TRUE
We are already familiar with (pseudo) random number generation (e.g., rnorm
(100, 10, 4) or runif(100, 10,20)), which generate algorithmically
computer values subject to specified distributions. There are also web services,
e.g., random.org, that can provide true random numbers based on atmospheric
522 16 Specialized Machine Learning Topics
noise, rather than using a pseudo random number generation protocol. Below is one
example of generating a total of 300 numbers arranged in 3 columns, each of
100 rows of random integers (in decimal format) between 100 and 200.
#https://www.random.org/integers/?num=300&min=100&max=200&col=3&base=10&
format=plain&rnd=new
siteURL <- "http://random.org/integers/" # base URL
shortQuery<-"num=300&min=100&max=200&col=3&base=10&format=plain&rnd=new"
completeQuery <- paste(siteURL, shortQuery, sep="?") # concat url and
submit query string
rngNumbers <- read.table(file=completeQuery) # and read the data
rngNumbers
## V1 V2 V3
## 1 144 179 131
## 2 127 160 150
## 3 142 169 109
…
## 98 178 103 134
## 99 173 178 156
## 100 117 118 110
RCurl package provides an amazing tool for extracting and scraping information
from websites. Let’s install it and extract information from a SOCR website.
# install.packages("RCurl")
library(RCurl)
The web object looks incomprehensible. This is because most websites are
wrapped in XML/HTML hypertext or include JSON formatted metadata. RCurl
deals with special HTML tags and website metadata.
To deal with the web pages only, httr package would be a better choice than
RCurl. It returns a list that makes much more sense.
#install.packages("httr")
library(httr)
web<-GET("http://wiki.socr.umich.edu/index.php/SOCR_Data")
str(web[1:3])
## List of 3
## $ url : chr "http://wiki.socr.umich.edu/index.php/SOCR_Data"
## $ status_code: int 200
16.1 Working with Specialized Data and Databases 523
## $ headers :List of 12
## ..$ date : chr "Mon, 03 Jul 2017 19:09:56 GMT"
## ..$ server : chr "Apache/2.2.15 (Red Hat)"
## ..$ x-powered-by : chr "PHP/5.3.3"
## ..$ x-content-type-options: chr "nosniff"
## ..$ content-language : chr "en"
## ..$ vary : chr "Accept-Encoding,Cookie"
## ..$ expires : chr "Thu, 01 Jan 1970 00:00:00 GMT"
## ..$ cache-control : chr "private, must-revalidate, max-age=0"
## ..$ last-modified : chr "Sat, 22 Oct 2016 21:46:21 GMT"
## ..$ connection : chr "close"
## ..$ transfer-encoding : chr "chunked"
## ..$ content-type : chr "text/html; charset=UTF-8"
## ..- attr(*, "class")= chr [1:2] "insensitive" "list"
A combination of the RCurl and the XML packages could help us extract only the
plain text in our desired webpages. This would be very helpful to get information
from heavy text-based websites.
web<-getURL("http://wiki.socr.umich.edu/index.php/SOCR_Data", followlocation
= TRUE)
#install.packages("XML")
library(XML)
web.parsed<-htmlParse(web, asText = T)
plain.text<-xpathSApply(web.parsed, "//p", xmlValue)
cat(paste(plain.text, collapse = "\n"))
## The links below contain a number of datasets that may be used for demonst
ration purposes in probability and statistics education. There are two types
of data - simulated (computer-generated using random sampling) and observed
(research, observationally or experimentally acquired).
##
## The SOCR resources provide a number of mechanisms to simulate data using
computer random-number generators. Here are some of the most commonly used S
OCR generators of simulated data:
##
## The following collections include a number of real observed datasets from
different disciplines, acquired using different techniques and applicable in
different situations.
##
## In addition to human interactions with the SOCR Data, we provide several
machine interfaces to consume and process these data.
##
## Translate this page:
##
## (default)
##
## Deutsch
…
## România
##
## Sverige
524 16 Specialized Machine Learning Topics
Here we extracted all plain text between the starting and ending paragraph
HTML tags, <p> and </p>.
More information about extracting text from XML/HTML to text via XPath is
available online.
The process that extracting data from complete web pages and storing it in structured
data format is called scraping. However, before starting a data scrape from a
website, we need to understand the underlying HTML structure for that specific
website. Also, we have to check the terms of that website to make sure that scraping
from this site is allowed.
The R package rvest is a very good place to start “harvesting” data from
websites.
To start with, we use read_html() to store the SOCR data website into a
xmlnode object.
library(rvest)
SOCR<-read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data")
SOCR
## {xml_document}
## <html lang="en" dir="ltr" class="client-nojs">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=
...
## [2] <body class="mediawiki ltr sitedir-ltr ns-0 ns-subject page-SOCR_Dat
...
From the summary structure of SOCR, we can discover that there are two important
hypertext section markups <head> and <body>. Also, notice that the SOCR data
website uses <title> and </title> tags to separate title in the <head> section.
Let’s use html_node() to extract title information based on this knowledge.
Here we used %>% operator, or pipe, to connect two functions. The above line of
code creates a chain of functions to operate on the SOCR object. The first function in
the chain html_node() extracts the title from head section. Then,
html_text() translates HTML formatted hypertext into English. More on R
piping can be found in the magrittr package.
Another function, rvest::html_nodes() can be very helpful in scraping.
Similar to html_node(), html_nodes() can help us extract multiple nodes in
an xmlnode object. Assume that we want to obtain the meta elements (usually page
16.1 Working with Specialized Data and Databases 525
description, keywords, author of the document, last modified, and other metadata)
from the SOCR data website. We apply html_nodes() to the SOCR object to
extract the hypertext data, e.g., lines starting with <meta> in the <head> section of
the HTML page source. It is optional to use html_attrs(), which extracts
attributes, text and tag names from HTML, obtain the main text attributes.
## [[1]]
## http-equiv content
## "Content-Type" "text/html; charset=UTF-8"
##
## [[2]]
## charset
## "UTF-8"
##
## [[3]]
## http-equiv content
## "X-UA-Compatible" "IE=EDGE"
##
## [[4]]
## name content
## "generator" "MediaWiki 1.23.1"
##
## [[5]]
## name content
## "ResourceLoaderDynamicStyles" ""
library(httr)
nof1<-GET("https://umich.instructure.com/files/1760327/download?download_frd
=1")
nof1
## Response [https://instructure-uploads.s3.amazonaws.com/account_1770000000
0000001/attachments/1760327/02_Nof1_Data.json?response-content-disposition=a
ttachment%3B%20filename%3D%2202_Nof1_Data.json%22%3B%20filename%2A%3DUTF-8%2
7%2702%255FNof1%255FData.json&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credent
ial=AKIAJFNFXH2V2O7RPCAA%2F20170703%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Da
te=20170703T190959Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signa
ture=ceb3be3e71d9c370239bab558fcb0191bc829b98a7ba61ac86e27a2fc3c1e8ce]
526 16 Specialized Machine Learning Topics
We can see that JSON objects are very simple. The data structure is organized
using hierarchies marked by square brackets. Each piece of information is formatted
as a {key:value} pair.
The package jsonlite is a very useful tool to import online JSON formatted
datasets into data frame directly. Its syntax is very straight-forward.
#install.packages("jsonlite")
library(jsonlite)
nof1_lite<-
fromJSON("https://umich.instructure.com/files/1760327/download?download_frd=1")
class(nof1_lite)
## [1] "data.frame"
We can transfer a xlsx dataset into CSV and use read.csv() to load this kind of
dataset. However, R provides an alternative read.xlsx() function in package
xlsx to simplify this process. Take our 02_Nof1_Data.xls data in the class file
as an example. We need to download the file first.
# install.packages("xlsx")
library(xlsx)
nof1<-read.xlsx("C:/Users/Folder/02_Nof1.xlsx", 1)
str(nof1)
## 'data.frame': 900 obs. of 10 variables:
## $ ID : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Tx : num 1 1 0 0 1 1 0 0 1 1 ...
16.2 Working with Domain-Specific Data 527
The last argument, 1, stands for the first excel sheet, as any excel file may include
a large number of tables in it. Also, we can download the xls or xlsx file into our
R working directory so that it is easier to find the file path.
Sometimes more complex protocols may be necessary to ingest data from XLSX
documents. For instance, if the XLSX doc is large, includes many tables and is only
accessible via HTTP protocol from a web-server. Below is an example of
downloading the second table, ABIDE_Aggregated_Data, from the multi-
table Autism/ABIDE XLSX dataset:
# install.packages("openxlsx"); library(openxlsx)
tmp = tempfile(fileext = ".xlsx")
download.file(url = "https://umich.instructure.com/files/3225493/download?do
wnload_frd=1",
destfile = tmp, mode="wb") df_Autism <- openxlsx::read.xlsx(xlsxFile = tmp,
sheet = "ABIDE_Aggregated_Data", skipEmptyRows = TRUE)
dim(df_Autism)
Genetic data are stored in widely varying formats and usually have more feature
variables than observations. They could have 1,000 columns and only 200 rows. One
of the commonly used pre-processng steps for such datasets is variable selection. We
will talk about this in Chap. 17.
The Bioconductor project created powerful R functionality (packages and tools)
for analyzing genomic data, see Bioconductor for more detailed information.
528 16 Specialized Machine Learning Topics
Social network data and graph datasets describe the relations between nodes (verti-
ces) using connections (links or edges) joining the node objects. Assume we have
N objects, we can have N ∗ (N 1) directed links establishing paired associations
between the nodes. Let’s use an example with N¼4 to demonstrate a simple graph
potentially modeling the node linkage Table 16.1.
If we change the a ! b to an indicator variable (0 or 1) capturing whether we
have an edge connecting a pair of nodes, then we get the graph adjacency matrix.
Edge lists provide an alternative way to represent network connections. Every
line in the list contains a connection between two nodes (objects) (Table 16.2).
The edge list on Table 16.2 lists three network connections: object 1 is linked to
object 2; object 1 is linked to object 3; and object 2 is linked to object 3. Note that
edge lists can represent both directed as well as undirected networks or graphs.
We can imagine that if N is very large, e.g., social networks, the data represen-
tation and analysis may be resource intense (memory or computation). In R, we have
multiple packages that can deal with social network data. One user-friendly example
is provided using the igraph package. First, let’s build a toy example and visualize
it using this package (Fig. 16.1).
#install.packages("igraph")
library(igraph)
4 7
3
2
10
1
Now let’s examine the co-appearance network of Facebook circles. The data
contains anonymized circles (friends lists) from Facebook collected from survey
participants using a Facebook app. The dataset only includes edges (circles, 88,234)
connecting pairs of nodes (users, 4,039) in the member social networks.
The values on the connections represent the number of links/edges within a circle.
We have a huge edge-list made of scrambled Facebook user IDs. Let’s load this
dataset into R first. The data is stored in a text file. Unlike CSV files, text files in table
format need to be imported using read.table(). We are using the header¼F
option to let R know that we don’t have a header in the text file that contains only
tab-separated node pairs (indicating the social connections, edges, between
Facebook users).
soc.net.data<-read.table("https://umich.instructure.com/files/2854431/downlo
ad?download_frd=1", sep=" ", header=F)
head(soc.net.data)
## V1 V2
## 1 0 1
## 2 0 2
## 3 0 3
## 4 0 4
## 5 0 5
## 6 0 6
Now the data is stored in a data frame. To make this dataset ready for igraph
processing and visualization, we need to convert soc.net.data into a matrix
object.
By using ncol¼2, we made a matrix with two columns. The data is now ready
and we can apply graph.edgelist().
# remove the first 347 edges (to wipe out the degenerate "0" node)
graph_m<-graph.edgelist(soc.net.data.mat[-c(0:347), ], directed = F)
Before we display the social network graph we may want to examine our model
first.
summary(graph_m)
This is an extremely brief yet informative summary. The first line U--- 4038
87887 includes potentially four letters and two numbers. The first letter could be U
or D indicating undirected or directed edges. A second letter N would mean that the
objects set has a “name” attribute. A third letter is for weighted (W) graph. Since we
didn’t add weight in our analysis the third letter is empty (“-“). A fourth character is
an indicator for bipartite graphs, whose vertices can be divided into two disjoint
sets where each vertex from one set connects to one vertex in the other set. The
two numbers following the 4 letters represent the number of nodes and the
number of edges, respectively. Now let’s render the graph (Fig. 16.2).
plot(graph_m)
This graph is very complicated. We can still see that some words are surrounded
by more nodes than others. To obtain such information we can use the degree()
function, which lists the number of edges for each node.
degree(graph_m)
Skimming the table we can find that the 107-th user has as many as 1,044
connections, which makes the user a highly-connected hub. Likely, this node may
have higher social relevance.
Some edges might be more important than other edges because they serve as a
bridge to link a cloud of nodes. To compare their importance, we can use the
betweenness centrality measurement. Betweenness centrality measures centrality in
a network. High centrality for a specific node indicates influence. betweenness
() can help us to calculate this measurement.
betweenness(graph_m)
http://socr.umich.edu/html/Navigators.html
http://socr.ucla.edu/SOCR_HyperTree.json
Fig. 16.3 Live demo: a dynamic graph representation of the SOCR resources
We can try another example using SOCR hierarchical data, which is also avail-
able for dynamic exploration as a tree graph. Let’s read its JSON data source using
the jsonlite package (Fig. 16.3).
532 16 Specialized Machine Learning Topics
tree.json<-fromJSON("http://socr.ucla.edu/SOCR_HyperTree.json",
simplifyDataFrame = FALSE)
# install.packages("data.tree")
library(data.tree)
tree.graph<-as.Node(tree.json, mode = "explicit")
In this graph, "About SOCR", which is located at the center, represents the root
node of the tree graph.
The proliferation of Cloud services and the emergence of modern technology in all
aspects of human experiences leads to a tsunami of data much of which is streamed
real-time. The interrogation of such voluminous data is an increasingly important
area of research. Data streams are ordered, often unbounded sequences of data
points created continuously by a data generator. All of the data mining, interrogation
and forecasting methods we discuss here are also applicable to data streams.
16.3.1 Definition
Y ¼ fy1 ; y2 ; y3 ; ; yt ; g,
where the (time) index, t, reflects the order of the observation/record, which may be
single numbers, simple vectors in multidimensional space, or objects, e.g., structured
Ann Arbor Weather (JSON) and its corresponding structured form. Some streaming
data is streamed because it’s too large to be downloaded shotgun style and some is
streamed because it’s continually generated and serviced. This presents the potential
problem of dealing with data streams that may be unlimited.
Notes:
• Data sources: Real or synthetic stream data can be used. Random simulation
streams may be created by rstream. Real stream data may be piped from
financial data providers, the WHO, World Bank, NCAR and other sources.
• Inference Techniques: Many of the data interrogation techniques we have seen
can be employed for dynamic stream data, e.g., factas, for PCA, rEMM and
birch for clustering, etc. Clustering and classification methods capable of
processing data streams have been developed, e.g., Very Fast Decision Trees
(VFDT), time window-based Online Information Network (OLIN), On-demand
Classification, and the APRIORI streaming algorithm.
• Cloud distributed computing: Hadoop2/HadoopStreaming, SPARK, Storm3/
RStorm provide an environments to expand batch/script-based R tools to the
Cloud.
534 16 Specialized Machine Learning Topics
The R stream package provides data stream mining algorithms using fpc, clue,
cluster, clusterGeneration, MASS, and proxy packages. In addition, the
package streamMOA provides an rJava interface to the Java-based data stream
clustering algorithms available in the Massive Online Analysis (MOA) framework
for stream classification, regression and clustering.
If you need a deeper exposure to data streaming in R, we recommend you go over
the stream vignettes.
This example shows the creation and loading of a mixture of 5 random 2D Gauss-
ians, centers at (x_coords, y_coords) with paired correlations rho_corr, representing
a simulated data stream.
Generate the stream:
# install.packages("stream")
library("stream")
k-Means Clustering
We will now try k-means and density-based data stream clustering algorithm,
D-Stream, where micro-clusters are formed by grid cells of size gridsize with density
of a grid cell (Cm) is least 1.2 times the average cell density. The model is updated
with the next 500 data points from the stream.
First, let’s run the k-means clustering with k ¼ 5 clusters and plot the resulting
micro- and macro-clusters (Fig. 16.5).
16.3 Data Streaming 535
Fig. 16.5 Micro and macro clusters of a 5-means clustering of the first 500 points of the streamed
simulated 2D Gaussian kernels
In this clustering plot, micro-clusters are shown as circles and macro-clusters are
shown as crosses and their sizes represent the corresponding cluster weight
estimates.
Next try the density-based data stream clustering algorithm D-Stream. Prior to
updating the model with the next 1,000 data points from the stream, we specify the
grid cells as micro-clusters, grid cell size (gridsize¼0.1), and a micro-cluster
(Cm¼1.2) that specifies the density of a grid cell as a multiple of the average cell
density.
We can re-cluster the data using k-means with 5 clusters and plot the resulting
micro- and macro-clusters (Fig. 16.6).
Note the subtle changes in the clustering results between kmc and km_G5.
536 16 Specialized Machine Learning Topics
Fig. 16.6 Micro- and macro- clusters of a 5-means clustering of the next 1,000 points of the
streamed simulated 2D Gaussian kernels
For DSD objects, some basic stream functions include print(), plot(), and
write_stream(). These can save part of a data stream to disk. DSD_Memory
and DSD_ReadCSV objects also include member functions like reset_stream()
to reset the position in the stream to its beginning.
To request a new batch of data points from the stream we use get_points().
This chooses a random cluster (based on the probability weights in p_weight) and
a point is drawn from the multivariate Gaussian distribution (mean ¼ mu, covariance
matrix ¼ Σ) of that cluster. Below, we pull n ¼ 10 new data points from the stream
(Fig. 16.7).
Fig. 16.7 Scatterplot of the next batch of 700 random Gaussian points in 2D
538 16 Specialized Machine Learning Topics
## X1 X2
## 1 0.4017803 0.2999017
## 2 0.4606262 0.5797737
## 3 0.4611642 0.6617809
## 4 0.3369141 0.2840991
## 5 0.8928082 0.5687830
## 6 0.8706420 0.4282589
## 7 0.2539396 0.2783683
## 8 0.5594320 0.7019670
## 9 0.5030676 0.7560124
## 10 0.7930719 0.0937701
new_p <- get_points(stream_5G, n = 100, class = TRUE)
head(new_p, n = 20)
## X1 X2 class
## 1 0.7915730 0.09533001 4
## 2 0.4305147 0.36953997 2
## 3 0.4914093 0.82120395 3
## 4 0.7837102 0.06771246 4
## 5 0.9233074 0.48164544 5
## 6 0.8606862 0.49399269 5
## 7 0.3191884 0.27607324 2
## 8 0.2528981 0.27596700 2
## 9 0.6627604 0.68988585 3
## 10 0.7902887 0.09402659 4
## 11 0.7926677 0.09030248 4
## 12 0.9393515 0.50259344 5
## 13 0.9333770 0.62817482 5
## 14 0.7906710 0.10125432 4
## 15 0.1798662 0.24967850 2
## 16 0.7985790 0.08324688 4
## 17 0.5247573 0.57527380 3
## 18 0.2358468 0.23087585 2
## 19 0.8818853 0.49668824 5
## 20 0.4255094 0.81789418 3
Note that if you add noise to your stream, e.g., stream_Noise <-
DSD_Gaussians(k ¼ 5, d ¼ 4, noise ¼ .1, p ¼ c(0.1, 0.5, 0.3,
0.9, 0.1)), then the noise points that are not classified as part of any cluster will
have an NA class label.
set.seed(12345)
stream_Bench <- DSD_Benchmark(1)
stream_Bench
This benchmark generator creates two 2D clusters moving in 2D. One moves
from top-left to bottom-right, the other from bottom-left to top-right. Then they meet
at the center of the domain, the 2 clusters overlap and then split again.
Concept drift in the stream can be depicted by requesting (10) times 300 data
points from the stream and animating the plot. Fast-forwarding the stream can be
accomplished by requesting, but ignoring, (2000) points in between the (10) plots.
The output of the animation below is suppressed to save space.
for(i in 1:10) {
plot(stream_Bench, 300, xlim = c(0, 1), ylim = c(0, 1))
tmp <- get_points(stream_Bench, n = 2000)
}
540 16 Specialized Machine Learning Topics
reset_stream(stream_Bench)
animate_data(stream_Bench,n=8000,horizon=120, xlim=c(0,1), ylim=c(0,1))
# Animations can be saved as HTML or GIF
#saveHTML(ani.replay(), htmlfile = "stream_Bench_Animation.html")
#saveGIF(ani.replay())
These data represent the X and Y spatial knee-pain locations for over 8,000 patients,
along with labels about the knee Front, Back, Left and Right. Let’s try to read the
SOCR Knee Pain Dataset as a stream.
## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body-primary" role="main">\n\t<a id="top
...
# View(kneeRawData_df)
We can use the DSD::DSD_Memory class to get a stream interface for matrix or
data frame objects, like the Knee pain location dataset. The number of true clusters
k ¼ 4 in this dataset.
16.3 Data Streaming 541
# Each time we get a point from *streamKnee*, the stream pointer moves
to the next position (row) in the data.
get_points(streamKnee, n=10)
## x y
## 1 0.11885895 0.5057803
## 2 0.32488114 0.6040462
## 3 0.31537242 0.4971098
## 4 0.32488114 0.4161850
## 5 0.69413629 0.5289017
## 6 0.32171157 0.4595376
## 7 0.06497623 0.4913295
## 8 0.12519810 0.4682081
## 9 0.32329635 0.4942197
## 10 0.30744849 0.5086705
streamKnee
streamKnee
# stream::update
reset_stream(streamKnee, pos = 1)
update(dsc_streamKnee, streamKnee, n = 500)
dsc_streamKnee
## DStream
## Class: DSC_DStream, DSC_Micro, DSC_R, DSC
## Number of micro-clusters: 16
## Number of macro-clusters: 11
Fig. 16.9 Data stream clustering and classification of the SOCR knee-pain dataset (n¼500)
16.3 Data Streaming 543
Fig. 16.10 5-Means stream clustering of the SOCR knee pain data
head(get_centers(dsc_streamKnee))
## [,1] [,2]
## [1,] 0.05 0.45
## [2,] 0.05 0.55
## [3,] 0.15 0.35
## [4,] 0.15 0.45
## [5,] 0.15 0.55
## [6,] 0.15 0.65
1 Xk
0 Purity ¼ maxj ci \ t j 1,
N i¼1
where N¼number of observed data points, k ¼ number of clusters, ci is the ith cluster,
and tj is the classification that has the maximum number of points with ci class labels.
High purity suggests that we correctly label points (Fig. 16.10).
544 16 Specialized Machine Learning Topics
kMeans_Knee <- DSC_Kmeans(k=5) # use 4-5 clusters matching the 4 knee labels
recluster(kMeans_Knee, dsc_streamKnee)
plot(kMeans_Knee, streamKnee, type = "both")
Fig. 16.11 Animated continuous 5-means stream clustering of the knee pain data
Fig. 16.12 Continuous stream clustering and purity index across iterations
16.3 Data Streaming 545
## points purity
## 1 1 0.9600000
## 2 101 0.9043478
## 3 201 0.9500000
…
## 49 4801 0.9047619
## 50 4901 0.8850000
Figure 16.13 shows the average clustering purty as we evaluate the stream clustering
across the streaming points.
## points purity
## 1 1 0.9714286
## 2 101 0.9833333
## 3 201 0.9722222
…
## 49 4801 0.9772727
## 50 4901 0.9777778
Here and in previous chapters, e.g., Chap. 15, we notice that R may sometimes be
slow and memory-inefficient. These problems may be severe, especially for
datasets with millions of records or when using complex functions. There are
packages for processing large datasets and memory optimization – bigmemory,
biganalytics, bigtabulate, etc.
16.4 Optimization and Improving the Computational Performance 547
We have also seen long execution times when running processes that ingest, store or
manipulate huge data.frame objects. The dplyr package, created by Hadley
Wickham and Romain Francoi, provides a faster route to manage such large datasets
in R. It creates an object called tbl, similar to data.frame, which has an
in-memory column-like structure. R reads these objects a lot faster than data frames.
To make a tbl object we can either convert an existing data frame to tbl or
connect to an external database. Converting from data frame to tbl is quite easy. All
we need to do is call the function as.tbl().
#install.packages("dplyr")
library(dplyr)
nof1_tbl<-as.tbl(nof1); nof1_tbl
## # A tibble: 900 × 10
## ID Day Tx SelfEff SelfEff25 WPSS SocSuppt PMss PMss3 PhyAct
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 33 8 0.97 5.00 4.03 1.03 53
## 2 1 2 1 33 8 -0.17 3.87 4.03 1.03 73
## 3 1 3 0 33 8 0.81 4.84 4.03 1.03 23
…
This looks like a normal data frame. If you are using R Studio, displaying the
nof1_tbl will show the same output as nof1.
548 16 Specialized Machine Learning Topics
#install.packages("data.table")
library(data.table)
nof1<-fread("C:/Users/Dinov/Desktop/02_Nof1_Data.csv")
nof1[ID==1, mean(PhyAct)]
## [1] 52.66667
This useful functionality can also help us run complex operations with only a few
lines of code. One of the drawbacks of using data.table objects is that they are
still limited by the available system memory.
# install.packages("ff")
library(ff)
# vitalsigns<-read.csv.ffdf(file="UQ_VitalSignsData_Case04.csv", header=T)
vitalsigns<-
read.csv.ffdf(file="https://umich.instructure.com/files/366335/download?
download_frd=1", header=T)
mean(vitalsigns$Pulse)
## Warning in mean.default(vitalsigns$Pulse): argument is not numeric or
## logical: returning NA
## [1] NA
For basic calculations on such large datasets, we can use another package,
ffbase. It allows operations on ffdf objects using simple tasks like: mathemat-
ical operations, query functions, summary statistics and bigger regression models
using packages like biglm, which will be mentioned later in this chapter.
# install.packages("ffbase")
library(ffbase)
mean(vitalsigns$Pulse)
## [1] 108.7185
outputs for all of processes to get the final answer faster. However, parallel algo-
rithms may require special conditions and cannot be applied to all problems. If two
tasks have to be run in a specific order, this problem cannot be parallelized.
To measure how much time can be saved for different methods, we can use function
system.time().
system.time(mean(vitalsigns$Pulse))
## user system elapsed
## 0 0 0
This means calculating the mean of Pulse column in the vitalsigns dataset
takes less than 0.001 seconds. These values will vary between computers, operating
systems, and states of operations.
We will introduce two packages for parallel computing multicore and snow
(their core components are included in the package parallel). They both have a
different way of multitasking. However, to run these packages, you need to have a
relatively modern multicore computer. Let’s check how many cores your computer
has. This function parallel::detectCores() provides this functionality.
parallel is a base package, so there is no need to install it prior to using it.
library(parallel); detectCores()
## [1] 8
So, there are eight (8) cores in my computer. I will be able to run up to 6-8 parallel
jobs on this computer.
The multicore package simply uses the multitasking capabilities of the kernel,
the computer’s operating system, to “fork” additional R sessions that share the same
memory. Imagine that we open several R sessions in parallel and let each of them do
part of the work. Now, let’s examine how this can save time when running complex
protocols or dealing with large datasets. To start with, we can use the mclapply()
function, which is similar to lapply(), which applies functions to a vector and
returns a vector of lists. Instead of applying functions to vectors mcapply()
divides the complete computational task and delegates portions of it to each avail-
able core. To demonstrate this procedure, we will construct a simple, yet time
16.5 Parallel Computing 551
consuming, task of generating random numbers. Also, we can use the system.
time() function to track execution time.
set.seed(123)
system.time(c1<-rnorm(10000000))
# Note the multi core calls may not work on Windows, but will work on
Linux/Mac.
#This shows a 2-core and 4-vore invocations
# system.time(c2<-unlist(mclapply(1:2, function(x){rnorm(5000000)},
mc.cores = 2)))
# system.time(c4<-unlist(mclapply(1:4, function(x){rnorm(2500000)},
mc.cores = 4)))
The unlist() is used at the end to combine results from different cores into a
single vector. Each line of code creates 10,000,000 random numbers. The c1 call
took the longest time to complete. The c2 call used two cores to finish the task (each
core handled 5,000,000 numbers) and used less time than c1. Finally, c4 used all
four cores to finish the task and successfully reduced the overall time. We can see
that when we use more cores the overall time is significantly reduced.
The snow package allows parallel computing on multicore multiprocessor
machines or a network of multiple machines. It might be more difficult to use but
it’s also certainly more flexible. First we can set how many cores we want to use via
makeCluster() function.
# install.packages("snow")
library(snow)
cl<-makeCluster(2)
This call might cause your computer to pop up a message warning about access
though the firewall. To do the same task we can use parLapply() function in the
snow package. Note that we have to call the object we created with the previous
makeCluster() function.
While using parLapply(), we have to specify the matrix and the function that
will be applied to this matrix. Remember to stop the cluster we made after complet-
ing the task, to release back the system resources.
stopCluster(cl)
# install.packages("doParallel")
library(doParallel)
cl<-makeCluster(4)
registerDoParallel(cl)
#install.packages("foreach")
library(foreach)
system.time(c4<-foreach(i=1:4, .combine = 'c')
%dopar% rnorm(2500000))
Here we used four items (each item runs on a separate core), .combine¼c
allows foreach to combine the results with the parameter c(), generating the
aggregate result vector.
Also, don’t forget to close the doParallel by registering the sequential
backend.
unregister<-registerDoSEQ()
16.6 Deploying Optimized Learning Algorithms 553
Modern computers have graphics cards, GPUs (Graphical Processing Units), that
consists of thousands of cores, however they are very specialized, unlike the
standard CPU chip. If we can use this feature for parallel computing, we may
reach amazing performance improvements, at the cost of complicating the
processing algorithms and increasing the constraints on the data format. Specific
disadvantages of GPU computing include reliance on proprietary manufacturer (e.g.,
NVidia) frameworks and Complete Unified Device Architecture (CUDA) program-
ming language. CUDA allows programming of GPU instructions into a common
computing language. This paper provides one example of using GPU computation to
significantly improve the performance of advanced neuroimaging and brain mapping
processing of multidimensional data.
The R package gputools is created for parallel computing using NVidia
CUDA. Detailed GPU computing in R information is available online.
As we mentioned earlier, some tasks can be parallelized easier than others. In real
world situations, we can pick the algorithms that lend themselves well to
parallelization. Some of the R packages that allow parallel computing using ML
algorithms are listed below.
biglm allows training regression models with data from SQL databases or large
data chunks obtained from the ff package. The output is similar to the standard
lm() function that builds linear models. However, biglm operates efficiently on
massive datasets.
The bigrf package can be used to train random forests combining the foreach
and doParallel packages. In Chap. 15, we presented random forests as machine
learners ensembling multiple tree learners. With parallel computing, we can split the
task of creating thousands of trees into smaller tasks that can be outsourced to each
554 16 Specialized Machine Learning Topics
available compute core. We only need to combine the results at the end. Then, we
will obtain the exact same output in a relatively shorter amount of time.
Combining the caret package with foreach, we can obtain a powerful method
to deal with time-consuming tasks like building a random forest learner. Utilizing the
same example we presented in Chap. 15, we can see the time difference of utilizing
the foreach package.
#library(caret)
system.time(m_rf <- train(CHARLSONSCORE ~ ., data = qol, method = "rf",
metric = "Kappa", trControl = ctrl, tuneGrid = grid_rf))
It took more than a minute to finish this task in standard execution model purely
relying on the regular caret function. Below, this same model training completes
much faster using parallelization (less than half the time) compared to the standard
call above.
set.seed(123)
cl<-makeCluster(4)
registerDoParallel(cl)
getDoParWorkers()
## [1] 4
unregister<-registerDoSEQ()
Try to analyze the co-appearance network in the novel “Les Miserables”. The data
contains the weighted network of co-appearances of characters in Victor Hugo’s
novel “Les Miserables”. Nodes represent characters as indicated by the labels and
edges connect any pair of characters that appear in the same chapter of the book. The
values on the edges are the number of such co-appearances.
16.8 Assignment: 16. Specialized Machine Learning Topics 555
miserables<-read.table("https://umich.instructure.com/files/330389/download?
download_frd¼1", sep¼"", header¼F) head(miserables)
Also, try to interrogate some of the larger datasets we have by using alternative
parallel computing and big data analytics.
• Download the Main SOCR Wiki Page and compare RCurl and httr.
• Read and write XML code for the SOCR Main Page.
• Scrape the data from the SOCR Main Page.
• Download 03_les_miserablese_GraphData.txt
• Visualize this undirected network.
• Summary the graph and explain the output.
• Calculate degree and the centrality of this graph.
• Find out some important characters.
• Will the result change or not if we assume the graph is directed?
References