Mining and Visualising Real-World Data: About This Module
Mining and Visualising Real-World Data: About This Module
Mining and visualising realworld data
maximise the insight into the dataset and summarise its main characteristics
provide the basis and support the selection of appropriate Machine Learning tools to be
applied
In this module, you will be introduced to the rich set of Python-based tools for data
manipulation and visualisation. You will be using data from a publicly available case study
using an external file ( retail_data.csv ), which you will need to load and pre-process before
you can apply any Machine Learning algorithms. In this module you will learn how to do this.
You will also learn how to further explore and interpret your data through simple measures
and plotting.
Case Study
The dataset we will be using throughout today’s workshop is an aggregated and adapted
version of the Online Retail Case Study (https://archive.ics.uci.edu/ml/datasets/Online+Retail) from
the UCI Machine Learning repository, which is a great source for publicly available real-world
data for Machine Learning purposes. The dataset has been designed for this workshop with
the purpose of modelling the behaviour of customers ("returning" vs. "non-returning"
customers) based on their transaction activity (such as balance, max spent and number of
orders, among others).
http://beta.cambridgespark.com/courses/jpm/02module.html 1/16
2016. 11. 27. Mining and visualising realworld data
Figure 1. Preview of the online retail dataset. A description of the columns of this table can be
found in data/features_description.md .
Answer:
The features are relevant to the activity of the customers ( balance , max_spent ,
n_orders , etc.).
The target variable or class highlights whether the customer is returning or not ("yes"
vs. "no").
pandas : for high-performance, easy-to-use data structures and data analysis functions.
numpy (NumPy): for its array data structure and data manipulation functions.
http://beta.cambridgespark.com/courses/jpm/02module.html 2/16
2016. 11. 27. Mining and visualising realworld data
Start by opening the provided file jpm-vanilla.ipynb . At the very top of the file, you should
see all the various import statements we will be using throughout today’s workshop. These tell
the Python interpreter (the engine that runs the program) that these libraries are required for
the program to run. In this case:
PYTHON
# compatibility with python2 and 3
from __future__ import print_function, division
# numerical capacity
import scipy as scipy
import numpy as np
import pandas as pd
# matplotlib setup
%matplotlib inline
import matplotlib.pylab as plt
import seaborn as sns
# plotly setup
import plotly.plotly as py
from plotly.graph_objs import *
from plotly.tools import FigureFactory as FF
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
init_notebook_mode() (2)
# extra tools
from mpl_toolkits.mplot3d.axes3d import Axes3D
import visplots (1)
# SVM MODULE
from sklearn.svm import SVC
1. The visplots library has been developed in-house, and provides additional plotting
functionality for the visualisation of the classifiers' boundaries and their performance.
http://beta.cambridgespark.com/courses/jpm/02module.html 3/16
2016. 11. 27. Mining and visualising realworld data
2. Run at the start of every IPython notebook to use plotly.offline . This injects the
plotly.js source files into the notebook.
At a first stage, the data has only been loaded. Let’s have a look at the top few lines - we can
use the .head() method to achieve this.
PYTHON
# Import the data and explore the first few rows
Before you can feed data into a Machine Learning algorithm, you first need to convert the
imported DataFrame into a numpy array. It is also good practice to always check the
dimensionality of the input data using the shape command to confirm that you really have
imported all the data in the correct way (e.g. one common mistake is to get the separator
wrong and end up with only one column).
PYTHON
# Convert to numpy array and check the dimensionality
npArray = np.array(retail)
print(retail.shape)
> (1998, 11) (1)
1. These values tell us that our imported data consist of 1998 rows (samples) and 11 columns
(features).
http://beta.cambridgespark.com/courses/jpm/02module.html 4/16
2016. 11. 27. Mining and visualising realworld data
PYTHON
# Split to input matrix X and class vector y
X = npArray[:,:-1].astype(float) (1)
y = npArray[:,-1]
1. To use the values in X as continuous (floating point) values, we need to explicitly convert
or “cast” them into the float data type (which is Python’s data type for representing
continuous numerical values).
Try printing the size of the input matrix X and class vector y using the shape function:
PYTHON
# Print the dimensions of X and y
These tell us that X is a 2-dimensional array (matrix) with 1998 rows and 10 columns, while y
is a 1-dimensional array (vector) with 1998 elements. Also, based on the class vector y, the
customers are classified into two distinct categories: "yes" for returning customers and "no"
for non-returning customers.
Plotly (https://plot.ly/) is an online collaborative data analysis and graphing tool that we will use
in order to construct fully interactive graphs. The Plotly API allows you to access all of the
library’s interactive functionality directly from Python (or other programming languages such
as R, JavaScript and MATLAB, among others). Crucially, Plotly has recently been made open-
source (https://plot.ly/javascript/open-source-announcement/), which now enables plotting without
requiring access to their API. Plotly Offline (https://plot.ly/python/offline/) brings interactive Plotly
graphs to the offline Jupyter (IPython) Notebook environment.
http://beta.cambridgespark.com/courses/jpm/02module.html 5/16
2016. 11. 27. Mining and visualising realworld data
Imbalances in distribution of labels (classes) is a frequent problem when working with real-
world data, and it can often lead to poor classification results for the minority class even if the
classification results for the majority class are very good. This is something to bear in mind
when evaluating the performance of a model, since for imbalanced class distributions, the
overall accuracy can be high even when the model performs very poorly when classifying the
minority class. In order to investigate the class frequency, we will use the itemfreq()
function as follows:
PYTHON
# Print the y frequencies
yFreq = scipy.stats.itemfreq(y)
print(yFreq)
> [['no' 260]
['yes' 1738]]
In our current dataset, you can see that the y values are categorical (i.e. they can only take one
of a discrete set of values) and have a non-numeric representation, "yes" or "no". This can be
problematic for scikit-learn and plotting functions in Python, since they assume numerical
values, so we need to map the text categories to numerical representations using
LabelEncoder and the fit_transform function from the preprocessing module:
PYTHON
# Convert the categorical to numeric values, and print the y frequencies
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
yFreq = scipy.stats.itemfreq(y)
print(yFreq)
>[[ 0 260]
[ 1 1738]]
Visualising the data in some way is a good way to get a feel for how the data is distributed. As a
simple example, try plotting the frequencies of the output labels, 1 and 0, to see how they are
distributed using the Bar graphical object (trace) from Plotly.
http://beta.cambridgespark.com/courses/jpm/02module.html 6/16
2016. 11. 27. Mining and visualising realworld data
PYTHON
# Display the y frequencies in a barplot with Plotly
# (4) Plot
iplot(fig)
1. Creating the Data object. The Data object may contain one or more graphical objects such
as Scatter, Box and Bar, often referred to as "traces". Data is in a list-like format, therefore
we must use square bracket notation "[ ]"
2. Creating a Layout object. Layouts and most of their individual arguments (such as the
xaxis and yaxis) are in dict formats.
3. Creating a Figure object. This is the step where we are combining the Data with the Layout.
http://beta.cambridgespark.com/courses/jpm/02module.html 7/16
2016. 11. 27. Mining and visualising realworld data
Figure 2. Bar chart showing frequencies of returning (class "1") and non-returning (class "0")
customers.
Data scaling
To avoid attributes with greater numeric ranges dominating those with smaller numeric
ranges, it is usually advisable to scale your data prior to fitting a classification model. Feature
scaling is generally performed as part of the data pre-processing.
In order to investigate the range and descriptive statistics of our features, we can apply the
describe() function from pandas to the original retail DataFrame (not the numpy
array!). For instance:
PYTHON
retail.describe()
http://beta.cambridgespark.com/courses/jpm/02module.html 8/16
2016. 11. 27. Mining and visualising realworld data
Figure 3. Descriptive statistics for each feature in the online retail dataset.
Despite the informative table we have just presented, visual aids are always preferable as they
enhance the interpretability of the actual findings. Boxplots are commonly used in order to
investigate the differences in ranges of the input features. Boxplots are a standardised way of
displaying the distribution of the data based on the "five number summary" (minimum, first
quartile, median, third quartile, and maximum). At this stage, let us start by constructing a
boxplot using the raw data:
PYTHON
# Create a boxplot of the raw data
data = [
Box(
y=X[:,i], # values to be used for box plot
name=header[i], # label (on hover and x-axis)
marker=dict(color = "purple"),
) for i in range(ncol)
]
layout = Layout(
xaxis=dict(title="Feature"),
yaxis=dict(title="Value"),
showlegend=False,
)
iplot(fig)
http://beta.cambridgespark.com/courses/jpm/02module.html 9/16
2016. 11. 27. Mining and visualising realworld data
Figure 4. Boxplot highlighting the different feature ranges of the raw data.
There are many ways of scaling but the most common scaling mechanism is auto-scaling,
where for each column, the values are centred around the mean and divided by their
standard deviation. This scaling mechanism can be applied by calling the scale() function from
scikit-learn’s preprocessing module.
PYTHON
# Auto-scale the data
X = preprocessing.scale(X)
The outcome of the scaling can be once more represented in a boxplot using exactly the
previous plotting script:
PYTHON
# Create a boxplot of the scaled data (simple or enhanced)
If you feel more adventurous, you can visit https://plot.ly/python/box-plots/ to find more Plotly
examples (also arguments (https://plot.ly/python/reference/#box)) and create even more advanced
boxplots such as the following:
http://beta.cambridgespark.com/courses/jpm/02module.html 10/16
2016. 11. 27. Mining and visualising realworld data
http://beta.cambridgespark.com/courses/jpm/02module.html 11/16
2016. 11. 27. Mining and visualising realworld data
PYTHON
# Create an enhanced scatter plot of the first two features
f1 = 0
f2 = 1
layout = Layout(
xaxis=dict(title=header[f1]),
yaxis=dict(title=header[f2]),
height= 600,
)
iplot(fig)
1. f1 and f2 are used to specify the features that you wish to plot against each other.
2. Remember that we can concatenate more than one traces into a Data object using a list
format ("[ ]"), whereas the Layout object is in a dict format.
http://beta.cambridgespark.com/courses/jpm/02module.html 12/16
2016. 11. 27. Mining and visualising realworld data
Figure 6. Scatter plot of a combination of features against each other with the colour and
marker shape indicating whether the points are classified as returning or non-returning
customers.
PYTHON
# Hint: Investigate the Scatter3d object from Plotly
# Axes in 3D Plotly plots work in little differently than in 2D:
# axes are bound to a Scene object (use help(Scene)).
http://beta.cambridgespark.com/courses/jpm/02module.html 13/16
2016. 11. 27. Mining and visualising realworld data
Figure 7. Three-dimensional scatterplot. The colour highlights whether the points are classified
as returning or non-returning customers like in the previous plot. In this case we look at the
relationship between the features mean_spent , balance and max_spent at the same time.
PYTHON
# Hints: You may want to use nested loops that iterate through the
# rows and columns of the grid, and also import and make use of the
# make_subplots() function from Plotly
http://beta.cambridgespark.com/courses/jpm/02module.html 14/16
2016. 11. 27. Mining and visualising realworld data
Figure 8. Scatter plot matrix where features are plotted against each other. The colour
highlights whether the points are classified as returning or non-returning customers like in the
previous plot. This kind of chart is handy when one needs to look at the relationships between
multiple variables simultaneously.
Further reading
http://numpy.scipy.org - To find out more about numpy.
http://beta.cambridgespark.com/courses/jpm/02module.html 15/16
2016. 11. 27. Mining and visualising realworld data
http://scikit-learn.org/stable/supervised_learning.html#supervised-learning - scikit-
learn supervised learning page.
Wrap up of Module 2
A vector can be represented by a NumPy array with one dimension.
A NumPy array can only hold data of a single data type. If you try to change an element to a
value of a different type, there will be an error, and an error message will be displayed.
The colon can be used in the array indices to specify a range of array elements, e.g.
X[1 : 3 ]. This is known as index slicing. The index to the left of the colon (1) is included in
the range, while the index to the right of the colon (3) is not included in the range.
Plotting two features (dimensions) against each other is a good way of identifying features
or feature combinations that might be useful in classification.
http://beta.cambridgespark.com/courses/jpm/02module.html 16/16