Machine Learning Absolute Beginners Introduction 2nd
Machine Learning Absolute Beginners Introduction 2nd
Beginners
Oliver Theobald
Second Edition
Copyright © 2017 by Oliver Theobald
All rights reserved. No part of this publication may be reproduced,
distributed, or transmitted in any form or by any means, including
photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the publisher,
except in the case of brief quotations embodied in critical reviews
and certain other non-commercial uses permitted by copyright law.
Contents
INTRODUCTION
WHAT IS MACHINE LEARNING?
ML CATEGORIES
THE ML TOOLBOX
DATA SCRUBBING
SETTING UP YOUR DATA
REGRESSION ANALYSIS
CLUSTERING
BIAS & VARIANCE
ARTIFICIAL NEURAL NETWORKS
DECISION TREES
ENSEMBLE MODELING
BUILDING A MODEL IN PYTHON
MODEL OPTIMIZATION
FURTHER RESOURCES
DOWNLOADING DATASETS
FINAL WORD
INTRODUCTION
Machines have come a long way since the Industrial Revolution. They
continue to fill factory floors and manufacturing plants, but now their
capabilities extend beyond manual activities to cognitive tasks that, until
recently, only humans were capable of performing. Judging song
competitions, driving automobiles, and mopping the floor with professional
chess players are three examples of the specific complex tasks machines are
now capable of simulating.
But their remarkable feats trigger fear among some observers. Part of this
fear nestles on the neck of survivalist insecurities, where it provokes the
deep-seated question of what if? What if intelligent machines turn on us in a
struggle of the fittest? What if intelligent machines produce offspring with
capabilities that humans never intended to impart to machines? What if the
legend of the singularity is true?
The other notable fear is the threat to job security, and if you’re a truck driver
or an accountant, there is a valid reason to be worried. According to the
British Broadcasting Company’s (BBC) interactive online resource Will a
robot take my job?, professions such as bar worker (77%), waiter (90%),
chartered accountant (95%), receptionist (96%), and taxi driver (57%) each
have a high chance of becoming automated by the year 2035. [1]
But research on planned job automation and crystal ball gazing with respect
to the future evolution of machines and artificial intelligence (AI) should be
read with a pinch of skepticism. AI technology is moving fast, but broad
adoption is still an unchartered path fraught with known and unforeseen
challenges. Delays and other obstacles are inevitable.
Nor is machine learning a simple case of flicking a switch and asking the
machine to predict the outcome of the Super Bowl and serve you a delicious
martini. Machine learning is far from what you would call an out-of-the-box
solution.
Machines operate based on statistical algorithms managed and overseen by
skilled individuals—known as data scientists and machine learning
engineers. This is one labor market where job opportunities are destined for
growth but where, currently, supply is struggling to meet demand. Industry
experts lament that one of the biggest obstacles delaying the progress of AI is
the inadequate supply of professionals with the necessary expertise and
training.
According to Charles Green, the Director of Thought Leadership at Belatrix
Software:
“It’s a huge challenge to find data scientists, people with machine
learning experience, or people with the skills to analyze and use the
data, as well as those who can create the algorithms required for
machine learning. Secondly, while the technology is still emerging, there
are many ongoing developments. It’s clear that AI is a long way from
how we might imagine it.” [2]
Perhaps your own path to becoming an expert in the field of machine learning
starts here, or maybe a baseline understanding is sufficient to satisfy your
curiosity for now. In any case, let’s proceed with the assumption that you are
receptive to the idea of training to become a successful data scientist or
machine learning engineer.
To build and program intelligent machines, you must first understand
classical statistics. Algorithms derived from classical statistics contribute the
metaphorical blood cells and oxygen that power machine learning. Layer
upon layer of linear regression, k-nearest neighbors, and random forests surge
through the machine and drive their cognitive abilities. Classical statistics is
at the heart of machine learning and many of these algorithms are based on
the same statistical equations you studied in high school. Indeed, statistical
algorithms were conducted on paper well before machines ever took on the
title of artificial intelligence.
Computer programming is another indispensable part of machine learning.
There isn’t a click-and-drag or Web 2.0 solution to perform advanced
machine learning in the way one can conveniently build a website nowadays
with WordPress or Strikingly. Programming skills are therefore vital to
manage data and design statistical models that run on machines.
Some students of machine learning will have years of programming
experience but haven’t touched classical statistics since high school. Others,
perhaps, never even attempted statistics in their high school years. But not to
worry, many of the machine learning algorithms we discuss in this book have
working implementations in your programming language of choice; no
equation writing necessary. You can use code to execute the actual number
crunching for you.
If you have not learned to code before, you will need to if you wish to make
further progress in this field. But for the purpose of this compact starter’s
course, the curriculum can be completed without any background in
computer programming. This book focuses on the high-level fundamentals of
machine learning as well as the mathematical and statistical underpinnings of
designing machine learning models.
For those who do wish to look at the programming aspect of machine
learning, Chapter 13 walks you through the entire process of setting up a
supervised learning model using the popular programming language Python.
WHAT IS MACHINE LEARNING?
In 1959, IBM published a paper in the IBM Journal of Research and
Development with an, at the time, obscure and curious title. Authored by
IBM’s Arthur Samuel, the paper invested the use of machine learning in the
game of checkers “to verify the fact that a computer can be programmed so
that it will learn to play a better game of checkers than can be played by the
person who wrote the program.” [3]
Although it was not the first publication to use the term “machine learning”
per se, Arthur Samuel is widely considered as the first person to coin and
define machine learning in the form we now know today. Samuel’s landmark
journal submission, Some Studies in Machine Learning Using the Game of
Checkers, is also an early indication of homo sapiens’ determination to
impart our own system of learning to man-made machines.
Figure 1: Historical mentions of “machine learning” in published books. Source: Google Ngram Viewer, 2017
widely accepted.
Although not directly mentioned in Arthur Samuel’s definition, a key feature
of machine learning is the concept of self-learning. This refers to the
application of statistical modeling to detect patterns and improve
performance based on data and empirical information; all without direct
programming commands. This is what Arthur Samuel described as the ability
to learn without being explicitly programmed. But he doesn’t infer that
machines formulate decisions with no upfront programming. On the contrary,
machine learning is heavily dependent on computer programming. Instead,
Samuel observed that machines don’t require a direct input command to
perform a set task but rather input data.
Figure 3: The lineage of machine learning represented by a row of Russian matryoshka dolls
Popping out from computer science and data science as the third matryoshka
doll is artificial intelligence. Artificial intelligence, or AI, encompasses the
ability of machines to perform intelligent and cognitive tasks. Comparable to
the way the Industrial Revolution gave birth to an era of machines that could
simulate physical tasks, AI is driving the development of machines capable
of simulating cognitive abilities.
While still broad but dramatically more honed than computer science and
data science, AI contains numerous subfields that are popular today. These
subfields include search and planning, reasoning and knowledge
representation, perception, natural language processing (NLP), and of course,
machine learning. Machine learning bleeds into other fields of AI, including
NLP and perception through the shared use of self-learning algorithms.
Figure 4: Visual representation of the relationship between data-related fields
Supervised Learning
As the first branch of machine learning, supervised learning concentrates on
learning patterns through connecting the relationship between variables and
known outcomes and working with labeled datasets.
Supervised learning works by feeding the machine sample data with various
features (represented as “X”) and the correct value output of the data
(represented as “y”). The fact that the output and feature values are known
qualifies the dataset as “labeled.” The algorithm then deciphers patterns that
exist in the data and creates a model that can reproduce the same underlying
rules with new data.
For instance, to predict the market rate for the purchase of a used car, a
supervised algorithm can formulate predictions by analyzing the relationship
between car attributes (including the year of make, car brand, mileage, etc.)
and the selling price of other cars sold based on historical data. Given that the
supervised algorithm knows the final price of other cards sold, it can then
work backward to determine the relationship between the characteristics of
the car and its value.
Figure 1: Car value prediction model
After the machine deciphers the rules and patterns of the data, it creates what
is known as a model: an algorithmic equation for producing an outcome with
new data based on the rules derived from the training data. Once the model is
prepared, it can be applied to new data and tested for accuracy. After the
model has passed both the training and test data stages, it is ready to be
applied and used in the real world.
In Chapter 13, we will create a model for predicting house values where y is
the actual house price and X are the variables that impact y, such as land size,
location, and the number of rooms. Through supervised learning, we will
create a rule to predict y (house value) based on the given values of various
variables (X).
Examples of supervised learning algorithms include regression analysis,
decision trees, k-nearest neighbors, neural networks, and support vector
machines. Each of these techniques will be introduced later in the book.
Unsupervised Learning
In the case of unsupervised learning, not all variables and data patterns are
classified. Instead, the machine must uncover hidden patterns and create
labels through the use of unsupervised learning algorithms. The k-means
clustering algorithm is a popular example of unsupervised learning. This
simple algorithm groups data points that are found to possess similar features
as shown in Figure 1.
Figure 1: Example of k-means clustering, a popular unsupervised learning technique
If you group data points based on the purchasing behavior of SME (Small
and Medium-sized Enterprises) and large enterprise customers, for example,
you are likely to see two clusters emerge. This is because SMEs and large
enterprises tend to have disparate buying habits. When it comes to purchasing
cloud infrastructure, for instance, basic cloud hosting resources and a Content
Delivery Network (CDN) may prove sufficient for most SME customers.
Large enterprise customers, though, are more likely to purchase a wider array
of cloud products and entire solutions that include advanced security and
networking products like WAF (Web Application Firewall), a dedicated
private connection, and VPC (Virtual Private Cloud). By analyzing customer
purchasing habits, unsupervised learning is capable of identifying these two
groups of customers without specific labels that classify the company as
small, medium or large.
The advantage of unsupervised learning is it enables you to discover patterns
in the data that you were unaware existed—such as the presence of two major
customer types. Clustering techniques such as k-means clustering can also
provide the springboard for conducting further analysis after discrete groups
have been discovered.
In industry, unsupervised learning is particularly powerful in fraud detection
—where the most dangerous attacks are often those yet to be classified. One
real-world example is DataVisor, who essentially built their business model
based on unsupervised learning.
Founded in 2013 in California, DataVisor protects customers from fraudulent
online activities, including spam, fake reviews, fake app installs, and
fraudulent transactions. Whereas traditional fraud protection services draw on
supervised learning models and rule engines, DataVisor uses unsupervised
learning which enables them to detect unclassified categories of attacks in
their early stages.
On their website, DataVisor explains that "to detect attacks, existing solutions
rely on human experience to create rules or labeled training data to tune
models. This means they are unable to detect new attacks that haven’t already
been identified by humans or labeled in training data." [5]
This means that traditional solutions analyze the chain of activity for a
particular attack and then create rules to predict a repeat attack. Under this
scenario, the dependent variable (y) is the event of an attack and the
independent variables (X) are the common predictor variables of an attack.
Examples of independent variables could be:
a) A sudden large order from an unknown user. I.E. established customers
generally spend less than $100 per order, but a new user spends $8,000 in one
order immediately upon registering their account.
b) A sudden surge of user ratings. I.E. As a typical author and bookseller
on Amazon.com, it’s uncommon for my first published work to receive more
than one book review within the space of one to two days. In general,
approximately 1 in 200 Amazon readers leave a book review and most books
go weeks or months without a review. However, I commonly see competitors
in this category (data science) attracting 20-50 reviews in one day!
(Unsurprisingly, I also see Amazon removing these suspicious reviews weeks
or months later.)
c) Identical or similar user reviews from different users. Following the
same Amazon analogy, I often see user reviews of my book appear on other
books several months later (sometimes with a reference to my name as the
author still included in the review!). Again, Amazon eventually removes
these fake reviews and suspends these accounts for breaking their terms of
service.
d) Suspicious shipping address. I.E. For small businesses that routinely ship
products to local customers, an order from a distant location (where they
don't advertise their products) can in rare cases be an indicator of fraudulent
or malicious activity.
Standalone activities such as a sudden large order or a distant shipping
address may prove too little information to predict sophisticated
cybercriminal activity and more likely to lead to many false positives. But a
model that monitors combinations of independent variables, such as a sudden
large purchase order from the other side of the globe or a landslide of book
reviews that reuse existing content will generally lead to more accurate
predictions. A supervised learning-based model could deconstruct and
classify what these common independent variables are and design a detection
system to identify and prevent repeat offenses.
Sophisticated cybercriminals, though, learn to evade classification-based rule
engines by modifying their tactics. In addition, leading up to an attack,
attackers often register and operate single or multiple accounts and incubate
these accounts with activities that mimic legitimate users. They then utilize
their established account history to evade detection systems, which are
trigger-heavy against recently registered accounts. Supervised learning-based
solutions struggle to detect sleeper cells until the actual damage has been
made and especially with regard to new categories of attacks.
DataVisor and other anti-fraud solution providers therefore leverage
unsupervised learning to address the limitations of supervised learning by
analyzing patterns across hundreds of millions of accounts and identifying
suspicious connections between users—without knowing the actual category
of future attacks. By grouping malicious actors and analyzing their
connections to other accounts, they are able to prevent new types of attacks
whose independent variables are still unlabeled and unclassified. Sleeper cells
in their incubation stage (mimicking legitimate users) are also identified
through their association to malicious accounts. Clustering algorithms such as
k-means clustering can generate these groupings without a full training
dataset in the form of independent variables that clearly label indications of
an attack, such as the four examples listed earlier. Knowledge of the
dependent variable (known attackers) is generally the key to identifying other
attackers before the next attack occurs. The other plus side of unsupervised
learning is companies like DataVisor can uncover entire criminal rings by
identifying subtle correlations across users.
We will cover unsupervised learning later in this book specific to clustering
analysis. Other examples of unsupervised learning include association
analysis, social network analysis, and descending dimension algorithms.
Reinforcement Learning
Reinforcement learning is the third and most advanced algorithm category in
machine learning. Unlike supervised and unsupervised learning,
reinforcement learning continuously improves its model by leveraging
feedback from previous iterations. This is different to supervised and
unsupervised learning, which both reach an indefinite endpoint after a model
is formulated from the training and test data segments.
Reinforcement learning can be complicated and is probably best explained
through an analogy to a video game. As a player progresses through the
virtual space of a game, they learn the value of various actions under different
conditions and become more familiar with the field of play. Those learned
values then inform and influence a player’s subsequent behavior and their
performance immediately improves based on their learning and past
experience.
Reinforcement learning is very similar, where algorithms are set to train the
model through continuous learning. A standard reinforcement learning model
has measurable performance criteria where outputs are not tagged—instead,
they are graded. In the case of self-driving vehicles, avoiding a crash will
allocate a positive score and in the case of chess, avoiding defeat will
likewise receive a positive score.
A specific algorithmic example of reinforcement learning is Q-learning. In Q-
learning, you start with a set environment of states, represented by the
symbol ‘S’. In the game Pac-Man, states could be the challenges, obstacles or
pathways that exist in the game. There may exist a wall to the left, a ghost to
the right, and a power pill above—each representing different states.
The set of possible actions to respond to these states is referred to as “A.” In
the case of Pac-Man, actions are limited to left, right, up, and down
movements, as well as multiple combinations thereof.
The third important symbol is “Q.” Q is the starting value and has an initial
value of “0.”
As Pac-Man explores the space inside the game, two main things will
happen:
1) Q drops as negative things occur after a given state/action
2) Q increases as positive things occur after a given state/action
In Q-learning, the machine will learn to match the action for a given state that
generates or maintains the highest level of Q. It will learn initially through the
process of random movements (actions) under different conditions (states).
The machine will record its results (rewards and penalties) and how they
impact its Q level and store those values to inform and optimize its future
actions.
While this sounds simple enough, implementation is a much more difficult
task and beyond the scope of an absolute beginner’s introduction to machine
learning. Reinforcement learning algorithms aren’t covered in this book,
however, I will leave you with a link to a more comprehensive explanation of
reinforcement learning and Q-learning following the Pac-Man scenario.
https://inst.eecs.berkeley.edu/~cs188/sp12/projects/reinforcement/reinforcement.html
THE ML TOOLBOX
A handy way to learn a new subject area is to map and visualize the essential
materials and tools inside a toolbox.
If you were packing a toolbox to build websites, for example, you would first
pack a selection of programming languages. This would include frontend
languages such as HTML, CSS, and JavaScript, one or two backend
programming languages based on personal preferences, and of course, a text
editor. You might throw in a website builder such as WordPress and then
have another compartment filled with web hosting, DNS, and maybe a few
domain names that you’ve recently purchased.
This is not an extensive inventory, but from this general list, you can start to
gain a better appreciation of what tools you need to master in order to
become a successful website developer.
Let’s now unpack the toolbox for machine learning.
Compartment 1: Data
In the first compartment is your data. Data constitutes the input variables
needed to form a prediction. Data comes in many forms, including structured
and non-structured data. As a beginner, it is recommended that you start with
structured data. This means that the data is defined and labeled (with
schema) in a table, as shown here:
Before we proceed, I first want to explain the anatomy of a tabular dataset. A
tabular (table-based) dataset contains data organized in rows and columns. In
each column is a feature. A feature is also known as a variable, a dimension
or an attribute—but they all mean the same thing.
Each individual row represents a single observation of a given
feature/variable. Rows are sometimes referred to as a case or value, but in
this book, we will use the term “row.”
Each column is known as a vector. Vectors store your X and y values and
multiple vectors (columns) are commonly referred to as matrices. In the case
of supervised learning, y will already exist in your dataset and be used to
identify patterns in relation to independent variables (X). The y values are
commonly expressed in the final column, as shown in Figure 2.
Figure 2: The y value is often but not always expressed in the far right column
Compartment 2: Infrastructure
The second compartment of the toolbox contains your infrastructure, which
consists of platforms and tools to process data. As a beginner to machine
learning, you are likely to be using a web application (such as Jupyter
Notebook) and a programming language like Python. There are then a series
of machine learning libraries, including NumPy, Pandas, and Scikit-learn that
are compatible with Python. Machine learning libraries are a collection of
pre-compiled programming routines frequently used in machine learning.
You will also need a machine from which to work, in the form of a computer
or a virtual server. In addition, you may need specialized libraries for data
visualization such as Seaborn and Matplotlib, or a standalone software
program like Tableau, which supports a range of visualization
techniques including charts, graphs, maps, and other visual options.
With your infrastructure sprayed out across the table (hypothetically of
course), you are now ready to get to work building your first machine
learning model. The first step is to crank up your computer. Laptops and
desktop computers are both suitable for working with smaller datasets. You
will then need to install a programming environment, such as Jupyter
Notebook, and a programming language, which for most beginners is Python.
Python is the most widely used programming language for machine learning
because:
a) It is easy to learn and operate,
b) It is compatible with a range of machine learning libraries, and
c) It can be used for related tasks, including data collection (web
scraping) and data piping (Hadoop and Spark).
Other go-to languages for machine learning include C and C++. If you’re
proficient with C and C++ then it makes sense to stick with what you already
know. C and C++ are the default programming languages for advanced
machine learning because they can run directly on a GPU (Graphical
Processing Unit). Python needs to be converted first before it can run on a
GPU, but we will get to this and what a GPU is later in the chapter.
Next, Python users will typically install the following libraries: NumPy,
Pandas, and Scikit-learn. NumPy is a free and open-source library that allows
you to efficiently load and work with large datasets, including managing
matrices.
Scikit-learn provides access to a range of popular algorithms, including linear
regression, Bayes’ classifier, and support vector machines.
Finally, Pandas enables your data to be represented on a virtual
spreadsheet that you can control through code. It shares many of the same
features as Microsoft Excel in that it allows you to edit data and perform
calculations. In fact, the name Pandas derives from the term “panel data,”
which refers to its ability to create a series of panels, similar to “sheets” in
Excel. Pandas is also ideal for importing and extracting data from CSV files.
Compartment 3: Algorithms
Now that the machine learning environment is set up and you’ve chosen your
programming language and libraries, you can next import your data directly
from a CSV file. You can find hundreds of interesting datasets in CSV format
from kaggle.com. After registering as a member of their platform, you can
download a dataset of your choice. Best of all, Kaggle datasets are free and
there is no cost to register as a user.
The dataset will download directly to your computer as a CSV file, which
means you can use Microsoft Excel to open and even perform basic
algorithms such as linear regression on your dataset.
Next is the third and final compartment that stores the algorithms. Beginners
will typically start off by using simple supervised learning algorithms such as
linear regression, logistic regression, decision trees, and k-nearest neighbors.
Beginners are also likely to apply unsupervised learning in the form of k-
means clustering and descending dimension algorithms.
Visualization
No matter how impactful and insightful your data discoveries are, you need a
way to effectively communicate the results to relevant decision-makers. This
is where data visualization, a highly effective medium to communicate data
findings to a general audience, comes in handy. The visual message conveyed
through graphs, scatterplots, box plots, and the representation of numbers in
shapes makes for quick and easy storytelling.
In general, the less informed your audience is, the more important it is to
visualize your findings. Conversely, if your audience is knowledgeable about
the topic, additional details and technical terms can be used to supplement
visual elements.
To visualize your results you can draw on Tableau or a Python library such as
Seaborn, which are stored in the second compartment of the toolbox.
Advanced Toolbox
We have so far examined the toolbox for a typical beginner, but what about
an advanced user? What would their toolbox look like? While it may take
some time before you get to work with the advanced toolkit, it doesn’t hurt to
have a sneak peek.
The toolbox for an advanced learner resembles the beginner’s toolbox but
naturally comes with a broader spectrum of tools and, of course, data. One of
the biggest differences between a beginner and an advanced learner is the size
of the data they manage and operate. Beginners naturally start by working
with small datasets that are easy to manage and which can be downloaded
directly to one’s desktop as a simple CSV file. Advanced learners, though,
will be eager to tackle massive datasets, well in the vicinity of big data.
Compartment 2: Infrastructure
After scrubbing the dataset, the next step is to pull out your machine learning
equipment. In terms of tools, there are no real surprises. Advanced learners
are still using the same machine learning libraries, programming languages,
and programming environments as beginners.
However, given that advanced learners are now dealing with up to petabytes
of data, robust infrastructure is required. Instead of relying on the CPU of a
personal computer, advanced students typically turn to distributed computing
and a cloud provider such as Amazon Web Services (AWS) to run their data
processing on what is known as a Graphical Processing Unit (GPU) instance.
GPU chips were originally added to PC motherboards and video consoles
such as the PlayStation 2 and the Xbox for gaming purposes. They were
developed to accelerate the creation of images with millions of pixels whose
frames needed to be constantly recalculated to display output in less than a
second. By 2005, GPU chips were produced in such large quantities that their
price had dropped dramatically and they’d essentially matured into a
commodity. Although highly popular in the video game industry, the
application of such computer chips in the space of machine learning was not
fully understood or realized until recently.
In his 2016 novel, The Inevitable: Understanding the 12 Technological
Forces That Will Shape Our Future, Founding Executive Editor of Wired
Magazine, Kevin Kelly, explains that in 2009, Andrew Ng and a team at
Stanford University discovered how to link inexpensive GPU clusters to run
neural networks consisting of hundreds of millions of node connections.
“Traditional processors required several weeks to calculate all the cascading
possibilities in a neural net with one hundred million parameters. Ng found
that a cluster of GPUs could accomplish the same thing in a day.”[6]
Feature Selection
To generate the best results from your data, it is important to first identify the
variables most relevant to your hypothesis. In practice, this means being
selective about the variables you select to design your model.
Rather than creating a four-dimensional scatterplot with four features in the
model, an opportunity may present to select two highly relevant features and
build a two-dimensional plot that is easier to interpret. Moreover, preserving
features that do not correlate strongly with the outcome value can, in fact,
manipulate and derail the model’s accuracy. Consider the following table
excerpt downloaded from kaggle.com documenting dying languages.
Database: https://www.kaggle.com/the-guardian/extinct-languages
Let’s say our goal is to identify variables that lead to a language becoming
endangered. Based on this goal, it’s unlikely that a language’s “Name in
Spanish” will lead to any relevant insight. We can therefore go ahead and
delete this vector (column) from the dataset. This will help to prevent over-
complication and potential inaccuracies, and will also improve the overall
processing speed of the model.
Secondly, the dataset holds duplicate information in the form of separate
vectors for “Countries” and “Country Code.” Including both of these vectors
doesn’t provide any additional insight; hence, we can choose to delete one
and retain the other.
Another method to reduce the number of features is to roll multiple features
into one. In the next table, we have a list of products sold on an e-commerce
platform. The dataset comprises four buyers and eight products. This is not a
large sample size of buyers and products—due in part to the spatial
limitations of the book format. A real-life e-commerce platform would have
many more columns to work with, but let’s go ahead with this example.
In order to analyze the data in a more efficient way, we can reduce the
number of columns by merging similar features into fewer columns. For
instance, we can remove individual product names and replace the eight
product items with a lower number of categories or subtypes. As all product
items fall under the single category of “fitness,” we will sort by product
subtype and compress the columns from eight to three. The three newly
created product subtype columns are “Health Food,” “Apparel,” and
“Digital.”
This enables us to transform the dataset in a way that preserves and captures
information using fewer variables. The downside to this transformation is that
we have less information about relationships between specific products.
Rather than recommending products to users according to other individual
products, recommendations will instead be based on relationships between
product subtypes.
Nonetheless, this approach does uphold a high level of data relevancy.
Buyers will be recommended health food when they buy other health food or
when they buy apparel (depending on the level of correlation), and obviously
not machine learning textbooks—unless it turns out that there is a strong
correlation there! But alas, such a variable is outside the frame of this dataset.
Remember that data reduction is also a business decision, and business
owners in counsel with the data science team will need to consider the trade-
off between convenience and the overall precision of the model.
Row Compression
In addition to feature selection, there may also be an opportunity to reduce
the number of rows and thereby compress the total number of data points.
This can involve merging two or more rows into one. For example, in the
following dataset, “Tiger” and “Lion” can be merged and renamed
“Carnivore.”
However, by merging these two rows (Tiger & Lion), the feature values for
both rows must also be aggregated and recorded in a single row. In this case,
it is viable to merge the two rows because they both possess the same
categorical values for all features except y (Race Time)—which can be
aggregated. The race time of the Tiger and the Lion can be added and divided
by two.
Numerical values, such as time, are normally simple to aggregate unless they
are categorical. For instance, it would be impossible to aggregate an animal
with four legs and an animal with two legs! We obviously can’t merge these
two animals and set “three” as the aggregate number of legs.
Row compression can also be difficult to implement when numerical values
aren’t available. For example, the values “Japan” and “Argentina” are very
difficult to merge. The countries “Japan” and “South Korea” can be merged,
as they can be categorized as the same continent, “Asia” or “East Asia.”
However, if we add “Pakistan” and “Indonesia” to the same group, we may
begin to see skewed results, as there are significant cultural, religious,
economic, and other dissimilarities between these four countries.
In summary, non-numerical and categorical row values can be problematic to
merge while preserving the true value of the original data. Also, row
compression is normally less attainable than feature compression for most
datasets.
One-hot Encoding
After choosing variables and rows, you next want to look for text-based
features that can be converted into numbers. Aside from set text-based values
such as True/False (that automatically convert to “1” and “0” respectively),
many algorithms and also scatterplots are not compatible with non-numerical
data.
One means to convert text-based features into numerical values is through
one-hot encoding, which transforms features into binary form, represented as
“1” or “0”—“True” or “False.” A “0,” representing False, means that the
feature does not belong to a particular category, whereas a “1”—True or
“hot”—denotes that the feature does belong to a set category.
Below is another excerpt of the dataset on dying languages, which we can use
to practice one-hot encoding.
First, note that the values contained in the “No. of Speakers” column do not
contain commas or spaces, e.g. 7,500,000 and 7 500 000. Although such
formatting does make large numbers clearer for our eyes, programming
languages don’t require such niceties. In fact, formatting numbers can lead to
an invalid syntax or trigger an unwanted result, depending on the
programming language you use. So remember to keep numbers unformatted
for programming purposes. Feel free, though, to add spacing or commas at
the data visualization stage, as this will make it easier for your audience to
interpret!
On the right-hand-side of the table is a vector categorizing the degree of
endangerment of the nine different languages. This column we can convert to
numerical values by applying the one-hot encoding method, as demonstrated
in the subsequent table.
Using one-hot encoding, the dataset has expanded to five columns and we
have created three new features from the original feature (Degree of
Endangerment). We have also set each column value to “1” or “0,”
depending on the original category value.
This now makes it possible for us to input the data into our model and choose
from a wider array of machine learning algorithms. The downside is that we
have more dataset features, which may lead to slightly longer processing
time. This is nonetheless manageable, but it can be problematic for datasets
where original features are split into a larger number of new features.
One hack to minimize the number of features is to restrict binary cases to a
single column. As an example, there is a speed dating dataset on kaggle.com
that lists “Gender” in a single column using one-hot encoding. Rather than
create discrete columns for both “Male” and “Female,” they merged these
two features into one. According to the dataset’s key, females are denoted as
“0” and males are denoted as “1.” The creator of the dataset also used this
technique for “Same Race” and “Match.”
Database: https://www.kaggle.com/annavictoria/speed-dating-experiment
Binning
Binning is another method of feature engineering that is used to convert
numerical values into a category.
Whoa, hold on! Didn’t you say that numerical values were a good thing? Yes,
numerical values tend to be preferred in most cases. Where numerical values
are less ideal, is in situations where they list variations irrelevant to the goals
of your analysis. Let’s take house price evaluation as an example. The exact
measurements of a tennis court might not matter greatly when evaluating
house prices. The relevant information is whether the house has a tennis
court. The same logic probably also applies to the garage and the swimming
pool, where the existence or non-existence of the variable is more influential
than their specific measurements.
The solution here is to replace the numeric measurements of the tennis court
with a True/False feature or a categorical value such as “small,” “medium,”
and “large.” Another alternative would be to apply one-hot encoding with “0”
for homes that do not have a tennis court and “1” for homes that do have a
tennis court.
Missing Data
Dealing with missing data is never a desired situation. Imagine unpacking a
jigsaw puzzle that you discover has five percent of its pieces missing.
Missing values in a dataset can be equally frustrating and will ultimately
interfere with your analysis and final predictions. There are, however,
strategies to minimize the negative impact of missing data.
One approach is to approximate missing values using the mode value. The
mode represents the single most common variable value available in the
dataset. This works best with categorical and binary variable types.
Before you split your data, it is important that you randomize all rows in the
dataset. This helps to avoid bias in your model, as your original dataset might
be arranged sequentially depending on the time it was collected or some other
factor. Unless you randomize your data, you may accidentally omit important
variance from the training data that will cause unwanted surprises when you
apply the trained model to your test data. Fortunately, Scikit-learn provides a
built-in function to shuffle and randomize your data with just one line of code
(demonstrated in Chapter 13).
After randomizing your data, you can begin to design your model and apply
that to the training data. The remaining 30 percent or so of data is put to the
side and reserved for testing the accuracy of the model.
In the case of supervised learning, the model is developed by feeding the
machine the training data and the expected output (y). The machine is able to
analyze and discern relationships between the features (X) found in the
training data to calculate the final output (y).
The next step is to measure how well the model actually performs. A
common approach to analyzing prediction accuracy is a measure called mean
absolute error, which examines each prediction in the model and provides an
average error score for each prediction.
In Scikit-learn, mean absolute error is found using the model.predict function
on X (features). This works by first plugging in the y values from the training
dataset and generating a prediction for each row in the dataset. Scikit-learn
will compare the predictions of the model to the correct outcome and measure
its accuracy. You will know if your model is accurate when the error rate
between the training and test dataset is low. This means that the model has
learned the dataset’s underlying patterns and trends.
Once the model can adequately predict the values of the test data, it is ready
for use in the wild. If the model fails to accurately predict values from the test
data, you will need to check whether the training and test data were properly
randomized. Alternatively, you may need to change the model's
hyperparameters.
Each algorithm has hyperparameters; these are your algorithm settings. In
simple terms, these settings control and impact how fast the model learns
patterns and which patterns to identify and analyze.
Cross Validation
Although the training/test data split can be effective in developing models
from existing data, a question mark remains as to whether the model will
work on new data. If your existing dataset is too small to construct an
accurate model, or if the training/test partition of data is not appropriate, this
can lead to poor estimations of performance in the wild.
Fortunately, there is an effective workaround for this issue. Rather than
splitting the data into two segments (one for training and one for testing), we
can implement what is known as cross validation. Cross validation
maximizes the availability of training data by splitting data into various
combinations and testing each specific combination.
Cross validation can be performed through two primary methods. The first
method is exhaustive cross validation, which involves finding and testing all
possible combinations to divide the original sample into a training set and a
test set. The alternative and more common method is non-exhaustive cross
validation, known as k-fold validation. The k-fold validation technique
involves splitting data into k assigned buckets and reserving one of those
buckets to test the training model at each round.
To perform k-fold validation, data are first randomly assigned to k number of
equal sized buckets. One bucket is then reserved as the test bucket and is used
to measure and evaluate the performance of the remaining (k-1) buckets.
Imagine you’re back in high school and it's the year 2015 (which is probably
much more recent than your actual year of graduation!). During your senior
year, a news headline piques your interest in Bitcoin. With your natural
tendency to chase the next shiny object, you tell your family about your
cryptocurrency aspirations. But before you have a chance to bid for your first
Bitcoin on Coinbase, your father intervenes and insists that you try paper
trading before you go risking your life savings. “Paper trading” is using
simulated means to buy and sell an investment without involving actual
money.
So over the next twenty-four months, you track the value of Bitcoin and write
down its value at regular intervals. You also keep a tally of how many days
have passed since you first started paper trading. You never anticipated to
still be paper trading after two years, but unfortunately, you never got a
chance to enter the cryptocurrency market. As suggested by your father, you
waited for the value of Bitcoin to drop to a level you could afford. But
instead, the value of Bitcoin exploded in the opposite direction.
Nonetheless, you haven’t lost hope of one day owning Bitcoin. To assist your
decision on whether you continue to wait for the value to drop or to find an
alternative investment class, you turn your attention to statistical analysis.
You first reach into your toolbox for a scatterplot. With the blank scatterplot
in your hands, you proceed to plug in your x and y coordinates from your
dataset and plot Bitcoin values from 2015 to 2017. However, rather than use
all three columns from the table, you select the second (Bitcoin price) and
third (No. of Days Transpired) columns to build your model and populate the
scatterplot (shown in Figure 1). As we know, numerical values (found in the
second and third columns) are easy to plug into a scatterplot and require no
special conversion or one-hot encoding. What’s more, the first and third
columns contain the same variable of “time” and the third column alone is
sufficient.
As your goal is to estimate what Bitcoin will be valued at in the future, the y-
axis plots the dependent variable, which is “Bitcoin Price.” The independent
variable (X), in this case, is time. The “No. of Days Transpired” is thereby
plotted on the x-axis.
After plotting the x and y values on the scatterplot, you can immediately see a
trend in the form of a curve ascending from left to right with a steep increase
between day 607 and day 736. Based on the upward trajectory of the curve, it
might be time to quit hoping for a drop in value.
However, an idea suddenly pops up into your head. What if instead of
waiting for the value of Bitcoin to fall to a level that you can afford, you
instead borrow from a friend and purchase Bitcoin now at day 736? Then,
when the value of Bitcoin rises further, you can pay back your friend and
continue to earn asset appreciation on the Bitcoin you fully own.
In order to assess whether it’s worth borrowing from your friend, you will
need to first estimate how much you can earn in potential profit. Then you
need to figure out whether the return on investment will be adequate to pay
back your friend in the short-term.
It’s now time to reach into the third compartment of the toolbox for an
algorithm. One of the simplest algorithms in machine learning is regression
analysis, which is used to determine the strength of a relationship between
variables. Regression analysis comes in many forms, including linear, non-
linear, logistic, and multilinear, but let’s take a look first at linear regression.
Linear regression comprises a straight line that splits your data points on a
scatterplot. The goal of linear regression is to split your data in a way that
minimizes the distance between the regression line and all data points on the
scatterplot. This means that if you were to draw a vertical line from the
regression line to each data point on the graph, the aggregate distance of each
point would equate to the smallest possible distance to the regression line.
Figure 2: Linear regression line
As shown in Figure 3, the hyperplane reveals that you actually stand to lose
money on your investment at day 800 (after buying on day 736)! Based on
the slope of the hyperplane, Bitcoin is expected to depreciate in value
between day 736 and day 800—despite no precedent in your dataset for
Bitcoin ever dropping in value.
While it’s needless to say that linear regression isn’t a fail-proof method to
picking investment trends, the trendline does offer a basic reference point to
predict the future. If we were to use the trendline as a reference point earlier
in time, say at day 240, then the prediction posted would have been more
accurate. At day 240 there is a low degree of deviation from the hyperplane,
while at day 736 there is a high degree of deviation. Deviation refers to the
distance between the hyperplane and the data point.
Figure 4: The distance of the data points to the hyperplane
In general, the closer the data points are to the regression line, the more
accurate the final prediction. If there is a high degree of deviation between
the data points and the regression line, the slope will provide less accurate
predictions. Basing your predictions on the data point at day 736, where there
is high deviation, results in poor accuracy. In fact, the data point at day 736
constitutes an outlier because it does not follow the same general trend as the
previous four data points. What’s more, as an outlier it exaggerates the
trajectory of the hyperplane based on its high y-axis value. Unless future data
points scale in proportion to the y-axis values of the outlier data point, the
model’s predictive accuracy will suffer.
Calculation Example
Although your programming language will take care of this automatically,
it’s useful to understand how linear regression is actually calculated. We will
use the following dataset and formula to perform linear regression.
# The final two columns of the table are not part of the original dataset and have been added for convenience to complete the following equation.
Where:
Σ = Total sum
Σx = Total sum of all x values (1 + 2 + 1 + 4 + 3 = 11)
Σy = Total sum of all y values (3 + 4 + 2 + 7 + 5 = 21)
Σxy = Total sum of x*y for each row (3 + 8 + 2 + 28 + 15 = 56)
Σx = Total sum of x*x for each row (1 + 4 + 1 + 16 + 9 = 31)
2
B=
(5(56) – (11 x 21)) / (5(31) – 11 )
2
Let’s now test the regression line by looking up the coordinates for x = 2.
y = 1.029 + 1.441(x)
y = 1.029 + 1.441(2)
y = 3.911
In this case, the prediction is very close to the actual result of 4.0.
Logistic Regression
A large part of data analysis boils down to a simple question: is something
“A” or “B?” Is it “positive” or “negative?” Is this person a “potential
customer” or “not a potential customer?” Machine learning accommodates
such questions through logistic equations, and specifically through what is
known as the sigmoid function. The sigmoid function produces an S-shaped
curve that can convert any number and map it into a numerical value between
0 and 1, but it does so without ever reaching those exact limits.
A common application of the sigmoid function is found in logistic regression.
Logistic regression adopts the sigmoid function to analyze data and predict
discrete classes that exist in a dataset. Although logistic regression shares a
visual resemblance to linear regression, it is technically a classification
technique. Whereas linear regression addresses numerical equations and
forms numerical predictions to discern relationships between variables,
logistic regression predicts discrete classes.
The logistic sigmoid function above is calculated as “1” divided by “1” plus
“e” raised to the power of negative “x,” where:
x = the numerical value you wish to transform
e = Euler's constant, 2.718
In a binary case, a value of 0 represents no chance of occurring, and 1
represents a certain chance of occurring. The degree of probability for values
located between 0 and 1 can be calculated according to how close they rest to
0 (impossible) or 1 (certain possibility) on the scatterplot.
Figure 7: A sigmoid function used to classify data points
Based on the found probabilities we can assign each data point to one of two
discrete classes. As seen in Figure 7, we can create a cut-off point at 0.5 to
classify the data points into classes. Data points that record a value above 0.5
are classified as Class A, and any data points below 0.5 are classified as Class
B. Data points that record a result of exactly 0.5 are unclassifiable, but such
instances are rare due to the mathematical component of the sigmoid
function.
Please also note that this formula alone does not produce the hyperplane
dividing discrete categories as seen earlier in Figure 6. The statistical formula
for plotting the logistic hyperplane is somewhat more complicated and can be
conveniently plotted using your programming language.
Given its strength in binary classification, logistic regression is used in many
fields including fraud detection, disease diagnosis, emergency detection, loan
default detection, or to identify spam email through the process of identifying
specific classes, e.g. non-spam and spam. However, logistic regression can
also be applied to ordinal cases where there are a set number of discrete
values, e.g. single, married, and divorced.
Logistic regression with more than two outcome values is known as
multinomial logistic regression, which can be seen in Figure 8.
Two tips to remember when performing logistic regression are that the data
should be free of missing values and that all variables are independent of
each other. There should also be sufficient data for each outcome value to
ensure high accuracy. A good starting point would be approximately 30-50
data points for each outcome, i.e. 60-100 total data points for binary logistic
regression.
The scatterplot in Figure 9 consists of data points that are linearly separable
and the logistic hyperplane (A) splits the data points into two classes in a way
that minimizes the distance between all data points and the hyperplane. The
second line, the SVM hyperplane (B), likewise separates the two clusters, but
from a position of maximum distance between itself and the two clusters.
You will also notice a gray area that denotes margin, which is the distance
between the hyperplane and the nearest data point, multiplied by two. The
margin is a key feature of SVM and is important because it offers additional
support to cope with new data points that may infringe on a logistic
regression hyperplane. To illustrate this scenario, let’s consider the same
scatterplot with the inclusion of a new data point.
Figure 10: A new data point is added to the scatterplot
The new data point is a circle, but it is located incorrectly on the left side of
the logistic regression hyperplane (designated for stars). The new data point,
though, remains correctly located on the right side of the SVM hyperplane
(designated for circles) courtesy of ample “support” supplied by the margin.
Figure 11: Mitigating anomalies
k-Nearest Neighbors
The simplest clustering algorithm is k-nearest neighbors (k-NN); a supervised
learning technique used to classify new data points based on the relationship
to nearby data points.
k-NN is similar to a voting system or a popularity contest. Think of it as
being the new kid in school and choosing a group of classmates to socialize
with based on the five classmates who sit nearest to you. Among the five
classmates, three are geeks, one is a skater, and one is a jock. According to
k-NN, you would choose to hang out with the geeks based on their numerical
advantage. Let’s look at another example.
Figure 1: An example of k-NN clustering used to predict the class of a new data point
k-Means Clustering
As a popular unsupervised learning algorithm, k-means clustering attempts to
divide data into k discrete groups and is effective at uncovering basic data
patterns. Examples of potential groupings include animal species, customers
with similar features, and housing market segmentation. The k-means
clustering algorithm works by first splitting data into k number of clusters
with k representing the number of clusters you wish to create. If you choose
to split your dataset into three clusters then k, for example, is set to 3.
Each data point can be assigned to only one cluster and each cluster is
discrete. This means that there is no overlap between clusters and no case of
nesting a cluster inside another cluster. Also, all data points, including
anomalies, are assigned to a centroid irrespective of how they impact the final
shape of the cluster. However, due to the statistical force that pulls all nearby
data points to a central point, your clusters will generally form an elliptical or
spherical shape.
Figure 7: Two clusters are formed after calculating the Euclidean distance of the remaining data points to the centroids.
Figure 8: The centroid coordinates for each cluster are updated to reflect the cluster’s mean value. As one data point has switched from the right cluster to the left cluster, the
centroids of both clusters are recalculated.
Figure 9: Two final clusters are produced based on the updated centroids for each cluster
Setting k
In setting k, it is important to strike the right number of clusters. In general,
as k increases, clusters become smaller and variance falls. However, the
downside is that neighboring clusters become less distinct from one another
as k increases.
If you set k to the same number of data points in your dataset, each data point
automatically converts into a standalone cluster. Conversely, if you set k to 1,
then all data points will be deemed as homogenous and produce only one
cluster. Needless to say, setting k to either extreme will not provide any
worthy insight to analyze.
In order to optimize k, you may wish to turn to a scree plot for guidance. A
scree plot charts the degree of scattering (variance) inside a cluster as the
total number of clusters increase. Scree plots are famous for their iconic
“elbow,” which reflects several pronounced kinks in the plot’s curve.
A scree plot compares the Sum of Squared Error (SSE) for each variation of
total clusters. SSE is measured as the sum of the squared distance between
the centroid and the other neighbors inside the cluster. In a nutshell, SSE
drops as more clusters are formed.
This then raises the question of what the optimal number of clusters is. In
general, you should opt for a cluster solution where SSE subsides
dramatically to the left on the scree plot, but before it reaches a point of
negligible change with cluster variations to its right. For instance, in Figure
10, there is little impact on SSE for six or more clusters. This would result in
clusters that would be small and difficult to distinguish.
In this scree plot, two or three clusters appear to be an ideal solution. There
exists a significant kink to the left of these two cluster variations due to a
pronounced drop-off in SSE. Meanwhile, there is still some change in SSE
with the solution to their right. This will ensure that these two cluster
solutions are distinct and have an impact on data classification.
A more simple and non-mathematical approach to setting k is applying
domain knowledge. For example, if I am analyzing data concerning visitors
to the website of a major IT provider, I might want to set k to 2. Why two
clusters? Because I already know there is likely to be a major discrepancy in
spending behavior between returning visitors and new visitors. First-time
visitors rarely purchase enterprise-level IT products and services, as these
customers will normally go through a lengthy research and vetting process
before procurement can be approved.
Hence, I can use k-means clustering to create two clusters and test my
hypothesis. After creating two clusters, I may then want to examine one of
the two clusters further, either applying another technique or again using k-
means clustering. For example, I might want to split returning users into two
clusters (using k-means clustering) to test my hypothesis that mobile users
and desktop users produce two disparate groups of data points. Again, by
applying domain knowledge, I know it is uncommon for large enterprises to
make big-ticket purchases on a mobile device. Still, I wish to create a
machine learning model to test this assumption.
If, though, I am analyzing a product page for a low-cost item, such as a $4.99
domain name, new visitors and returning visitors are less likely to produce
two clear clusters. As the product item is of low value, new users are less
likely to deliberate before purchasing.
Instead, I might choose to set k to 3 based on my three primary lead
generators: organic traffic, paid traffic, and email marketing. These three lead
sources are likely to produce three discrete clusters based on the facts that:
a) Organic traffic generally consists of both new and returning
customers with a strong intent of purchasing from my website (through
pre-selection, e.g. word of mouth, previous customer experience).
b) Paid traffic targets new customers who typically arrive on the
website with a lower level of trust than organic traffic, including
potential customers who click on the paid advertisement by mistake.
c) Email marketing reaches existing customers who already have
experience purchasing from the website and have established user
accounts.
This is an example of domain knowledge based on my own occupation, but
do understand that the effectiveness of “domain knowledge” diminishes
dramatically past a low number of k clusters. In other words, domain
knowledge might be sufficient for determining two to four clusters, but it will
be less valuable in choosing between 20 or 21 clusters.
BIAS & VARIANCE
Algorithm selection is an important step in forming an accurate prediction
model, but deploying an algorithm with a high rate of accuracy can be a
difficult balancing act. The fact that each algorithm can produce vastly
different models based on the hyperparameters provided can lead to
dramatically different results. As mentioned earlier, hyperparameters are the
algorithm’s settings, similar to the controls on the dashboard of an airplane or
the knobs used to tune radio frequency—except hyperparameters are lines of
code!
Shooting targets, as seen in Figure 2, are not a visual chart used in machine
learning, but it does help to explain bias and variance. Imagine that the center
of the target, or the bull’s-eye, perfectly predicts the correct value of your
model. The dots marked on the target then represent an individual realization
of your model based on your training data. In certain cases, the dots will be
densely positioned close to the bull’s-eye, ensuring that predictions made by
the model are close to the actual data. In other cases, the training data will be
scattered across the target. The more the dots deviate from the bull’s-eye, the
higher the bias and the less accurate the model will be in its overall predictive
ability.
In the first target, we can see an example of low bias and low variance. Bias
is low because the hits are closely aligned to the center and there is low
variance because the hits are densely positioned in one location.
The second target (located on the right of the first row) shows a case of low
bias and high variance. Although the hits are not as close to the bull’s-eye as
the previous example, they are still near to the center and bias is therefore
relatively low. However, there is high variance this time because the hits are
spread out from each other.
The third target (located on the left of the second row) represents high bias
and low variance and the fourth target (located on the right of the second
row) shows high bias and high variance.
Ideally, you want a situation where there is low variance and low bias. In
reality, though, there is more often a trade-off between optimal bias and
variance. Bias and variance both contribute to error, but it is the prediction
error that you want to minimize, not bias or variance specifically.
In Figure 3, we can see two lines moving from left to right. The line above
represents the test data and the line below represents the training data. From
the left, both lines begin at a point of high prediction error due to low
variance and high bias. As they move from left to right they change to the
opposite: high variance and low bias. This leads to low prediction error in the
case of the training data and high prediction error for the test data. In the
middle of the chart is an optimal balance of prediction error between the
training and test data. This is a common case of bias-variance trade-off.
Figure 4: Underfitting on the left and overfitting on the right
The human brain contains interconnected neurons with dendrites that receive
inputs. From these inputs, the neuron produces an electric signal output from
the axon and then emits these signals through axon terminals to other
neurons.
Similar to neurons in the human brain, artificial neural networks are formed
by interconnected neurons, also called nodes, which interact with each other
through axons, called edges. In a neural network, the nodes are stacked up in
layers and generally start with a broad base. The first layer consists of raw
data such as numeric values, text, images or sound, which are divided into
nodes. Each node then sends information to the next layer of nodes through
the network’s edges.
Figure 2: The nodes, edges/weights, and sum/activation function of a basic neural network
Each edge has a numeric weight (algorithm) that can be altered and
formulated based on experience. If the sum of the connected edges satisfies a
set threshold, known as the activation function, it will activate a neuron at the
next layer. However, if the sum of the connected edges does not meet the set
threshold, the activation will not be triggered. This results in an all or nothing
arrangement.
Note, also, that the weights along each edge are unique to ensure that the
nodes fire differently (as seen in Figure 3) and they don’t all return the same
outcome.
Figure 3: Unique edges to produce different outcomes
A typical neural network can be divided into input, hidden, and output layers.
Data is first received by the input layer, where broad features are detected.
The hidden layer(s) then analyze and process the data. Based on previous
computations, the data becomes streamlined through the passing of each
hidden layer. The final result is shown as the output layer.
The middle layers are considered hidden layers because, like human vision,
they covertly break down objects between the input and output layers. For
example, when humans see four lines connected in the shape of a square we
instantly recognize those four lines as a square. We don’t notice the lines as
four independent lines with no relationship to each other. Our brain is
conscious only of the output layer. Neural networks work much the same way
in that they break down data into layers and examine the hidden layers to
produce a final output.
While there are many techniques to assemble the nodes of a neural network,
the simplest method is the feed-forward network. In a feed-forward network,
signals flow only in one direction and there is no loop in the network.
The most basic form of a feed-forward neural network is the perceptron.
Figure 7: Activation function where the output (y) is 0 when x is negative, and the output (y) is 1 when x is positive
Thus:
Input 1: 24 * 0.5 = 12
Input 2: 16 * -1.0 = -16
Sum (Σ): 12 + -16 = - 4
As a numeric value less than zero, our result will register as “0” and therefore
not trigger the activation function of the perceptron.
However, we can also modify the activation threshold to a completely
different rule, such as:
x > 3, y = 1
x ≤ 3, y = 0
Figure 8: Activation function where the output (y) is 0 when x is equal or less than 3, and the output (y) is 1 when x is greater than 3
When working with a larger model of neural network layers, a value of “1”
will be configured to pass the output to the next layer. Conversely, a “0”
value is configured to be ignored and will not be passed to the next layer for
processing.
In supervised learning, perceptrons can be used to train data and develop a
prediction model. The steps to training data are as follows:
1) Inputs are fed into the processor (neurons/nodes).
2) The perceptron estimates the value of those inputs.
3) The perceptron computes the error between the estimate and the
actual value.
4) The perceptron adjusts its weights according to the error.
5) Repeat the previous four steps until you are satisfied with the
model’s accuracy. The training model can then be applied to the test
data.
The weakness of a perceptron is that, because the output is binary (1 or 0),
small changes in the weights or bias in any single perceptron within a larger
neural network can induce polarizing results. This can lead to dramatic
changes within the network and a complete flip in regards to the final output.
As a result, this makes it very difficult to train an accurate model that can be
successfully applied to test data and future data inputs.
An alternative to the perceptron is the sigmoid neuron. A sigmoid neuron is
very similar to a perceptron, but the presence of a sigmoid function rather
than a binary model now accepts any value between 0 and 1. This enables
more flexibility to absorb small changes in edge weights without triggering
inverse results—as the output is no longer binary. In other words, the output
result won’t flip just because of one minor change to an edge weight or input
value.
Figure 12: Common usage scenarios and paired deep learning techniques
As can be seen from the table, multi-layer perceptrons have been largely
superseded by new deep learning techniques such as convolution networks,
recurrent networks, deep belief networks, and recursive neural tensor
networks (RNTN). These more advanced iterations of a neural network can
be used effectively across a number of practical applications that are
currently in vogue today. Although convolution networks are arguably the
most popular and powerful of deep learning techniques, new methods and
variations are continuously evolving.
11
DECISION TREES
The fact that neural networks can be applied to a broader range of machine
learning problems than any other technique has led some pundits to hail
neural networks as the ultimate machine learning algorithm. However, this is
not to say that neural networks fit the bill as a statistical silver bullet. In
various cases, neural networks fall short and decision trees are held up as a
popular counterargument.
The massive reserve of data and computational resources that neural
networks demand is one obvious pitfall. Only after training on millions of
tagged examples can Google's image recognition engine reliably recognize
classes of simple objects (such as dogs). But how many dog pictures do you
need to show to the average four-year-old before they “get it?”
Decision trees, on the other hand, provide high-level efficiency and easy
interpretation. These two benefits make this simple algorithm popular in the
space of machine learning.
As a supervised learning technique, decision trees are used primarily for
solving classification problems, but they can be applied to solve regression
problems too.
Of these three variables, variable 1 (Exceeded KPIs) produces the best result
with two perfectly homogenous groups. Variable 3 produces the second best
result, as one leaf is homogenous. Variable 2 produces two leaves that are not
homogenous. Variable 1 would therefore be selected as the first binary
question to split this dataset.
Whether it is ID3 or another algorithm, this process of splitting data into
binary partitions, known as recursive partitioning, is repeated until a stopping
criterion is met. This stopping point could be based on a range of criteria,
such as:
- When all leaves contain less than 3-5 items
- When a branch produces a result that places all items in one binary
leaf
Figure 3: Example of a stopping criteria
Random Forests
Rather than striving for the most efficient split at each round of recursive
partitioning, an alternative technique is to construct multiple trees and
combine their predictions to select an optimal path of classification or
prediction. This involves a randomized selection of binary questions to grow
multiple different decision trees, known as random forests. In the industry,
you will also often hear people refer to this process as “bootstrap
aggregating” or “bagging.”
Boosting
Another variant of multiple decision trees is the popular technique of
boosting, which are a family of algorithms that convert “weak learners” to
“strong learners.” The underlying principle of boosting is to add weights to
iterations that were misclassified in earlier rounds. This can be interpreted as
similar to a language teacher offering after-school tutoring to the weakest
students in the class in order to improve the average test results of the entire
class.
A popular boosting algorithm is gradient boosting. Rather than selecting
combinations of binary questions at random (like random forests), gradient
boosting selects binary questions that improve prediction accuracy for each
new tree. Decision trees are therefore grown sequentially, as each tree is
created using information derived from the previous decision tree.
The way this works is that mistakes incurred with the training data are
recorded and then applied to the next round of training data. At each iteration,
weights are added to the training data based on the results of the previous
iteration. Higher weighting is applied to instances that were incorrectly
predicted from the training data, and instances that were correctly predicted
receive less weighting. The training and test data are then compared and
errors are again logged in order to inform weighting at each subsequent
round. Earlier iterations that do not perform well, and that perhaps
misclassified data, can thus be improved upon through further iterations. This
process is repeated until there is a low level of error. The final result is then
obtained from a weighted average of the total predictions derived from each
model.
While this approach mitigates the issue of overfitting, it does so with fewer
trees than the bagging approach. In general, the more trees you add to a
random forest, the greater its ability to thwart overfitting. Conversely, with
gradient boosting, too many trees may cause overfitting and caution should
be taken as new trees are added.
One drawback of using random forests and gradient boosting is that we return
to a black-box technique and sacrifice the visual simplicity and ease of
interpretation that comes with a single decision tree.
12
ENSEMBLE MODELING
One of the most effective machine learning methodologies is ensemble
modeling, also known as ensembles. Ensemble modeling combines statistical
techniques to create a model that produces a unified prediction. It is through
combining estimates and following the wisdom of the crowd that ensemble
modeling performs a final classification or outcome with better predictive
performance. Naturally, ensemble models are a popular choice when it comes
to machine learning competitions like the Netflix Competition and Kaggle
competitions.
Ensemble models can be classified into various categories including
sequential, parallel, homogenous, and heterogeneous. Let’s start by first
looking at sequential and parallel models. For sequential ensemble models,
prediction error is reduced by adding weights to classifiers that previously
misclassified data. Gradient boosting and AdaBoost are two examples of
sequential models. Conversely, parallel ensemble models work concurrently
and reduce error by averaging. Decision trees are an example of this
technique.
Ensemble models can also be generated using a single technique with
numerous variations (known as a homogeneous ensemble) or through
different techniques (known as a heterogeneous ensemble). An example of a
homogeneous ensemble model would be numerous decision trees working
together to form a single prediction (bagging). Meanwhile, an example of a
heterogeneous ensemble would be the usage of k-means clustering or a neural
network in collaboration with a decision tree model.
Naturally, it is important to select techniques that complement each other.
Neural networks, for instance, require complete data for analysis, whereas
decision trees can effectively handle missing values. Together, these two
techniques provide added value over a homogeneous model. The neural
network accurately predicts the majority of instances that provide a value and
the decision tree ensures that there are no “null” results that would otherwise
be incurred from missing values in a neural network. The other advantage of
ensemble modeling is that aggregated estimates are generally more accurate
than any single estimate.
There are various subcategories of ensemble modeling; we have already
touched on two of these in the previous chapter. Four popular subcategories
of ensemble modeling are bagging, boosting, a bucket of models, and
stacking.
Bagging, as we know, is short for “boosted aggregating” and is an example
of a homogenous ensemble. This method draws upon randomly drawn
datasets and combines predictions to design a unified model based on a
voting process among the training data. Expressed in another way, bagging is
a special process of model averaging. Random forest, as we know, is a
popular example of bagging.
Boosting is a popular alternative technique that addresses error and data
misclassified by the previous iteration to form a final model. Gradient
boosting and AdaBoost are both popular examples of boosting.
A bucket of models trains numerous different algorithmic models using the
same training data and then picks the one that performed most accurately on
the test data.
Stacking runs multiple models simultaneously on the data and combines
those results to produce a final model. This technique is currently very
popular in machine learning competitions, including the Netflix Prize. (Held
between 2006 and 2009, Netflix offered a prize for a machine learning model
that could improve their recommender system in order to produce more
effective movie recommendations. One of the winning techniques adopted a
form of linear stacking that combined predictions from multiple predictive
models.)
Although ensemble models typically produce more accurate predictions, one
drawback to this methodology is, in fact, the level of sophistication.
Ensembles face the same trade-off between accuracy and simplicity as a
single decision tree versus a random forest. The transparency and simplicity
of a simple technique, such as a decision tree or k-nearest neighbors, is lost
and instantly mutated into a statistical black-box. Performance of the model
will win out in most cases, but the transparency of your model is another
factor to consider when determining your preferred methodology.
13
BUILDING A MODEL IN PYTHON
After examining the statistical underpinnings of numerous algorithms, it’s
time to turn our attention to building an actual machine learning model.
Although there are various options in regards to programming languages (as
outlined in Chapter 4), for this exercise we will use Python because it is quick
to learn and it’s an effective programming language for anyone interested in
manipulating and working with large datasets.
If you don't have any experience in programming or programming with
Python, there’s no need to worry. The key purpose of this chapter is to
understand the methodology and steps behind building a basic machine
learning model.
In this exercise, we will design a house price valuation system using gradient
boosting by following these six steps:
1) Set up the development environment
2) Import the dataset
3) Scrub the dataset
4) Split the data into training and test data
5) Select an algorithm and configure its hyperparameters
6) Evaluate the results
To initiate Jupyter Notebook, run the following command from the Terminal
(for Mac/Linux) or Command Prompt (for Windows):
jupyter notebook
Terminal/Command Prompt will then generate a URL for you to copy and
paste into your web browser. Example: http://localhost:8888/
Copy and paste the generated URL into your web browser to load Jupyter
Notebook. Once you have Jupyter Notebook open in your browser, click on
“New” in the top right-hand corner of the web application to create a new
“Notepad” project, and then select “Python 3.”
The final step is to install the necessary libraries required to complete this
exercise. You will need to install Pandas and a number of libraries from
Scikit-learn into the notepad.
In machine learning, each project will vary in regards to the libraries required
for import. For this particular exercise, we are using gradient boosting
(ensemble modeling) and mean absolute error to measure performance.
You will need to import each of the following libraries and functions by
entering these exact commands in Jupyter Notebook:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
from sklearn.externals import joblib
Don’t worry if you don’t recognize each of the imported libraries in the code
snippet above. These libraries will be referred to in later steps.
df.head(n=5)
Right-click and select “Run” or navigate from the Jupyter Notebook menu:
Cell > Run All
This will populate the dataset within Jupyter Notebook as shown in Figure 2.
This step is not mandatory, but it is a useful technique for reviewing your
dataset inside Jupyter Notebook.
Scrubbing Process
Let’s first remove columns from the dataset that we don’t wish to include in
the model by using the del df[' '] function and entering the vector (column)
titles that we wish to remove.
# The misspellings of “longitude” and “latitude” are used, as the two misspellings were not corrected in
the source file.
del df['Address']
del df['Method']
del df['SellerG']
del df['Date']
del df['Postcode']
del df['Lattitude']
del df['Longtitude']
del df['Regionname']
del df['Propertycount']
Keep in mind that it’s important to drop rows with missing values after
applying the del df function to remove columns (as shown in the previous
step). This way, there’s a better chance that more rows from the original
dataset will be preserved. Imagine dropping a whole row because it was
missing the value for a variable that would be later deleted like the post code
in our model!
Next, let’s convert columns that contain non-numerical data to numerical
values using one-hot encoding. With Pandas, one-hot encoding can be
performed using the get_dummies function:
This command converts column values for Suburb, CouncilArea, and Type
into numerical values through the application of one-hot encoding.
Next, we need to remove the “Price” column because this column will act as
our dependent variable (y) and for now we are only examining the eleven
independent variables (X).
del features_df['Price']
Finally, create X and y arrays from the dataset using the matrix data type
(as_matrix). The X array contains the independent variables and the y array
contains the dependent variable of Price.
X = features_df.as_matrix()
y = df['Price'].as_matrix()
model = ensemble.GradientBoostingRegressor(
n_estimators=150,
learning_rate=0.1,
max_depth=30,
min_samples_split=4,
min_samples_leaf=6,
max_features=0.6,
loss='huber'
)
The first line is the algorithm itself (gradient boosting) and comprises just
one line of code. The lines below dictate the hyperparameters for this
algorithm.
n_estimators represents how many decision trees to build. Remember that a
high number of trees will generally improve accuracy (up to a certain point),
but it will also increase the model’s processing time. Above, I have selected
150 decision trees as an initial starting point.
learning_rate controls the rate at which additional decision trees influence
the overall prediction. This effectively shrinks the contribution of each tree
by the set learning_rate. Inserting a low rate here, such as 0.1, should
improve accuracy.
max_depth defines the maximum number of layers (depth) for each decision
tree. If “None” is selected, then nodes expand until all leaves are pure or until
all leaves contain less than min_samples_leaf. Here, I have selected a high
maximum number of layers (30), which will have a dramatic effect on the
final result, as we will see later.
min_samples_split defines the minimum number of samples required to
execute a new binary split. For example, min_samples_split = 10 means there
must be ten available samples in order to create a new branch.
min_samples_leaf represents the minimum number of samples that must
appear in each child node (leaf) before a new branch can be implemented.
This helps to mitigate the impact of outliers and anomalies in the form of a
low number of samples found in one leaf as a result of a binary split. For
example, min_samples_leaf = 4 requires there to be at least four available
samples within each leaf for a new branch to be created.
max_features is the total number of features presented to the model when
determining the best split. As mentioned in Chapter 11, random forests and
gradient boosting restrict the total number of features shown to each
individual tree to create multiple results that can be voted upon later.
If the max_features value is an integer (whole number), the model will
consider max_features at each split (branch). If the value is a float (e.g. 0.6),
then max_features is the percentage of total features randomly selected.
Although max_features sets a maximum number of features to consider in
identifying the best split, total features may exceed the max_features limit if
no split can initially be made.
loss calculates the model's error rate. For this exercise, we are using huber
which protects against outliers and anomalies. Alternative error rate options
include ls (least squares regression), lad (least absolute deviations), and
quantile (quantile regression). Huber is actually a combination of ls and lad.
To learn more about gradient boosting hyperparameters, you may refer to the
Scikit-learn website:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
Lastly, we need to use Scikit-learn to save the training model as a file using
the joblib.dump function, which was imported into Jupyter Notebook in Step
1. This will allow us to use the training model again in the future for
predicting new real estate property values, without needing to rebuild the
model from scratch.
joblib.dump(model, 'house_trained_model.pkl')
Here, we input our y values, which represent the correct results from the
training dataset. The model.predict function is then called on the X training
set and will generate a prediction with up to two decimal places. The mean
absolute error function will then compare the difference between the model’s
expected predictions and the actual values. The same process is repeated with
the test data.
Let’s now run the entire model by right-clicking and selecting “Run” or
navigating from the Jupyter Notebook menu: Cell > Run All.
Wait a few seconds for the computer to process the training model. The
results, as shown below, will then appear at the bottom of the notepad.
For this exercise, our training set mean absolute error is $27,157.02 and the
test set mean absolute error is $169,962.99. This means that on average, the
training set miscalculated the actual property value by a mere $27,157.02.
However, the test set miscalculated by an average of $169,962.99.
This means that our training model was very accurate at predicting the actual
value of properties contained in the training data. While $27,157.02 may
seem like a lot of money, this average error value is low given the maximum
range of our dataset is $8 million. As many of the properties in the dataset are
in excess of seven figures ($1,000,000+), $27,157.02 constitutes a reasonably
low error rate.
But how did the model fare with the test data? These results are less accurate.
The test data provided less indicative predictions with an average error rate of
$169,962.99. A high discrepancy between the training and test data is usually
a key indicator of overfitting. As our model is tailored to the training data, it
stumbled when predicting the test data, which probably contains new patterns
that the model hasn’t adjusted for. The test data, of course, is likely to contain
slightly different patterns and new potential outliers and anomalies.
However, in this case, the difference between the training and test data is
exacerbated by the fact that we configured the model to overfit the training
data. An example of this issue was setting max_depth to “30.” Although
setting a high max_depth improves the chances of the model finding patterns
in the training data, it does tend to lead to overfitting. Another possible cause
is a poor split of the training and test data, but for this model the data was
randomized using Scikit-learn.
Lastly, please take into account that because the training and test data are
shuffled randomly, your own results will differ slightly when replicating this
model on your own machine.
14
MODEL OPTIMIZATION
In the previous chapter we built our first supervised learning model. We now
want to improve its accuracy and reduce the effects of overfitting. A good
place to start is modifying the model’s hyperparameters.
Without changing any other hyperparameters, let’s first start by modifying
max_depth from “30” to “5.” The model now generates the following results:
Although the mean absolute error of the training set is higher, this helps
reduce the problem of overfitting and should improve the results of the test
data. Another step to optimize the model is to add more trees. If we set
n_estimators to 250, we see this result:
This second optimization reduces the training set’s absolute error rate by
approximately $11,000 and we now have a smaller gap between our training
and test results for mean absolute error.
Together, these two optimizations underline the importance of maximizing
and understanding the impact of individual hyperparameters. If you decide to
replicate this supervised machine learning model at home, I recommend that
you test modifying each of the hyperparameters individually and analyze
their impact on mean absolute error. In addition, you will notice changes in
the machine’s processing time based on the hyperparameters selected. For
instance, setting max_depth to “5” reduces total processing time compared to
when it was set to “30” because the maximum number of branch layers are
significantly less. Processing speed and resources will become an important
consideration as you move on to working with larger datasets.
Another important optimization technique is feature selection. As you will
recall, we removed nine features while scrubbing our dataset. Now might be
a good time to reconsider those features and analyze whether they have an
effect on the overall accuracy of the model. “SellerG” would be an interesting
feature to add to the model because the real estate company selling the
property could have some impact on the final selling price.
Alternatively, dropping features from the current model may reduce
processing time without having a significant effect on accuracy—or may
even improve accuracy. To select features effectively, it is best to isolate
feature modifications and analyze the results, rather than applying various
changes at once.
While manual trial and error can be an effective technique to understand the
impact of variable selection and hyperparameters, there are also automated
techniques for model optimization, such as grid search. Grid search allows
you to list a range of configurations you wish to test for each hyperparameter,
and then methodically tests each of those possible hyperparameters. An
automated voting process takes place to determine the optimal model. As the
model must test each possible combination of hyperparameters, grid search
does take a long time to run! Example code for grid search is shown at the
end of this chapter.
Finally, if you wish to use a different supervised machine learning algorithm
and not gradient boosting, much of the code used in this exercise can be
replicated. For instance, the same code can be used to import a new dataset,
preview the dataframe, remove features (columns), remove rows, split and
shuffle the dataset, and evaluate mean absolute error.
http://scikit-learn.org is a great resource to learn more about other algorithms
as well as the gradient boosting used in this exercise.
# Remove price
del features_df['Price']
# Set up algorithm
model = ensemble.GradientBoostingRegressor(
n_estimators=250,
learning_rate=0.1,
max_depth=5,
min_samples_split=4,
min_samples_leaf=6,
max_features=0.6,
loss='huber'
)
# Remove price
del features_df['Price']
# Input algorithm
model = ensemble.GradientBoostingRegressor()
| Machine Learning |
Machine Learning
Format: Coursera course
Presenter: Andrew Ng
Cost: Free
Suggested Audience: Beginners (especially those with a preference for
MATLAB)
A free and well-taught introduction from Andrew Ng, one of the most
influential figures in this field. This course has become a virtual rite of
passage for anyone interested in machine learning.
| Basic Algorithms |
| The Future of AI |
The Inevitable: Understanding the 12 Technological Forces That Will
Shape Our Future
Format: E-Book, Book, Audiobook
Author: Kevin Kelly
Suggested Audience: All (with an interest in the future)
A well-researched look into the future with a major focus on AI and machine
learning by The New York Times Best Seller Kevin Kelly. Provides a guide
to twelve technological imperatives that will shape the next thirty years.
| Programming |
| Recommendation Systems |
Recommender Systems
Format: Coursera course
Presenter: The University of Minnesota
Cost: Free 7-day trial or included with $49 USD Coursera subscription
Suggested Audience: All
Taught by the University of Minnesota, this Coursera specialization covers
fundamental recommender system techniques including content-based and
collaborative filtering as well as non-personalized and project-association
recommender systems.
.
| Deep Learning |
Deep Learning Simplified
Format: Blog
Channel: DeepLearning.TV
Suggested Audience: All
A short video series to get you up to speed with deep learning. Available for
free on YouTube.
| Future Careers |
Will a Robot Take My Job?
Format: Online article
Author: The BBC
Suggested Audience: All
Check how safe your job is in the AI era leading up to the year 2035.
Hotel Reviews
Does having a five-star reputation lead to more disgruntled guests, and
conversely, can two-star hotels rock the guest ratings by setting low
expectations and over-delivering? Or are one and two-star rated hotels simply
rated low for a reason? Find all this out from this sample dataset of hotel
reviews. This particular dataset covers 1,000 hotels and includes hotel name,
location, review date, text, title, username, and rating. The dataset is sourced
from the Datafiniti’s Business Database, which includes almost every hotel in
the world.
Thank you,
Oliver Theobald
[1]
BBC, Will A Robot Take My Job?, 2015, http://www.bbc.com/news/technology-34066941
[2]
Nearshore Americas, Machine Learning Adoption Thwarted by Lack of Skills and Understanding, 2017, http://www.nearshoreamericas.com
[3]
Arthur Samuel, Some Studies in Machine Learning Using the Game of Checkers, IBM Journal of Research and Development, Vol. 3, Issue. 3, 1959.
[4]
Arthur Samuel, Some Studies in Machine Learning Using the Game of Checkers, IBM Journal of Research and Development, Vol. 3, Issue. 3, 1959.
[5]
DataVisor, Unsupervised Machine Learning Engine, 2017, https://www.datavisor.com/unsupervised-machine-learning-engine/
[6]
Kevin Kelly, The Inevitable: Understanding the 12 Technological Forces That Will Shape Our Future, Penguin Books, 2016.
[7]
Torch, What is Torch? http://torch.ch/, 2017