Data v2
Data v2
Data is everywhere
Data science is a mul disciplinary blend of data inference, algorithm development, and technology in
order to solve analy cally complex problems
We will consider data science as a field of study and prac ce that involves the collec on, storage, and
processing of data in order to derive important insights into a problem or a phenomenon. Such data
may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.), and
could be in different formats (text, audio, video, augmented or virtual reality, etc.).
The number of job pos ngs for ‘data scien st’ grew 57%” year-over- year in the first quarter of 2015.
both industry and academia recently increased their demand for data science and data scien sts. The
answer is not surprising: we have a lot of data, we con nue to generate a staggering amount of data
at an unprecedented and ever-increasing speed, analyzing data wisely necessitates the involvement of
competent and well-trained prac oners, and analyzing such data can provide ac onable insights.
The “3V model” a empts to lay this out in a simple (and catchy) way. These are the three Vs:
3. Variety: The massive array of data and types (structured and unstructured).
Data as (1) fact, (2) signal, and (3) symbol. Here, informa on is differen ated from data in that it is
“useful.”
The ques on should be: Where do we not see data science these days? Its unlimited; its everywhere.
Increase of data volume in last 15 years.
Libraries
Security
Data Science and Business Analy cs; Business analy cs (BA) refers to the skills, technologies, and
prac ces for con nuous itera ve explora on and inves ga on of past and current business
performance to gain insight and be strategic.
There are four types of analy cs, each of which holds opportuni es for data scien sts in business
analy cs:30
• Decision analy cs: supports decision-making with visual analy cs that reflect reasoning.
• Descrip ve analy cs: provides insight from historical data with repor ng, score cards,
clustering, etc.
• Predic ve analy cs: employs predic ve modeling using sta s cal and machine learning
techniques.
• Prescrip ve analy cs: recommends decisions using op miza on, simula on, etc.
Broadly speaking, engineering in various fields (chemical, civil, computer, mechanical, etc.) has created
demand for data scien sts and data science methods.
Engineers constantly need data to solve problems. Data scien sts have been called upon to develop
methods and techniques to meet these needs. Likewise, engineers have assisted data scien sts. Data
science has benefi ed from new so ware and hardware developed via engineering, such as the CPU
(central processing unit) and GPU (graphic processing unit) that substan ally reduce compu ng me.
Computer scien sts have developed numerous techniques and methods, such as (1) data- base (DB)
systems that can handle the increasing volume of data in both structured and unstructured formats,
expedi ng data analysis; (2) visualiza on techniques that help people make sense of data; and (3)
algorithms that make it possible to compute complex and heterogeneous data in less me.
What is computa onal thinking? Typically, it means thinking like a computer scien st. Computa onal
thinking is using abstrac on and decomposi on when a acking a large complex task or designing a
large complex system.
you are convinced that: (1) data science is a flourishing and a fantas c field; (2) it is virtually
everywhere; and (3) perhaps you want to pursue it as a career! OK. Therefore, you need 3 important
skills;
1. Willing to Experiment
3. Data Literacy. Data literacy is the ability to extract meaningful informa on from a
dataset,
Going forward, it is important that you develop a solid founda on in sta s cal techniques and
computa onal thinking. And then you need to pick up a couple of programing and data processing
tools– Python, R, and SQL And so, if you already know some programing language (e.g., C, Java, PHP)
or a scien fic data processing environment (e.g., Matlab), you could use them to solve many or most
of the problems and tasks in data science
More on Data
“Just as trees are the raw material from which paper is produced, so too, can data be viewed as the
raw material from which informa on is obtained.”
What ma ers for is that any data – whether it is a number, a category, or a text – is labeled. In other
words, we know what that number, category, or text means.
Unstructured data is data without labels. The lack of structure makes compila on and organizing
unstructured data a me- and energy-consuming task. It would be easy to derive insights from
unstructured data if it could be instantly transformed into structured data. However, structured data
is akin to machine language, in that it makes informa on much easier to be parsed by computers.
Unstructured data, on the other hand, is o en how humans communicate (“natural language”); but
people do not interact naturally with informa on in strict, database format.
• Public
• Accessible
• Described
• Reusable.
• Complete.
• Timely
• Managed Post-Release
We are living in a world where more and more devices exist – from lightbulbs to cars – and are ge ng
connected to the Internet, crea ng an emerging trend of the Internet of Things (IoT). These devices
are genera ng and using much data, but not all of which are “tradi onal” types (numbers, text). When
dealing with such contexts, we may need to collect and explore mul modal (different forms) and
mul media (different media) data such as images, music and other sounds, gestures, body posture,
and the use of space.
1. CSV (Comma-Separated Values) format is the most common import and export format for
spreadsheets and databases.
2. TSV (Tab-Separated Values) files are used for raw data and can be imported into and exported
from spreadsheet so ware.
3. XML (eXtensible Markup Language) was designed to be both human- and machine- readable,
and can thus be used to store and transport data. In the real world, computer systems and databases
contain data in incompa ble formats. As the XML data is stored in plain text format, it provides a
so ware- and hardware-independent way of storing data. This makes it much easier to create data
that can be shared by different applica ons
4. RSS (Really Simple Syndica on) is a format used to share data between services, and which
was defined in the 1.0 version of XML. It facilitates the delivery of informa on from various sources on
the Web. Informa on provided by a website in an XML file in such a way is called an RSS feed. Most
current Web browsers can directly read RSS files, but a special RSS reader or aggregator may also be
used.
5. JSON (JavaScript Object Nota on) is a lightweight data-interchange format. It is not only easy
for humans to read and write, but also easy for machines to parse and generate. It is based on a subset
of the JavaScript Programming Language,
• An ordered list of values. In most languages, this is realized as an array, vector, list, or
sequence.
Data in the real world is o en dirty; that is, it is in need of being cleaned up before it can be used for
a desired purpose. This is o en called data pre-processing. What makes data “dirty”? Here are some
of the factors that indicate that data is not clean or ready to process:
• Incomplete. When some of the a ribute values are lacking, certain a ributes of interest are
lacking, or a ributes contain only aggregate data.
• Noisy. When data contains errors or outliers. For example, some of the data points in a dataset
may contain extreme values that can severely affect the dataset’s range.
• Inconsistent. Data contains discrepancies in codes or names. For example, if the “Name”
column for registra on records of employees contains values other than alphabe cal le ers, or if
records do not start with a capital le er, discrepancies are present.
Forms of data pre-processing
Data Cleaning
Since there are several reasons why data could be “dirty,” there are just as many ways to “clean” it.
three key methods that describe ways in which data may be “cleaned,” or be er organized, or scrubbed
of poten ally incorrect, incomplete, or duplicated informa on.
Data Munging
O en, the data is not in a format that is easy to work with. For example, it may be stored or presented
in a way that is hard to process. Thus, we need to convert it to something more suitable for a computer
to understand. To accomplish this, there is no specific scien fic method. The approaches to take are
all about manipula ng or wrangling (or munging) the data to turn it into something that is more
convenient or desirable. This can be done manually, automa cally, or, in many cases, semi-
automa cally.
“Add two diced tomatoes, three cloves of garlic, and a pinch of salt in the mix.” This can be turned into
a table
This table conveys the same informa on as the text, but it is more “analysis friendly.”
Some mes data may be in the right format, but some of the values are missing. Other mes data may
be missing due to problems with the process of collec ng data, or an equipment malfunc on. Or,
comprehensiveness may not have been considered important at the me of collec on. Furthermore,
some data may get lost due to system or human error while storing or transferring the data. So, what
to do when we encounter missing data? There is no single good answer. We need to find a suitable
strategy based on the situa on. Strategies to combat missing data include ignoring that record, using
a global constant to fill in all missing values, imputa on, inference-based solu ons (Bayesian formula
or a decision tree), etc. these inference techniques are vital topics on machine learning and data
mining.
There are mes when the data is not missing, but it is corrupted for some reason. This is, in some ways,
a bigger problem than missing data. Data corrup on may be a result of faulty data collec on
instruments, data entry problems, or technology limita ons. For example, a digital thermometer
measures temperature to one decimal point (e.g., 70.1°F), but the storage system ignores the decimal
points. So, now we have 70.1°F and 70.9°F both stored as 70°F. This may not seem like a big deal, but
for humans a 99.4°F temperature means you are fine, and 99.8°F means you have a fever, and if our
storage system represents both of them as 99°F, then it fails to differen ate between healthy and sick
persons!
Just as there is no single technique to take care of missing data, there is no one way to remove noise,
or smooth out the noisiness in the data. However, there are some steps to try. First, you should iden fy
or remove outliers. For example, records of previous students who sat for a data science examina on
show all students scored between 70 and 90 points, barring one student who received just 12 points.
It is safe to assume that the last student’s record is an outlier (unless we have a reason to believe that
this anomaly is really an unfortunate case for a student!). Second, you could try to resolve
inconsistencies in the data. For example, all entries of customer names in the sales data should follow
the conven on of capitalizing all le ers, and you could easily correct them if they are not.
Data Integra on
To be as efficient and effec ve for various data analyses as possible, data from various sources
commonly needs to be integrated. The following steps describe how to integrate mul ple databases
or files.
1. Combine data from mul ple sources into a coherent storage place (e.g., a single file
or a database).
2. Engage in schema integra on, or the combining of metadata from different sources.
a. A conflict may arise; for instance, such as the presence of different a ributes
and values from various sources for the same real-world en ty.
b. Reasons for this conflict could be different representa ons or different scales;
for example, metric vs. Bri sh units.
4. Address redundant data in data integra on. Redundant data is commonly generated
in the process of integra ng mul ple databases. For example:
Data Transforma on
Data must be transformed so it is consistent and readable (by a system). The following five processes
may be used for data transforma on.
4. Normaliza on: Scaled to fall within a small, specified range and aggrega on. Some of
the techniques that are used for accomplishing normaliza on (but we will not be covering
them here) are:
a. Min–max normaliza on.
Data Reduc on
Data reduc on is a key process in which a reduced representa on of a dataset that produces the same
or similar analy cal results is obtained. One example of a large dataset that could warrant reduc on is
a data cube. Data cubes are mul dimensional sets of data that can be stored in a spreadsheet.
A data cube could be in two, three, or a higher dimension. Each dimension typically represents an
a ribute of interest. Two of the most common techniques used for data reduc on.
• Data Cube Aggrega on. The lowest level of a data cube is the aggregated data for an individual
en ty of interest. To do this, use the smallest representa on that is sufficient to address the given task.
In other words, we reduce the data to its more meaningful size and structure for the task at hand.
• Dimensionality Reduc on. In contrast with the data cube aggrega on method, where the data
reduc on was with the considera on of the task, dimensionality reduc on method works with respect
to the nature of the data. Here, a dimension or a column in your data spreadsheet is referred to as a
“feature,” and the goal of the process is to iden fy which features to remove or collapse to a combined
feature. This requires iden fying redundancy in the given data and/or crea ng composite dimensions
or features that could sufficiently represent a set of raw features. Strategies for reduc on include
sampling, clustering, principal component analysis, etc
Data Discre za on
We are o en dealing with data that are collected from processes that are con nuous, such as
temperature, ambient light, and a company’s stock price. But some mes we need to convert these
con nuous values into more manageable parts. This mapping is called discre za on. And as you can
see, in undertaking discre za on, we are also essen ally reducing data. Thus, this process of
discre za on could also be perceived as a means of data reduc on, but it holds par cular importance
for numerical data.
To achieve discre za on, divide the range of con nuous a ributes into intervals. For instance, we
could decide to split the range of temperature values into cold, moderate, and hot, or the price of
company stock into above or below its market valua on.
These two terms – data analysis and data analy cs – are o en used interchangeably and could be
confusing. data analysis refers to hands-on data explora on and evalua on. Data analy cs is a broader
term and includes data analysis as [a] necessary subcomponent. Analy cs defines the science behind
the analysis. The science means understanding the cogni ve processes an analyst uses to understand
problems and explore data in meaningful ways.
One way to understand the difference between analysis and analy cs is to think in terms of past and
future. Analysis looks backwards, providing marketers with a historical view of what has happened.
Analy cs, on the other hand, models the future or predicts a result.
Analy cs makes extensive use of mathema cs and sta s cs and the use of descrip ve techniques and
predic ve models to gain valuable knowledge from data. We can categorize analysis techniques into
six classes of analysis and analy cs: descrip ve analysis, diagnos c analy cs, predic ve analy cs,
prescrip ve analy cs, exploratory analysis, and mechanis c analysis.
Descrip ve Analysis
Descrip ve analysis is about: “What is happening now based on incoming data.” It is a method for
quan ta vely describing the main features of a collec on of data. Here are a few key points about
descrip ve analysis:
Take the example of the Census Data Set, where descrip ve analysis is applied on a whole popula on.
Of course, data needs to be displayed. Once some data has been collected, it is useful to plot a graph
showing how many mes each score occurs. This is known as a frequency distribu on. Frequency
distribu ons come in different shapes and sizes. Therefore, it is important to have some general
descrip ons for common types of distribu on. The following are some of the ways in which
sta s cians can present numerical findings.
Histogram. Histograms plot values of observa ons on the horizontal axis, with a bar showing how
many mes each value occurred in the dataset.
Normal Distribu on. In an ideal world, data would be distributed symmetrically around the center of
all scores. Thus, if we drew a ver cal line through the center of a distribu on, both sides should look
the same. This so-called normal distribu on is characterized by a bell-shaped curve
There are two ways in which a distribu on can deviate from normal:
Measures of Centrality
Diagnos c Analy c
Diagnos c analy cs are used for discovery, or to determine why something happened. Some mes this
type of analy cs when done hands-on with a small dataset is also known as causal analysis, since it
involves at least one cause (usually more than one) and one effect.
This allows a look at past performance to determine what happened and why. The result of the analysis
is o en referred to as an analy c dashboard. There are various types of techniques available for
diagnos c or causal analy cs. Among them, one of the most frequently used is correla on.
Correla ons
Correla on is a sta s cal analysis that is used to measure and describe the strength and direc on of
the rela onship between two variables. Strength indicates how closely two variables are related to
each other, and direc on indicates how one variable would change its value as the value of the other
variable changes.
Correla on is a simple sta s cal measure that examines how two variables change together over me.
Take, for example, “umbrella” and “rain.” If someone who grew up in a place where it never rained
saw rain for the first me, this person would observe that, whenever it rains, people use umbrellas.
They may also no ce that, on dry days, folks do not carry umbrellas. By defini on, “rain” and
“umbrella” are said to be correlated! More specifically, this rela onship is strong and posi ve. Think
about this for a second.
An important sta s c, the Pearson’s r correla on, is widely used to measure the degree of the
rela onship between linear related variables. When examining the stock market, for example, the
Pearson’s r correla on can measure the degree to which two commodi es are related.
Predic ve Analy cs
As you may have guessed, predic ve analy cs has its roots in our ability to predict what might happen.
These analy cs are about understanding the future using the data and the trends we have seen in the
past, as well as emerging new contexts and processes. An example is trying to predict how people will
spend their tax refunds based on how consumers normally behave around a given me of the year
(past data and trends), and how a new tax policy (new context) may affect people’s refunds.
1. First, once the data collec on is complete, it needs to go through the process of
cleaning
2. Cleaned data can help us obtain hindsight in rela onships between different variables.
Plo ng the data (e.g., on a sca erplot) is a good place to look for hindsight.
3. Next, we need to confirm the existence of such rela onships in the data. This is where
regression comes into play. From the regression equa on, we can confirm the pa ern of
distribu on inside the data. In other words, we obtain insight from hindsight.
4. Finally, based on the iden fied pa erns, or insight, we can predict the future, i.e.,
foresight.
Process of predic ve analy cs
Prescrip ve Analy cs
Dedicated to finding the best course of ac on for a given situa on. This may start by first analyzing the
situa on (using descrip ve analysis), but then moves toward finding connec ons among various para-
meters/variables, and their rela on to each other to address a specific problem, more likely that of
predic on.
A process-intensive task, the prescrip ve approach analyzes poten al decisions, the interac ons
between decisions, the influences that bear upon these decisions, and the bearing all of this has on an
outcome to ul mately prescribe an op mal course of ac on in real me.
Prescrip ve analy cs can also suggest op ons for taking advantage of a future opportunity or mi gate
a future risk and illustrate the implica ons of each. In prac ce, prescrip ve analy cs can con nually
and automa cally process new data to improve the accuracy of predic ons and provide advantageous
decision op ons.
For example, in healthcare, we can be er manage the pa ent popula on by using prescrip ve
analy cs to measure the number of pa ents who are clinically obese, then add filters for factors like
diabetes and LDL cholesterol levels to determine where to focus treatment.
Exploratory Analysis
O en when working with data, we may not have a clear understanding of the problem or the situa on.
And yet, we may be called on to provide some insights. In other words, we are asked to provide an
answer without knowing the ques on! This is where we go for an explora on.
Exploratory analysis is an approach to analyzing datasets to find previously unknown rela onships.
O en such analysis involves using various data visualiza on approaches. Yes, some mes seeing is
believing! But more important, when we lack a clear ques on or a hypothesis, plo ng the data in
different forms could provide us with some clues regarding what we may find or want to find in the
data. Such insights can then be useful for defining future studies/ques ons, leading to other forms of
analysis.
Usually not the defini ve answer to the ques on at hand but only the start, exploratory analysis should
not be used alone for generalizing and/or making predic ons from the data.
Exploratory data analysis is an approach that postpones the usual assump ons about what kind of
model the data follows with the more direct approach of allowing the data itself to reveal its underlying
structure in the form of a model. Thus, exploratory analysis is not a mere collec on of techniques;
rather, it offers a philosophy as to how to dissect a dataset; what to look for; how to look; and how to
interpret the outcomes.
As exploratory analysis consists of a range of techniques; its applica on is varied as well. However, the
most common applica on is looking for pa erns in the data, such as finding groups of similar genes
from a collec on of samples
Mechanis c Analysis
Mechanis c analysis involves understanding the exact changes in variables that lead to changes in
other variables for individual objects. For instance, we may want to know how the number of free
doughnuts per employee per day affects employee produc vity. Perhaps by giving them one extra
doughnut we gain a 5% produc vity boost, but two extra dough- nuts could end up making them lazy
(and diabe c)!
More seriously, though, think about studying the effects of carbon emissions on bringing about the
Earth’s climate change. Here, we are interested in seeing how the increased amount of CO2 in the
atmosphere is causing the overall temperature to change. We now know that, in the last 150 years,
the CO2 levels have gone from 280 parts per million to 400 parts per million.17 And in that me, the
Earth has heated up by 1.53 degrees Fahrenheit (0.85 degrees Celsius).18 This is a clear sign of climate
change, something that we all need to be concerned about, but I will leave it there for now. What I
want to bring you back to thinking about is the kind of analysis we presented here – that of studying a
rela onship between two variables. Such rela onships are o en explored using regression.
Regression
In sta s cal modeling, regression analysis is a process for es ma ng the rela onships among
variables. Given this defini on, you may wonder how regression differs from correla on. The answer
can be found in the limita ons of correla on analysis. Correla on by itself does not provide any
indica on of how one variable can be predicted from another. Regression provides this crucial
informa on.
Beyond es ma ng a rela onship, regression analysis is a way of predic ng an outcome variable from
one predictor variable (simple linear regression) or several predictor variables (mul ple linear
regression). Linear regression, the most common form of regression used in data analysis, assumes
this rela onship to be linear. In other words, the rela onship of the predictor variable(s) and
Regression analysis has a number of salient applica ons to data science and other sta s cal fields. In
the business realm, for example, powerful linear regression can be used to generate insights on
consumer behavior, which helps professionals understand business and factors related to profitability.
It can also help a corpora on understand how sensi ve its sales are to adver sing expenditures, or it
can examine how a stock price is affected by changes in interest rates. Regression analysis may even
be used to look to the future; an equa on may forecast demand for a company’s products or predict
stock behaviours outcome variable can be expressed by a straight line.
Machine Learning
Machine learning is a spin-off or a subset of ar ficial intelligence (AI). Here, the goal, is to give
“computers the ability to learn without being explicitly programmed
d. Scalability.
e. Ensemble modeling.
Note here that, in most cases, the applica on of machine learning is entwined with the applica on of
sta s cal analysis. Therefore, it is important to remember the differences in the nomenclature of these
two fields.
• In machine learning, a target is called a label.
Machine learning algorithms are organized into a taxonomy, based on the desired out- come of the
algorithm. Common algorithm types include:
a. Supervised learning. When we know the labels on the training examples we are using
to learn.
b. Unsupervised learning. When we do not know the labels (or even the number of labels
or classes) from the training examples we are using for learning.
Also to note; One phrase you o en hear with machine learning is data mining. That is because machine
learning and data mining overlap quite significantly in many places. Depending on who you talk to, one
is seen as a precursor or entry point for the other. In the end, it does not ma er as long as we keep
our focus on understanding the context and deriving some meaning out of the data.
Data mining is about understanding the nature of the data to gain insight into the problem that
generated the dataset in the first place, or some uniden fied issues that may arise in the future. Take
the case of customers’ brand loyalty in the highly compe ve e-commerce market. All of the e-
commerce pla orms store a database of customers’ previous purchases and return history along with
customer profiles. This kind of dataset not only helps the business owners to understand exis ng
customers’ purchasing pa erns, such as the products they may be interested in, or to measure brand
loyalty, but also provides in-depth knowledge about poten al new customers.
Regression
Think about it as a much more sophis cated version of extrapola on. For example, if you know the
rela onship between educa on and income (the more someone is educated, the more money they
make), we could predict someone’s income based on their educa on. Simply speaking, learning such
a rela onship is regression.
In more technical terms, regression is concerned with modeling the rela onship between variables of
interest. These rela onships use some measures of error in the predic ons to refine the models
itera vely. In other words, regression is a process.8
We can learn about two variables rela ng in some way (e.g., correla on), but if there is a rela onship
of some kind, can we figure out if or how one variable could predict the other? Linear regression allows
us to do that. Specifically, we want to see how a variable X affects a variable y. Here, X is called the
independent variable or predictor; y is called the dependent variable or response. Take a note of the
nota on here. The X is in uppercase because it could have mul ple feature vectors, making it a feature
matrix. If we are dealing with only a single feature for X, we may decide to use the lowercase x. On the
other hand, y is in lowercase because it is a single value or feature being predicted.
As men oned previously, linear regression fits a line (or plane, or hyperplane) to the dataset. For
example, in Figure below, we want to predict the annual return using excess return of stock in a stock
por olio. The line represents the rela on between these two variables. Here, it happens to be quite
linear (see most of the data points close to the line), but such is not always the case.
• Linear regression
• Logis c regression
• Stepwise regression
While it is easy to understand individual tools and methods, it is not always clear how to pick the best
one(s) given a problem. There are mul ple factors that need to be considered before choosing the
right algorithm for a problem. Some of these factors are discussed below.
Accuracy
Most of the me, beginners in machine learning incorrectly assume that for each problem the best
algorithm is the most accurate one. However, ge ng the most accurate answer possible is not always
necessary. Some mes an approxima on is adequate, depending on the problem. If so, you may be
able to cut your processing me drama cally by s cking with more approximate methods. Another
advantage of more approximate methods is that they naturally tend to avoid overfi ng.
Training Time
The number of minutes or hours necessary to train a model varies between algorithms. Training me
is o en closely ed to accuracy – one typically accompanies the other. In addi on, some algorithms
are more sensi ve to the number of data points than others. A limit on me can drive the choice of
algorithm, especially when the dataset is large.
Linearity
Lots of machine learning algorithms make use of linearity. Linear classifica on algorithms assume that
classes can be separated by a straight line (or its higher-dimensional analog). These include logis c
regression and support vector machines. Linear regression algorithms assume that data trends follow
a straight line. These assump ons are not bad for some problems, but on others they bring accuracy
down.
Number of Parameters
Parameters are the knobs a data scien st gets to turn when se ng up an algorithm. They are numbers
that affect the algorithm’s behaviour, such as error tolerance, number of itera ons, or op ons between
variants of how the algorithm behaves. The training me and accuracy of the algorithm can some mes
be quite sensi ve to ge ng just the right se ngs. Typically, algorithms with a large number of
parameters require the most trial and error to find a good combina on.
Number of Features
For certain types of data, the number of features can be very large compared to the number of data
points. This is o en the case with gene cs or textual data. The large number of features can bog down
some learning algorithms, making training me unfeasibly long. Support vector machines are
par cularly well suited to this case
O en the hardest part of solving a machine learning problem can be finding the right es mator for the
job. Different es mators are be er suited for different types of data and different problems. How do
we learn about when to use which es mator or technique? There are two primary ways that I can
think of: (1) developing a comprehensive theore cal understanding of different ways we could develop
es mators or build models; and (2) through lots of hands-on experience. As you may have guessed, in
this book, we are going with the la er.
Supervised learning
Supervised learning algorithms use a set of examples from previous records to make predic ons about
the future. For instance, exis ng car prices can be used to make guesses about the future models. Each
example used to train such an algorithm is labeled with the value of interest – in this case, the car’s
price. A supervised learning algorithm looks for pa erns in a training set. It may use any informa on
that might be relevant – the season, the car’s current sales records, similar offerings from compe tors,
the manufacturer’s brand percep on owned by the consumers – and each algorithm may look for a
different set of informa on and find different types of pa erns. Once the algorithm has found the best
pa ern it can, it uses that pa ern to make predic ons for unlabeled tes ng data – tomorrow’s values.
There are several types of supervised learning that exist within machine learning. Among them, the
three most commonly used algorithm types are regression, classifica on, and anomaly detec on.
Logis c Regression
One thing to note about linear regression is that the outcome variable is numerical. So, the ques on
is: What happens when the outcome variable is not numerical? For example, if you have a weather
dataset with the a ributes humidity, temperature, and wind speed, each is describing one aspect of
the weather for a day. And based on these a ributes, you want to predict if the weather for the day is
suitable for playing golf. In this case, the outcome variable that you want to predict is categorical (“yes”
or “no”). Fortunately, to deal with this kind of classifica on problem, we have logis c regression.
So max Regression
So far, we have seen regression for numerical outcome variable as well as regression for binomial
(“yes” or “no”, “1” or “0”) categorical outcome. But what happens if we have more than two categories.
For example, you want to rate a student’s performance based on the numbers he got in individual
subjects as “excellent,” “good,” “average,” or “below average.” We need to have mul nomial logis c
regression for this. In this sense mul nomial logis c regression or so max regression is a
generaliza on of regular logis c regression to handle mul ple (more than two) classes.
In so max regression, we replace the sigmoid func on from the logis c regression by the so-called
so max func on. This func on takes a vector of n real numbers as input and normalizes the vector
into a distribu on of n probabili es. That is, the func on transforms all the n components from any
real values (posi ve or nega ve) to values in the interval (0, 1).
Classifica on can be supervised or unsupervised. The former is the case when assigning a label to a
picture as, for example, either “cat” or “dog.” Here the number of possible choices is predetermined.
When there are only two choices, it is called two-class or binomial classifica on. When there are more
categories, it is known as mul class or mul nomial classifica on. There are many methods and
algorithms for building classifiers, with k nearest neighbor (kNN) being one of the most popular ones.
2. When we get a new data point, we compare it to each of our exis ng data points and
find similarity.
4. From these k data points, take the majority vote of their labels. The winning label is
the label/class of the new data point.
The number k is usually small between 2 and 20. As you can imagine, the more the number of
nearest neighbors (value of k), the longer it takes us to do the processing.
Decision Tree
In machine learning, a decision tree is used for classifica on problems. In such problems, the goal is to
create a model that predicts the value of a target variable based on several input variables. A decision
tree builds classifica on or regression models in the form of a tree structure. It breaks down a dataset
into smaller and smaller subsets while at the same me an associated decision tree is incrementally
developed. The final result is a tree with decision nodes and leaf nodes.
Several algorithms exist that generate decision trees, such as ID3/4/5, CART, CLS
Decision Rule
Rules are a popular alterna ve to decision trees. Rules typically take the form of an {IF: THEN}
expression (e.g., {IF “condi on” THEN “result”}). Typically for any dataset, an individual rule in itself is
not a model, as this rule can be applied when the associated condi on is sa sfied. Therefore, rule-
based machine learning methods typically iden fy a set of rules that collec vely comprise the
predic on model, or the knowledge base.
Decision rules (le ) and decision tree (right) for a weather data
Random Forest
A decision tree seems like a nice method for doing classifica on – it typically has a good accuracy, and,
more importantly, it provides human-understandable insights. But one big problem the decision tree
algorithm has is that it could overfit the data. What does that mean? It means it could try to model
the given data so well that, while the classifica on accuracy on that dataset would be wonderful, the
model may find itself crippled when looking at any new data; it learned too much from the data!
One way to address this problem is to use not just one, not just two, but many decision trees, each
one created slightly differently. And then take some kind of average from what these trees decide and
predict. Such an approach is so useful and desirable in many situa ons where there is a whole set of
algorithms that apply them. They are called ensemble methods.
In machine learning, ensemble methods rely on mul ple learning algorithms to obtain be er
predic on accuracy than what any of the cons tuent learning algorithms can achieve. In general, an
ensemble algorithm consists of a concrete and finite set of alterna ve models but incorporates a much
more flexible structure among those alterna ves. One example of an ensemble method is random
forest, which can be used for both regression and classifica on tasks.
Random forest operates by construc ng a mul tude of decision trees at training me and selec ng
the mode of the class as the final class label for classifica on or mean predic on of the individual trees
when used for regression tasks. The advantage of using random forest over decision tree is that the
former tries to correct the decision tree’s habit of overfi ng the data to their training set.
For a training set of N, each decision tree is created in the following manner:
1. A sample of the N training cases is taken at random but with replacement from the
original training set. This sample will be used as a training set to grow the tree.
2. If the dataset has M input variables, a number m (m being a lot smaller than M) is
specified such that, at each node, m variables are selected at random out of M. Among this m,
the best split is used to split the node. The value of m is held constant while we grow the
forest.
3. Following the above steps, each tree is grown to its largest possible extent and there
is no pruning.
4. Predict new data by aggrega ng the predic ons of the n trees (i.e., majority votes for
classifica on, average for regression).
Random forest is considered a panacea of all data science problems among most of its prac oners.
There is a belief that when you cannot think of any algorithm irrespec ve of situa on, use random
forest. It is a bit irra onal, since no algorithm strictly dominates in all applica ons (one size does not
fit all). Nonetheless, people have their favorite algorithms. And there are reasons why, for many data
scien sts, random forest is the favorite:
1. It can solve both types of problems, that is, classifica on and regression, and does a
decent es ma on for both.
2. Random forest requires almost no input prepara on. It can handle binary features,
categorical features, and numerical features without any need for scaling.
3. Random forest is not very sensi ve to the specific set of parameters used. As a result,
it does not require a lot of tweaking and fiddling to get a decent model; just use a large number
of trees and things will not go terribly awry.
So, is random forest a silver bullet? Absolutely not. First, it does a good job at classifica on but not as
good as for regression problems, since it does not give precise con nuous nature predic ons. Second,
random forest can feel like a black- box approach for sta s cal modelers, as you have very li le control
on what the model does. At best, you can try different parameters and random seeds and hope that
will change the output.
Naïve Bayes
This is a very popular and robust approach for classifica on that uses Bayes’ theorem. The Bayesian
classifica on represents a supervised learning method as well as a sta s cal method for classifica on.
In a nutshell, it is a classifica on technique based on Bayes’ theorem with an assump on of
independence among predictors. Here, all a ributes contribute equally and independently to the
decision. In simple terms, a Naïve Bayes classifier assumes that the presence of a par cular feature in
a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be
an apple if it is red, round, and about three inches in diameter. Even if these features depend on each
other or upon the existence of other features, all of these proper es independently contribute to the
probability that this fruit is an apple, and that is why it is known as naïve. It turns out that in most
cases, while such a naïve assump on is found to be not true, the resul ng classifica on models do
amazingly well.
One thing that has been common in all the classifier models we have seen so far is that they assume
linear separa on of classes. In other words, they try to come up with a decision boundary that is a line
(or a hyperplane in a higher dimension). But many problems do not have such linear characteris cs.
Support vector machine (SVM) is a method for the classifica on of both linear and nonlinear data.
SVMs are considered by many to be the best stock classifier for doing machine learning tasks. By stock,
here we mean in its basic form and not modified. This means you can take the basic form of the
classifier and run it on the data, and the results will have low error rates. Support vector machines
make good decisions for data points that are outside the training set. In a nutshell, an SVM is an
algorithm that uses nonlinear mapping to transform the original training data into a higher dimension.
Within this new dimension, it searches for the linear op mal separa ng hyperplane (i.e., a decision
boundary separa ng the tuples of one class from another). With an appropriate nonlinear mapping to
a sufficiently high dimension, data from two classes can always be separated by a hyperplane. The SVM
finds this hyperplane using support vectors (“essen al” training tuples) and margins (defined by the
support vectors).
Linearly separable data
We saw how to learn from data when the labels or true values associated with them are available. In
other words, we knew what was right or wrong and we used that informa on to build a regression or
classifica on model that could then make predic ons for new data. Such a process fell under
supervised learning. Now, we will consider the other big area of machine learning where we do not
know true labels or values with the given data, and yet we will want to learn the underlying structure
of that data and be able to explain it. This is called unsupervised learning.
In unsupervised learning, data points have no labels associated with them. Instead, the goal of an
unsupervised learning algorithm is to organize the data in some way or to describe its structure. This
can mean grouping it into clusters or finding different ways of looking at complex data so that it appears
simpler or more organized.
Clustering is the assignment of a set of observa ons into subsets (called clusters) so that observa ons
in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a
common technique for sta s cal data analysis used in many fields.
Agglomera ve Clustering
This is a bo om-up approach of building clusters or groups of similar data points from individual data
points. Following is a general outline of how an agglomera ve clustering algorithm runs.
1. Use any computable cluster similarity measure, for example, Euclidean distance,
cosine similarity, etc.
3. Repeat {
– iden fy the two most similar clusters Cj and Ck (could be es – chose one pair)
Divisive Clustering
The reverse of the agglomera ve technique, divisive clustering works in a top-down mode, where the
goal is to break up the cluster containing all objects into smaller clusters
There is a simple and effec ve algorithm to carry out the general approach described above: k-means.
One of the most frequently used clustering algorithms, k-means clustering is an algorithm to classify
or to group your objects based on a ributes or features into k number of groups, where k is a posi ve
integer number.
1. The basic step of k-means clustering is simple. In the beginning, we determine the
number of clusters (k) that we want and we assume the centroid or center of these clusters.
We can take any random objects as the ini al centroids, or the first k objects in sequence can
also serve as the ini al centroids.
2. Then the k-means algorithm will do the three steps below un l convergence.
Step 2: Put any ini al par on that classifies the data into k clusters. You may assign the
training samples randomly or systema cally, as in the following:
2. Assign each of the remaining (N − K) training samples to the cluster with the nearest
centroid. A er each assignment, recompute the centroid of the gain- ing cluster.
Step 3: Take each sample in sequence and compute its distance from the centroid of each of
the clusters. If a sample is not currently in the cluster with the closest centroid, switch this
sample to that cluster and update the centroid of the cluster gaining the new sample and the
cluster losing the sample.
Repeat the above three steps un l convergence is achieved – that is, un l a pass through the
training sample causes no new assignments.
We have seen clustering, classifica on algorithms, and probabilis c models that are based on the
existence of efficient and robust procedures for learning parameters from observa ons. O en,
however, the only data available for training a model are incomplete. Missing values can occur, for
example, in medical diagnoses, where pa ent histories generally include results from a limited ba ery
of tests. The expecta on maximiza on (EM) algorithm is a fantas c approach to addressing this
problem. The EM algorithm enables parameter es ma on in probabilis c models with incomplete
data.
Reinforcement learning
Reinforcement learning (RL) a empts to model how so ware agents should take ac ons in an
environment that will maximize some form of cumula ve reward.
Let us take an example. Imagine you want to train a computer to play chess against a human. In such
a case, determining the best move to make depends on a number of factors. The number of possible
states that can exist in a game is usually very large. To cover these many states using a standard rules-
based approach would mean specifying a lot of hard- coded rules. RL cuts out the need to manually
specify rules, and RL agents learn simply by playing the game. For two-player games, such as
backgammon, agents can be trained by playing against other human players or even other RL agents.
In RL, the algorithm decides to choose the next course of ac on once it sees a new data point. Based
on how suitable the ac on is, the learning algorithm also gets some incen ve a short me later. The
algorithm always modifies its course of ac on toward the highest reward. Reinforcement learning is
common in robo cs, where the set of sensor readings at one point in me is a data point, and the
algorithm must choose the robot’s next ac on. It is also a natural fit for Internet-of-Things (IoT)
applica ons.
The typical framing of a reinforcement learning (RL) scenario (an agent takes ac ons in an
environment that is interpreted into a reward and a representa on of the state which is fed back into
the agent.)