Linear Regression by Hand Linear Regression Is A Data Scientist S by Richard Peterson Towards Data Science 24062021 033829pm
Linear Regression by Hand Linear Regression Is A Data Scientist S by Richard Peterson Towards Data Science 24062021 033829pm
Open in app
You have 1 free member-only story left this month. Upgrade for unlimited access.
https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 1/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science
Linear regression is a form of linear algebra that was allegedly invented by Carl
Open in app
Friedrich Gauss (1777–1855), but was first published in a scientific paper by Adrien-
Marie Legendre (1752–1833). Gauss used the least squares method to guess when and
where the asteroid Ceres would appear in the night sky (The Discovery of Statistical
Regression, 2015). This was not a hobby project, this was a well-funded research
project for the purpose of oceanic navigation, a highly competitive field that was
sensitive to technological disruption.
Linear regression attempts to model the relationship between two variables by fitting a
linear equation to observed data (Linear Regression, n.d.).
Scatterplots
Let’s make up some data to use as an example. The relationship between Chimpanzee
hunting party size and percentage of successful hunts is well documented. (Busse,
1978) I am going to grab a few data points from Busse to use for this article, and plot
the data using a seaborn scatterplot. Notice how the line I drew through the data does
not fit it perfectly, but the points approximate a linear pattern? The line I drew through
the data is the Least Squares Line, and is used to predict y values for given x values.
Using just a rudimentary Least Squares Line drawn by hand through the data, we could
predict that a hunting party of 4 chimpanzees is going to be around 52% successful. We
are not 100 percent accurate, but with more data, we would likely improve our
accuracy. How well the data fits the Least Squares Line is the Correlation
Coefficient.
https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 2/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science
Open in app
The least squares line is defined as the line where the sum of the squares of the vertical
distances from the data points to the line is as small as possible (Lial, Greenwell and
Ritchey, 2016).
The least squares line has two components: the slope m, and y-intercept b. We will
solve for m first, and then solve for b. The equations for m and b are:
That’s a lot of Sigmas (∑)!. But don’t worry, Sigma just means “sum of”, such as “sum of
x,” symbolized by ∑x, which is just the sum of the x column, “Number of Chimpanzees.”
We need to calculate ∑x, ∑y, ∑xy, ∑x², and ∑y². Each piece will then be fed into the
equations for m and b. Create the below table based on our original dataset.
https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 3/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science
Open in app
Now it is a simple matter to plug our Sigma values into the equation for m and b. n is
the number of values in the dataset, which in our case is 8.
There you have it! You can make predictions of y from given values of x using your
equation: y = 5.4405x + 31.6429. This means that our line starts out at 31.6429 and
the y-values increase by 5.4405 percentage points for every 1 Chimpanzee that joins
the hunting party. To test this out, let’s predict the percent hunt success for 4
chimpanzees.
We just predicted the percentage of successful hunts for a chimpanzee hunting party
based solely on knowledge of their group size, which is pretty amazing!
https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 4/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science
Let’s plot the least squares line over our previous scatterplot using python to show how
Open in app
it fits the data. Seaborn.regplot() is a great chart to use in this situation, but for
demonstration purposes, I will manually create the y=mx+b line and lay it over the
seaborn chart.
However, now that you can make predictions, you need to qualify your predictions with
the Correlation Coefficient, which describes how well the data fits your calculated
line.
Correlation Coefficient
We use the Correlation Coefficient to determine if the least squares line is a good model
for our data. If the data points are not linear, a straight line will not be the right model
for prediction. Karl Pearson invented the Correlation Coefficient r, which is between 1
and -1, and measures the strength of the linear relationship between two variables
(Lial, Greenwell and Ritchey, 2016). If r is exactly -1 or 1, it means the data fits the line
exactly, and there is no deviation from the line. r=0 means that there is no linear
correlation. As r values approach zero, it means that association decreases as well.
https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 5/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science
Luckily, these Sigma values have already been calculated in our previous table. We
simply plug them into our equation.
Our value is close to positive 1, which means that the data is highly correlated, and
positive. You could have determined this from looking at the least squares line plotted
over the scatterplot, but the Correlation Coefficient gives you scientific proof!
Conclusion
Linear regression is one of the best machine learning methods available to a data
scientist or a statistician. There are many ways to create a machine learning model
using your programming skills, but it is definitely a good idea to familiarize yourself
with the math used by the model.
References
Busse, C. D. (1978). Do Chimpanzees Hunt Cooperatively? The American Naturalist,
112(986), 767–770. https://doi.org/10.1086/283318
Lial, Greenwell and Ritchey (2016). Finite Mathematics and Calculus with Applications,
10th Ed. New York, NY: Pearson [ISBN-13 9780133981070].
https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 6/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 7/7