0% found this document useful (0 votes)
36 views

Linear Regression by Hand Linear Regression Is A Data Scientist S by Richard Peterson Towards Data Science 24062021 033829pm

The document discusses linear regression, including: - Linear regression is a method for predicting a dependent variable (y) from an independent variable (x) using a line of best fit to model the relationship between the two variables. - It calculates the least squares line, which is the line where the sum of the squared distances between the data points and the line is minimized. - The correlation coefficient measures how strongly the data points are correlated to the least squares line, with values between -1 and 1, indicating the strength of the linear relationship. - By working through an example predicting chimpanzee hunting success rates from party sizes, it shows how to calculate the least squares line equation and correlation coefficient
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Linear Regression by Hand Linear Regression Is A Data Scientist S by Richard Peterson Towards Data Science 24062021 033829pm

The document discusses linear regression, including: - Linear regression is a method for predicting a dependent variable (y) from an independent variable (x) using a line of best fit to model the relationship between the two variables. - It calculates the least squares line, which is the line where the sum of the squared distances between the data points and the line is minimized. - The correlation coefficient measures how strongly the data points are correlated to the least squares line, with values between -1 and 1, indicating the strength of the linear relationship. - By working through an example predicting chimpanzee hunting success rates from party sizes, it shows how to calculate the least squares line equation and correlation coefficient
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

24/06/2021 Linear Regression by Hand.

Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science

Open in app

Follow 551K Followers

You have 1 free member-only story left this month. Upgrade for unlimited access.

Linear Regression by Hand


Linear regression is a data scientist’s most basic and powerful tool. Let’s take a
closer look at the Least Squares Line and Correlation Coefficient.

Richard Peterson Apr 12, 2020 · 6 min read

Invention of Linear Regression

Photo by Johannes Plenio on Unsplash

https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 1/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science

Linear regression is a form of linear algebra that was allegedly invented by Carl
Open in app
Friedrich Gauss (1777–1855), but was first published in a scientific paper by Adrien-
Marie Legendre (1752–1833). Gauss used the least squares method to guess when and
where the asteroid Ceres would appear in the night sky (The Discovery of Statistical
Regression, 2015). This was not a hobby project, this was a well-funded research
project for the purpose of oceanic navigation, a highly competitive field that was
sensitive to technological disruption.

Principles of Linear Regression


Linear regression is a method for predicting y from x. In our case, y is the dependent
variable, and x is the independent variable. We want to predict the value of y for a
given value of x. Now, if the data were perfectly linear, we could simply calculate the
slope intercept form of the line in terms y = mx+ b. To predict y, we would just plug in
the given values of x and b. In the real world, our data will not be perfectly linear. It
will likely be in the form of a cluster of data points on a scatterplot. From that
scatterplot, we would like to determine, what is the line of best fit that describes the
linear qualities of the data, and how well does the line fit the cluster of points?

Linear regression attempts to model the relationship between two variables by fitting a
linear equation to observed data (Linear Regression, n.d.).

Scatterplots
Let’s make up some data to use as an example. The relationship between Chimpanzee
hunting party size and percentage of successful hunts is well documented. (Busse,
1978) I am going to grab a few data points from Busse to use for this article, and plot
the data using a seaborn scatterplot. Notice how the line I drew through the data does
not fit it perfectly, but the points approximate a linear pattern? The line I drew through
the data is the Least Squares Line, and is used to predict y values for given x values.
Using just a rudimentary Least Squares Line drawn by hand through the data, we could
predict that a hunting party of 4 chimpanzees is going to be around 52% successful. We
are not 100 percent accurate, but with more data, we would likely improve our
accuracy. How well the data fits the Least Squares Line is the Correlation
Coefficient.

https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 2/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science

Open in app

Least Squares Line


In the chart above, I just drew a line by hand through the data that I judged to be the
best fit. We should calculate this line in slope intercept form y = mx + b to make true
predictions. What we are seeking is a line where the differences between the line and
each point are as small as possible. This is the line of best fit.

The least squares line is defined as the line where the sum of the squares of the vertical
distances from the data points to the line is as small as possible (Lial, Greenwell and
Ritchey, 2016).

The least squares line has two components: the slope m, and y-intercept b. We will
solve for m first, and then solve for b. The equations for m and b are:

Created in MS Word equation editor

That’s a lot of Sigmas (∑)!. But don’t worry, Sigma just means “sum of”, such as “sum of
x,” symbolized by ∑x, which is just the sum of the x column, “Number of Chimpanzees.”
We need to calculate ∑x, ∑y, ∑xy, ∑x², and ∑y². Each piece will then be fed into the
equations for m and b. Create the below table based on our original dataset.

https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 3/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science

Open in app

Now it is a simple matter to plug our Sigma values into the equation for m and b. n is
the number of values in the dataset, which in our case is 8.

There you have it! You can make predictions of y from given values of x using your
equation: y = 5.4405x + 31.6429. This means that our line starts out at 31.6429 and
the y-values increase by 5.4405 percentage points for every 1 Chimpanzee that joins
the hunting party. To test this out, let’s predict the percent hunt success for 4
chimpanzees.

y = 5.4405(4)+31.6429, which results in y=53.4

We just predicted the percentage of successful hunts for a chimpanzee hunting party
based solely on knowledge of their group size, which is pretty amazing!

https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 4/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science

Let’s plot the least squares line over our previous scatterplot using python to show how
Open in app
it fits the data. Seaborn.regplot() is a great chart to use in this situation, but for
demonstration purposes, I will manually create the y=mx+b line and lay it over the
seaborn chart.

However, now that you can make predictions, you need to qualify your predictions with
the Correlation Coefficient, which describes how well the data fits your calculated
line.

Correlation Coefficient
We use the Correlation Coefficient to determine if the least squares line is a good model
for our data. If the data points are not linear, a straight line will not be the right model
for prediction. Karl Pearson invented the Correlation Coefficient r, which is between 1
and -1, and measures the strength of the linear relationship between two variables
(Lial, Greenwell and Ritchey, 2016). If r is exactly -1 or 1, it means the data fits the line
exactly, and there is no deviation from the line. r=0 means that there is no linear
correlation. As r values approach zero, it means that association decreases as well.

https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 5/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science

The Correlation Coefficient is described by the formula


Open in app

Luckily, these Sigma values have already been calculated in our previous table. We
simply plug them into our equation.

Our value is close to positive 1, which means that the data is highly correlated, and
positive. You could have determined this from looking at the least squares line plotted
over the scatterplot, but the Correlation Coefficient gives you scientific proof!

Conclusion
Linear regression is one of the best machine learning methods available to a data
scientist or a statistician. There are many ways to create a machine learning model
using your programming skills, but it is definitely a good idea to familiarize yourself
with the math used by the model.

References
Busse, C. D. (1978). Do Chimpanzees Hunt Cooperatively? The American Naturalist,
112(986), 767–770. https://doi.org/10.1086/283318

Lial, Greenwell and Ritchey (2016). Finite Mathematics and Calculus with Applications,
10th Ed. New York, NY: Pearson [ISBN-13 9780133981070].

Linear Regression. (n.d.). Retrieved April 11, 2020, from


http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 6/7
24/06/2021 Linear Regression by Hand. Linear regression is a data scientist’s… | by Richard Peterson | Towards Data Science

The Discovery of Statistical Regression. (2015, November 6). Priceonomics.


Open in app
http://priceonomics.com/the-discovery-of-statistical-regression/

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.

Emails will be sent to [email protected].


Get this newsletter
Not you?

Machine Learning Linear Regression Python Pandas Seaborn

About Help Legal

Get the Medium app

https://towardsdatascience.com/linear-regression-by-hand-ee7fe5a751bf 7/7

You might also like