Simple Linear Regression
Simple Linear Regression
•Y=a +bx
“The scenario is you are a HR officer, you got a candidate with 5 years of experience. Then what is the best salary you should offer to him?”
Before deep dive into this problem, let’s plot the data set into the plot first:
All the points is not in a line BUT
they are in a line-shape! It’s linear!
• Let's start with exploratory data analysis. You want to get to know your
data first - this includes loading it in, visualizing features, exploring their
relationships and making hypotheses based on your observations. The
dataset is a CSV (comma-separated values) file, which contains the hours
studied and the scores obtained based on those hours. We'll load the data
into a DataFrame using Pandas:
• import pandas as pd
• Let's read the CSV file and package it into a DataFrame:
• # Substitute the path_to_file content by the path to your
student_scores.csv file
• path_to_file = 'home/projects/datasets/student_scores.csv‘
• df = pd.read_csv(path_to_file)
• Once the data is loaded in, let's take a quick peek at the first 5 values using
the head() method:
• df.head()
• This results in:
• Hours Scores 0 2.5 21 1 5.1 47 2 3.2 27 3 8.5 75 4 3.5 30
• We can also check the shape of our dataset via the shape property:
• df.shape
• Knowing the shape of your data is generally pretty crucial to being able to both analyze it and
build models around it:
• (25, 2)
• We have 25 rows and 2 columns - that's 25 entries containing a pair of an hour and a score.
Our initial question was whether we'd score a higher score if we'd studied longer. In essence,
we're asking for the relationship between Hours and Scores. So, what's the relationship
between these variables? A great way to explore relationships between variables is through
Scatterplots. We'll plot the hours on the X-axis and scores on the Y-axis, and for each pair, a
marker will be positioned based on their values:
• df.plot.scatter(x='Hours', y='Scores', title='Scatterplot of hours and scores percentages');
• As the hours increase, so do the scores. There's a fairly high positive correlation
here! Since the shape of the line the points are making appears to be straight - we
say that there's a positive linear correlation between the Hours and Scores
variables. How correlated are they? The corr() method calculates and displays the
correlations between numerical variables in a DataFrame:
• print(df.corr())
• Hours Scores
• Hours 1.000000 0.976191
• Scores 0.976191 1.000000
• In this table, Hours and Hours have a 1.0 (100%) correlation, just as Scores have a
100% correlation to Scores, naturally. Any variable will have a 1:1 mapping with
itself! However, the correlation between Scores and Hours is 0.97. Anything
above 0.8 is considered to be a strong positive correlation.
• To separate the target and features, we can attribute the dataframe column values
to our y and X variables:
• y = df['Scores'].values.reshape(-1, 1)
• X = df['Hours'].values.reshape(-1, 1)
• Scikit-Learn's linear regression model expects a 2D input, and we're really offering
a 1D array if we just extract the values:
• print(df['Hours'].values) # [2.5 5.1 3.2 8.5 3.5 1.5 9.2 ... ]
print(df['Hours'].values.shape) # (25,)
• We could already feed our X and y data directly to our linear regression model, but if
we use all of our data at once, how can we know if our results are any good? Just like
in learning, what we will do, is use a part of the data to train our model and another
part of it, to test it.
• This is easily achieved through the helper train_test_split() method, which accepts
our X and y arrays (also works on DataFrames and splits a single DataFrame into
training and testing sets), and a test_size. The test_size is the percentage of the
overall data we'll be using for testing:
• from sklearn.model_selection import train_test_split
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
• We have our train and test sets ready. Scikit-Learn has a plethora of model types we
can easily import and train, LinearRegression being one of them:
• from sklearn.linear_model import LinearRegression
• regressor = LinearRegression()
• Now, we need to fit the line to our data, we will do that by using the .fit() method
along with our X_train and y_train data:
• regressor.fit(X_train, y_train)
• The line is defined by our features and the intercept/slope. In fact, we can inspect
the intercept and slope by printing
the regressor.intecept_ and regressor.coef_ attributes, respectively:
• print(regressor.intercept_)
• 2.82689235
• print(regressor.coef_) The result should be:
• Making Predictions
• To avoid running calculations ourselves, we could write our own formula that
calculates the value:
• def calc(slope, intercept, hours):
• return slope*hours+intercept
• score = calc(regressor.coef_, regressor.intercept_, 9.5)
• print(score) # [[94.80663482]]
• However - a much handier way to predict new values using our model is to call on
the predict() function:
• # Passing 9.5 in double brackets to have a 2 dimensional array score =
regressor.predict([[9.5]])
• print(score) # 94.80663482
• print(df_preds
Evaluating the Model
• Mean Absolute Error (MAE): When we subtract the predicted values from the
actual values, obtaining the errors, sum the absolute values of those errors and get
their mean. This metric gives a notion of the overall error for each prediction of
the model, the smaller (closer to 0) the better.
• [Math Processing Error]mae=(1n)∑i=1n|Actual−Predicted|
• Mean Squared Error (MSE): It is similar to the MAE metric, but it squares the
absolute values of the errors. Also, as with MAE, the smaller, or closer to 0, the
better. The MSE value is squared so as to make large errors even larger. One thing
to pay close attention to, it that it is usually a hard metric to interpret due to the
size of its values and of the fact that they aren't in the same scale of the data.
• [Math Processing Error]mse=∑i=1D(Actual−Predicted)2
• Root Mean Squared Error (RMSE): Tries to solve the interpretation problem raised
with the MSE by getting the square root of its final value, so as to scale it back to
the same units of the data. It is easier to interpret and good when we need to
display or show the actual value of the data with the error. It shows how much the
data may vary, so, if we have an RMSE of 4.35, our model can make an error either
because it added 4.35 to the actual value, or needed 4.35 to get to the actual
value. The closer to 0, the better as well.
• from sklearn.metrics import mean_absolute_error,
mean_squared_error
• import numpy as np
• mae = mean_absolute_error(y_test, y_pred) mse =
mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) We
will also print the metrics results using the f string and the 2
digit precision after the comma with :.2f:
• print(f'Mean absolute error: {mae:.2f}')
• print(f'Mean squared error: {mse:.2f}')
• print(f'Root mean squared error: {rmse:.2f}') The results of
the metrics will look like this:
• Mean absolute error: 3.92 Mean squared error: 18.94 Root
mean squared error: 4.35