0% found this document useful (0 votes)
4 views

Simple Linear Regression

Uploaded by

Sumit Chouhan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Simple Linear Regression

Uploaded by

Sumit Chouhan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Simple Linear Regression

A linear function has one independent variable and


one dependent variable. The independent variable
is x and the dependent variable is y.
• a is the constant term or the y intercept. It is the
value of the dependent variable when x = 0.
• b is the coefficient of the independent variable. It is
also known as the slope and gives the rate of change
of the dependent variable.

•Y=a +bx
“The scenario is you are a HR officer, you got a candidate with 5 years of experience. Then what is the best salary you should offer to him?”
Before deep dive into this problem, let’s plot the data set into the plot first:
All the points is not in a line BUT
they are in a line-shape! It’s linear!

Linear Regression with Python


Before moving on, we summarize 2 basic
steps of Machine Learning as per below:
1. Training
2. Predict
Okay, we will use 4 libraries such
as numpy and pandas to work with data
set, sklearn to implement machine learning
functions, and matplotlib to visualize our plots for
viewing:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
• Importing the dataset
• dataset = pd.read_csv('salary_data.csv')
• X = dataset.iloc[:, :-1].values #get a copy of
dataset exclude last column
• y = dataset.iloc[:, 1].values #get array of
dataset in column 1st
Code explanation:
dataset: the table contains all values in our csv
file
X: the first column which contains Years
Experience array
y: the last column which contains Salary array
• Next, we have to split our dataset (total 30
observations) into 2 sets: training set which used
for training and test set which used for testing:

Splitting the dataset into the Training set and Test
setfrom sklearn.model_selection import
train_test_split
• X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=1/3,random_state=0)
• Code explanation:
• test_size=1/3: we will split our dataset (30 observations) into 2
parts (training set, test set) and the ratio of test set compare to
dataset is 1/3 (10 observations will be put into the test set. You
can put it 1/2 to get 50% or 0.5, they are the same. We should
not let the test set too big; if it’s too big, we will lack of data to
train. Normally, we should pick around 5% to 30%.
• train_size: if we use the test_size already, the rest of data will
automatically be assigned to train_size.
• random_state: this is the seed for the random number
generator. We can put an instance of the RandomState class as
well. If we leave it blank or 0, the RandomState instance used
by np.random will be used instead.
• We already have the train set and test set,
now we have to build the Regression Model:

# Fitting Simple Linear Regression to the
Training set
• from sklearn.linear_model import
LinearRegression
• regressor = LinearRegression()
• regressor.fit(X_train, y_train)
• Code explanation:
• regressor = LinearRegression(): our training
model which will implement the Linear
Regression.
• regressor.fit: in this line, we pass
the X_train which contains value of Year
Experience and y_train which contains values
of particular Salary to form up the model. This
is the training process.
Let’s visualize our training model and testing
model:
# Visualizing the Training set results
viz_train = plt
viz_train.scatter(X_train, y_train, color='red')
viz_train.plot(X_train, regressor.predict(X_train),
color='blue')
viz_train.title('Salary VS Experience (Training set)')
viz_train.xlabel('Year of Experience')
viz_train.ylabel('Salary')viz_train.show()
• # Visualizing the Test set results
• viz_test = pltviz_test.scatter(X_test, y_test,
color='red')
• viz_test.plot(X_train,regressor.predict(X_train)
, color='blue')
• viz_test.title('Salary VS Experience (Test set)')
• viz_test.xlabel('Year of Experience')
• viz_test.ylabel('Salary')viz_test.show()
After running above code, you will see 2 plots in
the console window:
Compare two plots, we can see 2 blue lines are
the same direction. Our model is good to use
now.
Alright! We already have the model, now we can
use it to calculate (predict) any values of X
depends on y or any values of y depends on X.
This is how we do it:

# Predicting the result of 5 Years


Experiencey_pred = regressor.predict([[5]])
• You can offer to your candidate the salary of
$73,545.90 and this is the best salary for him!

We can also pass an array of X instead
of single value of X:

# Predicting the Test set results
y_pred = regressor.predict(X_test)
• In conclusion, with Simple Linear Regression, we
have to do 5 steps as per below:
1. Importing the dataset.
2. Splitting dataset into training set and testing set
(2 dimensions of X and y per each set). Normally,
the testing set should be 5% to 30% of dataset.
3. Visualize the training set and testing set to
double check (you can bypass this step if you
want).
4. Initializing the regression model and fitting it
using training set (both X and y).
5. Let’s predict!!
EXERCISE 2

• Let's start with exploratory data analysis. You want to get to know your
data first - this includes loading it in, visualizing features, exploring their
relationships and making hypotheses based on your observations. The
dataset is a CSV (comma-separated values) file, which contains the hours
studied and the scores obtained based on those hours. We'll load the data
into a DataFrame using Pandas:

• import pandas as pd
• Let's read the CSV file and package it into a DataFrame:
• # Substitute the path_to_file content by the path to your
student_scores.csv file
• path_to_file = 'home/projects/datasets/student_scores.csv‘
• df = pd.read_csv(path_to_file)
• Once the data is loaded in, let's take a quick peek at the first 5 values using
the head() method:
• df.head()
• This results in:
• Hours Scores 0 2.5 21 1 5.1 47 2 3.2 27 3 8.5 75 4 3.5 30
• We can also check the shape of our dataset via the shape property:
• df.shape
• Knowing the shape of your data is generally pretty crucial to being able to both analyze it and
build models around it:
• (25, 2)
• We have 25 rows and 2 columns - that's 25 entries containing a pair of an hour and a score.
Our initial question was whether we'd score a higher score if we'd studied longer. In essence,
we're asking for the relationship between Hours and Scores. So, what's the relationship
between these variables? A great way to explore relationships between variables is through
Scatterplots. We'll plot the hours on the X-axis and scores on the Y-axis, and for each pair, a
marker will be positioned based on their values:
• df.plot.scatter(x='Hours', y='Scores', title='Scatterplot of hours and scores percentages');
• As the hours increase, so do the scores. There's a fairly high positive correlation
here! Since the shape of the line the points are making appears to be straight - we
say that there's a positive linear correlation between the Hours and Scores
variables. How correlated are they? The corr() method calculates and displays the
correlations between numerical variables in a DataFrame:
• print(df.corr())
• Hours Scores
• Hours 1.000000 0.976191
• Scores 0.976191 1.000000
• In this table, Hours and Hours have a 1.0 (100%) correlation, just as Scores have a
100% correlation to Scores, naturally. Any variable will have a 1:1 mapping with
itself! However, the correlation between Scores and Hours is 0.97. Anything
above 0.8 is considered to be a strong positive correlation.
• To separate the target and features, we can attribute the dataframe column values
to our y and X variables:
• y = df['Scores'].values.reshape(-1, 1)
• X = df['Hours'].values.reshape(-1, 1)
• Scikit-Learn's linear regression model expects a 2D input, and we're really offering
a 1D array if we just extract the values:
• print(df['Hours'].values) # [2.5 5.1 3.2 8.5 3.5 1.5 9.2 ... ]
print(df['Hours'].values.shape) # (25,)
• We could already feed our X and y data directly to our linear regression model, but if
we use all of our data at once, how can we know if our results are any good? Just like
in learning, what we will do, is use a part of the data to train our model and another
part of it, to test it.
• This is easily achieved through the helper train_test_split() method, which accepts
our X and y arrays (also works on DataFrames and splits a single DataFrame into
training and testing sets), and a test_size. The test_size is the percentage of the
overall data we'll be using for testing:
• from sklearn.model_selection import train_test_split
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
• We have our train and test sets ready. Scikit-Learn has a plethora of model types we
can easily import and train, LinearRegression being one of them:
• from sklearn.linear_model import LinearRegression
• regressor = LinearRegression()
• Now, we need to fit the line to our data, we will do that by using the .fit() method
along with our X_train and y_train data:
• regressor.fit(X_train, y_train)
• The line is defined by our features and the intercept/slope. In fact, we can inspect
the intercept and slope by printing
the regressor.intecept_ and regressor.coef_ attributes, respectively:
• print(regressor.intercept_)
• 2.82689235
• print(regressor.coef_) The result should be:
• Making Predictions
• To avoid running calculations ourselves, we could write our own formula that
calculates the value:
• def calc(slope, intercept, hours):
• return slope*hours+intercept
• score = calc(regressor.coef_, regressor.intercept_, 9.5)
• print(score) # [[94.80663482]]
• However - a much handier way to predict new values using our model is to call on
the predict() function:
• # Passing 9.5 in double brackets to have a 2 dimensional array score =
regressor.predict([[9.5]])
• print(score) # 94.80663482

• To make predictions on the test data, we pass the X_test values to


the predict() method. We can assign the results to the variable y_pred:
• y_pred = regressor.predict(X_test)
• Comparison with actual and predicted data
• df_preds = pd.DataFrame({'Actual': y_test.squeeze(), 'Predicted':
y_pred.squeeze()})

• print(df_preds
Evaluating the Model
• Mean Absolute Error (MAE): When we subtract the predicted values from the
actual values, obtaining the errors, sum the absolute values of those errors and get
their mean. This metric gives a notion of the overall error for each prediction of
the model, the smaller (closer to 0) the better.
• [Math Processing Error]mae=(1n)∑i=1n|Actual−Predicted|
• Mean Squared Error (MSE): It is similar to the MAE metric, but it squares the
absolute values of the errors. Also, as with MAE, the smaller, or closer to 0, the
better. The MSE value is squared so as to make large errors even larger. One thing
to pay close attention to, it that it is usually a hard metric to interpret due to the
size of its values and of the fact that they aren't in the same scale of the data.
• [Math Processing Error]mse=∑i=1D(Actual−Predicted)2
• Root Mean Squared Error (RMSE): Tries to solve the interpretation problem raised
with the MSE by getting the square root of its final value, so as to scale it back to
the same units of the data. It is easier to interpret and good when we need to
display or show the actual value of the data with the error. It shows how much the
data may vary, so, if we have an RMSE of 4.35, our model can make an error either
because it added 4.35 to the actual value, or needed 4.35 to get to the actual
value. The closer to 0, the better as well.
• from sklearn.metrics import mean_absolute_error,
mean_squared_error
• import numpy as np
• mae = mean_absolute_error(y_test, y_pred) mse =
mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) We
will also print the metrics results using the f string and the 2
digit precision after the comma with :.2f:
• print(f'Mean absolute error: {mae:.2f}')
• print(f'Mean squared error: {mse:.2f}')
• print(f'Root mean squared error: {rmse:.2f}') The results of
the metrics will look like this:
• Mean absolute error: 3.92 Mean squared error: 18.94 Root
mean squared error: 4.35

You might also like