XIIAIUNITICAPSTONE_PROJECTPARTII
XIIAIUNITICAPSTONE_PROJECTPARTII
Now load the dataset from the csv file “car.csv” we have already created
using read_csv() function.
(read_csv(‘car.csv’) is used to read the data from the csv file)
>>> df = pd.read_csv("car.csv")
>>> df
Distance Year Price
0 1500 5 50000
1 1600 3 45000
2 1000 1 70000
3 2000 2 60000
4 4000 7 35000
5 8000 9 20000
Now extract the data of dependent(like Price) and independent(like
Distance and Year) variables.
>>>X=df[['Distance','Year']]
>>>y=df['Price']
0 1500 5
1 1600 3
2 1000 1
3 2000 2
4 4000 7
5 8000 9
>>>y
0 50000
1 45000
2 70000
3 60000
4 35000
Now display the shape(i.e. number of rows and column in variables X and y
>>>X.shape()
(6, 2)
>>>y.shape()
(6,)
Now split the data into training data and testing data
>>>X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
>>>X_train
Distance Year
3 2000 2
5 8000 9
0 1500 5
2 1000 1
>>>X_test
Distance Year
1 1600 3
4 4000 7
>>>y_train
3 60000
5 20000
0 50000
2 70000
>>>y_test
1 45000
4 35000
The train-test split procedure is appropriate when there is a sufficiently large dataset
available. It will run faster, and you may have enough data for training and testing. While
Cross validation should be used if your dataset is small.
On small datasets, the extra computational burden of running cross-validation isn't a big
deal. With train-test split, these are also the problems where model quality scores would
be least reliable. So, if your dataset is smaller, you should run cross validation.
If your model takes a couple minute or less to run, it's probably worth switching to cross-
validation. If your model takes much longer to run, cross-validation may slow down your
workflow more than it's worth.
6. Metrics of model quality by simple Math
There are standard measures that we can use to summarize how good a set of predictions actually
are.
You must estimate the quality of a set of predictions when training a machine learning model.
Performance metrics like classification accuracy and Root Mean Squared Error(RMSE) can give you a
clear objective idea of how good a set of predictions is, and in turn how good the model is that
generated them.
All the algorithms in machine learning rely on minimizing or maximizing a function, which we call
“objective function”. The group of functions that are minimized are called “loss functions”. A loss
function is a measure of how good a prediction model does in terms of being able to predict the
expected outcome.
Loss functions can be broadly categorized into 2 types: Classification and Regression Loss.
Regression functions predict a quantity, and classification functions predict a label.
Note: - MSE will never be negative because the errors are always squared.
Lets take an example in terms of graph to understand MSE/RMSE:
Rain
Rain
Day Day
For calculating MSE in Python, we have mean_squared_error() function defined in library sklearn.
Output : - 0.21606
RMSE- Root Mean Squared Error
RMSE is one of the methods to determine the accuracy of our model in predicting the target values. In machine
Learning when we want to look at the accuracy of our model, we take the root mean square of the error that
has occurred between the test values and the predicted values mathematically:
In this scattered graph the red dots are the actual values and the blue line is the set of predicted values
drawn by our model.
Here X represents the distance between the actual value and the predicted line this line represents
the error, similarly, we can draw straight lines from each red dot to the blue line. Squaring them and
taking mean of all those distances and finally taking the root will give us RMSE of our model.
Note: - A good model should have an RMSE value less than 180. In case you have a higher RMSE value, this
would mean that you probably need to change your feature or probably you need to tweak your
hyperparameters.
(Hyperparameters are the parameters whose values govern the learning process)
Steps to Calculate MSE/RMSE: -