0% found this document useful (0 votes)
8 views

ML Course

Uploaded by

darshith15m
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ML Course

Uploaded by

darshith15m
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

INTERNSHIP REPORT

ON

Used Car Price Prediction


Submitted in complete fulfillment for

Machine Learning & Artificial Intelligence Internship

SUBMITTED BY
DARSHITH L
(U03NM21T029008)
Computer Science and Engineering, 4th sem
University of Visvesvaraya College of Engineering

UNDER THE GUIDANCE OF

Mr. Sudheendra,
Technical Lead,
Quant Masters Technologies Pvt. Ltd.

Quant Masters Pvt. Ltd,

Rajajinagar, Bengaluru - 560021


Table of Content

Sl Content Pg no
no
1 Introduction 1
2 Overview 0
3 Proposed methodology 0
4 Conclusion 0

1. Introduction
1.1 Introduction to Machine Learning 1
1.1.1 Key Concepts and steps in machine learning 1

1.1.2 Applications 2

1.1.3 Challenges 4

1.2 Project Overview 5


1.2.1 Problem we are solving 5

1.2.2 How I did it 6

1.2.3 What was found 7

1.2.4 Software Used 7


2. Overview 9
3. Proposed methodology 10
3.1 Importing Modules 10

3.2 Data Preprocessing 10

3.3 Train-Test Split 13

3.4 Training 14

3.4.1 Decision Tree Regressor 14

3.4.2 Random Forest Regressor 14

3.4.3 Linear Model 15

3.5 Evaluation 15

3.6 Result 16

3.7 References 17

4. Conclusion 18
List Of Images
Fig no Images Pg no
1.1 Types of machine Learning 2

1.2 4
Applications of Machine Learning

3.1 10
importing basic modules

3.2 10
data frame sample

3.3 11
summary of data types and number of non null values in
df

3.4 11
dropping rows having null value entry

3.5 11
df.info() after dropna()

3.6 11
converting Year to int16

3.7 12
reset_index()
3.8 12
Converting Km/Kg to KMPL

3.9 12
Converting Engine

3.10 12
Converting Power and Seats

3.11 12
Converting New_Price

3.12 13
Modifying brand name

3.13 13
Label Encoding

3.14 13
Train test split

3.15 13
Decision Tree Regressor

3.16 15
Random forest Regressor

3.17 15
Linear Regressor
1. Introduction

1.1 Introduction to Machine Learning:


Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on the
development of algorithms and models that allow computers to learn from and make
predictions or decisions based on data, without being explicitly programmed for each
task. It is a powerful tool used across various industries to extract valuable insights,
automate processes, and make data-driven decisions.

1.1.1 Key Concepts and steps in machine learning


Data: At the core of machine learning is data. ML algorithms learn patterns and
relationships from data to make predictions or decisions. This data can be
structured, such as tables in databases, or unstructured, such as text, images, or
videos.
Algorithm: Machine learning algorithms are mathematical models that learn from
data to perform specific tasks. These algorithms can be categorized into three
main types: supervised learning, unsupervised learning, and reinforcement
learning.
● Supervised Learning: In supervised learning, the algorithm learns from
labeled data, where each example is associated with a target variable or
outcome. The goal is to learn a mapping from inputs to outputs, enabling
the algorithm to make predictions on new, unseen data.
● Unsupervised Learning: Unsupervised learning algorithms learn from
unlabeled data, seeking to uncover hidden patterns or structures within
the data. Clustering and dimensionality reduction are common tasks in
unsupervised learning.

1
● Reinforcement Learning: Reinforcement learning involves training an
agent to interact with an environment in order to achieve a specific goal.
The agent learns by receiving feedback in the form of rewards or penalties
based on its actions.

Fig 1.1 Types of machine Learning

Model Evaluation: Evaluating the performance of ML models is crucial to assess


their effectiveness and generalization capabilities. Common evaluation metrics
vary depending on the type of task and the specific goals of the model.
Feature Engineering: Feature engineering involves selecting, transforming, and
creating features (input variables) to improve the performance of ML models.
Effective feature engineering can significantly impact the predictive power of a
model.

2
Deployment and Monitoring: Deploying ML models into production involves
integrating them into existing systems or applications. Continuous monitoring is
essential to ensure that models perform accurately and reliably over time, as
data distributions and patterns may change.

1.1.2 Applications:
Machine learning has a wide range of applications across industries, including:

● Healthcare: Predictive analytics for disease diagnosis, personalized treatment


recommendations, and patient monitoring.
● Finance: Fraud detection, credit scoring, algorithmic trading, and risk
assessment.
● Retail: Customer segmentation, recommendation systems, demand forecasting,
and inventory management.
● Automotive: Autonomous driving, predictive maintenance, and vehicle safety.
● Marketing: Customer churn prediction, targeted advertising, and sentiment
analysis.
● Manufacturing: Quality control, predictive maintenance, and supply chain
optimization.

3
Fig 1.2 Applications of Machine Learning

1.1.3 Challenges:
Despite its numerous benefits, machine learning also presents several challenges:

● Data Quality: ML models are highly dependent on the quality, relevance, and
representativeness of the data used for training. Poor-quality data can lead to
biased or inaccurate predictions.
● Interpretability: Many ML algorithms, particularly deep learning models, are often
considered "black boxes," making it challenging to interpret their decision-making
processes.
● Ethical Considerations: ML models can inadvertently perpetuate or amplify
biases present in the data, raising ethical concerns related to fairness,
transparency, and accountability.

4
● Computational Resources: Training complex ML models, especially deep neural
networks, requires significant computational resources, including high-
performance GPUs and large-scale distributed systems.

1.2 Project Overview:


Understanding the importance of a model that predicts the price of second-hand cars
involves recognizing the inherent value in making informed purchasing decisions. Such
a model serves as a vital tool for both buyers and sellers in the used car market. For
buyers, it provides invaluable insight into the fair market value of a pre-owned vehicle,
empowering them to make well-informed decisions based on factors such as the car's
make, model, mileage, condition, and market trends. By leveraging this predictive
model, buyers can navigate the complex landscape of second-hand car pricing with
confidence, ensuring they obtain the best possible deal for their budget. Conversely,
sellers can utilize the model to set competitive yet fair prices for their vehicles, attracting
potential buyers while maximizing returns. Moreover, the advantages of buying a
second-hand car are further accentuated through such predictive models. Not only do
second-hand cars offer significant cost savings compared to new vehicles, but they also
provide access to a wider range of models and features within a given budget.
Additionally, the depreciation curve of a used car is typically gentler than that of a new
car, allowing buyers to enjoy greater value retention over time. Furthermore, many used
cars come with certified pre-owned (CPO) status or warranties, offering buyers peace of
mind regarding the vehicle's condition and reliability. In essence, the combination of a
predictive pricing model and the advantages of buying second-hand cars underscores
the importance of informed decision-making in the used car market, benefiting both
buyers and sellers alike.

1.2.1 Problem we are solving:

5
The problem we are addressing is the lack of transparency and consistency in pricing
for second-hand cars. The used car market is often characterized by varying price
listings that do not always reflect the true value of the vehicles. This can lead to
confusion and uncertainty for both buyers and sellers, who may struggle to determine
fair market prices. Our goal is to provide a solution that leverages machine learning to
predict the price of second-hand cars accurately, thereby enhancing transparency and
facilitating fair transactions in the used car market.

1.2.2 How I did it:

Data Collection: The data was downloaded from an open source website for learning
purposes.

Data Preprocessing: I cleaned and prepared the dataset by handling missing values,
removing outliers, and standardizing features.

Feature Selection: I identified the most relevant features that could influence the
pricing of second-hand cars, such as car make, model, year of manufacture, mileage,
fuel type, and transmission type.

Model Training: I trained a predictive model using machine learning algorithms such as
decision trees, random forests.

Model Evaluation: I evaluated the performance of the trained model using appropriate
evaluation metrics, such as mean absolute error (MAE) to ensure its accuracy and
reliability.

Fine-Tuning: I fine-tuned the model parameters and hyperparameters to optimize its


performance and generalize well to unseen data.

Cross-Validation: I performed cross-validation to assess the model's robustness and


generalization ability across different subsets of the data.

6
Validation: I validated the model's predictions against real-world second-hand car
prices to validate its effectiveness and identify any areas for improvement.

Deployment: Since this model was made for learning purposes, I deployed this on
gitHub for my own further reference.

1.2.3 What was found:


My trained model demonstrated promising results in accurately predicting the prices of
second-hand cars. By leveraging advanced machine learning techniques, I was able to
achieve a high level of precision in estimating the fair market value of used vehicles. My
model effectively captured the intricate relationships between various car attributes and
their impact on pricing, providing valuable insights for both buyers and sellers in the
used car market. Furthermore, my analysis revealed that factors such as car age,
mileage, brand reputation, and fuel efficiency significantly influenced the pricing of
second-hand cars. Overall, my findings underscore the efficacy of machine learning in
addressing pricing discrepancies and enhancing transparency in the second-hand car
market

1.2.4 Software Used:


I chose Google Colab for this project because of its numerous advantages:

Free Access: Google Colab provides free access to computational resources, including
GPU and TPU accelerators, which are crucial for training machine learning models
efficiently.

Cloud-Based: Being cloud-based, Google Colab allows me to access my project from


any device with an internet connection, enabling seamless collaboration and flexibility.

7
Pre-Installed Libraries: Google Colab comes pre-installed with popular Python libraries
such as TensorFlow, PyTorch, and scikit-learn, saving me time and effort in setting up
my environment.

Jupyter Notebook Environment: Google Colab provides a Jupyter Notebook


environment, allowing me to write and execute Python code in a convenient and
interactive manner, with support for Markdown cells for documentation. It also has high
performance.

8
2. Overview

In my project, the task is akin to training a computer to accurately estimate the prices of
used cars. I utilize a form of machine learning known as Supervised Learning, where
the computer learns from a curated set of examples containing details about used cars
and their actual selling prices. This dataset acts as a teacher, providing the necessary
examples for the computer to learn from.

The dataset comprises information about various cars, including their models,
manufacturing years, mileage, conditions, and corresponding prices. My goal is to
enable the computer to predict the price of a car based on its features, essentially
teaching it how to make informed estimations using algorithms.

One approach I employ is the Decision Tree algorithm, which functions like a flowchart,
asking a series of questions about the car's characteristics to arrive at a price
estimation. Another technique I utilize is the Random Forest algorithm, where a group of
decision trees collaborates, resembling seeking advice from multiple experts to
ascertain the most accurate price.

Additionally, I explore Linear Regression, a simpler method that seeks a linear


relationship in the data to predict prices. It's akin to drawing a straight line through a
scatter plot of car features and prices.

Once the computer has learned from these examples, it becomes capable of predicting
the price of a new, unseen car by applying the patterns and relationships it has
discerned from the training data. If the predictions are not entirely precise, I can refine
the computer's learning process by providing feedback, helping it improve its accuracy
over time. Moreover, with the Random Forest algorithm, as more data becomes
available, the computer can continue learning and adapting to changes in the dynamic
used car market, ensuring its predictions remain up-to-date and reliable.

9
3. Proposed Methodology

My proposed methodology for the project involved the following steps:

3.1 Importing Modules:


Initially, I imported essential libraries such as NumPy and Pandas. Subsequently, I
imported additional required modules as needed throughout the project to enhance
usage and understandability.

Fig 3.1 importing basic modules

Fig 3.2 data frame sample

3.2 Data Preprocessing:


The first step in data preprocessing was checking for missing values using the info()
function.

10
Fig 3.3 summary of data types and number of non null values in df
Upon discovering numerous NaN values in the column designated as the target
variable, I opted to remove rows containing these NaN values.

Fig 3.4 dropping rows having null value entry


Following this, I proceeded with data preprocessing in two stages. In the first stage, I
observed that some numeric data were incorrectly stored as objects, so I corrected
them to their respective data types.

Fig 3.5 df.info() after dropna()

Fig 3.6 converting Year to int16

11
Fig 3.7 Converting Kilometers_Driven to int
The year and Kilometers_Driven were represented as a string of digits. Using a simple
function astype(), it was converted to a meaningful format.

Fig 3.8 reset_index()


When dropna() function was called, many rows were deleted. The indexes associated
with each row will not rearrange itself, hence reset_index() was called which arranged
the index from 0 to end.

Fig 3.9 Converting Km/Kg to KMPL


In the Mileage column, mileage was specified in two different forms. Since
KMPL(Kilometer per hour) is more convenient to understand and the majority of entries
are in KMPL, entries in Km/Kg are also converted to KMPL using mathematical
formulas.

Fig 3.10 Converting Engine

Fig 3.11 Converting Power and Seats


All entries in the Engine column were mentioned in ‘CC’, Power column was mentioned
in ‘bhp’. Hence easily using split() function and manipulating it, respective numerical
data was extracted and converted to int type.

12
Fig 3.12 Converting New_Price
Some of the entries in New_Price was in Crore, by traversing each record, all entries in
crore were converted to lakhs

Fig 3.13 Modifying brand name


The exact model was cut down to the Manufacturer company name for simplicity's sake.

Fig 3.14 Label Encoding


In the second stage, I encoded categorical data such as location, fuel type, and
transmission using one-hot encoding(as they are nominal data) , while employing label
encoding for the 'Owner_Type' column(as it is ordinal data).

3.3 Train-Test Split:


Utilizing the train_test_split function from the sklearn.model_selection module, I divided the
dataset into training and testing sets to facilitate model training and evaluation.

Fig 3.15 Train test split

13
3.4 Training:
For model training, I employed various algorithms including Decision Tree, Random
Forest Regressor, and Linear Regression to explore different approaches and assess
their performance.

3.4.1 Decision Tree Regressor:

 Decision tree regression is a supervised learning algorithm used for regression


tasks.
 It works by partitioning the feature space into smaller regions and fitting a simple
model (usually a constant value) to each region.
 The model splits the feature space based on the features that lead to the
greatest reduction in variance or another specified criterion.
 Decision trees are easy to interpret and visualize, making them useful for
understanding the underlying decision-making process

Fig 3.16 Decision Tree Regressor

3.4.2 Random Forest Regressor:

 Random forest regression is an ensemble learning method based on decision


trees.
 It creates multiple decision trees during training and combines their predictions
through averaging or voting to improve generalization and reduce overfitting.
 Random forests introduce randomness in the tree-building process by selecting a
random subset of features at each split and training each tree on a bootstrapped
sample of the dataset.
 Random forests are robust, versatile, and perform well on a wide range of
datasets

14
Fig 3.17 Random forest Regressor

3.4.3 Linear Model:

 Linear regression is a simple and commonly used regression algorithm that


models the relationship between a dependent variable and one or more
independent variables by fitting a linear equation to the observed data.
 The linear equation represents a straight line in two dimensions or a hyperplane
in higher dimensions.
 Linear regression assumes that the relationship between the variables is linear
and that the residuals (the differences between observed and predicted values)
are normally distributed with constant variance.
 Despite its simplicity, linear regression can be powerful and effective for many
real-world problems, especially when the relationship between the variables is
approximately linear.

Fig 3.18 Linear Regressor

3.5 Evaluation:
To evaluate the trained models, I utilized the Mean Absolute Error (MAE) metric. This
metric provided insights into the average magnitude of errors between predicted and

15
actual prices, aiding in the assessment of model performance and selection of the most
suitable algorithm for the task at hand.

Fig 3.19 Mean absolute error of different models

3.6 Result
The model has been successfully trained and evaluated. We can observe that mae will
be reduced if particular columns are removed. But practically, all features are needed to
predict the accurate price. Hence the final model includes all the features.

Our final model is trained using


RandomForestRegressor, retaining all the
columns .

Link for code:


Car price prediction.ipynb

3.7 References:
https://pandas.pydata.org/docs/reference/
https://numpy.org/doc/stable/
16
https://scikit-learn.org/stable/modules/classes.html
https://www.questionpro.com/blog/categorical-
data/#:~:text=There%20are%20two%20types%20of,scale%20or%20or
der%20to%20it.
https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-
python/

17
4. Conclusion

In the culmination of this project aimed at predicting Indian used car prices, a robust
methodology was employed, leveraging the power of NumPy and Pandas for effective
data handling and preprocessing. The journey began with a clear problem definition:
predicting used car prices based on various parameters.

Throughout the exploration and preprocessing phase, critical steps were taken to
ensure data reliability and relevance. The models selected for this task, namely
Decision Tree, Random Forest, and Regression, represent diverse approaches to
machine learning, each contributing unique insights into the predictive process.

A conscious decision was made to keep the project simple, avoiding advanced
parameters like random state. This choice aligns with the project's emphasis on clarity
and comprehensibility, making it accessible to a wider audience. The evaluation metric
chosen, Mean Absolute Error (MAE), reflects a pragmatic approach given the relatively
small dataset. The focus was not solely on achieving the most accurate output but,
rather, on comprehending core concepts and refining the methodology.

This project serves as a learning journey, not only for predicting used car prices but for
understanding the intricacies of machine learning in a real-world context. By adhering to
simplicity, the project fosters a deeper understanding of the fundamental principles,
laying a foundation for further exploration and refinement in future endeavors. It is a
testament to the balance between practicality and complexity, providing valuable
insights into the dynamic world of predictive modeling.

I would also like to thank, my instructor Mr.Sudheendra, and Quant Masters


Technologies Pvt. Ltd., for providing this great opportunity of excelling my

learning journey and preparing me to unpack expertise in same domain in future.

18

You might also like