ML Course
ML Course
ON
SUBMITTED BY
DARSHITH L
(U03NM21T029008)
Computer Science and Engineering, 4th sem
University of Visvesvaraya College of Engineering
Mr. Sudheendra,
Technical Lead,
Quant Masters Technologies Pvt. Ltd.
Sl Content Pg no
no
1 Introduction 1
2 Overview 0
3 Proposed methodology 0
4 Conclusion 0
1. Introduction
1.1 Introduction to Machine Learning 1
1.1.1 Key Concepts and steps in machine learning 1
1.1.2 Applications 2
1.1.3 Challenges 4
3.4 Training 14
3.5 Evaluation 15
3.6 Result 16
3.7 References 17
4. Conclusion 18
List Of Images
Fig no Images Pg no
1.1 Types of machine Learning 2
1.2 4
Applications of Machine Learning
3.1 10
importing basic modules
3.2 10
data frame sample
3.3 11
summary of data types and number of non null values in
df
3.4 11
dropping rows having null value entry
3.5 11
df.info() after dropna()
3.6 11
converting Year to int16
3.7 12
reset_index()
3.8 12
Converting Km/Kg to KMPL
3.9 12
Converting Engine
3.10 12
Converting Power and Seats
3.11 12
Converting New_Price
3.12 13
Modifying brand name
3.13 13
Label Encoding
3.14 13
Train test split
3.15 13
Decision Tree Regressor
3.16 15
Random forest Regressor
3.17 15
Linear Regressor
1. Introduction
1
● Reinforcement Learning: Reinforcement learning involves training an
agent to interact with an environment in order to achieve a specific goal.
The agent learns by receiving feedback in the form of rewards or penalties
based on its actions.
2
Deployment and Monitoring: Deploying ML models into production involves
integrating them into existing systems or applications. Continuous monitoring is
essential to ensure that models perform accurately and reliably over time, as
data distributions and patterns may change.
1.1.2 Applications:
Machine learning has a wide range of applications across industries, including:
3
Fig 1.2 Applications of Machine Learning
1.1.3 Challenges:
Despite its numerous benefits, machine learning also presents several challenges:
● Data Quality: ML models are highly dependent on the quality, relevance, and
representativeness of the data used for training. Poor-quality data can lead to
biased or inaccurate predictions.
● Interpretability: Many ML algorithms, particularly deep learning models, are often
considered "black boxes," making it challenging to interpret their decision-making
processes.
● Ethical Considerations: ML models can inadvertently perpetuate or amplify
biases present in the data, raising ethical concerns related to fairness,
transparency, and accountability.
4
● Computational Resources: Training complex ML models, especially deep neural
networks, requires significant computational resources, including high-
performance GPUs and large-scale distributed systems.
5
The problem we are addressing is the lack of transparency and consistency in pricing
for second-hand cars. The used car market is often characterized by varying price
listings that do not always reflect the true value of the vehicles. This can lead to
confusion and uncertainty for both buyers and sellers, who may struggle to determine
fair market prices. Our goal is to provide a solution that leverages machine learning to
predict the price of second-hand cars accurately, thereby enhancing transparency and
facilitating fair transactions in the used car market.
Data Collection: The data was downloaded from an open source website for learning
purposes.
Data Preprocessing: I cleaned and prepared the dataset by handling missing values,
removing outliers, and standardizing features.
Feature Selection: I identified the most relevant features that could influence the
pricing of second-hand cars, such as car make, model, year of manufacture, mileage,
fuel type, and transmission type.
Model Training: I trained a predictive model using machine learning algorithms such as
decision trees, random forests.
Model Evaluation: I evaluated the performance of the trained model using appropriate
evaluation metrics, such as mean absolute error (MAE) to ensure its accuracy and
reliability.
6
Validation: I validated the model's predictions against real-world second-hand car
prices to validate its effectiveness and identify any areas for improvement.
Deployment: Since this model was made for learning purposes, I deployed this on
gitHub for my own further reference.
Free Access: Google Colab provides free access to computational resources, including
GPU and TPU accelerators, which are crucial for training machine learning models
efficiently.
7
Pre-Installed Libraries: Google Colab comes pre-installed with popular Python libraries
such as TensorFlow, PyTorch, and scikit-learn, saving me time and effort in setting up
my environment.
8
2. Overview
In my project, the task is akin to training a computer to accurately estimate the prices of
used cars. I utilize a form of machine learning known as Supervised Learning, where
the computer learns from a curated set of examples containing details about used cars
and their actual selling prices. This dataset acts as a teacher, providing the necessary
examples for the computer to learn from.
The dataset comprises information about various cars, including their models,
manufacturing years, mileage, conditions, and corresponding prices. My goal is to
enable the computer to predict the price of a car based on its features, essentially
teaching it how to make informed estimations using algorithms.
One approach I employ is the Decision Tree algorithm, which functions like a flowchart,
asking a series of questions about the car's characteristics to arrive at a price
estimation. Another technique I utilize is the Random Forest algorithm, where a group of
decision trees collaborates, resembling seeking advice from multiple experts to
ascertain the most accurate price.
Once the computer has learned from these examples, it becomes capable of predicting
the price of a new, unseen car by applying the patterns and relationships it has
discerned from the training data. If the predictions are not entirely precise, I can refine
the computer's learning process by providing feedback, helping it improve its accuracy
over time. Moreover, with the Random Forest algorithm, as more data becomes
available, the computer can continue learning and adapting to changes in the dynamic
used car market, ensuring its predictions remain up-to-date and reliable.
9
3. Proposed Methodology
10
Fig 3.3 summary of data types and number of non null values in df
Upon discovering numerous NaN values in the column designated as the target
variable, I opted to remove rows containing these NaN values.
11
Fig 3.7 Converting Kilometers_Driven to int
The year and Kilometers_Driven were represented as a string of digits. Using a simple
function astype(), it was converted to a meaningful format.
12
Fig 3.12 Converting New_Price
Some of the entries in New_Price was in Crore, by traversing each record, all entries in
crore were converted to lakhs
13
3.4 Training:
For model training, I employed various algorithms including Decision Tree, Random
Forest Regressor, and Linear Regression to explore different approaches and assess
their performance.
14
Fig 3.17 Random forest Regressor
3.5 Evaluation:
To evaluate the trained models, I utilized the Mean Absolute Error (MAE) metric. This
metric provided insights into the average magnitude of errors between predicted and
15
actual prices, aiding in the assessment of model performance and selection of the most
suitable algorithm for the task at hand.
3.6 Result
The model has been successfully trained and evaluated. We can observe that mae will
be reduced if particular columns are removed. But practically, all features are needed to
predict the accurate price. Hence the final model includes all the features.
3.7 References:
https://pandas.pydata.org/docs/reference/
https://numpy.org/doc/stable/
16
https://scikit-learn.org/stable/modules/classes.html
https://www.questionpro.com/blog/categorical-
data/#:~:text=There%20are%20two%20types%20of,scale%20or%20or
der%20to%20it.
https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-
python/
17
4. Conclusion
In the culmination of this project aimed at predicting Indian used car prices, a robust
methodology was employed, leveraging the power of NumPy and Pandas for effective
data handling and preprocessing. The journey began with a clear problem definition:
predicting used car prices based on various parameters.
Throughout the exploration and preprocessing phase, critical steps were taken to
ensure data reliability and relevance. The models selected for this task, namely
Decision Tree, Random Forest, and Regression, represent diverse approaches to
machine learning, each contributing unique insights into the predictive process.
A conscious decision was made to keep the project simple, avoiding advanced
parameters like random state. This choice aligns with the project's emphasis on clarity
and comprehensibility, making it accessible to a wider audience. The evaluation metric
chosen, Mean Absolute Error (MAE), reflects a pragmatic approach given the relatively
small dataset. The focus was not solely on achieving the most accurate output but,
rather, on comprehending core concepts and refining the methodology.
This project serves as a learning journey, not only for predicting used car prices but for
understanding the intricacies of machine learning in a real-world context. By adhering to
simplicity, the project fosters a deeper understanding of the fundamental principles,
laying a foundation for further exploration and refinement in future endeavors. It is a
testament to the balance between practicality and complexity, providing valuable
insights into the dynamic world of predictive modeling.
18