Major Project Report (ROHIT)
Major Project Report (ROHIT)
DECELERATION
This is to declare that this report has been written by me. No part of the report is plagiarized
from other sources. All information included from other sources have been duly
acknowledged. I aver that if any part of the report is found to be plagiarized, I shall take full
responsibility for it.
ROHIT DEVGAN
i
ACKNOWLEDGEMENT
“An engineer with only theoretical knowledge is not a complete engineer. Practical
knowledge is very important to develop and apply engineering skills”. It gives me a great
pleasure to have an opportunity to acknowledge and to express gratitude to those who were
associated with me during our training at I.B.M (Data science with python) DELHI.
I am very great-full to Ms. Naina Devi for providing me with an opportunity for training under
her able guidance.
I am pleased to acknowledge my sincere thanks Dr. Akhilesh das Gupta institute of technology
and management and IBM for their kind encouragement in doing this project and for
completing it successfully. I am grateful to them for allowing me to be a part of this project
and helping me out in completing this.
ROHIT DEVGAN
ii
ABSTRACT
Calculation of Car mileage and the prediction of Diamond price may not seem an important
topic but if we compare it with everyday life, we will see that these are regular asked
questions by the people.
So, to make thing easier we have generated a solution that will help us in getting answers.
Calculation of Car mileage will simply depend upon the attributes chosen by the user for
example: Horsepower or MPG. It will compare these values with MPG to give us estimated
car mileage.
Calculation of Diamond price will simply depend upon the attributes chosen by the user for
example: Caret or purity. It will compare these values with price to give us estimated value.
iii
TABLE OF CONTENT
Acknowledgement……………………………………………………………………………………………i
Abstract……………………………………………………………………………………………………………ii
CHAPTER 1:
1.1 Introduction:…………………………………………………………………………...………1
1.1.1 Definition of Python…………………………………………………………………………..……….………1
1.2 Motivation……………………………………………………………………………………....14
CHAPTER 2:
2.1 Literature Review……………………………………………………………………….……24
2.2 Related Work……………………………………………………………………………….….26
CHAPTER 3:
3.1 Coding………………………………………………………………………………………………28
3.1.1 Prediction of Car Mileage…………………………………………………………………………………….28
3.1.2 Prediction of Diamond Price………………………………………………………………………………..44
CHAPTER 4:
4.1 Technology Used……………………………………………………………………………..65
4.2 Methodology Of The Project…………………………………………………………..68
CHAPTER 5:
5.1 Implementation……………………………………………………………………………....80
5.1.1 Prediction of Car Mileage…………………………………………………………………………………..…80
5.1.2 Prediction of Diamond Price………………………………………………………………………………...90
CHAPTER 6:
6.1 Result And Analysis………………………………………………………………………….99
CHAPTER 7:
7.1 Conclusion………………………………………………………………………………………103
7.2 Future Scope…………………………………………………………………………………..104
REFERENCES……………………………………………………………………………………………..106
CHAPTER: 1
1.1 INTRODUCTION
1.1.1 DEFINITION OF PYTHON
Python is an interpreter, object-oriented, high-level
programming language. Its high-level built in data structures,
combined with dynamic typing and dynamic binding, make it
very attractive for Rapid Application Development, as well as
for use as a scripting or glue language to connect existing components together.
The Python interpreter and the extensive standard library are available in source or binary
form without charge for all major platforms, and can be freely distributed.
Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.
Python is copyrighted. Like Perl, Python source code is now available under the GNU General
Public License (GPL).
Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.
Python is often compared to other programming languages such as Java and C++. In this
section we will briefly compare Python to each of these languages.
1|Page
COMPARISON TO JAVA
Python programs are generally expected to run slower than Java programs, but they
also take much less time to develop.
Python programs are typically 3-5 times shorter than equivalent Java programs. This
difference can be attributed to Python's built-in high-level data types and its dynamic
typing.
For example, a Python programmer wastes no time declaring the types of arguments
or variables, and Python's powerful polymorphic list and dictionary types, for which
rich syntactic support is built straight into the language, find a use in almost every
Python program. Because of the run-time typing, Python's run time must work harder
than Java's.
For example, when evaluating the expression a+b, it must first inspect the objects a
and b to find out their type, which is not known at compile time. It then invokes the
appropriate addition operation, which may be an overloaded user-defined method.
Java .
Comparison to C++
Almost everything said for Java also applies for C++, just more so: where Python code
is typically 3-5 times shorter than equivalent Java code, it is often 5-10 times shorter
than equivalent C++ code .
Anecdotal evidence suggests that one Python programmer can finish in two months
what two C++ programmers can't complete in a year.
2|Page
Easy to Read :
Python code looks like simple English words. There is no use of semicolons or brackets,
and the indentations define the code block.
You can tell what the code is supposed to do simply by looking at it.
Interpreted :
When a programming language is interpreted, it means that the source code is
executed line by line, and not all at once.
Programming languages such as C++ or Java are not interpreted, and hence need to
be compiled first to run them.
There is no need to compile Python because it is processed at runtime by the
interpreter.
Object-Oriented and Procedure-Oriented :
A programming language is object-oriented if it focuses design around data and
objects, rather than functions and logic.
On the contrary, a programming language is procedure-oriented if it focuses more on
functions (code that can be reused). One of the critical Python features is that it
supports both object-oriented and procedure-oriented programming.
Expressive :
Python needs to use only a few lines of code to perform complex tasks. For example,
to display Hello World, you simply need to type one line - print(“Hello World”).
Other languages like Java or C would take up multiple lines to execute this.
3|Page
Support for GUI :
One of the key aspects of any programming language is support for GUI or Graphical
User Interface. A user can easily interact with the software using a GUI.
Python offers various toolkits, such as Tkinter, wxPython and JPython, which allows
for GUI's easy and fast development.
Dynamically Typed :
Many programming languages need to declare the type of the variable before
runtime. With Python, the type of the variable can be decided during runtime. This
makes Python a dynamically typed language.
For example, if you have to assign an integer value 20 to a variable “x”, you don’t need
to write int x = 61. You just have to write x = 61.
High-level Language :
Python is a high-level programming language because programmers don’t need to
remember the system architecture, nor do they have to manage the memory.
This makes it super programmer-friendly and is one of the key features of Python.
Speed :
Unlike C or C++ it’s not closer to hardware because Python is a high-level language. As
we all know that compilation and execution help to work normally, but in this case,
execution of Python takes place with the help of an interpreter instead of the compiler
and Python code is executed line by line, which causes it to slow down.
4|Page
Mobile Development:
However Python is strong in desktop and server platforms, that is it is an excellent
server-side language but for mobile development, Python is not a very good language
which means it is a weak language for mobile development. It is very rarely used for
mobile development. This is the reason very few mobile applications are built in it like
Carbonnelle, which is built-in python.
Memory Consumption:
For any memory intensive tasks Python is not a good choice. That is why it is not used
for that purpose. Python’s memory consumption is also high, due to the flexibility of
the data types.
Web Development :
When it comes to web development, Python offers numerous options for web
development.
For instance, you have Django, Pyramid, Flask, and Bottle for developing web
frameworks and even advanced management systems like Plone and Django CMS.
These web frameworks are packed with content standard libraries and modules which
simplify tasks like content management, database interaction, and interfacing with
internet protocols like HTTP, SMTP, XML, JSON, FTP, IMAP, and POP.
Game Development :
Python comes loaded with many useful extensions (libraries) that come in handy for
the development of interactive games.
For instance, libraries like PySoy (a 3D game engine that supports Python 3) and
PyGame are two Python-based libraries used widely for game development.
Python is the foundation for popular games like Battlefield 2, Frets on Fire, World of
Tanks, Disney’s Toontown Online, Vega Strike, and Civilization-IV.
5|Page
Apart from game development, game designers can also use Python for developing
tools to simplify specific actions such as level design or dialog tree creation, and even
use those tools to export those tasks in formats that can be used by the primary game
engine.
Also, Python is used as a scripting language by many game engines.
6|Page
platforms. we can start out with creating simple applications such as Calculators, To-
Do apps and go ahead and create much more complicated applications.
Moreover , many major organizations and companies like NASA , Google , Nokia ,IBM
,Yahoo Maps , etc also uses Python in their operational work .
For Example:
7|Page
While loop in python:
a loop is a control structure that allows you to repeatedly execute a block of code as
long as a certain condition is satisfied. Loops are useful for automating repetitive tasks
and iterating over collections of data. Python supports two main types of loops: "for"
loops and "while" loops.
1. For loop: A "for" loop is used to iterate over a sequence (like a list, tuple,
string, etc.) and execute a block of code for each element in the sequence.
2. While loop: A "while" loop is used to repeatedly execute a block of code as
long as a specified condition remains true.
For Example:
Conditional Statements:
Conditional statements in Python allow you to make decisions in your code based on
certain conditions. These statements are used to control the flow of your program
and execute different blocks of code depending on whether a given condition is true
or false. The main conditional statements in Python are the "if," "elif" (short for "else
if"), and "else" statements.
For Example:
8|Page
1.1.7 APPLICATION AND SOFTWARE USED FOR PYTHON:
There are several applications and software tools that are commonly used for Python
programming.
a. PyCharm IDE:
b. Jupyter Notebook
c. Keras
d. Pip Package
e. Anaconda nevigator
Anaconda Navigator:
Anaconda Navigator is a graphical user interface (GUI) included with the Anaconda
distribution, which is a popular platform used for data science, machine learning, and
scientific computing in Python. Anaconda Navigator provides a user-friendly way to
manage and work with Python environments, packages, and projects. Here are some
key points about Anaconda Navigator:
1. Graphical Interface: Anaconda Navigator offers a visual interface for managing
various aspects of your data science workflow, making it easier for users who
might not be comfortable with the command line.
2. Environment Management: Anaconda Navigator allows you to create and
manage isolated Python environments, commonly referred to as anaconda
environments. These environments enable you to keep your projects and
packages separate, which helps avoid conflicts between different
dependencies.
3. Package Management: With Navigator, you can easily search, install, and
update Python packages and libraries from the Anaconda repository. It
simplifies the process of installing and managing complex packages and
dependencies.
4. Project Management: You can create and manage data science projects using
Navigator. Each project can have its own isolated environment and associated
packages, making it easy to switch between different project configurations.
9|Page
5. Integrated Development Environments (IDEs): Navigator can be used to
launch popular Python IDEs like Jupyter Notebook, JupyterLab, and Spyder.
These IDEs are widely used in the data science community for interactive
coding, analysis, and visualization.
Jupyter Notebook:
10 | P a g e
Key features of Jupyter Notebook include:
1. Cell-Based Structure: Notebooks are organized into cells. Each cell can contain
code (usually in Python but can support other languages), Markdown text for
explanations and documentation, or even LaTeX equations for mathematical
formulas.
2. Interactive Execution: Users can run individual code cells interactively. This
allows for step-by-step execution and immediate feedback, which is useful for
debugging and iterative development.
3. Rich Output: Notebooks can display rich outputs, including images, plots,
tables, and interactive visualizations. This makes it great for data analysis and
presentation.
4. Data Visualization: Jupyter Notebook integrates well with various data
visualization libraries, such as Matplotlib, Seaborn, Plotly, and more. This
makes it easy to create meaningful visualizations within your notebook.
5. Easy Sharing: Notebooks can be easily shared with others. They can be
exported to different formats like HTML, PDF, and slides, allowing you to
communicate your work effectively.
6. Collaboration: While real-time collaborative editing is not a built-in feature of
Jupyter Notebook, there are third-party tools and platforms that allow for
collaborative editing and sharing of notebooks.
7. Educational Tool: Jupyter Notebook is used widely in education for teaching
programming, data science, and various scientific concepts. Its interactive
nature and ability to mix code and explanations make it a valuable tool for
learning.
11 | P a g e
1.1.8 Regression In Python:
Definition of Regression:
Regression is a statistical and machine learning technique used for modelling the
relationship between one or more independent variables (often called "predictors" or
"features") and a dependent variable (often called the "target" or "outcome"). The
main goal of regression analysis is to understand and quantify the relationship
between the variables, as well as to make predictions or estimations based on this
relationship.
There are various types of regression techniques, each suited for different scenarios
and data types. Some common types of regression include:
a. Linear Regression
b. Multiple Linear Regression
c. Polynomial Regression
d. Logistic Regression
e. Time Series Regression
f. Support Vector Regression
12 | P a g e
d. Risk Assessment: Regression models can be used in finance and insurance to
assess risk factors and predict the likelihood of events, such as loan defaults or
insurance claims.
e. Healthcare: Regression can be used to predict medical outcomes, like
predicting patient health outcomes based on factors such as age, medical
history, and other relevant variables.
f. Machine Learning: Regression algorithms are a key component of machine
learning pipelines, helping algorithms learn patterns and relationships within
data.
Regression empowers us to extract meaningful insights from data and build predictive
models that help us estimate outcomes based on underlying relationships. By
leveraging regression in projects focused on car mileage and diamond price
prediction, we can enhance decision-making, gain valuable insights, and make
informed forecasts in two distinct domains.
13 | P a g e
1.2 MOTIVATION:
The motivation behind working on projects to predict car mileage and diamond prices
using python is all about making life better and smarter. Imagine being able to know
how far a car can go on a tank of gas before you buy it. This could save you money and
help the environment by choosing a more fuel-efficient car. That's why predicting car
mileage matters – it's about helping people make wise choices while being kind to our
planet.
Now, think about diamonds. They're beautiful and valuable, but their prices can be
confusing. Imagine having a tool that could help you understand how much a diamond
should cost based on its size, cut, and other features. This could help you get a fair
deal when buying one. That's where predicting diamond prices comes in – it's about
giving people the knowledge they need to make informed decisions when buying or
selling something precious.
In simple words, both these projects use clever math to predict important things – like
how efficient a car is or how much a diamond should cost. This knowledge can make
a real difference in people's lives, from saving money to being more environmentally
conscious and getting the best value for their money.
14 | P a g e
On the other hand, in the diamond price prediction project, the problem centres on
uncovering the intricate relationship between a diamond's characteristics and its
market price. This involves decoding how factors like the diamond's carat weight, cut
quality, colour, and clarity affect its value. The key challenge is to build an accurate
prediction model that aids in estimating a diamond's price based on its qualities. This
can provide buyers, sellers, and investors with a clearer understanding of the diamond
market, empowering them to make more informed decisions when it comes to
purchasing or selling diamonds.
1.4 OBJECTIVE:
The objectives of our projects focused on predicting car mileage and diamond prices
using Python are quite clear. In the car mileage prediction project, our goal is to
develop accurate prediction models that can estimate how far a car can travel on a
given amount of fuel. By analysing attributes like engine size, weight, and power, we
aim to create a tool that assists potential car buyers in making informed decisions
about fuel efficiency before purchasing a vehicle. This objective aligns with the
broader goal of promoting environmentally responsible and cost-effective
transportation choices.
Similarly, in the diamond price prediction project, our aim is to construct reliable
prediction models that can predict the market price of diamonds based on their
intrinsic features. By understanding the relationships between attributes such as carat
weight, cut quality, color, and clarity, we intend to create a tool that empowers
buyers, sellers, and investors in the diamond industry to navigate the market with
more confidence. Ultimately, our objectives in both projects revolve around utilizing
Python to develop models that provide valuable insights, enabling better decision-
making for both car buyers and diamond enthusiasts.
15 | P a g e
1.5 FEASIBILITY STUDY:
In the case of predicting car mileage, feasibility revolves around data availability and
quality. It's crucial to determine if there's sufficient data encompassing diverse car
attributes and corresponding mileage values. Additionally, evaluating the accuracy of
the data and its representation of real-world scenarios is paramount. The feasibility
study also involves gauging the availability of appropriate regression algorithms and
tools within Python that can handle the complexity of the prediction task.
1.5.1 Study of Car Mileage:
The primary function of cars is to provide convenient and fast transportation. They
allow individuals to commute to work, travel long distances, run errands, and explore
new places. With cars, people can save time and have greater flexibility in their daily
routines.
16 | P a g e
Sports Car – This includes the cars which are known for their good
performance at high speeds. They are usually convertibles. Sports cars are
designed for focusing on form and function rather than seating capacity,
luxury or cargo capacity.
Hatchbacks – It refers to the car which has an additional space for cargo.
These cars have a large door at the backside of the car, which gives access to
this additional area.
Minivan – These are smaller version of vans which are structured on car
platforms. These vans have sliding rear doors. These are included in the
segment of mid-sized cars.
Sports utility Vehicle or SUV – These cars are known for their off-road
capabilities. They have a large body type. Their seating capacity ranges from 5
to 7+. They are usually the ones with four-wheel drive.
Station Wagon – These are also known as Estate Car. They are quite similar
to a hatchback but with an extended roof and rear body. This extra space
makes it eligible for adding the third row.
IV. MILEAGE
Its how many kilometers the is vehicle going to run per liter of fuel or how
many miles the vehicle runs per gallon of fuel. It is also used to depict how
many kilometers/ miles the vehicle has covered in its life time. These are the
two formal usage of the term mileage. It defers one vehicle to another vehicle.
17 | P a g e
Factors on which mileage depends upon are as following: -
1. Numbers Of Cylinders:- The fuel mileage, or fuel efficiency, of a vehicle can
be influenced by several factors, including the number of cylinders in the
engine. However, the number of cylinders is just one of many factors that
contribute to overall fuel efficiency. Other factors, such as engine size, vehicle
weight, aerodynamics, transmission type, and driving conditions, also play
significant roles. In general, vehicles with fewer cylinders tend to have better
fuel efficiency, all else being equal. This is because engines with fewer
cylinders generate less internal friction, consume less fuel, and produce fewer
emissions. Here's a breakdown of how the number of cylinders can affect fuel
mileage
weight and mileage isn't always linear or straightforward. Other factors such
as engine efficiency, transmission type, tire type and pressure, road conditions,
driving habits, and maintenance also play significant roles in determining a
vehicle's fuel Efficiency. Top of Form
18 | P a g e
1.5.2 Study of Diamond Price:
Similarly, for the diamond price prediction project, feasibility hinges on the availability
and reliability of diamond data. Assessing whether the dataset includes a
comprehensive range of diamond attributes and corresponding prices is essential.
Ensuring that the data is representative of the diamond market and sufficiently
detailed is a critical part of the study. Additionally, determining the adequacy of
Python libraries for advanced regression techniques, data pre-processing, and
visualization is crucial for successful execution.
Introduction:
Diamond is a solid form of the element carbon with its atoms arranged in a crystal
structure called diamond cubic. Another solid form of carbon known as graphite is the
chemically stable form of carbon at room temperature and pressure, but diamond is
metastable and converts to it at a negligible rate under those conditions. Diamond has
the highest hardness and thermal conductivity of any natural material, properties that
are used in major industrial applications such as cutting and polishing tools. They are
also the reason that diamond anvil cells can subject materials to pressures found deep
in the Earth.
Diamonds are among the most coveted and admired gemstones in the world, known
for their exquisite beauty, brilliance, and enduring symbolism. Formed deep within
the Earth's mantle under extreme heat and pressure, diamonds emerge as rare and
remarkable treasures. With a long history of intrigue, cultural significance, and
industrial utility, diamonds hold a special place in both the realms of luxury and
practicality
19 | P a g e
Types of diamonds:
1. Technical: - These four types of diamonds are the main technical classifications.
Type Ia diamonds: because nitrogen gathers in clusters in these stones, they have a
yellowish tinge.
Type IIa diamonds: these diamonds have no nitrogen impurities and differing
fluorescent properties.
Type Ib diamonds: these are also quite rare, and their main feature is that individual
nitrogen atoms are scattered throughout the stone (rather than in clusters).
Type IIb diamonds: another rare type of diamond with no nitrogen atoms. They do,
however, contain boron in addition to its main carbon content.
2. By casual shoppers:- Most people in the market for diamonds will classify them
according to the following basic names (or variations of them):
Flawless (FL) - no inclusions or blemishes are visible to a skilled grader using 10×
magnification.
Internally Flawless (IF) - no inclusions and only blemishes are visible to a skilled
grader using 10× magnification.
Very, Very Slightly Included (VVS1 and VVS2) - inclusions are difficult for a skilled
grader to see under 10× magnification.
Very Slightly Included (VS1 and VS2) - inclusions are minor and range from difficult
to somewhat easy for a skilled grader to see under 10x magnification.
Slightly Included (SI1 and SI2) - inclusions are noticeable to a skilled grader under
10x magnification.
Included (I1, I2, and I3) - inclusions are obvious under 10× magnification and may
affect transparency and brilliance.
20 | P a g e
4. Carat (weight of a diamond):- It is equal to 200 mg or 0.00643 troy oz and, with all else
being equal, the larger the size of a diamond, the greater its value. The bulk of diamonds on
the market are less than one carat, which means that we must talk in terms of points (100
sub-divisions of the carat). So, treat the carat as an indication purely of the weight/size of
the diamond rather than any indication of value.
5. Cut of a diamond: - The last of the four Cs is the “cut” of a diamond. This refers to other
visible features of a diamond, such as its:
Excellent (EX)
Very Good (VG)
Good (G)
Fair (F)
Poor (P)
Specification of diamond
Carat Weight: This is a measure of a diamond's size and weight, with one carat equal
to 200 milligrams. The carat weight directly affects the diamond's size, and larger
diamonds are generally rarer and more valuable.
Cut: The cut refers to the way a diamond's facets are arranged and shaped. It's one
of the most crucial factors affecting a diamond's brilliance and sparkle. The quality of
the cut determines how effectively light is reflected within the diamond.
Cut Grade: This evaluates the overall quality of the diamond's cut, including
proportions, symmetry, and polish. Common cut grades include Excellent, Very
Good, Good, Fair, and Poor.
Colour: Diamonds are graded on a colour scale that ranges from D (colourless) to Z
(light yellow or brown). The less colour a diamond has, the higher its value.
Colourless diamonds allow lighter to pass through, resulting in greater brilliance.
21 | P a g e
Clarity: Clarity refers to the presence of internal and external imperfections, known
as inclusions and blemishes, respectively. The clarity grade ranges from Flawless (no
imperfections visible under 10x magnification) to Included (imperfections visible to
the naked eye).
Clarity Grading: Clarity grades include categories like Flawless (FL), Internally
Flawless (IF), Very Slightly Included (VVS1, VVS2), Very Slightly Included (VS1, VS2),
Slightly Included (SI1, SI2), and Included (I1, I2, I3).
Shape: The shape of a diamond refers to its overall outline or form. Common shapes
include Round Brilliant, Princess, Emerald, Ascher, Marquise, Oval, Pear, Heart, and
Cushion.
Certification: Diamonds are often certified by reputable gemological laboratories,
such as the Gemological Institute of America (GIA) or the International Gemological
Institute (IGI). Certification provides an objective assessment of a diamond's quality,
including its Four Cs.
Fluorescence: Some diamonds exhibit fluorescence when exposed to ultraviolet
light. This can affect the diamond's appearance under certain lighting conditions.
Symmetry and Polish: These factors assess the precision of a diamond's facets and
the smoothness of its surface. Excellent symmetry and polish contribute to a
diamond's overall beauty.
Depth and Table Percentage: These measurements relate to the proportions of the
diamond. The depth is the height of the diamond measured from the culet to the
table, and the table percentage is the width of the table facet relative to the
diameter of the diamond.
22 | P a g e
1.6 SIGNIFICANCE OF PROJECT:
In the case of predicting car mileage, the project's importance emerges from its
potential to guide individuals towards more informed and economical choices when
purchasing vehicles. By accurately estimating a car's fuel efficiency based on its
attributes, the project contributes to environmentally conscious decisions and cost
savings. In a world increasingly focused on sustainability and economic prudence, such
predictions hold the promise of making a positive impact on both individuals and the
environment.
Similarly, the diamond price prediction project carries significance in a realm where
clarity and transparency are paramount. By uncovering the intricate relationships
between diamond characteristics and their market prices, the project empowers
consumers, jewellers, and investors to make more educated decisions. This newfound
understanding of diamond pricing can lead to fairer transactions and enhanced trust
within the industry. Given the emotional and financial value associated with
diamonds, the project's potential to foster well-informed choices is both practical and
meaningful.
In the diamond price prediction project, a range of beneficiaries emerges from the
complex diamond market ecosystem. Consumers and potential buyers gain the
advantage of a fairer understanding of diamond pricing, enabling them to make well-
informed decisions when purchasing precious gemstones. Jewellers benefit from
improved transparency, facilitating better communication with customers and
fostering trust. Investors, on the other hand, can utilize the prediction system to gain
insights into pricing trends, enhancing their decision-making in a dynamic market.
23 | P a g e
CHAPTER: 2
2.1 LITERATURE REVIEW
Prediction of Car Mileage:
The project's literature review for car mileage prediction using Python entails a
thorough analysis of the literature, studies, and resources already available on the
topic of car mileage prediction. This review aims to offer a thorough understanding of
the approaches, methods, and conclusions that have been previously investigated in
works that are similar to this one.
To reduce fuel consumption and address environmental issues, many studies have
looked into predicting car mileage. Regression analysis, machine learning algorithms,
and data mining techniques are among the methods frequently used by researchers.
One popular technique is linear regression, where predictions are made by modelling
the relationship between car characteristics (such as engine specifications, weight,
and horsepower) and fuel efficiency (measured in miles per gallon, or mpg). Linear
regression is favoured for its simplicity and its ability to uncover linear relationships
between variables.
Furthermore, more sophisticated methods for forecasting car mileage have gained
popularity. These methods include decision trees, support vector machines, and
neural networks. These methods enable the capture of non-linear relationships and
interactions between attributes that linear models might find difficult to fully
represent. It is possible for machine learning algorithms to glean intricate patterns
from large datasets, improving prediction accuracy.
24 | P a g e
The evaluation of prediction models takes up a sizable portion of the literature review.
The accuracy of predictions is frequently evaluated using metrics like mean squared
error (MSE), root mean square error (RMSE), and mean absolute error (MAE). These
metrics assess how well the models generalize to new data by measuring the
discrepancy between predicted and actual mileage values.
Accurate diamond price forecasting has implications for economic analysis and
investment decisions outside of the jewellery sector. In order to predict the market
value of diamonds, researchers have used a variety of machine learning algorithms
and statistical models to extract important insights from diamond attributes.
25 | P a g e
Despite the improvements, problems still exist. Subjective factors that are difficult to
quantify and take into account, such as market sentiment and cultural trends, can
have an impact on diamond prices. Additionally, current data is required to maintain
prediction accuracy due to the diamond market's constant evolution.
The importance of data preparation and feature engineering is front and centre in
the predictive modelling ecosystem. The cornerstone of ensuring the integrity of
input data is meticulous data cleaning, thoughtful handling of missing values, and
skilful conversion of categorical variables into numeric representations. The accuracy
of our predictive models is increased by feature selection and extraction, which work
like skilled sculptors to chisel away at features to uncover those that have the
greatest impact on fuel efficiency.
26 | P a g e
Prediction Of Diamond Price
An extensive analysis of earlier research, studies, and endeavours that have delved
into the challenging field of estimating the market value of diamonds is required for
the related work within our project focused on using Python to predict diamond
prices. As it gives us useful information, methodologies, and lessons from previous
endeavours, this investigation is essential for determining the course of our project.
Due to the allure and economic importance of these gemstones, researchers have
become engrossed in the task of predicting diamond prices. To understand the
complex connections between diamond attributes and their corresponding market
values, a number of strategies and techniques have been investigated. A crucial
method for figuring out how characteristics like carat weight, cut quality, colour, and
clarity affect a diamond's value is regression analysis. In particular, linear regression
has become a trustworthy method for identifying these relationships and
comprehending how they affect a diamond's market value.
27 | P a g e
CHAPTER: 3
3.1 Prediction of Car Mileage
28 | P a g e
and so on this data is gone upto 397 rows….
29 | P a g e
30 | P a g e
31 | P a g e
32 | P a g e
33 | P a g e
34 | P a g e
35 | P a g e
36 | P a g e
37 | P a g e
38 | P a g e
39 | P a g e
40 | P a g e
41 | P a g e
42 | P a g e
the Predicted Value Of The Given Input By The User That Is Us Is 26.2 Mpg
43 | P a g e
Prediction of the diamond price
44 | P a g e
45 | P a g e
46 | P a g e
47 | P a g e
48 | P a g e
49 | P a g e
50 | P a g e
51 | P a g e
52 | P a g e
53 | P a g e
54 | P a g e
55 | P a g e
56 | P a g e
57 | P a g e
58 | P a g e
59 | P a g e
60 | P a g e
61 | P a g e
62 | P a g e
63 | P a g e
The predicted value of Diamond price is 112$ and 23 cents
Depending upon the input from the user
64 | P a g e
CHAPTER: 4
4.1 TECHNOLOGY USED
Python was the major technology used for the implementation of machine learning
concepts the reason being that there are numerous inbuilt methods in the form of
packaged libraries present in python. Following are prominent libraries/tools we used
in our project.
NumPy
Large, multi-dimensional arrays and matrices are supported by the well-known open-
source library known as NumPy (Numerical Python), which also offers a number of
mathematical operations that can be performed on these arrays. It is a foundational
Python package for scientific computing and is widely used in a variety of disciplines,
including image processing, machine learning, data analysis, and more.
Key features of NumPy:
a. Multi-dimensional Arrays
b. Element-wise Operations
c. Broadcasting
d. Indexing and Slicing
e. Mathematical Functions
f. Random Number Generation
Pandas
Python's pandas library is free and open-source for handling and analysing data. It
offers data structures and functions that are intended to make it simple and intuitive
to work with structured data, including that found in CSV files, Excel spreadsheets,
SQL databases, and other formats. The foundational tool for data scientists, analysts,
and researchers working with tabular data is pandas, which is built on top of the
NumPy library.
Key features of pandas:
a. Data Structure
b. Data Alignment
c. Indexing and Selection
65 | P a g e
d. Data Cleaning and Transformation
e. Grouping and Aggregation
f. Time Series Handling
Matplotlib
With the help of the well-liked 2D plotting library Matplotlib for Python, you can make
a huge selection of static, interactive, and animated visualizations. For creating graphs,
charts, plots, and other visual representations of data, it is especially helpful.
Matplotlib can be completely customized and easily combined with other libraries like
NumPy and pandas.
The Matplotlib library contains a module called matplotlib.pyplot that offers a high-
level interface for making different kinds of plots and visualizations. By abstracting
away some of the lower-level details, it makes the process of creating plots simpler,
facilitating the quick and effective generation of plots.
Key features of Matplotlib:
a. Flexible Plotting
b. Customization
c. Subplots and Layouts
d. Export and Integration
e. Mathematical Expressions
f. 3D Visualization
Seaborn
Built on top of Matplotlib, Seaborn is a statistical data visualization library for Python.
It offers a more complex interface for designing visually appealing and educational
statistical graphics. Seaborn is particularly helpful for visualizing statistical
relationships in your data and for creating complex visualizations with little code.
Key features of Seaborn:
a. Statistical Visualization
b. Higher-Level Abstraction
c. Built-in Theme and Color Palettes
d. Facet Grids
e. Categorical and Statistical Plot
f. Time Series Visualization
66 | P a g e
Scikit-Learn
A machine learning library for Python called Scikit-learn offers a variety of tools for
carrying out various machine learning tasks. It is intended to be straightforward and
user-friendly while providing a full range of functionalities for both inexperienced and
seasoned data scientists.
Scikit-learn's capabilities:
a. Supervised Learning: Scikit-learn offers a range of algorithms for supervised
learning tasks, where the goal is to learn a mapping from input data to a target
variable. This includes classification (assigning labels to data) and regression
(predicting continuous values).
b. Unsupervised Learning: Scikit-learn provides methods for unsupervised
learning tasks, such as clustering (grouping similar data points) and
dimensionality reduction (reducing the number of features while retaining
important information).
c. Model Selection and Evaluation: The library includes tools for evaluating the
performance of machine learning models, such as cross-validation,
hyperparameter tuning, and various evaluation metrics like accuracy,
precision, recall, F1-score, etc.
d. Data Preprocessing: Scikit-learn supports data preprocessing techniques like
feature scaling, normalization, handling missing values, and feature extraction.
e. Ensemble Methods: It offers ensemble methods like random forests, which
combine multiple models to improve predictive performance.
f. Model Persistence: Scikit-learn allows you to save trained models to disk and
reload them later, making it easy to deploy models in production
environments.
g. Easy API: Scikit-learn follows a consistent API design, making it relatively
straightforward to switch between different algorithms and experiment with
various models.
h. Community and Documentation: Scikit-learn has a large and active
community, providing extensive documentation, tutorials, and examples for
various use cases.
67 | P a g e
4.2 METHODOLOGY OF THE PROJECT
Data Collection and Pre-processing:
Data collection and pre-processing are essential steps in the data analysis and machine
learning pipeline. These actions entail gathering raw data and putting it into a format
that can be used for analysis or modelling. Here is a Python explanation of these steps
Data Collection: Data collection involves obtaining raw data from various sources,
such as databases, APIs, web scraping, or manual entry. The collected data can be in
different formats like CSV, JSON, XML, or databases like SQL.
Data Pre-processing: Data pre-processing is the process of preparing collected data
for analysis or modelling by cleaning, transforming, and organizing it. Dealing with
missing values, outliers, and irrelevant data is made easier by this step.
a. Dropping Of Unnecessary Column
68 | P a g e
Data Imputation
Data imputation is ma method to replace all the null or dirty values with mean
median or mode with no major change in the overall EDA of the dataset.
Feature Engineering
A critical step in the machine learning and data analysis processes is feature
engineering. It entails using raw data to build new features or alter existing ones in
order to improve a machine learning model's performance. In order to improve the
representation of the data, reduce noise, and extract pertinent information, feature
engineering aims to increase the accuracy and generalizability of models.
Here are some key steps and techniques of EDA used in project
a. Data Inspection:
69 | P a g e
b. Handling of Missing Data
c. Descriptive Statistics:
Data Visualisation
The process of presenting data in graphical or visual formats in order to communicate
insights, patterns, trends, and relationships that may not be as clear from raw data
alone. You can communicate complex information effectively with data visualization
to help you find insights that can be put to use.
70 | P a g e
Type of Visualisation:
a. Histogram: A histogram is a graphical representation of the distribution of a
dataset. It's commonly used to display the frequency or count of data points
within certain ranges, also known as "bins" or "intervals."
b. Boxplot: A box plot, also known as a box-and-whisker plot, is a graphical
representation of the distribution of a dataset. It provides a summary of the
key features of the data's distribution, such as the median, quartiles, and
potential outliers.
c. Scatter Plots: An illustration known as a scatter plot shows distinct data points
as dots on a two-dimensional plane. It's especially helpful for analyzing the
relationship between two variables and finding outliers or patterns in the data.
There are two types of scatter plot 2D and 3D which are for two variables and
3 variables respectively.
71 | P a g e
72 | P a g e
d. Count Plot: Count plots are a specific kind of graphic display that show the
frequency or count of categorical data in a dataset. It's especially beneficial for
visualizing the distribution of categorical variables and contrasting the
frequency of various categories.
73 | P a g e
e. Pair Plot: A pair plot, also known as a scatterplot matrix, is a graphical
representation that displays pairwise relationships between multiple variables
in a dataset.
74 | P a g e
Data Encoding:
Data encoding is the process of transforming textual or categorical information into a
numerical format that can be quickly processed and used as input for data analysis or
machine learning algorithms. Data encoding aids in the conversion of non-numerical
data into a format that is appropriate for machine learning algorithms, which
frequently require numerical input.
For Example: Converting categorical data of colour column in numerical data.
75 | P a g e
Dealing with Outliers:
Outliers are data points in a dataset that differ significantly from the rest of the data.
They could be unusually high or low values that deviate from the data's typical pattern.
Measurement errors, incorrect data entry, or genuinely uncommon events are just a
few of the causes of outliers.
Removal of Outliers
76 | P a g e
Data Splitting:
Data splitting is the process of dividing a dataset into Dependent and Independent
Variable for Training and Testing purpose for machine learning models, validating their
performance, and testing their generalization ability.
Model Selection:
Model selection is the process of deciding which machine learning algorithm or model
is best for a given problem. It entails assessing and contrasting various models to
ascertain which one performs the best in terms of accuracy, predictive power,
generalization, and other pertinent metrics.
77 | P a g e
Model Evaluation:
Model evaluation is a crucial step in the machine learning process that determines
how well a trained model performs on fresh, untested data. It involves evaluating the
model's precision, generalizability, and potential for development. Effective model
evaluation guides decisions about fine-tuning, selecting, or contrasting various models
by assisting you in understanding how well your model is likely to perform in real-
world scenarios.
Checking of Accuracy Score
78 | P a g e
Finally Output:
79 | P a g e
CHAPTER: 5
5.1 IMPLEMENTATION:
Prediction of Car Mileage:
Step 1. Importing the libraries and creating a Pandas Data Frame For performing on
data sets first we must import Libraries in the compilers.
a. Pandas:
Purpose: Pandas is a powerful data manipulation and analysis library.
It's used for loading, cleaning, pre-processing, and exploring datasets.
Explanation: You can use Pandas to read data from CSV files, handle
missing values, encode categorical variables, and perform various data
transformations.
b. NumPy:
Purpose: NumPy is a fundamental package for numerical computations
in Python. It provides support for arrays, matrices, and mathematical
functions.
Explanation: NumPy arrays are used to store and manipulate numerical
data efficiently. Many machines learning libraries, including scikit-
learn, work with NumPy arrays.
c. Scikit-Learn:
Purpose: scikit-learn is a machine learning library that provides a wide
range of tools for various machine learning tasks, including regression.
Explanation: You can use scikit-learn to choose regression models, pre-
process data, split datasets, train models, and evaluate their
performance.
d. Matplotlib And Seaborn:
Purpose: These libraries are used for data visualization.
Explanation: Matplotlib and Seaborn allow you to create various plots
and charts to visualize your data, model results, and relationships
between features and target variables.
80 | P a g e
Step 2. Data pre-processing and visualization Firstly, we check for any null, missing,
incomplete, or inappropriate values using the following code.
CONTENT:
Title: Auto-Mpg Data
Source: This dataset was taken from the StatLib library which is maintained at
Carnegie Mellon University. The dataset was used in the 1983 American
Statistical Association Exposition
Relevant information: The data concerns city-fuel consumption in miles per
gallon, to be predicted in terms of 3 multivariate discrete and 5 continuous
attributes. The number of instances is 398 and the number of attributes is 9
including the class attribute. The attribute information are
i. mpg - Mileage/Miles Per Gallon
ii. cylinders - the power unit of the car where gasoline is turned into
power
iii. displacement - engine displacement of the car
iv. horsepower - rate of the engine performance
v. weight - the weight of a car
vi. acceleration - the acceleration of a car
vii. model - model of the car
viii. origin - the origin of the car
ix. car- the name of the car
81 | P a g e
o Clean corrupt data
1. The first command will tell you whether there’s any missing value for any numerical
data, not string data since string datatype data can be blank. The aforementioned
command doesn’t capture that.
2. The second command will tell you whether the datatype of every feature is as per our
expectation i.e., we expect displacement, horsepower, mpg etc. to be numerical
datatype (float/int).
3. df.info () will help check whether the data type is exactly what we are expecting.
We see that horsepower column is perceived as object data type by Pandas, whereas
we should be expecting a floating value. It means there is a string somewhere. Now
our goal is to find that string values(s) and deduce what to do with the corrupt data.
Generally, the steps are to check for null, missing, incomplete, inappropriate values and
subsequently clean the data by converting data type to appropriate data types, filling missing
values, normalizing etc
82 | P a g e
o Check and remove outliers using Boxplot
1. To see if the data has any outliers, we will now plot a boxplot of every independent
variable or feature (x1, x2, x3 etc) except car name. Define and describe a df.boxplot()
function
2. Define list of continuous variables and call df.boxplot() for only those to detect outliers
Scaling and Normalization: Features with different scales can affect the performance of
certain regression algorithms. Scaling and normalization help bring features to a similar scale
Feature Selection: Identify which features are relevant for predicting the target variable
(mileage). You can use techniques like correlation analysis, feature importance from tree-
based models, or domain knowledge to decide which features to keep.
Split your dataset into training and testing sets. The training set is used to train the model,
while the testing set is used to evaluate its performance
Feature selection:- Choosing the most pertinent and instructive characteristics from your
dataset to train the model is an important stage in the process of creating a predictive model.
Selecting characteristics that have a strong association with the target variable (mileage) and
can aid the model in producing accurate predictions is important when forecasting car
mileage using regression.
83 | P a g e
1. Domain Knowledge: Start by considering your domain knowledge. Think about which
features are intuitively expected to affect the mileage of a car. For i cylinders,
displacement, horsepower, weight, acceleration, model year and origin are commonly
known factors that influence Car mileage.
2. Correlation Analysis: Calculate the correlation between each numerical feature and
the target variable (mileage). Features with higher absolute correlation values are
generally more important. You can use tools like the Pandas library to calculate
correlations and visualize them using plots like heatmaps.
3. Visualization: Visualize the relationships between individual features and the target
variable. You can use scatter plots, box plots, or other relevant plots to understand
how changes in each feature correspond to changes in car mileage.
4. Feature Importance from Models: Train a preliminary model (e.g., a simple linear
regression or decision tree) and use the feature importance scores provided by the
model. Some models, like Random Forests, can quantify the importance of each
feature in predicting the target variable.
5. Recursive Feature Elimination: This method involves repeatedly training a model and
removing the least important feature in each iteration until a desired number of
features is reached. Scikit-learn provides the RFECV class that can help with this
process.
7. Variance Thresholding: If you have features with very low variance, they might not
contribute much to the prediction. You can use scikit-learn's Variance Threshold to
remove features with variance below a certain threshold.
84 | P a g e
8. Feature Engineering: Sometimes, combining or transforming features can create
more informative variables. For example, you could create a feature that represents
the car by Power to Weight, number of cylinders, displacement, weight, CCs and model
year for good mileage.
9. Select Best and Select Percentile: These techniques use statistical tests to select the
top K features or a certain percentage of features that have the strongest correlation
with the target variable.
4.Splitting data
In splitting the data set into two subsets, you can train your model on one subset and then
evaluate how well it performs on another, untrained group. The end aim of any predictive
model is to anticipate how well the model will function on fresh, unforeseen data.
You have a dataset containing attributes (such carat, cylinders, displacement, horsepower,
weight, acceleration, model year and origin, etc.) and associated in the case of predicting car
mileage. Based on these characteristics, the wish to create a model that can forecast car
mileage.
1. Load and Prepare Data: Load your dataset, pre-process it by handling missing values,
encoding categorical features, and scaling/normalizing numerical features.
2. Separate Features and Target: Split your dataset into two parts: features (X) and
target (y). Features are the input variables that you will use to predict the target
variable (car mileage).
85 | P a g e
3. Split Data: Use a function or library to divide your dataset into training and testing
subsets. The common practice is to allocate a larger portion of the data for training
(e.g., 80%) and a smaller portion for testing (e.g., 20%). This proportion can vary based
on the size of your dataset.
4. Training and Testing: You use the X_train and y_train subsets to train your regression
model. This allows the model to learn the relationships between features and target
from the training data.
5. Evaluation: After training, you use the trained model to predict car mileage using the
testing features (X_test). Then, you compare this predicted mileage with the actual
mileage (y_test) to evaluate the model's performance. Common evaluation metrics
for regression include Mean Squared Error (MSE), Root Mean Squared Error (RMSE),
and R-squared
Certainly! Let's get started with a thorough explanation of how to use Python to create a
regression model for forecasting car mileage. As the chosen regression model for this
discussion, we'll concentrate on linear regression.
86 | P a g e
instantiate it, use the scikit-learn package. This will allow the model to learn patterns and
relationships within the features and the target mileage.
Train the selected regression model using the training data. In the case of Linear Regression,
some libraries like sklearn-learn to create and train the model.
Pandas:
Purpose: Pandas is a powerful data manipulation and analysis library. It's used for
loading, cleaning, pre-processing, and exploring datasets.
Explanation: You can use Pandas to read data from CSV files, handle missing values,
encode categorical variables, and perform various data transformations.
NumPy:
Purpose: NumPy is a fundamental package for numerical computations in Python. It
provides support for arrays, matrices, and mathematical functions.
Explanation: NumPy arrays are used to store and manipulate numerical data
efficiently. Many machines learning libraries, including scikit-learn, work with NumPy
arrays.
scikit-learn (sklearn):
Purpose: scikit-learn is a machine learning library that provides a wide range of tools
for various machine learning tasks, including regression.
Explanation: You can use scikit-learn to choose regression models, pre-process data,
split datasets, train models, and evaluate their performance.
Matplotlib and Seaborn:
Purpose: These libraries are used for data visualization.
Explanation: Matplotlib and Seaborn allow you to create various plots and charts to
visualize your data, model results, and relationships between features and target
variables.
Step 7: Model Evaluation: Use the testing dataset to evaluate the performance of your
trained model. Common metrics for regression tasks include Mean Squared Error (MSE), Root
Mean Squared Error (RMSE), and R-squared.
87 | P a g e
1. Mean Squared Error (MSE):
Purpose: MSE measures the average squared difference between the predicted and
actual values. It provides insight into the overall magnitude of errors in your
predictions.
Explanation: MSE computes the squared difference for each prediction and then
averages these squared differences. It penalizes larger errors more heavily.
Formula: MSE = (1/n) Σ (yᵢ - ŷᵢ) ², where yᵢ is the actual value and ŷᵢ is the predicted
value for the i-th observation, and n is the number of observations.
Interpretation: A lower MSE indicates better model performance. It's sensitive to
outliers, as larger errors contribute more significantly to the total.
Purpose: RMSE is the square root of the MSE. It provides a metric in the same unit as
the target variable, making it more interpretable.
Explanation: RMSE is a popular metric for understanding the average size of errors.
It's more sensitive to larger errors due to the squaring and then square root
operations.
Formula: RMSE = √MSE
Interpretation: Similar to MSE, lower RMSE values indicate better performance. It's
easy to compare against the range of your target variable.
Purpose: R² measures the proportion of the variance in the target variable that's
predictable from the input features. It gives an idea of how well the model explains
the variability in the data.
88 | P a g e
Formula: R² = 1 - (Σ (yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²), where ȳ is the mean of the actual values.
It would assess the effectiveness of the regression method using these measures in
the context of the task at hand to estimate the mileage of cars. Higher R² values and
lower MSE and RMSE values indicate a more effective model.
Step7. Prediction After having developed and assessed the algorithm, when can apply it to
generate predictions on fresh, previously unknown data. Give the trained model the feature
values of a car, and it will forecast the car's mileage.
The next stage is to utilize the regression model to make predictions on new data
(new_mileage) after it trained and calculated the Mean Squared Error (MSE) and Root Mean
Squared Error (RMSE) to assess its performance.
1. New_mileage is a list containing the feature values for the new car you want to predict
the mileage for. This list should have the same format and order as the features you
used for training your model.
2. model.predict() is used to make predictions based on the new features. The method
takes a 2D array-like input, where each row represents a set of features. In this case,
we're providing a single set of features, so we wrap it in a list.
3. predicted_mileage will store the predicted mileage for the new car.
89 | P a g e
Prediction of Diamond Price:
This chapter will describe the process of implementing the system. The implementation was
divided into five parts titled Data Set, Data Cleaning and Normalization, Machine Learning
Algorithms, Measurements, and Inference. Each of these parts are explained in their own
sections as part of this chapter and are shown in diagrams The high level component of the
diagram without a dedicated section of this chapter, Simulated Aging, is detailed in the
measurement section. The entire implementation was written in Python3 in the jupyter
notebook ide. The libraries utilized are pandas, sklearn (sci-kit learn), NumPy, re (regular
expressions), matplotlib, and seaborn.
Dataset The dataset was sourced from Kaggle and includes the prices and other attributes of
almost 26967 diamonds. There are 10 attributes included in the dataset including the target
i.e. price.
Feature description: price price in US dollars ($326--$18,823) This is the target column
containing tags for the features.
The 4 Cs of Diamonds: - carat (0.2--5.01) The carat is the diamond’s physical weight
measured in metric carats. One carat equals 1/5 gram and is subdivided into 100
points. Carat weight is the most objective grade of the 4Cs.
cut (Fair, Good, Very Good, Premium, Ideal) In determining the quality of the cut, the
diamond grader evaluates the cutter’s skill in the fashioning of the diamond. The more
precise the diamond is cut, the more captivating the diamond is to the eye.
color, from J (worst) to D (best) The colour of gem-quality diamonds occurs in many
hues. In the range from colourless to light yellow or light brown. Colourless diamonds
are the rarest. Other natural colours (blue, red, pink for example) are known as
"fancy,” and their colour grading is different than from white colorless diamonds.
90 | P a g e
clarity (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) Diamonds can have
internal characteristics known as inclusions or external characteristics known as
blemishes. Diamonds without inclusions or blemishes are rare; however, most
characteristics can only be seen with magnification.
Dimensions
x length in mm (0--10.74)
y width in mm (0--58.9)
z depth in mm (0--31.8)
table width of the top of the diamond relative to widest point (43--95)
A diamond's table refers to the flat facet of the diamond seen when the stone is face up. The
main purpose of a diamond table is to refract entering light rays and allow reflected light rays
from within the diamond to meet the observer’s eye. The ideal table cut diamond will give
the diamond stunning fire and brilliance.
Data pre-processing, particularly regression models for predicting diamond prices, requires
pre-processing data as a critical step. It entails preparing the raw data for machine learning
model training and evaluation by cleaning, converting, and organizing it. Here is a description
of the data pre-treatment procedures you might use in Python to solve an issue of predicting
the price of diamonds.
91 | P a g e
missing values, filling missing values with the mean or median, or using more
advanced techniques like interpolation
3. Scaling and Normalization: Features with different scales can affect the performance
of certain regression algorithms. Scaling and normalization help bring features to a
similar scale
4. Feature Selection: Identify which features are relevant for predicting the target
variable (price). You can use techniques like correlation analysis, feature importance
from tree-based models, or domain knowledge to decide which features to keep.
5. Train-Test Split: Split your dataset into training and testing sets. The training set is
used to train the model, while the testing set is used to evaluate its performance
Feature selection is a critical step in building a predictive model, as it involves choosing the
most relevant and informative features from your dataset to train the model. In the context
of predicting diamond prices using regression, you want to select features that have a strong
relationship with the target variable (price) and can help the model make accurate
predictions.
Here's how you can approach feature selection for predicting diamond prices using Python:
Choosing the most pertinent and instructive characteristics from your dataset to train the
model is an important stage in the process of creating a predictive model. Selecting
characteristics that have a strong association with the target variable (price) and can aid the
92 | P a g e
model in producing accurate predictions is important when forecasting diamond prices using
regression.
1. Domain Knowledge: Start by considering your domain knowledge. Think about which
features are intuitively expected to affect the price of a diamond. For instance, carat
weight, cut quality, color, and clarity are commonly known factors that influence
diamond prices.
2. Correlation Analysis: Calculate the correlation between each numerical feature and
the target variable (price). Features with higher absolute correlation values are
generally more important. You can use tools like the Pandas library to calculate
correlations and visualize them using plots like heatmaps.
3. Visualization: Visualize the relationships between individual features and the target
variable. You can use scatter plots, box plots, or other relevant plots to understand
how changes in each feature correspond to changes in diamond prices.
4. Feature Importance from Models: Train a preliminary model (e.g., a simple linear
regression or decision tree) and use the feature importance scores provided by the
model. Some models, like Random Forests, can quantify the importance of each
feature in predicting the target variable.
5. Recursive Feature Elimination: This method involves repeatedly training a model and
removing the least important feature in each iteration until a desired number of
features is reached. Scikit-learn provides the RFECV class that can help with this
process.
6. Lasso Regression (L1 Regularization): Lasso regression applies L1 regularization,
which encourages the model to set coefficients of less important features to zero. This
effectively performs feature selection during the model training process.
7. Variance Thresholding: If you have features with very low variance, they might not
contribute much to the prediction. You can use scikit-learn's Variance Threshold to
remove features with variance below a certain threshold.
93 | P a g e
8. Feature Engineering: Sometimes, combining or transforming features can create
more informative variables. For example, you could create a feature that represents
the product of carat weight and cut quality score.
9. SelectKBest and SelectPercentile: These techniques use statistical tests to select the
top K features or a certain percentage of features that have the strongest correlation
with the target variable.
In splitting the data set into two subsets, you can train your model on one subset and then
evaluate how well it performs on another, untrained group. The end aim of any predictive
model is to anticipate how well the model will function on fresh, unforeseen data.
You have a dataset containing attributes (such carat, cut, color, clarity, etc.) and associated
pricing in the case of predicting diamond prices. Based on these characteristics, the wish to
create a model that can forecast diamond prices.
1. Load and Prepare Data: Load your dataset, pre-process it by handling missing values,
encoding categorical features, and scaling/normalizing numerical features.
2. Separate Features and Target: Split your dataset into two parts: features (X) and
target (y). Features are the input variables that you will use to predict the target
variable (diamond price).
94 | P a g e
3. Split Data: Use a function or library to divide your dataset into training and testing
subsets. The common practice is to allocate a larger portion of the data for training
(e.g., 80%) and a smaller portion for testing (e.g., 20%). This proportion can vary based
on the size of your dataset.
In Python, you can use the train_test_split function from the sklearn.model_selection
module to achieve this.
· X_train and y_train: These are the subsets of features and target respectively
that will be used for training the model.
· X_test and y_test: These are the subsets of features and target respectively
that will be used for testing and evaluating the model.
4. Training and Testing: You use the X_train and y_train subsets to train your regression
model. This allows the model to learn the relationships between features and target
from the training data.
5. Evaluation: After training, you use the trained model to predict diamond prices using
the testing features (X_test). Then, you compare these predicted prices with the
actual prices (y_test) to evaluate the model's performance. Common evaluation
metrics for regression include Mean Squared Error (MSE), Root Mean Squared Error
(RMSE), and R-squared
Certainly! Let's get started with a thorough explanation of how to use Python to create a
regression model for forecasting diamond prices. As the chosen regression model for this
discussion, we'll concentrate on linear regression.
Step 6: Model Training: Train the selected regression model using the training data. In the
case of Linear Regression, some libraries like scikit-learn (sklearn) to create and train the
model.
95 | P a g e
Pandas:
Purpose: Pandas is a powerful data manipulation and analysis library. It's used for
loading, cleaning, pre-processing, and exploring datasets.
Explanation: You can use Pandas to read data from CSV files, handle missing values,
encode categorical variables, and perform various data transformations.
NumPy:
Purpose: NumPy is a fundamental package for numerical computations in Python. It
provides support for arrays, matrices, and mathematical functions.
Explanation: NumPy arrays are used to store and manipulate numerical data
efficiently. Many machine learning libraries, including scikit-learn, work with NumPy
arrays.
scikit-learn (sklearn):
Purpose: scikit-learn is a machine learning library that provides a wide range of tools
for various machine learning tasks, including regression.
Explanation: You can use scikit-learn to choose regression models, pre-process data,
split datasets, train models, and evaluate their performance.
Matplotlib and Seaborn:
Purpose: These libraries are used for data visualization.
Explanation: Matplotlib and Seaborn allow you to create various plots and charts to
visualize your data, model results, and relationships between features and target
variables.
Step 7: Model Evaluation: Use the testing dataset to evaluate the performance of your
trained model. Common metrics for regression tasks include Mean Squared Error (MSE), Root
Mean Squared Error (RMSE), and R-squared.
Purpose: MSE measures the average squared difference between the predicted and
actual values. It provides insight into the overall magnitude of errors in your
predictions.
96 | P a g e
Explanation: MSE computes the squared difference for each prediction and then
averages these squared differences. It penalizes larger errors more heavily.
Formula: MSE = (1/n) Σ(yᵢ - ŷᵢ)², where yᵢ is the actual value and ŷᵢ is the predicted value
for the i-th observation, and n is the number of observations.
Interpretation: A lower MSE indicates better model performance. It's sensitive to
outliers, as larger errors contribute more significantly to the total.
Purpose: RMSE is the square root of the MSE. It provides a metric in the same unit as
the target variable, making it more interpretable.
Explanation: RMSE is a popular metric for understanding the average size of errors.
It's more sensitive to larger errors due to the squaring and then square root
operations.
Formula: RMSE = √MSE
Interpretation: Similar to MSE, lower RMSE values indicate better performance. It's
easy to compare against the range of your target variable.
Purpose: R² measures the proportion of the variance in the target variable that's
predictable from the input features. It gives an idea of how well the model explains
the variability in the data.
Explanation: R² ranges from 0 to 1. A value of 1 means the model perfectly predicts
the target variable, while 0 means the model performs no better than predicting the
mean of the target variable.
Formula: R² = 1 - (Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²), where ȳ is the mean of the actual values.
You would assess the effectiveness of the regression method using these measures in the
context of the task at hand to estimate the price of diamonds. Higher R² values and lower
MSE and RMSE values indicate a more effective model.
97 | P a g e
Step 8: Prediction: After having developed and assessed the algorithm, when can apply it
to generate predictions on fresh, previously unknown data. Give the trained model the
feature values of a diamond, and it will forecast the diamond's price.
The next stage is to utilize the regression model to make predictions on new data (new_price)
after it trained and calculated the Mean Squared Error (MSE) and Root Mean Squared Error
(RMSE) to assess its performance.
1. new_price is a list containing the feature values for the new diamond you want to
predict the price for. This list should have the same format and order as the features
you used for training your model.
2. model.predict() is used to make predictions based on the new features. The method
takes a 2D array-like input, where each row represents a set of features. In this case,
we're providing a single set of features, so we wrap it in a list.
3. predicted_price will store the predicted price for the new diamond.
98 | P a g e
CHAPTER: 6
6.1 RESULT AND ANALYSIS:
A number of significant findings and conclusions have been drawn from the training
and evaluation of the regression model for predicting automobile mileage. In order to
reliably estimate the mileage (MPG) of various automobile models, the model was
constructed using a dataset comprising parameters like as engine size, horsepower,
weight, and fuel type. The key findings and the subsequent analysis are outlined in the
following sections.
1. Model Performance:
The performance of the model was carefully analysed on a previously unreported
testing dataset, and the results were appraised using standard regression assessment
measures. The Mean Squared Error (MSE), which measures the squared difference
between anticipated and actual values, was computed as [MSE value for car mileage
is 8.9209 and for Diamond price is 392819.70]. The Root Mean Squared Error (RMSE),
which provides a measure of the model's predicted accuracy in MPG units, was
calculated as [RMSE value for car mileage is 1.5272 and for Diamond price is 21.5086].
These measurements show the degree of divergence between projected and actual
mileage numbers.
99 | P a g e
3. Feature Significance:
To determine their impact on the expected mileage, the regression coefficients
connected to each predictor characteristic were investigated. It is noteworthy that
factors like engine size, horsepower, and weight exhibited sizable positive coefficients,
indicating that increases in these characteristics are associated with higher
automobile mileage. The coefficients of the pertinent one-hot encoded features
further demonstrated that the presence of specific fuel types led to variances in
mileage.
4. Prediction Accuracy:
The model was applied to novel input data to generate predictions for car mileage.
The predictions aligned closely with actual mileage values, thereby signifying the
model's ability to generalize its learned patterns to new data points. This accuracy
holds promising implications for practical applications, where informed estimations of
car mileage play a pivotal role in decision-making processes.
100 | P a g e
7.Data Preprocessing: We report the findings from thorough data preparation, model
training, and model evaluation after completing an extensive regression analysis to
forecast diamond prices. The goal was to create a precise prediction model that could
anticipate diamond values based on a number of inherent characteristics. The results
of this effort are summarized as follows:
101 | P a g e
11.Analysis: The analysis of the regression model's performance revealed that the
selected attributes, such as carat weight, cut quality, color grade, clarity grade, and
dimensions, significantly influence diamond prices. The model was successful in
capturing the intricate relationships between these attributes and the final price of
the diamond. Moreover, the R-squared value indicated that a substantial proportion
of the price variability could be explained by the provided attributes.
In conclusion, the results of this regression analysis demonstrate the model's
effectiveness in predicting diamond prices with a high degree of accuracy. The analysis
reaffirms the importance of attributes such as carat weight, cut quality, color grade,
clarity grade, and dimensions in determining diamond prices. These findings
contribute valuable insights to the diamond industry, enabling informed decision-
making and pricing strategies based on quantifiable attributes
102 | P a g e
CHAPTER: 7
7.1 CONCLUSION:
In conclusion, the project that was centred on the estimation of car mileage has
produced insightful findings. We were able to develop a predictive model with a
commendable level of accuracy in estimating car mileage by combining advanced
machine learning techniques with a large dataset. The model shows its ability to
generalize and make reasonably accurate predictions on unobserved data by
examining various factors including engine specifications, vehicle weight, and
aerodynamics. The project also demonstrated the value of feature engineering in
improving predictive performance. The results of this project have important
applications, especially in the automotive sector, where precise mileage forecasting
can help with improving vehicle design, increasing fuel efficiency, and empowering
consumers to make wise decisions. . However, there is still room for advancement,
including investigating the incorporation of more sophisticated algorithms and
incorporating real-time data for dynamic predictions. In general, this project highlights
both the potential of machine learning in the automotive industry and the iterative
process of improving predictive models for more accurate results.
Similarly, the project that was primarily concerned with the forecasting of diamond
prices has shown insightful findings and promising results. A reliable diamond price
prediction model was created using meticulous data collection, pre-processing, and
feature engineering, as well as cutting-edge machine learning algorithms. The model
proved to be highly accurate at estimating diamond values based on a variety of
important characteristics, including carat weight, cut, clarity, color, and depth.
According to industry standards, these characteristics have been shown to be
significant predictors of diamond prices. Additionally, the model's generalizability and
potential applicability in real-world scenarios are highlighted by the project's success
in utilizing a diverse dataset from reliable sources. It's crucial to recognize the inherent
complexity of diamond pricing, which can be impacted by consumer preferences,
market trends, and geopolitical factors. Although the developed model provides a
strong foundation for estimating diamond prices, ongoing improvement and
103 | P a g e
validation using new data will be necessary to keep the model current and accurate
over time. This project not only demonstrates machine learning's ability to predict
outcomes in the field of gem valuation, but it also paves the way for future
investigation and study in the field of luxury item pricing forecasting.
similarly, for diamond price prediction , The project's future scope offers intriguing
opportunities for improvement and innovation. Integrating advanced gemmological
data, such as fluorescence, symmetry, and other intricate attributes, could give a more
nuanced understanding of diamond valuation, increasing the model's accuracy. A
larger dataset and better feature engineering could result from partnerships with
jewellery specialists and gemmologists. Integrating external factors like economic
indicators and consumer preferences could improve the model's predictive abilities in
light of the constantly changing market trends. By embracing cutting-edge
technologies like blockchain for provenance verification, the diamond market could
104 | P a g e
gain more transparency, which would affect prices and merit inclusion in the
prediction model. Additionally, extending the application to rare minerals and colored
gemstones could diversify it and draw more attention. . In conclusion, the future
potential of the project lies in a multidisciplinary approach that combines
gemmological expertise, technological advancements, and dynamic data sources to
create a reliable prediction model adaptable to the ever-changing diamond industry
landscape.
105 | P a g e
REFRENCES:-
1. Class Notes
2. Annina S, Mahima SD, Ramesh B, “An Overview of Machine Learning and its applications”.
International Journal of Electrical Sciences & Engineering (IJESE), 2015. pp. 22-24.
The dataset “Aoto mpg” was created by an anonymous user: “UCI Machine Learning
Repository”. The last modification to the dataset as of the writing of this report was 6 year
ago.
The dataset is available for viewing and downloading at the following link:
The dataset “cubic zirconia” was created by an anonymous user: “UCI Machine Learning
Repository”. The last modification to the dataset as of the writing of this report was 6 year
ago.
The dataset is available for viewing and downloading at the following link:
cubic_zirconia | Kaggle
The Source Code is available for viewing and downloading at the following link:
https://www.kaggle.com/code/ghousethanedar/car-mileage-prediction-model
106 | P a g e