0% found this document useful (0 votes)
13 views

Major Project Report (ROHIT)

Uploaded by

Amit Amit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Major Project Report (ROHIT)

Uploaded by

Amit Amit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

1

DECELERATION

This is to declare that this report has been written by me. No part of the report is plagiarized
from other sources. All information included from other sources have been duly
acknowledged. I aver that if any part of the report is found to be plagiarized, I shall take full
responsibility for it.

ROHIT DEVGAN

i
ACKNOWLEDGEMENT

“An engineer with only theoretical knowledge is not a complete engineer. Practical
knowledge is very important to develop and apply engineering skills”. It gives me a great
pleasure to have an opportunity to acknowledge and to express gratitude to those who were
associated with me during our training at I.B.M (Data science with python) DELHI.

I am very great-full to Ms. Naina Devi for providing me with an opportunity for training under
her able guidance.

I am pleased to acknowledge my sincere thanks Dr. Akhilesh das Gupta institute of technology
and management and IBM for their kind encouragement in doing this project and for
completing it successfully. I am grateful to them for allowing me to be a part of this project
and helping me out in completing this.

ROHIT DEVGAN

ii
ABSTRACT

Calculation of Car mileage and the prediction of Diamond price may not seem an important
topic but if we compare it with everyday life, we will see that these are regular asked
questions by the people.

So, to make thing easier we have generated a solution that will help us in getting answers.

Calculation of Car mileage will simply depend upon the attributes chosen by the user for
example: Horsepower or MPG. It will compare these values with MPG to give us estimated
car mileage.

Calculation of Diamond price will simply depend upon the attributes chosen by the user for
example: Caret or purity. It will compare these values with price to give us estimated value.

iii
TABLE OF CONTENT

Acknowledgement……………………………………………………………………………………………i
Abstract……………………………………………………………………………………………………………ii

CHAPTER 1:
1.1 Introduction:…………………………………………………………………………...………1
1.1.1 Definition of Python…………………………………………………………………………..……….………1

1.1.2 History of Python………………………………………………………………………………..….…...……..1

1.1.3 Comparison with other language………………………………………………………………..…..….1

1.1.4 Python Features……………………………………………………………………………………….…..…….2

1.1.5 Demerits of Python Programming…………………………………………………………….…..…….4

1.1.6 Real World Application of Python……………………………………………………………….……….5

1.1.7 Application And Software Used For Python………………………………………………………….9

1.1.8 Regression In Python…………………………………………………………………………………..………12

1.2 Motivation……………………………………………………………………………………....14

1.3 Statement Of Problem……………………………………………………………….……14

1.4 Objective …………………………………………………………………………………….…..15

1.5 Feasibility Study………………………………………………………………………….…..16


1.5.1 Study of Car Mileage……………………………………………………………………………………..……16

1.5.2 Study of Diamond Price………………………………………………………………………………..…….19

1.6 Significance Of Project……………………………………………………………….……23

1.7 Beneficiary Of System.......................................................................23

CHAPTER 2:
2.1 Literature Review……………………………………………………………………….……24
2.2 Related Work……………………………………………………………………………….….26
CHAPTER 3:
3.1 Coding………………………………………………………………………………………………28
3.1.1 Prediction of Car Mileage…………………………………………………………………………………….28
3.1.2 Prediction of Diamond Price………………………………………………………………………………..44

CHAPTER 4:
4.1 Technology Used……………………………………………………………………………..65
4.2 Methodology Of The Project…………………………………………………………..68

CHAPTER 5:
5.1 Implementation……………………………………………………………………………....80
5.1.1 Prediction of Car Mileage…………………………………………………………………………………..…80
5.1.2 Prediction of Diamond Price………………………………………………………………………………...90

CHAPTER 6:
6.1 Result And Analysis………………………………………………………………………….99

CHAPTER 7:
7.1 Conclusion………………………………………………………………………………………103
7.2 Future Scope…………………………………………………………………………………..104

REFERENCES……………………………………………………………………………………………..106
CHAPTER: 1

1.1 INTRODUCTION
1.1.1 DEFINITION OF PYTHON
Python is an interpreter, object-oriented, high-level
programming language. Its high-level built in data structures,
combined with dynamic typing and dynamic binding, make it
very attractive for Rapid Application Development, as well as
for use as a scripting or glue language to connect existing components together.

The Python interpreter and the extensive standard library are available in source or binary
form without charge for all major platforms, and can be freely distributed.

1.1.2 HISTORY OF PYTHON

Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.

Python is copyrighted. Like Perl, Python source code is now available under the GNU General
Public License (GPL).

Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.

1.1.3 COMPARISION WITH OTHER LANGUAGES

Python is often compared to other programming languages such as Java and C++. In this
section we will briefly compare Python to each of these languages.

1|Page
 COMPARISON TO JAVA
Python programs are generally expected to run slower than Java programs, but they
also take much less time to develop.
Python programs are typically 3-5 times shorter than equivalent Java programs. This
difference can be attributed to Python's built-in high-level data types and its dynamic
typing.
For example, a Python programmer wastes no time declaring the types of arguments
or variables, and Python's powerful polymorphic list and dictionary types, for which
rich syntactic support is built straight into the language, find a use in almost every
Python program. Because of the run-time typing, Python's run time must work harder
than Java's.
For example, when evaluating the expression a+b, it must first inspect the objects a
and b to find out their type, which is not known at compile time. It then invokes the
appropriate addition operation, which may be an overloaded user-defined method.
Java .
 Comparison to C++
Almost everything said for Java also applies for C++, just more so: where Python code
is typically 3-5 times shorter than equivalent Java code, it is often 5-10 times shorter
than equivalent C++ code .
Anecdotal evidence suggests that one Python programmer can finish in two months
what two C++ programmers can't complete in a year.

1.1.4 PYTHON FEATURES


 Easy to Learn:
 Python is a very high-level programming language, yet it is effortless to learn.
 Anyone can learn to code in python in just few hours or a few days. Mastering Python
and all its advanced concepts, packages and modules might take some more time.
However, learning the basic Python syntax is very easy, as compared to other popular
languages like C, C++, and Java.

2|Page
 Easy to Read :
 Python code looks like simple English words. There is no use of semicolons or brackets,
and the indentations define the code block.
 You can tell what the code is supposed to do simply by looking at it.

 Free and Open-Source :


 Python is developed under an OSI-approved open source license. Hence, it is
completely free to use, even for commercial purposes.
 It doesn't cost anything to download Python or to include it in your application. It can
also be freely modified and re-distributed. Python can be downloaded from the official
Python website.

 Interpreted :
 When a programming language is interpreted, it means that the source code is
executed line by line, and not all at once.
 Programming languages such as C++ or Java are not interpreted, and hence need to
be compiled first to run them.
 There is no need to compile Python because it is processed at runtime by the
interpreter.
 Object-Oriented and Procedure-Oriented :
 A programming language is object-oriented if it focuses design around data and
objects, rather than functions and logic.
 On the contrary, a programming language is procedure-oriented if it focuses more on
functions (code that can be reused). One of the critical Python features is that it
supports both object-oriented and procedure-oriented programming.

 Expressive :
 Python needs to use only a few lines of code to perform complex tasks. For example,
to display Hello World, you simply need to type one line - print(“Hello World”).
 Other languages like Java or C would take up multiple lines to execute this.

3|Page
 Support for GUI :
 One of the key aspects of any programming language is support for GUI or Graphical
User Interface. A user can easily interact with the software using a GUI.
 Python offers various toolkits, such as Tkinter, wxPython and JPython, which allows
for GUI's easy and fast development.

 Dynamically Typed :
 Many programming languages need to declare the type of the variable before
runtime. With Python, the type of the variable can be decided during runtime. This
makes Python a dynamically typed language.
 For example, if you have to assign an integer value 20 to a variable “x”, you don’t need
to write int x = 61. You just have to write x = 61.

 High-level Language :
 Python is a high-level programming language because programmers don’t need to
remember the system architecture, nor do they have to manage the memory.
 This makes it super programmer-friendly and is one of the key features of Python.

1.1.5 DEMERITS OF PYTHON PROGRAMMING :


Like various other Programming Languages , Python has its own set of advantages and
disadvantages. Let’s see some of the disadvantages of Python :

 Speed :
Unlike C or C++ it’s not closer to hardware because Python is a high-level language. As
we all know that compilation and execution help to work normally, but in this case,
execution of Python takes place with the help of an interpreter instead of the compiler
and Python code is executed line by line, which causes it to slow down.

4|Page
 Mobile Development:
However Python is strong in desktop and server platforms, that is it is an excellent
server-side language but for mobile development, Python is not a very good language
which means it is a weak language for mobile development. It is very rarely used for
mobile development. This is the reason very few mobile applications are built in it like
Carbonnelle, which is built-in python.
 Memory Consumption:
For any memory intensive tasks Python is not a good choice. That is why it is not used
for that purpose. Python’s memory consumption is also high, due to the flexibility of
the data types.

1.1.6 REAL WORLD APPLICATIONS OF PYTHON:


Python supports cross-platform operating systems which makes building applications
with it all the more convenient. Some of the real world applications of the python are.

 Web Development :
When it comes to web development, Python offers numerous options for web
development.
For instance, you have Django, Pyramid, Flask, and Bottle for developing web
frameworks and even advanced management systems like Plone and Django CMS.
These web frameworks are packed with content standard libraries and modules which
simplify tasks like content management, database interaction, and interfacing with
internet protocols like HTTP, SMTP, XML, JSON, FTP, IMAP, and POP.

 Game Development :
Python comes loaded with many useful extensions (libraries) that come in handy for
the development of interactive games.
For instance, libraries like PySoy (a 3D game engine that supports Python 3) and
PyGame are two Python-based libraries used widely for game development.
Python is the foundation for popular games like Battlefield 2, Frets on Fire, World of
Tanks, Disney’s Toontown Online, Vega Strike, and Civilization-IV.

5|Page
Apart from game development, game designers can also use Python for developing
tools to simplify specific actions such as level design or dialog tree creation, and even
use those tools to export those tasks in formats that can be used by the primary game
engine.
Also, Python is used as a scripting language by many game engines.

 Scientific and Numeric Applications :


Python has become a crucial tool in scientific and numeric computing. In fact, Python
provides the skeleton for applications that deal with computation and scientific data
processing. Apps like FreeCAD (3D modeling software) and Abaqus (finite element
method software) are coded in Python.
Some of the most useful Python packages for scientific and numeric computation
include:
● SciPy (scientific numeric library)
● Pandas (data analytics library)
● IPython (command shell)
● Numeric Python (fundamental numeric package) ,etc

 Data Science and Data Visualization :


Data is money if you know how to extract relevant information which can help you
take calculated risks and increase profits. You study the data you have, perform
operations and extract the information required. Libraries such as Pandas, NumPy
help you in extracting information.
You can even visualize the data libraries such as Matplotlib, Seaborn, which are helpful
in plotting graphs and much more.

 Desktop GUI (GRAPHICAL USER INTERFACE) :


Python can be used to program desktop applications. It provides the Tkinter library
that can be used to develop user interfaces. There are some other useful toolkits such
as the wxWidgets, Kivy, PYQT that can be used to create applications on several

6|Page
platforms. we can start out with creating simple applications such as Calculators, To-
Do apps and go ahead and create much more complicated applications.
Moreover , many major organizations and companies like NASA , Google , Nokia ,IBM
,Yahoo Maps , etc also uses Python in their operational work .

 SOME BASIC FEATURES OF PYTHON:


 Basic operators used in python:
In Python, operators are symbols that represent specific operations to be performed
on one or more operands. These operations can include arithmetic calculations, logical
evaluations, comparisons, assignments, and more. Python provides a variety of
operators that serve different purposes. Here are some common types of operators
in Python:
1. Arithmetic Operators : +, -, *, /, %, //, **
2. Comparison Operators: ==, !=, <, >, <=, >=
3. Logical Operators: and, or, not
4. Assignment Operators: =, +=, -=, *=, /=, %=, //=, **=
5. Bitwise Operators: &, |, ^, ~, <<, >>
6. Membership Operators: in , in not
7. Identity Operators: is , is not

For Example:

7|Page
 While loop in python:
a loop is a control structure that allows you to repeatedly execute a block of code as
long as a certain condition is satisfied. Loops are useful for automating repetitive tasks
and iterating over collections of data. Python supports two main types of loops: "for"
loops and "while" loops.
1. For loop: A "for" loop is used to iterate over a sequence (like a list, tuple,
string, etc.) and execute a block of code for each element in the sequence.
2. While loop: A "while" loop is used to repeatedly execute a block of code as
long as a specified condition remains true.

For Example:

 Conditional Statements:
Conditional statements in Python allow you to make decisions in your code based on
certain conditions. These statements are used to control the flow of your program
and execute different blocks of code depending on whether a given condition is true
or false. The main conditional statements in Python are the "if," "elif" (short for "else
if"), and "else" statements.

For Example:

8|Page
1.1.7 APPLICATION AND SOFTWARE USED FOR PYTHON:
There are several applications and software tools that are commonly used for Python
programming.
a. PyCharm IDE:
b. Jupyter Notebook
c. Keras
d. Pip Package
e. Anaconda nevigator

 Anaconda Navigator:
Anaconda Navigator is a graphical user interface (GUI) included with the Anaconda
distribution, which is a popular platform used for data science, machine learning, and
scientific computing in Python. Anaconda Navigator provides a user-friendly way to
manage and work with Python environments, packages, and projects. Here are some
key points about Anaconda Navigator:
1. Graphical Interface: Anaconda Navigator offers a visual interface for managing
various aspects of your data science workflow, making it easier for users who
might not be comfortable with the command line.
2. Environment Management: Anaconda Navigator allows you to create and
manage isolated Python environments, commonly referred to as anaconda
environments. These environments enable you to keep your projects and
packages separate, which helps avoid conflicts between different
dependencies.
3. Package Management: With Navigator, you can easily search, install, and
update Python packages and libraries from the Anaconda repository. It
simplifies the process of installing and managing complex packages and
dependencies.
4. Project Management: You can create and manage data science projects using
Navigator. Each project can have its own isolated environment and associated
packages, making it easy to switch between different project configurations.

9|Page
5. Integrated Development Environments (IDEs): Navigator can be used to
launch popular Python IDEs like Jupyter Notebook, JupyterLab, and Spyder.
These IDEs are widely used in the data science community for interactive
coding, analysis, and visualization.

6. Visual Interface for Anaconda Commands: While Anaconda Navigator


provides a GUI, it is built on top of the anaconda package manager. It offers a
visual interface for anaconda commands, allowing users to manage
environments and packages without having to remember complex command-
line syntax.

7. Channel Management: You can configure anaconda channels through


Navigator. Channels are sources from which anaconda fetches packages. This
is useful when you need to access packages from non-default channels.

8. Learning Resources: Anaconda Navigator often provides links to


documentation, tutorials, and other learning resources, which can be helpful
for newcomers to data science and the Python ecosystem.

 Jupyter Notebook:

Jupyter Notebook is an open-source web


application that allows users to create
and share documents that contain live
code, equations, visualizations, and
explanatory text. It's a popular tool among data scientists, researchers, educators, and
programmers for interactive and exploratory computing. The name "Jupyter" is a
combination of three programming languages: Julia, Python, and R, which it initially
supported. However, it now supports a wide range of programming languages.

10 | P a g e
 Key features of Jupyter Notebook include:

1. Cell-Based Structure: Notebooks are organized into cells. Each cell can contain
code (usually in Python but can support other languages), Markdown text for
explanations and documentation, or even LaTeX equations for mathematical
formulas.
2. Interactive Execution: Users can run individual code cells interactively. This
allows for step-by-step execution and immediate feedback, which is useful for
debugging and iterative development.
3. Rich Output: Notebooks can display rich outputs, including images, plots,
tables, and interactive visualizations. This makes it great for data analysis and
presentation.
4. Data Visualization: Jupyter Notebook integrates well with various data
visualization libraries, such as Matplotlib, Seaborn, Plotly, and more. This
makes it easy to create meaningful visualizations within your notebook.
5. Easy Sharing: Notebooks can be easily shared with others. They can be
exported to different formats like HTML, PDF, and slides, allowing you to
communicate your work effectively.
6. Collaboration: While real-time collaborative editing is not a built-in feature of
Jupyter Notebook, there are third-party tools and platforms that allow for
collaborative editing and sharing of notebooks.
7. Educational Tool: Jupyter Notebook is used widely in education for teaching
programming, data science, and various scientific concepts. Its interactive
nature and ability to mix code and explanations make it a valuable tool for
learning.

11 | P a g e
1.1.8 Regression In Python:
 Definition of Regression:
Regression is a statistical and machine learning technique used for modelling the
relationship between one or more independent variables (often called "predictors" or
"features") and a dependent variable (often called the "target" or "outcome"). The
main goal of regression analysis is to understand and quantify the relationship
between the variables, as well as to make predictions or estimations based on this
relationship.
There are various types of regression techniques, each suited for different scenarios
and data types. Some common types of regression include:
a. Linear Regression
b. Multiple Linear Regression
c. Polynomial Regression
d. Logistic Regression
e. Time Series Regression
f. Support Vector Regression

 Use Of Regression in Python:


Regression is widely used in Python for various purposes, such as data analysis,
prediction, and modelling. Here are some common use cases for regression in Python:
a. Predictive Modelling: Regression is frequently used for predictive modelling.
Given historical data with known outcomes, you can train a regression model
to learn the relationships between input variables and the target variable. This
trained model can then be used to predict the target variable for new or
unseen data points.
b. Sales Forecasting: In business, regression can be used to predict future sales
based on factors like historical sales data, marketing expenditure, economic
indicators, and more.
c. Price Prediction: Regression can help predict prices of products or assets based
on factors like supply, demand, competitor prices, and other relevant
variables.

12 | P a g e
d. Risk Assessment: Regression models can be used in finance and insurance to
assess risk factors and predict the likelihood of events, such as loan defaults or
insurance claims.
e. Healthcare: Regression can be used to predict medical outcomes, like
predicting patient health outcomes based on factors such as age, medical
history, and other relevant variables.
f. Machine Learning: Regression algorithms are a key component of machine
learning pipelines, helping algorithms learn patterns and relationships within
data.

 How We Used Regression In Our Project:


In both the projects involving the prediction of car mileage and diamond prices,
regression plays a pivotal role in modelling and forecasting. Regression is a statistical
technique that enables us to establish relationships between input variables and a
target variable. In the context of predicting car mileage, regression helps us
understand how various attributes of a car, such as its engine size, weight, and
horsepower, influence its fuel efficiency or mileage. Similarly, in the case of predicting
diamond prices, regression assists us in deciphering the intricate interplay between
diamond characteristics like carat weight, cut quality, colour, and clarity, and their
corresponding prices.

Regression empowers us to extract meaningful insights from data and build predictive
models that help us estimate outcomes based on underlying relationships. By
leveraging regression in projects focused on car mileage and diamond price
prediction, we can enhance decision-making, gain valuable insights, and make
informed forecasts in two distinct domains.

13 | P a g e
1.2 MOTIVATION:
The motivation behind working on projects to predict car mileage and diamond prices
using python is all about making life better and smarter. Imagine being able to know
how far a car can go on a tank of gas before you buy it. This could save you money and
help the environment by choosing a more fuel-efficient car. That's why predicting car
mileage matters – it's about helping people make wise choices while being kind to our
planet.

Now, think about diamonds. They're beautiful and valuable, but their prices can be
confusing. Imagine having a tool that could help you understand how much a diamond
should cost based on its size, cut, and other features. This could help you get a fair
deal when buying one. That's where predicting diamond prices comes in – it's about
giving people the knowledge they need to make informed decisions when buying or
selling something precious.

In simple words, both these projects use clever math to predict important things – like
how efficient a car is or how much a diamond should cost. This knowledge can make
a real difference in people's lives, from saving money to being more environmentally
conscious and getting the best value for their money.

1.3 STATEMENT OF PROBLEM:


In the projects focused on predicting car mileage and diamond prices, we're tackling
some interesting challenges. For the car mileage prediction project, the problem
revolves around understanding the factors that influence how far a car can travel on
a certain amount of fuel. This involves figuring out how attributes like the car's engine
size, weight, and power relate to its fuel efficiency. The main challenge here is to
create a reliable prediction model that can help people estimate a car's mileage before
they buy it, which in turn can assist them in making eco-friendly and cost-effective
choices.

14 | P a g e
On the other hand, in the diamond price prediction project, the problem centres on
uncovering the intricate relationship between a diamond's characteristics and its
market price. This involves decoding how factors like the diamond's carat weight, cut
quality, colour, and clarity affect its value. The key challenge is to build an accurate
prediction model that aids in estimating a diamond's price based on its qualities. This
can provide buyers, sellers, and investors with a clearer understanding of the diamond
market, empowering them to make more informed decisions when it comes to
purchasing or selling diamonds.

1.4 OBJECTIVE:
The objectives of our projects focused on predicting car mileage and diamond prices
using Python are quite clear. In the car mileage prediction project, our goal is to
develop accurate prediction models that can estimate how far a car can travel on a
given amount of fuel. By analysing attributes like engine size, weight, and power, we
aim to create a tool that assists potential car buyers in making informed decisions
about fuel efficiency before purchasing a vehicle. This objective aligns with the
broader goal of promoting environmentally responsible and cost-effective
transportation choices.

Similarly, in the diamond price prediction project, our aim is to construct reliable
prediction models that can predict the market price of diamonds based on their
intrinsic features. By understanding the relationships between attributes such as carat
weight, cut quality, color, and clarity, we intend to create a tool that empowers
buyers, sellers, and investors in the diamond industry to navigate the market with
more confidence. Ultimately, our objectives in both projects revolve around utilizing
Python to develop models that provide valuable insights, enabling better decision-
making for both car buyers and diamond enthusiasts.

15 | P a g e
1.5 FEASIBILITY STUDY:
In the case of predicting car mileage, feasibility revolves around data availability and
quality. It's crucial to determine if there's sufficient data encompassing diverse car
attributes and corresponding mileage values. Additionally, evaluating the accuracy of
the data and its representation of real-world scenarios is paramount. The feasibility
study also involves gauging the availability of appropriate regression algorithms and
tools within Python that can handle the complexity of the prediction task.
1.5.1 Study of Car Mileage:
The primary function of cars is to provide convenient and fast transportation. They
allow individuals to commute to work, travel long distances, run errands, and explore
new places. With cars, people can save time and have greater flexibility in their daily
routines.

Cars are mainly differentiated as following conditions


I. According to weight
Subcompact – curb weight less than 2500 lbs
Compact – Curb weight 2500 to 2999 lbs
Mid-size – curb weight 3000 to 3499 lbs
Full size – curb weight more than 3500 lbs
II. According to style
Coupe – These are small cars but are admired for their sport looks. They have
the seating capability for two to four people. It is usually equipped with two
doors with a small boot.
Sedan or Saloon – These cars have a separate area for keeping luggage.
Sedans are loved by many. They are typically four door cars with seating
capability of 4 + people. Saloon is just the British variant of the term Sedan.
Convertibles – These cars have a detachable rooftop, and that is why they are
known as convertibles, as whenever required, the look of the car can be altered
in context to rooftop.

16 | P a g e
Sports Car – This includes the cars which are known for their good
performance at high speeds. They are usually convertibles. Sports cars are
designed for focusing on form and function rather than seating capacity,
luxury or cargo capacity.
Hatchbacks – It refers to the car which has an additional space for cargo.
These cars have a large door at the backside of the car, which gives access to
this additional area.
Minivan – These are smaller version of vans which are structured on car
platforms. These vans have sliding rear doors. These are included in the
segment of mid-sized cars.
Sports utility Vehicle or SUV – These cars are known for their off-road
capabilities. They have a large body type. Their seating capacity ranges from 5
to 7+. They are usually the ones with four-wheel drive.
Station Wagon – These are also known as Estate Car. They are quite similar
to a hatchback but with an extended roof and rear body. This extra space
makes it eligible for adding the third row.

III. According to power


Cars can be differentiated according to the power it produce it can be
induced power and break power as well.
o High power cars: - (These types of cars have break power 250hp to
300hp or above)
o Medium power cars: - ( These types of car have break power around
150hp to 250hp)
o Low power cars: - (These types of cars have break power around 0 to
150hp)

IV. MILEAGE
Its how many kilometers the is vehicle going to run per liter of fuel or how
many miles the vehicle runs per gallon of fuel. It is also used to depict how
many kilometers/ miles the vehicle has covered in its life time. These are the
two formal usage of the term mileage. It defers one vehicle to another vehicle.

17 | P a g e
Factors on which mileage depends upon are as following: -
1. Numbers Of Cylinders:- The fuel mileage, or fuel efficiency, of a vehicle can
be influenced by several factors, including the number of cylinders in the
engine. However, the number of cylinders is just one of many factors that
contribute to overall fuel efficiency. Other factors, such as engine size, vehicle
weight, aerodynamics, transmission type, and driving conditions, also play
significant roles. In general, vehicles with fewer cylinders tend to have better
fuel efficiency, all else being equal. This is because engines with fewer
cylinders generate less internal friction, consume less fuel, and produce fewer
emissions. Here's a breakdown of how the number of cylinders can affect fuel
mileage

2. Weight:- The mileage of a vehicle, often referred to as fuel efficiency or fuel


economy, can be influenced by several factors, including the weight of the
vehicle. Here's how weight can impact mileage:
1. Aerodynamic Efficiency.
2. Engine Power and Load.
3. Rolling Resistance.
4. Acceleration and Braking.
5. Vehicle Design.
6. Driving Behaviours.

weight and mileage isn't always linear or straightforward. Other factors such
as engine efficiency, transmission type, tire type and pressure, road conditions,
driving habits, and maintenance also play significant roles in determining a
vehicle's fuel Efficiency. Top of Form

3. Horse power:- there isn't a direct linear relationship between engine


horsepower and fuel efficiency, higher horsepower engines tend to be
associated with larger, more powerful vehicles that might sacrifice some fuel
efficiency for better performance. On the other hand, lower horsepower
engines can be designed to prioritize efficiency, especially in smaller and
lighter vehicles or those optimized for city driving. It's important to consider a
combination of factors, including engine size, vehicle weight, driving
conditions, and technology, to understand how horsepower influences mileage

4. Running Conditions:- The mileage, also known as fuel efficiency or gas


mileage, of an engine depends on various running conditions that influence
how efficiently the engine converts fuel into usable energy to propel the
vehicle. Here are some key running conditions that impact engine mileage:
1. Driving Speed and Style.
2. Terrain.
3. Traffic Conditions.
4. Vehicle Load.
5. Maintenance and Condition.
6. Weather and Environmental Factors.

18 | P a g e
1.5.2 Study of Diamond Price:
Similarly, for the diamond price prediction project, feasibility hinges on the availability
and reliability of diamond data. Assessing whether the dataset includes a
comprehensive range of diamond attributes and corresponding prices is essential.
Ensuring that the data is representative of the diamond market and sufficiently
detailed is a critical part of the study. Additionally, determining the adequacy of
Python libraries for advanced regression techniques, data pre-processing, and
visualization is crucial for successful execution.

Introduction:
Diamond is a solid form of the element carbon with its atoms arranged in a crystal
structure called diamond cubic. Another solid form of carbon known as graphite is the
chemically stable form of carbon at room temperature and pressure, but diamond is
metastable and converts to it at a negligible rate under those conditions. Diamond has
the highest hardness and thermal conductivity of any natural material, properties that
are used in major industrial applications such as cutting and polishing tools. They are
also the reason that diamond anvil cells can subject materials to pressures found deep
in the Earth.
Diamonds are among the most coveted and admired gemstones in the world, known
for their exquisite beauty, brilliance, and enduring symbolism. Formed deep within
the Earth's mantle under extreme heat and pressure, diamonds emerge as rare and
remarkable treasures. With a long history of intrigue, cultural significance, and
industrial utility, diamonds hold a special place in both the realms of luxury and
practicality

19 | P a g e
Types of diamonds:

1. Technical: - These four types of diamonds are the main technical classifications.

 Type Ia diamonds: because nitrogen gathers in clusters in these stones, they have a
yellowish tinge.
 Type IIa diamonds: these diamonds have no nitrogen impurities and differing
fluorescent properties.
 Type Ib diamonds: these are also quite rare, and their main feature is that individual
nitrogen atoms are scattered throughout the stone (rather than in clusters).
 Type IIb diamonds: another rare type of diamond with no nitrogen atoms. They do,
however, contain boron in addition to its main carbon content.

2. By casual shoppers:- Most people in the market for diamonds will classify them
according to the following basic names (or variations of them):

 Natural diamonds – the standard type of diamond we referred to in the


introduction: largely colourless and brightly sparkling in the light.
 Treated diamonds – natural diamonds that have been enhanced artificially by
inclusion filling or colour enhancement (usually much cheaper than natural
diamonds as it’s the only way that these diamonds can be sold).
 Natural coloured diamonds – pink, yellow, blue, purple, violet, red, green, grey,
white, black diamonds and more: demand for these has boomed in recent years.
 Man-made diamonds – these are created in the lab, a skill that has become easier
and cheaper in recent times with the advance of technology (usually identified by
their cheaper price).

3. Clarity of a diamond:-The classification system for clarity of a diamond provided by


GIA includes the following categories:

 Flawless (FL) - no inclusions or blemishes are visible to a skilled grader using 10×
magnification.
 Internally Flawless (IF) - no inclusions and only blemishes are visible to a skilled
grader using 10× magnification.
 Very, Very Slightly Included (VVS1 and VVS2) - inclusions are difficult for a skilled
grader to see under 10× magnification.
 Very Slightly Included (VS1 and VS2) - inclusions are minor and range from difficult
to somewhat easy for a skilled grader to see under 10x magnification.
 Slightly Included (SI1 and SI2) - inclusions are noticeable to a skilled grader under
10x magnification.
 Included (I1, I2, and I3) - inclusions are obvious under 10× magnification and may
affect transparency and brilliance.

20 | P a g e
4. Carat (weight of a diamond):- It is equal to 200 mg or 0.00643 troy oz and, with all else
being equal, the larger the size of a diamond, the greater its value. The bulk of diamonds on
the market are less than one carat, which means that we must talk in terms of points (100
sub-divisions of the carat). So, treat the carat as an indication purely of the weight/size of
the diamond rather than any indication of value.

5. Cut of a diamond: - The last of the four Cs is the “cut” of a diamond. This refers to other
visible features of a diamond, such as its:

 Proportions (width and depth of the diamond)


 Finish (does light escape from the diamond and leave it looking dull?)
 Symmetry
 Polish
Cuts of diamonds are subject to a grading system too:

 Excellent (EX)
 Very Good (VG)
 Good (G)
 Fair (F)
 Poor (P)

6. Other distinguishing features of diamonds

Some of the most common diamond shapes are:

 Round (the most common)


 Princess-cut
 Oval or pear-shaped diamonds
 Emerald-cut diamonds
The shape may affect the brilliance or

Specification of diamond

 Carat Weight: This is a measure of a diamond's size and weight, with one carat equal
to 200 milligrams. The carat weight directly affects the diamond's size, and larger
diamonds are generally rarer and more valuable.
 Cut: The cut refers to the way a diamond's facets are arranged and shaped. It's one
of the most crucial factors affecting a diamond's brilliance and sparkle. The quality of
the cut determines how effectively light is reflected within the diamond.
 Cut Grade: This evaluates the overall quality of the diamond's cut, including
proportions, symmetry, and polish. Common cut grades include Excellent, Very
Good, Good, Fair, and Poor.
 Colour: Diamonds are graded on a colour scale that ranges from D (colourless) to Z
(light yellow or brown). The less colour a diamond has, the higher its value.
Colourless diamonds allow lighter to pass through, resulting in greater brilliance.

21 | P a g e
 Clarity: Clarity refers to the presence of internal and external imperfections, known
as inclusions and blemishes, respectively. The clarity grade ranges from Flawless (no
imperfections visible under 10x magnification) to Included (imperfections visible to
the naked eye).
 Clarity Grading: Clarity grades include categories like Flawless (FL), Internally
Flawless (IF), Very Slightly Included (VVS1, VVS2), Very Slightly Included (VS1, VS2),
Slightly Included (SI1, SI2), and Included (I1, I2, I3).
 Shape: The shape of a diamond refers to its overall outline or form. Common shapes
include Round Brilliant, Princess, Emerald, Ascher, Marquise, Oval, Pear, Heart, and
Cushion.
 Certification: Diamonds are often certified by reputable gemological laboratories,
such as the Gemological Institute of America (GIA) or the International Gemological
Institute (IGI). Certification provides an objective assessment of a diamond's quality,
including its Four Cs.
 Fluorescence: Some diamonds exhibit fluorescence when exposed to ultraviolet
light. This can affect the diamond's appearance under certain lighting conditions.
 Symmetry and Polish: These factors assess the precision of a diamond's facets and
the smoothness of its surface. Excellent symmetry and polish contribute to a
diamond's overall beauty.
 Depth and Table Percentage: These measurements relate to the proportions of the
diamond. The depth is the height of the diamond measured from the culet to the
table, and the table percentage is the width of the table facet relative to the
diameter of the diamond.

Factors on Price of diamonds depends: -


• Carat Weight:
• Cut
• Colour
• Clarity
• Shape
• Certification
• Market Demand.
• Treatments and Enhancements
• Market Trends and Economic Factors:

22 | P a g e
1.6 SIGNIFICANCE OF PROJECT:
In the case of predicting car mileage, the project's importance emerges from its
potential to guide individuals towards more informed and economical choices when
purchasing vehicles. By accurately estimating a car's fuel efficiency based on its
attributes, the project contributes to environmentally conscious decisions and cost
savings. In a world increasingly focused on sustainability and economic prudence, such
predictions hold the promise of making a positive impact on both individuals and the
environment.

Similarly, the diamond price prediction project carries significance in a realm where
clarity and transparency are paramount. By uncovering the intricate relationships
between diamond characteristics and their market prices, the project empowers
consumers, jewellers, and investors to make more educated decisions. This newfound
understanding of diamond pricing can lead to fairer transactions and enhanced trust
within the industry. Given the emotional and financial value associated with
diamonds, the project's potential to foster well-informed choices is both practical and
meaningful.

1.7 BENEFICIARY OF THE SYSTEM:


For the car mileage prediction project, the primary beneficiaries are potential car
buyers and environmentally conscious individuals. The prediction system empowers
them with the ability to estimate a vehicle's fuel efficiency before purchase, helping
them make more economical and eco-friendly choices. By having insights into how
different car attributes affect mileage, buyers can align their preferences with
environmental concerns, ultimately reducing fuel consumption and emissions.

In the diamond price prediction project, a range of beneficiaries emerges from the
complex diamond market ecosystem. Consumers and potential buyers gain the
advantage of a fairer understanding of diamond pricing, enabling them to make well-
informed decisions when purchasing precious gemstones. Jewellers benefit from
improved transparency, facilitating better communication with customers and
fostering trust. Investors, on the other hand, can utilize the prediction system to gain
insights into pricing trends, enhancing their decision-making in a dynamic market.

23 | P a g e
CHAPTER: 2
2.1 LITERATURE REVIEW
 Prediction of Car Mileage:
The project's literature review for car mileage prediction using Python entails a
thorough analysis of the literature, studies, and resources already available on the
topic of car mileage prediction. This review aims to offer a thorough understanding of
the approaches, methods, and conclusions that have been previously investigated in
works that are similar to this one.

To reduce fuel consumption and address environmental issues, many studies have
looked into predicting car mileage. Regression analysis, machine learning algorithms,
and data mining techniques are among the methods frequently used by researchers.
One popular technique is linear regression, where predictions are made by modelling
the relationship between car characteristics (such as engine specifications, weight,
and horsepower) and fuel efficiency (measured in miles per gallon, or mpg). Linear
regression is favoured for its simplicity and its ability to uncover linear relationships
between variables.

Furthermore, more sophisticated methods for forecasting car mileage have gained
popularity. These methods include decision trees, support vector machines, and
neural networks. These methods enable the capture of non-linear relationships and
interactions between attributes that linear models might find difficult to fully
represent. It is possible for machine learning algorithms to glean intricate patterns
from large datasets, improving prediction accuracy.

The importance of feature engineering and data pre-processing is also emphasized in


the literature. To ensure that the input data is of a high enough standard to support
reliable predictions, it is essential to clean the data, handle missing values, and
transform categorical variables into numerical representations. In order to improve
the performance of the models, it is crucial to identify the most important
characteristics that have a significant impact on fuel efficiency.

24 | P a g e
The evaluation of prediction models takes up a sizable portion of the literature review.
The accuracy of predictions is frequently evaluated using metrics like mean squared
error (MSE), root mean square error (RMSE), and mean absolute error (MAE). These
metrics assess how well the models generalize to new data by measuring the
discrepancy between predicted and actual mileage values.

The literature also highlights domain-specific considerations. Different vehicle types,


like trucks and sedans, may display varying characteristics that have an impact on fuel
efficiency. For precise predictions, it is therefore essential to adapt prediction models
to particular vehicle categories. In addition, outside variables like the climate, the type
of roads, and driving habits can have a big impact on mileage. These elements are
incorporated into models to increase their sturdiness and practical relevance.

 Prediction Of Diamond Price


Python has been used to predict diamond prices using a variety of methodologies and
techniques that have been created and improved over time. This literature review
aims to offer a summary of the major developments, difficulties, and contributions in
this area.

Accurate diamond price forecasting has implications for economic analysis and
investment decisions outside of the jewellery sector. In order to predict the market
value of diamonds, researchers have used a variety of machine learning algorithms
and statistical models to extract important insights from diamond attributes.

In this context, regression analysis is a fundamental strategy. When determining


relationships between specific characteristics of a diamond, such as its carat weight,
cut quality, and clarity, and its final price, researchers have looked at the predictive
power of linear regression models.

25 | P a g e
Despite the improvements, problems still exist. Subjective factors that are difficult to
quantify and take into account, such as market sentiment and cultural trends, can
have an impact on diamond prices. Additionally, current data is required to maintain
prediction accuracy due to the diamond market's constant evolution.

2.2 RELATED WORK


 Prediction Of Car Mileage
The related work within the project context involves a thorough investigation of
earlier research, studies, and projects that have engaged with similar difficulties in
estimating a vehicle's fuel efficiency. The project's focus is on predicting car mileage
using Python. This exploration is a crucial phase of our project because it gives us the
opportunity to learn from and improve upon existing methodologies, techniques,
and discoveries.

Researchers have explored a number of strategies to improve fuel efficiency and


promote sustainable transportation options in the vast field of predicting car
mileage. Regression analysis is one of the tried-and-true methods for unravelling the
complex relationships between car characteristics and fuel efficiency. For capturing
linear relationships between vehicle characteristics like engine specifications, weight,
and power, and fuel efficiency metrics like miles per gallon (mpg), linear regression,
known for its simplicity.

The importance of data preparation and feature engineering is front and centre in
the predictive modelling ecosystem. The cornerstone of ensuring the integrity of
input data is meticulous data cleaning, thoughtful handling of missing values, and
skilful conversion of categorical variables into numeric representations. The accuracy
of our predictive models is increased by feature selection and extraction, which work
like skilled sculptors to chisel away at features to uncover those that have the
greatest impact on fuel efficiency.

Evaluation of prediction models serves as a compass as we navigate this domain. We


evaluate the congruence of our models' predictions with actual mileage values using
metrics like the R-squared coefficient, Mean Absolute Error (MAE), and Root Mean
Squared Error (RMSE).

26 | P a g e
 Prediction Of Diamond Price
An extensive analysis of earlier research, studies, and endeavours that have delved
into the challenging field of estimating the market value of diamonds is required for
the related work within our project focused on using Python to predict diamond
prices. As it gives us useful information, methodologies, and lessons from previous
endeavours, this investigation is essential for determining the course of our project.

Due to the allure and economic importance of these gemstones, researchers have
become engrossed in the task of predicting diamond prices. To understand the
complex connections between diamond attributes and their corresponding market
values, a number of strategies and techniques have been investigated. A crucial
method for figuring out how characteristics like carat weight, cut quality, colour, and
clarity affect a diamond's value is regression analysis. In particular, linear regression
has become a trustworthy method for identifying these relationships and
comprehending how they affect a diamond's market value.

It is impossible to overstate the importance of feature engineering and data


preparation. Accurate predictions are based on meticulous data cleaning, expert
handling of missing data, and careful transformation of categorical attributes into
numerical forms. Similar to the process of honing a piece of art, feature selection
and extraction aims to draw attention to characteristics that have a significant
influence on diamond prices and improve the accuracy of our predictive models.

A crucial component of related research is the evaluation of predictive models.


Metrics like R-squared, Mean Absolute Error, and Root Mean Squared Error (RMSE)
offer quantitative measures to evaluate how closely our models' forecasts match
actual diamond prices. These metrics act as benchmarks, pointing us in the direction
of reliable and accurate models.

27 | P a g e
CHAPTER: 3
3.1 Prediction of Car Mileage

28 | P a g e
and so on this data is gone upto 397 rows….

29 | P a g e
30 | P a g e
31 | P a g e
32 | P a g e
33 | P a g e
34 | P a g e
35 | P a g e
36 | P a g e
37 | P a g e
38 | P a g e
39 | P a g e
40 | P a g e
41 | P a g e
42 | P a g e
the Predicted Value Of The Given Input By The User That Is Us Is 26.2 Mpg

This Might Me Unreal But It Is According To The User Input..

43 | P a g e
Prediction of the diamond price

44 | P a g e
45 | P a g e
46 | P a g e
47 | P a g e
48 | P a g e
49 | P a g e
50 | P a g e
51 | P a g e
52 | P a g e
53 | P a g e
54 | P a g e
55 | P a g e
56 | P a g e
57 | P a g e
58 | P a g e
59 | P a g e
60 | P a g e
61 | P a g e
62 | P a g e
63 | P a g e
The predicted value of Diamond price is 112$ and 23 cents
Depending upon the input from the user

64 | P a g e
CHAPTER: 4
4.1 TECHNOLOGY USED
Python was the major technology used for the implementation of machine learning
concepts the reason being that there are numerous inbuilt methods in the form of
packaged libraries present in python. Following are prominent libraries/tools we used
in our project.

 NumPy
Large, multi-dimensional arrays and matrices are supported by the well-known open-
source library known as NumPy (Numerical Python), which also offers a number of
mathematical operations that can be performed on these arrays. It is a foundational
Python package for scientific computing and is widely used in a variety of disciplines,
including image processing, machine learning, data analysis, and more.
Key features of NumPy:
a. Multi-dimensional Arrays
b. Element-wise Operations
c. Broadcasting
d. Indexing and Slicing
e. Mathematical Functions
f. Random Number Generation

 Pandas
Python's pandas library is free and open-source for handling and analysing data. It
offers data structures and functions that are intended to make it simple and intuitive
to work with structured data, including that found in CSV files, Excel spreadsheets,
SQL databases, and other formats. The foundational tool for data scientists, analysts,
and researchers working with tabular data is pandas, which is built on top of the
NumPy library.
Key features of pandas:
a. Data Structure
b. Data Alignment
c. Indexing and Selection

65 | P a g e
d. Data Cleaning and Transformation
e. Grouping and Aggregation
f. Time Series Handling

 Matplotlib
With the help of the well-liked 2D plotting library Matplotlib for Python, you can make
a huge selection of static, interactive, and animated visualizations. For creating graphs,
charts, plots, and other visual representations of data, it is especially helpful.
Matplotlib can be completely customized and easily combined with other libraries like
NumPy and pandas.
The Matplotlib library contains a module called matplotlib.pyplot that offers a high-
level interface for making different kinds of plots and visualizations. By abstracting
away some of the lower-level details, it makes the process of creating plots simpler,
facilitating the quick and effective generation of plots.
Key features of Matplotlib:
a. Flexible Plotting
b. Customization
c. Subplots and Layouts
d. Export and Integration
e. Mathematical Expressions
f. 3D Visualization

 Seaborn
Built on top of Matplotlib, Seaborn is a statistical data visualization library for Python.
It offers a more complex interface for designing visually appealing and educational
statistical graphics. Seaborn is particularly helpful for visualizing statistical
relationships in your data and for creating complex visualizations with little code.
Key features of Seaborn:
a. Statistical Visualization
b. Higher-Level Abstraction
c. Built-in Theme and Color Palettes
d. Facet Grids
e. Categorical and Statistical Plot
f. Time Series Visualization

66 | P a g e
 Scikit-Learn
A machine learning library for Python called Scikit-learn offers a variety of tools for
carrying out various machine learning tasks. It is intended to be straightforward and
user-friendly while providing a full range of functionalities for both inexperienced and
seasoned data scientists.
Scikit-learn's capabilities:
a. Supervised Learning: Scikit-learn offers a range of algorithms for supervised
learning tasks, where the goal is to learn a mapping from input data to a target
variable. This includes classification (assigning labels to data) and regression
(predicting continuous values).
b. Unsupervised Learning: Scikit-learn provides methods for unsupervised
learning tasks, such as clustering (grouping similar data points) and
dimensionality reduction (reducing the number of features while retaining
important information).
c. Model Selection and Evaluation: The library includes tools for evaluating the
performance of machine learning models, such as cross-validation,
hyperparameter tuning, and various evaluation metrics like accuracy,
precision, recall, F1-score, etc.
d. Data Preprocessing: Scikit-learn supports data preprocessing techniques like
feature scaling, normalization, handling missing values, and feature extraction.
e. Ensemble Methods: It offers ensemble methods like random forests, which
combine multiple models to improve predictive performance.
f. Model Persistence: Scikit-learn allows you to save trained models to disk and
reload them later, making it easy to deploy models in production
environments.
g. Easy API: Scikit-learn follows a consistent API design, making it relatively
straightforward to switch between different algorithms and experiment with
various models.
h. Community and Documentation: Scikit-learn has a large and active
community, providing extensive documentation, tutorials, and examples for
various use cases.

67 | P a g e
4.2 METHODOLOGY OF THE PROJECT
 Data Collection and Pre-processing:
Data collection and pre-processing are essential steps in the data analysis and machine
learning pipeline. These actions entail gathering raw data and putting it into a format
that can be used for analysis or modelling. Here is a Python explanation of these steps

Data Collection: Data collection involves obtaining raw data from various sources,
such as databases, APIs, web scraping, or manual entry. The collected data can be in
different formats like CSV, JSON, XML, or databases like SQL.
Data Pre-processing: Data pre-processing is the process of preparing collected data
for analysis or modelling by cleaning, transforming, and organizing it. Dealing with
missing values, outliers, and irrelevant data is made easier by this step.
a. Dropping Of Unnecessary Column

68 | P a g e
 Data Imputation
Data imputation is ma method to replace all the null or dirty values with mean
median or mode with no major change in the overall EDA of the dataset.

 Feature Engineering
A critical step in the machine learning and data analysis processes is feature
engineering. It entails using raw data to build new features or alter existing ones in
order to improve a machine learning model's performance. In order to improve the
representation of the data, reduce noise, and extract pertinent information, feature
engineering aims to increase the accuracy and generalizability of models.

 Exploratory Data Analysis(EDA)


An important first step in the data analysis process is exploratory data analysis (EDA).
It entails analysing and summarizing the key traits, trends, and connections found in a
dataset. EDA enables you to comprehend the organization of the data, spot potential
problems, and produce analysis-related hypotheses.

Here are some key steps and techniques of EDA used in project
a. Data Inspection:

69 | P a g e
b. Handling of Missing Data

c. Descriptive Statistics:

 Data Visualisation
The process of presenting data in graphical or visual formats in order to communicate
insights, patterns, trends, and relationships that may not be as clear from raw data
alone. You can communicate complex information effectively with data visualization
to help you find insights that can be put to use.

70 | P a g e
Type of Visualisation:
a. Histogram: A histogram is a graphical representation of the distribution of a
dataset. It's commonly used to display the frequency or count of data points
within certain ranges, also known as "bins" or "intervals."
b. Boxplot: A box plot, also known as a box-and-whisker plot, is a graphical
representation of the distribution of a dataset. It provides a summary of the
key features of the data's distribution, such as the median, quartiles, and
potential outliers.

c. Scatter Plots: An illustration known as a scatter plot shows distinct data points
as dots on a two-dimensional plane. It's especially helpful for analyzing the
relationship between two variables and finding outliers or patterns in the data.
There are two types of scatter plot 2D and 3D which are for two variables and
3 variables respectively.

71 | P a g e
72 | P a g e
d. Count Plot: Count plots are a specific kind of graphic display that show the
frequency or count of categorical data in a dataset. It's especially beneficial for
visualizing the distribution of categorical variables and contrasting the
frequency of various categories.

73 | P a g e
e. Pair Plot: A pair plot, also known as a scatterplot matrix, is a graphical
representation that displays pairwise relationships between multiple variables
in a dataset.

f. Heatmap: A heatmap is a graphical representation that uses color to visualize


the values of a two-dimensional dataset. It's particularly useful for showing the
relationships between variables in a matrix format and highlighting patterns or
trends in the data.

74 | P a g e
 Data Encoding:
Data encoding is the process of transforming textual or categorical information into a
numerical format that can be quickly processed and used as input for data analysis or
machine learning algorithms. Data encoding aids in the conversion of non-numerical
data into a format that is appropriate for machine learning algorithms, which
frequently require numerical input.
For Example: Converting categorical data of colour column in numerical data.

75 | P a g e
 Dealing with Outliers:
Outliers are data points in a dataset that differ significantly from the rest of the data.
They could be unusually high or low values that deviate from the data's typical pattern.
Measurement errors, incorrect data entry, or genuinely uncommon events are just a
few of the causes of outliers.

Finding Percentage of outliers

Removal of Outliers

76 | P a g e
 Data Splitting:
Data splitting is the process of dividing a dataset into Dependent and Independent
Variable for Training and Testing purpose for machine learning models, validating their
performance, and testing their generalization ability.

 Model Selection:
Model selection is the process of deciding which machine learning algorithm or model
is best for a given problem. It entails assessing and contrasting various models to
ascertain which one performs the best in terms of accuracy, predictive power,
generalization, and other pertinent metrics.

 Training and Testing:


Training and testing are critical steps in the process of building and evaluating machine
learning models. These steps ensure that your model learns from data and can
generalize its predictions to new, unseen data. Training involves teaching the model
to recognize patterns in the data, while testing assesses how well the model can
perform on new data it hasn't seen before.

77 | P a g e
 Model Evaluation:
Model evaluation is a crucial step in the machine learning process that determines
how well a trained model performs on fresh, untested data. It involves evaluating the
model's precision, generalizability, and potential for development. Effective model
evaluation guides decisions about fine-tuning, selecting, or contrasting various models
by assisting you in understanding how well your model is likely to perform in real-
world scenarios.
Checking of Accuracy Score

Finding All Possible Errors

78 | P a g e
Finally Output:

79 | P a g e
CHAPTER: 5

5.1 IMPLEMENTATION:
Prediction of Car Mileage:
Step 1. Importing the libraries and creating a Pandas Data Frame For performing on
data sets first we must import Libraries in the compilers.
a. Pandas:
 Purpose: Pandas is a powerful data manipulation and analysis library.
It's used for loading, cleaning, pre-processing, and exploring datasets.
 Explanation: You can use Pandas to read data from CSV files, handle
missing values, encode categorical variables, and perform various data
transformations.

b. NumPy:
 Purpose: NumPy is a fundamental package for numerical computations
in Python. It provides support for arrays, matrices, and mathematical
functions.
 Explanation: NumPy arrays are used to store and manipulate numerical
data efficiently. Many machines learning libraries, including scikit-
learn, work with NumPy arrays.
c. Scikit-Learn:
 Purpose: scikit-learn is a machine learning library that provides a wide
range of tools for various machine learning tasks, including regression.
 Explanation: You can use scikit-learn to choose regression models, pre-
process data, split datasets, train models, and evaluate their
performance.
d. Matplotlib And Seaborn:
 Purpose: These libraries are used for data visualization.
 Explanation: Matplotlib and Seaborn allow you to create various plots
and charts to visualize your data, model results, and relationships
between features and target variables.

80 | P a g e
Step 2. Data pre-processing and visualization Firstly, we check for any null, missing,
incomplete, or inappropriate values using the following code.

CONTENT:
 Title: Auto-Mpg Data
 Source: This dataset was taken from the StatLib library which is maintained at
Carnegie Mellon University. The dataset was used in the 1983 American
Statistical Association Exposition
 Relevant information: The data concerns city-fuel consumption in miles per
gallon, to be predicted in terms of 3 multivariate discrete and 5 continuous
attributes. The number of instances is 398 and the number of attributes is 9
including the class attribute. The attribute information are
i. mpg - Mileage/Miles Per Gallon
ii. cylinders - the power unit of the car where gasoline is turned into
power
iii. displacement - engine displacement of the car
iv. horsepower - rate of the engine performance
v. weight - the weight of a car
vi. acceleration - the acceleration of a car
vii. model - model of the car
viii. origin - the origin of the car
ix. car- the name of the car

o Check data completeness


Now we check for any null, missing, incomplete, or inappropriate values using
the following code.

81 | P a g e
o Clean corrupt data
1. The first command will tell you whether there’s any missing value for any numerical
data, not string data since string datatype data can be blank. The aforementioned
command doesn’t capture that.
2. The second command will tell you whether the datatype of every feature is as per our
expectation i.e., we expect displacement, horsepower, mpg etc. to be numerical
datatype (float/int).
3. df.info () will help check whether the data type is exactly what we are expecting.
We see that horsepower column is perceived as object data type by Pandas, whereas
we should be expecting a floating value. It means there is a string somewhere. Now
our goal is to find that string values(s) and deduce what to do with the corrupt data.

Generally, the steps are to check for null, missing, incomplete, inappropriate values and
subsequently clean the data by converting data type to appropriate data types, filling missing
values, normalizing etc

o Data visualization using plots and charts


Now, we use some data visualisation techniques to visualise our data on
charts, histograms etc to further do any data pre-processing or feature
engineering if required. In short, we plot to check any anomaly, outlier,
distribution, range of values etc.
i) Pairplot
Show a pairplot of dependent variable (y) with respect to every independent variable
or feature (x1, x2, x3 etc) except car name

ii) Histogram plot


Now we plot histogram of dependent variable (y) and every independent variable or
feature (x1, x2, x3 etc) except car name. Define and describe a histplot() function and
later call it to plot all histograms

82 | P a g e
o Check and remove outliers using Boxplot
1. To see if the data has any outliers, we will now plot a boxplot of every independent
variable or feature (x1, x2, x3 etc) except car name. Define and describe a df.boxplot()
function
2. Define list of continuous variables and call df.boxplot() for only those to detect outliers

Step 3. Set up Machine learning model for prediction


In this section, I will show you initial steps needed before we begin applying different machine
learning techniques or models.
Machine learning algorithms typically require numerical inputs, so it need to encode
categorical variables (cylinders, displacement, horsepower, weight, acceleration, model year,
origin) into numerical values. One-hot encoding is a common technique for this purpose
(pd.get_dummies()).

Scaling and Normalization: Features with different scales can affect the performance of
certain regression algorithms. Scaling and normalization help bring features to a similar scale

Feature Selection: Identify which features are relevant for predicting the target variable
(mileage). You can use techniques like correlation analysis, feature importance from tree-
based models, or domain knowledge to decide which features to keep.

Test Train Split:

Split your dataset into training and testing sets. The training set is used to train the model,
while the testing set is used to evaluate its performance

Feature selection:- Choosing the most pertinent and instructive characteristics from your
dataset to train the model is an important stage in the process of creating a predictive model.
Selecting characteristics that have a strong association with the target variable (mileage) and
can aid the model in producing accurate predictions is important when forecasting car
mileage using regression.

83 | P a g e
1. Domain Knowledge: Start by considering your domain knowledge. Think about which
features are intuitively expected to affect the mileage of a car. For i cylinders,
displacement, horsepower, weight, acceleration, model year and origin are commonly
known factors that influence Car mileage.

2. Correlation Analysis: Calculate the correlation between each numerical feature and
the target variable (mileage). Features with higher absolute correlation values are
generally more important. You can use tools like the Pandas library to calculate
correlations and visualize them using plots like heatmaps.

3. Visualization: Visualize the relationships between individual features and the target
variable. You can use scatter plots, box plots, or other relevant plots to understand
how changes in each feature correspond to changes in car mileage.

4. Feature Importance from Models: Train a preliminary model (e.g., a simple linear
regression or decision tree) and use the feature importance scores provided by the
model. Some models, like Random Forests, can quantify the importance of each
feature in predicting the target variable.

5. Recursive Feature Elimination: This method involves repeatedly training a model and
removing the least important feature in each iteration until a desired number of
features is reached. Scikit-learn provides the RFECV class that can help with this
process.

6. Lasso Regression (L1 Regularization): Lasso regression applies L1 regularization,


which encourages the model to set coefficients of less important features to zero. This
effectively performs feature selection during the model training process.

7. Variance Thresholding: If you have features with very low variance, they might not
contribute much to the prediction. You can use scikit-learn's Variance Threshold to
remove features with variance below a certain threshold.

84 | P a g e
8. Feature Engineering: Sometimes, combining or transforming features can create
more informative variables. For example, you could create a feature that represents
the car by Power to Weight, number of cylinders, displacement, weight, CCs and model
year for good mileage.

9. Select Best and Select Percentile: These techniques use statistical tests to select the
top K features or a certain percentage of features that have the strongest correlation
with the target variable.

10. Regularization Techniques: Besides Lasso regression, other regularization techniques


like Ridge regression can also help in implicitly performing feature selection by
shrinking less important features' coefficients.

4.Splitting data

In splitting the data set into two subsets, you can train your model on one subset and then
evaluate how well it performs on another, untrained group. The end aim of any predictive
model is to anticipate how well the model will function on fresh, unforeseen data.

You have a dataset containing attributes (such carat, cylinders, displacement, horsepower,
weight, acceleration, model year and origin, etc.) and associated in the case of predicting car
mileage. Based on these characteristics, the wish to create a model that can forecast car
mileage.

Steps to Split Data:

1. Load and Prepare Data: Load your dataset, pre-process it by handling missing values,
encoding categorical features, and scaling/normalizing numerical features.

2. Separate Features and Target: Split your dataset into two parts: features (X) and
target (y). Features are the input variables that you will use to predict the target
variable (car mileage).

85 | P a g e
3. Split Data: Use a function or library to divide your dataset into training and testing
subsets. The common practice is to allocate a larger portion of the data for training
(e.g., 80%) and a smaller portion for testing (e.g., 20%). This proportion can vary based
on the size of your dataset.

In Python, you can use the train_test_split function from the


sklearn.model_selection module to achieve this. X_test and y_test: These are the
subsets of features and target respectively that will be used for testing and evaluating
the model.

4. Training and Testing: You use the X_train and y_train subsets to train your regression
model. This allows the model to learn the relationships between features and target
from the training data.

5. Evaluation: After training, you use the trained model to predict car mileage using the
testing features (X_test). Then, you compare this predicted mileage with the actual
mileage (y_test) to evaluate the model's performance. Common evaluation metrics
for regression include Mean Squared Error (MSE), Root Mean Squared Error (RMSE),
and R-squared

Step 5: Choose a Regression Model:

Certainly! Let's get started with a thorough explanation of how to use Python to create a
regression model for forecasting car mileage. As the chosen regression model for this
discussion, we'll concentrate on linear regression.

Step 6. Model selection


Select a regression algorithm that is suitable for the current issue. Due to its capacity to build
linear correlations between features and the target variable in this situation, a linear
regression model is an appropriate option. To fit the selected model to the training data and

86 | P a g e
instantiate it, use the scikit-learn package. This will allow the model to learn patterns and
relationships within the features and the target mileage.

Train the selected regression model using the training data. In the case of Linear Regression,
some libraries like sklearn-learn to create and train the model.

Pandas:
 Purpose: Pandas is a powerful data manipulation and analysis library. It's used for
loading, cleaning, pre-processing, and exploring datasets.
 Explanation: You can use Pandas to read data from CSV files, handle missing values,
encode categorical variables, and perform various data transformations.
 NumPy:
 Purpose: NumPy is a fundamental package for numerical computations in Python. It
provides support for arrays, matrices, and mathematical functions.
 Explanation: NumPy arrays are used to store and manipulate numerical data
efficiently. Many machines learning libraries, including scikit-learn, work with NumPy
arrays.

scikit-learn (sklearn):
 Purpose: scikit-learn is a machine learning library that provides a wide range of tools
for various machine learning tasks, including regression.
 Explanation: You can use scikit-learn to choose regression models, pre-process data,
split datasets, train models, and evaluate their performance.
 Matplotlib and Seaborn:
 Purpose: These libraries are used for data visualization.
 Explanation: Matplotlib and Seaborn allow you to create various plots and charts to
visualize your data, model results, and relationships between features and target
variables.

Step 7: Model Evaluation: Use the testing dataset to evaluate the performance of your
trained model. Common metrics for regression tasks include Mean Squared Error (MSE), Root
Mean Squared Error (RMSE), and R-squared.

87 | P a g e
1. Mean Squared Error (MSE):

 Purpose: MSE measures the average squared difference between the predicted and
actual values. It provides insight into the overall magnitude of errors in your
predictions.
 Explanation: MSE computes the squared difference for each prediction and then
averages these squared differences. It penalizes larger errors more heavily.
 Formula: MSE = (1/n) Σ (yᵢ - ŷᵢ) ², where yᵢ is the actual value and ŷᵢ is the predicted
value for the i-th observation, and n is the number of observations.
 Interpretation: A lower MSE indicates better model performance. It's sensitive to
outliers, as larger errors contribute more significantly to the total.

2. Root Mean Squared Error (RMSE):

 Purpose: RMSE is the square root of the MSE. It provides a metric in the same unit as
the target variable, making it more interpretable.
 Explanation: RMSE is a popular metric for understanding the average size of errors.
It's more sensitive to larger errors due to the squaring and then square root
operations.
 Formula: RMSE = √MSE
 Interpretation: Similar to MSE, lower RMSE values indicate better performance. It's
easy to compare against the range of your target variable.

3. R-squared (R²) or Coefficient of Determination:

 Purpose: R² measures the proportion of the variance in the target variable that's
predictable from the input features. It gives an idea of how well the model explains
the variability in the data.

 Explanation: R² ranges from 0 to 1. A value of 1 means the model perfectly predicts


the target variable, while 0 means the model performs no better than predicting the
mean of the target variable.

88 | P a g e
 Formula: R² = 1 - (Σ (yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²), where ȳ is the mean of the actual values.
It would assess the effectiveness of the regression method using these measures in
the context of the task at hand to estimate the mileage of cars. Higher R² values and
lower MSE and RMSE values indicate a more effective model.

Step7. Prediction After having developed and assessed the algorithm, when can apply it to
generate predictions on fresh, previously unknown data. Give the trained model the feature
values of a car, and it will forecast the car's mileage.

The next stage is to utilize the regression model to make predictions on new data
(new_mileage) after it trained and calculated the Mean Squared Error (MSE) and Root Mean
Squared Error (RMSE) to assess its performance.

1. New_mileage is a list containing the feature values for the new car you want to predict
the mileage for. This list should have the same format and order as the features you
used for training your model.

2. model.predict() is used to make predictions based on the new features. The method
takes a 2D array-like input, where each row represents a set of features. In this case,
we're providing a single set of features, so we wrap it in a list.

3. predicted_mileage will store the predicted mileage for the new car.

Finally we can print out the predicted mileage.

89 | P a g e
Prediction of Diamond Price:
This chapter will describe the process of implementing the system. The implementation was
divided into five parts titled Data Set, Data Cleaning and Normalization, Machine Learning
Algorithms, Measurements, and Inference. Each of these parts are explained in their own
sections as part of this chapter and are shown in diagrams The high level component of the
diagram without a dedicated section of this chapter, Simulated Aging, is detailed in the
measurement section. The entire implementation was written in Python3 in the jupyter
notebook ide. The libraries utilized are pandas, sklearn (sci-kit learn), NumPy, re (regular
expressions), matplotlib, and seaborn.

Step 1: Data Collection and Exploration

Dataset The dataset was sourced from Kaggle and includes the prices and other attributes of
almost 26967 diamonds. There are 10 attributes included in the dataset including the target
i.e. price.

Feature description: price price in US dollars ($326--$18,823) This is the target column
containing tags for the features.

 The 4 Cs of Diamonds: - carat (0.2--5.01) The carat is the diamond’s physical weight
measured in metric carats. One carat equals 1/5 gram and is subdivided into 100
points. Carat weight is the most objective grade of the 4Cs.

 cut (Fair, Good, Very Good, Premium, Ideal) In determining the quality of the cut, the
diamond grader evaluates the cutter’s skill in the fashioning of the diamond. The more
precise the diamond is cut, the more captivating the diamond is to the eye.

 color, from J (worst) to D (best) The colour of gem-quality diamonds occurs in many
hues. In the range from colourless to light yellow or light brown. Colourless diamonds
are the rarest. Other natural colours (blue, red, pink for example) are known as
"fancy,” and their colour grading is different than from white colorless diamonds.

90 | P a g e
 clarity (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) Diamonds can have
internal characteristics known as inclusions or external characteristics known as
blemishes. Diamonds without inclusions or blemishes are rare; however, most
characteristics can only be seen with magnification.
 Dimensions

x length in mm (0--10.74)

y width in mm (0--58.9)

z depth in mm (0--31.8)

 depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79) The depth of


the diamond is its height (in millimetres) measured from the culet (bottom tip) to the
table (flat, top surface).

 table width of the top of the diamond relative to widest point (43--95)

A diamond's table refers to the flat facet of the diamond seen when the stone is face up. The
main purpose of a diamond table is to refract entering light rays and allow reflected light rays
from within the diamond to meet the observer’s eye. The ideal table cut diamond will give
the diamond stunning fire and brilliance.

Step 2 :-Data Pre-processing

Data pre-processing, particularly regression models for predicting diamond prices, requires
pre-processing data as a critical step. It entails preparing the raw data for machine learning
model training and evaluation by cleaning, converting, and organizing it. Here is a description
of the data pre-treatment procedures you might use in Python to solve an issue of predicting
the price of diamonds.

1. Handling Missing Values:


Check for missing values in your dataset. If there are missing values, it need to decide
how to handle them. Common strategies include removing rows (df.dropna() )with

91 | P a g e
missing values, filling missing values with the mean or median, or using more
advanced techniques like interpolation

2. Encoding Categorical Variables:


Machine learning algorithms typically require numerical inputs, so it need to encode
categorical variables (like cut, color, clarity) into numerical values. One-hot encoding
is a common technique for this purpose (pd.get_dummies()).

3. Scaling and Normalization: Features with different scales can affect the performance
of certain regression algorithms. Scaling and normalization help bring features to a
similar scale

4. Feature Selection: Identify which features are relevant for predicting the target
variable (price). You can use techniques like correlation analysis, feature importance
from tree-based models, or domain knowledge to decide which features to keep.

5. Train-Test Split: Split your dataset into training and testing sets. The training set is
used to train the model, while the testing set is used to evaluate its performance

Feature selection is a critical step in building a predictive model, as it involves choosing the
most relevant and informative features from your dataset to train the model. In the context
of predicting diamond prices using regression, you want to select features that have a strong
relationship with the target variable (price) and can help the model make accurate
predictions.

Here's how you can approach feature selection for predicting diamond prices using Python:

Step3 :- Feature selection

Choosing the most pertinent and instructive characteristics from your dataset to train the
model is an important stage in the process of creating a predictive model. Selecting
characteristics that have a strong association with the target variable (price) and can aid the

92 | P a g e
model in producing accurate predictions is important when forecasting diamond prices using
regression.
1. Domain Knowledge: Start by considering your domain knowledge. Think about which
features are intuitively expected to affect the price of a diamond. For instance, carat
weight, cut quality, color, and clarity are commonly known factors that influence
diamond prices.

2. Correlation Analysis: Calculate the correlation between each numerical feature and
the target variable (price). Features with higher absolute correlation values are
generally more important. You can use tools like the Pandas library to calculate
correlations and visualize them using plots like heatmaps.

3. Visualization: Visualize the relationships between individual features and the target
variable. You can use scatter plots, box plots, or other relevant plots to understand
how changes in each feature correspond to changes in diamond prices.

4. Feature Importance from Models: Train a preliminary model (e.g., a simple linear
regression or decision tree) and use the feature importance scores provided by the
model. Some models, like Random Forests, can quantify the importance of each
feature in predicting the target variable.

5. Recursive Feature Elimination: This method involves repeatedly training a model and
removing the least important feature in each iteration until a desired number of
features is reached. Scikit-learn provides the RFECV class that can help with this
process.
6. Lasso Regression (L1 Regularization): Lasso regression applies L1 regularization,
which encourages the model to set coefficients of less important features to zero. This
effectively performs feature selection during the model training process.

7. Variance Thresholding: If you have features with very low variance, they might not
contribute much to the prediction. You can use scikit-learn's Variance Threshold to
remove features with variance below a certain threshold.

93 | P a g e
8. Feature Engineering: Sometimes, combining or transforming features can create
more informative variables. For example, you could create a feature that represents
the product of carat weight and cut quality score.

9. SelectKBest and SelectPercentile: These techniques use statistical tests to select the
top K features or a certain percentage of features that have the strongest correlation
with the target variable.

10. Regularization Techniques: Besides Lasso regression, other regularization techniques


like Ridge regression can also help in implicitly performing feature selection by
shrinking less important features' coefficients.

Step 4: Splitting data

In splitting the data set into two subsets, you can train your model on one subset and then
evaluate how well it performs on another, untrained group. The end aim of any predictive
model is to anticipate how well the model will function on fresh, unforeseen data.

You have a dataset containing attributes (such carat, cut, color, clarity, etc.) and associated
pricing in the case of predicting diamond prices. Based on these characteristics, the wish to
create a model that can forecast diamond prices.

Steps to Split Data:

1. Load and Prepare Data: Load your dataset, pre-process it by handling missing values,
encoding categorical features, and scaling/normalizing numerical features.

2. Separate Features and Target: Split your dataset into two parts: features (X) and
target (y). Features are the input variables that you will use to predict the target
variable (diamond price).

94 | P a g e
3. Split Data: Use a function or library to divide your dataset into training and testing
subsets. The common practice is to allocate a larger portion of the data for training
(e.g., 80%) and a smaller portion for testing (e.g., 20%). This proportion can vary based
on the size of your dataset.

In Python, you can use the train_test_split function from the sklearn.model_selection
module to achieve this.
· X_train and y_train: These are the subsets of features and target respectively
that will be used for training the model.
· X_test and y_test: These are the subsets of features and target respectively
that will be used for testing and evaluating the model.

4. Training and Testing: You use the X_train and y_train subsets to train your regression
model. This allows the model to learn the relationships between features and target
from the training data.

5. Evaluation: After training, you use the trained model to predict diamond prices using
the testing features (X_test). Then, you compare these predicted prices with the
actual prices (y_test) to evaluate the model's performance. Common evaluation
metrics for regression include Mean Squared Error (MSE), Root Mean Squared Error
(RMSE), and R-squared

Step 5: Choose a Regression Model:

Certainly! Let's get started with a thorough explanation of how to use Python to create a
regression model for forecasting diamond prices. As the chosen regression model for this
discussion, we'll concentrate on linear regression.

Step 6: Model Training: Train the selected regression model using the training data. In the
case of Linear Regression, some libraries like scikit-learn (sklearn) to create and train the
model.

95 | P a g e
Pandas:
 Purpose: Pandas is a powerful data manipulation and analysis library. It's used for
loading, cleaning, pre-processing, and exploring datasets.
 Explanation: You can use Pandas to read data from CSV files, handle missing values,
encode categorical variables, and perform various data transformations.
 NumPy:
 Purpose: NumPy is a fundamental package for numerical computations in Python. It
provides support for arrays, matrices, and mathematical functions.
 Explanation: NumPy arrays are used to store and manipulate numerical data
efficiently. Many machine learning libraries, including scikit-learn, work with NumPy
arrays.
scikit-learn (sklearn):
 Purpose: scikit-learn is a machine learning library that provides a wide range of tools
for various machine learning tasks, including regression.
 Explanation: You can use scikit-learn to choose regression models, pre-process data,
split datasets, train models, and evaluate their performance.
 Matplotlib and Seaborn:
 Purpose: These libraries are used for data visualization.
 Explanation: Matplotlib and Seaborn allow you to create various plots and charts to
visualize your data, model results, and relationships between features and target
variables.

Step 7: Model Evaluation: Use the testing dataset to evaluate the performance of your
trained model. Common metrics for regression tasks include Mean Squared Error (MSE), Root
Mean Squared Error (RMSE), and R-squared.

1.Mean Squared Error (MSE):

 Purpose: MSE measures the average squared difference between the predicted and
actual values. It provides insight into the overall magnitude of errors in your
predictions.

96 | P a g e
 Explanation: MSE computes the squared difference for each prediction and then
averages these squared differences. It penalizes larger errors more heavily.
 Formula: MSE = (1/n) Σ(yᵢ - ŷᵢ)², where yᵢ is the actual value and ŷᵢ is the predicted value
for the i-th observation, and n is the number of observations.
 Interpretation: A lower MSE indicates better model performance. It's sensitive to
outliers, as larger errors contribute more significantly to the total.

2. Root Mean Squared Error (RMSE):

 Purpose: RMSE is the square root of the MSE. It provides a metric in the same unit as
the target variable, making it more interpretable.
 Explanation: RMSE is a popular metric for understanding the average size of errors.
It's more sensitive to larger errors due to the squaring and then square root
operations.
 Formula: RMSE = √MSE
 Interpretation: Similar to MSE, lower RMSE values indicate better performance. It's
easy to compare against the range of your target variable.

3. R-squared (R²) or Coefficient of Determination:

 Purpose: R² measures the proportion of the variance in the target variable that's
predictable from the input features. It gives an idea of how well the model explains
the variability in the data.
 Explanation: R² ranges from 0 to 1. A value of 1 means the model perfectly predicts
the target variable, while 0 means the model performs no better than predicting the
mean of the target variable.
 Formula: R² = 1 - (Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²), where ȳ is the mean of the actual values.
You would assess the effectiveness of the regression method using these measures in the
context of the task at hand to estimate the price of diamonds. Higher R² values and lower
MSE and RMSE values indicate a more effective model.

97 | P a g e
Step 8: Prediction: After having developed and assessed the algorithm, when can apply it
to generate predictions on fresh, previously unknown data. Give the trained model the
feature values of a diamond, and it will forecast the diamond's price.

The next stage is to utilize the regression model to make predictions on new data (new_price)
after it trained and calculated the Mean Squared Error (MSE) and Root Mean Squared Error
(RMSE) to assess its performance.

1. new_price is a list containing the feature values for the new diamond you want to
predict the price for. This list should have the same format and order as the features
you used for training your model.
2. model.predict() is used to make predictions based on the new features. The method
takes a 2D array-like input, where each row represents a set of features. In this case,
we're providing a single set of features, so we wrap it in a list.
3. predicted_price will store the predicted price for the new diamond.

Finally we can print out the predicted Price of Diamond.

98 | P a g e
CHAPTER: 6
6.1 RESULT AND ANALYSIS:
A number of significant findings and conclusions have been drawn from the training
and evaluation of the regression model for predicting automobile mileage. In order to
reliably estimate the mileage (MPG) of various automobile models, the model was
constructed using a dataset comprising parameters like as engine size, horsepower,
weight, and fuel type. The key findings and the subsequent analysis are outlined in the
following sections.

1. Model Performance:
The performance of the model was carefully analysed on a previously unreported
testing dataset, and the results were appraised using standard regression assessment
measures. The Mean Squared Error (MSE), which measures the squared difference
between anticipated and actual values, was computed as [MSE value for car mileage
is 8.9209 and for Diamond price is 392819.70]. The Root Mean Squared Error (RMSE),
which provides a measure of the model's predicted accuracy in MPG units, was
calculated as [RMSE value for car mileage is 1.5272 and for Diamond price is 21.5086].
These measurements show the degree of divergence between projected and actual
mileage numbers.

2. Coefficient of Determination (R-squared):


The R-squared (coefficient of determination) was calculated for determining how
much variability in the target variable (mileage) can be explained by the model's
predictor variables. The calculated R-squared value of [R-squared value] indicates that
the model is capable of clarifying [R-squared percentage] % of the discrepancies in car
mileage. This metric emphasizes the amount to which the selected features contribute
to the model's predicted accuracy.

99 | P a g e
3. Feature Significance:
To determine their impact on the expected mileage, the regression coefficients
connected to each predictor characteristic were investigated. It is noteworthy that
factors like engine size, horsepower, and weight exhibited sizable positive coefficients,
indicating that increases in these characteristics are associated with higher
automobile mileage. The coefficients of the pertinent one-hot encoded features
further demonstrated that the presence of specific fuel types led to variances in
mileage.

4. Prediction Accuracy:
The model was applied to novel input data to generate predictions for car mileage.
The predictions aligned closely with actual mileage values, thereby signifying the
model's ability to generalize its learned patterns to new data points. This accuracy
holds promising implications for practical applications, where informed estimations of
car mileage play a pivotal role in decision-making processes.

5. Limitations and Future Considerations:


It is imperative to acknowledge certain limitations of the model. While the chosen
features capture a significant portion of the variance in car mileage, other
unaccounted factors might influence fuel efficiency. Future iterations of the model
could incorporate additional attributes or explore more advanced regression
techniques to potentially enhance prediction accuracy.

6.Results and Analysis: Predictions Using Regression for Diamond Prices


We show the outcomes of diligent data preparation, model training, and model
evaluation after a thorough regression analysis for forecasting diamond prices. The
objective was to create a reliable forecast model to calculate diamond values based
on a number of inherent characteristics. The following list enumerates the results of
this effort:

100 | P a g e
7.Data Preprocessing: We report the findings from thorough data preparation, model
training, and model evaluation after completing an extensive regression analysis to
forecast diamond prices. The goal was to create a precise prediction model that could
anticipate diamond values based on a number of inherent characteristics. The results
of this effort are summarized as follows:

8.Feature Selection: Attributes with a significant impact on diamond prices were


selected as features for the regression model. These attributes included carat weight,
cut quality, color grade, clarity grade, and dimensions. The chosen features were used
to construct a feature matrix that served as the foundation for training the regression
model.

9.Model Selection and Training: We used a regression technique designed for


forecasting continuous data, especially diamond prices, for our investigation. Using a
subset of the information, the chosen model underwent training to discover the
correlations between the selected variables and the target variable, which in this case
was the price of diamonds. To enhance the model's capacity for prediction,
adjustments were made.

10.Model Evaluation: The trained model's performance was rigorously evaluated


using established regression metrics. The metrics used for evaluation included Mean
Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (coefficient of
determination). These metrics quantified the model's accuracy in predicting diamond
prices and its ability to explain the variability in price based on the provided attributes.
Results: Upon evaluation, the regression model displayed a compelling level of
predictive capability. The Mean Squared Error, a measure of the model's prediction
accuracy, exhibited a low value, indicating that the model's predictions were generally
close to the actual diamond prices. The Root Mean Squared Error, which provides a
measure of the average prediction error, further corroborated the model's efficacy in
producing accurate predictions.

101 | P a g e
11.Analysis: The analysis of the regression model's performance revealed that the
selected attributes, such as carat weight, cut quality, color grade, clarity grade, and
dimensions, significantly influence diamond prices. The model was successful in
capturing the intricate relationships between these attributes and the final price of
the diamond. Moreover, the R-squared value indicated that a substantial proportion
of the price variability could be explained by the provided attributes.
In conclusion, the results of this regression analysis demonstrate the model's
effectiveness in predicting diamond prices with a high degree of accuracy. The analysis
reaffirms the importance of attributes such as carat weight, cut quality, color grade,
clarity grade, and dimensions in determining diamond prices. These findings
contribute valuable insights to the diamond industry, enabling informed decision-
making and pricing strategies based on quantifiable attributes

102 | P a g e
CHAPTER: 7
7.1 CONCLUSION:
In conclusion, the project that was centred on the estimation of car mileage has
produced insightful findings. We were able to develop a predictive model with a
commendable level of accuracy in estimating car mileage by combining advanced
machine learning techniques with a large dataset. The model shows its ability to
generalize and make reasonably accurate predictions on unobserved data by
examining various factors including engine specifications, vehicle weight, and
aerodynamics. The project also demonstrated the value of feature engineering in
improving predictive performance. The results of this project have important
applications, especially in the automotive sector, where precise mileage forecasting
can help with improving vehicle design, increasing fuel efficiency, and empowering
consumers to make wise decisions. . However, there is still room for advancement,
including investigating the incorporation of more sophisticated algorithms and
incorporating real-time data for dynamic predictions. In general, this project highlights
both the potential of machine learning in the automotive industry and the iterative
process of improving predictive models for more accurate results.

Similarly, the project that was primarily concerned with the forecasting of diamond
prices has shown insightful findings and promising results. A reliable diamond price
prediction model was created using meticulous data collection, pre-processing, and
feature engineering, as well as cutting-edge machine learning algorithms. The model
proved to be highly accurate at estimating diamond values based on a variety of
important characteristics, including carat weight, cut, clarity, color, and depth.
According to industry standards, these characteristics have been shown to be
significant predictors of diamond prices. Additionally, the model's generalizability and
potential applicability in real-world scenarios are highlighted by the project's success
in utilizing a diverse dataset from reliable sources. It's crucial to recognize the inherent
complexity of diamond pricing, which can be impacted by consumer preferences,
market trends, and geopolitical factors. Although the developed model provides a
strong foundation for estimating diamond prices, ongoing improvement and

103 | P a g e
validation using new data will be necessary to keep the model current and accurate
over time. This project not only demonstrates machine learning's ability to predict
outcomes in the field of gem valuation, but it also paves the way for future
investigation and study in the field of luxury item pricing forecasting.

7.2 FUTURE SCOPE


Future developments and applications for the car mileage prediction project have a
lot of potential. The accuracy and adaptability of the model could be improved by
enlarging the dataset to include a wider array of automobile models, production years,
and driving conditions. The creation of dynamic mileage predictions that adjust to
changing road and weather conditions may be made possible by incorporating real-
time data from IoT sensors and GPS systems. Additionally, incorporating extra features
like maintenance logs, engine specs, and driving styles may produce a more thorough
understanding of the variables affecting fuel efficiency. The integration of predictive
mileage technology into onboard vehicle systems could be facilitated by cooperation
with automakers, providing drivers with real-time feedback to improve their driving
habits. The project's success might also open the door for comparable predictive
models in the transportation industry, assisting with fleet management, eco-friendly
driving techniques, and urban planning initiatives. In conclusion, the project's future
scope will encompass a comprehensive strategy that includes cutting-edge data
sources, real-time applications, and cooperative efforts with industry stakeholders.

similarly, for diamond price prediction , The project's future scope offers intriguing
opportunities for improvement and innovation. Integrating advanced gemmological
data, such as fluorescence, symmetry, and other intricate attributes, could give a more
nuanced understanding of diamond valuation, increasing the model's accuracy. A
larger dataset and better feature engineering could result from partnerships with
jewellery specialists and gemmologists. Integrating external factors like economic
indicators and consumer preferences could improve the model's predictive abilities in
light of the constantly changing market trends. By embracing cutting-edge
technologies like blockchain for provenance verification, the diamond market could

104 | P a g e
gain more transparency, which would affect prices and merit inclusion in the
prediction model. Additionally, extending the application to rare minerals and colored
gemstones could diversify it and draw more attention. . In conclusion, the future
potential of the project lies in a multidisciplinary approach that combines
gemmological expertise, technological advancements, and dynamic data sources to
create a reliable prediction model adaptable to the ever-changing diamond industry
landscape.

105 | P a g e
REFRENCES:-
1. Class Notes

2. Annina S, Mahima SD, Ramesh B, “An Overview of Machine Learning and its applications”.
International Journal of Electrical Sciences & Engineering (IJESE), 2015. pp. 22-24.

3. Dataset Published on Kaggle

The dataset “Aoto mpg” was created by an anonymous user: “UCI Machine Learning
Repository”. The last modification to the dataset as of the writing of this report was 6 year
ago.

The dataset is available for viewing and downloading at the following link:

Auto-mpg dataset | Kaggle

4. Dataset Published on Kaggle

The dataset “cubic zirconia” was created by an anonymous user: “UCI Machine Learning
Repository”. The last modification to the dataset as of the writing of this report was 6 year
ago.

The dataset is available for viewing and downloading at the following link:

cubic_zirconia | Kaggle

5. Some Ideas of Source Code is taken from the kaggle :

The Source Code is available for viewing and downloading at the following link:

https://www.kaggle.com/code/ghousethanedar/car-mileage-prediction-model

106 | P a g e

You might also like