Minor Project Report
Minor Project Report
Submitted By
Biswajeet Biswal
(2224100026)
CERTIFICATE
This is to certify that the project report entitled “FLIGHT FARE PREDICTION
SYSTEM” being submitted by Mr. Biswajeet Biswal bearing the registration no:
2224100026 in partial fulfilment for the award of the Degree of Master in Computer
Application to the Odisha University of Technology and Research is a record of bonafide
work carried out by him under my guidance and supervision.
The results embodied in this project report have not been submitted to any other
University or Institute for the award of any Degree or Diploma.
DECLARATION
ACKNOWLEDGEMENT
I would like to express my sincere gratitude to my advisor, Mrs. Susmita Pal, School of
computer science whose knowledge and guidance have motivated me to achieve goals. He
has consistently been a source of motivation, encouragement, and inspiration. The time I have
spent working under his supervision has truly been a pleasure.
I take it as a great privilege to express our heartfelt gratitude to Dr Jibitesh Mishra, Head of
School of Computer Science for his valuable support and to all senior faculty members of the
School of computer science for their help during my course. Thanks to the programmers and
non-teaching staff of School of computer science.
Finally, special thanks to my parents for their support and encouragement throughout my life
and this course. Thanks to all my friends and well-wishers for their constant support.
The Flight Fare Prediction System is an innovative application leveraging machine learning
algorithms to forecast and analyze airfare prices. By harnessing historical flight data, user preferences,
and real-time market trends, the system employs sophisticated models to predict future ticket prices
accurately. This system aims to empower travelers with reliable estimations, assisting them in making
informed decisions about when to book flights. Through an intuitive interface, users can input their
travel details and receive projected fare information, enabling them to optimize their travel budget and
secure cost-effective flight options. The system's intelligent insights and predictive capabilities serve
as a valuable tool in navigating the dynamic and often fluctuating landscape of air travel pricing.
CONTENTS
1 – Introduction
1.1- Background
1.2– Project scope
1.3– Existing Work
1.4 – Proposed System
1.5 – Project workflow
2 – Algorithms Used
2.1 – Machine Learning Basics
2.1.1 – Data Mining Workflow
2.1.2 – Supervised and
Unsupervised Learning
2.1.3 – Machine Learning Algorithms
2.1.4 – Evaluation Metrics
3 – Methodology
3.1 – Data Review
3.2 - Software Introduction
3.3 – Data Flow Diagram
4 – Analysis
4.1 – Exploratory Data Analysis
4.1.1 – Data Identification
4.1.2 – Visualizing the data
5 – Model Training
5.1 – Preparing the data for modelling
5.1.1 – Separating independent
and dependent variables
5.1.2 - Feature engineering
5.1.3 – Splitting training and
testing data
5.2 – Model testing and Evaluation
function
5.2.1 – Description
5.2.2 - Results
5.3 – Plotting
5.3.1 – Reg Plot
6 – Modular Project Structure
6.1 – Folder Structure
6.2 – Pipelining
7 – User View
8 – Conclusion
9 – References/Bibliography
1. INTRODUCTION
INTRODUCTION
1.1. Background
• The project entitled “Flight Fare Prediction System” is solely focused on predicting
the ticket price based on given features.
• This project helps calculate more accurate and efficient ticket prices for a journey
from one location to another.
• The machine learning algorithms like the Random forest regressor is implemented in
predicting the price of the ticket.
2. Consider the impact of journey time on fare prediction. Longer or shorter flights
might have different pricing dynamics, which could be factored into the model to
enhance accuracy.
3. Evaluate how the number of stops or layovers affects fares. Non-stop flights
might have different pricing dynamics compared to flights with multiple stops,
and this information could enhance the model's predictive power.
4. And based on the conclusion drawn from the above point, we will build a
machine learning model which predicts the ticket prices given the user’s
attributes.
1. Data Collection: In the step we collect the data from different sources.
2.1. Data Processing: In this step we clean the data, we perform operations like
Dealing with missing values, removing outliers, etc.
2.2. Feature Extraction and Engineering: In this step, with the help of different
plotting methods and statistics we find out which features are relevant to the
project and add more relevant columns which will improve the efficiency of the
project. This will prepare the dataset for the next steps.
2.3. Feature Scaling and selection: In this step, we identify the numerical
features and categorical features. In the categorical features we perform the One
Hot Encoding to convert the categories into numbers and in the numerical
features, we perform the standard scaling to make the numeric values
compactible with the rest of the data.
2.4. Splitting the data: In this step, the data is split into the training and testing
set (8 : 2). The training set is used to train the machine learning model. And the
testing dataset is used to test the prediction of out model.
4. Modelling: In this step, the model with the highest accuracy is used to build the final
model, which will build the final “.pkl” file.
5. Model Evaluation and Tuning: In this step, we use a greater number of evaluation metrics
to test the accuracy of the model using our testing dataset. The accuracy of our model can
be improved using a method known as the Hyperparameter Tuning. In this step of our
model’s accuracy score doesn’t get to a specific point (0.88, in our case), we again go back
to step 2 and redo the whole thing.
6. Deployment and Monitoring: In this step, we our machine learning user friendly by
building a pipeline and connecting it to a user interface. After building the UI we put our
model in the public domain by deploying it using a cloud service like amazon azure.
Diagrammatic representation of the “Project Work Flow”
2. ALGORITHMS USED
2.1 Machine Learning Basic
1. The most basic workflow for data mining, and therefore machine learning, can be divided
into six steps. In the first step data acquisition has to be mentioned, as insufficient or
biased data can lead to wrong results. In machine learning or data mining this data
usually has to be quite big, as patterns might only emerge with thousands or millions of
individual data points.
2. The second and maybe most important step is the cleaning of given data. Problems with
data can include unsuitable features, noisy and so on. Features can be nominal (such as
yes/no or male/female), ordinal (ranking such as school grades) or numerical
(temperatures, cost and other parameters) and sometimes features have to be converted.
An example would be to label all days with average temperatures over 10°C as "warm"
and all below as "cold". Outliers might either be interesting, as in anomaly detection, or
can change the outcome of the learning process negatively, e.g. when using experimental
data where outliers have to be removed, therefore this has to be considered in the
cleaning process. After all, a scientific or an analytical process can only be as good as the
data provided is, as what often jokingly is said: "garbage in causes garbage out".
3. In the third and fourth step, with the cleaned data and a decision about the chosen
modeling technique, is to build and train a model. This is possibly the most abstract step
in data mining, especially when pre-built programs such as Azure Machine Learning
Studio or scikit-learn (in Python) are used. Simply put, the training process is finding
structures in the given training data. What kind of data or features these are is heavily
dependent on the goal (clustering or prediction etc.). Taking a simple example of model
training: a very famous dataset is the iris-dataset, that includes features of three different
iris-species and uses clustering to determine the correct subspecies. Features included in
this set are different parts of the flower, such as petal or sepal length. When using
clustering, the training algorithm iteratively calculates the most common features and
allows therefore a grouping process of all the data points. The result is heavily dependent
on the complexity of these clusters, as is in regression the result dependent on the
complexity of a curve. As can be seen in Figure 2.1 it can be quite hard to determine if
the bottom-most red data point belongs to the species-A-cluster or to the species-B-
cluster, depending on the method of measurement.
4. Testing and evaluating the model is mostly done by statistical methods and will seldom
give a result of 100 % match between the training and validation data. Considering one of
the most intuitive and simple data mining models, linear regression, this uncertainty is
mostly covered by introducing a measurement of uncertainty.
3. In Machine learning an abundance of models and algorithms can be found, but most
fundamentally these are divided into supervised and unsupervised learning.
a. One fundamental example has been mentioned in the foregoing section, the
clustering of iris-species. Former is a supervised process where data points are
labeled ("species A", "species B" or "species C") and labels are calculated for new
data points. Comparing calculated labels according to the trained model with the
original label gives the model’s accuracy, hence supervised.
b. Unsupervised learning on the other hand does not require any labeling, since the
algorithm is searching for a pattern in the data. This might be useful when
categorizing customers into different groups without a priori knowledge of which
groups they belong to.
2.1.3. Machine Learning Algorithms
In our project only Regression is used, so here are some of the commonly used regression
algorithms:
Linear regression models assume that the relationships between input and output variables are
linear. These models are quite simplistic, but in many cases provide adequate and tractable
representations of the relationships. The model aims a prediction of real output data Y by the
given input data X = (x_1, x_2, …, x_p) and has the following form:
β describes initially unknown coefficients. Linear models with more than one input variable p
> 1 are called multiple linear regression models. The best known estimation method of linear
regression is the least squares method. In this method, the coefficients β = β_0, β_1…,
β_p are determined in such a way that the Residual Sum of Squares (RSS) becomes minimal.
Here, y_i-f(x_i) describes the residuals, β_0 the estimate of the intercept term, and β_j the
estimate of the slope parameter
A Decision Tree grows by iteratively splitting tree nodes until the ‘leaves’ contain no more
impurities or a termination condition is reached. The creation of the Decision Tree starts at
the root of the tree and splits the data in a way that results in the largest Information Gain IG.
In binary Decision Trees, the division of the total dataset D_p by attribute a into D_left
and D_right is done. Accordingly, the information gain is defined as:
The algorithm aims at maximizing the information gain, i.e. the method wants to split the
total dataset in such a way that the impurity in the child nodes is reduced the most.
While classification uses entropy or the Gini coefficient as a measure of impurity, regression
uses, for example, the Mean Squared Error (MSE) as a measure of the impurity of a node.
Splitting methods that use the Mean Squared Error to determine impurity are also
called variance reduction methods. Usually, the tree size is controlled by the maximum
number of nodes max_depth, at which the division of the dataset stops.
The Random Forest has a number of hyperparameters. The most crucial one, besides
the maximum depth of the trees max_depth, is the number of decision trees n_estimators. By
default, the Mean Square Error (MSE) is used as criterion for splitting the dataset as the trees
grow.
The following figure shows an example model for the Random Forest. The way it works
results in the characteristic “step” form :
MSE (model): Mean Squared Error of the predictions against the actual values
MSE (baseline): Mean Squared Error of mean prediction against the actual values
3.Methodology
3.1. Data Review
1. Dataset Source - https://www.kaggle.com/datasets/nikhilmittal/flight-fare-prediction-mh
2. This data contains 11 columns and 10683 rows.
3. Logical View
g. Arrival_Time: The time which provides the arrival time details when the airline
arrived.
1. Anaconda: It is a Python distribution for data science and machine learning. The
main purpose of Python virtual environments is to create isolated environments
where each project is independent from the other, using its own dependencies.
2. Jupyter Notebook: Jupyter Notebook is an open-source web-based interactive
computing environment used for data analysis, visualization, and the creation of
computational notebooks. It allows users to create and share documents that contain
live code, equations, visualizations, and narrative text.
3. VS Code: Short for Visual Studio Code, is a free and open-source source code editor
developed by Microsoft. It is widely used by programmers and developers for various
programming languages and platforms.
VS Code is designed to be lightweight, customizable, and highly extensible, making
it suitable for a wide range of programming tasks. It provides a user-friendly interface
and a powerful set of features that enhance productivity and streamline the
development process.
f. Flask: Flask is a lightweight web framework for Python that allows developers to
build web applications quickly and with minimal boilerplate code. It is known for
its simplicity, flexibility, and ease of use. Flask is often referred to as a "micro"
framework because it provides only the essential features needed to create web
applications, without imposing any strict architectural patterns or dependencies.
Insights
1. There are five categorical attributes : Airline, Source, Destination, Route,
Additional_Info.
2. There are three numerical attributes: Date_of_Journey, Dep_Time,
Arrival_Time, Duration, Total_Stops, Price.
Insights: This boxplot showcasing the relationship between the source (departure
location) and the price of tickets offers insights into how ticket prices vary based on the
departure point.
4.1.2.3. Boxplot: Destination vs Price of ticket
Insights: From above scatter plot it is clear that the ticket prices for
Business class is higher which is quite obvious.
Insights: There are more flight departures from Delhi than from any other source.
4.1.2.8. Count Plot: Destination
Insights: It appears that Cochin is receiving more flights compared to other
destinations.
4.1.2.9. Count plot: Duration_hours
1. Independent Variables
1. As our categorical variables have string values, which our model does not understand. So by
using OneHotEncoder we convert our categorical variables into numerical variables.
2. As the values of our numerical features are in range 50-100 and this does not match with the
numerical values of the other variables. So to fix this problem, we scale-up or scale-down our
data by using StandardScaler. It helps in improving the accuracy of our model.
1. The dataset of 10462 rows is split into a training set of 8369 rows and testing set of 2093 row.
2. The training will be used for training the model, and the testing set will be used for testing the
accuracy of the model.
5.2. Model testing and Evaluation Function
5.2.1. Description:
In model testing code, initialize all of our machine learning algorithms on which we will
be testing out the accuracy of the model using the evaluate method. In the evaluate method
we have used three evaluation metrics:
1. R2 score
2. Mean squared error
The evaluate function takes in the actual values and the predicted values as the
parameters and returns the r2_score and the rmse. These values will tell the accuracy of
the model.
In the model testing code, use the predict function on each of the algorithm and obtain
a set of predicted values. These predicted values will be given to the evaluate function
along with the actual values to get the accuracy of the model. It will give the following
output.
As we can see Linear Regression gives the most efficient results:
1. r2 score - 0.88
2. mse – 1553.1
5.2.2. Results
So, from all the above modelling and testing, we have found that
random forest regressor is the best suited algorithm for the dataset,
which gives the most accurate prediction.
5.3. Plotting
5.3.1 – Reg Plot
As in out scatter plot, the actual and the predicted values are giving a
straight line, we conclude that the accuracy of our model is up to the
mark.
6 – Modular Project
Structure
6.1. Modular Project Structure
In order to modularize the project and give it proper structure, we follow the following rules
6.1.1. Folder Structure
2. src: The folder that consists of the source code related to data gathering, data preparation,
feature extraction, etc.
3. tests: The folder that consists of the code representing unit tests for code maintained with the
src folder.
4. models: The folder that consists of files representing trained/retrained models as part of build
jobs, etc. The model names can be appropriately set as project name date time or project
build id (in case the model is created as part of build jobs). Another approach is to store the
model files in a separate storage such as AWS S3, Google Cloud Storage, or any other form
of storage.
5. data: The folder consists of data used for model training/retraining. The data could also be
stored in a separate storage system.
6. pipeline: The folder consists of code that's used for retraining and testing the model in an
automated manner. These could be docker containers related code, scripts, workflow related
code, etc.
7. docs: The folder that consists of code related to the product requirement specifications
(PRS), technical design specifications (TDS), etc.
Data pipelines operate on the same principle; only they deal with information rather than
liquids or gasses. Data pipelines are a sequence of data processing steps, many of them
accomplished with special software. The pipeline defines how, what, and where the data is
collected. Data pipelining automates data extraction, transformation, validation, and
combination, then loads it for further analysis and visualization. The entire pipeline
provides speed from one end to the other by eliminating errors and neutralizing bottlenecks
or latency.
In our modular project, we have used the concept of pipelining to take the user input from
the UI and sort of tunneled the data through a pipeline to our model. And in this way
predict the data.
7 – User View
The UI is built using HTML and CSS which is Bundled up in
a local server using Flask.
Local server
8. Conclusion
This system will give people an idea of the trends the prices follow and also provide the
predicted value of the price, which they can check before booking flights to save money.
This kind of system or service can be provided to customers through flight booking
companies to help them book tickets.
In essence, leveraging features like airlines, total stops, departure and arrival times,
source, destination, and additional information in a flight fare prediction project presents
a rich opportunity. However, addressing complexities, ensuring data quality, employing
advanced models, accommodating user preferences, and fostering continuous
improvement are crucial steps towards building a robust and accurate prediction system
in this domain.
9. References/Bibliography
Web Sources:
https://www.javatpoint.com/machine-learning
https://www.geeksforgeeks.org/machine-learning/
https://www.w3schools.com/python/default.asp
https://stackoverflow.com/
https://www.youtube.com/watch?
v=bPrmA1SEN2k&list=PLZoTAELRMXVPBTrWtJkn3wWQxZkmT
XGwe
Book:
1. Machine Learning, Tom Mitchell, McGraw Hill, 1997.
2. Fox, J. (1997), Applied Regression Analysis, Linear Models, and Related
Methods, ISBN: 080394540X, Sage Pubns.