0% found this document useful (0 votes)

31 views

Minor Project Report

The document describes a project to build a flight fare prediction system using machine learning. The system aims to predict flight ticket prices accurately based on historical data and user-provided travel details. It analyzes airline data and builds models using algorithms like Random Forest regression to forecast fares and help travelers plan trips within budget.

Uploaded by

soumyaranjan4104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Minor Project Report

Uploaded by

soumyaranjan4104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 50

A PROJECT REPORT ON

FLIGHT FARE PREDICTION

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

A Project report submitted in partial fulfilment of the requirement for

the award of the Degree of

MASTER OF COMPUTER APPLICATION

Submitted By
Biswajeet Biswal
(2224100026)

Under the esteemed guidance of

Mrs. Susmita Pal

School of Computer Science

ODISHA UNIVERSITY OF TECHNOLOGY AND RESEARCH
(Techno Campus, Ghatikia, Mahalaxmi Vihar, Bhubaneswar-751003)
Academic Year 2023-2024
ODISHA UNIVERSITY OF TECHNOLOGY AND RESEARCH
(Techno Campus, Ghatikia, Mahalaxmi Vihar, Bhubaneswar-751003)

School of Computer Science

CERTIFICATE

This is to certify that the project report entitled “FLIGHT FARE PREDICTION
SYSTEM” being submitted by Mr. Biswajeet Biswal bearing the registration no:
2224100026 in partial fulfilment for the award of the Degree of Master in Computer
Application to the Odisha University of Technology and Research is a record of bonafide
work carried out by him under my guidance and supervision.
The results embodied in this project report have not been submitted to any other
University or Institute for the award of any Degree or Diploma.

Internal Guide Head of School of Computer Science

Mrs. Susmita Pal Dr Jibitesh Mishra
ODISHA UNIVERSITY OF TECHNOLOGY AND RESEARCH
(Techno Campus, Ghatikia, Mahalaxmi Vihar, Bhubaneswar-751003)

School of Computer Science

DECLARATION

I Biswajeet Biswal bearing Registration No: 2224100026, a bonafide student of

Odisha University of Technology and Research, would like to declare that the project titled
“Flight Fare Prediction System” in partial fulfilment of MCA Degree course of Odisha
University of Technology and Research is my original work in the year 2024 under the
guidance of Mrs. Susmita Pal, School of Computer Science and it has not previously formed
the basis for any degree or diploma or other any similar title submitted to any university.

Date: 05th Jan 2024 Biswajeet Biswal

Place: OUTR, BBSR (2224100026)

ODISHA UNIVERSITY OF TECHNOLOGY AND RESEARCH
(Techno Campus, Ghatikia, Mahalaxmi Vihar, Bhubaneswar-751003)

School of Computer Science

ACKNOWLEDGEMENT

I would like to express my sincere gratitude to my advisor, Mrs. Susmita Pal, School of
computer science whose knowledge and guidance have motivated me to achieve goals. He
has consistently been a source of motivation, encouragement, and inspiration. The time I have
spent working under his supervision has truly been a pleasure.
I take it as a great privilege to express our heartfelt gratitude to Dr Jibitesh Mishra, Head of
School of Computer Science for his valuable support and to all senior faculty members of the
School of computer science for their help during my course. Thanks to the programmers and
non-teaching staff of School of computer science.
Finally, special thanks to my parents for their support and encouragement throughout my life
and this course. Thanks to all my friends and well-wishers for their constant support.

Date: 05th Jan 2024 Biswajeet Biswal

Place: OUTR, BBSR (2224100026)

ABSTRACT

The Flight Fare Prediction System is an innovative application leveraging machine learning
algorithms to forecast and analyze airfare prices. By harnessing historical flight data, user preferences,
and real-time market trends, the system employs sophisticated models to predict future ticket prices
accurately. This system aims to empower travelers with reliable estimations, assisting them in making
informed decisions about when to book flights. Through an intuitive interface, users can input their
travel details and receive projected fare information, enabling them to optimize their travel budget and
secure cost-effective flight options. The system's intelligent insights and predictive capabilities serve
as a valuable tool in navigating the dynamic and often fluctuating landscape of air travel pricing.
CONTENTS

1 – Introduction
1.1- Background
1.2– Project scope
1.3– Existing Work
1.4 – Proposed System
1.5 – Project workflow
2 – Algorithms Used
2.1 – Machine Learning Basics
2.1.1 – Data Mining Workflow
2.1.2 – Supervised and
Unsupervised Learning
2.1.3 – Machine Learning Algorithms
2.1.4 – Evaluation Metrics
3 – Methodology
3.1 – Data Review
3.2 - Software Introduction
3.3 – Data Flow Diagram
4 – Analysis
4.1 – Exploratory Data Analysis
4.1.1 – Data Identification
4.1.2 – Visualizing the data
5 – Model Training
5.1 – Preparing the data for modelling
5.1.1 – Separating independent
and dependent variables
5.1.2 - Feature engineering
5.1.3 – Splitting training and
testing data
5.2 – Model testing and Evaluation
function
5.2.1 – Description
5.2.2 - Results
5.3 – Plotting
5.3.1 – Reg Plot
6 – Modular Project Structure
6.1 – Folder Structure
6.2 – Pipelining

7 – User View

8 – Conclusion

9 – References/Bibliography
1. INTRODUCTION
INTRODUCTION

1.1. Background
• The project entitled “Flight Fare Prediction System” is solely focused on predicting
the ticket price based on given features.
• This project helps calculate more accurate and efficient ticket prices for a journey
from one location to another.
• The machine learning algorithms like the Random forest regressor is implemented in
predicting the price of the ticket.

1.2. Project Scope

1. In this proposed system every user is provided with a form with fields like the airlines,
departure time, arrival time, source, destination, total stops etc.
2. Based on the user’s inputs, the model predicts the efficient ticket prices of the flights
for the journey.
3. This system's practical implementation lies in its ability to forecast flight fares
accurately, enabling airlines or travelers to make informed decisions and adjust
pricing strategies or travel plans accordingly.

1.3. Existing Work

• Some projects uses basic linear models or simplistic algorithms that fail to capture
the complex relationships between these parameters and fare prices, resulting in less
accurate predictions.
• Some projects might not handle data quality issues well, leading to inaccuracies or
biases in predictions due to missing or incorrect data entries for parameters like
arrival time or total stops.
• Other systems might not account for interactions or dependencies between
parameters. For instance, it may not consider how different combinations of airlines,
journey times, and stops affect fare prices.
1.4. Proposed system
1. Analyze and incorporate airline-specific pricing strategies or tendencies into the
model. Different airlines might have distinct pricing models based on various
factors, including service quality, popularity, or market positioning.

2. Consider the impact of journey time on fare prediction. Longer or shorter flights
might have different pricing dynamics, which could be factored into the model to
enhance accuracy.

3. Evaluate how the number of stops or layovers affects fares. Non-stop flights
might have different pricing dynamics compared to flights with multiple stops,
and this information could enhance the model's predictive power.
4. And based on the conclusion drawn from the above point, we will build a
machine learning model which predicts the ticket prices given the user’s
attributes.

1.5. Project Workflow

This project involves the following steps for building the project:

1. Data Collection: In the step we collect the data from different sources.

2. Exploratory Data Analysis:

2.1. Data Processing: In this step we clean the data, we perform operations like
Dealing with missing values, removing outliers, etc.

2.2. Feature Extraction and Engineering: In this step, with the help of different
plotting methods and statistics we find out which features are relevant to the
project and add more relevant columns which will improve the efficiency of the
project. This will prepare the dataset for the next steps.

2.3. Feature Scaling and selection: In this step, we identify the numerical
features and categorical features. In the categorical features we perform the One
Hot Encoding to convert the categories into numbers and in the numerical
features, we perform the standard scaling to make the numeric values
compactible with the rest of the data.

2.4. Splitting the data: In this step, the data is split into the training and testing
set (8 : 2). The training set is used to train the machine learning model. And the
testing dataset is used to test the prediction of out model.

3. Choosing an Machine Learning Algorithm: In this step, we test different algorithms

on our training dataset, and based on the results from the evaluation metrics, we pick
the algorithm with the highest accuracy.

4. Modelling: In this step, the model with the highest accuracy is used to build the final
model, which will build the final “.pkl” file.

5. Model Evaluation and Tuning: In this step, we use a greater number of evaluation metrics
to test the accuracy of the model using our testing dataset. The accuracy of our model can
be improved using a method known as the Hyperparameter Tuning. In this step of our
model’s accuracy score doesn’t get to a specific point (0.88, in our case), we again go back
to step 2 and redo the whole thing.

6. Deployment and Monitoring: In this step, we our machine learning user friendly by
building a pipeline and connecting it to a user interface. After building the UI we put our
model in the public domain by deploying it using a cloud service like amazon azure.
Diagrammatic representation of the “Project Work Flow”
2. ALGORITHMS USED
2.1 Machine Learning Basic

2.1.1. Steps-by-step process of Machine Learning

1. The most basic workflow for data mining, and therefore machine learning, can be divided
into six steps. In the first step data acquisition has to be mentioned, as insufficient or
biased data can lead to wrong results. In machine learning or data mining this data
usually has to be quite big, as patterns might only emerge with thousands or millions of
individual data points.

2. The second and maybe most important step is the cleaning of given data. Problems with
data can include unsuitable features, noisy and so on. Features can be nominal (such as
yes/no or male/female), ordinal (ranking such as school grades) or numerical
(temperatures, cost and other parameters) and sometimes features have to be converted.
An example would be to label all days with average temperatures over 10°C as "warm"
and all below as "cold". Outliers might either be interesting, as in anomaly detection, or
can change the outcome of the learning process negatively, e.g. when using experimental
data where outliers have to be removed, therefore this has to be considered in the
cleaning process. After all, a scientific or an analytical process can only be as good as the
data provided is, as what often jokingly is said: "garbage in causes garbage out".

3. In the third and fourth step, with the cleaned data and a decision about the chosen
modeling technique, is to build and train a model. This is possibly the most abstract step
in data mining, especially when pre-built programs such as Azure Machine Learning
Studio or scikit-learn (in Python) are used. Simply put, the training process is finding
structures in the given training data. What kind of data or features these are is heavily
dependent on the goal (clustering or prediction etc.). Taking a simple example of model
training: a very famous dataset is the iris-dataset, that includes features of three different
iris-species and uses clustering to determine the correct subspecies. Features included in
this set are different parts of the flower, such as petal or sepal length. When using
clustering, the training algorithm iteratively calculates the most common features and
allows therefore a grouping process of all the data points. The result is heavily dependent
on the complexity of these clusters, as is in regression the result dependent on the
complexity of a curve. As can be seen in Figure 2.1 it can be quite hard to determine if
the bottom-most red data point belongs to the species-A-cluster or to the species-B-
cluster, depending on the method of measurement.
4. Testing and evaluating the model is mostly done by statistical methods and will seldom
give a result of 100 % match between the training and validation data. Considering one of
the most intuitive and simple data mining models, linear regression, this uncertainty is
mostly covered by introducing a measurement of uncertainty.

2.1.2. Supervised and Unsupervised Learning

1. Supervised learning: "Supervised learning is a type of machine learning algorithm that
uses a known dataset with labelled columns (called the training dataset) to make
predictions. The training dataset includes input data and response values."

2. Unsupervised learning: "Unsupervised learning is a type of machine learning algorithm

used to draw inferences from datasets consisting of input data without labeled
responses."

3. In Machine learning an abundance of models and algorithms can be found, but most
fundamentally these are divided into supervised and unsupervised learning.

a. One fundamental example has been mentioned in the foregoing section, the
clustering of iris-species. Former is a supervised process where data points are
labeled ("species A", "species B" or "species C") and labels are calculated for new
data points. Comparing calculated labels according to the trained model with the
original label gives the model’s accuracy, hence supervised.
b. Unsupervised learning on the other hand does not require any labeling, since the
algorithm is searching for a pattern in the data. This might be useful when
categorizing customers into different groups without a priori knowledge of which
groups they belong to.
2.1.3. Machine Learning Algorithms

In our project only Regression is used, so here are some of the commonly used regression
algorithms:

2.1.3.1. Multiple Linear Regression

Linear regression models assume that the relationships between input and output variables are
linear. These models are quite simplistic, but in many cases provide adequate and tractable
representations of the relationships. The model aims a prediction of real output data Y by the
given input data X = (x_1, x_2, …, x_p) and has the following form:

β describes initially unknown coefficients. Linear models with more than one input variable p
> 1 are called multiple linear regression models. The best known estimation method of linear
regression is the least squares method. In this method, the coefficients β = β_0, β_1…,
β_p are determined in such a way that the Residual Sum of Squares (RSS) becomes minimal.
Here, y_i-f(x_i) describes the residuals, β_0 the estimate of the intercept term, and β_j the
estimate of the slope parameter

2.1.3.2. Decision Trees

A Decision Tree grows by iteratively splitting tree nodes until the ‘leaves’ contain no more
impurities or a termination condition is reached. The creation of the Decision Tree starts at
the root of the tree and splits the data in a way that results in the largest Information Gain IG.

In general, the Information Gain IG of a feature a is defined as follows:

In binary Decision Trees, the division of the total dataset D_p by attribute a into D_left
and D_right is done. Accordingly, the information gain is defined as:

The algorithm aims at maximizing the information gain, i.e. the method wants to split the
total dataset in such a way that the impurity in the child nodes is reduced the most.
While classification uses entropy or the Gini coefficient as a measure of impurity, regression
uses, for example, the Mean Squared Error (MSE) as a measure of the impurity of a node.

Splitting methods that use the Mean Squared Error to determine impurity are also
called variance reduction methods. Usually, the tree size is controlled by the maximum
number of nodes max_depth, at which the division of the dataset stops.

2.1.3.3. Random Forest

By merging several uncorrelated Decision Trees, often a significant improvement of the

model accuracy can be achieved. This method is called Random Forest. The trees are
influenced by certain random processes (randomization) as they grow. The final model
reflects an averaging of the trees.
Different methods of randomization exist. According to Breiman, who coined the term
‘Random Forest’ in 1999, random forests are established according to the following
procedure. First, a random sample is chosen from the total dataset for each tree. As the tree
grows, a selection of a subset of the features takes place at each node. These serve as criteria
for splitting the dataset. The target value is then determined for each Decision
Tree individually. The averaging of these predictions represents the prediction of the Random
Forest.

The Random Forest has a number of hyperparameters. The most crucial one, besides
the maximum depth of the trees max_depth, is the number of decision trees n_estimators. By
default, the Mean Square Error (MSE) is used as criterion for splitting the dataset as the trees
grow.

The following figure shows an example model for the Random Forest. The way it works
results in the characteristic “step” form :

2.1.3.4. K Nearest Neighbors Regressor

K nearest neighbors is a simple algorithm that stores all available cases and predict the
numerical target based on a similarity measure (e.g., distance functions). KNN has been used
in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-
parametric technique. KNN regression uses the following distance functions:
The above three distance measures are only valid for continuous variables. In the case of
categorical variables you must use the Hamming distance, which is a measure of the number
of instances in which corresponding symbols are different in two strings of equal length.
2.1.4. Evaluation Metrics
2.1.4.1. Root Mean Squared Error
RMSE is the most popular evaluation metric used in regression problems. It
follows an assumption that errors are unbiased and follow a normal
distribution. Here are the key points to consider on RMSE:
1. The power of ‘square root’ empowers this metric to show large number
deviations.
2. The ‘squared’ nature of this metric helps to deliver more robust results, which
prevent canceling the positive and negative error values. In other words, this
metric aptly displays the plausible magnitude of the error term.
3. It avoids the use of absolute error values, which is highly undesirable in
mathematical calculations.
4. When we have more samples, reconstructing the error distribution using
RMSE is considered to be more reliable.
5. RMSE is highly affected by outlier values. Hence, make sure you’ve removed
outliers from your data set prior to using this metric.
6. As compared to mean absolute error, RMSE gives higher weightage and
punishes large errors.

where N is the Total Number of Observations.

2.1.4.2. R-Squared
We learned that when the RMSE decreases, the model’s performance will improve.
But these values alone are not intuitive.
In the case of a classification problem, if the model has an accuracy of 0.8, we
could gauge how good our model is against a random model, which has an accuracy
of 0.5. So the random model can be treated as a benchmark. But when we talk about
the RMSE metrics, we do not have a benchmark to compare.
This is where we can use the R-Squared metric. The formula for R-Squared is as
follows:

MSE (model): Mean Squared Error of the predictions against the actual values
MSE (baseline): Mean Squared Error of mean prediction against the actual values

3.Methodology
3.1. Data Review
1. Dataset Source - https://www.kaggle.com/datasets/nikhilmittal/flight-fare-prediction-mh
2. This data contains 11 columns and 10683 rows.

3. Logical View

4. About the Dataset:

The aim of collection of data for this dataset was to find any correlations between the
various attributes enlisted in the dataset. The attributes are :

a. Airline: Name of the airlines

b. Date_of_Journey: Journey date details.

c. Source: In which location starts the journey.

d. Destination: The destination location details.

e. Route: This provides path details from source to destination.

f. Dep_Time: The time which provides the departure time details.

g. Arrival_Time: The time which provides the arrival time details when the airline
arrived.

h. Duration: Total time required from source to destination.

i. Total_Stops: Total stops in between source and destination.

j. Additional_Info: This provides some additional information about airline.

k. Prices: It provides price of the ticket.

3.2. Software Introduction

The following software have been used in the project:

1. Anaconda: It is a Python distribution for data science and machine learning. The
main purpose of Python virtual environments is to create isolated environments
where each project is independent from the other, using its own dependencies.
2. Jupyter Notebook: Jupyter Notebook is an open-source web-based interactive
computing environment used for data analysis, visualization, and the creation of
computational notebooks. It allows users to create and share documents that contain
live code, equations, visualizations, and narrative text.

3. VS Code: Short for Visual Studio Code, is a free and open-source source code editor
developed by Microsoft. It is widely used by programmers and developers for various
programming languages and platforms.
VS Code is designed to be lightweight, customizable, and highly extensible, making
it suitable for a wide range of programming tasks. It provides a user-friendly interface
and a powerful set of features that enhance productivity and streamline the
development process.

4. Python: Python is a high-level, general-purpose programming language that was

created by Guido van Rossum and first released in 1991. It emphasizes code
readability and has a simple and clean syntax, making it easy to learn and use. Python
supports multiple programming paradigms, including procedural, object-oriented, and
functional programming.

5. Python modules used:

a. Pandas: Pandas is an open-source Python library that provides high-performance,
easy-to-use data manipulation and analysis tools. It is built on top of the NumPy
library and is widely used in data science, machine learning, and data analysis
workflows.

b. NumPy: NumPy (Numerical Python) is a powerful open-source Python library

that provides support for large, multi-dimensional arrays and matrices, along with
a collection of mathematical functions to operate on these arrays efficiently. It
serves as a fundamental building block for scientific computing and data analysis
in Python.

c. Matplotlib.pyplot: Itis a module within the Matplotlib library, which is a widely

used plotting library in Python. The pyplot module provides a high-level interface
for creating various types of plots, charts, and visualizations.

d. Seaborn: Seaborn is a Python data visualization library built on top of Matplotlib.

It provides a high-level interface for creating attractive and informative statistical
graphics. Seaborn is particularly useful for visualizing complex datasets and
exploring relationships between variables.

e. Scikit-Learn: Scikit-learn, also known as sklearn, is a widely used open-source

machine learning library for Python. It is built on top of NumPy, SciPy, and
Matplotlib, and provides a comprehensive set of tools for various machine
learning tasks, including classification, regression, clustering, dimensionality
reduction, model selection, and preprocessing.

f. Flask: Flask is a lightweight web framework for Python that allows developers to
build web applications quickly and with minimal boilerplate code. It is known for
its simplicity, flexibility, and ease of use. Flask is often referred to as a "micro"
framework because it provides only the essential features needed to create web
applications, without imposing any strict architectural patterns or dependencies.

3.3. Data Flow Diagram:

4 – Analysis
4.1 – Exploratory Data Analysis
4.1.1 – Data Identification
4.1.1.1 – Check for datatypes:

Insights
1. There are five categorical attributes : Airline, Source, Destination, Route,
Additional_Info.
2. There are three numerical attributes: Date_of_Journey, Dep_Time,
Arrival_Time, Duration, Total_Stops, Price.

4.1.1.2. Visualizing the categorical and numerical features:

4.1.1.3.Check for missing values

1. There are no missing values in the dataset.

4.1.1.4. Check for duplicates
4.1.1.5. Check of unique values in each column

1. Number of unique values in each column:

a. Airline-12
b. Date_of_Journey-44
c. Source-5
d. Destination-6
e. Route-128
f. Dep_Time-222
g. Arrival_Time-1343
h. Duration-368
i. Total_Stops-5
j. Additional_Info-10
k. Price-1870

4.1.1.6.Adding total and Average columns:

4.1.2. Visualizing the data

4.1.2.1. Boxplot: Total_Stops vs Price of ticket
Insights: In this boxplot showing Total Stops versus the Price of a ticket, each box
represents the distribution of ticket prices for different numbers of stops.

4.1.2.2. Boxplot: Source vs Price of ticket

Insights: This boxplot showcasing the relationship between the source (departure
location) and the price of tickets offers insights into how ticket prices vary based on the
departure point.
4.1.2.3. Boxplot: Destination vs Price of ticket

Insights: This boxplot illustrating the relationship between destination (arrival

location) and ticket prices provides insights into how prices vary based on where the
journey concludes.

4.1.2.4. Scatter: Airlines vs Price of ticket

Insights: The ticket rate for Jet Airways Business airline is high.
4.1.2.5. Scatter: Additional information vs price of ticket

Insights: From above scatter plot it is clear that the ticket prices for
Business class is higher which is quite obvious.

4.1.2.6. Count plot: Airlines

Insights: There are more Jet Airways flights than any other airline.
4.1.2.7. Count plot: Source

Insights: There are more flight departures from Delhi than from any other source.
4.1.2.8. Count Plot: Destination
Insights: It appears that Cochin is receiving more flights compared to other
destinations.
4.1.2.9. Count plot: Duration_hours

Insights: The journey takes a maximum of 2 hours, including all flights.

Conclusion:
1. The price of the ticket is determined based on the number of stops, the total duration of the flight,
and the airline you choose that may affect the cost.
2. The price of the ticket is less related to the departure minute and arrival minutes.
5 – Model Training
5.1. Preparing the data for modelling
5.1.1 – Separating dependent and independent variables

1. Independent Variables

2. Dependent variables (Target Variable)

5.1.2. Feature Engineering

1. As our categorical variables have string values, which our model does not understand. So by
using OneHotEncoder we convert our categorical variables into numerical variables.

2. As the values of our numerical features are in range 50-100 and this does not match with the
numerical values of the other variables. So to fix this problem, we scale-up or scale-down our
data by using StandardScaler. It helps in improving the accuracy of our model.

3. After Conversion the data look like this:

5.1.3. Splitting training and testing data

1. The dataset of 10462 rows is split into a training set of 8369 rows and testing set of 2093 row.
2. The training will be used for training the model, and the testing set will be used for testing the
accuracy of the model.
5.2. Model testing and Evaluation Function
5.2.1. Description:
In model testing code, initialize all of our machine learning algorithms on which we will
be testing out the accuracy of the model using the evaluate method. In the evaluate method
we have used three evaluation metrics:
1. R2 score
2. Mean squared error

The evaluate function takes in the actual values and the predicted values as the
parameters and returns the r2_score and the rmse. These values will tell the accuracy of
the model.
In the model testing code, use the predict function on each of the algorithm and obtain
a set of predicted values. These predicted values will be given to the evaluate function
along with the actual values to get the accuracy of the model. It will give the following
output.
As we can see Linear Regression gives the most efficient results:
1. r2 score - 0.88
2. mse – 1553.1
5.2.2. Results

So, from all the above modelling and testing, we have found that
random forest regressor is the best suited algorithm for the dataset,
which gives the most accurate prediction.

5.3. Plotting
5.3.1 – Reg Plot
As in out scatter plot, the actual and the predicted values are giving a
straight line, we conclude that the accuracy of our model is up to the
mark.

6 – Modular Project
Structure
6.1. Modular Project Structure
In order to modularize the project and give it proper structure, we follow the following rules
6.1.1. Folder Structure

The following are the details of the above-mentioned folder structure:

1. project_name: Name of the project.

2. src: The folder that consists of the source code related to data gathering, data preparation,
feature extraction, etc.

3. tests: The folder that consists of the code representing unit tests for code maintained with the
src folder.
4. models: The folder that consists of files representing trained/retrained models as part of build
jobs, etc. The model names can be appropriately set as project name date time or project
build id (in case the model is created as part of build jobs). Another approach is to store the
model files in a separate storage such as AWS S3, Google Cloud Storage, or any other form
of storage.

5. data: The folder consists of data used for model training/retraining. The data could also be
stored in a separate storage system.

6. pipeline: The folder consists of code that's used for retraining and testing the model in an
automated manner. These could be docker containers related code, scripts, workflow related
code, etc.

7. docs: The folder that consists of code related to the product requirement specifications
(PRS), technical design specifications (TDS), etc.

6.2. Data Pipelining

Data pipelines operate on the same principle; only they deal with information rather than
liquids or gasses. Data pipelines are a sequence of data processing steps, many of them
accomplished with special software. The pipeline defines how, what, and where the data is
collected. Data pipelining automates data extraction, transformation, validation, and
combination, then loads it for further analysis and visualization. The entire pipeline
provides speed from one end to the other by eliminating errors and neutralizing bottlenecks
or latency.

In our modular project, we have used the concept of pipelining to take the user input from
the UI and sort of tunneled the data through a pipeline to our model. And in this way
predict the data.
7 – User View
The UI is built using HTML and CSS which is Bundled up in
a local server using Flask.
Local server
8. Conclusion
This system will give people an idea of the trends the prices follow and also provide the
predicted value of the price, which they can check before booking flights to save money.
This kind of system or service can be provided to customers through flight booking
companies to help them book tickets.

In essence, leveraging features like airlines, total stops, departure and arrival times,
source, destination, and additional information in a flight fare prediction project presents
a rich opportunity. However, addressing complexities, ensuring data quality, employing
advanced models, accommodating user preferences, and fostering continuous
improvement are crucial steps towards building a robust and accurate prediction system
in this domain.
9. References/Bibliography

Web Sources:
 https://www.javatpoint.com/machine-learning
 https://www.geeksforgeeks.org/machine-learning/
 https://www.w3schools.com/python/default.asp
 https://stackoverflow.com/
 https://www.youtube.com/watch?
v=bPrmA1SEN2k&list=PLZoTAELRMXVPBTrWtJkn3wWQxZkmT
XGwe

Book:
1. Machine Learning, Tom Mitchell, McGraw Hill, 1997.
2. Fox, J. (1997), Applied Regression Analysis, Linear Models, and Related
Methods, ISBN: 080394540X, Sage Pubns.

Flight Price Prediction
57% (7)
Flight Price Prediction
19 pages
A Guide To 21 Feature Importance Methods and Packages in Machine Learning (With Code) - by Theophano Mitsa - Dec, 2023 - Towards Data Science
100% (1)
A Guide To 21 Feature Importance Methods and Packages in Machine Learning (With Code) - by Theophano Mitsa - Dec, 2023 - Towards Data Science
41 pages
2003275_TARUN KOHLI_Flight Price Prediction (AutoRecovered)
No ratings yet
2003275_TARUN KOHLI_Flight Price Prediction (AutoRecovered)
57 pages
19BCE2521_VL2022230103417_PE004
No ratings yet
19BCE2521_VL2022230103417_PE004
48 pages
major final drift 1432
No ratings yet
major final drift 1432
74 pages
Flight Price Prediction Using Machine Learning Report
No ratings yet
Flight Price Prediction Using Machine Learning Report
58 pages
VND Openxmlformats-Officedocument Wordprocessingml
No ratings yet
VND Openxmlformats-Officedocument Wordprocessingml
71 pages
33358_Report
No ratings yet
33358_Report
31 pages
EE5253 2023 Paper Group35
No ratings yet
EE5253 2023 Paper Group35
5 pages
Surendra Paper
No ratings yet
Surendra Paper
7 pages
Untitled document
No ratings yet
Untitled document
5 pages
Updated Hard Copy Final Report
No ratings yet
Updated Hard Copy Final Report
38 pages
47.epra Journals 14763
No ratings yet
47.epra Journals 14763
6 pages
Flight Fare Prediction: Project Report
No ratings yet
Flight Fare Prediction: Project Report
38 pages
Mini Project Report Template
No ratings yet
Mini Project Report Template
21 pages
Project PPT 1
No ratings yet
Project PPT 1
16 pages
Flight Fare Predictor
No ratings yet
Flight Fare Predictor
21 pages
MAJOR PROJECTy
No ratings yet
MAJOR PROJECTy
11 pages
FLIGHT FARE PREDICTION - Copy100
No ratings yet
FLIGHT FARE PREDICTION - Copy100
22 pages
Project report_merged
No ratings yet
Project report_merged
59 pages
Flight Price Prediction Report
No ratings yet
Flight Price Prediction Report
18 pages
Cse 28
No ratings yet
Cse 28
7 pages
Comparative Analysis of Machine Learning Models For Accurate Flight Price Prediction
No ratings yet
Comparative Analysis of Machine Learning Models For Accurate Flight Price Prediction
7 pages
STC Final
No ratings yet
STC Final
36 pages
Flight Price Prediction Project
No ratings yet
Flight Price Prediction Project
9 pages
Prediction of Flight-Fare Using Machine Learning
No ratings yet
Prediction of Flight-Fare Using Machine Learning
6 pages
Flight Ticket Price Predicting With The
No ratings yet
Flight Ticket Price Predicting With The
4 pages
Flight Price Predection 2
No ratings yet
Flight Price Predection 2
6 pages
Flight Ticket Price Predictor - Formatted Paper
No ratings yet
Flight Ticket Price Predictor - Formatted Paper
5 pages
Flight Price Predictions
No ratings yet
Flight Price Predictions
37 pages
Prediction of Flight-Fare Using Machine Learning
No ratings yet
Prediction of Flight-Fare Using Machine Learning
6 pages
Winter Report
No ratings yet
Winter Report
82 pages
Flight Delay Project Main
No ratings yet
Flight Delay Project Main
54 pages
FlightDelay_SVR
No ratings yet
FlightDelay_SVR
43 pages
Batch 7 F
No ratings yet
Batch 7 F
15 pages
Presentation Sample Half
No ratings yet
Presentation Sample Half
11 pages
Easychair Preprint: Vinod Kimbhaune, Harshil Donga, Asutosh Trivedi, Sonam Mahajan and Viraj Mahajan
No ratings yet
Easychair Preprint: Vinod Kimbhaune, Harshil Donga, Asutosh Trivedi, Sonam Mahajan and Viraj Mahajan
5 pages
Ict Project Report (1)[1]
No ratings yet
Ict Project Report (1)[1]
14 pages
Intern Report
No ratings yet
Intern Report
43 pages
DCCCCCCCCCCC
No ratings yet
DCCCCCCCCCCC
41 pages
Presentation On Flight Price Prediction 2
No ratings yet
Presentation On Flight Price Prediction 2
30 pages
Capstone Review 1
No ratings yet
Capstone Review 1
7 pages
Airfare Synopsis
No ratings yet
Airfare Synopsis
6 pages
Presentation On Flight Price Prediction
No ratings yet
Presentation On Flight Price Prediction
30 pages
Indian Airline Ticket Price Analysis
No ratings yet
Indian Airline Ticket Price Analysis
60 pages
Implementation of Flight Fare Prediction System Using Machine Learning
No ratings yet
Implementation of Flight Fare Prediction System Using Machine Learning
11 pages
Aircraft Ticket Price Prediction Using Machine Learning
No ratings yet
Aircraft Ticket Price Prediction Using Machine Learning
7 pages
Report Edited
No ratings yet
Report Edited
24 pages
Flight Ticket Price Predictor Using ML: (Document Subtitle)
No ratings yet
Flight Ticket Price Predictor Using ML: (Document Subtitle)
5 pages
Introduction
No ratings yet
Introduction
3 pages
Paper 90
No ratings yet
Paper 90
7 pages
ppt of infosys
No ratings yet
ppt of infosys
8 pages
Kiran Kumar Mini
No ratings yet
Kiran Kumar Mini
113 pages
FinalProject Ahmed ALANAZI
No ratings yet
FinalProject Ahmed ALANAZI
118 pages
Prashant Major Project Final
No ratings yet
Prashant Major Project Final
90 pages
"Airlines": Post Graduate Diploma in Computer Application
No ratings yet
"Airlines": Post Graduate Diploma in Computer Application
57 pages
Flight Price Project
No ratings yet
Flight Price Project
15 pages
Crypto Currency Prediction
100% (1)
Crypto Currency Prediction
34 pages
A Machine Learning Model For Flight Delay Prediction: Certificate
No ratings yet
A Machine Learning Model For Flight Delay Prediction: Certificate
17 pages
Thesis Machine Learning
No ratings yet
Thesis Machine Learning
29 pages
ANSYS Workbench 2023 R2: A Tutorial Approach, 6th Edition
From Everand
ANSYS Workbench 2023 R2: A Tutorial Approach, 6th Edition
Prof. Sham Tickoo
No ratings yet
DS 3151608-Be-Winter-2022
No ratings yet
DS 3151608-Be-Winter-2022
2 pages
Detecting Malware in Portable Executable Files Using Machine Learning Approach
No ratings yet
Detecting Malware in Portable Executable Files Using Machine Learning Approach
7 pages
Home Credit Score Card Model
No ratings yet
Home Credit Score Card Model
19 pages
Machine Learing Algorithms
No ratings yet
Machine Learing Algorithms
13 pages
An Overview of Chatbots Using ML Algorithms in Agricultural Domain
No ratings yet
An Overview of Chatbots Using ML Algorithms in Agricultural Domain
9 pages
Diabetes Synopsis Report
No ratings yet
Diabetes Synopsis Report
10 pages
Malicious Twitter Bots Detection Using Machine Learning: A Mini Project Report
No ratings yet
Malicious Twitter Bots Detection Using Machine Learning: A Mini Project Report
54 pages
Analysis of User Behavior Patterns Using Machine Learning Algorithms
No ratings yet
Analysis of User Behavior Patterns Using Machine Learning Algorithms
7 pages
Sentiment Analysis of Code-Mixed Social Media Text
No ratings yet
Sentiment Analysis of Code-Mixed Social Media Text
18 pages
Bike Renting PDF
No ratings yet
Bike Renting PDF
26 pages
Project Report PDF
100% (1)
Project Report PDF
38 pages
ML - 04 - Decision Trees
No ratings yet
ML - 04 - Decision Trees
51 pages
100-Machine-Learning-Interview-Questions-and-Answers (Downloaded From Internet)
No ratings yet
100-Machine-Learning-Interview-Questions-and-Answers (Downloaded From Internet)
24 pages
SQL Injection Attack Detection by Machine Learning Classifier
No ratings yet
SQL Injection Attack Detection by Machine Learning Classifier
8 pages
An Analytic-Based Course Recommendation System For Higher Education
No ratings yet
An Analytic-Based Course Recommendation System For Higher Education
6 pages
Credit Card Fraud Detection Full Doc 3
No ratings yet
Credit Card Fraud Detection Full Doc 3
74 pages
Performance_Analysis_of_Machine_Learning_Classification_Model_and_Ensemble_Learning_for_Water_Quality_Index_of_Taal_Lake
No ratings yet
Performance_Analysis_of_Machine_Learning_Classification_Model_and_Ensemble_Learning_for_Water_Quality_Index_of_Taal_Lake
6 pages
Predictive Analytics in Employee Churn A Systematic Literature Review
No ratings yet
Predictive Analytics in Employee Churn A Systematic Literature Review
11 pages
BTP_Report
No ratings yet
BTP_Report
32 pages
Synopsis - Diabetes Prediction
No ratings yet
Synopsis - Diabetes Prediction
28 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Professional Synopsis:: Kunal Anarse
No ratings yet
Professional Synopsis:: Kunal Anarse
2 pages
Sivaraksa et al. - 2024 - Risk-Optimized Crypto Trading Bot
No ratings yet
Sivaraksa et al. - 2024 - Risk-Optimized Crypto Trading Bot
6 pages
Prediction of Crops Based On Soil Type Using Machine Learning
0% (1)
Prediction of Crops Based On Soil Type Using Machine Learning
44 pages
Machine Learning with Spark and Python Essential Techniques for Predictive Analytics 2nd Edition Michael Bowles download
100% (2)
Machine Learning with Spark and Python Essential Techniques for Predictive Analytics 2nd Edition Michael Bowles download
35 pages
The Cricket Winner Prediction With Applications of ML and Data Analytics
No ratings yet
The Cricket Winner Prediction With Applications of ML and Data Analytics
18 pages
A Comparison of Regression Models For Prediction of Graduate Admissions
No ratings yet
A Comparison of Regression Models For Prediction of Graduate Admissions
5 pages
4026-Article Text-9944-1-10-20190730
No ratings yet
4026-Article Text-9944-1-10-20190730
8 pages

Minor Project Report

Uploaded by

Minor Project Report

Uploaded by

A PROJECT REPORT ON

FLIGHT FARE PREDICTION

A Project report submitted in partial fulfilment of the requirement for

MASTER OF COMPUTER APPLICATION

Under the esteemed guidance of

Mrs. Susmita Pal

School of Computer Science

School of Computer Science

Internal Guide Head of School of Computer Science

School of Computer Science

I Biswajeet Biswal bearing Registration No: 2224100026, a bonafide student of

Date: 05th Jan 2024 Biswajeet Biswal

Place: OUTR, BBSR (2224100026)

School of Computer Science

Date: 05th Jan 2024 Biswajeet Biswal

Place: OUTR, BBSR (2224100026)

1.2. Project Scope

1.3. Existing Work

1.5. Project Workflow

2. Exploratory Data Analysis:

3. Choosing an Machine Learning Algorithm: In this step, we test different algorithms

2.1.1. Steps-by-step process of Machine Learning

2.1.2. Supervised and Unsupervised Learning

2. Unsupervised learning: "Unsupervised learning is a type of machine learning algorithm

2.1.3.1. Multiple Linear Regression

2.1.3.2. Decision Trees

In general, the Information Gain IG of a feature a is defined as follows:

2.1.3.3. Random Forest

By merging several uncorrelated Decision Trees, often a significant improvement of the

2.1.3.4. K Nearest Neighbors Regressor

where N is the Total Number of Observations.

4. About the Dataset:

a. Airline: Name of the airlines

b. Date_of_Journey: Journey date details.

c. Source: In which location starts the journey.

d. Destination: The destination location details.

e. Route: This provides path details from source to destination.

f. Dep_Time: The time which provides the departure time details.

h. Duration: Total time required from source to destination.

i. Total_Stops: Total stops in between source and destination.

j. Additional_Info: This provides some additional information about airline.

k. Prices: It provides price of the ticket.

3.2. Software Introduction

4. Python: Python is a high-level, general-purpose programming language that was

5. Python modules used:

b. NumPy: NumPy (Numerical Python) is a powerful open-source Python library

c. Matplotlib.pyplot: Itis a module within the Matplotlib library, which is a widely

d. Seaborn: Seaborn is a Python data visualization library built on top of Matplotlib.

e. Scikit-Learn: Scikit-learn, also known as sklearn, is a widely used open-source

3.3. Data Flow Diagram:

4.1.1.2. Visualizing the categorical and numerical features:

1. There are no missing values in the dataset.

1. Number of unique values in each column:

4.1.1.6.Adding total and Average columns:

4.1.2. Visualizing the data

4.1.2.2. Boxplot: Source vs Price of ticket

Insights: This boxplot illustrating the relationship between destination (arrival

4.1.2.4. Scatter: Airlines vs Price of ticket

4.1.2.6. Count plot: Airlines

Insights: The journey takes a maximum of 2 hours, including all flights.

2. Dependent variables (Target Variable)

5.1.2. Feature Engineering

3. After Conversion the data look like this:

The following are the details of the above-mentioned folder structure:

6.2. Data Pipelining

You might also like