Governors State University
OPUS Open Portal to University Scholarship
All Capstone Projects Student Capstone Projects
Fall 2022
Airline Search Engine Project
Arun Kailasa
Follow this and additional works at: https://opus.govst.edu/capstones
Recommended Citation
Kailasa, Arun, "Airline Search Engine Project" (2022). All Capstone Projects. 568.
https://opus.govst.edu/capstones/568
For more information about the academic degree, extended learning, and certificate programs of Governors State
University, go to http://www.govst.edu/Academics/Degree_Programs_and_Certifications/
Visit the Governors State Computer Science Department
This Capstone Project is brought to you for free and open access by the Student Capstone Projects at OPUS Open
Portal to University Scholarship. It has been accepted for inclusion in All Capstone Projects by an authorized
administrator of OPUS Open Portal to University Scholarship. For more information, please contact
[email protected].
AIRLINE SEARCH ENGINE PROJECT
By
Arun Kailasa
B. Tech, Vaagdevi College of Engineering, 2020
GRADUATE CAPSTONE SEMINAR PROJECT
Submitted in partial fulfillment of the requirements
For the Degree of Master of Science,
With a Major in Computer Science
Governors State University
University Park, IL 60484
2022
ABSTRACT
The Airline Search Engine Project is a tool that helps anyone to find the facts/data related to Airlines/Airports. For this
project, the raw data set is available in the .dat format. We are going to use this data, which can be downloaded from
[1].
The tool may also do some first cleaning of the data if needed for forming dimensional data, the cleaning process such
as data value unification, data type and size unification, deduplication, dropping columns, and correcting some known
errors.
The data will be processed with the help of languages like Python and Spark. By storing the data, we can distribute
storage systems such as Hadoop and Amazon S3. The Integrated Development Environment (IDE) used in this project
would be editors such as Google Colab and PyCharm.
This tool can be run as a job in different clusters such as EMR (Elastic MapReduce), HDInsight, Cloudera, and
Databricks. It can solve/derive data by analyzing terra bytes of raw data into useful information. We can create reports
out of it, which Data Analysts, Data Scientists, and businesspeople can use.
Table of Contents
1 Project Description...................................................................................................................................................................................... 3
1.1 Appendix A: ........................................................................................................................................................................................ 3
1.2 Appendix B: ........................................................................................................................................................................................ 3
1.3 Appendix C: ........................................................................................................................................................................................ 2
2 Architecture and flow of the Data Pipeline ................................................................................................................................................ 4
3 Tools and Technologies .............................................................................................................................................................................. 4
4 Project Structure ......................................................................................................................................................................................... 5
5 Project folder Hierarchy.............................................................................................................................................................................. 6
6 Utility Code .................................................................................................................................................................................................. 7
7 Code for creating the Spark session............................................................................................................................................................ 7
8 Transformation and Cleaning..................................................................................................................................................................... 8
9 Complete Project Code: ............................................................................................................................................................................... 9
10 Project Output Screenshots................................................................................................................................................................... 11
10.1 Find a list of Airports operating in the Country X .......................................................................................................................... 11
10.2 Find the list of Airlines having X stops ........................................................................................................................................... 11
10.3 List of Airlines operating with codeshare ........................................................................................................................................ 12
10.4 Find the list of Active Airlines in the United States......................................................................................................................... 12
10.5 Which country (or) territory has the highest number of Airports .................................................................................................. 13
10.6 The top K cities with most Incoming Airlines ................................................................................................................................. 13
10.7 The top K cities with most Outgoing Airlines .................................................................................................................................. 14
10.8 Trip that connects two cities X and Y ........................................................................................................................................... 14
10.9 Trip that connects X and Y with less than Z stops .......................................................................................................................... 15
10.10 All the cities reachable within d hops of a city ........................................................................................................................... 15
10.11 Find list of Airports operating in the Country X ........................................................................................................................ 16
10.12 Find the list of Airlines having X stops ....................................................................................................................................... 16
10.13 List of Airlines operating with code share .................................................................................................................................. 17
10.14 Find the list of Active Airlines in the United States .................................................................................................................... 17
i
10.15 Which country (or) territory has the highest number of Airports .............................................................................................. 18
10.16 The top K cities with most incoming Airlines........................................................................................................................... 18
10.17 The top K cities with most outgoing Airlines ........................................................................................................................... 19
10.18 Trip that connects two cities X and Y ....................................................................................................................................... 19
10.19 Trip that connects X and Y with less than Z stops .................................................................................................................. 20
10.20 All the cities reachable within d hops of a city ............................................................................................................................ 20
11 AWS Output Screenshot ....................................................................................................................................................................... 21
12 Acknowledgement ................................................................................................................................................................................. 23
13 References: ........................................................................................................................................................................................... 23
ii
1 Project Description
This tool is going to process various raw data sets which you can find in Appendix A and from this raw data we can
derive some useful facts which you can find in Appendix B. The tool will process raw data and initially create various
dimensional data models such as Airports, Airlines, Routes, Planes, and Countries tables. The schema of those tables
can be found in Appendix C.
1.1 Appendix A:
The raw data sets are
1) Airport.dat – Which contains information related to Airports such as Airport id, Airport Name, etc.
2) Airlines.dat – Which contains information related to Airlines such as Airline id, Airline name, etc et al. [5].
3) Routes.dat – Which contains information related to routes such as Source Airport, Destination Airport.
4) Plane – Which contains information related to plane such as Plane name, etc.
5) Country – Which contains information related to Country name, iso_code et al. [5].
1.2 Appendix B:
a. Find list of Airports operating in the Country X.
b. Find the list of Airlines having X stops.
c. List of Airlines operating with code share.
d. Find the list of Active Airlines in the United States.
i. Airline aggregation:
e. Which Country (or) Territory has the highest number of Airports.
f. The top K cities with most Incoming/Outgoing Airlines.
i. Trip recommendation:
g. Define a trip as a sequence of connected routes. Find a trip that connects two cities X and Y
(reachability).
h. Find a trip that connects X and Y with less than Z stops (constrained reachability).
i. Find all the cities reachable within d hops of a city (bounded reachability).
a. Fast Transitive closure/connected component implemented in parallel/distributed algorithms.
1
1.3 Appendix C:
Table name Airports
airport_id bigint
Name string
city string
country String
iata String
icao String
latitude Double
longitude Double
altitude Bigint
timezone Double
dst String
tz_database String
type String
source String
2
Table name Airlines
Airlineid bigint
Name string
Alias String
Iata String
Icao String
Callsign String
Country String
active String
Table Name Routes
Airline string
Airlineid String
Source_airport String
Source_airport_id String
Destination_airport string
Destination_airportid string
Codeshare string
Stops Bigint
Equipment string
Table Name Planes
Name String
Iata String
Icao string
Table Name Countries
Name String
Iso_code String
Dafif_code String
3
2 Architecture and flow of the Data Pipeline
The given data set will be uploaded to either the Amazon S3 bucket et al. [4,6] or can be uploaded to Hadoop
attributed filesystem. The uploaded data will be processed with the help of Apache Spark engine et al. [3]. The
Apache Spark engine mostly will be cluster like Amazon Elastic Map Reduce (EMR) service or locally installed Spark.
Once the data is processed, we can store the data again in another Amazon S3 bucket or it can be stored in the HDFS
also. The output data can be viewed with the help of various tools such as Apache Superset, Tableau, Presto query
engine, Amazon Athena et al. [6] or it can be created as another Hive table et al. [3].
Figure 1: Architecture and flow of the Data Pipeline [2].
3 Tools and Technologies
Google Colab, Spark, Python, AWS, PyCharm, HDFS, AWS Resources such as S3 bucket, Identity Access
Management (IAM), AWS Glue Data Catalog, AWS Glue Crawler, AWS Athena, SQL.
4
4 Project Structure
The Airline Search Engine Project is developed with Integrated Development Environment ( IDE) such as
PyCharm et al. [8] and by installing necessary language binaries like PySpark and Spark et al. [3,11].
Figure 2: PySpark version 3.1.2 and Spark version 3.1.2.
5
The pip list command shows the PySpark version used in this project. PySpark version 3.1.2 and Spark version 3.1.2.
Figure 3: pip list command showing PySpark Version.
5 Project folder Hierarchy
A separate project is created for this, and it includes a separate virtual environment to install the necessary project
dependency modules like Pandas et al. [10], NumPy, etc. The folder structure includes a separate folder for data
loading/reading and some util Spark code will be developed and developed folder like the util folder.
Figure 4: Project folder Hierarchy
6
6 Utility Code
Utility code was developed to read the Spark session configuration and to set the Spark configuration at run time as well. The
load_df utility was developed to read the data. You can find the code in the belowscreenshot.
Figure 5 : Utility Code
7 Code for creating the Spark session
Figure 6: Code for creating the Spark session
7
8 Transformation and Cleaning
Doing some transformation and cleaning work like replace strings like “\N” and “- “with na and transformation by
replacing all null values with strings like na. You can find the output in the screen below after this transformation
and cleaning.
Figure 7: Transformation and Cleaning
8
9 Complete Project Code:
Figure 8: Project Code
Figure 9: Complete Project Code
9
Figure 10: Complete Project Code
Figure 11: Spark Session Configuration Code
10
10 Project Output Screenshots
10.1 Find a list of Airports operating in the Country X
spark.sql("select *, count(*) over () as count from airports where country =
'Greenland'").show(100)
Output:
Figure 10.1: Output for list of Airports operating in the Country X (‘GREENLAND’)
10.2 Find the list of Airlines having X stops
spark.sql("select * from routes where stops > 0").show(100)
Output:
Figure 10.2: Output for list of Airlines having X stops
11
10.3 List of Airlines operating with codeshare
spark.sql("select * , count(*) over() as count from routes where codeshare != 'na' ").show(100)
Output:
Figure 10.3: Output for list of Airlines operating with codeshare
10.4 Find the list of Active Airlines in the United States
spark.sql("select *, count(*) over() as count from airlines where country = 'United States' andactive =
'Y'").show(100)
Output:
Figure 10.4: Output for list of active Airlines in the United States
12
10.5 Which country (or) territory has the highest number of Airports
spark.sql("select count(*) as cnt, country from airports group by country order by cnt desc").show(20)
Output:
Figure 10.5: Output for the countries with highest number of Airports
10.6 The top K cities with most Incoming Airlines
spark.sql("""select * from (select airports.airportid, airports.name, airports.city,airports.country,
tb2.incoming_flight_count from airports inner join (select count (*) as incoming_flight_count, destinationairportid
from routesgroup by destinationairportid ) tb2 on airports.airportid = tb2.destinationairportid) otb order by
otb.incoming_flight_countdesc""").show(100)
Output:
Figure 10.6: Output for the top cities with most incoming Airlines
13
10.7 The top K cities with most Outgoing Airlines
spark.sql("""select * from (select airports.airportid, airports.name, airports.city,airports.country,
tb2.outgoing_flight_count from airports inner join (select count (*) as outgoing_flight_count, sourceairportid from
routes groupby sourceairportid ) tb2 on airports.airportid = tb2.sourceairportid) otb order by otb.outgoing_flight_count
desc""").show(100)
Output:
Figure 10.7: Output for top cities with most outgoing Airlines
10.8 Trip that connects two cities X and Y
spark.sql("""select * from routes where sourceairportid = '2613' and
destinationairportid='2531' """).show(100)
Output:
Figure 10.8: Output for trip that connects two cities X and Y
14
10.9 Trip that connects X and Y with less than Z stops
spark.sql("""select * from routes where sourceairportid = '2613' and
destinationairportid='2531' and stops < 1 """).show(100)
Output:
Figure 10.9: Output for trip that connects X and Y with less than Z stops
10.10 All the cities reachable within d hops of a city
spark.sql("""select destinationairport from routes where stops = 1 """).show(100)
Output:
Figure 10.10: Output for all the cities reachable within d hops of a city
15
10.11 Find list of Airports operating in the Country X
spark.sql("select *, count(*) over () as count from airports where country =
'Greenland'").show(100)
Output:
Figure 10.11: Output for list of Airports operating in the country X
10.12 Find the list of Airlines having X stops
spark.sql("select * from routes where stops > 0").show(100)
Output:
Figure 10.12: Output for the list of Airlines having X stops
16
10.13 List of Airlines operating with code share
spark.sql("select * , count(*) over() as count from routes where codeshare != 'na' ").show(100)
Output:
Figure 10.13: Output for Airlines operating with code share
10.14 Find the list of Active Airlines in the United States
spark.sql("select *, count(*) over() as count from airlines where country = 'United States' andactive =
'Y'").show(100)
Output:
Figure 10.14: Output for list of active airlines in the United States
17
10.15 Which country (or) territory has the highest number of Airports
spark.sql("select count(*) as cnt, country from airports group by country order by cnt desc").show(20)
Output:
Figure 10.15: Output for multiple countries having highest number of Airports
10.16 The top K cities with most incoming Airlines
spark.sql("""select * from (select airports.airportid, airports.name, airports.city,airports.country,
tb2.incoming_flight_count from airports inner join (select count (*) as incoming_flight_count, destinationairportid
from routesgroup by destinationairportid ) tb2 on airports.airportid = tb2.destinationairportid) otb order by
otb.incoming_flight_count desc""").show(100)
Output:
Figure 10.16: Output for top K cities with most incoming Airlines
18
10.17 The top K cities with most outgoing Airlines
spark.sql("""select * from (select airports.airportid, airports.name, airports.city,airports.country,
tb2.outgoing_flight_count from airports inner join (select count (*) as outgoing_flight_count, sourceairportid from
routes groupby sourceairportid ) tb2 on airports.airportid = tb2.sourceairportid) otb order by
otb.outgoing_flight_count desc""").show(100)
Output:
Figure 10.17: Output for top K cities with most outgoing Airlines
10.18 Trip that connects two cities X and Y
spark.sql("""select * from routes where sourceairportid = '2613' and
destinationairportid='2531' """).show(100)
Output:
Figure 10.18: Output for trip that connects two cities X and Y
19
10.19 Trip that connects X and Y with less than Z stops
spark.sql("""select * from routes where sourceairportid = '2613' and
destinationairportid='2531' and stops < 1 """).show(100)
Output:
Figure 10.19: Output for trip that connects X and Y with less than Z stops
10.20 All the cities reachable within d hops of a city
spark.sql("""select destinationairport from routes where stops = 1 """).show(100)
Output:
Figure 10.20: Output for all the cities reachable within d hops of a city
20
11 AWS Output Screenshot
Figure 30: AWS Crawlers page
Figure 31: AWS Tables
21
Figure 32: AWS Athena Query
Figure 33: AWS Athena Output
22
12 Acknowledgement
I would like to thank my major professor, Liu Yunchuan, for having faith in me and my talents and for continuing to believe that I
would be able to complete the project on schedule. This Project was completed successfully thanks to the support, ongoing direction,
and insightful feedback. I also want to express my sincere gratitude to my mentor for being on my panel, working as my academic
advisor, helping me make all the important choices, and having faith in me.
13 References:
[1] http://openflights.org/data.html.
[2] https://docs.aws.amazon.com/glue/latest/ug/tutorial-create-job.html
[3] https://spoddutur.github.io/spark-notes/spark-as-cloud-based-sql-engine-via-thrift-server.html
[4] https://docs.aws.amazon.com/s3/index.html
[5] https://www.iata.org/en/publications/directories/code-search/
[6] https://www.youtube.com/watch?v=8VOf1PUFE0I
[7] https://docs.aws.amazon.com/iam/index.html
[8] https://www.jetbrains.com/pycharm/learn/
[9] https://docs.python.org/3/library/index.html
[10] https://pandas.pydata.org/docs/
[11] https://spark.apache.org/docs/latest/api/python/index.html
23