0% found this document useful (0 votes)
74 views28 pages

Airline Search Engine Project

This document describes an airline search engine project that analyzes airline and airport data. The project builds a data pipeline to clean, transform, and load raw data into Spark. It then demonstrates various queries like finding airports in a country, airlines with a certain number of stops, active airlines in the US, and routes between cities. The project utilizes tools like Python, Spark, AWS and shows screenshots of sample outputs.

Uploaded by

nehal siddiqui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views28 pages

Airline Search Engine Project

This document describes an airline search engine project that analyzes airline and airport data. The project builds a data pipeline to clean, transform, and load raw data into Spark. It then demonstrates various queries like finding airports in a country, airlines with a certain number of stops, active airlines in the US, and routes between cities. The project utilizes tools like Python, Spark, AWS and shows screenshots of sample outputs.

Uploaded by

nehal siddiqui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Governors State University

OPUS Open Portal to University Scholarship

All Capstone Projects Student Capstone Projects

Fall 2022

Airline Search Engine Project


Arun Kailasa

Follow this and additional works at: https://opus.govst.edu/capstones

Recommended Citation
Kailasa, Arun, "Airline Search Engine Project" (2022). All Capstone Projects. 568.
https://opus.govst.edu/capstones/568

For more information about the academic degree, extended learning, and certificate programs of Governors State
University, go to http://www.govst.edu/Academics/Degree_Programs_and_Certifications/

Visit the Governors State Computer Science Department


This Capstone Project is brought to you for free and open access by the Student Capstone Projects at OPUS Open
Portal to University Scholarship. It has been accepted for inclusion in All Capstone Projects by an authorized
administrator of OPUS Open Portal to University Scholarship. For more information, please contact
[email protected].
AIRLINE SEARCH ENGINE PROJECT
By

Arun Kailasa
B. Tech, Vaagdevi College of Engineering, 2020

GRADUATE CAPSTONE SEMINAR PROJECT

Submitted in partial fulfillment of the requirements

For the Degree of Master of Science,

With a Major in Computer Science

Governors State University


University Park, IL 60484

2022
ABSTRACT

The Airline Search Engine Project is a tool that helps anyone to find the facts/data related to Airlines/Airports. For this
project, the raw data set is available in the .dat format. We are going to use this data, which can be downloaded from
[1].

The tool may also do some first cleaning of the data if needed for forming dimensional data, the cleaning process such
as data value unification, data type and size unification, deduplication, dropping columns, and correcting some known
errors.

The data will be processed with the help of languages like Python and Spark. By storing the data, we can distribute
storage systems such as Hadoop and Amazon S3. The Integrated Development Environment (IDE) used in this project
would be editors such as Google Colab and PyCharm.

This tool can be run as a job in different clusters such as EMR (Elastic MapReduce), HDInsight, Cloudera, and
Databricks. It can solve/derive data by analyzing terra bytes of raw data into useful information. We can create reports
out of it, which Data Analysts, Data Scientists, and businesspeople can use.
Table of Contents
1 Project Description...................................................................................................................................................................................... 3
1.1 Appendix A: ........................................................................................................................................................................................ 3
1.2 Appendix B: ........................................................................................................................................................................................ 3
1.3 Appendix C: ........................................................................................................................................................................................ 2
2 Architecture and flow of the Data Pipeline ................................................................................................................................................ 4
3 Tools and Technologies .............................................................................................................................................................................. 4
4 Project Structure ......................................................................................................................................................................................... 5
5 Project folder Hierarchy.............................................................................................................................................................................. 6
6 Utility Code .................................................................................................................................................................................................. 7
7 Code for creating the Spark session............................................................................................................................................................ 7
8 Transformation and Cleaning..................................................................................................................................................................... 8
9 Complete Project Code: ............................................................................................................................................................................... 9
10 Project Output Screenshots................................................................................................................................................................... 11
10.1 Find a list of Airports operating in the Country X .......................................................................................................................... 11
10.2 Find the list of Airlines having X stops ........................................................................................................................................... 11
10.3 List of Airlines operating with codeshare ........................................................................................................................................ 12
10.4 Find the list of Active Airlines in the United States......................................................................................................................... 12
10.5 Which country (or) territory has the highest number of Airports .................................................................................................. 13
10.6 The top K cities with most Incoming Airlines ................................................................................................................................. 13
10.7 The top K cities with most Outgoing Airlines .................................................................................................................................. 14
10.8 Trip that connects two cities X and Y ........................................................................................................................................... 14
10.9 Trip that connects X and Y with less than Z stops .......................................................................................................................... 15
10.10 All the cities reachable within d hops of a city ........................................................................................................................... 15
10.11 Find list of Airports operating in the Country X ........................................................................................................................ 16
10.12 Find the list of Airlines having X stops ....................................................................................................................................... 16
10.13 List of Airlines operating with code share .................................................................................................................................. 17
10.14 Find the list of Active Airlines in the United States .................................................................................................................... 17

i
10.15 Which country (or) territory has the highest number of Airports .............................................................................................. 18
10.16 The top K cities with most incoming Airlines........................................................................................................................... 18
10.17 The top K cities with most outgoing Airlines ........................................................................................................................... 19
10.18 Trip that connects two cities X and Y ....................................................................................................................................... 19
10.19 Trip that connects X and Y with less than Z stops .................................................................................................................. 20
10.20 All the cities reachable within d hops of a city ............................................................................................................................ 20
11 AWS Output Screenshot ....................................................................................................................................................................... 21
12 Acknowledgement ................................................................................................................................................................................. 23
13 References: ........................................................................................................................................................................................... 23

ii
1 Project Description

This tool is going to process various raw data sets which you can find in Appendix A and from this raw data we can
derive some useful facts which you can find in Appendix B. The tool will process raw data and initially create various
dimensional data models such as Airports, Airlines, Routes, Planes, and Countries tables. The schema of those tables
can be found in Appendix C.

1.1 Appendix A:
The raw data sets are
1) Airport.dat – Which contains information related to Airports such as Airport id, Airport Name, etc.
2) Airlines.dat – Which contains information related to Airlines such as Airline id, Airline name, etc et al. [5].
3) Routes.dat – Which contains information related to routes such as Source Airport, Destination Airport.
4) Plane – Which contains information related to plane such as Plane name, etc.
5) Country – Which contains information related to Country name, iso_code et al. [5].

1.2 Appendix B:
a. Find list of Airports operating in the Country X.

b. Find the list of Airlines having X stops.

c. List of Airlines operating with code share.

d. Find the list of Active Airlines in the United States.

i. Airline aggregation:

e. Which Country (or) Territory has the highest number of Airports.

f. The top K cities with most Incoming/Outgoing Airlines.

i. Trip recommendation:

g. Define a trip as a sequence of connected routes. Find a trip that connects two cities X and Y
(reachability).
h. Find a trip that connects X and Y with less than Z stops (constrained reachability).

i. Find all the cities reachable within d hops of a city (bounded reachability).

a. Fast Transitive closure/connected component implemented in parallel/distributed algorithms.

1
1.3 Appendix C:

Table name Airports


airport_id bigint
Name string
city string

country String

iata String

icao String

latitude Double

longitude Double

altitude Bigint

timezone Double

dst String

tz_database String

type String

source String

2
Table name Airlines
Airlineid bigint
Name string
Alias String
Iata String
Icao String
Callsign String
Country String
active String

Table Name Routes


Airline string
Airlineid String
Source_airport String
Source_airport_id String
Destination_airport string
Destination_airportid string
Codeshare string
Stops Bigint
Equipment string

Table Name Planes


Name String
Iata String
Icao string

Table Name Countries


Name String
Iso_code String
Dafif_code String

3
2 Architecture and flow of the Data Pipeline

The given data set will be uploaded to either the Amazon S3 bucket et al. [4,6] or can be uploaded to Hadoop
attributed filesystem. The uploaded data will be processed with the help of Apache Spark engine et al. [3]. The
Apache Spark engine mostly will be cluster like Amazon Elastic Map Reduce (EMR) service or locally installed Spark.
Once the data is processed, we can store the data again in another Amazon S3 bucket or it can be stored in the HDFS
also. The output data can be viewed with the help of various tools such as Apache Superset, Tableau, Presto query
engine, Amazon Athena et al. [6] or it can be created as another Hive table et al. [3].

Figure 1: Architecture and flow of the Data Pipeline [2].

3 Tools and Technologies


Google Colab, Spark, Python, AWS, PyCharm, HDFS, AWS Resources such as S3 bucket, Identity Access
Management (IAM), AWS Glue Data Catalog, AWS Glue Crawler, AWS Athena, SQL.

4
4 Project Structure

The Airline Search Engine Project is developed with Integrated Development Environment ( IDE) such as
PyCharm et al. [8] and by installing necessary language binaries like PySpark and Spark et al. [3,11].

Figure 2: PySpark version 3.1.2 and Spark version 3.1.2.

5
The pip list command shows the PySpark version used in this project. PySpark version 3.1.2 and Spark version 3.1.2.

Figure 3: pip list command showing PySpark Version.

5 Project folder Hierarchy

A separate project is created for this, and it includes a separate virtual environment to install the necessary project
dependency modules like Pandas et al. [10], NumPy, etc. The folder structure includes a separate folder for data
loading/reading and some util Spark code will be developed and developed folder like the util folder.

Figure 4: Project folder Hierarchy

6
6 Utility Code

Utility code was developed to read the Spark session configuration and to set the Spark configuration at run time as well. The
load_df utility was developed to read the data. You can find the code in the belowscreenshot.

Figure 5 : Utility Code

7 Code for creating the Spark session

Figure 6: Code for creating the Spark session

7
8 Transformation and Cleaning

Doing some transformation and cleaning work like replace strings like “\N” and “- “with na and transformation by
replacing all null values with strings like na. You can find the output in the screen below after this transformation
and cleaning.

Figure 7: Transformation and Cleaning

8
9 Complete Project Code:

Figure 8: Project Code

Figure 9: Complete Project Code

9
Figure 10: Complete Project Code

Figure 11: Spark Session Configuration Code

10
10 Project Output Screenshots

10.1 Find a list of Airports operating in the Country X

spark.sql("select *, count(*) over () as count from airports where country =


'Greenland'").show(100)

Output:

Figure 10.1: Output for list of Airports operating in the Country X (‘GREENLAND’)

10.2 Find the list of Airlines having X stops


spark.sql("select * from routes where stops > 0").show(100)

Output:

Figure 10.2: Output for list of Airlines having X stops

11
10.3 List of Airlines operating with codeshare
spark.sql("select * , count(*) over() as count from routes where codeshare != 'na' ").show(100)

Output:

Figure 10.3: Output for list of Airlines operating with codeshare

10.4 Find the list of Active Airlines in the United States

spark.sql("select *, count(*) over() as count from airlines where country = 'United States' andactive =
'Y'").show(100)

Output:

Figure 10.4: Output for list of active Airlines in the United States

12
10.5 Which country (or) territory has the highest number of Airports
spark.sql("select count(*) as cnt, country from airports group by country order by cnt desc").show(20)

Output:

Figure 10.5: Output for the countries with highest number of Airports

10.6 The top K cities with most Incoming Airlines

spark.sql("""select * from (select airports.airportid, airports.name, airports.city,airports.country,


tb2.incoming_flight_count from airports inner join (select count (*) as incoming_flight_count, destinationairportid
from routesgroup by destinationairportid ) tb2 on airports.airportid = tb2.destinationairportid) otb order by
otb.incoming_flight_countdesc""").show(100)

Output:

Figure 10.6: Output for the top cities with most incoming Airlines

13
10.7 The top K cities with most Outgoing Airlines

spark.sql("""select * from (select airports.airportid, airports.name, airports.city,airports.country,


tb2.outgoing_flight_count from airports inner join (select count (*) as outgoing_flight_count, sourceairportid from
routes groupby sourceairportid ) tb2 on airports.airportid = tb2.sourceairportid) otb order by otb.outgoing_flight_count
desc""").show(100)

Output:

Figure 10.7: Output for top cities with most outgoing Airlines

10.8 Trip that connects two cities X and Y

spark.sql("""select * from routes where sourceairportid = '2613' and


destinationairportid='2531' """).show(100)

Output:

Figure 10.8: Output for trip that connects two cities X and Y

14
10.9 Trip that connects X and Y with less than Z stops

spark.sql("""select * from routes where sourceairportid = '2613' and


destinationairportid='2531' and stops < 1 """).show(100)

Output:

Figure 10.9: Output for trip that connects X and Y with less than Z stops

10.10 All the cities reachable within d hops of a city

spark.sql("""select destinationairport from routes where stops = 1 """).show(100)

Output:

Figure 10.10: Output for all the cities reachable within d hops of a city

15
10.11 Find list of Airports operating in the Country X

spark.sql("select *, count(*) over () as count from airports where country =


'Greenland'").show(100)

Output:

Figure 10.11: Output for list of Airports operating in the country X

10.12 Find the list of Airlines having X stops


spark.sql("select * from routes where stops > 0").show(100)

Output:

Figure 10.12: Output for the list of Airlines having X stops

16
10.13 List of Airlines operating with code share

spark.sql("select * , count(*) over() as count from routes where codeshare != 'na' ").show(100)

Output:

Figure 10.13: Output for Airlines operating with code share

10.14 Find the list of Active Airlines in the United States

spark.sql("select *, count(*) over() as count from airlines where country = 'United States' andactive =
'Y'").show(100)

Output:

Figure 10.14: Output for list of active airlines in the United States

17
10.15 Which country (or) territory has the highest number of Airports

spark.sql("select count(*) as cnt, country from airports group by country order by cnt desc").show(20)

Output:

Figure 10.15: Output for multiple countries having highest number of Airports

10.16 The top K cities with most incoming Airlines

spark.sql("""select * from (select airports.airportid, airports.name, airports.city,airports.country,


tb2.incoming_flight_count from airports inner join (select count (*) as incoming_flight_count, destinationairportid
from routesgroup by destinationairportid ) tb2 on airports.airportid = tb2.destinationairportid) otb order by
otb.incoming_flight_count desc""").show(100)

Output:

Figure 10.16: Output for top K cities with most incoming Airlines

18
10.17 The top K cities with most outgoing Airlines

spark.sql("""select * from (select airports.airportid, airports.name, airports.city,airports.country,


tb2.outgoing_flight_count from airports inner join (select count (*) as outgoing_flight_count, sourceairportid from
routes groupby sourceairportid ) tb2 on airports.airportid = tb2.sourceairportid) otb order by
otb.outgoing_flight_count desc""").show(100)

Output:

Figure 10.17: Output for top K cities with most outgoing Airlines

10.18 Trip that connects two cities X and Y

spark.sql("""select * from routes where sourceairportid = '2613' and


destinationairportid='2531' """).show(100)

Output:

Figure 10.18: Output for trip that connects two cities X and Y

19
10.19 Trip that connects X and Y with less than Z stops

spark.sql("""select * from routes where sourceairportid = '2613' and


destinationairportid='2531' and stops < 1 """).show(100)

Output:

Figure 10.19: Output for trip that connects X and Y with less than Z stops

10.20 All the cities reachable within d hops of a city


spark.sql("""select destinationairport from routes where stops = 1 """).show(100)

Output:

Figure 10.20: Output for all the cities reachable within d hops of a city

20
11 AWS Output Screenshot

Figure 30: AWS Crawlers page

Figure 31: AWS Tables

21
Figure 32: AWS Athena Query

Figure 33: AWS Athena Output

22
12 Acknowledgement

I would like to thank my major professor, Liu Yunchuan, for having faith in me and my talents and for continuing to believe that I
would be able to complete the project on schedule. This Project was completed successfully thanks to the support, ongoing direction,
and insightful feedback. I also want to express my sincere gratitude to my mentor for being on my panel, working as my academic
advisor, helping me make all the important choices, and having faith in me.

13 References:
[1] http://openflights.org/data.html.
[2] https://docs.aws.amazon.com/glue/latest/ug/tutorial-create-job.html
[3] https://spoddutur.github.io/spark-notes/spark-as-cloud-based-sql-engine-via-thrift-server.html
[4] https://docs.aws.amazon.com/s3/index.html
[5] https://www.iata.org/en/publications/directories/code-search/
[6] https://www.youtube.com/watch?v=8VOf1PUFE0I
[7] https://docs.aws.amazon.com/iam/index.html
[8] https://www.jetbrains.com/pycharm/learn/
[9] https://docs.python.org/3/library/index.html
[10] https://pandas.pydata.org/docs/
[11] https://spark.apache.org/docs/latest/api/python/index.html

23

You might also like