0% found this document useful (0 votes)
15 views

Ormulate The Data Science Problem

Uploaded by

nicebi09
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Ormulate The Data Science Problem

Uploaded by

nicebi09
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

ormulate the Data Science Problem:

The primary goal of this practicum is to analyze flight delay patterns using data from the
Bureau of Transportation Statistics for flights originating or departing from Arizona,
Nevada, and California in 2019. Specifically, the problem involves two main objectives:

1. Analyzing air traffic at various airports within these states to determine the volume and
patterns of air traffic.
2. Evaluating which airlines/carriers are more prone to delays and identifying the factors
contributing to these delays.

Question 2 — Prepare the Data (8 pts)

Step 1: Evaluate Dataset and Data Types

First, load the dataset and inspect the variables to ensure they are of the correct data
type. Typical data types include integers for counts, floats for measurements like delay
times, and strings for categorical variables like airline names and airport codes.

python
Copy code
import pandas as pd # Load the dataset df = pd.read_csv( '2019_ONTIME_REPORTING_FSW.csv' ) # Display
the first few rows and data types of the dataframe print (df.head()) print (df.info())

Step 2: Convert Variables to Suitable Data Types

Check if there are any columns that need type conversion. For example, dates should be
converted to datetime objects.

python
Copy code
# Convert columns to appropriate data types df[ 'FL_DATE' ] = pd.to_datetime(df[ 'FL_DATE' ]) # Check for other
necessary conversions # For instance, ensuring numeric columns are indeed numeric df[ 'DEP_DELAY' ] =
pd.to_numeric(df[ 'DEP_DELAY' ], errors= 'coerce' ) df[ 'ARR_DELAY' ] = pd.to_numeric(df[ 'ARR_DELAY' ],
errors= 'coerce' ) print (df.info())

Step 3: Handle Missing and Invalid Values

Identify and handle missing or invalid values in the dataset.

python
Copy code
# Check for missing values print (df.isnull(). sum ()) # Handle missing values - options include dropping or filling
# Here we choose to fill with mean or a specific value for demonstration
df[ 'DEP_DELAY' ].fillna(df[ 'DEP_DELAY' ].mean(), inplace= True )
df[ 'ARR_DELAY' ].fillna(df[ 'ARR_DELAY' ].mean(), inplace= True ) print (df.isnull(). sum ())

Step 4: High-Level Explanation and Visualizations

Summarize the dataset and create visualizations to support the explanation.

python
Copy code
import matplotlib.pyplot as plt import seaborn as sns # Summary statistics summary = df.describe()
print (summary) # Visualization 1: Distribution of Departure Delays plt.figure(figsize=( 10 , 6 ))
sns.histplot(df[ 'DEP_DELAY' ], bins= 50 , kde= True ) plt.title( 'Distribution of Departure Delays' )
plt.xlabel( 'Departure Delay (minutes)' ) plt.ylabel( 'Frequency' ) plt.show() # Visualization 2: Distribution of Arrival
Delays plt.figure(figsize=( 10 , 6 )) sns.histplot(df[ 'ARR_DELAY' ], bins= 50 , kde= True ) plt.title( 'Distribution of
Arrival Delays' ) plt.xlabel( 'Arrival Delay (minutes)' ) plt.ylabel( 'Frequency' ) plt.show() # Visualization 3: Number
of Flights per Airline plt.figure(figsize=( 12 , 8 )) sns.countplot(y= 'OP_UNIQUE_CARRIER' , data=df,
order=df[ 'OP_UNIQUE_CARRIER' ].value_counts().index) plt.title( 'Number of Flights per Airline' )
plt.xlabel( 'Number of Flights' ) plt.ylabel( 'Airline' ) plt.show()

Question 3 — Explore Patterns in the Region (20 points)

Step 1: Determine which region has the most air traffic

Count the number of flights per state.

python
Copy code
# Count flights per state state_counts = df[ 'ORIGIN_STATE_NM' ].value_counts() # Visualize the results
plt.figure(figsize=( 8 , 6 )) state_counts.plot(kind= 'bar' ) plt.title( 'Number of Flights per State' ) plt.xlabel( 'State' )
plt.ylabel( 'Number of Flights' ) plt.show()

Comment on Findings:

 The bar plot shows the distribution of flights across Arizona, Nevada, and California,
identifying which state has the highest air traffic.

Step 2: Analyze the Most Popular Outbound/Destination Airports

Identify top outbound and destination airports for each state.


python
Copy code
# Top outbound airports per state top_outbound_airports = df.groupby( 'ORIGIN' )
[ 'DEST' ].count().reset_index(name= 'count' ).sort_values(by= 'count' ,
ascending= False ).groupby( 'ORIGIN_STATE_NM' ).head( 5 ) # Visualize the results plt.figure(figsize=( 12 , 8 ))
sns.barplot(x= 'count' , y= 'ORIGIN' , hue= 'ORIGIN_STATE_NM' , data=top_outbound_airports) plt.title( 'Top
Outbound Airports per State' ) plt.xlabel( 'Number of Flights' ) plt.ylabel( 'Airport' ) plt.show()

Comment on Findings:

 This visualization shows the top 5 outbound airports for each state and helps identify
key hubs of air traffic.

Question 4 — Explore the Carriers (20 points)

Step 1: Calculate the Proportion of Flights for Each Airline

python
Copy code
# Calculate the proportion of flights for each airline airline_flight_counts =
df[ 'OP_UNIQUE_CARRIER' ].value_counts(normalize= True ) * 100 # Visualize the top 10 airlines
top_10_airlines = airline_flight_counts.head( 10 ) plt.figure(figsize=( 12 , 8 )) top_10_airlines.plot(kind= 'bar' )
plt.title( 'Proportion of Flights by Top 10 Airlines' ) plt.xlabel( 'Airline' ) plt.ylabel( 'Proportion of Flights (%)' )
plt.show()

Step 2: Analyze Flight Delays for Each Airline

Calculate and visualize summary statistics of delays.

python
Copy code
# Summary statistics for delays delay_summary = df.groupby( 'OP_UNIQUE_CARRIER' )[ 'DEP_DELAY' ,
'ARR_DELAY' ].describe() # Visualize the delay patterns plt.figure(figsize=( 12 , 8 ))
sns.boxplot(x= 'OP_UNIQUE_CARRIER' , y= 'DEP_DELAY' , data=df) plt.title( 'Boxplot of Departure Delays by
Airline' ) plt.xlabel( 'Airline' ) plt.ylabel( 'Departure Delay (minutes)' ) plt.xticks(rotation= 90 ) plt.show()

Comment on Findings:

 Boxplots highlight the variation and patterns in departure delays for each airline,
indicating which airlines are more prone to delays.

Question 5 — Evaluate Airlines with Best Record (30 points)


Step 1: Identify Airlines with Best On-Time Record

Calculate and visualize the top 10 airlines with the least average delay.

python
Copy code
# Calculate average delay per airline avg_delays = df.groupby( 'OP_UNIQUE_CARRIER' )
[ 'ARR_DELAY' ].mean().sort_values() # Visualize the top 10 airlines top_10_on_time_airlines =
avg_delays.head( 10 ) plt.figure(figsize=( 12 , 8 )) top_10_on_time_airlines.plot(kind= 'bar' ) plt.title( 'Top 10
Airlines with Least Average Arrival Delay' ) plt.xlabel( 'Airline' ) plt.ylabel( 'Average Arrival Delay (minutes)' )
plt.show()

Step 2: Calculate Total Flight Hours per Month for Top 10 Airlines

python
Copy code
# Calculate total flight hours per month for top 10 airlines top_10_airlines_list =
top_10_on_time_airlines.index.tolist() df_top_10 = df[df[ 'OP_UNIQUE_CARRIER' ].isin(top_10_airlines_list)]
df_top_10[ 'FLIGHT_HOURS' ] = df_top_10[ 'ACTUAL_ELAPSED_TIME' ] / 60 # Convert minutes to hours
monthly_flight_hours = df_top_10.groupby([ 'OP_UNIQUE_CARRIER' , df_top_10[ 'FL_DATE' ].dt.month])
[ 'FLIGHT_HOURS' ]. sum ().unstack() # Visualize the results plt.figure(figsize=( 14 , 8 ))
sns.heatmap(monthly_flight_hours, annot= True , fmt= ".2f" , cmap= "YlGnBu" ) plt.title( 'Monthly Flight Hours for
Top 10 Airlines' ) plt.xlabel( 'Month' ) plt.ylabel( 'Airline' ) plt.show()

Comment on Findings:

 The heatmap provides insights into the flight hours distribution across different months
for the top 10 airlines, showing seasonal variations and operational patterns.

Question 6 — Explore Aircraft (20 points)

Step 1: Select and Analyze Three Aircraft

Identify and analyze three specific aircraft using their TAIL_NUM.

python
Copy code
# Select three aircraft selected_aircraft = df[ 'TAIL_NUM' ].unique()[: 3 ] df_selected_aircraft =
df[df[ 'TAIL_NUM' ].isin(selected_aircraft)] # Calculate average arrival and departure delays for selected aircraft
aircraft_delay_summary = df_selected_aircraft.groupby( 'TAIL_NUM' )[[ 'DEP_DELAY' ,
'ARR_DELAY' ]].mean() # Visualize the results plt.figure(figsize=( 12 , 8 ))
aircraft_delay_summary.plot(kind= 'bar' ) plt.title( 'Average Arrival and Departure Delays for Selected Aircraft' )
plt.xlabel( 'Aircraft (TAIL_NUM)' ) plt.ylabel( 'Average Delay (minutes)' ) plt.show() # Analyze destinations for
selected aircraft aircraft_destinations = df_selected_aircraft.groupby([ 'TAIL_NUM' , 'DEST' ])
[ 'DEST' ].count().unstack().fillna( 0 ) # Visualize the destinations plt.figure(figsize=( 14 , 8 ))
sns.heatmap(aircraft_destinations, annot= True , fmt= "d" , cmap= "Blues" ) plt.title( 'Destinations for Selected
Aircraft' ) plt.xlabel( 'Destination Airport' ) plt.ylabel( 'Aircraft (TAIL_NUM)' ) plt.show()

Comment on Findings:

 Bar plots show the average delays for each selected aircraft, while heatmaps display the
destinations these aircraft frequently travel to, revealing patterns in their operations.

Note: Ensure to label all visualizations with appropriate titles and axis labels, and round
all numeric calculations to 2 decimal places. Provide

You might also like