Ormulate The Data Science Problem
Ormulate The Data Science Problem
The primary goal of this practicum is to analyze flight delay patterns using data from the
Bureau of Transportation Statistics for flights originating or departing from Arizona,
Nevada, and California in 2019. Specifically, the problem involves two main objectives:
1. Analyzing air traffic at various airports within these states to determine the volume and
patterns of air traffic.
2. Evaluating which airlines/carriers are more prone to delays and identifying the factors
contributing to these delays.
First, load the dataset and inspect the variables to ensure they are of the correct data
type. Typical data types include integers for counts, floats for measurements like delay
times, and strings for categorical variables like airline names and airport codes.
python
Copy code
import pandas as pd # Load the dataset df = pd.read_csv( '2019_ONTIME_REPORTING_FSW.csv' ) # Display
the first few rows and data types of the dataframe print (df.head()) print (df.info())
Check if there are any columns that need type conversion. For example, dates should be
converted to datetime objects.
python
Copy code
# Convert columns to appropriate data types df[ 'FL_DATE' ] = pd.to_datetime(df[ 'FL_DATE' ]) # Check for other
necessary conversions # For instance, ensuring numeric columns are indeed numeric df[ 'DEP_DELAY' ] =
pd.to_numeric(df[ 'DEP_DELAY' ], errors= 'coerce' ) df[ 'ARR_DELAY' ] = pd.to_numeric(df[ 'ARR_DELAY' ],
errors= 'coerce' ) print (df.info())
python
Copy code
# Check for missing values print (df.isnull(). sum ()) # Handle missing values - options include dropping or filling
# Here we choose to fill with mean or a specific value for demonstration
df[ 'DEP_DELAY' ].fillna(df[ 'DEP_DELAY' ].mean(), inplace= True )
df[ 'ARR_DELAY' ].fillna(df[ 'ARR_DELAY' ].mean(), inplace= True ) print (df.isnull(). sum ())
python
Copy code
import matplotlib.pyplot as plt import seaborn as sns # Summary statistics summary = df.describe()
print (summary) # Visualization 1: Distribution of Departure Delays plt.figure(figsize=( 10 , 6 ))
sns.histplot(df[ 'DEP_DELAY' ], bins= 50 , kde= True ) plt.title( 'Distribution of Departure Delays' )
plt.xlabel( 'Departure Delay (minutes)' ) plt.ylabel( 'Frequency' ) plt.show() # Visualization 2: Distribution of Arrival
Delays plt.figure(figsize=( 10 , 6 )) sns.histplot(df[ 'ARR_DELAY' ], bins= 50 , kde= True ) plt.title( 'Distribution of
Arrival Delays' ) plt.xlabel( 'Arrival Delay (minutes)' ) plt.ylabel( 'Frequency' ) plt.show() # Visualization 3: Number
of Flights per Airline plt.figure(figsize=( 12 , 8 )) sns.countplot(y= 'OP_UNIQUE_CARRIER' , data=df,
order=df[ 'OP_UNIQUE_CARRIER' ].value_counts().index) plt.title( 'Number of Flights per Airline' )
plt.xlabel( 'Number of Flights' ) plt.ylabel( 'Airline' ) plt.show()
python
Copy code
# Count flights per state state_counts = df[ 'ORIGIN_STATE_NM' ].value_counts() # Visualize the results
plt.figure(figsize=( 8 , 6 )) state_counts.plot(kind= 'bar' ) plt.title( 'Number of Flights per State' ) plt.xlabel( 'State' )
plt.ylabel( 'Number of Flights' ) plt.show()
Comment on Findings:
The bar plot shows the distribution of flights across Arizona, Nevada, and California,
identifying which state has the highest air traffic.
Comment on Findings:
This visualization shows the top 5 outbound airports for each state and helps identify
key hubs of air traffic.
python
Copy code
# Calculate the proportion of flights for each airline airline_flight_counts =
df[ 'OP_UNIQUE_CARRIER' ].value_counts(normalize= True ) * 100 # Visualize the top 10 airlines
top_10_airlines = airline_flight_counts.head( 10 ) plt.figure(figsize=( 12 , 8 )) top_10_airlines.plot(kind= 'bar' )
plt.title( 'Proportion of Flights by Top 10 Airlines' ) plt.xlabel( 'Airline' ) plt.ylabel( 'Proportion of Flights (%)' )
plt.show()
python
Copy code
# Summary statistics for delays delay_summary = df.groupby( 'OP_UNIQUE_CARRIER' )[ 'DEP_DELAY' ,
'ARR_DELAY' ].describe() # Visualize the delay patterns plt.figure(figsize=( 12 , 8 ))
sns.boxplot(x= 'OP_UNIQUE_CARRIER' , y= 'DEP_DELAY' , data=df) plt.title( 'Boxplot of Departure Delays by
Airline' ) plt.xlabel( 'Airline' ) plt.ylabel( 'Departure Delay (minutes)' ) plt.xticks(rotation= 90 ) plt.show()
Comment on Findings:
Boxplots highlight the variation and patterns in departure delays for each airline,
indicating which airlines are more prone to delays.
Calculate and visualize the top 10 airlines with the least average delay.
python
Copy code
# Calculate average delay per airline avg_delays = df.groupby( 'OP_UNIQUE_CARRIER' )
[ 'ARR_DELAY' ].mean().sort_values() # Visualize the top 10 airlines top_10_on_time_airlines =
avg_delays.head( 10 ) plt.figure(figsize=( 12 , 8 )) top_10_on_time_airlines.plot(kind= 'bar' ) plt.title( 'Top 10
Airlines with Least Average Arrival Delay' ) plt.xlabel( 'Airline' ) plt.ylabel( 'Average Arrival Delay (minutes)' )
plt.show()
Step 2: Calculate Total Flight Hours per Month for Top 10 Airlines
python
Copy code
# Calculate total flight hours per month for top 10 airlines top_10_airlines_list =
top_10_on_time_airlines.index.tolist() df_top_10 = df[df[ 'OP_UNIQUE_CARRIER' ].isin(top_10_airlines_list)]
df_top_10[ 'FLIGHT_HOURS' ] = df_top_10[ 'ACTUAL_ELAPSED_TIME' ] / 60 # Convert minutes to hours
monthly_flight_hours = df_top_10.groupby([ 'OP_UNIQUE_CARRIER' , df_top_10[ 'FL_DATE' ].dt.month])
[ 'FLIGHT_HOURS' ]. sum ().unstack() # Visualize the results plt.figure(figsize=( 14 , 8 ))
sns.heatmap(monthly_flight_hours, annot= True , fmt= ".2f" , cmap= "YlGnBu" ) plt.title( 'Monthly Flight Hours for
Top 10 Airlines' ) plt.xlabel( 'Month' ) plt.ylabel( 'Airline' ) plt.show()
Comment on Findings:
The heatmap provides insights into the flight hours distribution across different months
for the top 10 airlines, showing seasonal variations and operational patterns.
python
Copy code
# Select three aircraft selected_aircraft = df[ 'TAIL_NUM' ].unique()[: 3 ] df_selected_aircraft =
df[df[ 'TAIL_NUM' ].isin(selected_aircraft)] # Calculate average arrival and departure delays for selected aircraft
aircraft_delay_summary = df_selected_aircraft.groupby( 'TAIL_NUM' )[[ 'DEP_DELAY' ,
'ARR_DELAY' ]].mean() # Visualize the results plt.figure(figsize=( 12 , 8 ))
aircraft_delay_summary.plot(kind= 'bar' ) plt.title( 'Average Arrival and Departure Delays for Selected Aircraft' )
plt.xlabel( 'Aircraft (TAIL_NUM)' ) plt.ylabel( 'Average Delay (minutes)' ) plt.show() # Analyze destinations for
selected aircraft aircraft_destinations = df_selected_aircraft.groupby([ 'TAIL_NUM' , 'DEST' ])
[ 'DEST' ].count().unstack().fillna( 0 ) # Visualize the destinations plt.figure(figsize=( 14 , 8 ))
sns.heatmap(aircraft_destinations, annot= True , fmt= "d" , cmap= "Blues" ) plt.title( 'Destinations for Selected
Aircraft' ) plt.xlabel( 'Destination Airport' ) plt.ylabel( 'Aircraft (TAIL_NUM)' ) plt.show()
Comment on Findings:
Bar plots show the average delays for each selected aircraft, while heatmaps display the
destinations these aircraft frequently travel to, revealing patterns in their operations.
Note: Ensure to label all visualizations with appropriate titles and axis labels, and round
all numeric calculations to 2 decimal places. Provide