Exploratory Data Analysis Report_ Electric Vehicle Dataset -
Exploratory Data Analysis Report_ Electric Vehicle Dataset -
The dataset comprises 232,230 rows and 17 columns, offering a detailed snapshot of electric
vehicles. Each entry represents a unique vehicle, identifiable by its Vehicle Identification
Number (VIN), accompanied by a variety of attributes detailing its technical specifications,
geographical context, and administrative information.
Key Attributes
Vehicle Details:
● Model Year, Make, Model: Describe the vehicle's manufacturing year, brand, and
specific model.
● Electric Vehicle Type: Categorizes the vehicle as either Battery Electric Vehicle (BEV)
or Plug-in Hybrid Electric Vehicle (PHEV).
● CAFV Eligibility: Indicates if the vehicle is eligible for Clean Alternative Fuel Vehicle
incentives.
● Electric Range: Specifies the distance a vehicle can travel solely on electric power.
Notably, 25% of the records report an electric range of 0 miles, which could signify hybrid
models or potential data recording inconsistencies.
● Base MSRP: Represents the manufacturer's suggested retail price. A significant number
of entries are recorded as $0, suggesting missing data or placeholder values.
Additional Information:
Missing Values:
The dataset exhibits minimal missing values in most location-based fields: County (4), City (4),
Postal Code (4), Vehicle Location (11), Electric Utility (4), and 2020 Census Tract (4). However,
there are more substantial missing entries in the Legislative District (481). Additionally, Electric
Range and Base MSRP each have 27 missing values.
Duplicates:
The dataset is free of duplicate rows, indicating a clean and reliable set of unique vehicle
records.
Preliminary Observations:
● Tesla Dominance: Tesla vehicles, particularly the Model Y and Model 3, are the most
prevalent in the dataset, suggesting a strong market presence or a focus in the data
collection process.
● Electric Range Distribution: The high percentage of vehicles with an electric range of 0
miles warrants further investigation to differentiate between hybrid vehicles and potential
data quality issues.
● Geographic Concentration: The majority of vehicles are registered in Washington
State, with a notable concentration in King County and the city of Seattle. This regional
bias should be considered during analysis.
● Top Vehicle Makes & Models: Visualize the frequency distribution of the top 10 vehicle
makes (e.g., TESLA, CHEVROLET) and the top 10 models (e.g., MODEL Y, MODEL 3)
to understand market share and popularity.
● Temporal Analysis: Analyze the count of electric vehicles by model year to identify
trends in EV adoption over time and potential growth trajectories.
● Electric Range Analysis: Examine the distribution of the electric range to understand
the typical range capabilities and investigate the significant number of zero-range
entries.
● Vehicle Range by Electric Vehicle Type: Compare the electric range of Battery Electric
Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) to assess performance
differences between these categories.
● Pricing Insights: Investigate the Base MSRP across different vehicle makes to identify
pricing patterns and the impact of brand on price, while acknowledging and addressing
the presence of $0 values.
● Geographic Distribution: Map the distribution of electric vehicles across different
counties and cities within Washington State to identify areas with high EV adoption.
● Relationship between Range and Model Year: Explore if there's a correlation between
the vehicle's model year and its electric range, potentially indicating technological
improvements over time.
● Missing Values: Identified and documented the missing values in County, City, Postal
Code (4 each), Electric Range (27), Base MSRP (27), Legislative District (481), Vehicle
Location (11), Electric Utility (4), and 2020 Census Tract (4). While the number is
relatively small for most fields, the Legislative District has a notable amount of missing
data.
● Zero Values in Electric Range: Noticed that 25% of the Electric Range entries are 0.0.
This was flagged as a potential indicator of hybrid vehicles or a data entry issue requiring
further investigation.
● Zero Values in Base MSRP: Identified 27 instances where the Base MSRP is $0. This
suggests missing or placeholder data that needs to be addressed for accurate pricing
analysis.
● Duplicates: Confirmed that the dataset contains no duplicate rows, ensuring the
uniqueness of each vehicle record.
Feature Engineering:
Tesla Dominance: Tesla is the leading manufacturer in the dataset with 13,451
vehicles.
Top Tesla Models: Within the Tesla brand, the Model Y (6,392 records) and
Model 3 (4,875 records) are the most frequently recorded models.
The dataset shows a higher number of records for more recent model years, with
2023 having the highest count (8,099 records), followed by 2024 and 2022. This
trend likely reflects the increasing adoption of electric vehicles and/or the focus of
the data collection efforts.
A substantial portion of the vehicles (25%) report an electric range of 0.0 miles.
This could imply a significant presence of plug-in hybrid vehicles in the dataset,
or it might indicate data entry errors where the range was not recorded for some
battery electric vehicles. Further investigation is needed to clarify this.
● Performance by Vehicle Type:
● Pricing Discrepancies:
● Geographic Concentration:
The majority of the electric vehicles in the dataset are located in Washington
State, with a notable concentration in King County and the city of Seattle. This
indicates a strong adoption of EVs in this region, possibly due to state incentives,
infrastructure, or other regional factors.
4. Formulating Hypotheses
Based on the initial exploration of the data, we can formulate the following testable hypotheses:
● Null Hypothesis (H0): There is no statistically significant difference in the mean Base
MSRP across different vehicle makes.
● Alternative Hypothesis (H1): There is a statistically significant difference in the mean
Base MSRP across at least one vehicle make compared to others.
● Test Used: One-way ANOVA (Analysis of Variance)
● Reasoning: ANOVA is the appropriate statistical test to compare the means of a
continuous variable (Base MSRP) across multiple independent groups (vehicle makes).
It assesses whether the observed differences in means are likely due to chance or a real
effect of the vehicle make.
Results:
● F-statistic: 11.8574
● P-value: 3.5405578232248146e-73
Interpretation:
The calculated p-value (approximately 3.54e-73) is extremely small, significantly less than the
conventional significance level of 0.05. This indicates very strong statistical evidence against the
null hypothesis.
Conclusion:
Based on the results of the one-way ANOVA test, we reject the null hypothesis. We conclude
that there is a statistically significant difference in the Base MSRP across different vehicle
makes in this dataset. This supports the alternative hypothesis that the average Base MSRP
varies significantly depending on the vehicle manufacturer. Further post-hoc analysis could be
conducted to identify which specific vehicle makes have significantly different average MSRPs
from each other.
To further analyze this electric vehicle dataset and gain deeper insights, consider the following
steps:
● Address Missing MSRP Values: Investigate the reasons behind the missing Base
MSRP values. If possible, explore imputation techniques based on vehicle make, model,
and year, or consider using external data sources to fill these gaps.
● Investigate Zero Electric Range: Further analyze the vehicles with an electric range of
0.0 miles. Determine if they are primarily plug-in hybrid vehicles or if there are potential
data errors. If they are PHEVs, consider creating a separate category for more granular
analysis.
● Geographic Analysis: Conduct a more in-depth geographic analysis to understand the
factors driving EV adoption in specific regions. This could involve looking at correlations
with population density, income levels, charging infrastructure, and state incentives.
● Time Series Analysis: If more historical data is available, perform a time series analysis
to understand the growth trends of different EV makes and models over time.
● Correlation Analysis: Explore the correlations between different attributes, such as the
relationship between model year and electric range, or between Base MSRP and electric
range.
● Regression Analysis: Build a regression model to predict the electric range or Base
MSRP based on other vehicle attributes like make, model, and vehicle type.
● CAFV Eligibility Analysis: Investigate the characteristics of vehicles that are eligible for
CAFV incentives versus those that are not.
● Electric Utility Analysis: Explore the distribution of electric vehicles across different
utility providers and see if there are any patterns or correlations.
● Hypothesis Testing: Conduct formal significance tests for the remaining formulated
hypotheses (comparison of electric range between BEVs and PHEVs, and the
correlation between model year and electric range).
The dataset appears to be of reasonably good quality, with a large number of records and
minimal missing values in most key fields. The absence of duplicate entries is also a positive
aspect. However, the significant number of zero values in the Electric Range and Base MSRP
fields raises some concerns and requires further investigation. These zero values could
represent different scenarios (e.g., hybrid vehicles, missing data) that need to be clearly
understood to avoid skewing the analysis.
To enhance the analysis and gain a more comprehensive understanding, the following
additional data points would be beneficial: