Final Capstone Project - Group 4 - TPS
Final Capstone Project - Group 4 - TPS
Report
Toronto Police Services
Made By
Group 4 – TPS
Dev Wadiker
Chinmay Wadhavkar
Hajera Unnisa
Deepanshi
Institution
Date of Submission
17 August 2024
Executive Summary
Objective:
Key Findings:
Data Quality and Cleaning: The dataset initially contained missing values
and redundant columns. A meticulous data cleaning process was
implemented, involving the filling of missing values, removal of duplicates,
and standardization of categorical data. These steps ensured that the data
was accurate and ready for analysis.
Temporal Trends: Analysis of crime data over time revealed significant
trends, such as fluctuations in crime rates by year, month, and hour. Peak
periods of criminal activity were identified, providing insights into when crimes
are most likely to occur.
Spatial Analysis: The spatial distribution of crimes highlighted key hotspots
across different neighbourhoods in Toronto. Certain areas were identified as
having consistently higher crime rates, indicating the need for targeted
policing efforts.
Crime Categories: The data showed that certain types of crimes, such as
assaults, were more prevalent in specific neighbourhoods and during certain
times of the day. This finding suggests that crime prevention strategies could
be tailored to address these specific issues.
Conclusions:
The analysis of the crime data has provided valuable insights into the temporal and
spatial distribution of crimes in Toronto. By understanding when and where crimes
are most likely to occur, Toronto Police Services can better allocate resources and
develop targeted interventions. The findings also underscore the importance of
continuous data monitoring and analysis to adapt to changing crime patterns.
Recommendations:
1. Resource Allocation: Based on the identified crime hotspots and peak crime
periods, it is recommended that Toronto Police Services allocate resources
more effectively, particularly during high-crime hours and in high-risk
neighbourhoods.
2. Targeted Interventions: Develop community-specific strategies to address
the types of crimes that are most prevalent in each neighbourhood.
3. Continuous Monitoring: Implement ongoing data collection and analysis to
monitor trends and adjust policing strategies in real-time.
Introduction
Background:
Toronto, one of the most populous cities in Canada, faces a wide range of crime-
related challenges. Understanding crime patterns is crucial for law enforcement
agencies, like Toronto Police Services, to effectively allocate resources, develop
crime prevention strategies, and ensure public safety. This report is part of a
capstone project undertaken by a team of students from George Brown College, in
collaboration with Toronto Police Services, aimed at analysing crime data to uncover
trends, patterns, and actionable insights.
The dataset used for this analysis consists of records on major crime indicators in
Toronto, including various types of offenses, their occurrences over time, and their
geographical distribution across the city. Given the large volume of data and the
complexity of the variables involved, a systematic approach was required to clean,
analyse, and interpret the data effectively.
Problem Statement:
The primary challenge addressed by this analysis is to identify significant temporal and
spatial trends in crime within Toronto. By doing so, the goal is to provide Toronto Police
Services with actionable insights that can be used to improve resource allocation, enhance
public safety measures, and develop targeted crime prevention strategies. This analysis seeks
to answer the following key questions:
What are the trends in crime rates over time, and how do they vary by month, day,
and hour?
Which neighborhoods in Toronto are the most affected by crime, and what types of
crimes are most prevalent in these areas?
How can Toronto Police Services utilize these findings to optimize their operational
strategies?
Objectives:
1. To clean and preprocess the provided crime dataset to ensure its accuracy and
usability for analysis.
2. To conduct an exploratory data analysis (EDA) that identifies key trends, patterns,
and correlations within the data.
3. To provide detailed insights into the temporal and spatial distribution of crimes in
Toronto.
4. To offer recommendations for Toronto Police Services based on the analysis findings,
with a focus on enhancing crime prevention and resource allocation.
Scope:
This report focuses on the analysis of major crime indicators in Toronto, covering various
types of crimes, their temporal patterns, and their geographical distribution across the city.
The scope includes:
Data cleaning and preparation to address missing values and redundant information.
Exploratory data analysis (EDA) to uncover trends and patterns in the data.
Interpretation of the results to provide actionable insights.
Recommendations for future work and potential improvements in crime prevention
strategies.
This analysis does not cover the causes of crime or the socio-economic factors influencing
crime rates, as these are beyond the scope of the dataset and the objectives of this project.
Data Description
Data Source:
The dataset used for this analysis was provided by Toronto Police Services as part
of the Major Crime Indicators Open Data initiative. It contains detailed records of
reported crimes in Toronto, covering various offenses, their occurrence times, and
locations. The data was initially unprocessed, requiring several steps of cleaning and
preparation to ensure its suitability for analysis.
Variables:
OCC_DATE (Occurrence Date): The date and time when the crime
occurred.
REPORT_DATE (Report Date): The date and time when the crime was
reported to the police.
OCC_YEAR, OCC_MONTH, OCC_DAY, OCC_DOY, OCC_DOW,
OCC_HOUR: These columns break down the occurrence date into different
temporal components, such as year, month, day, day of the year, day of the
week, and hour.
LONG_WGS84, LAT_WGS84: The longitude and latitude coordinates of the
crime location, used for spatial analysis.
DIVISION: The police division responsible for the area where the crime
occurred.
LOCATION_TYPE, PREMISES_TYPE: These columns describe the type of
location and premises where the crime occurred.
OFFENCE, MCI_CATEGORY: These columns categorize the crime by
offense type and major crime indicator category, respectively.
NEIGHBOURHOOD_158, NEIGHBOURHOOD_140: These columns provide
neighborhood identifiers, allowing for analysis by specific areas within
Toronto.
Missing Data:
The dataset initially contained missing values in several key variables, particularly
those related to the occurrence dates and times (e.g., OCC_YEAR, OCC_MONTH,
OCC_DAY). To address this, different imputation techniques were used:
Data Quality:
The data underwent a rigorous cleaning process to ensure its quality. This included:
Post-cleaning, the dataset was saved in a final, processed state, ready for analysis.
The resulting data is now consistent, complete, and well-structured, providing a solid
foundation for the subsequent exploratory data analysis.
Methodology
Approach:
The analysis of the crime data was conducted in a systematic manner, starting with
data cleaning and preparation, followed by exploratory data analysis (EDA), and
culminating in detailed trend and pattern analysis. The primary objective was to
uncover actionable insights that could assist Toronto Police Services in optimizing
their operations and crime prevention strategies.
Python and Pandas: Python, particularly the Pandas library, was the primary
tool used for data manipulation and cleaning. Pandas facilitated the handling
of large datasets, allowing for efficient data preprocessing, including handling
missing values, standardizing data, and removing duplicates.
Matplotlib and Seaborn: These Python libraries were used for data
visualization. They provided the necessary tools to create informative and
visually appealing charts and graphs that depict temporal and spatial trends in
the data.
Jupyter Notebooks: The entire analysis was conducted in Jupyter
Notebooks, which provided an interactive environment for data exploration
and analysis. It allowed for the integration of code, output, and narrative,
making it easier to document the analysis process.
Geopandas: For spatial analysis, Geopandas was utilized to handle
geographic data and create visualizations that highlighted crime hotspots and
neighborhood-level crime distributions.
Assumptions:
Several assumptions were made during the analysis:
Limitations:
Process:
The data cleaning and preparation phase was a critical step in ensuring that the
dataset was accurate, consistent, and ready for analysis. Given the large volume of
data and the complexity of the variables, a systematic approach was applied to
address issues such as missing values, redundant columns, and inconsistent data
formatting.
Mode Imputation: This method was particularly useful for categorical data
because it preserved the distribution of categories. For instance, if the
majority of crimes occurred on a Friday, missing values in the OCC_DOW
column were filled with "Friday."
Mean Imputation: This method helped maintain the central tendency of the
numerical data. For example, if a few entries were missing for OCC_DAY, the
mean day of occurrence across all records was used to fill in these gaps.
Backfill Method: For date/time variables, where chronological consistency is
crucial, the backfill method ensured that no gaps disrupted the temporal
sequence of events. This method is particularly useful in datasets where
events follow a natural order, such as the progression of crime reports over
time.
Data Transformation:
In this section, we'll summarize and analyse the key findings from the data and
visualizations, breaking it down into different sub-sections. Here's how the EDA
report will be structured:
1. Descriptive Statistics
Summary Statistics: We'll summarize the key statistics from the dataset,
such as mean, median, standard deviation, and ranges of numeric columns.
2. Temporal Analysis
Crime Trends Over Time: Analyze the trends in crime rates over the years,
identify peak months, and observe patterns across different days of the week
and hours of the day.
Seasonal Patterns: Identify any seasonal variations in crime rates.
3. Spatial Analysis
6. Correlation Analysis
1. Descriptive Statistics
2. Temporal Analysis
Crime Hotspots:
o The crime_hotspots.png map reveals high concentrations of crime in
certain areas of Toronto, with noticeable hotspots.
Neighborhood Analysis:
o The crime_distribution_top_neighborhoods.png chart identifies the top
20 neighborhoods with the highest crime rates, with West Humber-
Clairville and Moss Park leading.
4. Crime Category Analysis
Offense Types:
o The frequency_of_offense_types.png chart shows that assault is the
most frequent crime type, followed by vehicle-related offenses and
theft.
Location and Premises Types:
o The distribution_of_crimes_by_location.png visualization shows that
most crimes occur in condos and mobile homes, followed by public
spaces.
o
5. Victim and Offender Analysis
Demographics:
o If demographic data were available, this section would analyze the
characteristics of victims and offenders.
6. Correlation Analysis
Correlation Matrix:
o The correlation_matrix.png visualization shows strong correlations
between certain variables, like OBJECTID and REPORT_YEAR, which
could indicate data recording patterns rather than meaningful insights.
Model Analysis
Model Summary:
The Random Forest model was optimized using Randomized Search for
hyperparameter tuning.
Best Parameters:
o Number of estimators: 200
o Maximum depth: 20
o Minimum samples split: 2
o Minimum samples leaf: 1
Overall Accuracy: The model achieved an accuracy score of 44.19%,
indicating moderate performance.
Performance Metrics:
Visualizations:
1. Confusion Matrix:
o File: confusion_matrix_rf.png
o Insight: The confusion matrix reveals that the model has strong
performance for certain high-frequency crimes but struggles with less
frequent categories. The matrix's normalization also helps in
understanding the relative performance across categories.
2. Feature Importance:
o File: feature_importance_rf.png
o Insight: The feature importance plot shows that spatial features like
LONG_WGS84 and LAT_WGS84 are the most significant predictors,
followed by temporal features like OCC_DAY and OCC_HOUR. This
suggests that both location and time are critical factors in predicting
crime types.
Conclusion:
The Random Forest model provides a reasonable starting point but has
limitations, particularly with less frequent crime categories. The accuracy of
44.19% suggests that while the model has captured some patterns, there is
room for improvement.
In this section, we delve into the core analyses conducted on the Toronto crime
dataset. The primary objective was to explore the data, identify patterns, and
develop predictive models that could be utilized by the Toronto Police Services to
optimize crime prevention strategies and resource allocation.
1. Descriptive Statistics
The analysis began with a descriptive statistical overview of the dataset. This step
provided crucial insights into the central tendencies, dispersion, and distribution of
the data across key variables.
Summary Statistics:
o The variables REPORT_YEAR, REPORT_MONTH, REPORT_DAY,
OCC_YEAR, OCC_DAY, LAT_WGS84, and LONG_WGS84 were
examined.
o Key Findings:
The dataset spans multiple years, with a noticeable increase in
crime reporting in recent years.
The geographic coordinates (LAT_WGS84, LONG_WGS84)
revealed the spatial distribution of crime incidents across
Toronto.
2. Temporal Analysis
Temporal analysis was conducted to understand the distribution of crimes over time,
which included an investigation into yearly, monthly, daily, and hourly trends.
3. Spatial Analysis
Crime Hotspots:
o A scatter plot was generated to visualize crime occurrences across
Toronto using latitude and longitude coordinates.
o Key Findings:
High concentrations of crime were observed in central Toronto,
particularly in downtown areas, suggesting these as hotspots
requiring focused policing efforts.
Neighbourhood Analysis:
o Crime distribution across different neighbourhoods was examined
using bar charts.
o Key Findings:
Certain neighbourhoods, such as "West Humber-Clairville" and
"Moss Park," consistently reported higher crime rates compared
to others. These areas might benefit from targeted crime
prevention programs.
The analysis was further refined by categorizing crimes based on their types and
examining the frequency and distribution of major crime categories.
5. Correlation Analysis
Correlation analysis was employed to identify relationships between different variables in the
dataset.
Correlation Matrix:
o A heatmap was generated to visualize the correlation between key variables.
o Key Findings:
There were strong correlations between certain temporal variables,
such as REPORT_YEAR and OCC_YEAR, indicating the consistency of
crime reporting. However, correlations between other variables were
generally low, suggesting that crimes are influenced by a diverse set
of factors.
6. Model Development
Based on the insights gathered from the EDA, predictive models were developed to forecast
crime occurrences and types.
Visualizations
Interpretation
The analyses reveal that crime in Toronto is significantly influenced by both spatial and
temporal factors. The Random Forest model provided a deeper understanding of the factors
contributing to crime occurrences but also highlighted the complexities involved in
accurately predicting crime.
Conclusions
Summary of Findings
The analysis of the Toronto crime dataset revealed several critical insights that can
inform and enhance law enforcement strategies:
1. Temporal Trends:
o Key Insight: Crime incidents have shown notable patterns over the
years, with an increase in reports from 2014 onwards, peaking around
2019 and 2021. The observed reduction in 2024 could reflect recent
successful interventions or data collection timing.
o Actionable Insight: Understanding these trends allows for better
forecasting and resource allocation, ensuring that law enforcement is
prepared for periods of heightened activity.
2. Spatial Trends:
o Key Insight: Certain neighborhoods, notably "West Humber-Clairville"
and "Moss Park," have consistently high crime rates. Downtown
Toronto also emerged as a critical area for focus.
o Actionable Insight: Targeted interventions in these hotspots can lead
to significant reductions in crime, making these areas safer for
residents and visitors.
Implications
Final Thoughts
The Random Forest model's performance, combined with the insights gained from
the EDA, provides a solid foundation for predictive policing in Toronto. While there is
always room for improvement, the model has already demonstrated its utility and
potential. By continuing to refine the model and incorporating additional data, the
Toronto Police Services can stay ahead of crime trends and improve public safety.
The findings from this study affirm the value of data-driven approaches in law
enforcement and highlight the opportunities for continuous improvement and
innovation in predictive policing.
Recommendations
Based on the findings from our analysis, we present the following recommendations
to help the Toronto Police Services leverage data-driven insights for more effective
crime prevention and resource allocation:
Incorporate the Random Forest Model: Integrate the Random Forest model
developed in this study into daily operations. Use it to forecast potential crime
types in specific areas, allowing the police to take preemptive actions.
Continuous Model Improvement: While the model has shown promising
results, further refinement is recommended. Incorporate additional data
sources such as socioeconomic factors, weather data, and public event
schedules to enhance the model’s accuracy and predictive power.
Expand Predictive Analytics: Explore the use of more advanced machine
learning techniques, such as Gradient Boosting Machines or Neural
Networks, to build on the current model. These techniques can potentially
offer even greater accuracy and insights.
3. Community Engagement
Utilize Data for Policy Development: Use the insights gained from the EDA
and predictive models to inform policy decisions. For example, policies could
be developed around the deployment of resources based on temporal and
spatial crime patterns, or around targeted interventions for specific crime
types.
Invest in Data Infrastructure: To fully realize the potential of predictive
policing, continued investment in data collection, storage, and analysis
infrastructure is essential. This includes adopting modern data management
platforms, improving data quality, and ensuring that officers and analysts have
access to the tools and training they need to leverage these resources
effectively.
Appendices
The appendices section provides supplementary materials that support the analysis
and recommendations presented in the report. This includes relevant code,
additional visualizations, and detailed explanations of the dataset variables.
1. Code
The Python code used for data analysis, Exploratory Data Analysis (EDA), and
model building is available upon request. This code includes all the necessary steps,
from data cleaning to the implementation of the Random Forest model. The code is
well-documented, making it easy to understand and replicate the analysis.
This section includes visualizations and tables that were generated during the
analysis but were not included in the main body of the report. These provide further
insights and support the findings discussed earlier.
Visualizations:
1. Correlation Matrix: Visualizes the relationships between variables in
the dataset.
2. Crime Distribution by Month, Day, and Hour: Charts that provide
additional details on how crime occurrences vary by time.
3. Crime Hotspots: A scatter plot showing crime incidents across
Toronto, highlighting high-density areas.
4. Crime Distribution Across Neighborhoods: A detailed bar chart of
crime rates in different neighborhoods.
5. Distribution of Major Crime Categories: A bar chart showing the
prevalence of different crime types.
6. Confusion Matrix (Random Forest Model): Provides a detailed look
at the model's performance across various crime categories.
7. Feature Importance (Random Forest Model): Displays the
significance of different features used in the model.
3. Data Dictionary
1. Data Sources
Python:
o Python was the primary programming language used for data analysis and
model building. The following libraries were extensively used:
Pandas: For data manipulation and analysis.
NumPy: For numerical computations and handling large datasets.
Matplotlib: For creating static, animated, and interactive
visualizations.
Seaborn: For statistical data visualization.
Scikit-learn: For machine learning algorithms and model evaluation.
Jupyter Notebook: The analysis was conducted using Jupyter Notebook, an open-
source web application that allows for the creation and sharing of documents that
contain live code, equations, visualizations, and narrative text.
Microsoft Excel: Used for initial data cleaning and exploration.