Final
Final
1. Introduction ............................................................................................................ 3
1.1 Background of the Dataset ............................................................................... 3
1.2 Relevance of Data Analysis in Public Services ................................................. 3
1.3 Goals and Scope of Analysis ............................................................................ 4
1.4Tools used ......................................................................................................... 5
2. Data Understanding ............................................................................................... 7
2.1 Overview of Data Structure ............................................................................... 7
2.2 Description of Key Columns ............................................................................. 8
2.3 Data Types and Missing Data Analysis ............................................................. 9
3. Data Preparation .................................................................................................. 11
3.1 Import the dataset and clean it ....................................................................... 12
3.2 Converting Date-Time Columns and Creating New Features ......................... 13
3.3 Dropping Irrelevant Columns .......................................................................... 14
3.4.Handling Missing Data. ................................................................................... 14
3.5 Display Unique Values from All Columns ........................................................ 15
4.Data Analysis ........................................................................................................ 16
4.1Show Summary Statistics such as sum, mean, standard deviation ................. 17
4.2 Data Analysis: Calculate and Show Correlation of All Variables ..................... 18
5. Data Exploration ................................................................................................... 18
5.1 four major insights through visualization that you come up after data mining. 20
5.2 Group Complaint Types by Average Request Closing Time and Location ...... 24
Statistical Testing: Test 1 - Average Response Time Across Complaint Types ..... 25
Test 2: Whether the type of complaint or service requested and location are related
.............................................................................................................................. 26
Conclusion ............................................................................................................... 29
Bibliography ............................................................................................................. 32
1. Introduction
The dataset is a collection of service requests registered to New York's 311. 311 is a
non-emergency customer service hotline that the city's residents, businesses, and
visitors dial to complain about noise, illegal parking, and others such as sanitation,
maintenance, and public service problems. This data is on those service requests; it
is critical since it provides some insight into the demand for public services, resource
distribution, and problems in an urban area.
It also contains some other attributes related to each service request, such as:
Complaint Type: The type of complaint made. It can be about noise, a blocked
driveway, or illegal parking, among other things.
Agency: This is the department or agency responsible for the complaint. Could be
NYDP or Sanitation.
Incident Location: The location where the incident was reported, including details such
as zip codes and neighborhood names.
Created Date and Closed Date: The time when the request was opened and closed.
Consequently, via such data analysis, the chronology of complaints in different areas
can be known, the efficacy with which problems are solved can be monitored, and
there can be the observation of patterns where types of complaints change over time.
(NYC.gov, n.d.)
This is the most important initiative in the area of optimization for public service,
particularly with respect to those serving the global population of metropolises like New
York. In other words, the huge data created within 311 service requests is of great
value to city authorities and policymakers.
For instance, data analysis helps to determine geographies that are hotspots of
categories of complaints and allows interventions on targeted geographies. The city
will also analyze response times and resolution rates in order to determine the
efficiency of operations in various agencies and change the workflows accordingly.
The public service agencies can even plan to prevent the subsequent recurring
problems.Moreover, this might be analysis of citizen complaints to catch systemic
problems for which solutions need to be envisaged over the long term, by way of
infrastructural upgradation or policy change. Taking these insights into cognizance
propels public service management from reactive into proactive, which is in the best
interest of residents so that their concerns are ultimately solved more promptly and
efficiently. (Goldsmith, 2023)
This analysis aims primarily to understand patterns and trends in 311 service request
data with respect to the following goals:
Complaint Trends: Given this, the analysis would establish the total number of
complaints that are prevalent in the case of New York City and whether some
complaints seem to increase with time while others decrease.
Response Time Analysis: Information on how promptly the city's agencies were
responding to complaints and whether response time varied for different complaints
and geographical locations.
The analysis would include multiple steps for data preparation in terms of cleaning and
transforming it into a state appropriate for statistical analysis and visual exploration of
important variables, then moving on to hypothesis tests to find out relationships
between various factors such as the type of complaints and geographic locations or if
the response times vary for different types of complaints. It will also aim to pinpoint
insights that are actionable in order to make informed decisions and service
improvements that will help the city optimize the way it responds to the complaints of
its citizens, and ultimately improve public satisfaction.
1.4Tools used
1. Pandas:
Pandas is a widely used Python library for data manipulation and analysis. In this
project, it was essential for reading the CSV dataset into a structured DataFrame. It
allowed easy handling of missing values, renaming columns, and filtering rows based
on conditions. We used pandas to extract date and time components, group data by
categories (like month or borough), and compute summary statistics. Its intuitive
syntax and integration with other tools made it ideal for managing and exploring our
service request data. (Alriksson, 2020)
2. Matplotlib:
Matplotlib is a foundational plotting library in Python used to create static, interactive,
and animated visualizations. In this project, it was used to build bar charts, line plots,
and scatter plots to visually interpret complaint trends. It helped visualize the
distribution of complaints across boroughs, time, and complaint types. With
customization features such as color, label formatting, and saving plots as images,
Matplotlib made our visual outputs presentation-ready. These visuals were key to
revealing trends that would otherwise remain hidden in raw numbers.
3. Seaborn:
Seaborn is a high-level data visualization library built on top of Matplotlib that provides
an easier and more aesthetically pleasing way to create plots. We used Seaborn to
generate statistical plots like boxplots and histograms, which helped analyze the
spread and distribution of complaint response times. Its integration with pandas made
it simple to plot directly from DataFrames. Seaborn automatically handles visual
themes, legends, and color palettes, making plots clearer and more professional. It
significantly improved the visual storytelling aspect of our data analysis. (Solomon,
2022)
4. Jupyter Notebook:
Jupyter Notebook served as the interactive development environment where all the
coding and visualization tasks were performed. It allowed us to write and execute code
in cells, visualize outputs immediately, and document our process alongside the code.
This interactivity helped test small portions of code step by step and adjust parameters
quickly. Additionally, markdown cells enabled us to add explanations, titles, and
structured reports directly in the notebook. Jupyter’s ability to combine code, output,
and narrative made it ideal for both development and presentation.
5. CSV File:
The dataset used in this project was stored in a CSV (Comma-Separated Values)
format. CSV files are simple text files that store tabular data and are widely used for
sharing structured information. Using pandas, we imported the CSV file into a
DataFrame to begin our analysis. Despite being large and complex, the CSV format
allowed us to access and process the data efficiently. It served as the foundation for
all our analysis, containing the complaint details, timestamps, locations, and status
information required for exploration.
2. Data Understanding
It has maintained a pretty comprehensive set of records of how citizens have been
able to interact with 311 over a period of time. The dataset contains 300,698 rows and
53 columns, and every column in a row gives a definite view or picture to exhibit the
service request of the data. The data itself is in the form of single attributes:
• Complaint details: type and description Location details: incident address, ZIP code,
borough Dates & times: Creation, closure, response times Agency responsible for
handling the complaint: NYPD, Sanitation Department, etc.
Each row is indicative of a particular complaint, whereas the columns further account
for the data that the complaint's type, agency responsible, location of the occurrence
of the incident, and dates when the request was created and resolved for the incident
2.2 Description of Key Columns
This dataset has a number of columns, each capturing information on the request for
services. A few of the primary columns in the dataset are:
• Unique Key: A unique identifier that is given to every service request. This is the
primary key of this dataset.
• Created Date: date and time of the creation of the service request. It is one of the
important columns to understand the timestamp of when any complaint was received.
• Closed Date: date and time of closing out for the service request. It will help to find
out resolution time - time taken to resolve the complaint.
•Agency: The agency or department that is responsible for addressing the complaint
(examples are NYPD, Department of Sanitation).
•Agneb Name: The complete name of the agency that is handling the request
(examples are "New York City police Department").
•Complaint Type: The specific complaint type, such as Blocked Driveway, Noise -
Street/Sidewalk, or Illegal Parking, etc.
•Descriptor: This is another specification for the complaint, adding more detail to at
least provide some context; for example, Loud Music/Party for Noise -
Street/Sidewalks.
•Incident Zip: This is the ZIP code where the incident took place; hence, it provides
general information about the location of the complaining party.
•Incident Address: This is the exact location information or street address where a
complaint has been lodged, if available.
These fields contain a lot of information in connection with complaints' nature, location,
and the way these were resolved. These columns help give fast calculations for
response time for any complaint through a difference of two vital dates: Created Date
and Closed Date. Therefore, Complaint Type and Location Type aid in classifying and
understanding issues that were raised by citizens.
Integer: This will be for numerical data, like Unique Key (unique identifier for an
appeal)
Object (String): The occurrence of textual data, type Agency, Complaint Type, incident
address.
Datetime: For data in the form of dates and times, say, the values in columns like
Created At and Closed At.
Float: where it contains numeric data, a whole number or a decimal place and,
optionally, empty values; Incident Zip is an example. Missing Data: A minimum of few
missing or null data in columns of datasets, likely few of these will hit the analysis.
Missing Data Analysis: Closed Date has got 2,164 missing values, making a point that
open requests are still pending and not closed. Other than this, some missing values
in other columns, like Descriptor, Incident Address, and Resolution Action, are again
not very important, as these are not going to interfere with the basic analysis or time-
to-respond calculation. Closed Date: 2,164 missing values, depict that the requests
are open, therefore not closed.
In all other columns, the possibility of some missing values is considered. Descriptor,
Incident Address, and Resolution Action may also have a few missing values; this
again is not very important, as these will not be getting in the way of basic analysis or
time-to-respond calculation.
Handling Missing Data: The open requests in Closed Date can easily be removed from
the missing value treatment for that column. The imputation of Missing values can be
done to fill it up with any placeholder. Beside this, other columns having some form of
Missing values need imputation or simply removal of incomplete entries depending on
how much that column is relevant towards the analysis.
2 Created Date The date and time when the request was String
created.
3 Closed Date The date and time when the request was String
closed.
15 Park Facility The name of the park facility, if the incident String
Name occurred in a park.
3. Data Preparation
Data preparation still forms the important steps in pipeline data analysis. It is the
process of converting raw forms of data into clean datasets ready for further
analysis. The process makes data ready for any kind of applied analysis. In this
scenario, the data preparation for the dataset of service requests from 311 in New
York City will consist of a few primary steps, as described below:
3.1 Import the dataset and clean it
The very first step in data preparation is to import the dataset and do some initial
cleaning. This is performed by loading the data into a DataFrame, ensuring its
structure is proper, and making it analysis-ready.
•How to load the Dataset: To begin with, we must import the data from whatever
file format it is into pandas (in this instance, the file format is CSV).
•As soon as the data loads, proceed with a few anomaly checks for preliminary
cleaning by starting the detection of missing values, duplicate entries, or any other
data integrity issues.
Explanation: This code snippet below will convert string columns into datetime using
the pd.to_datetime() function. It also sets errors='coerce', so that every invalid entry
for dates will be converted to NaT. Please subtract the Closed Date from the Created
Date to get the time taken in hours.
In any given dataset, there might be fields that are definitely of no significance for
analysis. Such columns can be certainly removed to make data more manageable and
viewable. For example, in the given dataset, there is no need to know precise
addresses, school names, and vehicle-related details in order to understand the
pattern of complaints and duration of response times.
Drop the irrelevant columns for the type of analysis we have in our hands.
Dropping all rows with missing data in critical columns like Created Date and Closed
Date, as they are essential in the computation of response times. For the rows with
less critical columns in missing data, values may be imputed or dropped based on
significance. (Jain, 2021)
igure 5 Handling Missing Data.
Explanation: The dropna() method is used to remove rows with missing values in the
specified columns, while the fillna() method fills missing values in the Complaint Type
column with the most frequent value (mode).
Drop rows having missing values in critical columns like Created Date and Closed
Date.
4.Data Analysis
Data analysis is the process of investigation and interpretation of data with a view to
identifying patterns, trends, and relationships between variables. Suchrequiring the
application of statistical techniques at the stage that could be summary statistics,
correlation analysis, and data visualization. Summary statistics throw light on principal
measures through mean, standard deviation, skewness, and kurtosis. Correlation
analysis helps to identify relationships between numeric variables. When visualized,
the data thus helps in understanding the patterns, outliers, and trends in a more
effective manner. It is because of this process that one can make an informed decision
based on insights from data.
Explanation: value can vary from -1 for perfect negative to +1 for perfect positive. This
will ensure an error in the assessment as non-numeric data will be excluded from the
correlation calculation." This error occurred because the process used to compute the
correlations from corr() considered computed values against non-numeric columns,
which was pretty ludicrous. Thus, we had to first filter out only the numerical columns
satisfying select_dtypes() for data types float64 and int64. So, for these columns, the
correlation matrix was constructed to indicate the relationship between numeric
variables.
5. Data Exploration
EDA is an approach to analyzing a dataset in which its structure, patterns, or
relationships between variables are derived in order to further analyze it. This is quite
the first stage in data analysis. It consists of a number of techniques applied to gain
insights into the data before applying advanced modeling or analysis. The principal
objectives of data exploration are as follows: (Kumar, 2021)
2. Missing or Anomalous Data Identification: Look for the missing, duplicated, or outlier
values that need to be handled before analysis.
Data exploration helps with the hypothesis, the proper analytical techniques for use,
and lays the data ready for further advanced statistical models, machine learning
models, or any decision-making processes. This is the most critical starting point of
any data science project.
5.1 four major insights through visualization that you come up after
data mining.
Data mining and analysis of large datasets will yield several valuable insights through
visualizations. Here are four key insights which stem from various types of
visualizations:
Insight: A bar chart of the most frequent complaints within each category can easily
bring out the frequency, for instance, if there are recurring problems related to noise,
or a blocked driveway has a high frequency of complaints, it signifies some repeated
problem in that area.
Visualization:
2. Geographical Distribution of Complaints
Insight: A scatter plot or heatmap based on latitude and longitude can be plotted to
point out areas where complaints are concentrated the most. This will provide an
insight into an area where resource allocation is skewed, indicating an area where
more attention or intervention is needed. (Singh, 2020)
Visualization:
3. Request_Closing_Time by Complaint Type
This concept refers to identifying trends or regularities in how service requests occur
over time by leveraging NumPy, a numerical computing library in Python.
In practice, service request data often includes timestamps indicating when each
request was made. By analyzing these timestamps with NumPy arrays, you can detect
temporal patterns—such as peaks in activity during certain hours, days, or seasons.
Visualization:
4. Association between Variables
Insight: The correlation matrix heatmap can be used to detail the relationship between
variables. For instance, one can show positive correlations between some specific
complaint types and their corresponding response times, or high correlations between
variables originating from the geographic location due to increased probabilities of
getting certain complaints.
Visualization:
One-Way ANOVA is conducted for testing whether there is an equal response time
across all types of complaints. The Analysis of Variance checks if there is a significant
mean difference in Request_Closing_Time among different Complaint Types.
Hypotheses:
Null Hypothesis (H0): The average response times are equal for all complaint types,
so there is no significant difference between them.
Alternative Hypothesis (H1): The average response times are not equal across all
complaint types, indicating that at least one group is different.
Explanation:
• f_oneway(): This function performs the ANOVA test comparing the means of
multiple groups (complaint types). It returns two values: the F-statistic (which
tells the ratio of variance between the groups) and the p-value (which tells if the
difference is statistically significant).
• F-statistic: A larger value indicates a higher likelihood that at least one group
mean is different.
• p-value: If p < 0.05, we reject the null hypothesis and conclude that at least one
complaint type has a significantly different average response time. If p ≥ 0.05,
we fail to reject the null hypothesis, meaning no significant difference exists.
The type of complaint or service requested and the location (Borough) are related.
(There is a significant relationship between complaint type and location.)
A Chi-Square test helps determine if two categorical variables (complaint type and
location) are independent or related.
Explanation:
• Chi2 Stat: The Chi-Square statistic, which helps measure the association
between two categorical variables.
• p-value: If the p-value < 0.05, we reject the Null Hypothesis and conclude that
there is a significant relationship between the complaint type and location. If the
p-value ≥ 0.05, we fail to reject the Null Hypothesis and conclude that there is
no significant relationship between the two.
Conclusion
In This milestone of this project, we aimed to explore, clean, and analyze the 311
Customer Service Requests dataset to extract meaningful insights that can be used
for process optimization, resource allocation, and improving service efficiency. The
dataset provided a wealth of information, including the Complaint Type, Borough,
Created Date, Closed Date, and other relevant details that help in understanding the
patterns and trends in customer complaints. The first stage of this project involved the
data understanding phase, where we identified the key variables, such as Complaint
Type, Borough, and Request_Closing_Time. We recognized early on that the dataset
required significant preprocessing to ensure data quality and completeness.
In the data preparation phase, we handled missing values, converted the Created Date
and Closed Date to the proper datetime format, and engineered the new feature,
Request_Closing_Time, which calculated the time it took to resolve each complaint.
We also dropped irrelevant columns and handled any remaining missing values by
imputing the Complaint Type with the most frequent category. This step ensured that
our dataset was clean and ready for deeper analysis, and we could now focus on the
core aspects of the dataset.
Once the data was prepared, we moved into the exploratory data analysis (EDA)
phase, where we sought to uncover patterns, trends, and insights from the data.
Through visualizations such as bar charts, box plots, and histograms, we discovered
key findings:
• ANOVA Test: The One-Way ANOVA test confirmed that average response times
differ significantly across different complaint types (with a p-value of 0.0). This
validated our earlier observation from visualizations, emphasizing that certain
complaint types require more time to resolve, and indicating potential areas for
process improvement.
Jain, S. (2021, 10 02). Towards Data Science. Retrieved from Towards Data
Science: https://towardsdatascience.com/how-to-handle-missing-data-
8646b18db0d4