Data Loading and Pre-Processing
Data Loading and Pre-Processing
Data loading involves acquiring the necessary datasets from various sources and loading
them into a suitable format for analysis. In our project, we gather data from the British Social
Attitudes survey, public surveys, and additional sources spanning specific timeframes.
Utilizing tools like pandas in Python, we import these datasets into DataFrames, allowing for
easy manipulation and analysis.
Once the data is loaded, the pre-processing phase begins. This section details the process of
loading and pre-processing the data used to develop a public engagement prediction model
for government policies. Our objective was to analyze the gathered data from the British
Social Attitudes (BSA) survey to assess its relevance for the model. We specifically focused
on variables that could potentially influence public engagement with a given policy.
1. bsa21_variables.csv: This CSV file contains variable names and labels from the BSA
survey (year 2021). It provides information about the type of data collected in each
variable.
2. bsa21_data.csv: This CSV file houses the actual survey responses from participants in
the BSA (year 2021). It links individual responses to corresponding variables using
identifiers.
3. public_survey_variables.csv: This CSV file contains variable names and labels from the
public_survey_variables.csv. Providing information about the type of data collected in
each variable
4. public_survey_data.csv: This CSV file houses the actual survey responses from
participants. They may include information from different surveys or time periods,
depending on our project scope.
5. 2010_2011_variables.csv: This CSV file contains variable names and labels from the
2010_2011 survey dataset. Providing information about the type of data collected in each
variable
6. 2010_2011_data.csv: This CSV file houses the actual survey responses from participants
in the 2010_2011 survey. They may include information from different surveys or time
periods, depending on our project scope.
We imported the necessary libraries, including pandas (pd) for data manipulation, and used
the pd.read_csv function to load each dataset into a separate pandas DataFrame. The code
snippet below demonstrates this process:
import pandas as pd
Data pre-processing is crucial to ensure the quality and effectiveness of our model. The
following steps were undertaken to prepare the data for further analysis:
1. Missing Values: We checked for missing values in each column of the DataFrames using the
isnull().sum() method. Columns with a high proportion of missing values (e.g., exceeding 50%
of the data) were dropped to avoid introducing bias. Remaining missing entries were imputed
with a placeholder value, such as -1, to maintain consistent data structure.
2. Scaling Numerical Features: Numerical features (columns containing integer or float data
types) were scaled using StandardScaler from the scikit-learn library (sklearn.preprocessing).
This step normalizes the data, ensuring all features contribute equally to the model's training
process.
Here's a breakdown of the preprocessing techniques applied to the textual policy descriptions:
1. Tokenization: We divided the text descriptions into individual words using the
word_tokenize function from the NLTK library. This step helps identify the essential
words that convey the meaning of the policy.
2. Lowercasing: All characters in the tokens were converted to lowercase using a list
comprehension. This ensures consistency in feature representation and avoids biasing
the model towards capitalization.
3. Punctuation Removal: Punctuation marks were removed from the tokens using
Python's str.maketrans function. Punctuation doesn't hold significant meaning in the
context of our analysis, and removing it helps focus on the core words.
4. Stop Word Removal: Stop words, which are common words like "the," "a," and "is,"
were eliminated using the NLTK stopwords corpus. These words don't contribute
much to the meaning of the policy and can introduce noise into the data.
5. Lemmatization: Word stems were reduced to their base forms using the
WordNetLemmatizer from NLTK. This helps capture the core meaning of words
regardless of their grammatical variations (e.g., "increasing" becomes "increase").
By applying these techniques, we transformed the raw text descriptions into a cleaner and
more consistent format, ready for feature extraction.
To convert the preprocessed text data into numerical features suitable for machine learning
algorithms, we employed Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF
assigns weights to words based on their frequency within a document (policy description) and
their overall frequency across the entire dataset. Words that appear frequently within a
specific policy description but rarely across all descriptions receive higher TF-IDF weights.
This approach emphasizes the unique and relevant terms associated with each policy.
The resulting TF-IDF matrix provides a numerical representation of the textual data,
capturing the importance of each word in describing a particular government policy. This
matrix can then be used as input for various machine learning models to predict public
engagement.
In our implementation, we achieved the following Silhouette Scores for different datasets:
Data 1 (bsa21_variables): 0.0686
Data 2 (public_survey_variables): 0.3923
Data 3 (2010_2011_variables): 0.3040
The Silhouette Scores for Data 2 and Data 3 are considerably higher than Data 1, indicating a
more meaningful clustering for these datasets. This suggests that the textual descriptions in
Data 1 might be less well-defined or have a lower inherent structure compared to the other
datasets.
We investigated two machine learning approaches for predicting public engagement with
government policies:
1. K-Means Clustering: After grouping policies into clusters using K-Means, we could
potentially associate the clusters with different levels of public engagement based on
historical data or expert analysis.
2. Neural Network Model: We also trained a Neural Network model to directly predict
the public engagement score for a new policy description. The Neural Network learns
from the relationship between the TF-IDF features of the policy description and the
corresponding public engagement score (if available in your data).
In this project, we focused on demonstrating the K-Means clustering approach. However, the
implementation for training a Neural Network model is also provided in the code snippet for
your reference.
For illustrative purposes, let's consider the results from Data 2 (public_survey_variables)
which achieved a higher Silhouette Score. The K-Means model assigned each policy
description in Data 2 to one of the pre-defined clusters.
We can then analyze the characteristics of the policies within a specific cluster to understand
the underlying themes or topics that might influence public engagement. Visualizations can
be helpful in this process. Here are some examples:
For demographic data (Data 2): We can explore the age distribution within a cluster
to see if there's a particular age group more interested in the policies within that
cluster.
For activity level data (Data 2): We can visualize the activity levels (work hours) of
the population in a cluster to understand if there's a correlation between activity level
and the policy focus (e.g., healthcare policies might be of higher interest to less active
populations).
For job-related information (Data 2): We can analyze the distribution of job types
or employment status within a cluster to see if specific professions are more likely to
be impacted by the policies in that cluster.
These analyses can help us develop insights into the factors that influence public engagement
with different types of government policies.
We trained neural network models on the TF-IDF features extracted from each dataset (Data
1, Data 2, and Data 3). The reported accuracy scores (maximum value of cluster
probabilities) are:
Data 1: 65.80%
Data 2: 91.56%
Data 3: 97.49%
Data 3 (2010_2011_variables) also achieved a high accuracy (97.49%), suggesting that the
neural network was able to capture the relevant features from the policy descriptions in this
dataset as well.
In the initial phase of this project, I familiarized myself with the fundamentals of Artificial
Intelligence (AI) and Machine Learning (ML). This involved understanding core concepts,
algorithms, and their applications. Following this foundational knowledge, I identified the
potential of using AI and ML to analyze textual data related to government policies and
public engagement.
The focus then shifted to exploring techniques for uncovering patterns and relationships
within this data. K-Means clustering, an unsupervised learning method, was chosen to group
policy descriptions with similar characteristics. This approach allows for the identification of
thematic clusters that might be associated with different levels of public interest.
Enhancing Prediction with Neural Networks:
The project further investigated the application of Neural Networks, a supervised learning
technique. Neural Networks can learn complex relationships between features (like TF-IDF
weights from policy descriptions) and target variables (public engagement scores in this
case). The code included a Multi-Layer Perceptron (MLP) implementation to train a Neural
Network model for predicting public engagement scores.
The results indicated that the Neural Network models achieved promising accuracy scores,
particularly for Data 2 (public_survey_variables) and Data 3 (2010_2011_variables). This
suggests that the model was able to effectively capture the underlying patterns in the textual
data and public engagement scores within these datasets.
By comparing K-Means clustering and Neural Networks, we can see the value proposition of
each approach. K-Means offers interpretability, allowing us to analyze the policy descriptions
within each cluster to understand the thematic drivers of public engagement. Neural
Networks, while potentially more accurate, can be less interpretable, requiring further
analysis of the predicted cluster probabilities.
5. Conclusion
This project successfully explored the application of Artificial Intelligence and Machine
Learning techniques for analyzing government policy data and predicting public engagement.
The implemented methods, K-Means clustering and Neural Networks, demonstrated the
potential for uncovering hidden patterns and relationships within textual policy descriptions.
The K-Means clustering approach facilitated the grouping of policies with similar
characteristics, enabling further analysis to understand the thematic factors influencing public
interest. The exploration of Neural Networks provided promising results for predicting public
engagement scores, particularly for datasets with a clear relationship between policy
descriptions and engagement levels.
While the project yielded positive results, there are areas for further exploration and
improvement. One key challenge involves the interpretability of Neural Network models.
Unlike K-Means clustering where we can directly analyze the policy descriptions within each
cluster, Neural Networks can be like "black boxes." Future work could investigate techniques
like feature importance analysis to shed light on this. Feature importance analysis ranks
features (words or phrases in our case) based on their contribution to the model's predictions.
This would provide valuable insights into the specific aspects of policy descriptions that the
model leverages for predicting public engagement.
Another exciting area for exploration is incorporating additional data sources beyond textual
policy descriptions. Public sentiment analysis from social media platforms can reveal public
opinion in real-time, capturing the immediate reactions and discussions surrounding a policy.
Demographic data, such as age, income, or location, could further enrich the model's
understanding of public engagement. By considering factors like age groups most impacted
by a policy or geographical areas with higher stakes, the model's predictions could be more
nuanced and geographically specific.
By overcoming these limitations and incorporating additional data sources, the project's
foundation in AI and ML can be further strengthened. These techniques have the potential to
become valuable tools for policymakers seeking to understand and optimize their policy
initiatives. Imagine a future where policymakers can leverage AI-powered predictions to
gauge public interest in proposed policies, tailor messaging to resonate with specific
demographics, and ultimately foster more effective and well-received policy
implementations.