0% found this document useful (0 votes)

7 views

Data Loading and Pre-Processing

The document discusses loading, preprocessing, and analyzing survey data from multiple sources to predict public engagement with government policies. Key steps include importing datasets, handling missing data, scaling numeric features, removing outliers, preprocessing text by tokenizing, lemmatizing and extracting TF-IDF features, clustering preprocessed text using K-Means, and training models like K-Means and neural networks to predict public engagement.

Uploaded by

ramzan1243259

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Data Loading and Pre-Processing

Uploaded by

ramzan1243259

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

1.

Data Loading and Pre-processing

Data loading and pre-processing are foundational stages in any data-driven project, setting
the groundwork for subsequent analysis and modelling. In our Public Engagement Prediction
Model on a Government Policy, these stages are crucial for understanding the data, ensuring
its quality, and preparing it for further exploration.

Data loading involves acquiring the necessary datasets from various sources and loading
them into a suitable format for analysis. In our project, we gather data from the British Social
Attitudes survey, public surveys, and additional sources spanning specific timeframes.
Utilizing tools like pandas in Python, we import these datasets into DataFrames, allowing for
easy manipulation and analysis.

Once the data is loaded, the pre-processing phase begins. This section details the process of
loading and pre-processing the data used to develop a public engagement prediction model
for government policies. Our objective was to analyze the gathered data from the British
Social Attitudes (BSA) survey to assess its relevance for the model. We specifically focused
on variables that could potentially influence public engagement with a given policy.

1.1 Data Sources

The project utilized three separate datasets:

1. bsa21_variables.csv: This CSV file contains variable names and labels from the BSA
survey (year 2021). It provides information about the type of data collected in each
variable.

2. bsa21_data.csv: This CSV file houses the actual survey responses from participants in
the BSA (year 2021). It links individual responses to corresponding variables using
identifiers.

3. public_survey_variables.csv: This CSV file contains variable names and labels from the
public_survey_variables.csv. Providing information about the type of data collected in
each variable

4. public_survey_data.csv: This CSV file houses the actual survey responses from
participants. They may include information from different surveys or time periods,
depending on our project scope.

5. 2010_2011_variables.csv: This CSV file contains variable names and labels from the
2010_2011 survey dataset. Providing information about the type of data collected in each
variable
6. 2010_2011_data.csv: This CSV file houses the actual survey responses from participants
in the 2010_2011 survey. They may include information from different surveys or time
periods, depending on our project scope.

1.2 Data Loading

We imported the necessary libraries, including pandas (pd) for data manipulation, and used
the pd.read_csv function to load each dataset into a separate pandas DataFrame. The code
snippet below demonstrates this process:

import pandas as pd

data1_variables = pd.read_csv("bsa21_variables.csv", encoding='latin1')

data2_variables = pd.read_csv("public_survey_variables.csv", encoding='latin1')
data3_variables = pd.read_csv("2010_2011_variables.csv", encoding='latin1')
data1 = pd.read_csv("bsa21_data.csv")
data2 = pd.read_csv("public_survey_data.csv")
data3 = pd.read_csv("2010_2011_data.csv")

1.3 Data Pre-processing

Data pre-processing is crucial to ensure the quality and effectiveness of our model. The
following steps were undertaken to prepare the data for further analysis:

1. Missing Values: We checked for missing values in each column of the DataFrames using the
isnull().sum() method. Columns with a high proportion of missing values (e.g., exceeding 50%
of the data) were dropped to avoid introducing bias. Remaining missing entries were imputed
with a placeholder value, such as -1, to maintain consistent data structure.

2. Scaling Numerical Features: Numerical features (columns containing integer or float data
types) were scaled using StandardScaler from the scikit-learn library (sklearn.preprocessing).
This step normalizes the data, ensuring all features contribute equally to the model's training
process.

3. Outlier Removal: Isolation Forest, an anomaly detection algorithm from scikit-learn

(sklearn.ensemble), was employed to identify and remove outliers from the numerical data. This
helps to improve the model's robustness and prevent it from being overly influenced by
extreme data points.

The code for the data pre-processing steps is provided below:

2. Text Preprocessing and Feature Extraction
This section details the process of preprocessing and extracting features from the textual data
contained within the survey variables. Our objective was to transform the raw text
descriptions of government policies into numerical features suitable for use in machine
learning models that predict public engagement.

2.1 Text Preprocessing Steps

Here's a breakdown of the preprocessing techniques applied to the textual policy descriptions:

1. Tokenization: We divided the text descriptions into individual words using the
word_tokenize function from the NLTK library. This step helps identify the essential
words that convey the meaning of the policy.
2. Lowercasing: All characters in the tokens were converted to lowercase using a list
comprehension. This ensures consistency in feature representation and avoids biasing
the model towards capitalization.
3. Punctuation Removal: Punctuation marks were removed from the tokens using
Python's str.maketrans function. Punctuation doesn't hold significant meaning in the
context of our analysis, and removing it helps focus on the core words.
4. Stop Word Removal: Stop words, which are common words like "the," "a," and "is,"
were eliminated using the NLTK stopwords corpus. These words don't contribute
much to the meaning of the policy and can introduce noise into the data.
5. Lemmatization: Word stems were reduced to their base forms using the
WordNetLemmatizer from NLTK. This helps capture the core meaning of words
regardless of their grammatical variations (e.g., "increasing" becomes "increase").
By applying these techniques, we transformed the raw text descriptions into a cleaner and
more consistent format, ready for feature extraction.

2.2 Feature Extraction with TF-IDF

To convert the preprocessed text data into numerical features suitable for machine learning
algorithms, we employed Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF
assigns weights to words based on their frequency within a document (policy description) and
their overall frequency across the entire dataset. Words that appear frequently within a
specific policy description but rarely across all descriptions receive higher TF-IDF weights.
This approach emphasizes the unique and relevant terms associated with each policy.

Here's how we implemented TF-IDF feature extraction:

1. TF-IDF Vectorizer: We initialized a TF-IDF vectorizer using TfidfVectorizer from

scikit-learn. This function creates a vocabulary of unique words encountered in the
corpus and calculates the TF-IDF weights for each word in each document.
2. Fitting and Transforming: The TF-IDF vectorizer was fitted to the preprocessed
text data using the fit_transform method. This process creates a document-term matrix,
where each row represents a policy description, each column represents a unique
word, and the cell values represent the corresponding TF-IDF weight.

The resulting TF-IDF matrix provides a numerical representation of the textual data,
capturing the importance of each word in describing a particular government policy. This
matrix can then be used as input for various machine learning models to predict public
engagement.

3. Clustering and Model Training

This section details the process of clustering the preprocessed textual data and training
machine learning models to predict public engagement with government policies. Our
objective is to group policy descriptions with similar characteristics and develop models that
can estimate the level of public interest or response a new policy might generate.

3.1 K-Means Clustering

We employed K-Means clustering, an unsupervised machine learning technique, to group the

textual policy descriptions (represented by TF-IDF features) into a predefined number of
clusters. K-Means aims to minimize the within-cluster variance, ensuring that data points
within a cluster are similar to each other and dissimilar to data points in other clusters.

The number of clusters (k) is a crucial parameter in K-Means clustering. We experimented

with different values of k and evaluated the clustering performance using the Silhouette Score
metric. The Silhouette Score measures the cohesion within clusters (how similar data points
are within a cluster) compared to the separation between clusters (how dissimilar data points
are between clusters). Higher Silhouette Scores indicate better clustering quality.

In our implementation, we achieved the following Silhouette Scores for different datasets:
 Data 1 (bsa21_variables): 0.0686
 Data 2 (public_survey_variables): 0.3923
 Data 3 (2010_2011_variables): 0.3040

The Silhouette Scores for Data 2 and Data 3 are considerably higher than Data 1, indicating a
more meaningful clustering for these datasets. This suggests that the textual descriptions in
Data 1 might be less well-defined or have a lower inherent structure compared to the other
datasets.

3.2 Model Training

We investigated two machine learning approaches for predicting public engagement with
government policies:

1. K-Means Clustering: After grouping policies into clusters using K-Means, we could
potentially associate the clusters with different levels of public engagement based on
historical data or expert analysis.
2. Neural Network Model: We also trained a Neural Network model to directly predict
the public engagement score for a new policy description. The Neural Network learns
from the relationship between the TF-IDF features of the policy description and the
corresponding public engagement score (if available in your data).

In this project, we focused on demonstrating the K-Means clustering approach. However, the
implementation for training a Neural Network model is also provided in the code snippet for
your reference.

3.3 Analysis of Clustering Results

For illustrative purposes, let's consider the results from Data 2 (public_survey_variables)
which achieved a higher Silhouette Score. The K-Means model assigned each policy
description in Data 2 to one of the pre-defined clusters.

We can then analyze the characteristics of the policies within a specific cluster to understand
the underlying themes or topics that might influence public engagement. Visualizations can
be helpful in this process. Here are some examples:

 For demographic data (Data 2): We can explore the age distribution within a cluster
to see if there's a particular age group more interested in the policies within that
cluster.
 For activity level data (Data 2): We can visualize the activity levels (work hours) of
the population in a cluster to understand if there's a correlation between activity level
and the policy focus (e.g., healthcare policies might be of higher interest to less active
populations).
 For job-related information (Data 2): We can analyze the distribution of job types
or employment status within a cluster to see if specific professions are more likely to
be impacted by the policies in that cluster.

These analyses can help us develop insights into the factors that influence public engagement
with different types of government policies.

3.4 Neural Network Exploration for Public Engagement Prediction

In addition to K-Means clustering, we investigated the potential of Neural Networks for
predicting public engagement with government policies. Neural Networks are a powerful
machine learning technique capable of learning complex relationships between features and
target variables. In our case, the features are the TF-IDF weights representing the textual
policy descriptions, and the target variable (if available in your data) is the public
engagement score associated with each policy.

3.4.1 Neural Network Model Implementation

The code includes a function train_neural_network that implements a Multi-Layer Perceptron

(MLP) classifier for neural network training. The MLP takes the TF-IDF features of the
policy descriptions (X_train) and the corresponding public engagement scores (y_train) as
input and learns to map the features to the engagement scores.

The code also includes functions to:

 Split the data into training and testing sets (train_test_split).

 Predict the cluster probabilities for a new policy using the trained neural network
model (predict_cluster_nn).
 Visualize the predicted cluster probabilities (plot_cluster_probabilities).

3.4.2 Neural Network Training Results

We trained neural network models on the TF-IDF features extracted from each dataset (Data
1, Data 2, and Data 3). The reported accuracy scores (maximum value of cluster
probabilities) are:
 Data 1: 65.80%
 Data 2: 91.56%
 Data 3: 97.49%

Data 2 (public_survey_variables) achieved the highest accuracy (91.56%), indicating that

the neural network model learned effective patterns from the textual data and public
engagement scores in this dataset. This suggests that the policy descriptions in Data 2 might
have a clearer relationship with public engagement compared to the other datasets.

Data 3 (2010_2011_variables) also achieved a high accuracy (97.49%), suggesting that the
neural network was able to capture the relevant features from the policy descriptions in this
dataset as well.

Data 1 (bsa21_variables) resulted in a lower accuracy (65.80%) compared to the other

datasets. This could be due to several factors, such as:

 The public engagement scores in Data 1 might be less reliable or informative.

 The textual descriptions in Data 1 might be less well-defined or have a lower inherent
structure compared to the other datasets, making it more challenging for the neural
network to learn meaningful patterns.

4. Predictive Analysis and Cluster Prediction

This section explored two machine learning approaches for predicting public engagement
with government policies: K-Means clustering and Neural Networks.

Understanding the Data Landscape:

In the initial phase of this project, I familiarized myself with the fundamentals of Artificial
Intelligence (AI) and Machine Learning (ML). This involved understanding core concepts,
algorithms, and their applications. Following this foundational knowledge, I identified the
potential of using AI and ML to analyze textual data related to government policies and
public engagement.

The focus then shifted to exploring techniques for uncovering patterns and relationships
within this data. K-Means clustering, an unsupervised learning method, was chosen to group
policy descriptions with similar characteristics. This approach allows for the identification of
thematic clusters that might be associated with different levels of public interest.
Enhancing Prediction with Neural Networks:

The project further investigated the application of Neural Networks, a supervised learning
technique. Neural Networks can learn complex relationships between features (like TF-IDF
weights from policy descriptions) and target variables (public engagement scores in this
case). The code included a Multi-Layer Perceptron (MLP) implementation to train a Neural
Network model for predicting public engagement scores.

The results indicated that the Neural Network models achieved promising accuracy scores,
particularly for Data 2 (public_survey_variables) and Data 3 (2010_2011_variables). This
suggests that the model was able to effectively capture the underlying patterns in the textual
data and public engagement scores within these datasets.

By comparing K-Means clustering and Neural Networks, we can see the value proposition of
each approach. K-Means offers interpretability, allowing us to analyze the policy descriptions
within each cluster to understand the thematic drivers of public engagement. Neural
Networks, while potentially more accurate, can be less interpretable, requiring further
analysis of the predicted cluster probabilities.

5. Conclusion
This project successfully explored the application of Artificial Intelligence and Machine
Learning techniques for analyzing government policy data and predicting public engagement.
The implemented methods, K-Means clustering and Neural Networks, demonstrated the
potential for uncovering hidden patterns and relationships within textual policy descriptions.

The K-Means clustering approach facilitated the grouping of policies with similar
characteristics, enabling further analysis to understand the thematic factors influencing public
interest. The exploration of Neural Networks provided promising results for predicting public
engagement scores, particularly for datasets with a clear relationship between policy
descriptions and engagement levels.

Overcoming Challenges and Future Directions:

While the project yielded positive results, there are areas for further exploration and
improvement. One key challenge involves the interpretability of Neural Network models.
Unlike K-Means clustering where we can directly analyze the policy descriptions within each
cluster, Neural Networks can be like "black boxes." Future work could investigate techniques
like feature importance analysis to shed light on this. Feature importance analysis ranks
features (words or phrases in our case) based on their contribution to the model's predictions.
This would provide valuable insights into the specific aspects of policy descriptions that the
model leverages for predicting public engagement.

Another exciting area for exploration is incorporating additional data sources beyond textual
policy descriptions. Public sentiment analysis from social media platforms can reveal public
opinion in real-time, capturing the immediate reactions and discussions surrounding a policy.
Demographic data, such as age, income, or location, could further enrich the model's
understanding of public engagement. By considering factors like age groups most impacted
by a policy or geographical areas with higher stakes, the model's predictions could be more
nuanced and geographically specific.

Furthermore, advancements in Natural Language Processing (NLP) techniques could be

integrated to delve deeper into the sentiment and emotions conveyed within the policy
descriptions. This could allow the model to not only identify the topics discussed but also
gauge the overall public sentiment towards those topics.

By overcoming these limitations and incorporating additional data sources, the project's
foundation in AI and ML can be further strengthened. These techniques have the potential to
become valuable tools for policymakers seeking to understand and optimize their policy
initiatives. Imagine a future where policymakers can leverage AI-powered predictions to
gauge public interest in proposed policies, tailor messaging to resonate with specific
demographics, and ultimately foster more effective and well-received policy
implementations.