0% found this document useful (0 votes)

163 views19 pages

Enhancing The Government Accounting Information Sys - 2023 - International Journ

Uploaded by

Ellishya Khirudin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

163 views19 pages

Enhancing The Government Accounting Information Sys - 2023 - International Journ

Uploaded by

Ellishya Khirudin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

International Journal of Accounting Information Systems 48 (2023) 100600

Contents lists available at ScienceDirect

International Journal of Accounting

Information Systems
journal homepage: www.elsevier.com/locate/accinf

Enhancing the government accounting information systems using

social media information: An application of text mining and
machine learning
Huijue Kelly Duan a, *, Miklos A. Vasarhelyi b, Mauricio Codesso c, Zamil Alzamil d
a
Sacred Heart University, Jack Welch College of Business & Technology, 3135 Easton Turnpike, Fairfield, CT 06825, United States of America
b
Rutgers, The State University of New Jersey, Rutgers Business School, 1 Washington Park, Newark, NJ 07102, United States of America
c
Northeastern University, D’Amore-McKim School of Business, 319J Hayden Hall, Boston, MA 02115, United States of America
d
Majmaah University, Computer Science Department, Al-Majmaah 11952, Saudi Arabia

A R T I C L E I N F O A B S T R A C T

Keywords: This study demonstrates a way of bringing an innovative data source, social media information, to
Social media the government accounting information systems to support accountability to stakeholders and
Text mining managerial decision-making. Future accounting and auditing processes will heavily rely on
Machine learning
multiple forms of exogenous data. As an example of the techniques that could be used to generate
Sentiment analysis
this needed information, the study applies text mining techniques and machine learning algo
rithms to Twitter data. The information is developed as an alternative performance measure for
NYC street cleanliness. It utilizes Naïve Bayes, Random Forest, and XGBoost to classify the tweets,
illustrates how to use the sampling method to solve the imbalanced class distribution issue, and
uses VADER sentiment to derive the public opinion about street cleanliness. This study also ex
tends the research to another social media platform, Facebook, and finds that the incremental
value is different between the two social media platforms. This data can then be linked to gov
ernment accounting information systems to evaluate costs and provide a better understanding of
the efficiency and effectiveness of operations.

1. Introduction

Future accounting systems will utilize large amounts of exogenous data (Brown-Liburd et al., 2019) in conjunction with traditional
accounting data. Government accounting systems will move to be a conglomerate of three main components: 1) traditional financial,
2) infrastructure maintenance, and 3) quality of services (Bora et al., 2021). This study illustrates how exogenous variables eventually
integrated into service processes can be used within modern accounting and assurance operational services. It explores an alternative
performance measure by analyzing social media information to enhance government managerial decision-making and bring inno
vation to governmental operations. The progressive development of information and communication technologies (ICTs) and the
digital transformation of operations have fundamentally changed every aspect of people’s lives, social needs, as well as communication
strategies with the government. Modern government reporting demands reform toward a “data-driven, analytics-based, real-time, and
proactive reporting paradigm” (Bora et al., 2021). A dynamic and interconnected communication channel with the citizens would

* Corresponding author.
E-mail address: [email protected] (H.K. Duan).

https://doi.org/10.1016/j.accinf.2022.100600
Received 29 September 2021; Received in revised form 5 July 2022; Accepted 14 November 2022
Available online 25 November 2022
1467-0895/© 2022 Elsevier Inc. All rights reserved.
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

Fig. 1. NYC Residents NYC311 Complaints. This Tableau dashboard information is based on data obtained from NYC Open Data, which is available
at https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9.

generate the exogenous data source to improve public services’ performance and delivery. It would also be part of the three-
dimensional reporting system measuring and reporting the quality of services. Outdated measurements and old-fashioned ways of
operations cannot provide efficient public services to meet current citizens’ needs and expectations. For example, the New York City
(NYC) Mayor’s Office of Operations implements a Scorecard inspection program to assess the cleanliness of its streets and sidewalks by
relying on inspectors’ subjective judgment during a drive-by visual inspection of sampled locations.1 This method was established in
1973 and has not changed for nearly fifty years (Office of the New York State Comptroller, 2020). The ratings are adjusted for street
miles but not for the population, housing density, or the nature of activity in the inspected area, such as residential or commercial
areas. Based on the current rating, the majority of the streets are rated as acceptably clean (See Appendix A). However, the Office of
New York State Comptroller issued an audit report in 2020 where it stated several weaknesses of the methodology used by the Mayor’s
office, specifically the inspection process and the rating calculation, which raise concerns over the reliability of the ratings.
The auditors also pointed out that “without analyzing and acting on all available data, including complaints, to identify and mitigate the
underlying problem, there is material risk that the same sanitation problems will continue to surface and negatively impact the quality of life for
residents and visitors in those areas” (Office of the New York State Comptroller, 2020). The state auditors encouraged the Department of
Sanitation to consider all the available data sources to develop and implement additional performance measures for street cleanliness
(Office of the New York State Comptroller, 2020). The current service reporting system is what technology of the last century could
provide. As accounting information systems are rigid and backward-looking, the public would be much better served with close-to-
real-time service reporting integrated with a system of public accountability.
Additionally, NYC residents increasingly contact the Department of Sanitation via NYC311 about missed trash pickups, overflowing
litter baskets, and other insalubrious conditions. The examination of the NYC311 service request data from May 22, 2014, to May 22,
2019, reveals an increasing trend of complaints or requests for services by NYC residents to the Department of Sanitation and the
Department of Health and Mental Hygiene (as shown in Fig. 1).

1
Scorecard Inspection, information is available at: https://www.worldsweeper.com/Street/Profiles/NYCScorecard.pdf.

2
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

To better embrace innovation in government, many plans and proposals are being considered and implemented, including big data
analytics, smart cities, machine learning, drone usage, etc. Governments are increasingly adopting innovative data sources and data
analytics to better support the decision-making process, such as mobile device sensor-based app data, crowdsourcing data, Twitter
sentiment, and postings (Kitchin, 2014; O’Leary, 2013; OECD, 2017; Zeemering, 2021). Several cities have been exploring this area,
using different management information systems to gather exogenous data and monitor public services and functions. Examples of
these include monitoring traffic based on transportation network data, the data analytic center of the Centro De Operacoes Prefeitura
Do Rio in Brazil, London’s Dashboard and LoveCleanStreets App, Boston’s infrastructure monitoring system, etc. (Kitchin, 2014; Li
et al., 2018; O’Leary, 2019a, 2013). Incorporating big data into government information systems as part of service evaluation and
assessment factors improves public services’ effectiveness, which allows the government official to make data-driven decisions,
promptly address the issues, and better deploy the resources.
As an example to demonstrate the possibility of using exogenous data in supporting government managerial decision-making, this
study proposes an alternative performance measure. This measure uses social media information to assess the street cleanliness in NYC
in response to the New York State auditors’ recommendations stated in the 2020 audit report. It utilizes text mining techniques and
machine learning algorithms to examine social media information, applies an analytical approach to identify temporal trends and
patterns of street cleanliness, provides a different perspective about street cleanliness other than official cleanliness ratings, and as
sesses the tweets’ sentiment to measure the performance of municipal services. The study finds that the overall sentiment trend over
the examined period is negative, inconsistent with the official Scorecard ratings. This study proposes that the government incorporates
social media information into municipal performance evaluation and assessment factors. A continuous monitoring dashboard for street
cleanliness that integrates various data sources, including social media information, can be built to support public services decision-
making.
Public accountability is an essential factor for a sustainable and stable government. Many government institutions demonstrate
their accountability by disclosing the tax revenue amount and illustrating how they spend taxpayers’ money efficiently and effectively,
as well as how that expenditure benefits citizens’ lives (Callahan and Holzer, 1999). Involving citizens in the process of government
fiscal budgeting and decision-making process, particularly in resource allocation and performance measurement, is critical to meeting
citizens’ expectations and increasing the government’s accountability (Berner and Smith, 2004; Ebdon and Franklin, 2004; Justice
et al., 2006; Robbins et al., 2008; Woolum, 2011). The majority of governments’ performance measures concentrate on information
used to make internal management decisions, such as inputs, outputs, staffing patterns, and resource allocations (Ho and Ni, 2005;
Woolum, 2011). Incorporating exogenous data, such as social media information, into government accounting information systems is a
way of considering citizens’ preference and their views on public issues, which helps government decision-makers to provide better
public services that matter to citizens, determine how public services should be managed, measured, and reported.
The contributions of this study mainly focus on three areas. First, this study demonstrates the possibility of incorporating social
media information into the government information systems to support decision-making. Collecting and analyzing social media in
formation is a direct and efficient way to obtain timely feedback from citizens and proactively interact with the public. Government
accounting information systems can incorporate these measures and link them to cost figures allowing the understanding of the ef
ficiency and effectiveness of operations. Second, this study presents a data analytical approach to enhance decision-making using more
real-time type data rather than only historical data provided by accounting systems. Users can retrieve valuable information from the
tweets by utilizing text mining techniques and machine learning algorithms and can handle a dataset with an imbalanced class dis
tribution issue. Among the total number of tweets collected, only a small portion of the data is relevant to the subject; thus, the
distribution of the dataset is skewed. The sampling methods used in the study can resolve the imbalanced class distribution issue, and
the methodology can be generalized to other areas, such as predicting financial fraud and assessing bankruptcy possibilities. Third, this
study provides an example of using social media information as an alternative performance measure. It applies emerging technologies
and an analytical approach to examine social media information and provides a different perspective from the general public for
tackling a public problem.
The remainder of this study is organized as follows: the second section reviews existing literature on the study of social media
information. The third section provides the methodology of this study. The fourth section shows the results, and the fifth section
focuses on extending the analysis to another social media platform. Finally, the last section discusses the conclusions and limitations of
the study and provides future avenues for research.

2. Literature review

Research on social media has exponentially grown in recent years. As part of the exogenous data, the added value and the impact of
social media are significant considering the volume, velocity, variety, and veracity of the information that is available (Buhl et al.,
2013; Vasarhelyi et al., 2015; Yoon et al., 2015; Zhang et al., 2015). A Twitter platform facilitates network interconnections and
perfectly illustrates the social network theory. The interconnected network among users generates a rich data source for opinion
mining and sentiment analysis (Pak and Paroubek, 2010). This section discusses the extant literature related to crowdsourcing, the
value of social media information and the techniques that researchers are using to analyze this type of data, and the information used to
measure the municipalities’ performance.

2.1. The use of crowdsourcing

Many cities are seeking a novel approach to address street condition issues. For example, Boston implemented a mobile, cloud-

3
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

based app for the citizens to report problems related to the city’s infrastructure, such as potholes and graffiti (O’Leary, 2019a). London
developed a crowdsourcing-based cloud computing system, LoveCleanStreets,2 allowing citizens to take a picture of illegal dumping,
potholes, graffiti, etc., and submit the images through a mobile app (Li et al., 2018). Jakarta implemented a tool to capture citizens’
social media posts, including Twitter, to produce a real-time flood monitoring system (OECD, 2017). The use of crowdsourcing is an
emerging technique and has become increasingly popular (O’Leary, 2019b). O’Leary (2019b) presents five case studies of the Big 4
accounting firms and Wikistrat that use crowdsourcing to generate innovations and change their consulting business model. Firms use
different social media platforms to gather opinions on various issues and suggestions for business development and address clients’
concerns. Researchers also study the utilization of crowdsourcing, such as applying the crowdsourcing approach in the accounting and
finance field, using social media in knowledge management, and exploring the use of crowdsourcing to build data analytical tools
(Dzuranin and Mălăescu, 2016; O’Leary, 2016a, 2015a). Governments can undoubtedly utilize crowdsourcing to improve and enhance
public services (Dutil, 2015); the participation of the citizens can “help the government be more responsive and effective” (Linders,
2012). Canada initiated a crowdsourcing competition to explore ideas that can help Canada define its future role in the global digital
economic environment (O’Leary, 2016b). Participants are asked to evaluate a set of innovative ideas to help develop Canada’s digital
future (O’Leary, 2016b). The development of ICTs facilitates a broader horizon for government to communicate and interact with the
public. As a popular microblogging platform, Twitter is utilized by many governments to engage in communication (Mossberger et al.,
2013). Eventually, crowdsourcing results must be integrated into the “modern” government accounting systems.

2.2. The value of social media and NLP tools

Social network theory refers to interconnections among people, organizations, or groups (Haythornthwaite, 1996; Williams and
Durrance, 2008). The interaction within the network promotes collaborations among users, which could generate valuable infor
mation and insight for stakeholders. The use of social media, such as Facebook, Twitter, YouTube, Instagram, Weibo, etc., has
dramatically grown in the past decade. These social media channels, which are Internet-based Web 2.03 applications, have provided a
platform for users to proactively express and exchange opinions, share knowledge and experiences, and develop their social networks.
As a major social media platform, Twitter had more than 322.4 million users worldwide in 2021; the number is expected to increase to
340.2 million by 2024.4 People are rapidly adopting these communication channels, establishing social network relationships via
complex network links. To put things into perspective, Twitter generates over 500 million tweets each day, and Facebook has more
than 4.75 billion posts per day (Dhaoui et al., 2017). This amount of information is considered a rich data source that is high in volume,
velocity, and variety to support decision-making (O’Leary, 2015b).
Researchers find that Twitter data contains valuable information and can be used to discover signal events, predict specific cir
cumstances, and assess the causality of an event (O’Leary, 2015b). Twitter is being used in various settings, including audit procedures
(Rozario et al., 2022), emergency and disaster situations (Hughes and Palen, 2009; Mandel et al., 2012; Vieweg et al., 2010), political
campaigns (O’Leary, 2012), fraud activity (O’Leary, 2011), reputation management (Jansen et al., 2009; Prokofieva, 2015), election
prediction (Cameron and Barrett, 2016; Shi et al., 2012; Tsakalidis et al., 2015), disease control prediction (Culotta, 2010; Guo et al.,
2020; Jahanbin and Rahmanian, 2020), stock market movement (Bollen et al., 2011; Oh and Sheng, 2011; Risius et al., 2015; Sul et al.,
2017), sales prediction (Asur and Huberman, 2010; Culotta, 2013; Lassen et al., 2014), etc. The incremental value of disseminating this
type of qualitative unstructured content and retrieving useful information can be significant.
There is a growing trend in analyzing qualitative information using Natural Language Processing (NLP) tools. Researchers explore
ways to interpret the textual information from annual reports, financial news articles, conference calls, employees’ e-mails, social
media contents (Burgoon et al., 2016; Holton, 2009; Larcker and Zakolyukina, 2012; Li, 2008; Liu and Moffitt, 2016; Loughran and
McDonald, 2011; Sul et al., 2017). The bag of words approach is commonly used in analyzing textual content, also known as the rule-
based dictionary approach. Loughran and McDonald (2011) develop their own dictionary to examine the tone and sentiment of
corporate 10-K reports. Based on Management Discussion and Analysis (MD&A) sections in annual reports or quarterly filings, Cec
chini et al. (2010) create their dictionaries of keywords to automatically analyze financial text, detect economic events, and predict
fraud and bankruptcy. Other research utilizes the machine learning approach to quantify the qualitative information, including un
supervised machine learning5 (e.g., clustering method) and supervised machine learning6 (e.g., classification method). Li (2010)
examines the information content of the forward-looking statements in the MD&A of 10-K and 10-Q filings using a Naïve Bayesian
machine learning algorithm. Schumaker et al. (2012) evaluate sentiment in financial news articles to predict stock prices using the
Support Vector Regression machine learning algorithm.
Valence Aware Dictionary and sEntiment Reasoner (VADER) is a “lexicon and rule-based sentiment analysis tool that is specifically
tuned to the sentiment expressed in social media” (Hutto and Gilbert, 2014). Many studies apply VADER to perform the sentiment

2
For more information about LoveClearnStreets, please refer to https://lovecleanstreets.info/.
3
Web 2.0 refers to the second generation of the World Wide Web. For more detail, please refer to https://www.webopedia.com/TERM/W/Web_2_
point_0.html.
4
Number of Twitter users worldwide from 2019 to 2024: https://www.statista.com/statistics/303681/twitter-users-worldwide/.
5
Unsupervised machine learning is to study the structure of a dataset in order to detect anomalies, reduce dataset dimensionality, extract common
features or attributes (Tan et al., 2019).
6
Supervised machine learning is to use a class of labeled data (e.g., training dataset) to generate a learning algorithm that could correctly predict
the class labels of records it has never seen before (e.g., testing dataset) (Tan et al., 2019).

4
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

analysis of textual content. Elbagir and Yang (2019) use VADER to classify the tweet sentiment related to the 2016 US election. Borg
and Boldt (2020) apply VADER to assess the sentiment expressed in customers’ e-mails. Pano and Kashef (2020) perform a sentiment
analysis of Bitcoin-related tweets using VADER to predict the bitcoin price during the COVID-19 pandemic. Nemes and Kiss (2021) use
VADER as one of the sentiment tools to analyze the stock news headlines. This study adopts VADER as the sentiment analysis tool to
assess the public’s opinion of street cleanliness.

2.3. Municipalities’ performance measure

Non-financial indicators are widely used in the public sector, such as the measures used in the Service Efforts and Accomplishments
(SEA) reporting. SEA reporting was implemented by the Governmental Accounting Standard Board (GASB), which aims to provide the
citizens with the performance measures of the public services, including services efficiency and effectiveness, and service quality.7
Performance information could have an impact on the municipalities’ budgeting, funding, and donations. Buchheit and Parsons (2006)
perform an experimental study regarding the impact of disclosing non-financial information (e.g., information related to service efforts
and accomplishments) on non-profit organizations’ donations. The study finds that information about service efforts and accom
plishments significantly influences the donors’ decision-making process (Buchheit and Parsons, 2006). Wang (2000) uses a national
survey from 208 counties in the US to examine the impact of different performance measures in budgetary decision-making. The study
finds that the counties use various performance indicators in different stages of the budget cycle, such as agency requests and executive
budgets, etc. (Wang, 2000). Municipalities’ performance is measured in different ways, such as resource allocation decisions,
budgetary decision making, human resources management, performance monitoring, and program evaluation (Reck, 2001; Rivenbark
and Kelly, 2006; Wang, 2002). Different types of information can be used as part of the municipalities’ performance measures,
including financial and non-financial information. Reed (1986) conducts an experimental study and finds that when only non-financial
information is presented, particularly program effectiveness data, the information influences government budget funding decisions.
Reck (2001) examines the incremental value of financial and non-financial information in government budgeting allocation and
performance evaluation. The study finds that financial information is useful to allocate resources, while non-financial information is
used to evaluate the overall performance and is influential in assessing the overall entity’s efficiency and effectiveness (Reck, 2001).
Social media information has been used as part of a new performance measure in various fields. Bonsón and Ratkai (2013) use
Facebook data to generate metrics to measure the effectiveness of corporate social network communication with the stakeholders,
including the stakeholders’ mood. Coulter and Roggeveen (2012) examine the effect of social media on consumers’ reactions to
product-related promotions, which provide insights into their marketing strategies. Burton and Soboleva (2011) use tweets to measure
the company’s marketing communication strategy based on six companies (twelve twitter accounts) in the US and Australia. Based on
prior research, social media information could be a potential data source to be used as a performance measure for public services; it
would provide the government with a different perspective from the general public.
Government service reporting allows the users to assess the economy, efficiency, and effectiveness of the service provided, where
the performance measures concern the results of the public services.8 However, some of the performance indicators provided today by
government entities are a far call from providing the basis for accountability. A timely and dynamic reporting with real-time exogenous
data feeding would reshape the government’s performance management and facilitate a more responsive government. Incorporating
social media information into the government accounting information systems enables the citizens to provide direct feedback about
the output quality of public services and explicitly indicate the citizens’ needs and expectations. Governments can utilize this
communication channel to promote public engagement to support their decision-making and promptly deploy the service resources.

3. Methodology

The general workflow for this study is illustrated in Appendix B. The following subsections describe each step in detail.

3.1. Data collection

The Streaming API9 is forward-looking and collects upcoming tweets, it is generally the preferred way of downloading a large
number of tweets without exceeding the rate limits, but it is time-consuming (Bonzanini, 2016). This study uses the Streaming API to
collect tweets based on NYC’s longitude and latitude due to the granularity restriction of Twitter’s geotagged data. Geotagged tweets
are categorized based on a bounding box, which is defined by longitude and latitude; the granularity of the bounding box must be one

7
GASB SEA reporting, information is available at: https://www.seagov.org/aboutpmg/.
8
About SEA Reporting – Performance Measurement, information is available at: https://www.seagov.org/aboutpmg/performance_measurement.
shtml.
9
Twitter offers several different types of API, such as Enterprise, Premium, Standard, Essential, Elevated, and Academic Research. Depending on
the API, fees may be required, and different levels of data access and limitations may apply. For this study, we only had access to standard free API.
For more information, please refer to: https://developer.twitter.com/en/docs/twitter-api Twitter released the Academic Research API in 2021,
which allows users to retrieve archived tweets back to March 2006. Academic researchers may consider using Academic Research API for their
future research. For more information, please see: https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction.

5
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

of the options: neighborhood/county, city, admin, or country.10 We define the bounding box using NYC’s longitude and latitude, and
all tweets from this location are captured. A Python script is used to access the Twitter API using Python 2.711 to fetch all Twitter
streams originating from NYC. Different Python libraries are used during the tweets collection process. For example, Tweepy allows
users to access the Twitter API.12 StreamListener enables users to stream real-time tweets and store the tweets in a designated loca
tion13 (See Appendix C for a list of major Python libraries used in this study). The dataset collected in this study is from August 27,
2018, to May 22, 2019, and contains 6.8 million tweets that cover all the tweets that originated from NYC. A sample of the tweets is
listed in Appendix D.

3.2. Data preparation

After collecting the data through the Streaming API, the following steps transform the raw data into a format that the user can read
and analyze.
Data Cleaning: a ‘C’ script14 is used to remove the corrupted records, quotation marks, dots, and commas.
Variable Selection and Aggregation: the streamed data contains many different types of attributes. This step selects the attributes in
the dataset that could potentially be used in the subsequent analysis. Six fields were chosen for each tweet and aggregated into the
dataset, considering the potential relevance to the research subject. The selected attributes are the date and time of the tweet, tweets
body (the content posted by users), user identification number (a unique ID for each user), number of followers (the number of fol
lowers to the author), number of likes (the number of likes for the tweet), the total number of posts of the individual user. Structured
Query Language Platform (MySQL)15is used to store the streamed tweets.
Data Aggregation: multiple databases, data cubes, and files are aggregated, and 27 chunk files are combined into one single table in
MySQL. Finally, a CSV file is generated as the dataset for analysis.

3.3. Relevancy determination

The tweets are collected based on NYC’s longitude and latitude; therefore, the dataset contains tweets that are not relevant to the
research topic. A methodology needs to be selected to retrieve relevant information from a massive number of tweets. One method is to
utilize keywords. A list of keywords is created based on the research topic and the bag of words in the Natural Language Toolkit (NLTK)
to filter the relevant tweets.16 However, after applying the initial list of keywords, the dataset still contains many irrelevant tweets.
Hence, it is essential to review the dataset and check the specific keywords in the content to create an appropriate list of keywords. For
example, the keyword = dog could be related to a personal pet or homeless dog, and depending on the content, the tweet might not be
relevant to the research topic here. Therefore, the list of keywords needs to be modified, e.g., adjust dog to stray dog, homeless dog, etc.
This step requires some manual work to review and update the keywords list (See Appendix E for a sample of the keyword list).
Combining manual and automatic efforts have been explored in academic literature. Chakraborty and Vasarhelyi (2017) create a
hybrid model to build a taxonomy utilizing manual and automatic steps. They use a clustering approach to develop a taxonomy
structure and use manual steps to create data tagging, identify the required list of items, and validate the accuracy of the clustering
approach. Even though the manual work is time-consuming, it is beneficial to develop a knowledge base to analyze the tweets in the
subsequent analysis. After applying the final keywords list, the remaining dataset contains about 132,149 tweets. However, out of the
132,149 tweets, many tweets are still irrelevant and need to be further preprocessed.

3.3.1. Data preprocessing

The dataset needs to be preprocessed before applying the machine learning method, which converts unstructured and semi-
structured text into a structured format. The following steps are performed to normalize the text:
Tokenization: text in tweets is broken into discrete words referred to as tokens.
Non-ASCII Characters Removal: ASCII stands for American Standard Code for Information Interchange;17 it is a system language to
represent text in computers. This step is to remove all the non-ASCII characters, which are characters, languages, or scripts used other

10
Twitter Geo/search information https://developer.twitter.com/en/docs/twitter-api/v1/geo/places-near-location/api-reference/get-geo-search.
11
Python software download is available at: (https://www.python.org/).
12
Introduction to Tweepy, information is available at https://www.pythoncentral.io/introduction-to-tweepy-twitter-for-python/.
13
StreamListener usage, information is available at https://docs.tweepy.org/en/v3.4.0/streaming_how_to.html.
14
C-script is a script that runs in a command-line environment. Information is available at https://docs.microsoft.com/en-us/windows-server/
administration/windows-commands/cscript.
15
MySQL is an open-source relational database management system. Information is available at https://www.mysql.com/.
16
The list of keywords was constructed by combining manual and automatic efforts. The authors first reviewed the NLTK corpus package to
identify relevant words using judgments. Subsequently, the authors identified topics mentioned in the tweets using the topic modeling approach and
manual validation to develop a comprehensive taxonomy. The topic modeling approach indicates the probability of words used for each abstract
topic. Based on identified words from topic modeling, the authors checked back to the tweets and updated the keyword lists accordingly. For
example, the topic modeling result indicates the word ‘stink’; authors filtered the tweets using ‘stink’ and reviewed the tweets to see other similar
words used. In this case, the authors extend the keyword list to include ‘stinky,’ ‘stinks,’ etc.
17
ASCII, information is available at https://techterms.com/definition/ascii#:~:text=Stands%20for%20%22American%20Standard%20Code,%
2C%20numbers%2C%20and%20punctuation%20symbols.

6
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

than American Standard Code.

Numbers Replacement: this step is to replace all integer occurrences with textual representation (e.g., 2019 is converted to the word
‘2019’).
URL, User ID, Hashtags, and Special Symbols Removal: this step is to remove all the special symbols in the contents, such as URL (e.g.,
https: link), user ID (e.g., \ud83d), Hashtags, and special symbols (e.g., # and @).
Stopwords Removal: this step is to remove all the stopwords listed in the NLTK, a Python library.
Lemmatization: lemmatization is a process of normalizing the text or word during Natural Language Processing; it is often referred to
as removing inflectional endings and returning the base or dictionary form of a word.18 For example, the words posts, posted, and
posting are treated as the word ‘post.’

3.3.2. Classification model

Blei et al. (2003) describe Latent Dirichlet Allocation (LDA) as “a generative probabilistic model for collections of discrete data, and
it is a three-level hierarchical Bayesian model.” LDA is an example of topic modeling, which classifies the dataset based on different
topics; it provides several topics within a dataset and the probability of each word per topic. However, this methodology could not
provide a clear separation between relevant and irrelevant tweets in this study. Each identified case varies depending on the dataset,
and the probability of each word used could be different from topic to topic. When all the results from various topics are combined,
relevant tweets might be considered irrelevant under specific topics. Mixed results were found by using this method, and the classi
fication of the dataset was not precise. Therefore, an alternative approach, the supervised machine learning method, was used.
The supervised machine learning method relies on a training dataset to train the machine to learn and predict the outcome of the
testing data. Three different algorithms are used in this study: Naïve Bayes (NB), Random Forest (RF), and eXtreme Gradient Boosting
(XGBoost).19 NB is based on Bayesian methods, a “statistical principle for combining prior knowledge of the classes with new evidence
gathered from data” (Tan et al., 2019). NB assumes the attributes are conditionally independent of the other features in the dataset to
estimate the class-conditional probability (Tan et al., 2019). The algorithm is generally considered the simplest and most widely-used
probabilistic classification model that can handle uncertainty in predictions, compute class-conditional probabilities even in high-
dimensional settings, robust to noises and irrelevant attributes, and handle missing values (Tan et al., 2019). RF is “an ensemble
learning method20 specifically designed for decision tree classifiers; it combines the predictions made by multiple decision trees, where
each tree is generated based on the values of an independent set of random vectors” (Tan et al., 2019). RF is computationally fast and
robust to overfitting and performs well in high-dimensional settings (Tan et al., 2019). XGBoost is a type of boosting method, a part of
the ensemble methods that manipulate the training sets. Boosting is an iterative procedure used to change the distribution of training
sets adaptively; it assigns a weight to each training example, then adjusts the weight at the end of each boosting round so that the base
classifiers will focus on examples that are hard to classify (Tan et al., 2019). The boosting approach is particularly effective for an
imbalanced dataset. All three algorithms are widely used in literature (Alom et al., 2018; Awwalu et al., 2019; Bazzaz Abkenar et al.,
2021; Holton, 2009; McCord and Chuah, 2011; Schnebly and Sengupta, 2019; Singh et al., 2019; Tseng et al., 2012). This study adopts
the three algorithms mainly because of their performance in prior literature and their reasonable computational speed.
The supervised machine learning approach requires manual labeling to create a set of labeled data. Twenty-six thousand tweets are
labeled manually as “1”, meaning relevant to the subject, or “0”, meaning irrelevant to the research subject. The labeled dataset is
divided into 80% as the true training set and 20% as the validation set. To preserve the matching characteristics21 between the datasets
and prevent introducing bias into the samples, stratified 10-fold cross-validation and a pairwise t-test are used to identify the best-
performed classifier. Then the best-performed classifier is applied to the testing set to identify the relevant tweets to the subject.

3.4. Sentiment analysis

Hutto and Gilbert (2014) compare the VADER to eleven other highly regarded sentiment analysis tools: Linguistic Inquiry Word
Count (LIWC), General Inquirer (GI), Affective Norms for English Words (ANEW), SentiWordNet (SWN), SenticNet (SCN), Word-Sense
Disambiguation (WSD) using WordNet, the Hu-Liu04 opinion lexicon, NB classifier, Maximum Entropy (MaxEnt or ME), SVM-
Classification, SVM-Regression, and conclude that VADER outperformed these tools in terms of dealing with social media texts, the
New York Times editorials, movie reviews, and product reviews. Hutto and Gilbert (2014) find that VADER performed better than
other highly regarded sentiment analysis tools in large part because it used human raters from Amazon Mechanical Turk during its
development. Each rater might have a different interpretation of emotional intensity; some words might be negative to one person but
neutral to others. VADER sentiment analysis considers these factors and averages raters’ ratings for each word, “the sentiment lexicon
is sensitive to both the polarity and the intensity of sentiments expressed in social media contexts” (Hutto and Gilbert, 2014). It
combines a dictionary of lexical features to sentiment scores with a set of five heuristics based on grammatical and syntactical cues to

18
A definition and discussion of lemmatization can be found athttps://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-
1.html#2378.
19
Authors also used Support Vector Machines (SVM). However, it is computationally very slow, and the performance (accuracy rate and recall
ratio) is not good. In that case, we dropped this algorithm.
20
An ensemble method “constructs a set of base classifiers from training data and performs classification by taking a vote on the predictions made
by each base classifier” (Tan et al., 2019).
21
To keep the matching characteristics of the datasets, we split the dataset based on the same proportion between relevant and irrelevant samples.

7
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

Table 1
Confusion Matrix.
Confusion Matrix Predicted No Predicted Yes

Actual No True Negative (TN) False Positive (FP)

(Type I Error)
Actual Yes False Negative (FN) True Positive (TP)
(Type II Error)
Precision TP / (TP + FP)
Recall TP / (TP + FN)
F1-Score (2 * Recall * Precision) / (Recall + Precision)
Classification Accuracy (TP + TN) / (TP + TN + FP + FN)

Table 2
Confusion Matrix and Classification Results.

Panel A: Naïve Bayes

Precision Recall F1-Score Support

0 0.98 0.89 0.93 4942

1 0.12 0.47 0.19 159
Accuracy 0.87 5101
Macro avg 0.55 0.68 0.56 5101
Weighted avg 0.95 0.87 0.91 5101
Confusion Matrix Predicted No Predicted Yes Predicted Classification
Actual No TN = 4377 FP = 565 0 117,510
Actual Yes FN = 85 TP = 74 1 14,639

Panel B: Random Forest

Precision Recall F1-Score Support
0 0.98 1 0.99 4942
1 0.93 0.31 0.47 159
Accuracy 0.98 5101
Macro avg 0.95 0.66 0.73 5101
Weighted avg 0.98 0.98 0.97 5101
Confusion Matrix Predicted No Predicted Yes Predicted Classification
Actual No TN = 4938 FP = 4 0 130,388
Actual Yes FN = 109 TP = 50 1 1,761

Panel C: XGBoost
Precision Recall F1-Score Support
0 0.98 1 0.99 4942
1 0.84 0.37 0.52 159
Accuracy 0.98 5101
Macro avg 0.91 0.68 0.75 5101
Weighted avg 0.98 0.98 0.97 5101
Confusion Matrix Predicted No Predicted Yes Predicted Classification
Actual No TN = 4931 FP = 11 0 130,350
Actual Yes FN = 100 TP = 59 1 1,799

convey changes to sentiment intensity. It considers punctuation (e.g.,!!!), capitalization (e.g., I am so HAPPY), degree modifiers (e.g., it
is good vs it is extremely good), emoticon (e.g.,), acronyms (e.g., LOL, ttyl), slang (e.g., Nah, meh), etc. These features are commonly
used in Twitter’s content. VADER was found to be the best suitable for social media text. It has also proven itself to be a great tool when
analyzing the sentiment of movie reviews and opinion articles (Hutto and Gilbert, 2014).
VADER sentiment analysis returns a sentiment score in the range of − 1 to 1, from most negative to most positive.22 The sentiment
score is calculated by summing up the sentiment scores of each VADER dictionary-listed word in the sentence (Hutto and Gilbert,
2014). The score is categorized as negative, neutral, positive, and compound. The compound score is computed by summing the
valence scores23 of each word in the lexicon, adjusted according to the rules, and then normalized to be between − 1 (most extreme
negative) and + 1 (most extreme positive) (Hutto and Gilbert, 2014). This is also known as a normalized or weighted composite
score.24 To separate the sentiment into categories, researchers need to assign a threshold for the compound score. Typically, the

22
VADER package in Python is available at https://github.com/cjhutto/vaderSentiment.
23
Valence scores measure the sentiment intensity (Hutto and Gilbert, 2014).
24
VADER compound score calculation information is available at https://blog.quantinsti.com/vader-sentiment/#:~:text=Compound%20VADER
%20scores%20for%20analyzing,1%20(most%20extreme%20positive).

8
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

threshold value is 0.05. After reviewing the dataset, the standard threshold value (0.05) causes misclassification issues in this study.
Thus, the threshold value needed to be adjusted according to the instances. A sample of 500 tweets is manually annotated,25 and the
sentiment scores are determined based on descriptive statistics, such as mean, median, and standard deviation. The overall distribution
of the sentiment scores is evaluated based on different criteria, such as within one standard deviation or two standard deviations from
the mean. The final threshold is determined based on the highest accuracy rate compared to the manual annotation; this is a similar
approach performed by Chakraborty and Vasarhelyi (2017) that manually validated the machine learning results.

4. Results

The approach for obtaining results can be divided into two steps. The first step is relevancy determination, which uses a supervised
machine learning method to retrieve relevant tweets related to this study. The second step is sentiment analysis, which applies VADER
to the relevant tweets identified during the first step.

4.1. Relevancy determination

Twenty-six thousand tweets were labeled manually as “1”, meaning relevant to the subject, or “0”, meaning irrelevant to the
research subject. The relevant tweets represent over 3% of the total labeled tweets. 80% of the labeled data is set as the true training
group, and 20% of the labeled data is set as the validation group. Three classifiers, NB, RF, and XGBoost, are applied to the dataset to
evaluate the data classification. The results from the three algorithms are evaluated by examining the confusion matrix.26 Table 1
shows a general overview of the confusion matrix.
The results from NB (as shown in Table 2, Panel A) indicate that the false-positive number is 565, which means 565 tweets are
irrelevant but identified as relevant. The false-negative number is 85, which means 85 tweets are relevant but identified as irrelevant.
Additionally, the precision level is 98% and 12% for classifiers 0 (irrelevant) and 1 (relevant), respectively. The recall level is 89% for
classifier 0 and 47% for classifier 1. The F1-score is 93% for classifier 0 and 19% for classifier 1, respectively. Finally, the model
achieves 87% prediction accuracy. As a result, the model predicted that 117,510 records are classified as irrelevant; 14,639 records as
relevant.
The results from RF (as shown in Table 2, Panel B) indicate that the false-positive number is 4, which means 4 tweets are irrelevant
but identified as relevant. The false-negative number is 109, which means 109 tweets are relevant but identified as irrelevant.
Furthermore, the precision level is 98% and 93% for classifiers 0 and 1, respectively. The recall level is 1% for classifier 0 and 31% for
classifier 1. The F1-score is 99% for classifier 0 and 47% for classifier 1. Finally, the model achieves 98% prediction accuracy. As a
result, the RF model predicted that 130,388 records are classified as irrelevant; 1,761 records are classified as relevant.
The results from XGBoost (as shown in Table 2, Panel C) indicate that the false-positive number is 11, which means 11 tweets are
irrelevant but identified as relevant. The false-negative number is 100, which means 100 tweets are relevant but identified as irrel
evant. Moreover, the precision level is 98% and 84% for classifiers 0 and 1, respectively. The recall level is 1% for classifier 0 and 37%
for classifier 1. The F1-score is 99% for classifier 0 and 52% for classifier 1. Finally, the model achieves 98% prediction accuracy. As a
result, the model predicted that 130,350 records are classified as irrelevant; 1,799 records as relevant.
The above results (Table 2) indicate that RF and XGBoost have very similar performance, and both have achieved a high accuracy
score (about 98%). Based on the accuracy scores, the performance of each classifier model is excellent. However, the results from the
classifier models indicate that the dataset is facing an imbalanced class issue. For example, the RF model predicts that 130,388 records
are classified as irrelevant; only 1,761 records are as relevant; the distribution of the dataset is highly imbalanced and skewed as the
majority of the tweets in this dataset are irrelevant. Additionally, all three models’ false-negative numbers are high (85, 109, and 100,
respectively for NB, RF, and XGBoost), which causes the low recall ratio (47%, 31%, 37%, for NB, RF, and XGBoost respectively).
Applying two well-known sampling methods can resolve this imbalanced class distribution issue. The first one is the random
undersampling method, which balances the distribution of the classes by randomly removing records from the majority class (Tan
et al., 2019). The other is the random oversampling method, which aims to balance the class distribution by randomly duplicating
records from the minority class (Tan et al., 2019). However, each of the two methods has its limitations. Undersampling may cause the
dataset to lose valuable information and cause an underfitting issue. Oversampling may cause overfitting. Stratified 10-fold cross-
validation and a pairwise t-test are performed to prevent the limitations of the sampling methods. This approach preserves the
matching characteristics and keeps the same distribution proportion of the samples between the training and validation set, preventing
bias introduction during model evaluation. Additionally, the area under the ROC curve (AUC) is assessed to evaluate the performance
of different classifiers. A summary of the key measures is in Table 3.
XGBoost-oversampling performs the best among all nine models based on the AUC, as it has the highest number (85%), followed by
RF-undersampling and XGBoost-undersampling (both are 82%). It is also reasonable to consider using the recall ratio to evaluate the

25
Two users manually annotated all the selected samples, any discrepancies between the two users are investigated to reach a final decision.
26
Confusion matrix is used to evaluate the performance of a classification model that is based on the counts of test records correctly and incorrectly
predicted by the model (Tan et al., 2019).True Positive - Relevant tweets are classified correctly.True Negative - Irrelevant tweets are classified
correctly.False Positive - Irrelevant tweets are classified as relevant.False Negative - Relevant tweets are classified as irrelevant.Precision - The
ability of the classifier not to label an irrelevant tweet as relevant.Recall - The ability of the classifier to capture all the relevant tweets.F-1 Score -
Weighted average of Precision and Recall.Classification Accuracy - Percentage of accurate prediction.

9
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

Table 3
A Summary of the Key Measures.
Accuracy Precision Recall F1-Score ROC AUC

NB 0.87 0.12 0.47 0.19 0.7

RF 0.98 0.93 0.31 0.98 0.64
XGBoost 0.98 0.84 0.37 0.52 0.75
NB_Undersampling 0.57 0.05 0.78 0.1 0.67
RF_Undersampling 0.9 0.21 0.74 0.32 0.82
XGBoost_Undersampling 0.86 0.16 0.77 0.26 0.82
NB_Oversampling 0.87 0.12 0.47 0.19 0.7
RF_Oversampling 0.97 0.6 0.36 0.45 0.72
XGBoost_Oversampling 0.94 0.3 0.69 0.42 0.85

Table 4
Sentiment Analysis by Category.
Sentiment Homeless Parking Street Subway Grand Total

Negative 9.8% 15.9% 32.2% 5.4% 63.4%

Neutral 3.5% 6.4% 15.9% 2.4% 28.1%
Positive 1.6% 1.6% 4.3% 1.0% 8.5%

models, as the relevant tweets are from the minority class. In this case, based on the recall ratio, all three undersampling classifiers
perform relatively better.
Moreover, a pairwise t-test based on AUC and recall ratio is used to further evaluate the classifiers’ performance; the pairwise t-test
is set at a 5% significance two-tailed test. The untabulated AUC pairwise t-test results indicate that the models under the two sampling
methods are significantly different from the original classifiers. Mainly, RF-undersampling, XGBoost-undersampling, and XGBoost-
oversampling are substantially different from the original classifiers. XGBoost-oversampling and RF-undersampling are significantly
different from the majority of the other models. The untabulated recall pairwise t-test results indicate that most of the classifiers are
significantly different from the others.
Overall, considering the computational requirements and misclassification cost, XGBoost-oversampling deems a reasonable clas
sifier to select as it has the highest AUC and relatively high recall ratio. Manual validation is performed on a sample basis to avoid
overfitting. One thousand tweets are randomly selected and manually labeled as relevant and irrelevant. The results also indicate that
XGBoost-oversampling performs best, considering the accuracy percentage compared to the manual labels, with RF-undersampling as
the second best. Therefore, it is determined that the XGBoost-oversampling method is the appropriate classifier to use. Finally,
applying the XGBoost-oversampling classifier to the testing set result, the final dataset consists of 8,434 relevant tweets.

4.2. Sentiment analysis

After identifying relevant tweets, the next step is to apply the sentiment analysis. The sentiment expressed in people’s tweets can be
used as an indicator of the street condition. The overall sentiment is negative (63.4% negative, 28.1% neutral, and 8.5% positive).
Additionally, to provide additional information, the dataset is categorized into four categories: Street, Subway, Homeless,27 and
Parking, based on the topics discussed in the tweets. Table 4 shows that the majority of the negative tweets are related to the street. As
expected, most tweets are negative as people are more likely to vent their frustrations on social media channels. There are, however, a
few positive tweets posted as well.28 The impact of an imbalanced number of positive and negative tweets on the municipalities’
operations is limited since one of the responsibilities of municipalities is to provide adequate, efficient, and effective public services to
meet the needs of citizens. By addressing citizens’ complaints and frustrations, municipalities can better manage the resources in
needed areas, provide targeted services, and involve citizens in the public issues. Considering the negative tweets are relatively stable
compared to the positive tweets (e.g., more negative tweets than positive tweets), the measurements of municipalities’ performance
could be based on the percentage change of the negative sentiment.
A dashboard is utilized to display the findings. As depicted in Fig. 2, most of the tweets are negative and are related to the street.
However, specific events could impact this result, as the time series analysis shows that people tweet more on certain days. As noted in
the dashboard, the public opinion about the overall NYC street cleanliness is negative, different from the official Scorecard ratings (see
Appendix A). This information informs a different perspective on a public issue.

27
NYC has a large number of homeless people residing on the streets and in subway stations, causing sanitation problems.
28
Examples of positive tweets: “My block smells like straight up flowers I love this,” “The 1 train has never seen such gorgeousness!” “Goodbye
litter, Graffiti, Gum, Stickers, and Debris… you are no match for our neighborhood volunteers!”.

10
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

Fig. 2. Street Cleanliness Dashboard.

5. Framework extension

The tweets were collected based on NYC’s longitude and latitude, not at a granular level (e.g., at street level), due to the limitation
of the Twitter API used. To complement this limitation and evaluate the approach to analyzing social media information, another
social media platform (Facebook) is selected for testing. Another purpose of this extension is to explore the potential usage of Facebook
data in evaluating NYC street cleanliness. Due to Facebook’s privacy restriction on personal data, community posts, which are public
data, were selected.29 Eighteen Facebook pages related to various NYC communities were selected, and over 20 thousand posts were
collected. A summary of the distribution of the posts over each community is in Table 5.
The same classifier, XGBoost-oversampling, is applied to the dataset and predicts that 3,707 records are classified as irrelevant; 401
as relevant. However, the incremental value of the 401 records is limited to the research topic as the posts are mostly generic an
nouncements of particular events or activities related to the street condition, parking, or homelessness rather than personal opinions.
Therefore, all 401 posts are evaluated individually, and 110 are identified as truly relevant. The overall sentiment of the Facebook
dataset is 87% neutral, 8% positive, and 5% negative. The findings are displayed in Fig. 3.
Furthermore, locational information is examined from the negative and positive posts (see Fig. 4). The Bronx has the most negative
posts, mainly due to street cleaning and parking issues. Soho has the most positive posts primarily due to a special event that promotes
clean-up of the area.
Based on this supplemental study, we conclude that the incremental value of Twitter and Facebook is different, at least for this
research topic. They contain different types of data, even though they are both social media information. Therefore, the approach
should be different when analyzing these two types of information. Tweets are generally short and precise, and the content is more
reflective of personal opinions. Facebook posts could contain more extended content; the content from the community pages is mostly
announcements related to particular events. It is essential to identify the appropriate data source for the research subject. In this case,

29
Public community pages are another way for people to express their opinions on Facebook besides posting on an individual’s account. Local
citizens who are actively involved in local community work are more aware of the functions of Facebook community pages, and they are motivated
to express and communicate their concerns with the public. In this case, Facebook community pages are identified as another data source to explore.

11
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

Table 5
Facebook Communities.
# Facebook Community Number of Posts Time Period

1 Battery Park City Parks (@BatteryParkCityParks) 1673 2014-03-25 to 2019-09-18

2 CHELSEA.NOW.NEWSPAPER (@CHELSEA.NOW.NEWSPAPER) 2509 2012-06-15 to 2019-01-02
3 The Battery Conservancy (@InwoodNYC) 942 2011-06-28 to 2019-09-09
4 EastMidtown Business District - Midtown(@EastMidtownPartnership) 2321 2010-09-10 to 2019-09-18
5 34thStNYC Midtown(@34thStNYC) 1458 2013-03-18 to 2019-09-18
6 timessquarenyc - Midtown(@timessquarenyc) 3954 2008-12-12 to 2019-09-19
7 newyorktimessquare1 – Midtown (@newyorktimessquare1) 1784 2014-11-18 to 2019-09-19
8 Murray Hill (@MHNANYC) 553 2013-05-14 to 2019-09-17
9 Gramercy (@GNAnyc) 1365 2010-09-23 to 2019-09-19
10 Stuyvesant Town (@stuydems) 450 2009-08-19 to 2019-09-20
11 Stuyvesant Town − 2 - Manhattan (@townofstuyvesant) 406 2015-04-18 to 2019-09-20
12 Upper East side (@ILoveTheUpperEastSide) 99 2014-05-30 to 2019-09-06
13 West Harlem 1 (@westharlemcpo) 264 2012-03-19 to 2019-11-09
14 West Harlem 2 (@WestHarlemDems) 69 2017-09-11 to 2019-05-15
15 Central Harlem (@harlemparktopark) 2571 2008-03-21 to 2019-09-20
16 East Harlem (@EastHarlem360) 251 2017-05-02 to 2019-09-17
17 East Harlem (@EastHarlemAlliance) 318 2014-10-15 to 2019-02-27
18 Washington Heights and Inwood 1354 2013-09-09 to 2019-09-20

Fig. 3. Facebook Street Cleanliness Dashboard.

Twitter data indicates more valuable information than Facebook data. One of the reasons could be that Facebook data is collected from
individual community pages rather than individual users’ webpage; people might tend to post an event announcement or activity
promotion to the community pages rather than complain about the dirty streets. Therefore, it would be interesting to analyze in
dividuals’ Facebook posts if the information is available and compare the two data sources for future research. On the other hand, for a

12
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

Fig. 4. Facebook Posts Distribution.

government to utilize this social media platform and collect crowdsourcing opinions about street cleanliness, a Facebook page could be
created for this public interest.

6. Conclusion

6.1. Summary

This study demonstrates how to bring an innovative data source to the government information system and utilize social media
information to support government managerial decision-making. Text mining techniques and machine learning algorithms analyze
social media information. These social media data sources can develop an alternative performance measure for NYC street cleanliness.
Specifically, this paper applies text mining techniques and supervised machine learning algorithms to analyze relevant tweets from a
large volume of tweets systematically, and it also examines Facebook community posts. Three different algorithms are evaluated: NB,
RF, and XGBoost. The results indicate that RF and XGBoost algorithms provide the best accuracy regarding data relevancy for both
Twitter and Facebook datasets. However, the dataset faces an imbalance issue as most data are classified as irrelevant, resulting in a
skewed classification. Random sampling methods are used to resolve this problem, and the testing results indicate that XGBoost-
oversampling provides better performance than other models. This study uses the VADER sentiment analysis tool to assess the
sentiment expressed in the tweets, referred to as a gold standard to analyze social media information (Hutto and Gilbert, 2014). The
testing results indicate that the overall sentiment trend over the examined period is negative, and most negative sentiment is related to
street cleanliness. These findings are different from the official Scorecard ratings, but they align with the increasing trend of NYC311
complaints. This demonstrates the need for alternative information sources for use in deciding how and where to apply limited
resources.
The methodology presented in this study could be used in a system of real-time reporting and associated with accounting for
municipal expenses and resources used. Incorporating social media information into the government’s operational evaluation process
provides a different perspective of a public issue; it allows the authorities to comprehensively assess the problem and determine
adequate action plans. Performance measurement is widely used in budgeting and management; many state and local governments
base their budget decisions on the efficiency and effectiveness of service delivery (Kelly and Rivenbark, 2014). As a result, performance
measurement becomes a significant factor in deciding the actual budget of governments (Melkers and Willoughby, 2005; Woolum,
2011). Associating government resources and expenses with improvements as reflected in declining negative (and improving positive)
social media reporting could allow government agencies to be more responsive. It is also a way of involving citizens in the mea
surement of public services, assisting in better targeting services, and improving communication by understanding citizens’ per
spectives on public issues, all of which would lend credibility to the performance measurement.
Additionally, many organizations, including government entities, adopt the Balanced Scorecard (BSC) methodologies to improve
their strategic planning and management system (Chan, 2004; Erawan, 2020; Farneti, 2009; Griffiths, 2003; Hoque and Adams, 2011;
Lang, 2004). It is viewed as the vehicle to increase the government’s performance and public accountability (Lang, 2004). The concept
of BSC was introduced in the early 1990 s; it consists of four perspectives measuring the organization’s performance: customer,
financial, internal business, and innovation and learning perspectives (Kaplan and Norton, 1992). Specifically, the customer per
spectives refer to the question: “how do the customers see us?” (Kaplan and Norton, 1992). In this sense, utilizing social media infor
mation to support the government’s performance measures and decision-making is also a way of assessing customers’ perspectives on a
public issue and assisting the municipalities in establishing a well-balanced BSC system for their future strategic planning and
management.
As such, this research demonstrates the need to improve government accounting information systems to utilize this and other new
information sources. Without considering various alternative information sources, government decisions may be made that are not in
the best interests of the public. For instance, at the beginning of the pandemic, the mayor of NYC cut the Sanitation Department’s
budget by more than $100 million, resulting in reduced corner trash basket services, curbside compost services, and street-cleaning
frequency (Arschin, 2022). Ultimately, this budget cut led to a pileup of trash on the streets, overflowing corner baskets, and litter
in the streets, which raised concerns from lawmakers (Arschin, 2022).
Current government service reporting provides limited accountability as many fundamental issues, such as street cleanliness and
infrastructure monitoring, are not reported or accounted (Bora et al., 2021). Public service reporting is a critical aspect of the three-

13
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

Fig. 5. Sources of Data (Adapted from Bora et al., 2021).

dimensional government reporting components, see Fig. 5. Both exogenous and endogenous data can provide a wide range of
measuring attributes in a dynamic government reporting schema (Bora et al., 2021). Integrating exogenous social media data into
service processes supports the government’s reporting move towards modern accounting and operational assurance services.
To further utilize the results from this study, a continuous monitoring dashboard for street cleanliness can be built to create a
dynamic and interactive communication channel between authorities and citizens. The dashboard can include all the available data
sources, such as official rating results, social media sentiments, NYC311, etc., resource deployment (e.g., needed personnel and
required supplies), and status after service. The monitoring dashboard increases social awareness and the transparency of government
operations. It will enable the authorities to timely address the problem, better deploy the resources, effectively manage their opera
tions, and improve the quality of public services.

6.2. Limitations and future work

This study has several limitations. The No Free Lunch theorem states that “there are no single learning algorithms that in any
domain always induces the most accurate learner” (Alpaydin, 2014). Many other approaches could be explored and examined to
improve the solutions to the imbalanced classification. Due to the limitations of the Twitter API used, the data is collected based on
NYC’s longitude and latitude; it is not based on detailed street-level. However, the tweets at a granular level can be obtained through
different types of Twitter API. On the other hand, the government has the privilege to obtain the data at a level that other parties cannot
(Brown-Liburd et al., 2019). Authorities can use this study as a pilot test and extend the study to location-based analysis. Additionally,
the analysis was extended to another social media platform, Facebook. Data from Facebook community pages were utilized due to
Facebook’s privacy restriction on personal data. It would be interesting to compare the usefulness and informativeness between
Twitter and Facebook’s personal posts if the information is available, and potentially the official can adopt both social media infor
mation to support their decision making. Other social media information can also be explored. Moreover, the sentiment examined in
this paper was only measured in three types: negative, positive, and neutral. A more advanced semantic analysis of tweets could be
potentially studied, including real-time images of problematic posts.
The digital transformation in the current data environment forces organizations to adopt new ways of running the business,
conducting tasks, and reengineering operations. These changes impose unprecedented opportunities and challenges to government
entities, business operations, auditors, regulators, and other stakeholders. This study presents an innovative approach to enhance
government decision-making and brings a new data source to the government information systems.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.

Acknowledgments

We are thankful for the helpful comments received from Daniel O’Leary, Helen Brown-Liburd, Aleksandr Kogan, Deniz Appelbaum,
Lawrence Gordon, and everyone from Rutgers, The State University of New Jersey – Continuous Auditing & Reporting Lab (CAR Lab).
Special thanks to the editors and two anonymous reviewers from the journal; thank you for your valuable comments on the publication
of this paper.

14
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

This paper was presented at the 2019 American Accounting Association (AAA) Annual Research Workshop on Strategic and
Emerging Technologies, the 2020 AAA AIS, SET and International Sections Joint Midyear Meeting in Orlando, FL, the 2020 Durham
Rutgers Accounting Analytics Network Research Webinar, the 2020 AAA Annual Meeting, the 12th Annual Pre-ICIS Workshop on
Accounting Information System in 2020. The authors are thankful for all the comments received from the conference reviewers and
participants.

Appendix

Appendix A. NYC Scorecard inspection program

According to the prospectus issued by the New York City Mayor’s Office (NYC Mayor’s Office of Operations, 1973):
“The New York City Mayor’s Office runs a Scorecard Cleanliness Program to measure the cleanliness of NYC streets and sidewalks.
The information is used for the Department of Sanitation used for policy development, planning, and evaluation of citywide opera
tions; the Mayor’s Office for tracking and monitoring the City’s cleanliness over time; Community Boards and other public interest
groups to learn about cleanliness conditions in local neighborhoods; Business Improvement Districts to evaluate the conditions of
neighborhood shopping and central business districts. The measurements are based on rigorous photographic standards of cleanliness
for streets and sidewalks. The ratings are based on a seven-point scale of cleanliness: 1 to be the cleanest, 3 is the dirtiest, and five
intermediate ratings, ratings below 1.5 are considered acceptably clean. The inspections are conducted either before or after the
Department of Sanitation street cleaning and are continuously monitored to detect potentially biased ratings. The overall trend of the
cleanliness rating for each district is analyzed, including month to month, year to year, district to district comparison.”
Below is the Scorecard Rating Scale, adopted from the Audit Report issued by the Office of the New York State Comptroller (Office
of the New York State Comptroller, 2020).

As indicated in the below inspection reports30 (August 2018 and April 2019), the percent of acceptably clean streets in NYC’s
neighborhood is rated above 94%.

30
Updated inspection report is available at: https://www1.nyc.gov/site/operations/performance/scorecard-street-sidewalk-cleanliness-ratings.
page.

15
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

Appendix B. Workflow

The general workflow of this study is illustrated here:

Appendix C. Major Python libraries used in this study

Libraries General Explanation In this study

Tweepy Tweepy is Twitter for Python; it enables the user to access the Twitter API Access the Twitter API.
to download the tweets. For more information, please refer to:
https://tweepy.readthedocs.io/en/latest/getting_started.
html#introduction

StreamListener Automatically sent a result to a designated channel.For more information, Stream real-time tweets and store the tweets in a designated
please refer to location.
https://docs.spring.io/spring-cloud-stream/docs/1.0.3.RELEASE/api/
org/springframework/cloud/stream/annotation/StreamListener.html

Panda library Python Data Analysis Library (Panda) is a Python package to work with This study utilized this library in many ways, such as reading the
structured and time-series data; it permits users to analyze and manipulate CSV/excel files, assessing the dataset’s structure, preprocessing
data in any language. It can reshape, slice, index, subset, group by, merge, the dataset, etc.
join data. For more information, please refer to https://pandas.pydata.
org/

re Regular expression operations are used to identify special characters or This study used regular expression operations to parse and
strings. For more information, please refer to: analyze the special characters used in the tweets.
https://www.datacamp.com/community/tutorials/python-regular-
expression-tutorial

Scikit-learn Scikit-learn offers different packages, such as classification, regression, This study mainly used Scikit-learn to set up the classification
clustering, etc. models.
For more information, please refer to:
https://scikit-learn.org/stable/

NLTK Natural Language Toolkit (NLTK) contains different packages to analyze This study uses the NLTK library to preprocess the tweets (e.g.,
the contents. For more information, please refer to: perform tokenization and lemmatization, remove stopwords,
https://www.nltk.org/ etc.)

16
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

Appendix D. Sample of the tweets

Appendix E. Sample of the keyword list

parking parkin parkins sidewalks sidewalk dirty trash

trashed trashes trashing trashy filthy filth garbage

overflow overflowing overflowed overflows disgust disgusted disgusting
disgusts stink stinkers stinking stinks stinky foul stench
stench odor foul smell smells smelled smelley*
urine homeless clean cleaned cleaner cleanest cleaning
cleanliness cleanly cleanness cleans cleanse cleansed cleanser
cleansers cleansing cleanup cleanups rat rats rat’s
rodents roach roaches roache* pigeon pigeon’s pigeons
crap mice mouse stray strayed strayer strays
straying cat stray dog homeless dog mutt mutts dirt
dirt’s street streets street’s streett* streetwise feces
excrement blood blooded bloodied bloodier bloodest bloodless
bloods blood’s

*typos used in the tweets.

References

Alom, Z., Carminati, B., Ferrari, E., 2018. Detecting Spam Accounts on Twitter. In: in: 2018 IEEE/ACM international Conference on Advances in Social Networks
Analysis and Mining (ASoNAM). IEEE, pp. 1191–1198. https://doi.org/10.1109/ASONAM.2018.8508495.
Alpaydin, E., 2014. Introduction to Machine Learning, 3rd ed. MIT Press.
Arschin, D., 2022. Trash is Piling Up on NYC Streets, Lawmakers Say [WWW Document]. FOX 5 New York. URL https://www.fox5ny.com/news/too-much-trash-
piling-up-on-nyc-streets-lawmakers-say (accessed 2.3.22).
Asur, S., Huberman, B.A., 2010. Predicting the Future with Social Media, in: 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent
Technology. IEEE, pp. 492–499. 10.1109/WI-IAT.2010.63.
Awwalu, J., Bakar, A.A., Yaakub, M.R., 2019. Hybrid N-gram model using Naïve Bayes for classification of political sentiments on Twitter. Neural Comput. Appl. 31,
9207–9220. https://doi.org/10.1007/s00521-019-04248-z.
Bazzaz Abkenar, S., Mahdipour, E., Jameii, S.M., Haghi Kashani, M., 2021. A hybrid classification method for twitter spam detection based on differential evolution
and random forest. Concurrency Comput.: Pract. Experience 33, e6381.
Berner, M., Smith, S., 2004. The state of the states: a review of state requirements for citizen participation in the local government budget process. State Local Govern.
Rev. 36, 140–150. https://doi.org/10.1177/0160323x0403600205.
Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993.
Bollen, J., Mao, H., Zeng, X., 2011. Twitter mood predicts the stock market. J. Comput. Sci. 2, 1–8. https://doi.org/10.1016/j.jocs.2010.12.007.
Bonsón, E., Ratkai, M., 2013. A set of metrics to assess stakeholder engagement and social legitimacy on a corporate facebook page. Online Inform. Rev. 37, 787–803.
https://doi.org/10.1108/OIR-03-2012-0054.
Bonzanini, M., 2016. Mastering Social Media Mining with Python. Packt Publishing Ltd.
Bora, I., Dai, J., Duan, H.K., Vasarhelyi, M.A., Zhang, A., 2021. The transformation of government accountability and reporting. J. Emerg. Technol. Account. 18, 1–21.
Borg, A., Boldt, M., 2020. Using VADER sentiment and SVM for predicting customer response sentiment. Expert Syst. Appl. 162, 113746 https://doi.org/10.1016/j.
eswa.2020.113746.
Brown-Liburd, H., Cheong, A., Vasarhelyi, M.A., Wang, X., 2019. Measuring with Exogenous Data (MED), and Government Economic Monitoring (GEM). J. Emerg.
Technol. Account. 16, 1–19. https://doi.org/10.2308/jeta-10682.
Buchheit, S., Parsons, L.M., 2006. An experimental investigation of accounting information’s influence on the individual giving process. J. Account. Public Policy 25,
666–686. https://doi.org/10.1016/j.jaccpubpol.2006.09.002.
Buhl, H.U., Röglinger, M., Moser, D.-K.-F., Heidemann, J., 2013. Big data. Bus. Inform. Syst. Eng. 5, 65–69. https://doi.org/10.1007/978-981-13-3384-2_9.
Burgoon, J., Mayew, W.J., Giboney, J.S., Elkins, A.C., Moffitt, K., Dorn, B., Byrd, M., Spitzley, L., 2016. Which spoken language markers identify deception in high-
stakes settings? Evidence from earnings conference calls. J. Lang. Social Psychol. 35, 123–157. https://doi.org/10.1177/0261927X15586792.
Burton, S., Soboleva, A., 2011. Interactive or reactive? Marketing with Twitter. J. Consumer Market. 28, 491–499. https://doi.org/10.1108/07363761111181473.
Callahan, K., Holzer, M., 1999. Results-Oriented Government: Citizen Involvement in Performance Measurement. Performance & Quality Measurement in
Government. Issues and Experiences, pp. 51–64.
Cameron, M.P., Barrett, P., 2016. Can social media predict election results? Evidence from New Zealand. J. Polit. Market. 15, 416–432.
Cecchini, M., Aytug, H., Koehler, G.J., Pathak, P., 2010. Making words work-using financial text as a predictor of financial events. Decis. Support Syst. 50, 164–175.
https://doi.org/10.1016/j.dss.2010.07.012.

17
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

Chakraborty, V., Vasarhelyi, M., 2017. A hybrid method for taxonomy creation. Int. J. Digital Account. Res. 17, 33–65. https://doi.org/10.4192/1577-8517-v17_2.
Chan, Y.-C.-L., 2004. Performance measurement and adoption of balanced scorecards: a survey of municipal governments in the USA and Canada. Int. J. Public Sector
Manage. 17, 204–221. https://doi.org/10.1108/09513550410530144.
Coulter, K., Roggeveen, A., 2012. “Like It Or Not”: consumer responses to word-of-mouth communication in on-line social networks. Manage. Res. Rev. 35, 878–899.
Culotta, A., 2013. Lightweight methods to estimate influenza rates and alcohol sales volume from twitter messages. Lang. Resour. Eval. 47, 217–238. https://doi.org/
10.1007/s10579-012-9185-0.
Culotta, A., 2010. Towards Detecting Influenza Epidemics by Analyzing Twitter Messages, in: Proceedings of the First Workshop on Social Media Analytics. pp.
115–122.
Dhaoui, C., Webster, C.M., Tan, L.P., 2017. Social media sentiment analysis: lexicon versus machine learning. J. Consum. Market. 34, 480–488. https://doi.org/
10.1108/JCM-03-2017-2141.
Dutil, P., 2015. Crowdsourcing as a new instrument in the Government’s Arsenal: explorations and considerations. Canad. Public Admin. 58, 363–383. https://doi.
org/10.1111/capa.12134.
Dzuranin, A.C., Mălăescu, I., 2016. The current state and future direction of IT audit: challenges and opportunities. J. Inform. Syst. 30, 7–20. https://doi.org/10.2308/
isys-51315.
Ebdon, C., Franklin, A., 2004. Searching for a role for citizens in the budget process. Public Budget. Finance 24, 32–49. https://doi.org/10.1111/j.0275-
1100.2004.02401002.x.
Elbagir, S., Yang, J., 2019. Twitter Sentiment Analysis Using Natural Language Toolkit and VADER Sentiment, in: Proceedings of the International Multiconference of
Engineers and Computer Scientists. p. 16.
Erawan, I.G.A., 2020. Implementation of balanced scorecard in Indonesian government institutions: a systematic literature review. J. Public Admin. Stud. 4, 64–71.
Farneti, F., 2009. Balanced scorecard implementation in an Italian Local Government Organization. Public Money Manage. 29, 313–320. https://doi.org/10.1080/
09540960903205964.
Griffiths, J., 2003. Balanced scorecard use in New Zealand Government Departments and Crown Entities. Aust. J. Public Admin. 62, 70–79. https://doi.org/10.1111/
j.2003.00350.x.
Guo, J.-W., Radloff, C.L., Wawrzynski, S.E., Cloyes, K.G., 2020. Mining Twitter to explore the emergence of COVID-19 symptoms. Public Health Nurs. 37, 934–940.
https://doi.org/10.1111/phn.12809.Mining.
Haythornthwaite, C., 1996. Social network analysis: an approach and technique for the study of information exchange. Lib. Inform. Sci. Res. 18, 323–342.
Ho, A.-T.-K., Ni, A.Y., 2005. Have cities shifted to outcome-oriented performance reporting?—A content analysis of city budgets. Public Budget. Finance 25, 61–83.
Holton, C., 2009. Identifying disgruntled employee systems fraud risk through text mining: a simple solution for a multi-billion dollar problem. Decis. Support Syst.
46, 853–864. https://doi.org/10.1016/j.dss.2008.11.013.
Hoque, Z., Adams, C., 2011. The rise and use of balanced scorecard measures in Australian government departments. Finan. Account. Manage. 27, 308–334. https://
doi.org/10.1111/j.1468-0408.2011.00527.x.
Hughes, A.L., Palen, L., 2009. Twitter adoption and use in mass convergence and emergency events. Int. J. Emergency Manage. 6, 248–260.
Hutto, C.J., Gilbert, E., 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text, in: Proceedings of the International AAAI
Conference on Web and Social Media. pp. 216–225.
Jahanbin, K., Rahmanian, V., 2020. Using Twitter and Web News Mining to Predict COVID-19 Outbreak. Asian Pacific J. Trop. Med. 13, 378. https://doi.org/
10.4103/1995-7645.279651.
Jansen, B., Zhang, M., Sobel, K., Chowdury, A., 2009. Twitter power: tweets as electronic word of mouth. J. Am. Soc. Inform. Sci. Technol. 60, 2169–2188. https://
doi.org/10.1002/asi.
Justice, J.B., Melitski, J., Smith, D.L., 2006. E-government as an instrument of fiscal accountability and responsiveness: do the best practitioners employ the best
practices? Am. Rev. Public Admin. 36, 301–322. https://doi.org/10.1177/0275074005283797.
Kaplan, R.S., Norton, D.P., 1992. The balanced scorecard: measures that drive performance. Harvard Bus. Rev. 70, 71–79.
Kelly, J.M., Rivenbark, W.C., 2014. Performance Budgeting for State and Local Government. Routledge.
Kitchin, R., 2014. The real-time city? Big data and smart urbanism. GeoJournal 79, 1–14. https://doi.org/10.1007/s10708-013-9516-8.
Lang, S.S., 2004. Balanced scorecard and government entities. CPA J. 74, 48.
Larcker, D.F., Zakolyukina, A.A., 2012. Detecting deceptive discussions in conference calls. J. Account. Res. 50, 495–540. https://doi.org/10.1111/j.1475-
679X.2012.00450.x.
Lassen, N.B., Madsen, R., Vatrapu, R., 2014. Predicting iPhone Sales from iPhone Tweets, in: 2014 IEEE 18th International Enterprise Distributed Object Computing
Conference. IEEE, pp. 81–90.
Li, F., 2008. Annual report readability, current earnings, and earnings persistence. J. Account. Econ. 45, 221–247. https://doi.org/10.1016/j.jacceco.2008.02.003.
Li, F., 2010. The information content of forward-looking statements in corporate filings—a Naïve Bayesian machine learning approach. J. Account. Res. 48,
1049–1102. https://doi.org/10.1111/j.1475-679X.2010.00382.x.
Li, W., Bhushan, B., Gao, J., 2018. A Mutilple-Level Assessment System for Smart City Street Cleanliness. SEKE 256–255.
Linders, D., 2012. From E-government to we-government: defining a typology for citizen coproduction in the age of social media. Govern. Inform. Quart. 29, 446–454.
https://doi.org/10.1016/j.giq.2012.06.003.
Liu, Y., Moffitt, K.C., 2016. Text mining to uncover the intensity of SEC comment letters and its association with the probability of 10-K restatement. J. Emerg.
Technol. Account. 13, 85–94. https://doi.org/10.2308/jeta-51438.
Loughran, T., McDonald, B., 2011. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. J. Finance 66, 35–65. https://doi.org/10.1111/j.1540-
6261.2010.01625.x.
Mandel, B., Culotta, A., Boulahanis, J., Stark, D., Lewis, B., Rodrigue, J., 2012. A Demographic Analysis of Online Sentiment During Hurricane Irene, in: Proceedings
of the Second Workshop on Language in Social Media. pp. 27–36.
McCord, M., Chuah, M., 2011. Spam Detection on Twitter Using Traditional Classifiers. In: International Conference on Autonomic and Trusted Computing 175–186.
Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23496-5_13.
Melkers, J., Willoughby, K., 2005. Models of performance-measurement use in local governments: understanding budgeting, communication, and lasting effects.
Public Admin. Rev. 65, 180–190.
Mossberger, K., Wu, Y., Crawford, J., 2013. Connecting citizens and local governments? Social media and interactivity in major U.S cities. Govern. Inform. Quart. 30,
351–358. https://doi.org/10.1016/j.giq.2013.05.016.
Nemes, L., Kiss, A., 2021. Prediction of stock values changes using sentiment analysis of stock news headlines. J. Inform. Telecommun. 5, 375–394. https://doi.org/
10.1080/24751839.2021.1874252.
NYC Mayor’s Office of Operations, 1973. Evaluating Municipal Services: Scorecard Cleanliness Program Prospectus.
O’Leary, D.E., 2011. Blog mining-review and extensions: “from each according to his opinion”. Decis. Support Syst. 51, 821–830. https://doi.org/10.1016/j.
dss.2011.01.016.
O’Leary, D.E., 2012. Computer-based political action: the battle and internet blackout over PIPA. Computer 45, 64–72. https://doi.org/10.1109/MC.2012.186.
O’Leary, D.E., 2013. Exploiting Big Data from Mobil Device Sensor-Based Apps: Challenges and Benefits. MIS Quarterly Executive 12.
O’Leary, D.E., 2015a. Crowdsourcing tags in accounting and finance: review, analysis, and emerging issues. J. Emerg. Technol. Account. 12, 93–115. https://doi.org/
10.2308/jeta-51195.
O’Leary, D.E., 2015b. Twitter mining for discovery, prediction and causality: applications and methodologies. Intel. Syst. Account. Finance Manage. 22, 227–247.
https://doi.org/10.1002/isaf.1376.
O’Leary, D.E., 2016a. KPMG knowledge management and the next phase: using enterprise social media. J. Emerging Technol. Account. 13, 215–230. https://doi.org/
10.2308/jeta-51600.

18
H.K. Duan et al. International Journal of Accounting Information Systems 48 (2023) 100600

O’Leary, D.E., 2016b. On the relationship between number of votes and sentiment in crowdsourcing ideas and comments for innovation: a case study of Canada’s
digital compass. Decis. Support Syst. 88, 28–37. https://doi.org/10.1016/j.dss.2016.05.006.
O’Leary, D.E., 2019a. Facilitating citizens’ voice and process reengineering using a cloud-based mobile app. J. Inform. Syst. 33, 137–162. https://doi.org/10.2308/
isys-52244.
O’Leary, D.E., 2019b. Enterprise Crowdsourcing Innovation in the Big 4 Consulting Firms. J. Emerg. Technol. Account. 16, 99–118.
OECD, 2017. Embracing Innovation in Government: Global Trends.
Office of the New York State Comptroller, 2020. New York City Department of Sanitation New York City Mayor’s Office of Operations Street and Sidewalk Cleanliness
Division of State Government Accountability.
Oh, C., Sheng, O.R.L., 2011. Investigating Predictive Power of Stock Micro Blog Sentiment in Forecasting Future Stock Directional Prices Movement, in: Proceedings of
the International Conference on Information Systems (ICIS).
Pak, A., Paroubek, P., 2010. Twitter as a Corpus for Sentiment Analysis and Opinion Mining, in: Proceedings of the Seventh International Conference on Language
Resources and Evaluation (LREC’10).
Pano, T., Kashef, R., 2020. A complete VADER-based sentiment analysis of Bitcoin (BTC) Tweets during the ERA of COVID-19. Big Data Cogn. Comput. 4, 33. https://
doi.org/10.3390/bdcc4040033.
Prokofieva, M., 2015. Twitter-based dissemination of corporate disclosure and the intervening effects of firms’ visibility: evidence from Australian-listed companies.
J. Inform. Syst. 29, 107–136. https://doi.org/10.2308/isys-50994.
Reck, J.L., 2001. The usefulness of financial and nonfinancial performance information in resource allocation decisions. J. Account. Public Policy 20, 45–71. https://
doi.org/10.1016/S0278-4254(01)00018-7.
Reed, S.A., 1986. The impact of nonmonetary performance measures upon budgetary decision making in the public sector. J. Account. Public Policy 5, 111–140.
https://doi.org/10.1016/0278-4254(86)90018-9.
Risius, M., Akolk, F., Beck, R., 2015. Differential Emotions and the Stock Market - The Case of Company-specific Trading. ECIS 2015 Completed Research Papers 147.
Rivenbark, W., Kelly, J., 2006. Performance budgeting in municipal government. Public Perform. Manage. Rev. 30, 35–46. https://doi.org/10.2753/pmr1530-
9576300102.
Robbins, M.D., Simonsen, B., Feldman, B., 2008. Citizens and resource allocation: improving decision making with interactive web-based citizen participation. Public
Admin. Rev. 68, 564–575. https://doi.org/10.1111/j.1540-6210.2008.00891.x.
Rozario, A., Vasarhelyi, M.A., Wang, D., 2022. On the use of consumer tweets to assess the risk of misstated revenue in consumer-facing industries: evidence from
analytical procedures. Auditing: J. Practice Theory.
Schnebly, J., Sengupta, S., 2019. Random Forest Twitter Bot Classifier, in: 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC).
IEEE, pp. 0506–0512. 10.1109/CCWC.2019.8666593.
Schumaker, R.P., Zhang, Y., Huang, C.N., Chen, H., 2012. Evaluating sentiment in financial news articles. Decis. Support Syst. 53, 458–464. https://doi.org/10.1016/
j.dss.2012.03.001.
Shi, L., Agarwal, N., Agrawal, A., Garg, R., Spoelstra, J., 2012. Predicting US Primary Elections with Twitter. URL: http://snap. stanford. edu/social2012/papers/shi.
pdf 4.
Singh, J.P., Dwivedi, Y.K., Rana, N.P., Kumar, A., Kapoor, K.K., 2019. Event classification and location prediction from tweets during disasters. Ann. Oper. Res. 283,
737–757. https://doi.org/10.1007/s10479-017-2522-3.
Sul, H.K., Dennis, A.R., Yuan, L.I., 2017. Trading on Twitter: using social media sentiment to predict stock returns. Decis. Sci. 48, 454–488. https://doi.org/10.1111/
deci.12229.
Tan, P.-N., Steinbach, M., Kumar, V., Karpatne, A., 2019. Introduction to Data Mining, 2nd ed. Pearson Education Inc.
Tsakalidis, A., Papadopoulos, S., Cristea, A.I., Kompatsiaris, Y., 2015. Predicting elections for multiple countries using Twitter and Polls. IEEE Intell. Syst. 30, 10–17.
https://doi.org/10.1109/MIS.2015.17.
Tseng, C., Patel, N., Paranjape, H., Lin, T.Y., Teoh, S., 2012. Classifying Twitter data with Naive Bayes Classifier Chris. IEEE International Conference on Granular
Computing 2012, 294–299.
Vasarhelyi, M., Kogan, A., Tuttle, B.M., 2015. Big data in accounting: an overview. Account. Horizons 29, 381–396. https://doi.org/10.2308/acch-51071.
Vieweg, S., Hughes, A.L., Starbird, K., Palen, L., 2010. Microblogging During Two Natural Hazards Events: What Twitter May Contribute to Situational Awareness, in:
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. pp. 1079–1088. 10.1145/1753326.1753486.
Wang, X., 2000. Performance measurement in budgeting: a study of county governments. Public Budget. Finance 20, 102–118. https://doi.org/10.1111/0275-
1100.00022.
Wang, X., 2002. Assessing performance measurement impact: a study of U.S. local governments. Public Perform. Manage. Rev. 26, 26–43. https://doi.org/10.2307/
3381296.
Williams, K., Durrance, J.C., 2008. Social networks and social capital: rethinking theory in community informatics. J. Commun. Inform. 4.
Woolum, J., 2011. Citizen involvement in performance measurement and reporting: a comparative case study from local government. Public Perform. Manage. Rev.
35, 79–102. https://doi.org/10.2753/PMR1530-9576350104.
Yoon, K., Hoogduin, L., Zhang, L., 2015. Big data as complementary audit evidence. Account. Horizons 29, 431–438. https://doi.org/10.2308/acch-51076.
Zeemering, E.S., 2021. Functional Fragmentation in City Hall and Twitter Communication During the COVID-19 Pandemic: evidence from Atlanta, San Francisco, and
Washington. DC. Government Information Quarterly 38, 101539. https://doi.org/10.1016/j.giq.2020.101539.
Zhang, J., Yang, X., Appelbaum, D., 2015. Toward effective big data analysis in continuous auditing. Account. Horizons 29, 469–476. https://doi.org/10.2308/acch-
51070.