SSRN Id3449848 PDF
SSRN Id3449848 PDF
Abstract
Audit quality has always been the focus of audit research, especially since the passage of the
Sarbanes-Oxley Act in 2002. Much research has been done to measure and predict audit quality,
and the existing predictive models commonly use regression. By contrast, this paper uses various
supervised learning algorithms to predict audit quality, which is proxied by restatements, the best
measure of audit quality that is publicly available (Aobdia, 2015). Using 14,028 firm-year
observations from 2008 to 2016 in the United States and ten different supervised learning
algorithms, the research mainly shows that Random Forest algorithm can predict audit quality
more accurately than logistic regression, and that audit-related variables are better than financial
variables in predicting audit quality. The results of this paper can provide regulators, investors,
and other stakeholders a more effective tool than the traditional logistic regression to assess and
predict audit quality, thus better protecting the benefit of the general public and ensuring the
healthy functioning of the capital market.
Key words:
Audit Quality, Machine Learning Algorithms, Restatements
Many factors have been identified to affect audit quality, such as abnormal audit fees (Blankley,
Hurtt, & MacGregor, 2012), auditor industry specialization (Romanus, Maher, & Fleming, 2008),
auditor changes (Romanus et al., 2008), brand name of the auditor (Eshleman & Guo, 2014), and
auditor size (Francis & Yu, 2009). To model and predict audit quality, the mainstream research
uses linear regression (Francis & Yu, 2009) or logistic regression (Francis & Yu, 2009; Lennox et
al., 2014), depending on whether the dependent variable, the proxy for audit quality, is continuous
or discrete. However, if the main purpose is to make predictions, some machine learning
algorithms may perform better than regressions. With all the current knowledge of what can affect
audit quality, machine learning algorithms can be constructed for the use of regulators, investors,
and other stakeholders to assess and predict audit quality more accurately, thus better protecting
the benefit of the general public and ensuring the efficient functioning of the capital market.
In the machine learning domain, regression is a subset of supervised learning, in which the
algorithms learn from the available examples with known “labels” (Alpaydin, 2014). Besides
regression, other common supervised learning algorithms are Artificial Neural Networks (ANN),
Decision Tree (DT), Naïve Bayes (NB), and Support Vector Machine (SVM) etc. Supervised
learning algorithms have been very successful in performing prediction tasks such as image/voice
This paper addresses four research questions: 1) how accurately can machine learning algorithms
predict audit quality and which algorithms work the best? 2) which variables are the most
predictive of audit quality? 3) which group of variables are more predictive of audit quality, the
audit-related variables or the financial variables? 4) are the predictive abilities of the two groups
of variables complementary or supplementary? Answering the above four research questions can
provide a clear guidance to the regulators, investors, and other stakeholders on which algorithms
and variables to choose to best predict audit quality. The results of this research show that 1)
compared to regressions, machine learning algorithms, especially Random Forest, perform better
in predicting audit quality; 2) the six most predictive variables are: auditor market share, client’s
total assets, auditor portfolio share, audit fee, auditor size, and the brand name of the auditor; 3)
compared to financial variables, audit-related variables perform better in predicting audit quality;
and 4) the predictive ability of the algorithms is the highest when both financial variables and
audit-related variables are included in the independent variables, indicating that the two groups of
variables complement each other in predicting audit quality.
This research contributes to the audit literature in three aspects: 1) this paper pioneers in
constructing machine learning algorithms to predict audit quality, and provides evidence that
Random Forest is more accurate in predicting audit quality than regressions; 2) this research
identifies six most predictive variables of audit quality, five of which are audit related variables,
providing new evidence to the previous audit quality research; 3) the results of this paper provide
regulators, investors, and other stakeholders more powerful tools to assess and predict audit quality.
This paper is organized as follows: section two goes through the literature of audit quality and
machine learning and comes up with the research questions; section three provides details of the
empirical implementation; section four documents the empirical results; section five provides
some additional analysis; and section six concludes the paper.
While there is no consensus on which audit quality measures are the best because each has its own
strengths and weakness depending on the research setting (M. DeFond & Zhang, 2014), Aobdia
(2015) finds that restatements and whether the client meets or beats the zero earnings threshold
can better predict Part I Findings, an “accurate measure of audit process quality derived from audit
deficiencies of individual engagements identified during the PCAOB inspections process”, than
others. Compared to the measurement of whether the issuer meets/beats the zero earnings and other
proxies such as accrual-based metrics, restatement reflects the actual audit quality being delivered,
thus it is a relatively strong evidence of poor audit quality (M. DeFond & Zhang, 2014). Moreover,
restatement is a very direct and egregious measure of audit quality (M. L. DeFond & Francis, 2005;
M. DeFond & Zhang, 2014; Romanus et al., 2008) because it indicates that “the auditor
erroneously issued as unqualified opinion on materially misstated financial statements” (M.
DeFond & Zhang, 2014). Besides its directness and egregiousness, its dichotomized value is highly
consentaneous and convenient for the purpose of making predictions. Therefore, restatement is
chosen as the proxy for audit quality in this paper, in which the focus is on assessing and predicting
audit quality. However, since SEC only examines one third of the public companies’ financial
statements, there may be some “false negatives” (will be discussed in section three) existing in the
audit engagements that are not examined. Furthermore, since the restatement reflects the existence
of material misstatements in the financial statements, it cannot capture the subtle audit quality
variation (M. DeFond & Zhang, 2014). Moreover, the instances of restatements are relatively rare
compared to the whole sample, which will result in an imbalanced dataset. To address the data
imbalance and the “false negative” issues, some techniques are deployed in this research
(discussed in section three).
Machine learning is a subset of Artificial Intelligence (AI). At its core, machine learning is
“programming computers to optimize a performance criterion using example data or past
experience” (Alpaydin, 2014). By “learning” from example data or past experience, the algorithms
will automatically extract the hidden knowledge of performing certain tasks that humans cannot
find explicit solutions, such as pattern recognition in images and videos, classifying spam emails
from legitimate ones, and predicting fraudulent behaviors (Alpaydin, 2014). There are generally
three types of machine learning algorithms: supervised learning, unsupervised learning, and semi-
supervised learning. In supervised learning, the algorithms are trained and tested on example or
past data with “labels” (Alpaydin, 2014), for example, whether or not an email is spam, whether
or not a fraudulent behavior has happened, and whether or not a voice recording comes from Bob,
etc. Common supervised learning algorithms are Naïve Bayesian (NB), Bayesian Belief Network
(BBN), Artificial Neural Networks (ANN), Decision Trees (DT), Support Vector Machines
(SVM), Random Tree (RT) and Random Forest (RF) etc. Supervised learning algorithms are
mainly used for classification/prediction tasks, and they have been used to predict economic events
such as frauds and bankruptcy (Cecchini, Aytug, Koehler, & Pathak, 2010; Chen, Huang, & Kuo,
2009; Dimmock & Gerken, 2012). Different with supervised learning whose aim is “to learn a
mapping from the input to an output whose correct values are provided by a supervisor”,
unsupervised learning is trained and applied on unlabeled data and it focuses on finding the
regularities/patterns from the input data. One common method in unsupervised learning is
clustering where the aim is to find clusters or grouping of input. For example, in customer
segmentation, customers with similar attributes are clustered in the same group so that different
services can be provided to different customer groups (Alpaydin, 2014). Semi-supervised learning
falls between supervised and unsupervised learning and it is trained on a combination of labeled
In this research, since the audit quality is measured by restatement which has “labels” (i.e. whether
or not the financial statement got restated), the supervised learning algorithms should be used.
Previous research on audit quality or restatement use regressions because their goal is to find causal
relationships between audit quality/restatement and other variables of interest (Aier, Comprix,
Gunlock, & Lee, 2005; Becker et al., 1998; Deis & Giroux, 1992; Eshleman & Guo, 2014; Francis
& Yu, 2009; Ghosh, 2005; Kinney, Palmrose, & Scholz, 2004; Lennox et al., 2014; Plumlee &
Yohn, 2010; Schmidt & Wilkins, 2013). However, if making predictions is the main purpose, many
supervised learning algorithms other than regressions can be utilized. Though these algorithms
have been very successful in predicting economic events such as fraud and bankruptcy, it is not
clear whether they can be used to accurately predict audit quality and which variables should be
included in the algorithm to achieve the best predictive ability. Thus, the first two research
questions this research aims to address are:
RQ1: How accurately can supervised learning algorithms predict audit quality and which
algorithms work the best?
RQ2: What factors are the most predictive of audit quality using supervised learning algorithms?
Restatement is an output-based audit quality proxy which is constrained by the firm’s financial
reporting system and its innate characteristics (M. DeFond & Zhang, 2014). Besides, audit quality
is not independent of the firm’s financial reporting system and its innate characteristics, because
firm “managers are likely to choose the quality of the financial reporting systems in anticipation
of the audit quality they expect the auditor to deliver” and the “auditors are expected to explicitly
consider the quality of the firm’s financial reporting system and its innate characteristics in
selecting clients, and in the audit planning process” (M. DeFond & Zhang, 2014). Thus, to
mitigate bias, this research includes both audit related variables and financial variables in the
independent variables. In a similar research, Dutta, Dutta, & Raahemi (2017) predicts restatement
using supervised learning algorithms. However, they do not treat restatement as a proxy for audit
RQ3: Which group of variables are more predictive of audit quality using supervised learning
algorithms, the audit related variables or the financial variables?
RQ4: Are the predictability of the two groups of variables complementary or supplementary
using supervised learning algorithms?
3. Empirical Implementation
3.1 Data Collection
In this paper, the audit related data and the restatement data come from the Audit Analytics
database and the financial data come from COMPUSTAT. The time period of the sample spans
from 2008 to 2016. This research chooses 2008 as the starting point because it is post-SOX and
post-financial crisis. In this paper, the instances of restatements in 10-Ks due to accounting errors
and fraud are used. The details of how the restated instances are generated for this research are
provided in the Appendix.
Thirty independent variables are collected and calculated based on Table 1. The other six variables
are not included because they are not publicly available. Table 2 lists the thirty independent
variables and Table 3 in the Appendix provides the details of how each variable is calculated. The
shaded variables are audit related variables. There are sixteen audit-related variables. The three
accrual variables (TotalNetAccruals, AbnormalAccruals, and AbsAbnAcc) are regarded as both
audit-related and financial variables because they are not only the indicators of audit quality (M.
DeFond & Zhang, 2014), but also the financial indicators of the client. In the further analysis, I
exclude the accrual variables from the audit-related variables and the main results still hold.
Table 4
# firm-year
observations
Financail data merged with restatement from 2002 to 2016 245299
Less: missing value in DLAF -145312
Less: missing value in GC -26104
Less: missing value in FE -623
Less: missing value in SALESGROWTH -14057
Less: missing value in Zscore -16555
Less: missing value in LEV1 -40
Less: missing value in FREEC -50
Less: missing value in AbsAbnAcc -432
Less: missing value in materialweakness -2241
Less: missing value in BM ratio -73
Financial data merged with restatement from 2002 to 2016 (no missing 39812
values)
Less: non-restatement observations whose SIC and FE never appear in -21296
those of restatement instances
Matched sample from 2002 to 2016 18516
Less: observations from 2002 to 2007 -4488
Matched sample from 2008 to 2016 14028
10
The descriptive statistics for the whole sample are provided in Table 6. The outliers are kept in the
sample because the supervised learning algorithms used in this research are not sensitive to outliers
(Alpaydin, 2014); and deleting outliers may even cause loss of useful information that are useful
for efficient classification. In the additional analysis, the winsorized data are used to perform the
same analysis, and the main results still hold. From the descriptive statistics, about 10% of the
observations disclosed material weakness, most (85.3%) of the auditors have been with the client
for at least three years, about 12% of the sample observations received Going Concern Opinions,
and more than half (62.1%) percent of the sample observations were audited by Big4 firms.
The Pearson correlation matrix is provided in Table 7. LAF (Log of Audit Fees) is statistically
significantly correlated with most of the financial variables. However, only the correlation between
LAF and LTA (Log of Total Asset) is economically significant (the correlation is 0.9017). This
might be due to the fact that more audit effort is generally expended on larger firms, resulting in
higher audit fees. Although some other audit variables are significantly correlated with some of
the financial variables, the correlation coefficients are small enough to be ignored.
11
12
13
TotalN~s Abnorm~s AbsAbn~c SMALL_~T SMALL_~E GC PRIOGC Specia~t Weight~e A~oShare A~tShare FIN EXANTE EPSGro~h Audit~ge Big4
TotalNetAc~s 1
AbnormalAc~s 0.0077 1
AbsAbnAcc 0.0033 0.8324* 1
SMALL_PROFIT 0.0291* -0.0074 -0.0118 1
SMALL_INCR~E 0.0035 -0.0031 -0.0048 0.4099* 1
GC -0.0369* 0.0278* 0.0642* -0.1687* -0.0652* 1
PRIOGC -0.0394* 0.0292* 0.0591* -0.1706* -0.0630* 0.6840* 1
Specialist 0.0094 -0.0907* -0.1031* 0.0167* 0.0109 -0.1135* -0.1084* 1
WeightedMa~e 0.0858* -0.005 -0.0111 0.0450* -0.0012 -0.1075* -0.1234* 0.0592* 1
AuditorPor~e -0.0261* 0.0026 0.008 -0.0836* -0.0222* 0.2279* 0.2606* 0.0362* 0.0545* 1
AuditorMar~e 0.0862* -0.012 -0.0240* 0.1421* 0.0298* -0.2698* -0.3005* 0.1109* 0.4210* -0.3318* 1
FIN -0.0001 0.0407* 0.0428* -0.015 -0.0063 0.0910* 0.0873* -0.002 -0.0077 0.0168* -0.0115 1
EXANTE 0.1475* 0.0027 -0.0153 -0.0299* -0.0453* 0.0488* 0.0753* 0.0206* 0.0308* 0.0236* -0.0309* 0.0188* 1
EPSGrowth 0.0417* 0.0174* 0.0126 -0.1443* -0.1447* 0.0170* 0.0007 -0.0105 0.0186* 0.012 0.0152 0.0021 0.0651* 1
AuditorCha~e -0.0173* -0.0065 0.0014 -0.0475* -0.0171* 0.1470* 0.1658* -0.0442* -0.0652* 0.0888* -0.1744* 0.0064 0.0046 0.0049 1
Big4 0.0644* -0.0153 -0.0314* 0.1470* 0.0368* -0.3469* -0.3958* 0.1151* 0.2423* -0.4558* 0.7054* -0.0277* -0.0304* 0.0148 -0.2404* 1
Note: the correlations that are significant at 5% level are starred. The shaded areas show that the audit related variables and the financial
variables that are statistically significantly correlated with each other.
14
Table 8
15
Since there is no consensus on whether it is necessary to deal with the data imbalance issue, this
research train the algorithms using both the original imbalanced training dataset and the synthetic
balanced training dataset generated by SMOTE, and then test them on the original imbalanced
testing dataset. In the balanced training dataset generated by SMOTE, the synthetic instance is
created by first taking the vector between the current data point and one of its k nearest neighbors,
and then multiplying this vector by a random number between 0 and 1 (Dutta et al., 2017). The
SMOTE filter in WEKA, a commonly used machine learning software, is used to generate the
synthetic balanced training data. The summary of this balanced training data is listed in Table 9.
Table 9
When training the algorithms, either audit-related variables, or financial variables, or both are
included as independent variables, because the third and fourth research questions of this research
want to find out which group of variables has better predictive ability and whether they
complement each other in predicting audit quality.
The main supervised learning algorithms used in this paper are: Naïve Bayesian (NB), Bayesian
Belief Network (BBN), Artificial Neural Network (ANN), Support Vector Machine (SVM),
Decision Tree (DT), Random Tree (RT) and Random Forest (RF). Other advanced algorithms such
as Bagging, Stacking and AdaBoost are also used. The algorithm names and their corresponding
choices in WEKA are listed in Table 10, and some brief introduction of the major algorithms used
are provided below.
16
Naïve Bayesian (NB) classifier is based on Bayes’ theorem with the “naïve” assumption that every
pair of features are independent (Zhang, 2004). Given a class variable ! and a dependent feature
vector "# through "$ , NB theorem state the following relationship (Zhang, 2004):
Given the input, %("# , … , "$ ) is constant. Therefore, the classification rule can be derived as
follows:
$
% ! "# , … , "$ ∝ % ! %(", |!)
,0#
$
! = arg max % ! %(", |!)
7 ,0#
Bayesian networks use this Bayes’ rule for probabilistic inference (Murphy, 1998).
17
Artificial neural network models take their inspiration from the functioning of the brain, and the
backpropagation is used to train the neural network for a variety of applications (Alpaydin, 2014).
When ANN is used for classification, the perceptron is the basic processing element which
converts the inputs it receives into outputs as a function of a weighted sum of the inputs. For
#
example, ! = , where x is a vector of inputs, w is a vector of weights, and y is the
#.89: [-=> ?]
output. The weights w need to be “learned” through backpropagation till the errors are minimized.
A support vector machine determines a hyperplane in the feature space that best separates positive
from negative examples, and “a feature space results from mapping the observable attributes to
properties that might better relate to the problem at hand” (Cecchini et al., 2010).
A decision tree is “a hierarchical model for supervised learning whereby the local region is
identified in a sequence of recursive splits in a smaller number of steps”, and it is composed of
internal decision nodes and terminal leaves (Alpaydin, 2014). The goal of using decision tree is
“to create a model that predicts the value of a target variable by learning simple decision rules
inferred from the data features” (scikit-learn, n.d.).
Random Forest
Random forest builds multiple decision trees and merges them together to get a more accurate and
stable prediction (Donges, 2018). Random forests searches for the best feature among a random
subset of features, and this results in a wide diversity that generally results in a better model
(Donges, 2018).
In classifying instances into restated or non-restated, there are 4 outcomes: 1) the actual restated
instances are correctly classified as restated (True Positive); 2) the actual restated instances are
wrongly classified as non-restated (False Negative); 3) the actual non-restated instances are
correctly classified as non-restated (True Negative); and 4) the actual non-restated instances are
wrongly classified as restated (False Positive). In this particular context, false negative is more
serious than false positive, so the cost of false negative should be higher than that of false positive.
18
Subset feature selection can be used to remove less significant or redundant attributes to help build
parsimonious models (Dutta et al., 2017), and this research compares the performance of the
algorithms with and without subset feature selection. The feature selection can be realized using
the WEKA function AttributeSelectedClassifier. The evaluator chosen is CfsSubsetEval, which
evaluates the worth of a subset of attributes by considering the individual predictive ability of each
feature along with the degree of redundancy between them. The searching method “bi-directional”
is chosen.
After the algorithms are trained on the training data, the trained algorithms will be tested on the
original imbalanced testing data. Figure 2 illustrates the whole procedures from training to testing
performed using WEKA.
Figure 2
19
As has been discussed above, there might be four outcomes when the trained algorithms classify
instances in the testing data: true positive, false negative, true negative, and false positive. A
confusion matrix (Figure 3) is a matrix that summarizes all the outcomes.
Figure 3
There are several indexes (listed in Table 11) that can be used to evaluate the performance of the
algorithms. Due to the imbalance of the testing data, this research uses the Recall, Specificity, and
the Area Under Curve (AUC) to evaluate the performance of the algorithms, because these indexes
indicate how effectively the algorithms correctly classify each instance into its actual class. The
closer these three indexes are to 1, the better the performance.
Table 11
20
4. Results
4.1 Imbalanced Training Data
When the algorithms are trained on the original imbalanced training data, they only perform well
when the relative cost of “False Negative” and “False Positive” is 10. Among all the algorithms,
Random Forest, Bagging with Random Forest, AdaBoost with Random Forest, and Stacking with
Random Forest outperform the others. This might be because Random Forest is not sensitive to
the imbalanced data. The testing results when the imbalanced training data are used and when the
misclassification cost is 10 are listed in Table 12.
21
Testing results using original imbalanced training data at the misclassification cost of 10
Input Algorithm Recall Specificity AUC Accuracy
Note:
There are 17 Non-Audit related features: LTA, LEV1, LEV2, FREEC, materialweakness, SALESGROWTH, OCF,
ZScore, BNratio, TotalNetAccurals, AbnormalAccurals, AbsAbnAcc, SMALL_PROFIT, SMALL_INCREASE, FIN,
EXANTE, and EPSGrowth
There are 16 Audit related features: LAF, DLAF, AuditorSize, INFLUENCE, TENURE, TotalNetAccurals,
AbnormalAccurals, AbsAbnAcc, GC, PRIOGC, Specialist, WeightedMarketValue, AuditorPortfolioShare,
AuditorMarketShare, AuditorChange, and Big4
The selected features from “All Features” are LTA, SMALL_INCREASE, AuditorPortfolioShare,
AuditorMarketShare; The selected features from “Non Audit-related Features” are LTA, SALESGROWTH, OCF,
Zscore, SMALL_PROFIT, FIN; The selected features from “Audit-related Features” are LAF, AuditorPortfolioShare,
AuditorMarketShare.
22
To see which variables are the most predictive of audit quality, the variables are ranked in terms
of their predictive ability using the evaluator in WEKA called GainRatioAttributeEval, which
evaluates the worth of an attribute by measuring the gain ratio with respect to the class. This
ranking is listed in Table 13 and the shaded variables are audit related variables.
Table 13
Ranking of predictability using Random Forest
(with original training dataset, relative cost=10)
Rank Variable Rank Variable
1 AuditorMarketShare 16 TotalNetAccruals
2 LTA 17 LEV2
3 AuditorPortfolioShare 18 AbsAbnAcc
4 LAF 19 GC
5 AuditorSize 20 AbnormalAccruals
6 Big4 21 BMratio
7 OCF 22 PRIOGC
8 FREEC 23 TENURE
9 ZScore 24 EXANTE
10 INFLUENCE 25 EPSGrowth
11 SMALL_INCREASE 26 materialweakness
12 SMALL_PROFIT 27 DLAF
13 SALESGROWTH 28 Specialist
14 FIN 29 AuditorChange
15 LEV1 30 WeightedMarketValue
Among the six most predictive variables listed in Table 13 there are five audit related variables:
the market share of the auditor, the portfolio share of the auditor, log of audit fees, size of the
23
Till now, the four research questions raised in this research can be answered as follows: 1) Random
Forest algorithm works the best in predicting audit quality and can achieve an AUC value of 0.723
when trained on all variables without feature selection; 2) the most predictive variables are: the
market share of the auditor, the log of client’s total assets, the portfolio share of the auditor, log of
audit fees, size of the auditor, and the brand name of the auditor; 3) audit-related variables have
better predictive ability than financial variables; and 4) audit-related variables and financial
variables complement each other in predicting audit quality.
When the algorithms are trained on the synthetic balanced data, those that are sensitive to
imbalanced data start to perform decently, for example, Bayesian Belief Network, Artificial Neural
Network, and Support Vector Machine. The details of the testing results are listed in Table 14,
Table 15, and Table 16. When all variables are included as inputs (Table 14), the algorithms
generally don’t work well if the subset feature selection is performed. Without feature selection,
SVM achieves an AUC of 0.654 regardless of the level of misclassification cost; Bayesian
Network works the best when the misclassification cost is 1; MultilayerPerceptron and Random
Forest perform the best when the relative cost is 5; and J48 performs the best when the
misclassification cost is 10. The highest AUC value (0.696) is achieved when the Random Forest
is used at the misclassification cost of 5.
24
Note:
The selected subset features from “All Features” are materialweakness, AuditorSize, TENURE, SMALL_PROFIT,
SMALL_INCREASE, Specialist, FIN, EXANTE, EPSGrowth
25
Note:
There are 17 Non Audit-related features: LTA, LEV1, LEV2, FREEC, materialweakness, SALESGROWTH, OCF,
ZScore, BNratio, TotalNetAccurals, AbnormalAccurals, AbsAbnAcc, SMALL_PROFIT, SMALL_INCREASE,
FIN, EXANTE, and EPSGrowth
The selected features from “Non Audit-related Features” are materialweakness, SMALL_PROFIT, FIN, EXANTE,
and EPSGrowth.
When only financial variables are included as inputs (Table 15), the performance is good enough
only when no subset feature selection is performed. Bayesian Network works the best when the
misclassification cost is 10; SVM works well when the misclassification cost is within 10; and
Random Forest works well when the misclassification cost is between 5 and 10. The highest AUC
value (0.619) is achieved when SVM is used when the misclassification cost is within 10.
26
When only audit related variables are included as inputs (Table 16), the algorithms perform better
when no subset feature selection is performed, and the performance is decent only when the
misclassification cost is 1. The highest AUC value (0.758) is achieved when MultilayerPerceptron
is used, next comes RandomForest (AUC of 0.625).
No matter whether the algorithms are trained by the original imbalanced data or the synthetic
balanced data, the performance is better when no subset feature selection is performed. Table 17
summarizes the results from imbalanced and balanced data without feature selection.
27
Note:
The selected features from “All Features” are materialweakness, AuditorSize, TENURE, SMALL_PROFIT,
SMALL_INCREASE, Specialist, FIN, EXANTE, EPSGrowth
The selected features from “Non Audit-related Features” are materialweakness, SMALL_PROFIT, FIN, EXANTE,
and EPSGrowth.
When the algorithms are trained on balanced training data, the highest value of AUC when all
variables are included is 0.696, that when only financial variables are included is 0.619, and that
when only audit related variables are included is 0.650. This coincides with the result generated
from the original unbalanced training data: the audit-related variables have better predictive ability
of audit quality than financial variables, and the combination of the two groups achieves the best
performance, indicating that audit related variables and financial variables complement each other
in predicting audit quality.
When all variables are included in inputs and when the misclassification cost is set to 10, Random
Forest has higher value of AUC when it is trained on imbalanced data than on balanced data, and
the same holds true when only financial variables or only audit-related variables are used as inputs.
28
Till now, the conclusions derived from the above analysis can be summarized as follows: 1)
Supervised learning algorithms can be used to accurately predict audit quality, especially when
Random Forest is applied on the original data without feature selection; 2) the most predictive
variables are: the market share of the auditor, the log of client’s total assets, the portfolio share of
the auditor, log of audit fees, size of the auditor, and the brand name of the auditor; 3) audit related
variables have better predictability than financial variables; and 4) audit related variables and
financial variables complement each other in predicting audit quality.
5. Further Analysis
5.1 Compare Random Forest with Logistic regression
Since Random Forest performs extremely well in the previous analysis, it will be compared with
logistic regression here to see which can better predict audit quality. Table 18 lists the performance
indicators of the two algorithms under each condition.
The results show that Random Forest outperforms Logistic regression in each scenario in terms of
AUC value, showing the superior performance of Random Forest in this particular context. One
explanation for the poorer performance of Logistic Regression is that it is sensitive to the outliers
in the dataset, however, the Random Forest algorithm is robust to outliers.
29
Table 19
Summary of performance when accruals are included/excluded from audit variables
(with the original unbalanced training data and misclassification cost of 10)
Input Algorithm Recall Specificity AUC Accuracy
Without Feature Random
0.639 0.704 0.671 0.70
Selection Forest
With Accruals
With Feature Random
0.556 0.737 0.646 0.726
Selection Forest
Without Feature Random
0.627 0.737 0.682 0.73
Excluding Selection Forest
Accruals With Feature Random
0.556 0.737 0.646 0.726
Selection Forest
30
This research pioneers in constructing supervised learning algorithms that are more effective than
traditional regressions to predict audit quality, which is proxied by restatement, one of the best
publicly available measure of audit quality (Aobdia, 2015). Using 14,028 firm-year observations
from 2008 to 2016 in the United States and ten different supervised learning algorithms, the
research shows that: 1) supervised learning algorithms can be used to predict audit quality
accurately, especially when Random Forest is applied; 2) the variables that are most predictive of
audit quality are: the market share of the auditor, the log of client’s total assets, the portfolio share
of the auditor, log of audit fees, size of the auditor, and the brand name of the auditor; 3) audit
related variables have higher predictive ability than financial variables; and 4) audit related
variables and financial variables complement each other in predicting audit quality.
One major defect of this paper and most research that predicts restatement or fraud is that the
financial variables used in the model are not the original version before restatement. This can be
problematic because it means that research on audit failures is based on data that has already been
restated to predict future restatements. Lack of access to the original financial data is the major
reason why only the updated financial data is used for this paper and most of other related research.
However, the conclusions that audit-related variables alone can predict audit quality very
accurately and that they can predict better than financial variables already prove the reliability of
the algorithms. Future research in the area of predicting audit quality might consider including
31
In a nutshell, from the conclusion of this research, all the stakeholders who wish to predict audit
quality are suggested to use Random Forest algorithm which are trained with the original
imbalanced data and with both audit-related variables and financial variables.
32
34
35
BANKRUPTCY The Altman Z-score, which is a measure of the Francis & Yu, 2009
probability of bankruptcy, with a lower value indicating
greater financial distress
VOLATILITY Client’s stock volatility and is the standard deviation of Francis & Yu, 2009
12 monthly stock returns for the current fiscal year
MB Log of book to market ratio Francis & Yu, 2009
ACCRUALS Signed abnormal accruals Francis & Yu, 2009
ABS_ACCRUALS Absolute value of abnormal accruals derived from the Francis & Yu, 2009
performance adjusted accruals model in Equation
SMALL_PROFIT Dummy variable, and coded as 1 if a client's net income Francis & Yu, 2009
deflated by lagged total assets is between 0 and 5 percent,
and 0 otherwise
SMALL_INCREASE Dummy variable, and coded as 1 if a client's net income Francis & Yu, 2009
deflated by lagged total assets is between 0 and 1.3
percent, and 0 otherwise
GCREPORT Dummy variable that takes the value of 1 if a firm Francis & Yu, 2009
receives a going-concern report in a specific fiscal year,
and 0 otherwise
36
37
The restatement data provided by the Audit Analytics have the information of “Restatement Begin
Date” and “Restatement End Date”. The Audit Analytics confirmed via email that the beginning
and ending dates of the restatement outline periods affected by the restatement. For example,
restatement with beginning and ending dates of 01/01/2015 and 12/31/2017 affected years 2015,
2016, and 2017. And “it is possible (although rare) to have only certain years within the period
affected to be restated”. For example, in the example above, it is possible that a cash flow
restatement would affect only 2015 and 2017, but not 2016. Based on the above information, the
beginning and ending dates of the restatement for each firm are converted into firm-year
observations using STATA (code provided on demand).
Table 3
38
39
40