Excerpts - Machine Learning and Big Data Projects
Excerpts - Machine Learning and Big Data Projects
READING
7
Machine Learning
by Kathleen DeRose, CFA, and Christophe Le Lannou
Kathleen DeRose, CFA, is at New York University, Stern School of Business (USA).
Christophe Le Lannou is at dataLearning (United Kingdom).
LEARNING OUTCOMES
Mastery The candidate should be able to:
INTRODUCTION
Investment firms are increasingly using technology at every step of the investment
1
management value chain—from improving their understanding of clients, to uncov-
ering new sources of alpha, to executing trades more efficiently. Machine learning
techniques, a central part of that technology, are the subject of this reading. These
techniques first appeared in finance in the 1990s and have since flourished with the
explosion of data and cheap computing power.
This reading provides a high-level view of machine learning (ML). It covers a
selection of key ML algorithms and their investment applications. Investment prac-
titioners should be equipped with a basic understanding of the types of investment
problems that machine learning can address, an idea of how the algorithms work,
and the vocabulary to interact with machine learning and data science experts. While
investment practitioners need not master the details and mathematics of machine
learning, as domain experts in investments they can play an important role by being
able to source appropriate model inputs, interpret model outputs, and translate out-
puts into appropriate investment actions.
Section 2 gives an overview of machine learning in investment management.
Section 3 defines machine learning and the types of problems that can be addressed
by supervised and unsupervised learning. Section 4 describes evaluating machine
learning algorithm performance. Key supervised machine learning algorithms are
covered in Section 5, while Section 6 describes key unsupervised machine learning
algorithms. Neural networks, deep learning nets, and reinforcement learning are
covered in Section 7. The reading concludes with a summary.
Training dataset:
Target outcomes Prediction Actual
paired with {YPredict} {Y}
feature inputs.
Evaluation of Fit
In supervised machine learning, the dependent variable (Y) is the target and the
independent variables (X’s) are known as features. The labeled data (training data set)
is used to train the supervised ML algorithm to infer a pattern-based prediction rule.
The fit of the ML model is evaluated using labeled test data in which the predicted
targets (YPredict) are compared to the actual targets (YActual).
An example of supervised learning is the case in which ML algorithms are used
to predict whether credit card transactions are fraudulent or legitimate. In the credit
card example, the target is a binary variable with a value of 1 for “fraudulent” or 0
for “non-fraudulent.” The features are the transaction characteristics. The chosen ML
algorithm uses these data to train a model to predict fraud more accurately in new
transactions. The ML program “learns from experience” if the percentage of correctly
predicted credit card transactions increases as the input from a growing credit card
database increases. One possible ML algorithm to use here would be to fit a logistic
regression model to the data to provide an estimate of the probability a transaction
is fraudulent.
Supervised learning can be divided into two categories of problems, regression
problems and classification problems, with the distinction between them being deter-
mined by the nature of the target (Y) variable. If the target variable is continuous, then
the task is one of regression (even if the ML technique used is not “regression,” note
this nuance of ML terminology). If the target variable is categorical or ordinal (i.e.,
a ranked category), then it is a classification problem. Regression and classification
use different ML techniques.
Regression focuses on making predictions of continuous target variables. Most
readers are already familiar with multiple linear regression (e.g., ordinary least squares)
models, but other supervised learning techniques exist that include non-linear models.
These non-linear models are useful for problems involving large data sets with large
numbers of features, many of which may be correlated. Some examples of problems
belonging to the regression category are using historical stock market returns to fore-
cast stock price performance or using historical corporate financial ratios to forecast
the probability of bond default.
Classification focuses on sorting observations into distinct categories. In a regres-
sion problem, when the dependent variable (target) is categorical, the model relating
the outcome to the independent variables (features) is called a “classifier.” Many
classification models are binary classifiers, as in the case of fraud detection for credit
card transactions. Multi-category classification is not uncommon, as in the case of
classifying firms into multiple credit rating categories. In assigning ratings, the outcome
© CFA Institute. For candidate use only. Not for distribution.
What Is Machine Learning? 467
variable is ordinal, meaning the categories have a distinct order or ranking (e.g., from
low to high creditworthiness). Ordinal variables are intermediate between categorical
variables and continuous variables on a scale of measurement.
Exhibit 3 presents a stylized decision flowchart for choosing among the machine
learning algorithms shown in Exhibit 2. The dark-shaded ovals contain the supervised
ML algorithms; the light-shaded ovals contain the unsupervised ML algorithms; and
the key questions to consider are shown in the unshaded boxes.
© CFA Institute. For candidate use only. Not for distribution.
What Is Machine Learning? 469
Classification
KNN or CART,
Random Forests, Dimensionality
SVM Start
Neural Nets Reduction
No Yes PCA
Complex Yes
Non-linear Complexity
Data? Reduction?
No
Yes
Labeled Yes No Numerical
Data? Classification Prediction?
No Yes
Complex
Non-linear Complex
No Data? Non-linear
Yes Data?
No Yes
Number of
Categories Neural
Known? Nets Penalized CART,
No
Yes Regression, Random Forests,
LASSO Neural Nets
Hierarchical
K-Means
Clustering
Regression
Clustering
First, start by asking, are the data complex, having many features that are highly
correlated? If yes, then dimensionality reduction using principal components analysis
(PCA) is appropriate.
Next, is the problem one of classification or numerical prediction? If numerical
prediction, then depending on whether or not the data have non-linear characteristics,
the choice of ML algorithms is from a set of regression algorithms—either penalized
regression/LASSO for linear data or CART, random forest, and neural networks for
non-linear data.
If the problem is one of classification, then depending on whether or not the data
are labeled, the choice is either from a set of classification algorithms using labeled
data or from a set of clustering algorithms using unlabeled data.
If the data are labeled, then depending on whether or not the data have non-linear
characteristics, the choice of classification algorithm would be k-nearest neighbor
(KNN) and support vector machine (SVM) for linear data or CART, random forest,
and neural networks for non-linear data.
Finally, if the data are unlabeled, the choice of clustering algorithm depends on
whether or not the data have non-linear characteristics. The choice of clustering
algorithm would be neural networks for non-linear data or for linear data, k-means
with a known number of categories, and hierarchical clustering with an unknown
number of categories.
Following a description of how to evaluate ML algorithm performance, we will
define all of the ML algorithms shown in Exhibit 3 and then explain their applications
in investment management.
© CFA Institute. For candidate use only. Not for distribution.
470 Reading 7 ■ Machine Learning
EXAMPLE 1
Solution to 1:
B is correct. A is incorrect because machine learning is not best described as
a type of computer algorithm. C is incorrect because machine learning is not
limited to extracting information from linear, labeled data sets.
Solution to 2:
A is correct. B is incorrect because the term “labeled training data” means the
target (Y) is provided. C is incorrect because a supervised ML algorithm is meant
to predict a target (Y) variable.
Solution to 3:
A is correct. B is incorrect because supervised learning uses labeled training
data. C is incorrect because it describes unsupervised learning.
Solution to 4:
C is correct. A is incorrect because it describes classification, not dimension
reduction. B is incorrect because it describes clustering, not dimension reduction.
© CFA Institute. For candidate use only. Not for distribution.
Overview of Evaluating ML Algorithm Performance 471
perfect hindsight but no foresight. The main contributors to overfitting are thus high
noise levels in the data and too much complexity in the model. The middle graph shows
no errors in this overfit model. Complexity refers to the number of features, terms, or
branches in the model and to whether the model is linear or non-linear (non-linear is
more complex). As models become more complex, overfitting risk increases. A good
fit/robust model fits the training (in-sample) data well and generalizes well to out-of-
sample data, both within acceptable degrees of error. The right graph shows that the
good fitting model has only one error, the misclassified circle.
Y Y Y
building models, data scientists try to simultaneously minimize both bias and variance
errors while selecting an algorithm with good predictive or classifying power, as seen
in the right panel of Exhibit 5.
0 0 0
Number of Training Samples Number of Training Samples Number of Training Samples
Exhibit 6 Fitting Curve Shows Trade-Off Between Bias and Variance Errors
and Model Complexity
Model Error (Ein, Eout)
Optimal
Complexity
Total
Error
Variance Bias
Error Error
Model Complexity
© CFA Institute. For candidate use only. Not for distribution.
474 Reading 7 ■ Machine Learning
Finding the optimal point (managing overfitting risk)—the sweet spot just before
the total error rate starts to rise (due to increasing variance error)—is a core part of
the machine learning process and the key to successful generalization. Data scientists
express the trade-off between overfitting and generalization as a trade-off between
cost (the difference between in- and out-of-sample error rates) and complexity. They
use the trade-off between cost and complexity to calibrate and visualize under- and
overfitting and to optimize their models.
EXAMPLE 2
Solution to 1:
A, Statement 1, is correct. B, Statement 2, is incorrect because it describes
a poorly fitting model with high bias. C, Statement 3, is incorrect because it
describes an overfitted model with poor generalization.
Solution to 2:
B is correct. Anand’s model is good at correctly classifying using the training
sample, but it does not perform well using new data. The model is overfitted,
so it has high variance error.
© CFA Institute. For candidate use only. Not for distribution.
476 Reading 7 ■ Machine Learning
Solution to 3:
B is correct. A is incorrect because the penalty should increase in size with the
number of included features. C is incorrect because Anand is using labeled data
for classification, and unsupervised learning models do not use labeled data.
i =1
Penalized regression includes a constraint such that the regression coefficients
are chosen to minimize the sum of squared residuals plus a penalty term that increases
in size with the number of included features. So, in a penalized regression, a feature
© CFA Institute. For candidate use only. Not for distribution.
Supervised Machine Learning Algorithms 477
must make a sufficient contribution to model fit to offset the penalty from including
it. Therefore, only the more important features for explaining Y will remain in the
penalized regression model.
In one popular type of penalized regression, LASSO (least absolute shrinkage and
selection operator), the penalty term has the following form, with λ > 0:
K
Penalty term = λ ∑ b k .
k =1
In addition to minimizing the sum of the squared residuals, LASSO also involves
minimizing the sum of the absolute values of the regression coefficients (see the fol-
lowing expression). The greater the number of included features (i.e., variables with
non-zero coefficients), the larger the penalty term. Therefore, penalized regression
ensures that a feature is included only if the sum of squared residuals declines by more
than the penalty term increases. All types of penalized regression involve a trade-off of
this type. Also, since LASSO eliminates the least important features from the model,
it automatically performs feature selection.
n K
2
∑(Yi − Yi ) + λ ∑ b K
i =1 k =1
Lambda (λ) is a hyperparameter—a parameter whose value must be set by the
researcher before learning begins—of the regression model and will determine the
balance between fitting the model versus keeping the model parsimonious. Note that
in the case where λ = 0, then the LASSO penalized regression is equivalent to an
OLS regression. When using LASSO or other penalized regression techniques, the
penalty term is added only during the model building process (i.e., when fitting the
model to the training data). Once the model has been built, the penalty term is no
longer needed, and the model is then evaluated by the sum of the squared residuals
generated using the test data set.
With today’s availability of fast computation algorithms, investment analysts are
increasingly using LASSO and other regularization techniques to remove less per-
tinent features and build parsimonious models. Regularization describes methods
that reduce statistical variability in high dimensional data estimation problems—in
this case, reducing regression coefficient estimates toward zero and thereby avoiding
complex models and the risk of overfitting. LASSO has been used, for example, for
forecasting default probabilities in industrial sectors where scores of potential features,
many collinear, have been reduced to fewer than 10 variables, which is important given
the relatively small number (about 100) of observations of default.
Regularization methods can also be applied to non-linear models. A long-term
challenge of the asset management industry in applying mean–variance optimization
has been the estimation of stable covariance matrixes and asset weights for large
portfolios. Asset returns typically exhibit strong multi-collinearity, making the esti-
mation of the covariance matrix highly sensitive to noise and outliers, so the resulting
optimized asset weights are highly unstable. Regularization methods have been used
to address this problem.
In prediction, only out-of-sample performance (i.e., prediction accuracy) really
matters. The relatively parsimonious models produced by applying penalized regression
methods, like LASSO, tend to work well because they are less subject to overfitting.
straightforward and best explained with a few pictures. The left panel in Exhibit 7
presents a simple data set with two features (x and y coordinates) labeled in two groups
(triangles and crosses). These binary labeled data are noticeably separated into two
distinct regions, which could represent stocks with positive and negative returns in
a given year. These two regions can be easily separated by numerous straight lines;
three of them are shown in the right panel of Exhibit 7. The data are thus linearly
separable, and any of the straight lines shown would be called a linear classifier—a
binary classifier that makes its classification decision based on a linear combination
of the features of each data point.
Y Y
X X
Margin
Y Y
X X
IOG
X1>10%
No Yes
FCFG FCFG
X2>10% X2>20%
No Yes
Yes No
No Div.
Incr. Div.
IOG No Div.
(–) Incr.
X1>5% Incr. (+)
(–)
No Yes
Div. No Div.
Incr. Incr.
(+) (–)
X1≤10% X1>10%
X1: IOG
We now turn to how the CART algorithm selects features and cutoff values for
them. Initially, the classification model is trained from the labeled data, which in
this hypothetical case are 10 instances of companies having dividend increase (the
crosses) and 10 instances of companies with no dividend increase (the dashes). As
shown in Panel B of Exhibit 10, at the initial root node and at each decision node the
feature space (i.e., the plane defined by X1 and X2) is split into two rectangles for
values above and below the cutoff value for the particular feature represented at that
© CFA Institute. For candidate use only. Not for distribution.
Supervised Machine Learning Algorithms 483
node. This can be seen by noting the distinct patterns of the lines that emanate from
the decision nodes in Panel A. These same distinct patterns are used for partitioning
the feature space in Panel B.
The CART algorithm chooses the feature and the cutoff value at each node that
generates the widest separation of the labeled data to minimize classification error
(e.g., by a criterion, such as mean-squared error). After each decision node, the
partition of the feature space becomes smaller and smaller, so observations in each
group have lower within-group error than before. At any level of the tree, when the
classification error does not diminish much more from another split (bifurcation),
the process stops, the node is a terminal node, and the category that is in the major-
ity at that node is assigned to it. If the objective of the model is classification, then
the prediction of the algorithm at each terminal node will be the category with the
majority of data points. For example, in Panel B of Exhibit 10, the top right rectangle
of the feature space, representing IOG (X1)>10% and FCFG (X2)>20%, contains 5
crosses, the most data points of any of the partitions. So, CART would predict that
a new data point (i.e., a company) with such features belongs to the cross (dividend
increase) category. However, if instead the new data point had IOG (X1)>10% and
FCFG (X2) ≤20%, then it would be predicted to belong to the dash (no dividend
increase) category—represented by the lower right rectangle with 2 crosses but with
3 dashes. Finally, if the goal is regression, then the prediction at each terminal node
is the mean of the labeled values.
CART makes no assumptions about the characteristics of the training data, so if
left unconstrained, potentially it can perfectly learn the training data. To avoid such
overfitting, regularization parameters can be added, such as the maximum depth of the
tree, the minimum population at a node, or the maximum number of decision nodes.
The iterative process of building nodes is stopped once the regularization criterion
has been reached. For example, in Panel B of Exhibit 10, the upper left rectangle of
the feature space (determined by X1≤10%, X2>10%, and X1≤5% with three crosses)
might represent a terminal node resulting from a regularization criterion with min-
imum population equal to 3. Alternatively, regularization can occur via a pruning
technique that can be used afterward to reduce the size of the tree. Sections of the
tree that provide little classifying power are pruned (i.e., removed).
By its iterative structure, CART can uncover complex dependencies between
features that other models cannot reveal. As demonstrated in Exhibit 10, the same
feature can appear several times in combination with other features and some features
may be relevant only if other conditions have been met.
As shown in Exhibit 11, high profitability is a critical feature for predicting if a
stock is an attractive investment or a value trap (i.e., an investment that, although
apparently priced cheaply, is likely to be unprofitable). This feature is relevant only
if the stock is cheap—for example, in this hypothetical case if P/E is less than 15,
leverage is high (debt to total capital > 50%), and sales are expanding (sales growth
> 15%). Said another way, high profitability is irrelevant in this context if the stock is
not cheap, and if leverage is not high, and if sales are not expanding. Multiple linear
regression typically fails in such situations where the relationship between the features
and the outcome is non-linear.
© CFA Institute. For candidate use only. Not for distribution.
484 Reading 7 ■ Machine Learning
Is Stock Cheap?
(P/E < 15x)
Yes No
Is Leverage High?
(Debt/Tot. Cap > 50%)
Yes No
Yes No
Is Profitability High?
(Net Profit Margin > 20%)
Yes No
Attractive Equity
Value Trap
Investment
CART is a popular supervised machine learning model because the tree provides
a visual explanation for the prediction. This contrasts favorably with other algorithms
that are often considered to be “black boxes” because it may be difficult to understand
the reasoning behind their outcomes and thus to foster trust in them. CART is a
powerful tool to build expert systems for decision-making processes. It can induce
robust rules despite noisy data and complex relationships between high numbers of
features. Typical applications of CART in investment management include, among
others, enhancing detection of fraud in financial statements, generating consistent
decision processes in equity and fixed-income selection, and simplifying communi-
cation of investment strategies to clients.
in light shade and no default in dark shade, while the middle and right plots present
the predicted defaults and no defaults (also in light and dark shades, respectively). It is
clear from the middle plot, which is based on a traditional linear regression model, that
the model fails to predict the complex non-linear relationship between the features.
Conversely, the right plot, which presents the prediction results of a random forest
model, shows that this model performs very well in matching the actual distribution
of the data.
4 4 4
2 2 2
Y
0 0 0
–2 –2 –2
–2 0 2 –2 0 2 –2 0 2
X X X
Despite its relative simplicity, random forest is a powerful algorithm with many
investment applications. These include, for example, use in factor-based investment
strategies for asset allocation and investment selection or use in predicting whether
an IPO will be successful (e.g., percent oversubscribed, first trading day close/IPO
price) given the attributes of the IPO offering and the corporate issuer.
EXAMPLE 3
Solution to 1:
Lee is addressing a supervised learning classification problem because she
must determine whether Biotron’s upcoming bond issue would be classified as
investment grade or non-investment grade.
Solution to 2:
One suitable ML algorithm is the SVM. The SVM algorithm is a linear classifier
that aims to seek the optimal hyperplane—the one that separates observations
into two distinct sets by the maximum margin. So, the SVM is well suited to
binary classification problems, such as the one facing Lee (investment grade
vs. non-investment grade). In this case, Lee could train the SVM algorithm on
data—characteristics (features) and rating (target)—of low investment-grade
(Baa3/BBB–) and high non-investment-grade (Ba1/BB+) bonds. Lee would then
note on which side of the margin the new data point (Biotron’s new bonds) lies.
The KNN algorithm is also well suited for classification problems because
it classifies a new observation by finding similarities (or nearness) between the
new observation and the existing data. Training the algorithm with data as for
SVM, the decision rule for classifying Biotron’s new bonds is which classifi-
cation is in the majority among its k-nearest neighbors. Note that k must be
pre-specified by Lee.
© CFA Institute. For candidate use only. Not for distribution.
488 Reading 7 ■ Machine Learning
Solution to 3:
If the ML algorithms disagreed on the classification, the classification would
be more likely to be sensitive to the algorithm’s approach to classifying data.
Because the classification of Biotron’s new issue appears robust to the choice
of ML algorithm (i.e., both algorithms agree on the rating), the resulting clas-
sification will likely be correct.
EXAMPLE 4
Solution to 1:
Kim is addressing a classification problem because she must determine whether
bonds that she is considering purchasing in the credit quality range of B/B2 to
CCC/Caa2 will default or not default.
Solution to 2:
With 19 fundamental and 5 technical factors (i.e., the features) the dimension-
ality of the model is 24.
© CFA Institute. For candidate use only. Not for distribution.
Unsupervised Machine Learning Algorithms 489
Solution to 3:
The CART model is an available algorithm for addressing classification problems.
Its ability to handle complex, non-linear relationships makes it a good choice
to address the modelling problem at hand. An important advantage of CART
is that its results are relatively straightforward to visualize and interpret, which
should help Kim explain her recommendations based on the model to Hilux’s
investment committee and the firm’s clients.
Solution to 4:
At each node in the decision tree, the algorithm will choose the feature and the
cutoff value for the selected feature that generates the widest separation of the
labeled data to minimize classification error.
Solution to 5:
The team can avoid overfitting and improve the predictive power of the CART
model by adding regularization parameters. For example, the team could spec-
ify the maximum depth of the tree, the minimum population at a node, or the
maximum number of decision nodes. The iterative process of building nodes will
be stopped once the regularization criterion has been reached. Alternatively, a
pruning technique can be used afterward to remove parts of the CART model
that provide little power to correctly classify instances into default or no default
categories.
Solution to 6:
The analytics team might use ensemble learning to combine the predictions from
a collection of models, where the average result of many predictions leads to a
reduction in noise and thus more accurate predictions. Ensemble learning can
be achieved by an aggregation of either heterogeneous learners—different types
of algorithms combined with a voting classifier—or homogenous learners—a
combination of the same algorithm but using different training data based on
the bootstrap aggregating (i.e., bagging) technique. The team may also consider
developing a random forest classifier (i.e., a collection of many decision trees)
trained via a bagging method.
90°
PC 1:
Eigenvector that
X explains the largest
% of total variance
in the data.
PC 2
Z
0.6
0
0 5 10 15 20
Number of Principal Components
0.6
0
0 5 10 15 20
Number of Principal Components
The main drawback of PCA is that since the principal components are combina-
tions of the data set’s initial features, they typically cannot be easily labeled or directly
interpreted by the analyst. Compared to modelling data with variables that represent
well-defined concepts, the end user of PCA may perceive PCA as something of a
“black box.”
Reducing the number of features to the most relevant predictors is very useful,
even when working with data sets having as few as ten or so features. Notably, dimen-
sion reduction facilitates visually representing the data in two or three dimensions.
It is typically performed as part of exploratory data analysis, before training another
supervised or unsupervised learning model. Machine learning models are quicker
to train, tend to reduce overfitting (by avoiding the curse of dimensionality), and are
easier to interpret if provided with lower dimensional data sets.
6.2 Clustering
Clustering is another type of unsupervised machine learning that is used to organize
data points into similar groups called clusters. A cluster contains a subset of obser-
vations from the data set such that all the observations within the same cluster are
deemed “similar.” The aim is to find a good clustering of the data—meaning that the
observations inside each cluster are similar or close to each other (a property known
© CFA Institute. For candidate use only. Not for distribution.
Unsupervised Machine Learning Algorithms 493
as cohesion) and the observations in two different clusters are as far away from one
another or are as dissimilar as possible (a property known as separation). Exhibit 14
depicts this intra-cluster cohesion and inter-cluster separation.
2 The algorithm then analyzes the features for each observation. Based on the
distance measure that is utilized, k-means assigns each observation to its closest
centroid, which defines a cluster.
3 Using the observations within each cluster, k-means then calculates the new
(k) centroids for each cluster, where the centroid is the average value of their
assigned observations.
4 K-means then reassigns the observations to the new centroids, redefining the
clusters in terms of included and excluded observations.
5 The process of recalculating the new (k) centroids for each cluster is reiterated.
6 K-means then reassigns the observations to the revised centroids, again redefin-
ing the clusters in terms of observations that are included and excluded.
The k-means algorithm will continue to iterate until no observation is reassigned
to a new cluster (i.e., no need to recalculate new centroids). The algorithm has then
converged and reveals the final k clusters with their member observations. The k-means
algorithm has minimized intra-cluster distance (thereby maximizing cohesion) and
has maximized inter-cluster distance (thereby maximizing separation) under the
constraint that k = 3.
c1 c1 c1
c3 c3 c3
c2 c2 c2
Step 4: Reassigns each Step 5: Reiterates the process of Step 6: Reassigns each observation
observation to the nearest recalculationg new cntroids to the nearest centroid (from Step 5),
centroid (from Step 3) completing second iteration
c1
c1 c3 c1 c3
c3
c2 c2 c2
The k-means algorithm is fast and works well on very large data sets with hun-
dreds of millions of observations. However, the final assignment of observations to
clusters can depend on the initial location of the centroids. To address this problem,
the algorithm can be run several times using different sets of initial centroids, and
then one can choose the clustering that is most useful given the business purpose.
One limitation of this technique is that the hyperparameter, k, the number of
clusters in which to partition the data, must be decided before k-means can be run.
So, one needs to have a sense of how many clusters are reasonable for the problem
© CFA Institute. For candidate use only. Not for distribution.
Unsupervised Machine Learning Algorithms 497
few technical points on dendrograms bear mentioning—although they may not all be
apparent in Exhibit 17. The x-axis shows the clusters, and the y-axis indicates some
distance measure. Clusters are represented by a horizontal line, the arch, which con-
nects two vertical lines, called dendrites, where the height of each arch represents
the distance between the two clusters being considered. Shorter dendrites represent
a shorter distance (and greater similarity) between clusters. The horizontal dashed
lines cutting across the dendrites show the number of clusters into which the data
are split at each stage.
The agglomerative algorithm starts at the bottom of the dendrite where each
observation is its own cluster (A to K). Agglomerative clustering then generates the 6
larger clusters (1 to 6). For example, clusters A and B combine to form cluster 1, and
observation G remains its own cluster, now cluster 4. Moving up the dendrogram,
2 larger clusters are formed, where, for example, cluster 7 includes clusters 1 to 3.
Finally, at the top of the dendrogram is the single large cluster (9). The dendrogram
readily shows how this largest cluster is composed of the two main sub-clusters (7
and 8), each having three smaller sub-clusters (1 to 3 and 4 to 6, respectively). The
dendrogram also facilitates visualization of divisive clustering by starting at the top
of the largest cluster and then working downward until the bottom is reached where
all 11 single-observation clusters are shown.
.06 Dendrite
9
.05
.04
7 8
.03
.02
1 3 5 6
.01
2 4
0
A B C D E F G H I J K
Cluster
clusters themselves are not explicitly defined), they are still very useful in practice for
uncovering important underlying structure (namely, similarities among observations)
in complex data sets.
EXAMPLE 5
Exhibit 19 A More Complex (4-5-1) Neural Network with One Hidden Layer
Input 1
Input 2
Ouput
Input 3
Input 4
Now consider any of the nodes to the right of the input layer. These nodes are
sometimes called “neurons” because they process information received. Take the
topmost hidden node. Four links connect to that node from the inputs, so the node
gets four values transmitted by the links. Each node has, conceptually, two functional
parts: a summation operator and an activation function. Once the node receives the
four input values, the summation operator multiplies each value by a weight and sums
the weighted values to form the total net input. The total net input is then passed to
the activation function, which transforms this input into the final output of the node.
Informally, the activation function operates like a light dimmer switch that decreases
or increases the strength of the input. The activation function is characteristically
non-linear, such as an S-shaped (sigmoidal) function (with output range of 0 to 1) or
the rectified linear unit function shown in Panel B of Exhibit 18. Non-linearity implies
that the rate of change of output differs at different levels of input.
This activation function is shown in Exhibit 20, where in the left graph a negative
total net input is transformed via the S-shaped function into an output close to 0.
This low output implies the node does not “fire,” so there is nothing to pass to the
next node. Conversely, in the right graph a positive total net input is transformed
into an output close to 1, so the node does fire. The output of the activation function
is then transmitted to the next set of nodes if there is a second hidden layer or, as in
this case, to the output layer node as the predicted value. The process of transmis-
sion just described (think of forward pointing arrows in Exhibit 19) is referred to as
forward propagation.
© CFA Institute. For candidate use only. Not for distribution.
502 Reading 7 ■ Machine Learning
Sigmoid Sigmoid
Function Function
0 0
– Summation Operator + – Summation Operator +
Total Net Input Total Net Input
Dimmer Dimmer
Switch Switch
Starting with an initialized set of random network weights, training a neural net-
work in a supervised learning context is an iterative process in which predictions are
compared to actual values of labeled data and evaluated by a specified performance
measure (e.g., mean squared error). Then, network weights are adjusted to reduce
total error of the network. (If the process of adjustment works backward through the
layers of the network, this process is called backward propagation). Learning takes
place through this process of adjustment to network weights with the aim of reducing
total error. Without proliferating notation relating to nodes, the gist of the updating
can be expressed informally as:
New weight = (Old weight) – (Learning rate) × (Partial derivative of the total
error with respect to the old weight),
where “partial derivative” is a “gradient” or “rate of change of the total error with
respect to the change in the old weight,” and learning rate is a parameter that affects
the magnitude of adjustments. When learning is complete, all the network weights
have assigned values.
The structure of a network in which all the features are interconnected with
non-linear activation functions allows neural networks to uncover and approximate
complex non-linear relationships among features. Broadly speaking, when more nodes
and more hidden layers are specified, a neural network’s ability to handle complexity
tends to increase (but so does the risk of overfitting).
Asset pricing is a noisy, stochastic process with potentially unstable relationships
that challenge modelling processes, so researchers are asking if machine learning
can improve our understanding of how markets work. Research comparing statisti-
cal and machine learning methods’ abilities to explain and predict equity prices so
far indicates that simple neural networks produce models of equity returns at the
individual stock and portfolio level that are superior to models built using traditional
statistical methods due to their ability to capture dynamic and interacting variables.
This suggests that ML-based models, such as neural networks, may simply be better
able to cope with the non-linear relationships inherent in security prices. However,
the trade-offs in using them are the lack of interpretability and the amount of data
needed to train such models.
© CFA Institute. For candidate use only. Not for distribution.
Neural Networks, Deep Learning Nets, and Reinforcement Learning 505
Solution to 1:
A deep learning net (DLN) is a neural network (NN) with many hidden layers (at
least 3 but often more than 20). NNs and DLNs have been successfully applied to
a wide variety of complex tasks characterized by non-linearities and interactions
among features, particularly pattern recognition problems.
Solution to 2:
Mitsui wants to detect patterns of potential style drift in the daily trading of
nearly 100 external asset managers in many markets. This task will involve the
fast processing of huge amounts of complicated data. Monroe is correct that a
DLN is well suited to PEPF’s needs.
Solution to 3:
The input layer, the hidden layers, and the output layer constitute the three
groups of layers of DLNs. The input layer receives the inputs (i.e., features) and
has as many nodes as there are dimensions of the feature set. The hidden layers
consist of nodes, each comprised of a summation operator and an activation
function that are connected by links. These hidden layers are, in effect, where the
model is learned. The final layer, the output layer, produces a set of probabilities
of an observation being in any of the target style categories (each represented
by a node in the output layer). The DLN assigns the category based on the style
category with the highest probability.
EXAMPLE 7
Solution to 1:
B is correct. A and C are incorrect because when the target variable is binary
or categorical, the problem is a classification problem rather than a regression
problem.
Solution to 2:
C is correct. A is incorrect because penalized regression is related to multiple
linear regression. B is incorrect because penalized regression involves adding a
penalty term to the sum of the squared regression residuals.
Solution to 3:
C is correct. A is incorrect because CART is a supervised ML algorithm. B
is incorrect because CART is a classification and regression algorithm, not a
clustering algorithm.
Solution to 4:
B is correct. A is incorrect because neural networks are not exactly modeled on
the human nervous system. C is incorrect because neural networks are not based
on a tree structure of nodes when the relationships among the features are linear.
Solution to 5:
A is correct. B is incorrect because it refers to k-means clustering. C is incorrect
because it refers to classification, which involves supervised learning.
Solution to 6:
C is correct because dimension reduction techniques, like PCA, are aimed at
reducing the feature set to a manageable size while retaining as much of the
variation in the data as possible.
© CFA Institute. For candidate use only. Not for distribution.
References 509
■■ Neural networks with many hidden layers (at least 3 but often more than 20)
are known as deep learning nets (DLNs) and are the backbone of the artificial
intelligence revolution.
■■ The RL algorithm involves an agent that should perform actions that will
maximize its rewards over time, taking into consideration the constraints of its
environment.
REFERENCES
Bacham, Dinesh, and Janet Zhao. 2017. “Machine Learning:
Challenges, Lessons, and Opportunities in Credit Risk
Modeling.” Moody’s Analytics Risk Perspectives 9:28–35.
© CFA Institute. For candidate use only. Not for distribution.
510 Reading 7 ■ Machine Learning
PRACTICE PROBLEMS
B The activation function in a node operates like a light dimmer switch since
it decreases or increases the strength of the total net input.
C The summation operator receives input values, multiplies each by a weight,
sums up the weighted values into the total net input, and passes it to the
activation function.
© CFA Institute. For candidate use only. Not for distribution.
514 Reading 7 ■ Machine Learning
SOLUTIONS
1 A is correct. The target variable (quarterly return) is continuous, hence this calls
for a supervised machine learning based regression model.
B is incorrect, since classification uses categorical or ordinal target variables,
while in Step 1 the target variable (quarterly return) is continuous.
C is incorrect, since clustering involves unsupervised machine learning so does
not have a target variable.
2 B is correct. It is least appropriate because with LASSO, when λ = 0 the penalty
(i.e., regularization) term reduces to zero, so there is no regularization and the
regression is equivalent to an ordinary least squares (OLS) regression.
A is incorrect. With Classification and Regression Trees (CART), one way that
regularization can be implemented is via pruning which will reduce the size of
the regression tree—sections that provide little explanatory power are pruned
(i.e., removed).
C is incorrect. With LASSO, when λ is between 0.5 and 1 the relatively large
penalty (i.e., regularization) term requires that a feature makes a sufficient con-
tribution to model fit to offset the penalty from including it in the model.
3 A is correct. K-Means clustering is an unsupervised machine learning algo-
rithm which repeatedly partitions observations into a fixed number, k, of non-
overlapping clusters (i.e., groups).
B is incorrect. Principal Components Analysis is a long-established statistical
method for dimension reduction, not clustering. PCA aims to summarize or
reduce highly correlated features of data into a few main, uncorrelated compos-
ite variables.
C is incorrect. CART is a supervised machine learning technique that is most
commonly applied to binary classification or regression.
4 C is correct. Here, 20 is a hyperparameter (in the K-Means algorithm), which is
a parameter whose value must be set by the researcher before learning begins.
A is incorrect, because it is not a hyperparameter. It is just the size (number of
stocks) of Alef ’s portfolio.
B is incorrect, because it is not a hyperparameter. It is just the size (number of
stocks) of Alef ’s eligible universe.
5 B is correct. To predict which stocks are likely to become acquisition targets,
the ML model would need to be trained on categorical labelled data having the
following two categories: “0” for “not acquisition target”, and “1” for “acquisition
target”.
A is incorrect, because the target variable is categorical, not continuous.
C is incorrect, because the target variable is categorical, not ordinal (i.e., 1st,
2nd, 3rd, etc.).
6 C is correct. The advantages of using CART over KNN to classify companies
into two categories (“not acquisition target” and “acquisition target”), include
all of the following: For CART there are no requirements to specify an initial
hyperparameter (like K) or a similarity (or distance) measure as with KNN, and
CART provides a visual explanation for the prediction (i.e., the feature variables
and their cut-off values at each node).
A is incorrect, because CART provides all of the advantages indicated in
Statements I, II and III.
© CFA Institute. For candidate use only. Not for distribution.
READING
8
Big Data Projects
by Sreekanth Mallikarjun, PhD, and Ahmed Abbasi, PhD
Sreekanth Mallikarjun, PhD, is at Reorg (USA) and the University of Virginia, McIntire
School of Commerce (USA). Ahmed Abbasi, PhD, is at the University of Virginia, McIntire
School of Commerce (USA).
LEARNING OUTCOMES
Mastery The candidate should be able to:
INTRODUCTION
Big data (also referred to as alternative data) encompasses data generated by financial
1
markets (e.g., stock and bond prices), businesses (e.g., company financials, production
volumes), governments (e.g., economic and trade data), individuals (e.g., credit card
purchases, social media posts), sensors (e.g., satellite imagery, traffic patterns), and
the Internet of Things, or IoT, (i.e., the network of interrelated digital devices that can
transfer data among themselves without human interaction). A veritable explosion in
big data has occurred over the past decade or so, especially in unstructured data gen-
erated from social media (e.g., posts, tweets, blogs), email and text communications,
web traffic, online news sites, electronic images, and other electronic information
sources. The prospects are for exponential growth in big data to continue.
Investment managers are increasingly using big data in their investment processes
as they strive to discover signals embedded in such data that can provide them with an
information edge. They seek to augment structured data with a plethora of unstruc-
tured data to develop improved forecasts of trends in asset prices, detect anomalies,
etc. A typical example involves a fund manager using financial text data from 10-K
reports for forecasting stock sentiment (i.e., positive or negative), which can then be
used as an input to a more comprehensive forecasting model that includes corporate
financial data.
Unlike structured data (numbers and values) that can be readily organized into
data tables to be read and analyzed by computers, unstructured data typically require
specific methods of preparation and refinement before being usable by machines (i.e.,
computers) and useful to investment professionals. Given the volume, variety, and
velocity of available big data, it is important for portfolio managers and investment
analysts to have a basic understanding of how unstructured data can be transformed
into structured data suitable as inputs to machine learning (ML) methods (in fact, for
any type of modeling methods) that can potentially improve their financial forecasts.
This reading describes the steps in using big data, both structured and unstruc-
tured, in financial forecasting. The concepts and methods are then demonstrated in a
case study of an actual big data project. The project uses text-based data derived from
financial documents to train an ML model to classify text into positive or negative
sentiment classes for the respective stocks and then to predict sentiment.
Section 2 of the reading covers a description of the key characteristics of big data.
Section 3 provides an overview of the steps in executing a financial forecasting project
using big data. We then describe in Sections 4–6 key aspects of data preparation and
wrangling, data exploration, and model training using structured data and unstructured
(textual) data. In Section 7, we bring these pieces together by covering the execution
of an actual big data project. A summary in Section 8 concludes the reading.
3 Text preparation and wrangling. This step involves critical cleansing and
preprocessing tasks necessary to convert streams of unstructured data into a
format that is usable by traditional modeling methods designed for structured
inputs.
4 Text exploration. This step encompasses text visualization through techniques,
such as word clouds, and text feature selection and engineering.
The resulting output (e.g., sentiment prediction scores) can either be combined
with other structured variables or used directly for forecasting and/or analysis.
Next, we describe two key steps from the ML Model Building Steps depicted in
Exhibit 1 that typically differ for structured data versus textual big data: data/text
preparation and wrangling and data/text exploration. We then discuss model training.
Finally, we focus on applying these steps to a case study related to classifying and
predicting stock sentiment from financial texts.
Structured Data
Structured Data Input Collection
EXAMPLE 1
been asked to lead a new analytics team at LendALot tasked with developing the
ML-based creditworthiness scoring model. In the context of machine learning
using structured data sources, address the following questions.
1 State and explain one decision Wang will need to make related to:
A conceptualizing the modeling task.
B data collection.
C data preparation and wrangling.
D data exploration.
E model training.
In a later phase of the project, LendALot attempts to improve its credit
scoring processes by incorporating textual data in credit scoring. Wang tells his
team, “Enhance the creditworthiness scoring model by incorporating insights
from text provided by the prospective borrowers in the loan application free
response fields.”
2 Identify the process step that Wang’s statement addresses.
3 State two potential needs of the LendAlot team in relation to text
curation.
4 State two potential needs of the LendAlot team in relation to text prepara-
tion and wrangling.
Solution to 1:
A In the conceptualization step, Wang will need to decide how the output
of the ML model will be specified (e.g., a binary classification of credit-
worthiness), how the model will be used and by whom, and how it will be
embedded in LendALot’s business processes.
B In the data collection phase, Wang must decide on what data—internal,
external, or both—to use for credit scoring.
C In the data preparation and wrangling step, Wang will need to decide on
data cleansing and preprocessing needs. Cleansing may entail resolving
missing values, extreme values, etc. Preprocessing may involve extracting,
aggregating, filtering, and selecting relevant data columns.
D In the data exploration phase, Wang will need to decide which exploratory
data analysis methods are appropriate, which features to use in building a
credit scoring model, and which features may need to be engineered.
E In the model training step, Wang must decide which ML algorithm(s)
to use. Assuming labeled training data are available, the choice will be
among supervised learning algorithms. Decisions will need to be made on
how model fit is measured and how the model is validated and tuned.
Solution to 2:
Wang’s statement relates to the initial step of text problem formulation.
Solution to 3:
Related to text curation, the team will be using internal data (from loan appli-
cations). They will need to ensure that the text comment fields on the loan
applications have been correctly implemented and enabled. If these fields are
not required, they need to ensure there is a sufficient response rate to analyze.
© CFA Institute. For candidate use only. Not for distribution.
Data Preparation and Wrangling 525
Exhibit 3 (Continued)
Other
1 ID Name Gender Date of Birth Salary Income State Credit Card
6 5 Ms. XYZ F 15/1/1975 $60,500 $0 Y
7 6 Mr. GHI M 9/10/1942 NA $55,000 TX N
8 7 Mr. TUV M 2/27/1956 $300,000 $50,000 CT Y
9 8 Ms. DEF F 4/4/1980 $55,000 $0 British N
Columbia
Data cleansing can be expensive and cumbersome because it involves the use of
automated, rule-based, and pattern recognition tools coupled with manual human
inspection to sequentially check for the aforementioned types of errors row by row
and column by column. The process involves a detailed data analysis as an initial
step in identifying various errors that are present in the data. In addition to a manual
inspection and verification of the data, analysis software, such as SPSS, can be used
to understand metadata (data that describes and gives information about other data)
about the data properties to use as a starting point to investigate any errors in the data.
The business value of the project determines the necessary quality of data cleansing
and subsequently the amount of resources used in the cleansing process. In case the
errors cannot be resolved due to lack of available resources, the data points with errors
can simply be omitted depending on the size of the dataset. For instance, if a dataset is
large with more than 10,000 rows, removing a few rows (approximately 100) may not
have a significant impact on the project. If a dataset is small with less than 1,000 rows,
every row might be important and deleting many rows thus harmful to the project.
1 The data shown for Ms. Beta contain what is best described as an:
A invalidity error.
B inaccuracy error.
C incompleteness error.
2 The data shown for Mr. Gamma contain what is best described as an:
A invalidity error.
B duplication error.
C incompleteness error.
3 The data shown for Ms. Delta contain what is best described as an:
A invalidity error.
B inaccuracy error.
C duplication error.
4 The data shown for Mr. Zeta contain what is best described as an:
A invalidity error.
B inaccuracy error.
C duplication error.
5 The process mentioned in Wang’s first statement is best described as:
A feature selection.
B feature extraction.
C feature engineering
6 Wang’s second statement is best described as:
A feature selection.
B feature extraction.
C feature engineering.
Solution to 1:
A is correct. This is an invalidity error because the data are outside of a mean-
ingful range. Income cannot be negative.
Solution to 2:
C is correct. This is an incompleteness error as the loan type is missing.
Solution to 3:
B is correct. This is an inaccuracy error because LendALot must know how
much they have lent to that particular borrower (who eventually repaid the loan
as indicated by the loan outcome of no default).
Solution to 4:
C is correct. Row 8 duplicates row 7: This is a duplication error.
Solution to 5:
A is correct. The process mentioned involves selecting the features to use. The
proposal makes sense; with “ID,” “Name” is not needed to identify an observation.
© CFA Institute. For candidate use only. Not for distribution.
530 Reading 8 ■ Big Data Projects
Solution to 6:
B is correct. The proposed feature is a ratio of two existing features. Feature
extraction is the process of creating (i.e., extracting) new variables from existing
ones in the data.
Robots Are Us
However, the source text that can be downloaded is not as clean. The raw text
contains html tags and formatting elements along with the actual text. Exhibit 7 shows
the raw text from the source.
The initial step in text processing is cleansing, which involves basic operations to
clean the text by removing unnecessary elements from the raw text. Text operations
often use regular expressions. A regular expression (regex) is a series that contains
characters in a particular order. Regex is used to search for patterns of interest in a
given text. For example, a regex “<.*?>” can be used to find all the html tags that are
© CFA Institute. For candidate use only. Not for distribution.
Data Preparation and Wrangling 533
Text 1 The man went to the market today The man went to the market today
Text 4 There is no market for the product There is no market for the product
Similar to structured data, text data also require normalization. The normalization
process in text processing involves the following:
1 Lowercasing the alphabet removes distinctions among the same words due to
upper and lower cases. This action helps the computers to process the same
words appropriately (e.g., “The” and “the”).
2 Stop words are such commonly used words as “the,” “is,” and “a.” Stop words do
not carry a semantic meaning for the purpose of text analyses and ML training.
However, depending on the end-use of text processing, for advance text applica-
tions it may be critical to keep the stop words in the text in order to understand
the context of adjacent words. For ML training purposes, stop words typically
are removed to reduce the number of tokens involved in the training set. A
predefined list of stop words is available in programming languages to help with
this task. In some cases, additional stop words can be added to the list based
on the content. For example, the word “exhibit” may occur often in financial
filings, which in general is not a stop word but in the context of the filings can
be treated as a stop word.
3 Stemming is the process of converting inflected forms of a word into its base
word (known as stem). Stemming is a rule-based approach, and the results need
not necessarily be linguistically sensible. Stems may not be the same as the
morphological root of the word. Porter’s algorithm is the most popular method
for stemming. For example, the stem of the words “analyzed” and “analyzing”
is “analyz.” Similarly, the British English variant “analysing” would become
“analys.” Stemming is available in R and Python. The text mining package in R
provides a stemDocument function that uses this algorithm.
4 Lemmatization is the process of converting inflected forms of a word into
its morphological root (known as lemma). Lemmatization is an algorithmic
approach and depends on the knowledge of the word and language structure.
For example, the lemma of the words “analyzed” and “analyzing” is “analyze.”
Lemmatization is computationally more expensive and advanced.
Stemming or lemmatization will reduce the repetition of words occurring in var-
ious forms and maintain the semantic structure of the text data. Stemming is more
common than lemmatization in the English language since it is simpler to perform.
In text data, data sparseness refers to words that appear very infrequently, resulting
in data consisting of many unique, low frequency tokens. Both techniques decrease
data sparseness by aggregating many sparsely occurring words in relatively less sparse
stems or lemmas, thereby aiding in training less complex ML models.
© CFA Institute. For candidate use only. Not for distribution.
534 Reading 8 ■ Big Data Projects
The last step of text preprocessing is using the final BOW after normalizing to build
a document term matrix (DTM). DTM is a matrix that is similar to a data table for
structured data and is widely used for text data. Each row of the matrix belongs to a
document (or text file), and each column represents a token (or term). The number of
rows of DTM is equal to the number of documents (or text files) in a sample dataset.
The number of columns is equal to the number of tokens from the BOW that is built
using all the documents in a sample dataset. The cells can contain the counts of the
number of times a token is present in each document. The matrix cells can be filled
with other values that will be explained in the financial forecasting project section of
this reading; a large dataset is helpful in understanding the concepts. At this point,
the unstructured text data are converted to structured data that can be processed
further and used to train the ML model. Exhibit 11 shows a DTM constructed from
the resultant BOW of the four texts from Exhibit 10.
Exhibit 11 DTM of Four Texts and Using Normalized BOW Filled with Counts
of Occurrence
man went market today valu increas need product
Text 1 1 1 1 1 0 0 0 0
Text 2 0 0 1 0 1 1 0 0
Text 3 0 0 1 0 0 1 1 0
Text 4 0 0 1 0 0 0 0 1
© CFA Institute. For candidate use only. Not for distribution.
Data Exploration Objectives and Methods 537
Solution to 2:
C is correct. Some punctuations, such as percentage signs, currency symbols,
and question marks, may be useful for ML model training, so when such punc-
tuations are removed annotations should be added.
Solution to 3:
A is correct. Each column of a document term matrix represents a token from
the bag-of-words that is built using all the documents in a sample dataset.
Solution to 4:
B is correct. A cell in a document term matrix contains a count of the number
of tokens of the kind indicated in the column heading.
Solution to 5:
C is correct. The other choices are related to text cleansing.
Solution to 6:
A is correct. Stemming, the process of converting inflected word forms into a
base word (or stem), is one technique that can address the problem described.
Solution to 7:
C is correct, by definition. The other choices are not true.
Data Exploration
Data Data
Collection/ Preparation Exploratory Model
and Feature Feature Results
Curation Data Training
Wrangling Selection Engineering
Analysis
Feature Selection
Structured data consist of features, represented by different columns of data in a table
or matrix. After using EDA to discover relevant patterns in the data, it is essential to
identify and remove unneeded, irrelevant, and redundant features. Basic diagnostic
testing should also be performed on features to identify redundancy, heteroscedas-
ticity, and multi-collinearity. The objective of the feature selection process is to assist
in identifying significant features that when used in a model retain the important
patterns and complexities of the larger dataset while requiring fewer data overall.
This last point is important since computing power is not free (i.e., explicit costs and
processing time).
Typically, structured data even after the data preparation step can contain features
that do not contribute to the accuracy of an ML model or that negatively affect the
quality of ML training. The most desirable outcome is a parsimonious model with
fewer features that provides the maximum predictive power out-of-sample.
Feature selection must not be confused with the data preprocessing steps during
data preparation. Good feature selection requires an understanding of the data and
statistics, and comprehensive EDA must be performed to assist with this step. Data
preprocessing needs clarification only from data administrators and basic intuition
(e.g., salary vs. income) during data preparation.
Feature selection on structured data is a methodical and iterative process. Statistical
measures can be used to assign a score gauging the importance of each feature. The
features can then be ranked using this score and either retained or eliminated from
the dataset. The statistical methods utilized for this task are usually univariate and
consider each feature independently or with regard to the target variable. Methods
include chi-square test, correlation coefficients, and information-gain measures (i.e.,
R-squared values from regression analysis). All of these statistical methods can be
combined in a manner that uses each method individually on each feature, automatically
performing backward and forward passes over features to improve feature selection.
Prebuilt feature selection functions are available in popular programming languages
used to build and train ML models.
Dimensionality reduction assists in identifying the features in the data that account
for the greatest variance between observations and allows for the processing of a
reduced volume of data. Dimensionality reduction may be implemented to reduce
a large number of features, which helps reduce the memory needed and speed up
learning algorithms. Feature selection is different from dimensionality reduction, but
both methods seek to reduce the number of features in the dataset. The dimension-
ality reduction method creates new combinations of features that are uncorrelated,
whereas feature selection includes and excludes features present in the data without
altering them.
Feature Engineering
After the appropriate features are selected, feature engineering helps further optimize
and improve the features. The success of ML model training depends on how well the
data are presented to the model. The feature engineering process attempts to pro-
duce good features that describe the structures inherent in the dataset. This process
© CFA Institute. For candidate use only. Not for distribution.
542 Reading 8 ■ Big Data Projects
depends on the context of the project, domain of the data, and nature of the problem.
Structured data are likely to contain quantities, which can be engineered to better
present relevant patterns in the dataset. This action involves engineering an existing
feature into a new feature or decomposing it into multiple features.
For continuous data, a new feature may be created—for example, by taking the
logarithm of the product of two or more features. As another example, when con-
sidering a salary or income feature, it may be important to recognize that different
salary brackets impose a different taxation rate. Domain knowledge can be used to
decompose an income feature into different tax brackets, resulting in a new feature:
“income_above_100k,” with possible values 0 and 1. The value 1 under the new feature
captures the fact that a subject has an annual salary of more than $100,000. By group-
ing subjects into income categories, assumptions about income tax can be made and
utilized in a model that uses the income tax implications of higher and lower salaries
to make financial predictions.
For categorical data, for example, a new feature can be a combination (e.g., sum
or product) of two features or a decomposition of one feature into many. If a single
categorical feature represents education level with five possible values—high school,
associates, bachelor’s, master’s, and doctorate—then these values can be decomposed
into five new features, one for each possible value (e.g., is_highSchool, is_doctorate)
filled with 0s (for false) and 1s (for true). The process in which categorical variables are
converted into binary form (0 or 1) for machine reading is called one hot encoding.
It is one of the most common methods for handling categorical features in text data.
When date-time is present in the data, such features as “second of the hour,” “hour
of the day,” and “day of the date” can be engineered to capture critical information
about temporal data attributes—which are important, for example, in modeling
trading algorithms.
Feature engineering techniques systemically alter, decompose, or combine existing
features to produce more meaningful features. More meaningful features allow an
ML model to train more swiftly and easily. Different feature engineering strategies
can lead to the generation of dramatically different results from the same ML model.
The impact of feature selection and engineering on ML training is discussed further
in the next section.
Feature Engineering
As with structured data, feature engineering can greatly improve ML model training
and remains a combination of art and science. The following are some techniques for
feature engineering, which may overlap with text processing techniques.
1 Numbers: In text processing, numbers are converted into a token, such as “/
number/.” However, numbers can be of different lengths of digits representing
different kinds of numbers, so it may be useful to convert different numbers
into different tokens. For example, numbers with four digits may indicate years,
and numbers with many digits could be an identification number. Four-digit
numbers can be replaced with “/number4/,” 10-digit numbers with “/num-
ber10/,” and so forth.
2 N-grams: Multi-word patterns that are particularly discriminative can be identi-
fied and their connection kept intact. For example, “market” is a common word
that can be indicative of many subjects or classes; the words “stock market”
are used in a particular context and may be helpful to distinguish general texts
from finance-related texts. Here, a bigram would be useful as it treats the two
adjacent words as a single token (e.g., stock_market).
3 Name entity recognition (NER): NER is an extensive procedure available as a
library or package in many programming languages. The name entity rec-
ognition algorithm analyzes the individual tokens and their surrounding
semantics while referring to its dictionary to tag an object class to the token.
Exhibit 19 shows the NER tags of the text “CFA Institute was formed in 1947
and is headquartered in Virginia.” Additional object classes are, for example,
MONEY, TIME, and PERCENT, which are not present in the example text. The
NER tags, when applicable, can be used as features for ML model training for
better model performance. NER tags can also help identify critical tokens on
which such operations as lowercasing and stemming then can be avoided (e.g.,
Institute here refers to an organization rather than a verb). Such techniques
make the features more discriminative.
Exhibit 19 (Continued)
4 Parts of speech (POS): Similar to NER, parts of speech uses language structure
and dictionaries to tag every token in the text with a corresponding part of
speech. Some common POS tags are noun, verb, adjective, and proper noun.
Exhibit 19 shows the POS tags and descriptions of tags for the example text.
POS tags can be used as features for ML model training and to identify the
number of tokens that belong to each POS tag. If a given text contains many
proper nouns, it means that it may be related to people and organizations and
may be a business topic. POS tags can be useful for separating verbs and nouns
for text analytics. For example, the word “market” can be a verb when used as
“to market …” or noun when used as “in the market.” Differentiating such tokens
can help further clarify the meaning of the text. The use of “market” as a verb
could indicate that the text relates to the topic of marketing and might discuss
marketing a product or service. The use of “market” as a noun could suggest
that the text relates to a physical or stock market and might discuss stock
trading. Also for POS tagging, such compound nouns as “CFA Institute” can
be treated as a single token. POS tagging can be performed using libraries or
packages in programming languages.
In addition, many more creative techniques convey text information in a struc-
tured way to the ML training process. The goal of feature engineering is to maintain
the semantic essence of the text while simplifying and converting it into structured
data for ML.
EXAMPLE 4
Data Exploration
Paul Wang’s analytics team at LendALot Corporation has completed its initial
data preparation and wrangling related to their creditworthiness classification
ML model building efforts. As a next step, Wang has asked one of the team
members, Eric Kim, to examine the available structured data sources to see what
types of exploratory data analysis might make sense. Kim has been tasked with
reporting to the team on high-level patterns and trends in the data and which
variables seem interesting. Greater situational awareness about the data can
inform the team’s decisions regarding model training and whether (and how) to
incorporate textual big data in conjunction with the structured data inputs. Use
the following sample of columns and rows Kim pulled for manual examination
to answer the next questions.
© CFA Institute. For candidate use only. Not for distribution.
Model Training 549
Solution 1:
Lee and Kim should consider bag-of-words (BOW), n-grams, and parts-of-speech
(POS) as key textual feature representations for their text data. Conversely, name
entity recognition (NER) might not be as applicable in this context because the
data on prospective borrowers does not include any explicit references to people,
locations, dates, or organizations.
Solution 2:
All three textual feature representations have the potential to add value.
Bag-of-words (BOW) is typically applicable in most contexts involving text
features derived from languages where token boundaries are explicitly present
(e.g., English) or can be inferred through processing (e.g., a different language,
such as Spanish). BOW is generally the best starting point for most projects
exploring text feature representations.
N-grams, representations of word or token sequences, are also applicable.
N-grams can offer invaluable contextual information that can complement and
enrich a BOW. In this specific credit-worthiness context, we examine the BOW
token “worked.” It appears three times (rows 5–7), twice in no-default loan texts
and once in a defaulted loan text. This finding suggests that “worked” is being
used to refer to the borrower’s work ethic and may be a good predictor of credit
worthiness. Digging deeper and looking at several trigrams (i.e., three-token
sequences) involving “worked,” we see that “have_worked_hard” appears in the
two no-default loan related texts (referring to borrower accomplishments and
plans) and “had_worked_harder” appears in the defaulted loan text (referring to
what could have been done). This example illustrates how n-grams can provide
richer contextualization capabilities for the creditworthiness prediction ML
models.
Parts-of-speech tags can add value because they identify the composition of
the texts. For example, POS provides information on whether the prospective
borrowers are including many action words (verbs) or descriptors (adjectives)
and whether this is being done differently in instances of no-default versus
instances of defaulted loans.
MODEL TRAINING
Machine learning model training is a systematic, iterative, and recursive process. The
6
number of iterations required to reach optimum results depends on:
■■ the nature of the problem and input data and
■■ the level of model performance needed for practical application.
Machine learning models combine multiple principles and operations to provide
predictions. As seen in the last two sections, typical ML model building requires
data preparation and wrangling (cleansing and preprocessing) and data exploration
(exploratory data analysis as well as feature selection and engineering). In addition,
© CFA Institute. For candidate use only. Not for distribution.
550 Reading 8 ■ Big Data Projects
domain knowledge related to the nature of the data is required for good model build-
ing and training. For instance, knowledge of investment management and securities
trading is important when using financial data to train a model for predicting costs
of trading stocks. It is crucial for ML engineers and domain experts to work together
in building and training robust ML models.
The three tasks of ML model training are method selection, performance evaluation,
and tuning. Exhibit 20 outlines model training and its three component tasks. Method
selection is the art and science of deciding which ML method(s) to incorporate and
is guided by such considerations as the classification task, type of data, and size of
data. Performance evaluation entails using an array of complementary techniques and
measures to quantify and understand a model’s performance. Tuning is the process of
undertaking decisions and actions to improve model performance. These steps may be
repeated multiple times until the desired level of ML model performance is attained.
Although no standard rulebook for training an ML model exists, having a fundamental
understanding of domain-specific training data and ML algorithm principles plays a
vital role in good model training.
Model Training
Data Data
Collection/ Preparation Data
and Method Performance Results
Curation Exploration Tuning
Wrangling Selection Evaluation
Exhibit 22 depicts the idea of undersampling of the majority class and oversampling
of the minority class. In practice, the choice of whether to undersample or oversample
depends on the specific problem context. Advanced techniques can also reproduce
synthetic observations from the existing data, and the new observations can be added
to the dataset to balance the minority class.
Performance Evaluation
It is important to measure the model training performance or goodness of fit for vali-
dation of the model. We shall cover several techniques to measure model performance
that are well suited specifically for binary classification models.
1 Error analysis. For classification problems, error analysis involves computing
four basic evaluation metrics: true positive (TP), false positive (FP), true nega-
tive (TN), and false negative (FN) metrics. FP is also called a Type I error, and
FN is also called a Type II error. Exhibit 23 shows a confusion matrix, a grid
that is used to summarize values of these four metrics.
© CFA Institute. For candidate use only. Not for distribution.
554 Reading 8 ■ Big Data Projects
Predicted Results
Trading off precision and recall is subject to business decisions and model
application. Therefore, additional evaluation metrics that provide the overall
performance of the model are generally used. The two overall performance
metrics are accuracy and F1 score. Accuracy is the percentage of correctly
predicted classes out of total predictions. F1 score is the harmonic mean of
precision and recall. F1 score is more appropriate (than accuracy) when unequal
class distribution is in the dataset and it is necessary to measure the equilib-
rium of precision and recall. High scores on both of these metrics suggest good
model performance. The formulas for accuracy and F1 score are as follows:
Accuracy = (TP + TN)/(TP + FP + TN + FN). (5)
F1 score = (2 * P * R)/(P + R). (6)
in the training data. Variance error is high when the model is overly complicated
and memorizes the training data so much that it will likely perform poorly on new
data. It is not possible to completely eliminate both types of errors. However, both
errors can be minimized so the total aggregate error (bias error + variance error) is
at a minimum. The bias–variance trade-off is critical to finding an optimum balance
where a model neither underfits nor overfits.
1 Parameters are critical for a model and are dependent on the training data.
Parameters are learned from the training data as part of the training process
by an optimization technique. Examples of parameters include coefficients in
regression, weights in NN, and support vectors in SVM.
2 Hyperparameters are used for estimating model parameters and are not depen-
dent on the training data. Examples of hyperparameters include the regulariza-
tion term (λ) in supervised models, activation function and number of hidden
layers in NN, number of trees and tree depth in ensemble methods, k in k-near-
est neighbor classification and k-means clustering, and p-threshold in logistic
regression. Hyperparameters are manually set and tuned.
For example, if a researcher is using a logistic regression model to classify sentences
from financial statements into positive or negative stock sentiment, the initial cutoff
point for the trained model might be a p-threshold of 0.50 (50%). Therefore, any sen-
tence for which the model produces a probability >50% is classified as having positive
sentiment. The researcher can create a confusion matrix from the classification results
(of running the CV dataset) to determine such model performance metrics as accuracy
and F1 score. Next, the researcher can vary the logistic regression’s p-threshold—say
to 0.55 (55%), 0.60 (60%), or even 0.65 (65%)—and then re-run the CV set, create new
confusion matrixes from the new classification results, and compare accuracy and F1
scores. Ultimately, the researcher would select the logistic regression model with a
p-threshold value that produces classification results generating the highest accuracy
and F1 scores. Note that the process just outlined will be demonstrated in Section 7.
There is no general formula to estimate hyperparameters. Thus, tuning heuristics
and such techniques as grid search are used to obtain the optimum values of hyper-
parameters. Grid search is a method of systematically training an ML model by using
various combinations of hyperparameter values, cross validating each model, and
determining which combination of hyperparameter values ensures the best model
performance. The model is trained using different combinations of hyperparameter
values until the optimum set of values are found. Optimum values must result in sim-
ilar performance of the model on training and CV datasets, meaning that the training
error and CV error are close. This ensures that the model can be generalized to test
data or to new data and thus is less likely to overfit. The plot of training errors for
each value of a hyperparameter (i.e., changing model complexity) is called a fitting
curve. Fitting curves provide visual insight on the model’s performance (for the given
hyperparameter and level of model complexity) on the training and CV datasets and
are visually helpful to tune hyperparameters. Exhibit 26 shows the bias–variance error
trade-off by plotting a generic fitting curve for a regularization hyperparameter (λ).
© CFA Institute. For candidate use only. Not for distribution.
558 Reading 8 ■ Big Data Projects
Errorcv
Errortrain
Errorcv >>Errortrain
Error
Overfitting
Underfitting
Optimum Regularization
Small Error
Slight Regularization Large Regularization
Lambda (˜ )
The raw text contains punctuations, numbers, and white spaces that may not be
necessary for model training. Text cleansing involves removing, or incorporating
appropriate substitutions for, potentially extraneous information present in the text.
Operations to remove html tags are unnecessary because none are present in the text
Punctuations: Before stripping out punctuations, percentage and dollar symbols are
substituted with word annotations to retain their essence in the financial texts. Such
word annotation substitutions convey that percentage and currency-related tokens were
involved in the text. As the sentences have already been identified within and extracted
from the source text, punctuation helpful for identifying discrete sentences—such as
periods, semi-colons, and commas—are removed. Some special characters, such as
“+” and “©,” are also removed. It is a good practice to implement word annotation
substitutions before removing the rest of the punctuations.
Numbers: Numerical values of numbers in the text have no significant utility for
sentiment prediction in this project because sentiment primarily depends on the words
in a sentence. Here is an example sentence: “Ragutis, which is based in Lithuania's
second-largest city, Kaunas, boosted its sales last year 22.3 percent to 36.4 million
litas.” The word “boosted” implies that there was growth in sales, so analysis of this
sentiment does not need to rely on interpretation of numerical text data. Sentiment
analysis typically does not involve extracting, interpreting, and calculating relevant
numbers but instead seeks to understand the context in which the numbers are used.
Other commonly occurring numbers are dates and years, which are also not required
to predict sentence sentiment. Thus, all numbers present in the text are removed for
this financial sentiment project. However, prior to removing numbers, abbreviations
representing orders of magnitude, such as million (commonly represented by “m,”
“mln,” or “mn”), billion, or trillion, are replaced with the complete word. Retaining
these orders of magnitude-identifying words in the text preserves the original text
meaning and can be useful in predicting sentence sentiment.
Whitespaces: White spaces are present in the raw text. Additional white spaces
occur after performing the above operations to remove extraneous characters. The
white spaces must be removed to keep the text intact. Exhibit 29 shows the sample
text after cleansing. The cleansed text is free of punctuations and numbers, with
useful substitutions.
Profit before taxes amounted to EUR million down from EUR million a year ago negative
Profit before taxes decreased by percentSign to EUR million in the first nine months of compared to negative
EUR million a year earlier
Profit before taxes decreased to EUR million from EUR million the year before negative
Profit before taxes was EUR million down from EUR million negative
The companys profit before taxes fell to EUR million in the third quarter of compared to EUR million negative
in the corresponding period in
In August October the companys result before taxes totalled EUR million up from EUR million in the positive
corresponding period in
Finnish Bore that is owned by the Rettig family has grown recently through the acquisition of smaller positive
shipping companies
The plan is estimated to generate some EUR million USD million in cost savings on an annual basis positive
(continued)
© CFA Institute. For candidate use only. Not for distribution.
562 Reading 8 ■ Big Data Projects
Exhibit 29 (Continued)
Sentence Sentiment
Finnish pharmaceuticals company Orion reports profit before taxes of EUR million in the third quar- positive
ter of up from EUR million in the corresponding period in
Finnish Sampo Bank of Danish Danske Bank group reports profit before taxes of EUR million in up positive
from EUR million in
profit befor tax amount to currencysign million down from currencysign million a year ago negative
profit befor tax decreas by percentsign to currencysign million in the first nine month of compar to negative
currencysign million a year earlier
profit before tax decreas to currencysign million from currencysign million the year befor negative
profit befor tax was currencysign million down from currencysign million negative
the compani profit befor tax fell to currencysign million in the third quarter of compar to currency- negative
sign million in the correspond period in
in august octob the compani result befor tax total currencysign million up from currencysign million positive
in the correspond period in
finnish bore that is own by the rettig famili has grown recent through the acquisit of smaller shipping positive
company
the plan is estim to generat some currencysign million currencysign million in cost save on an annual positive
basi
© CFA Institute. For candidate use only. Not for distribution.
Financial Forecasting Project: Classifying and Predicting Sentiment for Stocks 565
300
200
100
0
0 50 100 150 200 250 300
Number of Characters in a Sentence
Word clouds are a convenient method of visualizing the text data because they
enable rapid comprehension of a large number of tokens and their corresponding
weights. Exhibit 35 shows a word cloud for all the sentences in the corpus. The font
sizes of the words are proportionate to the number of occurrences of each word in the
corpus. Similarly, Exhibit 36 shows the word cloud divided into two halves: one half
representing negative sentiment class sentences (upper half ); one half representing
positive sentiment class sentences (lower half ). Notably, some highly discriminative
stems and words, such as “decreas” and “down” in the negative half and “increas” and
“rose” in the positive half, are present. The feature selection process will eliminate
common words and highlight useful words for better model training.
© CFA Institute. For candidate use only. Not for distribution.
566 Reading 8 ■ Big Data Projects
Feature Selection
Exploratory data analysis revealed the most frequent tokens in the texts that could
potentially add noise to this ML model training process. In addition to common tokens,
many rarely occurring tokens, often proper nouns (i.e., names), are not informative
for understanding the sentiment of the sentence. Further analyses must be conducted
to decide which words to eliminate. Feature selection for text data involves keeping
the useful tokens in the BOW that are informative and help to discriminate different
classes of texts—those with positive sentiment and those with negative sentiment. At
this point, a total of 44,151 non-unique tokens are in the 2,180 sentences.
Frequency analysis on the processed text data helps in filtering unnecessary
tokens (or features) by quantifying how important tokens are in a sentence and in the
corpus as a whole. Term frequency (TF) at the corpus level—also known as collection
frequency (CF)—is the number of times a given word appears in the whole corpus
© CFA Institute. For candidate use only. Not for distribution.
Financial Forecasting Project: Classifying and Predicting Sentiment for Stocks 569
For example, TF at the sentence level for the word “the” in sentences num-
ber 701 and 223 is calculated as 6/39 = 0.1538462 and 5/37 = 0.1351351,
respectively.
8 DF (Document Frequency): Defined as the number of documents (i.e., sentences)
that contain a given word divided by the total number of sentences (here,
2,180). Document frequency is important since words frequently occurring
across sentences provide no differentiating information in each sentence. The
following equation can be used to compute DF:
DF = SentenceCountWithWord/Total number of sentences. (12)
For example, DF of the word “the” is 1,453/2,180 = 0.6665138; so, 66.7% of the
sentences contain the word “the.” A high DF indicates high word frequency in
the text.
9 IDF (Inverse Document Frequency): A relative measure of how unique a term
is across the entire corpus. Its meaning is not directly related to the size of the
corpus. The following equation can be used to compute IDF:
IDF = log(1/DF). (13)
For example, IDF of the word “the” is log(1/0.6665138) = 0.4056945. A low IDF
indicates high word frequency in the text.
10 TF–IDF: To get a complete representation of the value of each word, TF at
the sentence level is multiplied by the IDF of a word across the entire dataset.
Higher TF–IDF values indicate words that appear more frequently within a
smaller number of documents. This signifies relatively more unique terms that
are important. Conversely, a low TF–IDF value indicates terms that appear in
many documents. TF–IDF values can be useful in measuring the key terms
across a compilation of documents and can serve as word feature values for
training an ML model. The following equation can be used to compute TF–IDF:
TF–IDF = TF × IDF. (14)
TF or TF–IDF values are placed at the intersection of sentences (rows) and terms
(columns) of the document term matrix. For this project, TF values are used for the
DTM as the texts are sentences rather than paragraphs or other larger bodies of
text. TF–IDF values vary by the number of documents in the dataset; therefore, the
© CFA Institute. For candidate use only. Not for distribution.
570 Reading 8 ■ Big Data Projects
model performance can vary when applied to a dataset with just a few documents.
In addition to removing custom stop words and sparse terms, single character letters
are also eliminated because they do not add any value to the sentiment significance.
Feature Engineering
N-grams are used as a feature engineering process in this project. Use of n-grams
helps to understand the sentiment of a sentence as a whole. As mentioned previously,
the objective of this project is to predict sentiment class (positive and negative) from
financial texts. Both unigram and bigrams are implemented, and the BOW is created
from them. Bigram tokens are helpful for keeping negations intact in the text, which
is vital for sentiment prediction. For example, the tokens “not” and “good” or “no”
and “longer” can be formed into single tokens, now bigrams, such as “not_good”
and “no_longer.” These and similar tokens can be useful during ML model training
and can improve model performance. Exhibit 41 shows a sample of 100 words from
the BOW containing both unigram and bigram tokens after removal of custom stop
words, sparse terms, and single characters. Note that the BOW contains such tokens
as increas, loss, loss_prior, oper_rose, tax_loss, and sale_increas. Such tokens are
informative about the embedded sentiment in the texts and are useful for training
an ML model. The corresponding word frequency measures for the document term
matrix are computed based on this new BOW.
EXAMPLE 6
Similarly, the DTMs for the CV set and the test set are built using tokens from the
final training BOW for tuning, validating, and testing of the model. To be clear, the
final BOW from the training corpus is used for building DTMs across all the splits
because the model has been trained on that final BOW. Thus, the columns (think,
features) of all three DTMs are the same, but the number of rows varies because a
different number of sentences are in each split. The DTMs are filled with resultant
term frequency values calculated using sentences in the corpuses of the respective
splits—sentences from the CV set corpus and sentences from the test set corpus.
Exhibit 42 tabulates the summary of dimensions of the data splits and their uses in
the model training process. As mentioned, the columns of DTMs for the splits are
the same, equal to the number of unique tokens (i.e., features) from the final training
corpus BOW, which is 9,188. Note that this number of unique tokens (9,188) differs
from that in the master corpus (11,501) based on the sentences that are included in
the training corpus after the random sampling.
Method Selection
Alternative ML methods, including SVM, decision trees, and logistic regression, were
examined because these techniques are all considered potentially suitable for this
particular task (i.e., supervised learning), type of data (i.e., text), and size of data (i.e.,
wider data with many potential variables). The SVM and logistic regression methods
appeared to offer better performance than decision trees. For brevity, we discuss
logistic regression in the remainder of the chapter. Logistic regression was used to
train the model, using the training corpus DTM containing 1,309 sentences. As a
reminder, in this project texts are the sentences and the classifications are positive
and negative sentiment classes (labeled 1 and 0, respectively). The tokens are feature
variables, and the sentiment class is the target variable. Text data typically contain
thousands of tokens. These result in sparse DTMs because each column represents
a token feature and the values are mostly zeros (i.e., not all the tokens are present
in every text). Logistic regression can deal with such sparse training data because
the regression coefficients will be close to zero for tokens that are not present in a
significant number of sentences. This allows the model to ignore a large number of
minimally useful features. Regularization further helps lower the coefficients when
the features rarely occur and do not contribute to the model training.
Logistic regression is applied on the final training DTM for model training. As
this method uses maximum likelihood estimation, the output of the logistic model is
a probability value ranging from 0 to 1. However, because the target variable is binary,
coefficients from the logistic regression model are not directly used to predict the
value of the target variable. Rather, a mathematical function uses the logistic regression
coefficient (β) to calculate probability (p) of sentences having positive sentiment (y =
© CFA Institute. For candidate use only. Not for distribution.
574 Reading 8 ■ Big Data Projects
1).3 If p for a sentence is 0.90, there is a 90% likelihood that the sentence has positive
sentiment. Theoretically, the sentences with p > 0.50 likely have positive sentiment.
Because this is not always true in practice, however, it is important to find an ideal
threshold value of p. We elaborate on this point in a subsequent example. The thresh-
old value is a cutoff point for p values, and the ideal threshold p value is influenced
by the dataset and model training. When the p values (i.e., probability of sentences
having positive sentiment) of sentences are above this ideal threshold p value, then
the sentences are highly likely to have positive sentiment (y = 1). The ideal threshold
p value is estimated heuristically using performance metrics and ROC curves, as will
be demonstrated shortly.
Exhibit 43 ROC Curves of Model Results for Training and CV Data Before
Regularization
A. ROC Curve for Training Data B. ROC Curve for CV Data
True Positive Rate True Positive Rate
1.0 1.0
0.8 0.8
0.6 0.6
AUC = AUC =
96.5% 86.2%
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate
As the model is overfitted, least absolute shrinkage and selection operator (LASSO)
regularization is applied to the logistic regression. LASSO regularization penalizes the
coefficients of the logistic regression to prevent overfitting of the model. The penal-
ized regression will select the tokens (features) that have statistically significant (i.e.,
non-zero) coefficients and that contribute to the model fit; LASSO does this while
disregarding the other tokens. Exhibit 44 shows the ROC curves for the new model
1
3 This mathematical function is an exponential function of the form: P (y = 1) =
1 + exp ( 0 1 1 2 2
− β + β x + β x ++ βn xn )
0.9
0.8
0.8
0.6 0.7
0.4 0.6
0.5
0.2
0.4
0 0
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
Threshold Value Threshold Value
Exhibit 47 (Continued)
* The shaded row shows the selected threshold p value (0.60) and the performance metrics for the
selected model.
Finally, the confusion matrix using the ideal threshold p value of 0.60 is constructed
to observe the performance of the final model. When target value p > 0.60, the predic-
tion is assumed to be y = 1 (indicating positive sentiment); otherwise, the prediction
is assumed to be y = 0 (negative sentiment). The confusion matrix for the CV data is
shown in Exhibit 48. It is clear that the model performance metrics have improved
in the final model compared to the earliest case when the threshold p value was 0.50.
Now, accuracy and F1 score have both increased by one percentage point to 91% and
94%, respectively, while precision has increased by two percentage points to 90%.
Performance Metrics
without reading large documents. These sentiment classifications can also be used
as structured input data for larger ML models that have a specific purpose, such as
to predict future stock price movements.
EXAMPLE 7
Solution to 1:
Since confusion matrix A has fewer true positives (TPs) and fewer true neg-
atives (TNs) than the confusion matrix in Exhibit 48 (281 vs. 284 and 110 vs.
114, respectively), confusion matrix A has lower accuracy and a lower F1 score
compared to the one in Exhibit 48 (0.90 vs. 0.91 and 0.93 vs. 0.94, respectively).
Also, although confusion matrix A has slightly better precision, 0.91 vs. 0.90,
due to a few less false positives (FPs), it has significantly lower recall, 0.94 vs.
0.98, due to having many more false negatives (FNs), 17 vs. 7, than the confusion
matrix in Exhibit 48. On balance, the ML model using the threshold p value of
0.60 is the superior model for this sentiment classification problem.
Solution to 2:
Confusion matrix B has the same number of TPs (281) and TNs (110) as con-
fusion matrix A. Therefore, confusion matrix B also has lower accuracy (0.90)
and a lower F1 score (0.93) compared to the one in Exhibit 48. Although con-
fusion matrix B has slightly better recall, 0.99 vs. 0.98, due to fewer FNs, it has
somewhat lower precision, 0.87 vs. 0.90, due to having many more FPs, 41 vs.
© CFA Institute. For candidate use only. Not for distribution.
582 Reading 8 ■ Big Data Projects
30, than the confusion matrix in Exhibit 48. Again, it is apparent that the ML
model using the threshold p value of 0.60 is the better model in this sentiment
classification context.
Solution to 3:
The main differences in performance metrics between confusion matrixes A
and B are in precision and recall. Confusion matrix A has higher precision, at
0.91 vs. 0.87, but confusion matrix B has higher recall, at 0.99 vs. 0.94. These
differences highlight the trade-off between FP (Type I error) and FN (Type II
error). Precision is useful when the cost of FP is high, such as when an expen-
sive product that is fine mistakenly fails quality inspection and is scrapped; in
this case, FP should be minimized. Recall is useful when the cost of FN is high,
such as when an expensive product is defective but mistakenly passes quality
inspection and is sent to the customer; in this case, FN should be minimized.
In the context of sentiment classification, FP might result in buying a stock for
which sentiment is incorrectly classified as positive when it is actually negative.
Conversely, FN might result in avoiding (or even shorting) a stock for which
the sentiment is incorrectly classified as negative when it is actually positive.
The model behind the confusion matrix in Exhibit 48 strikes a balance in the
trade-off between precision and recall.
SUMMARY
In this reading, we have discussed the major steps in big data projects involving the
development of machine learning (ML) models—namely, those combining textual big
data with structured inputs.
■■ Big data—defined as data with volume, velocity, variety, and potentially lower
veracity—has tremendous potential for various fintech applications, including
several related to investment management.
■■ The main steps for traditional ML model building are conceptualization of the
problem, data collection, data preparation and wrangling, data exploration, and
model training.
■■ For textual ML model building, the first four steps differ somewhat from those
used in the traditional model: Text problem formulation, text curation, text
preparation and wrangling, and text exploration are typically necessary.
■■ For structured data, data preparation and wrangling entail data cleansing and
data preprocessing. Data cleansing typically involves resolving incompleteness
errors, invalidity errors, inaccuracy errors, inconsistency errors, non-uniformity
errors, and duplication errors.
■■ Preprocessing for structured data typically involves performing the following
transformations: extraction, aggregation, filtration, selection, and conversion.
■■ Preparation and wrangling text (unstructured) data involves a set of text-
specific cleansing and preprocessing tasks. Text cleansing typically involves
removing the following: html tags, punctuations, most numbers, and white
spaces.
© CFA Institute. For candidate use only. Not for distribution.
Practice Problems 585
the financial data using normalization. She notes that over the full sample dataset, the
“Interest Expense” variable ranges from a minimum of 0.2 and a maximum of 12.2,
with a mean of 1.1 and a standard deviation of 0.4.
Steele and Schultz then discuss how to preprocess the raw text data. Steele tells
Schultz that the process can be completed in the following three steps:
Step 1 Cleanse the raw text data.
Step 2 Split the cleansed data into a collection of words for them to be
normalized.
Step 3 Normalize the collection of words from Step 2 and create a distinct set
of tokens from the normalized words.
With respect to Step 1, Steele tells Schultz:
“I believe I should remove all html tags, punctuations, numbers, and extra
white spaces from the data before normalizing them.”
After properly cleansing the raw text data, Steele completes Steps 2 and 3. She
then performs exploratory data analysis. To assist in feature selection, she wants to
create a visualization that shows the most informative words in the dataset based on
their term frequency (TF) values. After creating and analyzing the visualization, Steele
is concerned that some tokens are likely to be noise features for ML model training;
therefore, she wants to remove them.
Steele and Schultz discuss the importance of feature selection and feature engi-
neering in ML model training. Steele tells Schultz:
“Appropriate feature selection is a key factor in minimizing model over-
fitting, whereas feature engineering tends to prevent model underfitting.”
Once satisfied with the final set of features, Steele selects and runs a model on the
training set that classifies the text as having positive sentiment (Class “1” or negative
sentiment (Class “0”). She then evaluates its performance using error analysis. The
resulting confusion matrix is presented in Exhibit 2.
B inconsistency error.
C non-uniformity error.
4 What type of error is most likely present in the last row of data (ID #4) in
Exhibit 1?
A Inconsistency error
B Incompleteness error
C Non-uniformity error
5 During the preprocessing of the data in Exhibit 1, what type of data transforma-
tion did Steele perform during the data preprocessing step?
A Extraction
B Conversion
C Aggregation
6 Based on Exhibit 1, for the firm with ID #3, Steele should compute the scaled
value for the “Interest Expense” variable as:
A 0.008.
B 0.083.
C 0.250.
7 Is Steele’s statement regarding Step 1 of the preprocessing of raw text data
correct?
A Yes.
B No, because her suggested treatment of punctuation is incorrect.
C No, because her suggested treatment of extra white spaces is incorrect.
8 Steele’s Step 2 can be best described as:
A tokenization.
B lemmatization.
C standardization.
9 The output created in Steele’s Step 3 can be best described as a:
A bag-of-words.
B set of n-grams.
C document term matrix.
10 Given her objective, the visualization that Steele should create in the explor-
atory data analysis step is a:
A scatter plot.
B word cloud.
C document term matrix.
11 To address her concern in her exploratory data analysis, Steele should focus on
those tokens that have:
A low chi-square statistics.
B low mutual information (ML) values.
C very low and very high term frequency (TF) values.
12 Is Steele’s statement regarding the relationship between feature selection/fea-
ture engineering and model fit correct?
A Yes.
B No, because she is incorrect with respect to feature selection.
C No, because she is incorrect with respect to feature engineering.
© CFA Institute. For candidate use only. Not for distribution.
Solutions 589
7 B is correct. Although most punctuations are not necessary for text analysis and
should be removed, some punctuations (e.g., percentage signs, currency sym-
bols, and question marks) may be useful for ML model training. Such punctu-
ations should be substituted with annotations (e.g., /percentSign/, /dollarSign/,
and /questionMark/) to preserve their grammatical meaning in the text. Such
annotations preserve the semantic meaning of important characters in the text
for further text processing and analysis stages.
8 A is correct. Tokenization is the process of splitting a given text into separate
tokens. This step takes place after cleansing the raw text data (removing html
tags, numbers, extra white spaces, etc.). The tokens are then normalized to cre-
ate the bag-of-words (BOW).
9 A is correct. After the cleansed text is normalized, a bag-of-words is created. A
bag-of-words (BOW) is a collection of a distinct set of tokens from all the texts
in a sample dataset.
10 B is correct. Steele wants to create a visualization for Schultz that shows the
most informative words in the dataset based on their term frequency (TF, the
ratio of the number of times a given token occurs in the dataset to the total
number of tokens in the dataset) values. A word cloud is a common visual-
ization when working with text data as it can be made to visualize the most
informative words and their TF values. The most commonly occurring words
in the dataset can be shown by varying font size, and color is used to add more
dimensions, such as frequency and length of words.
11 C is correct. Frequency measures can be used for vocabulary pruning to remove
noise features by filtering the tokens with very high and low TF values across
all the texts. Noise features are both the most frequent and most sparse (or
rare) tokens in the dataset. On one end, noise features can be stop words that
are typically present frequently in all the texts across the dataset. On the other
end, noise features can be sparse terms that are present in only a few text files.
Text classification involves dividing text documents into assigned classes. The
frequent tokens strain the ML model to choose a decision boundary among the
texts as the terms are present across all the texts (an example of underfitting).
The rare tokens mislead the ML model into classifying texts containing the rare
terms into a specific class (an example of overfitting). Thus, identifying and
removing noise features are critical steps for text classification applications.
12 A is correct. A dataset with a small number of features may not carry all the
characteristics that explain relationships between the target variable and the
features. Conversely, a large number of features can complicate the model and
potentially distort patterns in the data due to low degrees of freedom, causing
overfitting. Therefore, appropriate feature selection is a key factor in minimiz-
ing such model overfitting. Feature engineering tends to prevent underfitting in
the training of the model. New features, when engineered properly, can elevate
the underlying data points that better explain the interactions of features. Thus,
feature engineering can be critical to overcome underfitting.
13 A is correct. Precision, the ratio of correctly predicted positive classes (true
positives) to all predicted positive classes, is calculated as:
Precision (P) = TP/(TP + FP) = 182/(182 + 52) = 0.7778 (78%).
14 B is correct. The model’s F1 score, which is the harmonic mean of precision and
recall, is calculated as:
F1 score = (2 × P × R)/(P + R).
F1 score = (2 × 0.7778 × 0.8545)/(0.7778 + 0.8545) = 0.8143 (81%).
© CFA Institute. For candidate use only. Not for distribution.
590 Reading 8 ■ Big Data Projects