0% found this document useful (0 votes)
59 views

Integrated Long-Term Stock Selection Models Based On Feature Selection and Machine Learning Algorithms For China Stock Market

This document describes a study that builds integrated long-term stock selection models for the China stock market based on feature selection and machine learning algorithms. The models select features from historical stock data using various feature selection methods, then use machine learning algorithms like random forest, support vector machines, and decision trees to predict long-term stock price trends. The best performing model uses the random forest algorithm for both feature selection and stock price trend prediction. A long-short portfolio based on this model validates its effectiveness for long-term investment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Integrated Long-Term Stock Selection Models Based On Feature Selection and Machine Learning Algorithms For China Stock Market

This document describes a study that builds integrated long-term stock selection models for the China stock market based on feature selection and machine learning algorithms. The models select features from historical stock data using various feature selection methods, then use machine learning algorithms like random forest, support vector machines, and decision trees to predict long-term stock price trends. The best performing model uses the random forest algorithm for both feature selection and stock price trend prediction. A long-short portfolio based on this model validates its effectiveness for long-term investment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Received December 22, 2019, accepted January 16, 2020, date of publication January 24, 2020, date of current

version February 6, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.2969293

Integrated Long-Term Stock Selection Models


Based on Feature Selection and Machine Learning
Algorithms for China Stock Market
XIANGHUI YUAN1 , JIN YUAN 1, TIANZHAO JIANG2 , AND QURAT UL AIN1
1 Schoolof Economics and Finance, Xi’an Jiaotong University, Xi’an 710049, China
2 Shanghai Foresee Investment Ltd., Liability Company, Shanghai 200120, China

Corresponding author: Jin Yuan ([email protected])


This work was supported by the Chinese Natural Science Foundation under Grant 11631013, Grant 11971372, and Grant 11801433.

ABSTRACT The classical linear multi-factor stock selection model is widely used for long-term stock
price trend prediction. However, the stock market is chaotic, complex, and dynamic, for which reasons the
linear model assumption may be unreasonable, and it is more meaningful to construct a better-integrated
stock selection model based on different feature selection and nonlinear stock price trend prediction
methods. In this paper, the features are selected by various feature selection algorithms, and the parameters
of the machine learning-based stock price trend prediction models are set through time-sliding window
cross-validation based on 8-year data of Chinese A-share market. Through the analysis of different integrated
models, the model performs best when the random forest algorithm is used for both feature selection and
stock price trend prediction. Based on the random forest algorithm, a long-short portfolio is constructed to
validate the effectiveness of the best model.

INDEX TERMS Stock, trend prediction, machine learning, feature selection, long-term investment.

I. INTRODUCTION predict the stock’s short-term trend. In contrast, financial


The ability of investors to make profit depends mainly on factors such as operating income and return on total assets
their prediction ability, while most investors in the Chinese are more useful in predicting the long-term trend of stocks.
A-share market are facing investment loss. One of the main In this paper, we focus on the long-term stock price trend
reason is that most of the investors have limited informa- prediction in order to construct a better long-term stock selec-
tion and ability to predict the stock price trend well. There- tion model. The widely used classic models for long-term
fore, how to construct an effective stock selection model to stock price trend prediction are the well-known Capital Asset
improve investors’ predictability is a meaningful topic. Pricing Model (CAPM) and the Arbitrage Pricing Theory
In recent years, the algorithms for stock trend prediction (APT). These two models have been studied and improved
have been continuously proposed. From the perspective of by many scholars. For example, Fama and French establish
forecasting, they can be divided into two major categories. the three-factor model to explain stock returns [12]. These
One is the stock price trend prediction, which is called are linear models with the historical features data as the
classification [1]–[5]. The other is the stock price forecast, inputs, and with the stock returns as the outputs. However,
which is called regression [6]–[10]. In addition, from the the stock market is chaotic, complex, and dynamic, for which
perspective of forecasting time, it is roughly grouped into two reasons the linear model assumption may be unreasonable,
types: the short-term and the long-term trend forecast of stock and it is especially important to consider a nonlinear model
price [11]. In general, the length of time for stock price trend to achieve a mapping between features and stock returns.
prediction is highly correlated with the selected features. For Fortunately, how to establish nonlinear models can be found
example, indicators such as yesterday’s closing price and in many pieces of literature in recent years. Huang, Nakamo-
5-day moving average closing price are usually used to ria, and Wang state that they have predicted the direction of
NIKKEI 225 index, and the results show that the support
The associate editor coordinating the review of this manuscript and vector machine (SVM) algorithm can achieve higher accu-
approving it for publication was Sunith Bandaru . racy [1]. Patel et. al analyze applications of the four models in

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
22672 VOLUME 8, 2020
X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

the Indian market including artificial neural network (ANN), for support vector machines: classification (SVC) [3], [16],
SVM, random forest (RF), and Naive Bayes (NB) [13], [14]. and regression (SVR) [17], [18]. The core idea of SVM is the
At present, there are more than 3,000 listed companies in maximum margin hyper plane, in which way it can classify
the Chinese A-share market, and the number of listed com- the sample data into two different categories, including posi-
panies is still increasing. The traditional strategy is mainly tive and negative examples [19].
through research on listed companies to determine whether The steps to build a decision tree are as follows.
the company is worth buying or not. However, with the (xi , labeli ), (i = 1, 2, . . . , n) is a series of linearly sep-
increasing number of listed companies, this traditional strat- arable sample data in an N -dimensional plane, where xi is
egy will also require more manpower and material resources, the data set in the N-dimensional space, labeli is the label
and thus the sustainability of the strategy is not strong. corresponding to the sample data and the value is −1 or 1.
Numerous experts and scholars verify that the China stock Then the function obtained by SVM is wT · x + b = 0. The
market is still in the developing stage and a large number equation between the points is calculated as:
of individual investors trade in this market. Moreover, due
|wT · x + b| label(wT · x + b)
to information asymmetry and other phenomena, the price of D= =
some stocks tends to deviate from their own intrinsic value. ||w|| ||w||
Therefore, through analyzing historical stock data by com- The idea of SVM mentioned above is to maximize the
puter, the construction of quantitative stock selection strategy minimum interval, so the algorithm can be transformed into
model has great potential in such market. This kind of model an optimization problem as follow.
can generate a set of logical and strict trading instructions to 1
avoid investors from being affected by market sentiment and min ||w||2 s.t. labeli (wT · xi + b) ≥ 1(i = 1, 2, . . . , n)
2
resulting in incorrect judgments.
This paper focuses on the multi-features stock selection The above optimization problem is a quadratic program-
model based on different feature selection algorithms and ming problem, which can be solved by a corresponding
machine learning based stock price trend prediction algo- method. Then this paper introduces the second method to
rithms for the China stock market and establishes the nonlin- solve the classification hyper plane equation: Lagrangian
ear relationship between factors and stock returns. It expands multiplier method. The above optimization problem is trans-
the development direction of the classical multi-factor model formed into the following questions by the Lagrangian mul-
and provides a new investment strategy for investors in the tiplier method and the KKT condition (Lagrange duality):
China stock market. In our work, 60 features are obtained max min L(w, b, α)
to be used as the input of the model, through the financial α≥0 w,b
n
report, daily opening prices, closing prices, volumes and 1 X
other data of the A-share market. The main contributions L(w, b, α) = ||w||2 − αi (label(wT · xi + b) − 1)
2
of this research are reflected in the following points. First, i=1
the feature selection algorithm is used to filter the features, where labeli is the classification label and αi is the Lagrangian
which can reduce the complexity of the model, and avoid the multiplier. The above formula is calculated, and then the
dimensional disaster caused by too many features. Second, following equation can be achieved:
considering the problem of analyzing time series data using n n
X 1 X
the original cross-validation method, the method of time max L (w, b, α) = αi − αi αj labeli labelj xiT xj
sliding window cross-validation is adopted to make the model 2
i=1 i,j=1
more suitable for the actual situation. Third, the stock price n
X
trend forecasting algorithm is applied to predict the excess s.t. αi ≥ 0; αi labeli = 0
returns of stock of the subsequent month. i=1
The remainder of this paper is organized into the following
The optimization problem can be solved by the sequential
sections. In Section 2, several common stock price trend
minimal optimization (SMO) algorithm and then the specific
prediction algorithms and feature selection algorithms are
value of the parameter αi can be known. The algorithm can
introduced and also describes the principle and application
be described as an optimization function written as:
of each algorithm in detail. Section 3 explains the experiment
scheme. Sections 4 discusses the results of the experiment and f (x) = sign(wT · x + b)
proposes a effective trading strategy. Section 5 concludes this Xn
paper. = sign(( αi xi labeli )T · x + b)
i=1
II. METHODOLOGY Xn
A. THE METHODS OF STOCK PRICE TREND PREDICTION = sign( αi labeli < xi , x > + b)
1) SVM i=1
The support vector machine was first proposed by Vapnik and However, the data distribution currently discussed is an
then applied in different fields [15]. There are two categories ideal situation that is completely linear and can be divided

VOLUME 8, 2020 22673


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

TABLE 1. Kernel functions.

FIGURE 1. The structure of random forest.

For example, ID3 algorithm applies in information gain, and


without abnormal points. It is not possible to get the hyper
CART tree uses Gini index.
plane directly using the maximum-minimum interval. There-
(b) the creation of leaf nodes: the data sets divided by the
fore, the model allows the data to deviate from the hyper plane
optimal feature are placed in the leaf node.
and get a new optimization function:
(c) the segmentation of leaf node: for each subset of data,
n n
X 1 X the feature sets at this time are the remaining feature after
max αi − αi αj labeli labelj xiT xj the optimal feature is removed. Then continue to traverse all
2
i=1 i,j=1
features, and the best feature is selected to divide the subset
n
X of data to form a new subset.
s.t. 0 ≤ αi ≤ C; αi labeli = 0
(d) the construction of the decision tree model: continue
i=1
with step (2) and step (3) until the conditions for stopping
where C refers to the slack variable, which changes the the split are met. In general, there are some conditions of
tolerance of the model to the abnormal point by changing the the split stop, such as the number of leaf nodes satisfying the
size of C. condition, all features have been used to divide the data, and
An advantage of the SVM over the LR model is that the so on.
SVM solves the linear indivisibility problem of data by apply- Consequently, the decision tree is constructed in a rela-
ing the kernel function. In general, data in a low-dimensional tively simple way, and the calculation is not very complicated.
space is nonlinear, but after mapping it to a high-dimensional However, since each time the data is divided into a local
space, the data can become linearly separable. The function optimization, which is similar to the principle of the greedy
of kernel function is to calculate the inner product of two algorithm for data sample partitioning, so it is easy to cause
vectors in low-dimensional space and map to the function of over-fitting. In order to further solve the problem, the random
high-dimensional space, which not only realizes the mapping forest adopts the random sampling method of putting back,
from low-dimensional space to high-dimensional space but and obtains several sub-datasets from the original data sets,
also reduces the complexity. The function obtained by SVM which are respectively used to train a decision tree (weak
is: classifier) model, and the final model is built in this way by
N
X voting, taking the mean value and so on.
f (x) = sign( αi yi · K (xi , xj ) + b) Fig.1 shows the structure of the random forest [22]. The
i=1 steps for building the model are as follows. Firstly, the sample
K is the kernel function of SVM, which is the biggest data sets are selected based on the Bagging method from the
feature of SVM. Several common kernel functions are shown original training data sets and generate N training data sets by
in Table1. the above step. Secondly, the main job is to train N decision
tree models separately based on these N training data sets.
2) RF Thirdly, the random forest will consist of these N decision
The random forest algorithm is continuously evolved and trees. For the classification problem, the final classification
based on the decision tree algorithm. The traditional decision result is decided by the N decision tree classifiers, and for
tree model is constructed by dividing the sample data accord- the regression problem, the average of the predicted val-
ing to the features. Each partition determines the optimal par- ues of the N decision trees determines the final prediction
titioning property. As the number of features increases, and result [23], [24].
the number of branch nodes increases, the model constructed
in this way is the decision tree [20], [21]. 3) ANN
The steps to build a decision tree are as follows. Artificial neural networks have been widely used for stock
(a) the creation of root nodes: all data samples are placed price trend prediction because it can better complete the
in the root node, and then all features are traversed, and construction of nonlinear models [25], [26]. This paper
the optimal features are filtered out from there to divide mainly constructs a three-layer, fully connected neural net-
the data samples; thus the original data samples are divided work model, and Fig. 2 describes the specific structure. The
into multiple subsets. The methods for evaluating features function of the model is to use the features data sets as the
are information gain, information gain rate, Gini index, etc. input to the model, and the final prediction result is obtained

22674 VOLUME 8, 2020


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

FIGURE 2. The structure of three-layer fully connected neural network.

FIGURE 4. The flow chart of the error back propagation algorithm.

FIGURE 3. The structure of the two-layer fully connected neural network. (b) Forward propagation
From the input layer to the hidden layer:
hiddenj = f (netj1 )
through the output layer after the calculation of two hidden = f (x1 W1j1 + x2 W2j1 + . . . + xn Wnj1 + b1 bk1 ),
layers. The weights on the nodes are calculated by the error
(j = 1, 2, . . . , M )
back propagation algorithm.
For example, this paper assumes the existence of a two- where f (·) is the activation function, and it is set as the
layer fully connected neural network, presented in Fig. 3. The Sigmoid function, which is written as f (x) = 1+e1 −x .
detail of the figure is as follows: the number of the network From a hidden layer to output layer:
input layer nodes is N, which is defined as {x1 , x2 , . . . , xN },
and the number of hidden layer nodes is M , which is defined yj = f (netj2 )
as {hidden1 , hidden2 , . . . , hiddenM }. The number of output =f (hidden1 W1j2 +hidden2 W2j2 +. . .+hiddenm Wmj
2
+b2 bk2 )
layer nodes is K , which is defined as {y1 , y2 , . . . , yK }. The
weight from the input layer nodes to the hidden layer nodes
is Wij1 (i = 1, 2, . . . , N ; j = 1, 2, . . . , K ) and the weight from (c) Calculate the total error
the input layer nodes to the hidden layer nodes is Wij2 (i = The algorithm defines a loss function to measure the fit
1, 2, . . . , M ; j = 1, 2, . . . , K ). The bias term of input layer is of the model. The smaller the value of the loss function,
b1 and the weight from bias item to hidden layer node is bk1 . the better the performance of the fitting. For each data sample,
The bias term of the hidden layer is b2 and the weight from the loss function is:
bias item of the hidden layer to the output layer node is bk2 . n
1X
After the data of the input layer and the output layer, E= (yi − ti )2
2
it needs to determine the weight between each node to achieve i=1
the construction of the entire network. The method for deter- where ti is the target value.
mining the weight between nodes is the error back propa- (d) Error back propagation
gation algorithm, and the specific flow chart is displayed as The weight is updated by error back propagation so that
in Fig. 4. the error can be reduced. There are some common optimiza-
(a) Parameter initialization tion methods such as gradient descent method. The gradient
The initial weight and the offset weight are set and then the descent method is used as an example to illustrate the process
initial weight is updated according to the forward propagation of error back propagation.
and the error back propagation until the error or the number (1) Update the weight between the hidden layer and the
of iterations meets the condition. output layer

VOLUME 8, 2020 22675


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

Firstly, the effect of each weight on the overall error is and (2) after the feature dimension exceeds a certain limit,
calculated, which is the partial error to the bias of the weight: the performance of the classifier decreases as the feature
dimension increases [27].
∂E ∂E ∂yj ∂netj2
= · ·
∂Wij2 ∂yj ∂netj2 ∂Wij2 1) SVM-RFE
= (yj − tj ) · yj · (1 − yj ) · hiddeni , The recursive feature elimination (RFE) applies in the
(i = 1, 2, . . . , M ; j = 1, 2, . . . , K ) machine learning model to perform multiple rounds of train-
ing [28]. After each round of training, the features with the
For the bias item weights: lowest importance are eliminated, and then the model is
∂netj2 trained based on the new feature set.
∂E ∂E ∂yj
= · · SVM-RFE is the recursive feature elimination algorithm
∂bk2 ∂yj ∂netj2 ∂bk2 based on SVM [29]. The operation steps are as follows. The
= (yj − tj ) · yj · (1 − yj ) · b2 original sample data set is
Secondly, it needs to set learning efficiency to update D = {xi1 , xi2 , . . . , xin , yi }, (i = 1, 2, . . . , m)
weights:
∂E where n represents the number of features and m represents
Wij2+ = Wij2 + η · the sample data.
∂Wij2
(a) The original sample data set is used as input to the
∂E linear SVM model to train the SVM model. The classifica-
bk2+ = bk2 + η ·
∂bk2 tion decision function of the linear SVM model is f (x) =
where Wij2+ and bk2+ are the updated weights. sign(wT ·x+b), where wi (i = 1, 2, . . . , n) indicates the weight
(2) Update the weight between the input layer and the corresponding to the ith feature.
hidden layer (b) Calculate the importance score for each feature
Firstly, the partial derivative of the total error versus weight as:scorei = w2i (i = 1, 2, . . . , n), where scorei represents the
is calculated: importance score of the ith feature.
(c) Sort the importance of all features in descending order
∂E ∂E ∂hiddenj
= · and remove the last ranked feature. The feature-length is
∂Wij1 ∂hiddenj ∂Wij1 changed from n to n-1, and then the data set D is updated.
K
∂netk2 (d) Cycle through steps (1), (2), (3) until the number of
X ∂E ∂yk
= ( · · ) · hiddeni remaining features meets the set conditions.
∂yk ∂netk2 ∂hiddenj
k=1
· (1 − hiddeni ) · xi 2) FEATURE SELECTION BASED ON RF
Secondly, it needs to set learning efficiency to update weights: Considering that the decision tree model is constructed in the
way that the optimal features are selected from all the features
∂E at a time to use to divide the sample data but how to select the
Wij1+ = Wij1 + η ·
∂Wij1 optimal feature from a large number of features has become
∂E the key issue of the decision tree model. The decision tree
bk1+ = bk1 + η · model measures the importance of features by the value of
∂bk1
information gain, information gain rate, Gini index, and so
where Wij1+ and bk1+ are the updated weights. So far, on. In this way, the importance of all features can be measured
the weight of the ownership is updated so that the error is well. Therefore, the feature selection algorithm based on the
reduced, and the loop operation steps (2), (3), and (4) are tree model can draw some features with higher importance in
continued until the error is less than the set threshold or the this way [30], [31].
number of iterations satisfies the condition. With the continuous development of the model, the inte-
The above part introduces the construction method of a grated forests such as random forest and GBDT have begun to
two-layer fully-connected neural network and the weight emerge, which has solved the over-fitting problem of decision
update method. For multi-layer neural networks, the same tree to some extent. Therefore, this paper adopts the feature
algorithm can be used to construct the multi-layer neural selection algorithm based on random forest. The specific
network. calculation of the model is as follows;
(a) The random forest uses the sampled method of return-
B. THE METHODS OF FEATURE SELECTION ing to generate the sub-data set, which is used to construct the
When all the features are directly used as input to the model corresponding decision tree model. Therefore, some data will
in the process of actually creating the model, the following not be selected during the sampling process. At this time, this
situations may occur because of unrelated features and corre- part of the data is called the bag (OOB). It is used to analyze
lations between features: (1) higher complexity of the model the out-of-bag data through the newly constructed decision

22676 VOLUME 8, 2020


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

tree model to calculate the corresponding error, which is


recorded as OOB_error1.
(b) In the data obtained just outside the bag, select all the
data corresponding to one of the features, and then randomly
change the value of some of the data, that is, add a certain
noise to interfere.
(c) Assume that the number of decision trees of the random
forest is n and calculate OOB_error1 before adding noise and
OOB_error2 after adding noise interference for each decision
tree model.
(d) Calculate the importance of features according to the
n
formula 1n
P
(OOB_error1i − OOB_error2i ). It is reason-
i=1
able that if the feature is more important, the worse the
prediction effect after adding noise. So it can be found by the
formula that the larger the value, the greater the importance
of the feature.
(e) For each feature, cycle through steps (1), (2), (3) to
achieve a calculation of the importance of each feature.
(f) Feature filtering is performed according to the calcu-
lated feature importance. All features are sorted, and then
the most important features can be selected according to the
method of selecting a specified number of features or setting FIGURE 5. The flow chart of building the stock selection model.
a feature importance threshold.
falling trend in the bear market and on a rising trend in the bull
III. EXPERIMENTS DESIGN market.. Therefore, it is not very reasonable for the prediction
The second chapter introduces the principle of the factor of stock price trends directly. In literature, to prevent stock
selection algorithm and prediction algorithm, which provides returns from being affected by market trends, it takes the
a solid theoretical basis for the construction of this model. stock excess returns, which calculated by subtracting
This chapter mainly introduces how to realize the multi-factor the return of the Shanghai Stock Index from stock returns as
model based on a machine-learning algorithm to proceed the the forecast target. Moreover, because of the impact of data
real-time stock selection. The specific flow chart is shown noise, the models often fail to achieve good results. Thus,
in Fig. 5. for the data in the sample, the stock returns are sorted in
A. DATA COLLECTION descending order first every month and then classify the top
30% of the stock as 1 and the last 30% as −1.
The data of this paper are from all stocks of the Chinese
A-share market, and the work is to predict individual stock
B. DATA PREPROCESSING
trend. Considering that special treatment (ST) stocks and
sub-new stocks with shorter time to market are riskier to Considering that the feature data may have extreme values,
operate, ST stocks and stocks with less than 3 months it will affect the model and lead to abnormal results. This
are excluded. The data set is from January 1, 2010 to article uses the following methods to process extreme values:

January 1, 2018. On the last trading day of each month,  xm + n × DMAD xi ≥ xm + n × DMAD

it collects the data set including features and stock excess xi,new = xm − n × DMAD xi ≤ xm + n × DMAD
returns. Further, the sliding window method is employed 
xi else

to divide the data into training sets and testing sets. These
experimental data are all obtained from the Wind database. where xi,new is the processed value, and xi is the value
It is widely known that the predictive effects of machine of the ith variable. xm is the median of the sequence. DMAD is
learning algorithms are closely related to the features, which the median of a sequence of |xi − xm | and n is used to control
can contribute to stock returns. To be consistent with multiple the amplitude of the upper and lower limits.
literatures, this paper selects 60 features for forecasting. The Due to a lack of financial reports, calculation errors, etc.,
features which are used as input data for the model to predict it is likely to cause data loss, affecting the accuracy of data
the return of the stocks are listed in Table 2. As shown in analysis. So, the preprocessing of missing values is quite
the table, all features are divided into 10 categories, such as important. In this paper, the missing values are processed
valuation, growth, and so on. using the fill method to make full use of data. Because the
Since the ultimate goal of creating the model is to predict factors of stocks in the same industry are roughly similar,
stock price trend, how to deal with stock price trends is the place where the factor exposure is missing is set as the
particularly important. For example, almost all stocks are on a average value of the same stocks in the same industry.

VOLUME 8, 2020 22677


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

TABLE 2. Input features for the stock market data set.

22678 VOLUME 8, 2020


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

For the stocks of the A-Share market, stock returns will


also be affected by market capitalization and industry in addi-
tion to the 60 factors listed in Table 2. For example, The EP
factor of the banking industry’s stock is larger than that of the
internet industry. The purpose of neutralization is to eliminate
the influence of other factors and make the selected stocks
more dispersed. The main idea is to obtain a new feature
that is independent of the original feature through multiple
linear regression models. That is to say, through establishing FIGURE 6. 10-fold cross-validation.
a linear regression model, the residual value obtained by the
regression is used as the new factor data.
Since features have different units and quantity sizes, such
as market value and EP, they are obviously different in size.
Substituting the features into the model directly will result
in different proportions of different features, affecting the
prediction results. This problem is solved by the following
standardized methods:
xi − u FIGURE 7. The time window slicing cross-validation.
xi,new =
σ
For the SVM algorithm, the function of the Gaussian kernel
where xi,new is the value after data normalization and xi is the
function is to map data to infinite-dimensional space. There-
value of the ith variable. u is the mean of the sequence and σ
fore, regardless of the data distribution, it can be mapped
is the standard deviation of the sequence.
to infinite-dimensional space by Gaussian kernel function,
which realizes the establishment of the high-dimensional
C. PARAMETER SETTING OF FEATURE SELECTION
linear model. In this paper, the Gaussian kernel function is
1) SVM-RFE
used to build the SVM model. The optimization function is:
When the SVM-RFE algorithm is applied for feature selec-
n n
tion, it needs to select a certain number of features from X 1 X
max αi − αi αj labeli labelj K (xi , xj )
the features listed in Table 1. Firstly, the importance of all 2
i=1 i,j=1
features is sorted in descending order. Secondly, the top
n
80% of all features are selected, which means that there are X
s.t. 0 ≤ αi ≤ C; αi labeli = 0
48 features selected.
i=1

2) RF
K (xi , xj ) = exp(−γ ||xi − xj ||2 )
The RF algorithm is very similar to SVM-RFE, and both According to the formula, the parameters that need
need to select a certain number of features. For consistency, to be adjusted in the SVM model are mainly C and
the top 80% of the features are also selected. gamma (γ ), which are shown in Table 3. For example, when
C = 0.001 and gamma = 0.0001, the results obtained by the
D. PARAMETER SETTING OF STOCK PREDICTON time window slicing cross-validation are used as evaluation
The stock price trend forecasting algorithms used in this results of the model. Through testing each set of C and gamma
paper include SVM, RF, and ANN. For each model, different values, it can get the result for each model and finally select
parameters of the model will result in different results. In the the best performing model.
field of machine learning, the common method of parameter In addition, when choosing the RF algorithm, there are
setting is cross-validation. However, it cannot be applied some parameters that need to be set, such as the number of
in finance directly. For example, the operation steps of the decision trees, the maximum number of features that each
10-fold cross-validation method which is one of the most decision tree needs to consider when classifying, the mini-
common cross-validation methods exhibit in Fig. 6. If the mum number of samples required for internal node subdivi-
10-fold cross-validation method is used in financial data, sion, the minimum sample of leaf nodes and so on. In order
it is inevitable that future data are used to create model and to reduce the complexity of the model, the paper artificially
predict previous data. Thus, in this paper, the time window defines some of these parameters. For example, the number of
slicing cross-validation strategy is applied [32]. Considering decision trees is 100. The maximum number of features that
the application of 12 months of data for cross-validation, each decision tree needs to consider when classifying is the
the training set data are divided into 12 equal groups, firstly as square of the total number of features. It marks the minimum
illustrates in Fig. 7, and then the first 4 groups of data are used number of samples required for internal node subdivision
as the training set and the next group of data as the validation as ‘‘s’’ and the minimum sample of leaf nodes as ‘‘l’’. They
set each time. are described in detail in Table 4.

VOLUME 8, 2020 22679


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

TABLE 3. The parameters of SVM.

TABLE 4. The parameters of RF.

TABLE 5. The parameters of ANN.

TABLE 6. Confusion matrix.

In this paper, the three-layer fully connected neural net- set of the first group is over the period from January 2010 to
work is applied to predict the stock price trend. Com- January 2011, while the testing set is over the period from
pared with the other two algorithms, the algorithm has more January 2011 to February 2011. The specific division method
parameters, such as the number of hidden layer neurons, is shown in Fig. 8.
the optimization function, L2 penalty (regularization term)
F. MODEL EVALUATION
parameter, the maximum number of iterations, and so on. The
number of the first hidden layer neurons is set to 20, and then For the classification of the algorithm, there are several com-
another is 10. The optimization function is ‘‘lbfgs’’ which is mon evaluation indicators, such as accuracy, precision, and so
an optimizer in the family of quasi-Newton methods. It marks on. They are calculated from a confusion matrix, as defined
L2 penalty (regularization term) parameter as ‘‘alpha’’ and in Table 6.
the maximum number of iterations as ‘‘max_iter’’. They are Most people always like to measure the success of clas-
described in detail in Table 5. sifier tasks based on the correct rate, while accuracy is the
most important indicator for evaluating the model. However,
E. OUT-SAMPLE TEST the method of calculating accuracy needs to manually set a
The sliding window method, which has been widely used in threshold to achieve classification. The accuracy is greatly
stock price trend prediction, is applied to divide the sample affected by this threshold, and so the paper further uses the
into different groups of training and testing sets. The main AUC indicator to evaluate the model in this paper. AUC is
reason for applying this method is that investors always pay calculated from the area covered by the ROC curve, which
attention to the recent trend of stocks, but not interested in the x-axis is the FP, and the y-axis is the Recall.
data long ago. Therefore, the model needs to be continuously In order to further evaluate the model, the paper constructs
updated during the process of the application. In order to a strategy model and implements the back-testing of historical
obtain the latest stock information as much as possible, the data. There are some indicators, as exhibits in Table 7, which
model is regenerated every month. For example, the training are evaluated the strategy model [33]. The annualized return

22680 VOLUME 8, 2020


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

TABLE 8. Prediction accuracy of different models.

TABLE 9. Prediction AUC of different models.

FIGURE 8. Sliding window by one month.

TABLE 7. Evaluation indicators.

the risk-free rate. σp is the volatility of the return. Px and Py


are the total value of stocks and cash on a certain day, and the
requirement is y > x.

IV. RESULTS
A. EMPIRICAL RESULTS AND DISCUSSION
In this paper, the procedure described in Section 3 is used
refers to the return that can be obtained after one year of for the experiment. The above steps are repeated every three
investment. The Sharpe ratio is an indicator used to measure months according to the method of the time window slicing,
both returns and risks of the model. The max drawdown is the and some results were obtained and shown as follows.
maximum degree of loss that the model may have in a certain First, the prediction accuracy and AUC of different
period of time in the past. In the case of investment, if there integrated models are analyzed, which has been shown
is a large max drawdown, it will often lead investors to lose in Table 8 and Table 9. As can be seen from the two tables,
confidence in the model. Therefore, the max drawdown of the stock price trend forecast result is the best when adopt-
the reasonable control model is particularly important. It is ing the RF model. Consider that the prediction performance
especially important to control the max drawdown of the of the three models is relatively close, the paper conducts a
model reasonably. more in-depth analysis of these three models.
Where Rp is the annualized return and P is the total return. Second, the main job is to analyze the profitability of
n is the number of days the strategy is conducted and Rf is different integrated models. In order to further explain the

FIGURE 9. Hierarchical combined back-testing net value.

VOLUME 8, 2020 22681


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

FIGURE 10. The new long-short portfolio net value.

TABLE 10. Annualized return of different models.

TABLE 11. Sharp ratio of different models.

application of the algorithm proposed in this paper, the APT Moreover, Table 12 and Table 13 shows the win rate and profit
model is used to predict the stock returns and achieve com- loss ratios of different models.
parative analysis. The specific strategy is built as follows: the The above tables found that the best result can be obtained
back-test time is from 2011.1 to 2018.1, and the stocks are when selecting the features through RF and applying RF for
traded at the end of each month. By ranking the probability stock price trend prediction. The RF-RF1 integrated model is
that the A-share stocks are predicted to be ‘‘1’’, then the top analyzed in detail as follows.
1%, 3%, or 5% of the stocks that are equally divided into the First of all, the specific strategy is built as follows:
funds to buy are selected separately. The back-test time is from January 2011 to January 2018, and
Table 10 shows the annualized return of different inte- the stocks are traded at the end of each month. By ranking the
grated models. The results demonstrate that the profitabil- probability that the A-share stocks are predicted to be ‘‘1’’,
ity of the SVM and RF models are better than APT, and all stocks are divided into 10 equal groups. For each group,
regardless of whether using SVM or RF to predict stock all stocks that are equally divided into the funds to buy are
price trends, the profitability is stronger than ANN. As the selected. The next thing to do is to analyze the profitability of
number of selected stocks decreases, the profitability of these ten groups.
the model becomes stronger. Therefore, when choosing 1% of
the total number of stocks, the higher returns can be obtained. 1 The RF-RF integrated model represents the random forest algorithm is
The Sharpe ratio of the models is calculated in Table 11. used for both feature selection and stock price trend prediction.

22682 VOLUME 8, 2020


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

TABLE 12. Win rate of different models.

TABLE 13. Profit loss ratio of different models.

TABLE 14. Performance of different groups.

Fig. 9 shows the net value of these ten groups over the Sharpe ratio, win rate and profit loss ratio. And the hier-
period from January 2011 to January 2018. The red line repre- archical combined back-testing also shows that the RF-RF
sents the net trend of Group 1, while the purple and black lines integrated model has strong long-term predictability and prof-
represent the net value of HS300 and the Shanghai Composite itability. At present, the proportion of quantitative investment
Index. Table 14 shows the performance of different groups. in the field of financial investment is becoming larger and
Group 1 has the best results regardless of any evaluation larger. Especially in recent years, with the poor market situ-
indicators. In order to further illustrate profitability of the ation, traditional investment has been suffering a large loss
models, as shown in Table 15, the return for each year of the in China stock market, and quantitative investment obtains
ten groups is calculated. Group 1 has the best performance stable income by controlling systematic risk, which makes
compared to other models. Therefore, the RF-RF integrated quantitative investment products more and more trusted by
model has strong profitability. investors. Based on the feature selection and machine learn-
The above empirical results indicate that the RF-RF inte- ing algorithm, our empirical results find that the RF-RF inte-
grated model has the best performance in annualized return, grated model can bring stable long-term return to investors

VOLUME 8, 2020 22683


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

TABLE 15. Return of different combination in different years.

TABLE 16. Performance of the new portfolio.

which is meaningful for guiding investment, as well as pro- features are filtered by feature selection methods. The time
moting investors’ willingness to invest and improving the sliding window method is applied for cross-validation to
vitality of the capital market. determine the parameters of stock price trend prediction
algorithms, which makes the model more practical in actual
B. A LONG-SHORT TRADING STRATEGY BASED ON THE investment transactions. The empirical results show that the
RF-RF INTEGRATED MODEL best performance can be obtained when the RF is applied
Note that the Sharpe ratio of group 1 in Table 14 is 0.744 and for both feature selection and stock price trend forecasting.
the max drawdown is 45.95%. The main reason is that this By selecting different stock numbers to build the model, it is
model always holds stock fully at any time, so it is difficult also found that the RF-RF model has the highest return when
to control the drawdown when the market falls substantially. it chooses top 1% of the stocks, achieving a 29.51% annu-
However, when investors choose their investment strategy, alized return. The stratified back-testing method is used to
they still have doubts about the relatively volatile return further analyze the profitability of the RF-RF model, and the
fluctuations. annualized return from 2011 to 2018 for the new long-short
The RF-RF integrated model in this paper helps investors portfolio is 21.92% while the max drawdown is only 13.58%.
in solving this problem by buying the stock of Group 1 and Therefore, the RF-RF model is highly predictive of long-term
selling the stock of Group 10 at the same time. As can be stock price trends and can be used for guiding investment.
seen from Fig. 10 and Table 16, the annualized return of There are still some issues that need to be improved in this
this new long-short portfolio is 21.92% which is lower than article: (1) no test in overseas markets such as US and UK, and
the portfolio that selects the top 1% stocks, but the max (2) the feature selection algorithm still needs to be optimized,
drawdown of the new portfolio gets 13.58% which far below such as how to determine the number of features selected,
much lower than it, and the Sharpe ratio is 2.86, which is and (3) there is still a need to continually explore more new
significant for investors. features which have more predictability.

V. CONCLUSION REFERENCES
This paper aims to analyze the profitability of various inte-
[1] W. Huang, Y. Nakamori, and S.-Y. Wang, ‘‘Forecasting stock market
grated stock selection models based on different feature selec- movement direction with support vector machine,’’ Comput. Oper. Res.,
tion and stock price trend prediction algorithms. The original vol. 32, no. 10, pp. 2513–2522, Oct. 2005.

22684 VOLUME 8, 2020


X. Yuan et al.: Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms

[2] C. Huang, D. Yang, and Y. Chuang, ‘‘Application of wrapper approach [27] R. CervellÓ-Royo, F. Guijarro, and K. Michniuk, ‘‘Stock market trading
and composite classifier to the stock trend prediction,’’ Expert Syst. Appl., rule based on pattern recognition and technical analysis: Forecasting the
vol. 34, no. 4, pp. 2870–2878, May 2008. DJIA index with intraday data,’’ Expert Syst. Appl., vol. 42, no. 14,
[3] M.-C. Lee, ‘‘Using support vector machine with a hybrid feature selection pp. 5963–5975, Aug. 2015.
method to the stock trend prediction,’’ Expert Syst. Appl., vol. 36, no. 8, [28] W. You, Z. Yang, and G. Ji, ‘‘Feature selection for high-dimensional multi-
pp. 10896–10904, Oct. 2009. category data using PLS-based local recursive feature elimination,’’ Expert
[4] Y. Kara, M. Acar Boyacioglu, and Ö. K. Baykan, ‘‘Predicting direction of Syst. Appl., vol. 41, no. 4, pp. 1463–1475, Mar. 2014.
stock price index movement using artificial neural networks and support [29] X. Lin, F. Yang, L. Zhou, P. Yin, H. Kong, W. Xing, X. Lu, L. Jia, Q. Wang,
vector machines: The sample of the Istanbul Stock Exchange,’’ Expert and G. Xu, ‘‘A support vector machine-recursive feature elimination fea-
Syst. Appl., vol. 38, no. 5, pp. 5311–5319, May 2011. ture selection method based on artificial contrast variables and mutual
[5] M. Ballings, D. Van Den Poel, N. Hespeels, and R. Gryp, ‘‘Evaluating information,’’ J. Chromatography B, vol. 910, pp. 149–155, Dec. 2012.
multiple classifiers for stock price direction prediction,’’ Expert Syst. Appl., [30] X.-Q. Hu, M. Cui, and B. Chen, ‘‘Feature selection based on random forest
vol. 42, no. 20, pp. 7046–7056, Nov. 2015. and application in correlation analysis of symptom and disease,’’ presented
[6] D. Enke and S. Thawornwong, ‘‘The use of data mining and neural at the Conf. IEEE Int. Symp. Med. Educ., Aug. 2009.
networks for forecasting stock market returns,’’ Expert Syst. Appl., vol. 29, [31] D.-J. Yao, J. Yang, and X. J. Zhan, ‘‘Feature selection algorithm based on
no. 4, pp. 927–940, Nov. 2005. random forest,’’ J. Jilin Univ., vol. 44, no. 1, pp. 137–141, 2014.
[7] C.-F. Tsai and S.-P. Wang, ‘‘Stock price forecasting by hybrid machine [32] X. Zhang, Y. Hu, K. Xie, S. Wang, E. Ngai, and M. Liu, ‘‘A causal fea-
learning techniques,’’ in Proc. Int. Multi-Conf. Eng. Comput. Scientists, ture selection algorithm for stock prediction modeling,’’ Neurocomputing,
Mar. 2009, pp. 755–761. vol. 142, pp. 48–59, Oct. 2014.
[8] Y. Zuo and E. Kita, ‘‘Stock price forecast using Bayesian network,’’ Expert [33] J. Rohde, ‘‘Downside risk measure performance in the presence of breaks
Syst. Appl., vol. 39, no. 8, pp. 6729–6737, Jun. 2012. in volatility,’’ J. Risk Model Validation, vol. 9, no. 2, pp. 31–68, Dec. 2015.
[9] E. Chong, C. Han, and F. C. Park, ‘‘Deep learning networks for stock
market analysis and prediction: Methodology, data representations, and XIANGHUI YUAN received the B.Sc. degree in
case studies,’’ Expert Syst. Appl., vol. 83, pp. 187–205, Oct. 2017. electrical engineering and its automation from
[10] B. Weng, L. Lu, X. Wang, F. M. Megahed, and W. Martinez, ‘‘Predicting Shaanxi Science and Technology University,
short-term stock prices using ensemble methods and online data sources,’’
Xianyang, China, in 2002, and the Ph.D. degree
Expert Syst. Appl., vol. 112, pp. 258–273, Dec. 2018.
in control science and engineering from Xi’an
[11] A. Oztekin, R. Kizilaslan, S. Freund, and A. Iseri, ‘‘A data analytic
approach to forecasting daily stock returns in an emerging market,’’ Eur. Jiaotong University, Xi’an, China, in 2008.
J. Oper. Res., vol. 253, no. 3, pp. 697–710, Sep. 2016. From 2008 to 2018, he was a Faculty Member
[12] E. F. Fama and K. R. French, ‘‘Common risk factors in the returns on stocks with the Department of Automation, Xi’an
and bonds,’’ J. Financial Economics, vol. 33, no. 1, pp. 3–56, Feb. 1993. Jiaotong University. He is currently an Associate
[13] J. Patel, S. Shah, P. Thakkar, and K. Kotecha, ‘‘Predicting stock and Professor with the Department of Financial Engi-
stock price index movement using Trend Deterministic Data Preparation neering, Xi’an Jiaotong University. His research interests include estima-
and machine learning techniques,’’ Expert Syst. Appl., vol. 42, no. 1, tion and decision theory for stochastic systems, financial engineering, and
pp. 259–268, Jan. 2015. machine learning for financial data analysis. He is also a CFA charter holder.
[14] H. Y. Kim and C. H. Won, ‘‘Forecasting the volatility of stock price index:
A hybrid model integrating LSTM with multiple GARCH-type models,’’ JIN YUAN was born in Anhui, China, in 1994.
Expert Syst. Appl., vol. 103, pp. 25–37, Aug. 2018. He received the B.S. degree from Northwest Poly-
[15] V. Vapnik, ‘‘An overview of statistical learning theory,’’ IEEE Trans. technic University, Xi’an, China, in 2015. He is
Neural Netw., vol. 10, no. 5, pp. 988–999, Sep. 1999. currently pursuing the Ph.D. degree in applied
[16] A. Fan and M. Palaniswami, ‘‘Stock selection using support vector
economics with Xi’an Jiaotong University, Xi’an.
machines,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2001,
His research interests include financial engineer-
pp. 1793–1798.
ing, machine learning for financial data analysis,
[17] F. E. H. Tay and L. Cao, ‘‘Application of support vector machines in
financial time series forecasting,’’ Omega, vol. 29, no. 4, pp. 309–317, performance evaluation, and ranking.
Aug. 2001.
[18] K.-J. Kim, ‘‘Financial time series forecasting using support vector
machines,’’ Neurocomputing, vol. 55, nos. 1–2, pp. 307–319, Sep. 2003. TIANZHAO JIANG was born in Anhui, China,
[19] R. Khemchandani, Jayadeva, and S. Chandra, ‘‘Knowledge based proximal
in 1993. He received the B.S. degree from North-
support vector machines,’’ Eur. J. Oper. Res., vol. 195, no. 3, pp. 914–923,
west University, Xi’an, China, in 2016, and the
Jun. 2009.
M.S. degree in control theory and science from
[20] S. Safavian and D. Landgrebe, ‘‘A survey of decision tree classi-
fier methodology,’’ IEEE Trans. Syst., Man, Cybern., vol. 21, no. 3, Xi’an Jiaotong University, Xi’an, in 2019. He is
pp. 660–674, Jun. 1991. currently working as a Researcher with Shang-
[21] Y. H. Cho, J. K. Kim, and S. H. Kim, ‘‘A personalized recommender system hai Foresee Investment Ltd., Liability Company.
based on Web usage mining and decision tree induction,’’ Expert Syst. His research interests include information fusion,
Appl., vol. 23, no. 3, pp. 329–342, Oct. 2002. financial engineering, machine learning algorithm,
[22] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and and investment.
B. P. Feuston, ‘‘Random forest: A classification and regression tool for
compound classification and QSAR modeling,’’ J. Chem. Inf. Comput. Sci., QURAT UL AIN received the master’s degree in
vol. 43, no. 6, pp. 1947–1958, Nov. 2003. accounting and finance and the M.S. degree in
[23] B. Lariviere and D. Vandenpoel, ‘‘Predicting customer retention and prof- finance from the University of Central Punjab,
itability by using random forests and regression forests techniques,’’ Expert Lahore, Pakistan, in 2014 and 2017, respectively.
Syst. Appl., vol. 29, no. 2, pp. 472–484, Aug. 2005. She is currently pursuing the Ph.D. degree with the
[24] A. Prinzie and D. Van Den Poel, ‘‘Random forests for multiclass classi- School of Economics and Finance, Xi’an Jiaotong
fication: Random multinomial logit,’’ Expert Syst. Appl., vol. 34, no. 3,
University, China. Her areas of research interest
pp. 1721–1732, Apr. 2008.
are corporate finance, corporate governance, risk
[25] M. Khashei and M. Bijari, ‘‘An artificial neural network (p, d, q) model for
time series forecasting,’’ Expert Syst. Appl., vol. 37, no. 1, pp. 479–489,
management, and technology innovation. Her cur-
Jan. 2010. rent research work revolves around the governance
[26] Q. Cao, M. E. Parry, and K. B. Leggio, ‘‘The three-factor model and role of women directors. Her research work has been published in very
artificial neural networks: Predicting stock price movement in China,’’ well-reputed journals.
Ann. Oper. Res., vol. 185, no. 1, pp. 25–44, May 2011.

VOLUME 8, 2020 22685

You might also like