Predicting IPO Underperformance Using Machine Learning
Predicting IPO Underperformance Using Machine Learning
Rachit Agrawal
SJMSOM, IIT Bombay
Email: [email protected]
Usha Ananthakumar
SJMSOM, IIT Bombay
Email: [email protected]
ABSTRACT
Predicting the IPO short-term returns is a challenging task due to the involvement of many
determinants. Empirical analysis and literature have shown the presence of IPO underpricing,
and their research is dependent upon linear regression (GLM) models. In this study, we have
utilized various machine learning techniques on classified data to predict IPO underpricing in the
Indian markets. The results show that the gradient boosting model (GB) has performed
significantly better than GLM models and other ensemble techniques. The variable importance
measure indicates that profit margin, market sentiments, debt to assets ratio are the most
important predictors of IPO short-term returns.
INTRODUCTION
Financial markets are an integral part of any nation’s economic growth engine and allow for the
allocation and flow of funds from the low utility to high utility applications. The capital market helps
to improve the allocation of funds from savers and investors with varying degree of needs and
goals, distributed towards their most efficient uses to increase investment efficiency. This capital
market’s efficiency is a characteristic of the market in which it operates and relies on the flow of
information for its speed, accuracy, and reliability to be useful for different stakeholders to form
actionable insights and make informed decisions. The quality of these signals and their
dissemination from the overall noise generated incessantly in the markets helps in price discovery
mechanism and achieves this required efficiency in the financial markets to realize growth.
Initial Public Offering (IPO) is one of the processes, that contributes to the distribution of funds in
capital markets. IPOs are part of a primary market where corporates raise the request for capital
against equity shares to undertake new/expansion projects, from the investors willing to take the
risk in the stated endeavor of the company with the prospect of earning higher returns. In this
market, although, unlike the secondary market where price discovery happens by bringing
together the buyers and sellers with supply-demand dynamics of the financial instruments, the
bookmakers, i.e., investment banks and its consortiums, ascertain a price for these equity shares.
Investment Banks based on their judgment of the company’s shares demand at various price
points about various intrinsic operational, financial, management-related factors deterministic to
such process and capital needed by the company for the stated purpose. We primarily observe
in emerging and developed economies that IPOs are mostly underpriced to attract high demand
for such instruments. IPO’s pricing differs significantly with varying degrees of underpricing and
overpricing.
Agrawal & Ananthakumar Predicting IPO underperformance in Indian markets
An IPO is said to be underpriced when the offer price of the issue is below the closing market
price on the listing day. Whenever an issue is underpriced, it is gain for investors and loss of
capital for issuing company and vice versa for overpricing of the issue. The understanding of
determinants and driving factors for such underpriced and overpriced IPO is not properly
understood yet. With this varying degree of pricing, investors can experience major losses in both
the long term and the shorter term. In worst cases, total capital is lost; IPO may be delisted from
the market.
The IPO market in India is on the boom, and a lot more companies are issuing equity shares in
the capital market. Every year more than 30-35 companies on average raise capital from retail
and institutional investors. This market has been improving over the past three years due to
buoyant equity markets, increasing financial savings, and outperformance of recent IPO listings.
However, not all the IPOs have generated positive returns for the investors. There have been
instances of significant short-term losses and delisting in the IPO markets.
In this study, we have analyzed the underpricing patterns of historical IPOs to detect and predict
the decisive variables in IPO listing day performance. The study could help retail investors identify
important determinants for IPO underpricing in the market and protect investors from losing
money in short-term investments. As the efficiency of a capital market is enhanced by the
information and signal transmission to different stakeholders, such kind of exercise is with merit
to attract investors to the market with a significant understanding of the market and gain
confidence while investing in IPOs of Indian companies.
Machine learning under data science models can learn itself when a large enough data set is fed
to the model to learn. This study utilizes classification techniques like logistic regression, random
forest, bagging and gradient boosting, to predict the performance of IPOs.
LITERATURE REVIEW
(Stoll and Curley,1970; Ibboston, 1975) laid the foundation of IPO pricing and mispricing in
accordance with short-term first-day returns. They observed that there was a significant price
appreciation just after the equity offering issue is listed and initial gains in IPOs performance
was exceptionally high.
Many theoretical studies have been carried out regarding the factors responsible for such a level
of underpricing with IPOs. (Baron, 1982) attributed the underpricing of IPOs to two theories, first
information asymmetry theory and second to the market uncertainty. According to the information
asymmetry argument, the issuer has more idea about the operational and functional aspects of
the business as they offer an incentive by underpricing that compensation. Second, due to market
uncertainty, issuers rely on bankers’ expertise on market conditions regarding issue’s demand.
(Ritter & Welch, 2002) reviewed the US market IPOs underperformance from 1980 to 2001,
concluding that asymmetric information models are not the primary drivers of underperformance.
They believed that agency conflicts and share allocation issues could explain the dramatic
variations in underpricing.
Another theory by (Habib & Ljungqvist, 2001) tries to explain the voluntary underpricing of an
issue by the issuer because they release a small portion of their holding to the public
shareholders. (Loughran and Ritter, 2002) combine this with the fact that rather than maximizing
their gain in the issue of fresh equity in isolation, they have their stake in maximizing their wealth.
Hence, when an issuer releases a small portion of its shareholding for the public below fair market
value, it acts as a strong incentive for investors and attracts a lot of demand from the public. By
Agrawal & Ananthakumar Predicting IPO underperformance in Indian markets
foregoing fractional gains on a small portion of holding, they attract higher valuation in the long
term for their residual shareholdings.
Although most of the studies have been done for developed economies, (Loughran, Ritter, and
Rydqvist,1994) finds underpricing in developing nations to be more than that observed in
developed economies. (Gao, 2010; Tian, 2011) showed that asymmetric information theory does
not work in a developing nation, like China. (Procianoy & Cigerza, 2007) used a multivariate linear
regression model to explain the performance of IPO in emerging economies of Brazil, India, and
China with determinants of performance as offer size, investment bank reputation, final offer price,
market performance, goods produced with high tech content, GDP, FDI and interest rate and
inflation of the country. Most of these theories have been subjected to rigorous empirical testing
using firm-specific and market specific-factors. The empirical evidence presented in the literature
is notably in favor of the asymmetric information theory.
Several researchers have analyzed IPO returns using a variety of computational intelligence
techniques. (Luque, Quintana, and Isasi 2012) focus mainly on the offering characteristics to
predict IPO returns using a genetic algorithm. (Robertson, Golden, Runger, and Wasil, 1998)
construct an OLS regression and neural network models to predict the first day returns of IPOs,
the empirical findings of their study show that the predictions produced by neural network models
were better than predictions produced by OLS regression. (Quintana, Saez, & Isasi, 2017) have
tried to explain the IPO underperformance in US markets using machine learning models and
have concluded that random forests outperform the eight popular machine learning algorithms:
instance-based learning algorithms; least median of squares regression; locally-weighted
learning; M5 model trees; M5 model rules; multilayer perceptron; radial basis function networks;
and support vector machines trained with sequential minimal optimization in terms of mean and
median predictive accuracy. (Baba & Sevil, 2020) studied IPO returns in the Istanbul equity
market and have got similar conclusions regarding the random forest algorithm.
Considering Indian markets, (Krishnamurti & Kumar, 2002) tested the underpricing for the time
lag between final allotment and listing and found that it is an important factor in underpricing. The
reason stated for the relation between the two was that the investors perceived time lag as a risk
and required additional compensation for the risks involved. (Ghosh, 2002) studied the
relationship between uncertainty and age of the firm as determining factors of underpricing and
found a positive relationship between uncertainty risk and underpricing but the firm’s age could
not explain the initial returns. (Bansal & Khost, Anna, 2012) analyzed Indian IPOs after the global
financial crisis and found that firm's age, IPO years, book building pricing mechanism, ownership
structure, issue size, and market capitalization explained 44% of the variation in issuer
underpricing. (Chhabra, Kiran and Sah, 2017a) find the variables that signal information are highly
significant and companies with high information disclosure experience less underpricing. On the
other hand, (Chhabra, Kiran, Sah and Sharma, 2017b) find the informational variables less
effective in explaining the IPO underpricing.
The present literature on Indian capital markets is constricted to regression models with smaller
data sets, limited variables, and more focused on long-term capital gains. We plan to analyze the
Indian equity markets for short-term IPO returns using the machine learning techniques from
2007-2020 (pre-COVID) which includes listings from a period including the pre-global financial
crisis to pre-COVID 19.
Agrawal & Ananthakumar Predicting IPO underperformance in Indian markets
SAMPLE DATA
Data
The data set used in this study consists of first-day trading returns of 262 public offerings listed
on NSE during the period 2007 to 2020 (pre-COVID) with a minimum post-issue paid-up capital
of Rs 10 crore. Table I shows the IPO proceeds during this period. In this period, a total of 429
major firms went public, raising around $42 billion of capital. The volume of IPO’s peaked in 2007
with 108 deals. However, we see a significant dip in the interest post-2007. This reduction in IPO
activity was largely driven by the rapid decline in Indian stock exchanges following the breakout
of the global financial crisis in the third quarter of 2008. The IPO activities bounced back in 2009-
10. However, IPO activities took a sudden dip in the years 2013 and 2014 due to poor sentiments,
volatile secondary markets, and promoters not getting the right valuations. In the following years,
especially since 2015, the global favorable monetary conditions spurred economic growth and
sectoral developments in many emerging markets. The year 2017 remains the most productive
year for capital raising, with the public injecting ~$11 billion into equities. Due to data
inconsistencies, we removed 167 firms from our dataset. In our sample of 262 IPOs from 2007 to
2020, the average underpricing is 15.7 percent. Approximately 63 percent of the IPOs end the
first day of trading at a closing price greater than the offer price and about 37 percent have a first-
day zero or negative returns. The IPO data used in the empirical analysis is obtained from the
NSE website and CMIE prime database.
Table I. IPO Proceeds
Variables
The IPO process is closely related to the firm qualities, market timing, agency issues, investors’
interest, asymmetric information, and so on. For this study, we have selected variables based on
previous IPO underpricing literature and some new firm characteristics variables. The description
of the variables selected and descriptive statistics of the variables are provided in Table II. For
market characteristics, we will be using broader market returns (“BSE”) as a proxy for market
Agrawal & Ananthakumar Predicting IPO underperformance in Indian markets
sentiments. For firm characterstics, we have used Pre-IPO year data as the basis for calculation
of ratios and margins.
To predict the IPO underpricing, we first select the commonly used measure of underpricing,
which is expressed as
Ui = Poi – Pci (1)
Poi
where,
Ui is the degree of underpricing
Poi represents the offered price for stock i
Pci is the closing price for stock i on the listing day
The negative values of Ui represent underpriced IPO while the positive values suggest an
overpriced IPO. Ui is the dependent variable for our study. We have classified the dependent
variable as “underpriced “ and “overpriced” for further analysis.
RESEARCH METHODOLOGY
We intend to utilize classification algorithms to predict IPO underpricing and compare the
techniques based on their predictive ability. We also aim to identify variables that are significantly
better predictors for IPO underpricing. Accordingly, we selected the following set of methods to
analyze our data - logistic regression, decision trees and ensemble techniques like bagging,
random forest, gradient boosting.
Predictive Models
Logistic Regression
Logistic regression (LR) is a statistical method that predicts the probability of a categorical
response for a given set of independent variables. (Hosmer & Lemeshow, 1989) provide a
comprehensive introduction to logistic regression analysis.
The logistic regression model uses the odds ratio, which represents the probability of an event of
interest compared with the probability of not having an event of interest. The model is based on
the natural logarithm of the odds ratio given by
where p is the probability of success, p/1-p is the odds ratio, and β0, β1, β2, … βk are the
parameters of the model. LR uses maximum likelihood method to estimate the model parameters.
Decision Trees
Decision trees are one of the most used predictive modeling algorithms in practice. Decision trees
were first applied to language modeling by (Bahl et al.,1987) to estimate the probability of spoken
words. A decision tree is a predictive model, which is a mapping from observations about an item
to conclusions about its target value. Decision trees work by doing successive binary splits. The
first split will yield the biggest separation or distinction in two groups of data. Each subgroup is
then split until some stopping criteria is reached. In the tree structures, leaves represent
classifications (also referred to as labels), non-leaf nodes are features, and branches represent
features that lead to the classifications.
Bagging or Bootstrap Aggregating
Bagging (Breiman, 1996) is a method for fitting multiple versions of a prediction model and then
combining them into an aggregated prediction. In the bagging algorithm, bootstrap copies of the
original training data are created, the regression or classification algorithm (commonly referred to
as the base learner) is applied to each bootstrap sample and in the classification context, new
predictions are made by taking the majority vote prediction for the classes from across the
predictions made by the decision trees. Bagging effectively reduces the variance of an individual
base learner because of the aggregation process.
Random Forest
Random Forest (Freund and Schapire, 1996) combines the two concepts of bagging and random
selection of features by generating a set of classification trees where the training set for each tree
Agrawal & Ananthakumar Predicting IPO underperformance in Indian markets
is selected using bootstrapping from the original sample set and the features considered for
partitioning at each node is a random subset of the original set of features. Random Forest has
become a commonly used tool due to its ability to handle large features with small samples and
improved accuracy. The random sampling and bootstrapping together reduces the correlation
between the generated trees and majority voting of the class responses reduce the variance of
the error, providing an improvement over bagging algorithm.
When building decision trees, each time a split in the tree is considered, a random selection of m
predictors is chosen as a subset of split candidates from the full set of predictors. The number of
predictors considered at each split (m) is approximately equal to the square root of the total
number of predictors, p.
The random forest trees are insensitive to skewed distributions, outliers, and missing values; they
are considered as one of the most efficient predictive ML techniques.
Gradient Boosting (GB) (Freund & Schapire, 1997; Friedman et al., 2000) is a machine-learning
algorithm for regression and classification problems that combines the output of many weak
predictive models to produce a final robust predictive model.
GB procedures are invariant under all monotonic transformations of a single input variable (e.g.,
logarithm transform) and are not sensitive to outliers (this algorithm isolates outliers only in
separate nodes without affecting the performance of the final model). GB is insensitive to
multicollinearity, and it is more robust due to better handling of uncorrelated inputs (Friedman,
2001). GB can select and rank the variables, which provides a feasible way to compare IPO
underpricing predictors.
The statistical framework describes boosting as a numerical optimization problem whose goal is
to minimize the loss of the model by adding weak learners using gradient descent-like process.
The weights of the selection of the training data set for the latter learners are not equal. Samples
with larger errors have higher weights. The latter learners in GB are adjusted based on the errors
made by the previous learners. GB has 3 major parts -:
Loss Function - The role of the loss function is to estimate how good the model is at making
predictions with the given data.
Weak Learner - A weak learner is one that classifies our data but does poorly. It has a high error
rate. Decision tree models are often selected as weak learners.
Additive Model - This is the iterative and sequential approach of adding the trees (weak learners)
one step at a time. Each iteration should reduce the value of our loss function.
Boosted trees are grown sequentially; each tree is grown using information from previously
grown trees to improve performance. By fitting each tree in the sequence to the previous tree’s
residuals, we are allowing each new tree in the sequence to focus on the previous tree’s
mistakes and thus do better.
Evaluation Metrics
We have analyzed the above methods based on their predictive accuracies, sensitivity, AUC,
and RIS. Sensitivity plays an important role in the study as the Type 1 error is a more important
component for results than the Type 2 error.
Agrawal & Ananthakumar Predicting IPO underperformance in Indian markets
• Type I error (Positive class – “Overpriced”): If the model predicts an overpriced IPO as
underpriced, it could incur a significant loss to an investor. Higher the sensitivity lower would
be the Type I error
• Type II error: If the model predicts an underpriced IPO as overpriced, the investor would
not be incurring losses as the investor would not invest money in the IPO based on the model’s
predictions.
AUC stands for Area under the ROC Curve. ROC curve or Receiver Operating Characteristic
curve is a graph that shows the performance of a classification model at all classification
thresholds. AUC provides an aggregate measure of performance across all possible classification
thresholds.
Relative influence score (RIS) is a measure of how useful a particular variable is to a model by
quantifying the importance of the variable marginalizing on other variables. (Friedman & Meulman,
2003), showed that measures are based on the number of times a variable is selected for
improving the model. RIS score is used to study variable importance and its effect on the
prediction model.
Table III. Model Comparison
EMPIRICAL RESULTS
We applied the various classification methods on a training dataset which was assigned 70% of
the observations randomly using stratified sampling. On the remaining 30% sample, we tested
Agrawal & Ananthakumar Predicting IPO underperformance in Indian markets
our models. Table III summarizes the outcomes obtained on the application of models to the
testing dataset.
Table IV summarizes the predictive performances of the gradient boosting (GB) model, logistic
regression (GLM), pruned tree, random forest, and bagging. The test sample has a significant
higher AUC (also shown in Figure I), higher accuracy, and higher sensitivity on the GB model,
which indicates that the GB model has overall better prediction performance compared to other
models.
Pruned
0.60 0.58 0.51
Tree
GLM 0.55 0.60 0.38
Based on the variable importance graph of the GB model (as shown in Figure II), we find that
BSE 3M return, D/A ratio, profit margin, asset turnover ratios, are the most important indicators
of an IPO’s underpricing. The study also highlights those firm characteristics have higher variable
importance than other listing characteristics.
Agrawal & Ananthakumar Predicting IPO underperformance in Indian markets
CONCLUSION
Due to the involvement of many determinants with very different explanatory power, the presence
of outliers, and data inconsistencies, predicting IPO underpricing has been a challenging task.
We selected 11 determinants based on numerous empirical findings of previous studies. The
outcomes of this study show that profit margin, market short -term sentiments and leverage ratio
i.e. D/A ratio are the most important predictors for IPO underpricing in the Indian markets. Based
on the results, it can be said that firm characteristics like margin measures, leverage ratios and
asset quality, market characteristics like short-term market sentiment and listing characteristics
like issue expenses should be considered for investing in an IPO for a short-term capital gain
perspective.
The gradient boosting model in predicting short term gains maybe of relevance as both IPO
issuers and investors are highly concerned with the uncertainty and market response regarding
IPO price and performance. This study shows that the gradient boosting model has advantages
in predicting IPO underpricing over GLM models in three aspects: adaptability to large numbers
of input variables, it provides a rank (RIS) of these inputs based on their contribution to the
prediction, and it can properly handle predictors that show large skewness, kurtosis, and violate
GLM assumptions.
The empirical findings of this study add to the existing literature by emphasizing the accuracy and
methodological advantages of the ensemble techniques in predicting the IPO underpricing.
REFERENCES
[1] Baba, B., & Sevil, G. (2020). Predicting IPO initial returns using random forest. Borsa Istanbul
Review, 20(1), 13-23.
[2] Bansal R., & Khanna, R. (2012). IPOs underpricing and money “left on the table” in Indian
market. International Journal of Research in Management, Economics and Commerce, 2,
106-120.
[3] Baron, D. (1982) A Model of the Demand for Investment Banking Advice and Distribution
Services for New Issues. Journal of Finance, 37, 955-976.
[4] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
[5] Chhabra, S., Kiran, R., & Sah, A. N. (2017a). Information asymmetry leads to underpricing:
Validation through SEM for Indian IPOs. Program, 51(2),116-131.
[6] Chhabra, S., Kiran, R., Sah, A. N., & Sharma, V. (2017b). Information and performance
optimization: A study of Indian IPOs during 2005-2012. Program, 51(4), 458-471.
[7] Desai, V. S., & Bharati, R. (1998). A comparison of linear regression and neural network
methods for predicting excess returns on large stocks. Annals of Operations Research, 78(0),
127-163.
[8] Freund, Y., & Schapire, R.E. (1996). Experiments with a new boosting algorithm. Machine
Learning: Proceedings of the Thirteenth International Conference, 13,148-156.
[9] Freund, Y., & Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139.
[10] Friedman, Jerome. (2000). Greedy Function Approximation: A Gradient Boosting Machine.
The Annals of Statistics, 29, 1189-1232.
[11] Friedman, Jerome & Meulman, Jacqueline. (2003). Multiple additive regression trees with
application in Epidemiology. Statistics in medicine, 22, 1365-81
Agrawal & Ananthakumar Predicting IPO underperformance in Indian markets
[12] Gao, Y. (2010). What comprises IPO initial returns: Evidence from the Chinese market.
Pacific-Basin Finance Journal, 18(1), 77-89.
[13] Ghosh, Saurabh (2002). Underpricing of Ipos: The Indian Experience Over the Last
Decade.SSRN Electric Journal.
[14] Habib, M.A. and A.P. Ljungqvist (2001). Underpricing and Entrepreneurial Wealth Losses in
IPOs: Theory and Evidence. The Review of Financial Studies, 14(2), 433-458.
[15] Hosmer, D., & Lemeshow, S. (1989). Applied logistic regression. New York: Wiley.
[16] Krishnamurti, C. and Kumar, P. (2002). The initial listing performance of Indian IPOs,
Managerial Finance, 28 (2), 39-51.
[17] Lin, C., & Hsu, S. (2008). Determinants of the initial IPO performance: Evidence from Hong
Kong and Taiwan. Applied Financial Economics, 18(12), 955-963.
[18] Ljungqvist, A., & Wilhelm, W. J. (2003). IPO Pricing in the Dot-Com Bubble. The Journal of
Finance, 58(2), 723–752.
[19] Loughran, T., Ritter, J. R., & Rydqvist, K. (1994). Initial public offerings: International
insights.Pacific-Basin Finance Journal, 2(2-3), 165-199.
[20] Loughran, T. and J.R. Ritter, 2002, Why Don’t Issuers Get Upset About Leaving Money on
the Table in IPOs?. Review of Financial Studies, 15, 413-443.
[21] Luque, C., Quintana, D., & Isasi, P. (2012). Predicting IPO underpricing with genetic
algorithms. International Journal of Artificial Intelligence, 8(S12), 133-146.
[22] Marshal, B. B. (2004). The effect of firm financial characteristics and the availability of
alternative finance on IPO underpricing. Journal of Eco-nomics and Finance, 28(1), 88-103.
[23] Pande, A., & Vaidyanathan, R. (2009). Determinants of IPO underpricing in the National Stock
Exchange of India. ICFAI Journal of Applied Finance, 25(1), 84-107.
[24] Procianoy, Jairo Laser & Cigerza, Gilles. (2007). IPOs in Emerging Markets: A Comparison
of Brazil, India and China. SSRN Electronic Journal.
[25] Pande, Alok & Vaidyanathan, R. (2007). Determinants of IPO Underpricingin the National
Stock Exchange of India. IUP Journal of Applied Finance, 15(1), 17-18.
[26] Quintana, D., Saez, Y., & Isasi, P. (2017). Random forest prediction of IPO underpricing.
Applied Sciences, 6(7), 63.
[27] Ritter, J. R. (1984). The “hot issue” market of 1980. Journal of Business, 57(2), 215-240.
[28] Ritter, J. R., & Welch, I. (2002). A review of IPO activity, pricing, and allocations. The Journal
of Finance, 57(4), 1795-1828.
[29] Robertson, S. J., Golden, B. L., Runger, G. C., & Wasil, E. E. (1998). Neural network models
for initial public offerings. Neurocomputing, 18(3), 165-182.
[30] Stoll, H.R. and A.J. Curley. 1970. Small Business and the New Issues Market for Equities.
Journal of Financial and Quantitative Analysis, 5(3), 309-322.
[31] Tian, L. (2011). Regulatory underpricing: Determinants of Chinese extreme IPO returns.
Journal of Empirical Finance, 18(1), 78-90.