Forecasting Number of Indian Startups Using Supervised Learning Regression Models

Uploaded by

Varshitreddy Nareddy

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

Forecasting Number of Indian Startups Using Supervised Learning Regression Models

Uploaded by

Varshitreddy Nareddy

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Proceedings of the International Conference on Inventive Computation Technologies (ICICT 2023)

IEEE Xplore Part Number: CFP23F70-ART; ISBN: 979-8-3503-9849-6

Forecasting Number of Indian Startups using

Supervised Learning Regression Models
Darshanaben Dipakkumar Pandya Amitkumar Kantilal Patel Janki M anishkumar Purohit
Shri C.J Patel College Of Computer Assistant Professor Assistant Professor
Studies (Bca), Sankalchand Patel Silver Oak College Of Computer Silver Oak College Of Computer
University, Visnagar Application, Silver Oak University, Applications, Silver Oak University,
[email protected] Gota Ahmedabad, Gujarat Gota, Ahmedabad Gujarat
[email protected] [email protected]

M adhavi Nandlal Bhuptani Dr. Sheshang Degadwala, Dhairya Vyas

Assistant Professor, Associate Professor & Head of Research Scholar
Silver Oak College Of Computer Department, Department of Computer The M aharaja Sayajirao University of
2023 International Conference on Inventive Computation Technologies (ICICT) | 979-8-3503-9849-6/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICICT57646.2023.10134480

Application, Silver Oak Univercity , Engineering, Sigma Baroda, Vadodara, Gujarat, India
Gota Ahmedabad,Gujarat University,Vadodara, Gujarat [email protected]
[email protected] [email protected]

Abstract— For regulators and investors, estimating the The study will focus on identifying the significant factors
potential of the Indian market requires accurately predicting that drive the growth of startups in India, such as the
the number of startups in the Indian ecosystem. S tartup availability of funding, government policies, market demand,
growth may be predicted with great accuracy using Supervised and competition. In order to forecast future growth based on
Learning Regression models. These models take into account a these characteristics, linear models will be developed using
wide range of variables, including financing, market demand, past data.
and competition. The purpose of this research is to use
Supervised Learning Regression models to make pre dictions
about the future of the startup scene in India. Information
from the S tartup Database, official papers, and scholarly
journals all factored into the analysis. S upervised Learning
Regression models are then used to make predictions about
future growth based on the identified variables, using training
data taken from the past. Factors including finance
availability, government regulations, and market demand are
identified in the report as having a substantial influence on the
number of startups in India. The potential expansion of the
startup sector in India is foreseen by using Supervised
Learning Regression models to forecast the future number of
companies in the Indian ecosystem. The findings of this
research support the use of linear models for estimating future
startup activity in In dia. Policymakers and investors may
benefit from this study's results by learning more about the Fig. 1. Indian Startups [15]
forces that are propelling India's startup scene forward.
Policy makers and investors may utilise the findings of
Keywords— Indian Startups, Forecasting, Linear Models, this study to better understand the Indian market and the
Growth, Factors. startup industry's potential for development. The results of
this research can add to what is already known about
I. INT RODUCT ION utilising Supervised Learning Regression models to predict
The number of Indian startups operating in a wide range the expansion of new businesses.
of industries has increased dramatically in recent years. The
prospects for these new businesses depend on a number of II. RELAT ED W ORK
elements, such as access to capital, regulatory environment, This literature analysis draws from fourteen publications
consumer demand, and level of co mpetition. Policy makers that explore the potential of machine learning and other data
and investors in the Indian market may benefit fro m accurate analytics tools for gauging the prospects of new businesses.
projections of startup growth. The high percentage of startup failure makes this a
Predicting a startup's growth from a variety of inputs significant topic of study, as does the need to isolate the most
using a Supervised Learning Regression model has shown to promising new ventures.
be a valuable tool. Using Supervised Learning Regression Savin et al. (2023) employs a topic-based categorization
models, the total number of Indian startup companies is system to identify worldwide patterns among new
predicted. The startup database, government papers, and businesses. The authors provide a process for spotting new
scholarly publications will all be used to determine what developments and anticipating how they may affect new
variables influence the development of startups in India. businesses. Social media, news stories, and company reports

979-8-3503-9849-6/23/$31.00 ©2023 IEEE 948

Authorized licensed use limited to: Army Institute of Technology. Downloaded on September 19,2024 at 18:20:29 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Inventive Computation Technologies (ICICT 2023)
IEEE Xplore Part Number: CFP23F70-ART; ISBN: 979-8-3503-9849-6

were among the many data sets analysed for this research. that metrics of network centrality are useful for forecasting
The results demonstrate the efficiency of the suggested startups' fortunes.
strategy in spotting tendencies and foreseeing their
Arroyo et al(2019) .'s research examines the usefulness of
prospective effects on new businesses.
machine learn ing algorithms for assisting with VC
Investment returns for new businesses may be predicted investment decisions [9]. Five different machine learning
with the use of AI algorith ms and econometric models, as algorithms were evaluated for their ability to predict the
investigated in [2] by Farahani et al. Portfolio optimization success of 272 startups using this dataset. Based on their
utilising VaR and C-VaR is also suggested by the authors. findings, the Random Forest and Support Vector Machine
Financial documents, stock market data, and news articles (SVM) algorithms are the most effective at forecasting the
are only some of the data sources that were analysed for this long-term viability of new businesses. The research shows
research. The results verify the efficacy of the suggested that machine learning might be useful in assisting VC
approach in estimating ROI for new businesses. investment choices.
Success rates of new businesses are predicted using By co mbining conventional econometric models with
machine learning methods in [3] by Bangdiwala et al. The machine learn ing algorithms, Krishna et al. (2019)
authors present a strategy for discovering what makes a established a new framework for foreseeing the success of
business successful and then utilising that knowledge to train new businesses [10]. The authors created a prototype system
a machine learning model. Financial documents, industry that takes into account details including the start-physical
reports, and news articles were among the data sets analysed up's location, industry, and financing history. They compared
for this research. The results verify the validity of the the results of their method to those of conventional
suggested approach for forecasting the long-term viability of econometric models by applying it to a dataset consisting of
new businesses. 2,000 startups. When compared to conventional econometric
models, the results demonstrate that the machine learning -
In [4], Castle et al. (2021) zero in on the lessons learned based method is superior at forecasting the long -term
fro m forecasting contests as a basis for their forecasting
performance of startups.
guidelines. Multiple models, expert opinion, and data
visualisation are all discussed as aspects that contribute to Dellermann et al. (2018) provide a hybrid intelligence
reliable forecasting by the authors. The research was approach to forecasting early-stage start-up performance in
conducted by analysing results fro m many prediction their study [11]. The authors created a prototype approach to
contests. The results demonstrate the efficacy of the forecast the success of startups that blends human assessment
suggested forecasting principles in raising the bar of with machine learning algorithms. Using data from 600
predicting precision. startups, the system demonstrated that its hybrid intelligence
approach provided more accurate predictions of startup
CapitalVX is a machine learn ing model proposed by
success than did conventional machine learning techniques.
Ross et al. (2021) for use in selecting startups and predicting
when they would fail. Financial documents, industry studies, The accuracy of linear regression and support vector
and news stories are only some of the materials the authors regression (SVR) algorithms for forecasting the success of
utilise to train the model. The research demonstrates that the startups was compared by Kavitha et al. (2017) in [12]. The
CapitalVX model accurately predicts the success rate of authors examined the accuracy of the two algorithms in
startups and pinpoints the optimal exit option. forecasting the return on investment using a dataset including
information on one thousand different startups. Their
In [6], Varma (2021) reviews the state of the art in findings demonstrate that the SVR algorith m is superior than
predicting the success of new businesses using machine
linear regression for forecasting the long-term viability of
learning. Data quality and model interpretability are only two
new businesses.
of the issues the author highlights as major obstacles to the
advancement of machine learning-based prediction methods. Cassar (2014) looked on how prior business and industry
This article is helpful since it su mmarises current research on expert ise might foretell a co mpany's future performance in
using machine learning to predict the success of new [13]. Using data from 657 new businesses, the author
businesses. discovered that individuals having previous start-up
experience had a better chance of success than those without.
Using Crunchbase data, [7] bikowski and Antosiuk
Furthermore, the author discovered that prior expertise in the
(2021) offer a machine learning method that is devoid of
sector is more essential than prior start-up experience in
bias. The authors forecast the success of businesses using a determining the success of new businesses.
variety of machine learning approaches, such as decision
trees and logistic regression. The research demonstrates that In [14], Shalabh (2013) investigates the problem of
the suggested method accurately identifies the most accurate prediction using linear regression models once
important determinants for startup success and accurately again. To boost the performance of linear regression models,
predicts the chance of success. the author suggested a strategy that combines principal
component analysis with ridge regression. The research
Bonaventura et al(2020) .'s research [8] exp lores the use
emphasises the need for accurate prediction methodologies
of network centrality indicators for predicting success in the in the startup industry.
global start-up network. The authors analysed the ties
between startups and investors using a database of 80,000 In conclusion, the literature study emphasises the value
companies, which represents more than 3 million contacts. of machine learning and other data analytics techniques for
Their findings indicate that start-ups with t ies to a few of gauging the potential of new businesses. Topic-based
powerful investors do better than their peers. Results show categorization, econometric modelling, and various machine
learning algorithms are only some of the methods proposed

979-8-3503-9849-6/23/$31.00 ©2023 IEEE 949

for prediction in the papers reviewed in this overview. The the separate trees. The overfitting problem of decision tree
findings demonstrate the efficacy of these methods in regression is solved by random forest regression, leading to
determin ing the most important criteria for a startup's success improved performance in most cases. Predicting real estate
and in making accurate predictions of that success. and stock market values, as well as the results of medical
procedures, are just a few of its many uses.
III. M ET HODOLOGY
In conclusion, while linear regression is a straightforward
A. Dataset technique, its underlying assumption of a linear relationship
The top 300 Indian startups are represented in the given between variables limits its ability to capture complex non -
dataset, which contains the following categories of linear patterns in data, the more flexible decision tree
information: regression and random forest regression offer greater
potential for discovery. Compared to random forest
 Company - Name of the startup. regression, which uses many decision trees to boost model
 City - The city where the startup is headquartered. performance, choice tree regression is a single-tree approach.
 Starting Year - The year in which the startup was There are many different types of regress ion models
founded. available, and selecting the most appropriate one will depend
 Founders - Names of the startup's founders. on the nature of the issue at hand and the data being
 Industries - The industry sector in which the startup analysed.
operates.
 No. of Emp loyees - The total number of employees IV. NUMBER OF ST ART UP FORCAST ING
working for the startup. Using linear, random forest, and decision tree regression,
 Funding Amount in USD - The total amount of funding figure 2 depicts the expected flo w of Indian startups between
received by the startup in US dollars. 1984 and 2022.
 Funding Rounds - The number of times the startup has
raised funds from the market. Each funding round Data Reading
requires the founders to trade equity in their business Indian Startup (1984-2022)
for capital to advance their companies to the next
level.
 No. of Investors - The total number of investors who Pre -Processing
have invested in the startup. Null-removal and Duplicate Removal

B. Linear Regression Models

Linear Regression [2,3,10,11]: Modeling the association
between a dependent variable and a set of independent Calculate Correlation
variables is a common statistical task, and linear regression is
a straightforward and popular tool for doing so. A straight
line is used as an illustration of the assumed connection
Supervised Learning Regression Model
between the variables. Finding the best-fit line in linear
Linear, Random Forest and Decision T ree
regression is all about reducing the gap between the
predicted and observed values. The equation for linear
regression, which is often used for making predictions about
quantitative variables, is Make Forecasting

where Y is the dependent variable, X1, X2, ..., Xn are the

independent variables, and β0, β1, β2, ..., βn are the Evaluation
coefficients estimated from the data. R2 Score, MSE, RMSE and MAE

Decision Tree Regression [4,5,7,9]: Non-parametric Fig. 2. Flow Diagram of Forcasting

modelling of the connection between variables using
decision trees allows for the capturing of non-linear patterns. Reading and collecting the Indian Startup statistics from
It does this by dividing the data recursively into subgroups 1984 to 2022 is the initial stage. The next stages of analysis
according to the values of the independent variables, then will use this information as input. In the Indian Startup
making predictions about the mean or median value of the Scene, 1984– 2022. This section contains the Indian startup
dependent variable within each of these subsets. Both data used to develop regression models, which spans the
quantitative and categorical data may be easily analysed and years 1984 to 2022.
interpreted using decision trees. However, they are
vulnerable to overfitting, which may reduce their Data is pre-processed to guarantee its quality and fitness
effectiveness in general. for use in model construction. Handling missing values (null-
removal), deleting duplicate data points, and converting data
Random Forest Regression [6,8,12,14]: Random forest into an analysis-ready format are all examples of what may
regression is an ensemble technique that uses a collection of fall under this category.
different decision trees to get a more precise and reliable
prediction. A forest of decision trees is constructed, with After a dataset has been cleaned and prepared, the
each tree trained on a different subset of data and attributes. following stage is to determine the degree of association
The final forecast is obtained by averaging the predictions of between its constituent parts. When developing regression

models, correlation is useful because it provides information

on the strength and direction of linear relationships between
variables.
The regression models are constructed during the Build
Model phase. Linear Regression, Random Forest Regression,
and Decision Tree Regression are the three models shown in
the block diagram. The data that has already been cleaned
and sorted, together with the associated variables discovered
in the first two processes, are used to build these models.
Once the regression models have been constructed, they
may be utilised for forecasting. This section denotes the
process wherein the models are utilised to make predictions
about the dependent variable given information about the
independent variables.
Evaluating the effectiveness of the regression models is
the next step after producing the predictions. The R2 score,
the mean squared error, the root mean squared error, and the Fig. 4. Data Correlation
mean absolute error are only few of the metrics that may be
used for this purpose (MAE). These criteria allow for a mo re
thorough evaluation of the models' prognostic efficacy.
In conclusion, the block diagram summarises the main
processes involved in developing and accessing regression
models for data from Indian startups, including data reading,
pre-processing, correlation calculation, model development,
forecasting, and evaluation using a number of different
metrics.

V. RESULT A NALYSIS
Here, the performance of linear, random forest, and
decision tree regression are examined on data spanning
Indian startups' founding years (1984-2022).

Fig. 5. Linear Regression

Fig. 3. Dataset Reading

Fig. 6. Decision T ree Regression

[3] M. Bangdiwala, Y. Mehta, S. Agrawal, and S. Ghane, “Predicting

Success Rate of Startups using Machine Learning Algorithms,” in
2022 2nd Asian Conference on Innovation in Technology
(ASIANCON), 2022, pp. 1–6. doi:
10.1109/ASIANCON55314.2022.9908921.
[4] J. L. Castle, J. A. Doornik, and D. F. Hendry, “Forecasting Principles
from Experience with Forecasting Competitions,” Forecasting, vol. 3,
no. 1, pp. 138–165, 2021, doi: 10.3390/forecast3010010.
[5] G. Ross, S. Das, D. Sciro, and H. Raza, “CapitalVX: A machine
learning model for startup selection and exit prediction,” Journal of
Finance and Data Science, vol. 7, pp. 94–114, 2021, doi:
10.1016/j.jfds.2021.04.001.
[6] S. Varma, “Machine Learning based Outcome Prediction of New
Ventures: A review,” vol. 9, no. 3, pp. 529–532, 2021, [Online].
Available: www.ijert.org
[7] K. Żbikowski and P. Antosiuk, “A machine learning, bias-free
approach for predicting business success using Crunchbase data,”
Information Processing and Management, vol. 58, no. 4, 2021, doi:
10.1016/j.ipm.2021.102555.
[8] M. Bonaventura, V. Ciotti, P. Panzarasa, S. Liverani, L. Lacasa, and
V. Latora, “Predicting success in the worldwide start -up network,”
Scientific Reports, vol. 10, no. 1, pp. 1–6, 2020, doi: 10.1038/s41598-
019-57209-w.
Fig. 7. Random Forest Regression [9] J. Arroyo, F. Corea, G. Jimenez-Diaz, and J. A. Recio-Garcia,
“Assessment of machine learning performance for decision support in
venture capital investments,” IEEE Access, vol. 7, pp. 124233 –
TABLE I. CLASSIFICATION P ARAMETERS
124243, 2019, doi: 10.1109/ACCESS.2019.2938659.
Model R2- MSE RMSE MAE [10] A. Krishna, A. Agrawal, and A. Choudhary, “IEEE International
Score Conference on Data Mining Workshops, ICDMW,” IEEE
International Conference on Data Mining Workshops, ICDMW, vol.
Linear Regression 0.42 112.16 6.96 0.42 2019-November, pp. 798–805, 2019, doi:
Decision Tree 0.67 63.45 7.96 3.95 10.1109/ICDMW.2016.103.
Random Forest 0.92 13.78 3.71 1.94 [11] D. Dellermann, P. Ebel, N. Lipusch, K. M. Popp, and J. M.
Leimeister, “Finding the Unicorn: Predicting Early Stage Startup
As from result analysis it can say Random Forest model Success through a Hybrid Intelligence Method,” ICIS 2017:
Transforming Society with Digital Innovation, pp. 1–12, 2018, doi:
gives best performance among all Supervised regression 10.2139/ssrn.3159123.
models. [12] S. Kavitha, S. Varuna, and R. Ramya, “A comparative analysis on
linear regression and support vector regression,” Proceedings of 2016
CONCLUSION Online International Conference on Green Engineering and
In conclusion, the number of Indian startups may be T echnologies, IC-GET 2016, 2017, doi: 10.1109/GET.2016.7916627.
predicted with the use of Random Forest regression. Using a [13] G. Cassar, “Industry and startup experience on entrepreneur forecast
performance in new firms,” Journal of Business Venturing, vol. 29,
linear regression model constructed from analysed past data, no. 1, pp. 137–151, 2014, doi: 10.1016/j.jbusvent.2012.10.002.
we may ext rapolate information about the startup's future
[14] Shalabh, “A revisit to efficient forecasting in linear regression
growth. The number of Indian start-ups serves as dependent models,” Journal of Multivariate Analysis, vol. 114, no. 1, pp. 161 –
variable, and linear regression can be used to determine 170, 2013, doi: 10.1016/j.jmva.2012.07.017.
whether other factors are related to this number (e.g., time, [15] M. Rathore, “India: Number of recognized startups 2022,” Statista,
funding, market size, etc.). Once patterns in the data have 02-Jan-2023. [Online]. Available:
been recognised, the model may be utilised to create https://www.statista.com/statistics/1155602/india-start-up-recognized-
predictions. Keep in mind that the quality and businesses/. [Accessed: 14-Apr-2023].
trustworthiness of the data used in the development of the
models determines the precision of the predictions. Data pre-
processing, such as null-removal, duplicate-removal, and
outlier management, is essential for protecting the reliability
of the model. The model's performance and prediction
capacity are further shown by the assessment metrics of R2
score, MSE, RMSE, and MAE, respectively, which are 0.92,
13.78, 3.71, and 1.94. The future accuracy of predictions
may be enhanced by the review and improvement of
additional datasets.

REFERENCES
[1] I. Savin, K. Chukavina, and A. Pushkarev, Topic-based classification
and identification of global trends for startup companies, vol. 60, no.
2. Springer US, 2023. doi: 10.1007/s11187-022-00609-6.
[2] M. Farahani, M. Shahvaroughi Farahani, and A. Esfahani,
“ Forecasting Startup Return using Artificial Intelligence Methods and
Econometric Models and Portfolio Optimization Using VaR and C-
VaR,” International journal of innovation in Engineering, vol. 2, no.
1, pp. 78–109, 2022, [Online]. Available:
https://www.researchgate.net/publication/362073994

Authorized licensed use limited to: Army Institute of Technology. Downloaded on September 19,2024 at 18:20:29 UTC from IEEE Xplore. Restrictions apply.