0% found this document useful (0 votes)
54 views

Business Analytics (A Case-Study Approach Using LDA Topic Modeling)

This document summarizes a research paper presented at the 5th International Conference on Computing Methodologies and Communication about using topic modeling via latent Dirichlet allocation (LDA) on a dataset of 2000 business articles to gain insights. The paper applies LDA topic modeling to identify concealed topics and generate word clouds. It preprocesses the data, builds and visualizes the LDA model, and reports empirical findings on the extracted topics. The paper is structured to provide background on business analytics, language processing, and topic modeling, and then describe the proposed methodology and experimental results.

Uploaded by

Hei It's Aya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Business Analytics (A Case-Study Approach Using LDA Topic Modeling)

This document summarizes a research paper presented at the 5th International Conference on Computing Methodologies and Communication about using topic modeling via latent Dirichlet allocation (LDA) on a dataset of 2000 business articles to gain insights. The paper applies LDA topic modeling to identify concealed topics and generate word clouds. It preprocesses the data, builds and visualizes the LDA model, and reports empirical findings on the extracted topics. The paper is structured to provide background on business analytics, language processing, and topic modeling, and then describe the proposed methodology and experimental results.

Uploaded by

Hei It's Aya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)

IEEE Xplore Part Number: CFP21K25-ART

Business Analytics: A case-study approach using


LDA topic modelling
Vidya Vukanti Anu Jose
2021 5th International Conference on Computing Methodologies and Communication (ICCMC) | 978-1-6654-0360-3/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICCMC51019.2021.9418344

Department of Information Science & Department of Information Science &


Enginnering Enginnering
CM R Institute of Technology CM R Institute of Technology
Bengaluru, India Bengaluru, India
[email protected] [email protected]

Strategic decision-making in business is improved through gain new insights in their businesses, through various
business analytics, which can be descriptive, diagnostic, statistical methods & technologies using historical data
predictive or prescriptive. Business Analytics uses statistical
methods on historic data to gain new perceptions for better This paper considers the problem of modelling digital
policymaking. Topic modelling is a statistical, text-mining, economy corpora which has nearly 2000 art icles. Latent
unsupervised machine learning model, that can decipher themes Dirichlet Allocation (LDA) based topic modelling is used to
from a corpus such as social media posts, annual reports, social cluster the topics by calculating the probability distribution
media posts, news covers, related articles, trends in the domain, over a set of words. The result is represented in the form of
etc., In this research, Topic modelling is applied to Business word clouds for various topics showing the statistical
Digital Economy Dataset which hosts around 2400 titles, relationships that are useful for various tasks such as
abstracts, keywords from different authors related to various cataloguing, innovation, summarizat ion, & understanding the
topics on digital economy. Topic modelling has various importance of the topics.
approaches of which, Latent Dirichlet Allocation (LDA) is most
widely used. This paper explores the research articles related to The rest of the paper is structured as follows: Section 2,
emerging trends in business economy to extract the concealed provides literature survey on business analytics, Language
semantic structures and generate word clouds. The research processing & text mining, language modelling and topic
procedu re comprises, introduction of dataset, data pre- modelling in the field of business analytics. A detailed
processing, building and visualizing the model. description of proposed methodology is discussed in Section
3. Section 4 demonstrates our approach and reports empirical
Keywords—: Business Analytics, Topic modelling, Latent
Dirichlet Allocation (LDA) findings. Conclusion & future work is detailed in Section 5..

I. INT RODUCT ION II. LIT ERAT URE SURVEY


In today’s world, business applications capture A business’s plan requires regular updates to understand
information fro m various data sources such as Transactional strengths & weakness which can be due either internal or
data which is created main ly by businesses, Crowdsourced external changes in the organization. To analyse this,
data such as Instagram, OpenStreetMap, & Flickr, Social Pröllochs N & Feuerriegel S [1] proposed a model in which,
data from Twitter, LinkedIn, Machine data fro m IoT devices, these changes can be determined by performance evaluations
from all the corners of the world. of firms using topic modelling with topic specific analysis of
various opportunities & risks.
Business Analytics plays a key role in analyzing this vast
amount of data, helping organizat ions to make simple and Language Processing/Natural Language Processing is a
accurate decisions by providing comprehensive information challenging research in computer science to information
in the form of dashboards, charts, forms etc., Var ious text management, semantic mining, and enabling computers to
mining tools are used by wide number of applications for obtain meaning fro m human language processing in text-
business analytics in the literature. But there is little evidence documents. Hamed et al. [7] observed that Latent Dirichlet
demonstrating the use of topic modelling [1]. In Business Allocation (LDA) is the most popular approaches in topic
Analytics, Topic Modelling can be applied to generate modelling and also summarized the challenges in topic
business reports, identify sheering topics on social med ia modelling.
platforms and also to extract latent themes fro m reviews in e - Text mining transforms unstructured text documents into
commerce websites. Other applications of topic modelling structured form that is suitable for analysis. Ponweiser, M [8]
include, scientific research, bioinformatics, social network proposed that, Text Mining is the process of deriving high -
analysis & software engineering [2] quality information from text.
Strategic management in business analytics is the process Language processing uses text mining on text-documents
of articulating, applying and assessing cross -functional to organize, identify patterns, assess and infer the output.
decisions that will enable an organization to achieve its Tong, Z., & Zhang, H. [9] demonstrated that, various text
objectives in an optimal manner [3]. To facilitate strategic mining tasks include categorization, clustering,
analysis, academics & practitioners have devised a variety of summarization, keyword extraction etc.,
management tools [4,5], each with a different objective [6].
Business organizations uses Business Analytics to improve Language Modelling is an application of Natural
the strategic management for better decision-making and to Language Processing that determines the probability of
sequence of words. Archna Oberoi [10] listed some of the
applications of Language modelling that include, speech

1818

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 23:32:41 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
IEEE Xplore Part Number: CFP21K25-ART

recognition, audio to text conversion, sentiment analysis, and system enabled approaches. There is an increase in the
document summarization. Language Modelling is of two trend of BI researchers & applications and concepts in BI.
types: 1) statistical Model, which is probabilistic model to
Meaningful insights from a corpus of data can be
predict the next word in the sequence. 2) Neural language
Model, based on neural networks for speech recognition or retrieved using a natural language model-based information
retrieval. Joby P.P. [14] uses latent semantic analysis to
machine translation
extract substantial information fro m the questionary from
user or corpus.
Topic modelling methods are powerful s mart techniques For the reason in increase of data at larger dimensions
that widely applied in natural language processing to topic due to advancement in computer & web technologies, [15]
discovery and semantic min ing fro m unordered documents. Kherwa et al stated that Topic modelling method can be used
Blei et al [11] proposed that, It is a frequently used to extract hidden concepts, latent variables, protuberant
unsupervised text-mining tool to discover hidden semantic features of data, based on the framework.
structures or topics in large corpus of texts. Fig. 1 illustrates
the idea of Topic modelling strategy. Unsupervised document classification with web
scrapping using support vector machines & LDA topic
modelling was proposed by Thielmann et al [16] y ielding
accurate results for different data sets.

III. PROPOSED M ET HODOLOGY


The proposed strategic framework is shown in Fig.2.
LDA Topic modelling is applied to business Economy
dataset to understand and derive solutions on topic modelling

Fig. 1: T opic Modelling strategy: clustering words (shapes – triangles


& circles) under related topics.
Fig. 2: Proposed Strategic Framework model for Business Economy
Corpus.

Fro m Fig1. Corpus is mass collection of documents of A. Business Economy Dataset


unstructured texts (e.g. Articles, Blog posts, book chapters,
Business Digital Economy Dataset hosts around 2400
Emails) fro m which word-topics are extracted through topic
titles, abstracts, keywords fro m different authors related to
modelling methods, generating a vision on the topic trend.
various topics on digital economy. The summary of each
The two elementary outputs of topic modelling are: a
topic will be used as the base for the topic models. To build
collection of words that occur commonly (topics) and list of
topic modelling, ‘Abstract’ column is used and oth er
documents that are related with similar topics. The topics
columns are used for exploratory analysis with WordClouds.
generated are completely unique.
There are four types of Topic Modelling. They are:
• Latent Semantic Analysis (LSA): LSA creates
vector-based representation of text to make
semantic content.
• Probabilistic Latent Semantic Analysis (PLSA):
PLSA models related informat ion under a
probabilistic framework in to determine the
underlying semantic structure of the data [12].
• Latent Dirichlet Allocation (LDA): LDA calculates
the similarity between the given source documents
and acquires distributions of every document over
topics. In this paper, LDA is chosen for two Fig. 3: WordCloud of Abstracts.
reasons: 1) It has a great impact in the fields of NLP
2) It is the most popular probabilistic & statistical B. Pre-Processing of data
text modelling in the field of machine learning.
Data Pre-processing includes Data Parsing and
• Correlated Topic Model (CTM ): CTM uses logistic Tokenization, stop-word removing, Lemmat izat ion, Vector-
normal d istribution to identify the topics from space representation, Word-Cloud representation.
corpus.
Data parsing and cleaning is done by considering the
Business Intelligence (BI) is a collection of processes, abstract field and removing the rows whose abstracts are not
technologies & architectures that can convert raw data into available. Th is is followed by removing the punctuations and
profitable business information. Rouhani et al [13] ensuring there is no case sensitivity. Tokenizat ion of words
researches in BI and classified it into managerial, technical is implemented and is graphically represented for first 500

1819

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 23:32:41 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
IEEE Xplore Part Number: CFP21K25-ART

high frequency words in the form of Word Clouds as shown θ is Document-topic distribution
in figure 3. Fro m Fig.3, the word that has bigger parameter
representation has higher prominence and more frequency. θ d,k is the topic proportion for topic k in
The next step is, building N-gram language model at document d
word level i.e., bigram and trigram models, to find the z is the word-topic assignment parameter
probability distribution over word sequences. This process zd,n is the topic assignment for the n th word
removes any kind noise (spelling errors, special characters, in document d
non-standard word forms, grammar mistakes, and so on) by
w is the observed word
finding the probabilities fro m the text such as, invalid
dictionary words based on their probabilities. To provide wd,n is the n th word in document d
more focus on the words which relate more meaning to the With these notations, the joint distribution of hidden
topics, commonly used words also known as Stopwords, & observed variables of LDA is as follows [8]:
such as “a”, “an”, “the”, “is”,” was”, etc., are removed. This
is followed by stemming & lemmatization, for word or text
normalization to prepare them for further processing. Fig. 4
shows the words after this process. For these words’ corpora

The distribution specifies interdependencies


between topic assignment & topic distribution,
words in all the documents and their dependency on
the topics and topic assignments .
 Building Topic Model using LDA: LDA model is
implemented on Business Digital Economy dataset
of around 2400 sample abstracts, for 100 passes of
Gibbs sampling, with number of topics as 15, alpha
value set to 0.01, beta value as 0.1, through 500
iterations. The model converges for each abstract by
generating topic distribution, which defines how
is built that defines the frequency of each word. much each word of the abstract is related to each
topic. Table 1 shows sample topics along with the
Fig. 4: Normalized text distribution of the words with respect to each topic.
C. Building Topic Modelling using LDA & Results
D. Computing Model Perplexity & Coherence Score:
Acronyms
In our experiments, a low number of topics were
 LDA: LDA is an unsupervised text modelling preferred as the objective is to provide the general view. To
technique, which was introduced in 2003 by Blei et approximate optimal number of topics, an intrinsic
al. [17]. It is a classical statistical model for topic evaluation metric known as, Perplexity is evaluated for
mining in NLP. LDA uses matrix factorization different topics. Perplexity is a measure of probability
technique to represent a given corpus as a distribution in predicting the sample. A low perplexity
“document-word matrix”. These matrices are indicates good prediction. Perplexity is optimized by
processed through sampling techniques until they evaluating Coherence Score between topics inferred by the
converge to a steady point. model. The topic coherence measure helps to differentiate
topics that are semantically interpretable & topics that are
artifacts of statistical inference. Table 2 shows the values
obtained for these metrics. Figure 6 shows the variation in
coherence score as the number of topics varies.

Fig. 5: A Graphical distribution of LDA model

A Graphical representation of LDA model flow is


shown in Figure 5 where the various notations are:
α is Dirichlet parameter
η is topic parameter
β1:K are topics, where β k is a distribution Fig. 6: Coherence Score w.r.t No. of topics
over the words

1820

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 23:32:41 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
IEEE Xplore Part Number: CFP21K25-ART

T ABLE 2: Perplexity & Coherence Score E. Visualization of Results:


Value Figure7 shows visualization of the results of LDA topic
Metric O btained modelling on Digital Econo my data. The one side of the
Perplexity: -7.832773965 Figure 7 shows the topic distribution and the other side
Coherence Score: 0.341598949
shows the bar chart of the key terms that represent the
Coherence Score selected topic. The blue colour bar represents the frequency
u_mass: -4.205272144 of each word, whereas the red colour bar represents the local
frequency.

T ABLE 1: Sample topics along with the probability distribution of words

1821

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 23:32:41 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
IEEE Xplore Part Number: CFP21K25-ART

Fig. 7a: LDA Visualization for T op 15 topics

Fig. 7b: LDA Visualization for T op 6 topics

1822

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 23:32:41 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
IEEE Xplore Part Number: CFP21K25-ART

[4] Cheng, C., & Havenvid, M. (2017). Investigating strategy tools from
IV. CONCLUSION & FUT URE W ORK an interactive perspectiveThe IMP Journal, 11(1), 127–149.
In this research study, a framework is created to enable [5] Laamanen, T ., Mantere, S., & Vaara, E. (2018). Strategy Processes
and Practices: Dialogues and IntersectionsStrategic Management
scholars to use topic modelling, do a related literature Journal, 39.
review, reducing the necessity of manually reading articles, [6] G.G. Dess, A.B. Eisner, G.T. Lumpkin, strategic Management: Text
analyse large corpus in a faster & higher reliable manner. and cases, 4th ed., McGraw-Hill/Irwin, New York, NY, 2008.
The framework is based on LDA topic modelling is [7] Hamed Jelodar, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang,
Yanchao Li, & Liang Zhao. (2018). Latent Dirichlet Allocation
implemented on business economy dataset to categorize (LDA) and T opic modeling: models, applications, a survey.
articles under various topics by identifying them. The [8] Ponweiser, M. (2012). Latent Dirichlet allocation in R, Vienna
framework includes Pre-processing of data for cleaning University of Business and Economics.
&cross-validating the data, LDA Topic Modelling & Post- [9] Tong, Z., & Zhang, H. (2016). A text mining research based on LDA
processing to create various topics along with their frequency topic modelling. In International Conference on Computer Science,
of occurrence. The model is analyzed by calculating the Engineering and Information Technology (pp. 201–210).
perplexity & coherence score between the topics inferred. [10] https://insights.daffodilsw.com/blog/what-are-language-models-in-nlp
[11] Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation
Future work can include an improved framework that can Journal of machine Learning research, 3(Jan), 993–1022.
be applied to different corpuses to obtain practical insights [12] Blei D M. Probabilistic topic models[J]. Communications of the
for further development. Also, integrating the proposed ACM, 2012, (pp. 77-84).
framework of LDA topic modelling with various classifiers [13] Rouhani, S., Asgari, S., & Mirhosseini, S. (2012). Review study:
such as SVM, KNN for more optimal classification Business intelligence concepts and approachesAmerican Journal of
Scientific Research, 50, 62-75.
[14] Joby, P. P. "Expedient Information Retrieval System for Web Pages
REFERENCES Using the NaturalmLanguage Modeling." Journal of Artificial
Intelligence 2, no. 02 (2020): 100-110.
[1] Pröllochs, N., & Feuerriegel, S. (2020). Business analytics for [15] Kherwa, Pooja and Bansal, Poonam (2020) Topic Modeling: A
strategic management: Identifying and assessing corporate challenges Comprehensive Review. EAI Endorsed Transactions on Scalable
via topic modelling Information & Management, 57(1), 103070. Information Systems, 7 (24): e2.
[2] Qiang, J., Qian, Z., Li, Y., Yuan, Y., & Wu, X. (2020). Short text [16] Thielmann, A., Weisser, C., Krenz, A., & Säfken, B. (2020).
topic modeling techniques, applications, and performance: a Unsupervised Document Classification integrating Web Scraping,
surveyIEEE Transactions on Knowledge and Data Engineering. One-Class SVM and LDA T opic Modelling
[3] Ayitey, W. (2010). A Simple Approach to Strategic Management, [17] http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV10
Methodist Book Depot Ltd, Available from 11/oneata.pdf
https://www.researchgate.net/publication/279958992_A_Simple_App
roach_to_Strategic_Management
.

1823

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 23:32:41 UTC from IEEE Xplore. Restrictions apply.

You might also like