Automated Use Case Diagram Generator Using NLP And
Automated Use Case Diagram Generator Using NLP And
and ML
H.Rukshan Piyumadu Dias C.S.L.Vidanapathirana Rukshala Weerasinghe
Department of computing Department of computing Department of computing
Informatics Institute of Technology Informatics Institute of Technology Informatics Institute of Technology
Colombo,Sri Lanka Colombo,Sri Lanka Colombo,Sri Lanka
[email protected] [email protected] [email protected]
Abstract—This paper presents a novel approach to generate a by using Natural Language Processing(NLP) and Machine
use case diagram by analyzing the given user story using NLP Learning(ML). Under the topic of automating UML diagrams,
and ML. Use case diagrams play a major role in the designing compared to other UML diagrams like class diagram, sequence
phase of the SDLC. This proves the fact that automating the
use case diagram designing process would save a lot of time and diagram and activity diagram, there are fewer researches
effort. Numerous manual and semi-automated tools have been that have been conducted on automated use case diagram
developed previously. This paper also discusses the need for use generation. This research will contribute towards filling the
case diagrams and problems faced during designing that. This gap of identifying actors,use cases and relationships from
paper is an attempt to solve those issues by generating the use the user story and presenting it as a use case diagram which
case diagram in a fully automatic manner.
follows correct UML notations.
keywords—use case diagram, NLP, ML
The remainder of this paper is structured as follows: in
I. I NTRODUCTION section 2 it reviews the literature on the same domain and
parent domain. In section 3 the proposed approach will be
Use case diagram is a type of UML diagram that is used discussed. the proposed method’s results and their discussion
for representing the dynamic aspect of the system [1]. It will be provided under section 4. Finally the conclusion of
is useful when visualizing and specifying the system in the research will be mentioned in section 5.
the analysis and designing phase. User stories describe the
functionality of a software system in a simple way [2].
Analysis and design phase is important because the software II. L ITERATURE R EVIEW
builds on top of the blueprint created in this phase [3]. Using Limited amount of research has been done in the domain
manual tools like Visual UML, Draw.io, Rational Rose, of automated use case diagram generation. Because of that
Smart Draw etc to draw use case diagrams can be more reason, the research team has decided to conduct the literature
time consuming. In a survey conducted in 2017 it has found review on both the same domain(Use case diagrams) and
that 54.5% people have encountered the problem of ‘time parent domain(UML diagrams) in the year range of 2006 to
consuming’ when it comes to drawing use case diagrams [4]. 2023.
In another survey in 2017 it has found that 40.2% of system
engineers create UML diagrams manually, 23.9% generate Sharif Ahmed et al. [6] proposed a system to identify
UML diagram using semi-automatic methods and only 13% the actors and use cases from a given text by using
generate UML diagram using automatic methods [5]. And NLP techniques. The user story will split into sentences
also when drawing use case diagrams manually the system and Tokenize those sentences using either Stemming or
analyst has to remember the correct notation of drawing use Lemmatization. And then for Part-Of-Speech (POS) tagging,
case diagrams. These issues show the need for automation the Stanford CoreNLP has been used. Finally, a set of heuristic
when drawing use case diagrams. rules have been used to identify the actors and use cases.
Additionally, some tools like Word Sense Disambiguation
This paper has proposed a novel model or method for (WSD) have been used to identify words with similar
generating use case diagrams from textual user stories, senses, which helps to reduce ambiguity. Also, “Anaphora
Resolution” has been solved using JavaRAP by replacing
nouns with their correct noun form.
Vemuri, Chala and Fathi [8] have proposed a system which Fig. 1: Proposed model pipeline
includes both NLP and ML models for generating UML
diagrams. In this proposed model, there are four main sub
models. They are as follows: Pre processor , Actor classifier, A. Spelling and grammar correction
Use case classifier and Post processor. In the Pre processor,
In this proposed approach when it comes to identifying
tokenizing, POS tagging, Extraction of nouns and verbs,
actors and use cases from the given user story, NLP
splitting and removing tags processes are done for filtering
technology will be used. If the given user story contains
out the two subsets of actors and use cases from input text
grammatical errors it will cause inaccurate prediction. So
using the NLP model. Then both classifier models are used
Grammatical Error Correction or GEC is a vital task in the
to identify the respective actors’ or use cases’ relationships.
field of NLP [9]. Grammatical error correction is a task of
A ML model is used for those two classifiers. Finally, post
identifying errors of a text and correcting it automatically
processors are used to convert the classified results into the
[10]. A GEC model corrects errors like punctuation, spelling,
diagram. Accuracy, recall and precision tests had been run
tense, syntax and subject-verb agreement errors in a text [11].
for evaluating the proposed system.
For this proposed approach a GEC model called, “Gingerit”
Narawita and Vidanage, [3] suggest a system that uses that have built using gingersoftware.com API will be used.
both NLP and ML models for generating UML diagrams. In Gingerit library will detect and correct types errors in text
the NLP model, Sentence Splitting, Lexical Analysis, Syntax that have been mentioned above [12]. In this phase the initial
Analysis, Word Chunking were done to identify specific user story will be taken as input and correct the grammar
semantic elements from the input text. Then, A ML based errors and pass the grammatically correct user story to the
classifier was created to identify use cases and relationships. NLP model.
For this classifier three different algorithms were created
using Weka. They were as follows, a multi-layered perceptron
which has got very high accuracy, a logistics algorithm which B. NLP approach
is significantly lower than multi layered perceptron but with
Natural Language Processing (NLP) is a subcategory of
higher performance and Sequence Minimal Optimization
Artificial intelligence and Machine learning which allows
(SMO) which is lower accuracy than logistics and highest
computers to understand and manipulate natural human
performance. After the classification, it used a Rich Based
language [13]. In this approach the NLP model will be used
Approach to functionalities like remove unwanted terms in
as the key model to identify the actors and use cases with
the user input text, identify specific terms in the input text
their relationship.
and define Weka ARFF file names to read the files.
The NLP model will receive grammatically correct user
story from the GEC model. First the user story will be divided
into sentences using sentence segmentation. Next using word
III. P ROPOSED M ETHOD tokenization, the sentences will be divided into words. For
example, “I like to read books” after tokenization it will be,
This section describes the proposed approach that works [I], [like], [to], [read], [books]. Tokenization helps to review
under 4 phases. The figure 1 below shows the pipeline of the every and each word carefully.
proposed approach.
After word tokenization, Part of speech (POS) tagging D. Diagram generation
will be implemented. POS provides valuable information In this phase the final identified actors and use cases from
about words or tokens [14]. POS tagging will be used to ML model as a data structure of key-value pairs. Now the
identify whether the word is a noun, verb, adjective, pronoun, final identified data needs to be converted into a visual use
conjunction, preposition, numeral, article or interjection [15]. case diagram. To visually represent use case diagram several
After POS tagging, Dependency parsing will be applied tools have been identified,
to every word in the sentence. Dependency parsing is a
technique used to analyze grammatical relations between
words in a sentence [16]. It would create a tree or graph data Tool Name Use correct Support use case
structure of the sentence breaking down the words or tokens UML diagrams
based on the grammatical relation. Using dependency parsing notation
leads to an accurate NLP model.
plantUML ✓ ✓
The algorithms would identify the subject of the sentence yUml X ✓
using “nsub” dependency parser and where it is not an
“PRON” (pronoun) in POS tagging. Then it would identify Chart mage ✓ X
the object and the verb of the sentence. To identify the object,
dependency parser of “dobj” and the verb will be identified Zen UML ✓ X
using POS tagging of “VERB”. After identifying the subject, Umple ✓ X
object and the verb, lemmatization will be applied to those.
Lemmatization is used to convert a token or a word to its root UML Graph ✓ X
word [17], [18].
Dot UML ✓ X
Word Lemmatization
TABLE II: Diagram generating tools comparison
Buys Buy
Better Good After a comparison on various tools, plantUml has been
changing change selected as the tool to visualize the use case diagram. To
generate the use case diagram using plantUml it is required
TABLE I: word and its lemmatization to write pseudo code. It also can be automated using
the final updated actor-use case dictionary. For example,
Finally unique subjects will be identified as actors and below code would be the plantUml code to display the use
the combination of verb and object will be identified as the case diagram for, actor - “Customer” use case - “buy product”,
use case. The association relationship between actors and the
use cases will be identified as follows. When a new actor is @startuml
identified, all the use cases after that will be considered as left to right direction
the use cases of that actor until another new or different actor actor "Customer" as Cu
appears. The extracted data of the use case diagram will be rectangle {
stored in a dictionary data structure. usecase "buy product" as UC1
}
C. ML approach Cu --> UC1
@enduml
The ML model will receive a dictionary data structure from
the NLP model that contains actors and use cases. The ML
IV. R ESULTS AND D ISCUSSION
model will be used to filter and remove unnecessary use cases
from identified use cases in the NLP model. More than 700 This paper proposes an approach to generate use case dia-
use cases that have been extracted from the NLP model have grams automatically. First GEC will be performed. The results
been used in the data-set along with whether its a use case or of GEC are impressive. It fixes most of all grammatical errors
not. To perform the classification on this model, Naive Bayes but when it comes to fixing spelling issues, the GEC model
algorithm which is a supervised classification method will struggles a bit. When a word is misspelled, the GEC model
be used. In some cases, the Naive Bayesian classifier is an will replace that with a totally different word sometimes.
effective and simple probabilistic classification method [19]. By using 8 user stories where one user story contains 20 to
After filtering and removing the unnecessary use cases from 35 words, 13 out of 15 actors have been identified correctly
the data structure, the updated data structure will be passed which is about 86.67% and 20 out of 28 use cases have been
on to the diagram generation model. identified correctly which is about 71.43%.
User stories count 8
Actual actors count 15
Actual use cases count 28
Identified actors count 13
Identified use cases count 20
Accuracy = (TP + TN) / TOTAL name from the given user story. Finally, use case diagrams can
Precision = TP/ (TP+FP) be generated only for English language user stories.
Recall = TP/ (TP+FN)
F1 score = 2*(Precision* Recall) / (Precision + Recall) ACKNOWLEDGMENT
We would like to thank our supervisor, Ms.Rukshala
Using the confusion matrix several metrics can be calcu- Weerasinghe for her guidance and encouragement of this work,
lated. For example the accuracy, precision, recall score and throughout the research.
F1 score. The accuracy is defined as the ratio of number of
R EFERENCES
correct predictions by total number of predictions. The ML
classification model of this study contains an accuracy of [1] G. Booch, I. Jacobson, J. Rumbaugh, The Unified Modeling Language
User Guide, The Addison Wesley Object Technology Series, 1998.
75.52%. The precision describes the quality of the positive [2] A. Azzazi, “A Framework using NLP to automatically convert User-
predictions. The recall and the F1 score of the ML model Stories into Use Cases in Software Projects,” IJCSNS International
is 76% and 72%. After automatically identifying the actors, Journal of Computer Science and Network Security, 2017.
[3] C. R. Narawita and K. Vidanage, “UML generator – use case and
use cases and relationships and then generating the use case class diagram generation from text requirements,” Int J on Adv. in
diagram, there still can be some issues. To prevent that this ICT for Emerging Countries, vol. 10, no. 1, pp. 1–10, May 2017, doi:
system will provide manual editing of the use case diagram. 10.4038/icter.v10i1.7182.
[4] R. S. Madanayake, G. K. A. Dias, and N. D. Kodikara, “Transforming
Where users can add and remove actors or use cases, rename Simplified Requirement in to a UML Use Case Diagram Using an Open
actors or use cases and swap the connection between one actor Source Tool,” International Journal of Computer Science and Software
and a use case to another actor. Engineering, vol. 6, no. 3, p. 61, 2017.
[5] O. S. Dawood and A.-E.-K. Sahraoui, “From Requirements Engineering
For the below user story the generated use case diagram has to UML using Natural Language Processing – Survey Study,” European
been shown in figure 02. Journal of Engineering and Technology Research, vol. 2, no. 1, pp.
“A customer calls a car repair shop to make an appointment 44–50, Jan. 2017, doi: 10.24018/ejeng.2017.2.1.236.
[6] S. Gulia and T. Choudhury, “An efficient automated design to generate
for an oil change. The receptionist checks the availability UML diagram from Natural Language Specifications,” in 2016 6th
of the mechanic and schedules the appointment for the next International Conference - Cloud System and Big Data Engineer-
available time slot.” ing (Confluence), Jan. 2016, pp. 641–648. doi: 10.1109/CONFLU-
ENCE.2016.7508197.
User stories that have been used to evaluate this system have [7] M. Alksasbeh, B. Alqaralleh, A. Tahseen, Alramadin, and K. Alemerien,
not been used to train or test the model before. “AN AUTOMATED USE CASE DIAGRAMS GENERATOR FROM
NATURAL LANGUAGE REQUIREMENTS,” Journal of Theoretical
and Applied Information Technology, 2017.
V. C ONCLUSION [8] S. Vemuri, S. Chala, and M. Fathi, “Automated use case diagram
In conclusion, this paper proposed a method to generate generation from textual user requirement documents,” in 2017 IEEE
30th Canadian Conference on Electrical and Computer Engineering
the use case diagram for the given user story by using (CCECE), Apr. 2017, pp. 1–4. doi: 10.1109/CCECE.2017.7946792.
NLP and ML models. It also discussed different tools and [9] M. Long, Y. Wang, Y. Peng, and W. Huang, “A Review of the Research
techniques that have been considered and their advantages and on the Evaluation Metrics for Automatic Grammatical Error Correction
System,” Mobile Information Systems, vol. 2022, p. e5998948, Oct.
disadvantages. In addition several limitations of this proposed 2022, doi: 10.1155/2022/5998948.
method have been identified. This model will not identify [10] R. H. Susanto, P. Phandi, and H. T. Ng, “System Combination for
complex relationships like include and exclude, it will only Grammatical Error Correction,” in Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP),
identify association relationships. Also the user has to enter Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp.
the name of the system manually, it will not identify a suitable 951–962. doi: 10.3115/v1/D14-1102.
[11] C. Bryant, Z. Yuan, M. R. Qorib, H. Cao, H. T. Ng, and T. Briscoe,
“Grammatical Error Correction: A Survey of the State of the Art,” Com-
putational Linguistics, pp. 1–59, Apr. 2023, doi: 10.1162/coli a 00478.
[12] S. Abhishek, H. Sathish, A. K. K, and A. T, “Aiding the Visually
Impaired using Artificial Intelligence and Speech Recognition Tech-
nology,” in 2022 4th International Conference on Inventive Research
in Computing Applications (ICIRCA), Sep. 2022, pp. 1356–1362. doi:
10.1109/ICIRCA54612.2022.9985659.
[13] Y. Li, M. A. Thomas, and D. Liu, “From semantics to pragmatics: where
IS can lead in Natural Language Processing (NLP) research,” European
Journal of Information Systems, vol. 30, no. 5, pp. 569–590, Sep. 2021,
doi: 10.1080/0960085X.2020.1816145.
[14] S. G. Kanakaraddi and S. S. Nandyal, “Survey on Parts of Speech
Tagger Techniques,” in 2018 International Conference on Current Trends
towards Converging Technologies (ICCTCT), Mar. 2018, pp. 1–6. doi:
10.1109/ICCTCT.2018.8550884.
[15] M. Haspelmath, “Word Classes and Parts of Speech,” in International
Encyclopedia of the Social & Behavioral Sciences, N. J. Smelser and
P. B. Baltes, Eds., Oxford: Pergamon, 2001, pp. 16538–16545. doi:
10.1016/B0-08-043076-7/02959-4.
[16] S. Singkul and K. Woraratpanya, “Thai Dependency Parsing with Char-
acter Embedding,” in 2019 11th International Conference on Information
Technology and Electrical Engineering (ICITEE), Oct. 2019, pp. 1–5.
doi: 10.1109/ICITEED.2019.8930002.
[17] R. Pramana, Debora, J. J. Subroto, A. A. S. Gunawan, and Anderies,
“Systematic Literature Review of Stemming and Lemmatization Perfor-
mance for Sentence Similarity,” in 2022 IEEE 7th International Con-
ference on Information Technology and Digital Applications (ICITDA),
Nov. 2022, pp. 1–6. doi: 10.1109/ICITDA55840.2022.9971451.
[18] Khyani D, Siddhartha B, Niveditha N, Divya B. “An interpretation of
lemmatization and stemming in natural language processing”. J Univ
Shanghai Sci Technol. 2021;22:350–7.
[19] Y. Huang and L. Li, “Naive Bayes classification algorithm based
on small sample set,” in 2011 IEEE International Conference on
Cloud Computing and Intelligence Systems, Sep. 2011, pp. 34–39. doi:
10.1109/CCIS.2011.6045027.
[20] Vemuri, S. Chala, and M. Fathi, “Automated use case diagram generation
from textual user requirement documents,” IEEE 30th Canadian Conf.
on Electrical and Computer Engineering (CCECE), 2017
[21] M. S. O. Z. A. B. J. Alrawashdeh, “Generate use case from the
requirements written in a natural language using machine learning,”
IEEE Jordan Int. Joint Conf. on Electrical Engineering and Info. Tech.
(JEEIT), 2019
[22] M. Elallaouia, K. Nafilb, and R. Touahnia, “Automatic transformation
of user stories into uml use case diagrams using nlp techniques,” The
8th Int. Conf. on Ambient Systems, Networks and Technologies(ANT),
2018.
[23] A. M. Maatuk and E. A. Abdelnabi, “Generating uml use case and
activity diagrams using nlp techniques and heuristics rules,” DATA’21:
Int. Conf. on Data Science, E-learning and Info. Systems, 2021.
[24] S. Ahmed, A. Ahmed, and N. U. Eisty, “Automatic Transformation of
Natural to Unified Modeling Language: A Systematic Review,” in 2022
IEEE/ACIS 20th International Conference on Software Engineering Re-
search, Management and Applications (SERA), May 2022, pp. 112–119.
doi: 10.1109/SERA54885.2022.9806783.
[25] S. Ahmed, A. Ahmed, and N. U. Eisty, “Automatic Transformation of
Natural to Unified Modeling Language: A Systematic Review,” in 2022
IEEE/ACIS 20th International Conference on Software Engineering Re-
search, Management and Applications (SERA), May 2022, pp. 112–119.
doi: 10.1109/SERA54885.2022.9806783.
[26] X. Schmitt, S. Kubler, J. Robert, M. Papadakis, and Y. LeTraon, “A
Replicable Comparison Study of NER Software: StanfordNLP, NLTK,
OpenNLP, SpaCy, Gate,” in 2019 Sixth International Conference on
Social Networks Analysis, Management and Security (SNAMS), Oct.
2019, pp. 338–343. doi: 10.1109/SNAMS.2019.8931850.