Towards Automatically Extracting UML Class Diagram
Towards Automatically Extracting UML Class Diagram
it is complex and resource-consuming. We propose an automated and the tools and languages that are produced to model them [8].
approach for the extraction of UML class diagrams from natural lan- For a domain specialist, creating UML models from scratch is a time-
guage software specifications. To develop our approach, we create a consuming and error-prone process that requires various technical
dataset of UML class diagrams and their English specifications with skills. To address that problem, various approaches target the gener-
the help of volunteers. Our approach is a pipeline of steps consisting ation of models from different structured information such as user
of the segmentation of the input into sentences, the classification stories [4]. However, little work has been done on the extraction of
of the sentences, the generation of UML class diagram fragments natural language specifications. In the specific case of UML class
from sentences, and the composition of these fragments into one diagrams, existing work rely either on techniques that use machine
UML class diagram. We develop a quantitative testing framework learning in a semi-automated process [12] or rule-based techniques
specific to UML class diagram extraction. Our approach yields low that are fully automated but require a restricted input [1, 7]. In
precision and recall but serves as a benchmark for future research. this paper, we propose an approach that combines both machine
learning and rules while accepting free-flowing text.
CCS CONCEPTS Our approach uses natural language patterns and machine learn-
ing to fully automate the generation process. We first decompose a
• Software and its engineering → Software design engineer-
specification into sentences. Then, using a trained classifier, we tag
ing; • Computing methodologies → Information extraction;
each sentence as describing either a class or a relationship. Next, us-
Classification and regression trees.
ing grammar patterns, we map each sentence into a UML fragment.
Finally, we assemble the fragments into a complete UML diagram
KEYWORDS
using a composition algorithm. In addition to our approach, we
Model-driven engineering, Machine learning, Natural language build a dataset thanks to the effort of the modeling community. This
processing, Domain modeling dataset is used to train the classifier et to evaluate the approach.
ACM Reference Format: We evaluate our approach on a dataset of 62 diagrams containing
Song Yang and Houari Sahraoui. 2022. Towards Automatically Extracting 624 fragments. Although the accuracy of our approach does not
UML Class Diagrams from Natural Language Specifications. In ACM/IEEE reach an accuracy level needed for practical use, our work explores
25th International Conference on Model Driven Engineering Languages and the benefits of mixing machine learning with natural language
Systems (MODELS ’22 Companion), October 23–28, 2022, Montreal, QC, Canada.
patterns for a fully automated process. Our approach can serve as
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3550356.3561592
a baseline for future research on generating UML diagrams from
English specifications, and the dataset created together with the
1 INTRODUCTION defined quantitative metrics can serve as a benchmark for this
Software development is a complex and error-prone process. Part problem.
of this complexity comes from the gap between domain experts The rest of the paper is structured as follows. Section 2 gives
who are familiar with the domain knowledge but have limited an overview of the proposed generation pipeline and the details
expertise with development tools, and software specialists who of each step. The setup and the results of evaluating the approach
master the development environments but are unfamiliar with are provided in Section 3.1. Section 5 discusses the related work
the target application domain. To fill that gap, the model-driven and positions our contribution to it. Section 4 lists some threats to
engineering paradigm aims at raising the level of abstraction in validity. Finally, we conclude this paper in Section 6.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation 2 APPROACH
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
2.1 Overview
fee. Request permissions from [email protected]. The goal is to design a method to translate English specifications
MODELS ’22 Companion, October 23–28, 2022, Montreal, QC, Canada to UML diagrams. To do this, we implement a tool pipeline that
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9467-3/22/10. . . $15.00 generates UML class diagrams from natural language specifica-
https://doi.org/10.1145/3550356.3561592 tions. First, we create a dataset. Secondly, we implement an NLP
MODELS ’22 Companion, October 23–28, 2022, Montreal, QC, Canada Yang et al.
pipeline that performs the extraction of UML class diagrams. Figure we do not offer monetary compensation for participation. However,
2 summarizes the process. Figure 1 summarizes the process. since participation was low, we did not impose a contribution limit.
Our approach combines machine learning with pattern-based After about two months of crowdsourcing, we receive labels for
diagram generation. To perform machine learning, we start by cre- 649 fragments across 62 UML class diagrams. The produced dataset
ating a dataset of UML class diagrams and their corresponding is available on a public repository2 . To ensure quality, labels are
specifications in natural language (top part of Figure 1). We select reviewed and some are rejected. We replace the rejected labels by
pre-existing UML class diagrams from the AtlanMod Zoo reposi- labeling them again ourselves. Figure 4 shows example labels.
tory 1 . The selected diagrams are decomposed into fragments and
manually labeled by volunteer participants. After postprocessing, 2.3 Preprocessing and Fragmentation
the labeled diagrams are stored in a repository.
Preprocessing is the first step after receiving an input specification
The bottom part of Figure 1 consists of the actual diagram genera-
from the user as shown in Figure 1. We substitute pronouns through-
tion process, which takes place right after a user submits a software
out the text, such as it and him, by their reference nouns. This is
specification. The submitted natural language specification is then
done using coreferee [5], which is a tool written in Python that
preprocessed and decomposed into sentences. Using a classifier
performs coreference resolution, including pronoun substitution.
built from the above-mentioned dataset, the sentences are labeled
according to the nature of the UML construct they refer to, i.e. a A course is taught by a teacher. A classroom is as-
class or a relation. According to this label, specific procedures of signed to it.
parsing and extraction are performed on the sentence to generate =⇒ A course is taught by a teacher. A classroom is
a UML fragment. In the end, all UML fragments are composed back assigned to a course.
together into one UML class diagram. Pronoun substitution allows sentences in the English specifica-
tion to be less dependent on each other for semantic purposes. The
2.2 Dataset Creation accuracy of coreferee for general English text is 81%.
We create a new dataset for both the operation and the evaluation of Sentence fragmentation is the second step in the runtime oper-
our approach. In particular, we use this dataset to learn a classifier ations in Figure 1. We split the preprocessed text into individual
for the Classification step in Figure 1. sentences, using spaCy [3]. spaCy is an NLP library in Python that
To build the dataset, we start from an existing set of UML class can be used for various NLP tasks, such as sentence splitting. spaCy
diagrams from the AtlanMod Zoo. The AtlanMod Zoo has a repos- splits text into sentences by looking at punctuation and special
itory of 305 high-quality UML class diagrams that model various cases like abbreviations. Its decisions are powered by pre-trained
domains. The size of the diagrams varies from a few to hundreds statistical models. We use the small English model, which has a
of classes. We fragment each diagram into simple classes (Figure 2) good speed and respectable performance. For instance, in the fol-
and relationships (Figure 3). Table 1 shows the size of the initial lowing example, the first two dots are not considered for splitting
set of diagrams and the fragments, as well as the portion that we the sentences but the third dot is.
labeled.
An employee has a level of studies, i.e., a degree. An
Dataset UML models UML fragments employee is affiliated to a department.
AtlanMod Zoo 305 8706 =⇒
Labeled 62 649 𝑠 1 : An employee has a level of studies, i.e., a degree.
𝑠 2 : An employee is affiliated to a department.
Table 1: UML datasets and their sizes by version
Data collection
Runtime operations
Specification Sentence
English specs Preprocessing Fragmentation
Sentence
Classification
UML Class UML UML Fragment
Diagram Composition Generation
Figure 1: Overview of the extraction process of UML class diagrams from natural language specification
ClassName SimpleRDBS
attributeName
Key
attributeName
0..* SimpleRDBS
ClassA ClassB
related 1..1
Key Table
owner
Figure 3: Relationship fragment
(b) A Key is owned by one and only one Table in an RDBMS
When training classifiers on natural language text, we have to
select a method to map those sentences into numerical representa- Figure 4: Example labels received by crowdsourcing
tions. To this end, we experiment with two vectorization methods,
count and tf-idf, which are designed to turn words into vectors. tf-
idf ’s key difference from count is the penalization of very common well on both in theory. We attribute the difference in performance
words in a document. This allows giving less importance to words to the randomized splitting of the dataset when training and testing.
like "the". Moreover, if a Bernoulli distribution captures enough information to
As for the classification algorithms, we experiment with various classify well, it seems frequency-based vectorization is not needed.
algorithms from the scikit-learn library [9]. We use the default Should a given sentence may describe both a UML class and a
hyperparameter settings for each algorithm. Table 2 shows the UML relationship, we let the classifier make the decision.
performance of the algorithms on the test data. It is worth noting
that the training takes less than one minute. 2.5 UML Fragment Generation
Although some classifiers have better accuracy, we pick the After classifying each sentence as describing either a "class" or
Bernoulli Naive Bayes classifier with a tf-idf vectorizer. Bernoulli a "relationship", we generate the corresponding UML fragment
Naive Bayes is simple, has a good accuracy that is more stable according to this classification, which is the fourth step in the
across training experiments and is generally faster to execute. runtime operations of Figure 1.
Interestingly, Bernoulli Naive Bayes performs better on a tf-idf Using spaCy’s small English model [3], we define several gram-
vectorizer than the count vectorizer, when it should perform equally mar patterns to match the English sentences. We design the patterns
MODELS ’22 Companion, October 23–28, 2022, Montreal, QC, Canada Yang et al.
5.2 Automatic Approaches framework. From the novelty perspective, we explore intermediate
Automatic approaches do not require extensive user intervention. machine learning steps to simplify a mostly rule-based approach.
Once input is given, the user only needs to wait for the recom- Furthermore, our approach uses a divide-and-conquer strategy
mended result. The automatic approaches make use of more tradi- when fragmenting diagrams and text and when composing them
tional NLP techniques, such as hand-written rules and grammar back together.
parsing. For example, the authors of [1, 7] use several heuristics to In the future, a more complex pattern system can improve the
analyze the natural language specification. performance of our approach. Currently, we only use a single rule
The automated approaches presented in [1, 7] have normaliza- to generate a UML fragment, but if several rules contribute together,
tion rules, which require users to write specifications in a restricted the performance can increase. The composition algorithm can also
English sentence structure. Our approach accepts free-flowing text. be improved, such as by considering a confidence score in each
We don’t have any normalization rules that users must keep in fragment. Furthermore, inheritance can be generated as a new type
mind. of relationship by adding more grammar patterns. Finally, we can
generalize our approach to handle other types of UML diagrams, in
5.3 Semi-automatic Approaches particular behavioral ones.
Semi-automatic approaches make use of an AI assistant to guide REFERENCES
the user in general UML class diagrams. DoMoBOT is an example [1] Esra A Abdelnabi, Abdelsalam M Maatuk, Tawfig M Abdelaziz, and Salwa M
of such a tool. Overall, DoMoBOT makes use of machine learning Elakeili. 2020. Generating UML class diagram using NLP techniques and heuristic
via knowledge bases inside pre-trained models and word embed- rules. In 2020 20th International Conference on Sciences and Techniques of Automatic
Control and Computer Engineering (STA). IEEE, 277–282.
dings. The UML class diagram is generated progressively as the [2] Esra A. Abdelnabi, Abdelsalam M. Maatuk, and Mohammed Hagal. 2021. Gener-
user provides feedback to DoMoBOT [12]. ating UML Class Diagram from Natural Language Requirements: A Survey of
Approaches and Techniques. In 2021 IEEE 1st International Maghreb Meeting of
Although human intervention during the generation of UML the Conference on Sciences and Techniques of Automatic Control and Computer En-
models improves quality, the additional effort spent by users makes gineering MI-STA. 288–293. https://doi.org/10.1109/MI-STA52233.2021.9464433
the tools difficult to use, especially by domain experts. Our approach [3] Explosion AI. 2016. spaCy: Industrial-Strength Natural Language Processing.
Explosion AI, Berlin, Germany. Retrieved Jul 13, 2022 from https://spacy.io/
is meant to improve the automation of the generation process [4] Meryem Elallaoui, Khalid Nafil, and Raja Touahni. 2015. Automatic generation
as much as possible. An automated approach is also better for of UML sequence diagrams from user stories in Scrum process. In 2015 10th
consistent testing. international conference on intelligent systems: theories and applications (SITA).
IEEE, 1–6.
[5] Richard Paul Hudson. 2021. Coreferee: Coreference resolution for multiple languages.
6 CONCLUSION Explosion AI, Berlin, Germany. Retrieved Jul 13, 2022 from https://spacy.io/
universe/project/coreferee
In this paper, we propose an automated approach to extract UML [6] Danai Koutra, Ankur Parikh, Aaditya Ramdas, and Jing Xiang. 2011. Algorithms
class diagrams from English specifications. The approach uses ma- for Graph Similarity and Subgraph Matching. (2011), 15–16. Retrieved Jul 18,
2022 from https://www.cs.cmu.edu/~jingx/docs/DBreport.pdf
chine learning and pattern-based techniques. Machine learning is [7] Priyanka More and Rashmi Phalnikar. 2012. Generating UML diagrams from
used in the form of a binary classifier that labels sentences as either natural language specifications. International Journal of Applied Information
Systems 1, 8 (2012), 19–23.
describing a class or a relationship. The pattern-based techniques [8] Gunter Mussbacher et al. 2020. Opportunities in intelligent modeling assistance.
are handwritten grammar rules to parse English sentences. In this Software and Systems Modeling 19, 5 (2020), 1045–1053.
approach, we fragment the English input into sentences, gener- [9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
ate UML class diagram fragments from them, and combine all the napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
fragments together into a final result. Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
To develop our tool, we first create a dataset of UML diagrams [10] PlantUML. 2021. Open-source tool that uses simple textual descriptions to draw
beautiful UML diagrams. PlantUML. Retrieved Jul 14, 2022 from https://plantuml.
paired with English specifications. The specifications are produced com
by a crowdsourcing initiative. The resulting dataset, although small, [11] Julia Rubin and Marsha Chechik. 2013. N-Way Model Merging. In Proceedings of
the 2013 9th Joint Meeting on Foundations of Software Engineering (Saint Peters-
is enough to train the classifier and evaluate our approach. burg, Russia) (ESEC/FSE 2013). Association for Computing Machinery, New York,
We define three evaluation metrics of varying strictness to test NY, USA, 301–311. https://doi.org/10.1145/2491411.2491446
our approach’s accuracy in generating classes and relationships [12] Rijul Saini, Gunter Mussbacher, Jin LC Guo, and Jörg Kienzle. 2022. Automated,
interactive, and traceable domain modelling empowered by artificial intelligence.
from an English specification. The results for classes are 17% pre- Software and Systems Modeling 21, 3 (2022), 1015–1045.
cision and 25% recall for exact matching, the strictest metric. The
results for relationships are a connectivity similarity of 63% and a
size difference of 67%.
The correctness of the produced diagrams is limited. However,
these results are in part explained by the imprecision of the NLP
tools we used. Using more sophisticated NLP tools will help to
improve these results. In addition, more grammar patterns can be
added in Section 2.5 and an improved version of the composition
algorithm will reduce irrelevant classes.
From a broader perspective, our research lays the work for a
consistent quantitative evaluation framework with our approach
being the baseline and with the dataset and metrics being the testing