Context-Based Ontology Modelling For Database: Enabling Chatgpt For Semantic Database Management
Context-Based Ontology Modelling For Database: Enabling Chatgpt For Semantic Database Management
Abstract
This research paper explores the use of ChatGPT in database man-
agement. ChatGPT, an AI-powered chatbot, has limitations in per-
forming tasks related to database management due to the lack
of standardized vocabulary and grammar for representing database
semantics. To address this limitation, the paper proposes a solu-
tion that involves developing a set of syntaxes that can repre-
sent database semantics in natural language. The syntax is used
to convert database schemas into natural language formats, pro-
viding a new application of ChatGPT in database management.
The proposed solution is demonstrated through a case study where
ChatGPT is used to perform two tasks, semantic integration, and
tables joining. Results demonstrate that the use of semantic database
representations produces more precise outcomes and avoids com-
mon mistakes compared to cases with no semantic representation.
The proposed method has the potential to speed up the database
management process, reduce the level of understanding required for
database domain knowledge, and enable automatic database operations
without accessing the actual data, thus illuminating privacy protec-
tion concerns when using AI. This paper provides a promising new
direction for research in the field of AI-based database management.
1
2 ChatGPT for Semantic Database Management
1 Introduction
ChatGPT is a conversational chatbot that uses artificial intelligence (AI) and
machine learning (ML) techniques, combined with natural language process-
ing (NLP) methods, to produce human-like text. It was launched in November
2022 and quickly gained popularity, reaching over one million users within just
five days [1]. ChatGPT’s ability to produce human-like text and perform a wide
range of tasks has made it a popular tool for many users, including answer-
ing questions, writing short stories, composing music, solving math problems,
performing language translations, and even computer programming.
Database operation involves the manipulation of data and information
using specific syntax or commands, similar to computer programming. In
database operations, these commands are known as database queries, which
are used to retrieve, update, and manipulate data stored in a database. Sim-
ilarly, in computer programming, instructions are written in a programming
language, such as C or Python, in order to specify the desired behaviour of a
computer program.
There has been an expectation that ChatGPT could assist in creating
database queries, just as it can assist in creating computer programs. However,
creating database queries requires an understanding of the database itself, and
there is no conventional way to represent database semantics. This problem
limits ChatGPT’s ability to perform tasks related to database management.
In this paper, we present a solution to this problem by developing a set
of syntax that can represent database semantics, such as table structure and
relationships, in natural language. This allows for the creation of semantic rep-
resentations of databases that can be understood by ChatGPT and enable it
to perform database management tasks. Our work is demonstrated through a
case study, where ChatGPT is used to perform two tasks: semantic integra-
tion and table joining. Our results show that the use of semantic database
representations produces more precise outcomes and avoids common mistakes
compared to cases with no semantic representation.
The proposed method transforms database schemas into natural language
formats, providing a new application of ChatGPT in database management.
This study has the potential to speed up the database management process,
reduce the level of understanding required for database domain knowledge,
and enable automatic database operations without accessing the actual data,
thus illuminating privacy protection concerns when using AI.
The rest of the paper is organized as follows: In Section 2, we provide a
review of related work in the area of database management using AI. Then,
we describe our proposed solution in Section 3. Section 4 presents the results
ChatGPT for Semantic Database Management 3
and discussion of our case study. Finally, we discuss the potential benefits and
limitations of our work and conclude with future directions for research.
2 Literature review
2.1 AI-based database queries generation
The use of AI models for generating database queries through natural language
has been the focus of several research studies. One such model proposed by
Bais et al. [2] utilizes NLP techniques to analyze and interpret user queries
by performing morphological, syntactic, and semantic analysis, resulting in a
valid database query in SQL. Similarly, Sawant et al. [3] implemented a system
that can generate SQL queries from text and speech input using NLP and deep
learning techniques such as Long Short Term Memory (LSTM).
Other studies, such as Ghosh et al. [4], Nagare et al. [5], and Kombade et
al. [6] , have also utilized techniques such as lexical analysis, syntax analysis,
and semantic analysis to extract SQL queries from natural language input.
Kombade et al. [6] even considered the use of abbreviations in NLP to generate
SQL queries. The implementation of these studies used python with a GUI for
input and output, and the user could provide input through speech or text.
Despite the progress made in this field, limitations still exist in the ability
of AI models to accurately generate database queries from natural language
due to the complexity and ambiguity of natural language, as well as the lack
of standardized vocabulary and grammar for representing database structures.
For instance, Nagare et al. [5] mentions that the system checks the validity
of the user’s query, but it is unclear how the query’s validity is determined.
Moreover, the studies only consider basic database operations such as select,
delete, and update. Complex operations, such as joining multiple tables and
semantic integration, have not been investigated.
3 Methodology
3.1 ChatGPT
ChatGPT is a language model developed by OpenAI [15]. It is a type of AI
algorithm trained to predict the likelihood of a given sequence of words based
on the context of the words that come before it. This technology is based on
self-attention mechanisms [16] and has been trained on a massive dataset of
text, allowing it to generate sophisticated and seemingly intelligent writing.
ChatGPT is designed to converse with users in English and other languages
on a wide range of topics, making it ideal for use in chatbots, customer service,
content creation, and language translation tasks.
One of the applications of ChatGPT is to assist in programming, which
can be achieved in two ways. Firstly, ChatGPT can serve as a programming
assistant or tool. For instance, developers can ask ChatGPT programming-
related questions and obtain recommendations and suggestions about general
workflows and steps. Secondly, ChatGPT can generate code snippets directly,
resulting in enhanced productivity and time-saving benefits for developers.
Despite its advanced natural language processing capability and successes
in assisting programming, ChatGPT has not yet been able to generate queries
for databases because database schemas, which contain vital information about
database structures, are frequently written in the form of a graph rather than
natural language.
two examples that illustrate how the “context-of” construct can be used to
describe the relationship of headers within one table and the relationship of
tables within one database.
Patients_Alabama
Basic schema
Id
a table ‘Patients_Alabama’
BIRTHDATE with headers:
DEATHDATE Id, BIRTHDATE,
DEATHDATE, SSN, PREFIX,
SSN FIRST, LAST, SUFFIX,
MAIDEN, MARITAL, RACE,
PREFIX ETHNICITY, GENDER,
BIRTHPLACE, ADDRESS,
FIRST CITY, STATE, COUNTY.
LAST
SUFFIX
MAIDEN
MARITAL
RACE
Contextual schema
ETHNICITY
In table ‘Patients_Alabama’,
GENDER headers ADDRESS,
CITY, STATE, and COUNTY
BIRTHPLACE are in the context of patients’
address.
ADDRESS
CITY
STATE
COUNTY
Id Id Id START
SUFFIX STATE
ENCOUNTER COVERED_ENCOUNTERS
organizations UNCOVERED_ENCOUNTERS
CODE
Id medications COVERED_MEDICATIONS
DESCRIPTION
NAME START UNCOVERED_MEDICATIONS
VALUE
ADDRESS STOP COVERED_PROCEDURES
UNITS
CITY PATIENT UNCOVERED_PROCEDURES
TYPE
STATE PAYER COVERED_IMMUNIZATIONS
ZIP conditions ENCOUNTER UNCOVERED_IMMUNIZATIONS
LAT START CODE UNIQUE_CUSTOMERS
LON STOP DESCRIPTION QOLS_AVG
PHONE PATIENT BASE_COST MEMBER_MONTHS
REVENUE ENCOUNTER PAYER_COVERAGE
UTILIZATION CODE DISPENSES
DESCRIPTION TOTALCOST
and implementing the COM-DB method, which includes utilizing the “context-
of” construct and other ontology modelling constructs to convert database
schema into natural language. The effectiveness of this method is demonstrated
through the use of two examples, which show how it can be used to complete
two sophisticated database management tasks. The effectiveness of the method
is demonstrated in the case study.
4 Case study
The case study aims to showcase the efficacy of the proposed COM-DB
system. The system’s primary feature is the “context-of” construct, which
utilizes natural language to capture database semantics like table structure
and relationships. The primary objective of the system is to create semantic
representations of databases that can be easily comprehended by ChatGPT,
enabling it to perform various database management tasks.
The case study provides empirical evidence to support the effective-
ness of COM-DB. Two sample databases are collected from the literature,
Synthea Alabama [18] and BDA EHR [19]. Based on those databases, two
experiments are conducted that represent typical tasks conducted during
database integration: semantic integration and tables joining. In both exper-
iments, ChatGPT is used to perform tasks with and without the COM-DB-
based schema. The study repeats each experiment 10 times to ensure reliability
and eliminate the potential inconsistency in ChatGPT’s performance. Results
demonstrated illustrate an average result from the repeated experiments.
patients A patients B
Name FIRST
Surname LAST
Date of Birth BIRTHDATE
Place of Birth BIRTHPLACE
Address ADDRESS CITY STATE COUNTY
Gender GENDER
the headers from table ’patients A’ and table ’patients B’ which contain the
same information. Some headers may need to be combined or split.” is an
explanation of the task to be completed by ChatGPT.
The output in Figure 3 shows that ChatGPT can understand the task and
perform it to a degree. It matches Date of Birth and BIRTHDATE, Place of
Birth and BIRTHPLACE, Gender and GENDER, correctly. However, it failed
to match Name with FIRST, and Surname with LAST. In addition, ADDRESS
in ’patients B’ should be used with other headers CITY, STATE, COUNTY.
This was not noticed by ChatGPT.
Figure 4 shows the input and output of using ChatGPT with COM-DB
based schema. In addition to the inputs used in Figure 3, the ontology model
information is described as “In table ’patients A’, headers Name and Surname
are in the context of patients’ name. In table ’patients B’, headers ADDRESS,
CITY, STATE, and COUNTY are in the context of patients’ address.”
10 ChatGPT for Semantic Database Management
explain all tables with their contained headers in alphabetical order. Similar
to Experiment 1, only the headers are provided here without any sample data
or data type. The second part, “To create a SQL query that generates a list of
careplans, with corresponding providers’ and patients’ identity information.”
is an explanation of the task to be completed by ChatGPT.
Fig. 5 Experiment 2 Generate a new view from multiple tables without COM-DB based
schema. The conversation is split into two columns, from left to right.
Fig. 7 Experiment 2 Generate a new view from multiple tables with COM-DB based
schema. The conversation is split into two columns, from left to right.
5 Discussions
The results of the experiments indicate that ChatGPT performs better in
both semantic integration and tables joining tasks when using the COM-
DB-based schema. The context information provided by the ontology models
helps ChatGPT to better complete the tasks. The study demonstrates that
ChatGPT for Semantic Database Management 13
6 Conclusion
This paper explores the use of ChatGPT in the area of database management,
highlighting the challenges of using natural language processing to perform
database queries. Our research presents a solution by developing a set of syn-
taxes to represent database semantics in natural language. These syntaxes,
called COM-DB, enable ChatGPT to perform tasks related to database man-
agement, such as semantic integration and tables joining. Our case study shows
that the use of semantic representations in database management leads to more
precise outcomes and reduces common mistakes compared to cases without
such representations.
Our research aims to contribute to the field of database management
by introducing a novel approach for converting database schemas into nat-
ural language format, thereby opening up new applications for ChatGPT.
14 ChatGPT for Semantic Database Management
References
[1] Thorp, H.H.: ChatGPT is fun, but not an author. American Association
for the Advancement of Science (2023)
[2] Bais, H., Machkour, M., Koutti, L.: A model of a generic natural language
interface for querying database. international journal of intelligent systems
and applications 8(2), 35 (2016)
[3] Sawant, A., Raina, R., Patil, A., Pardeshi, A.: Ai model to generate sql
queries from natural language instructions through voice. In: Journal of
Physics: Conference Series, vol. 2273, p. 012014 (2022). IOP Publishing
[4] Ghosh, P.K., Dey, S., Sengupta, S.: Automatic sql query formation from
natural language query. International Journal of Computer Applications
975, 8887 (2014)
[5] Nagare, P., Indhe, S., Sabale, D., Thorat, G., Chaturvedi, P.: Automatic
sql query formation from natural language query. Int. Res. J. Eng. Technol
4, 1589–1591 (2017)
[6] Kombade, C., More, M., Pujari, A., Patil, S.: Natural language processing
with some abbreviation to sql. International journal for research in applied
science and engineering technology 8(5), 1046–1048 (2020)
ChatGPT for Semantic Database Management 15
[7] Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Silk-a link discovery
framework for the web of data. Ldow 538, 53 (2009)
[8] Ngomo, A.-C.N., Auer, S.: Limes-a time-efficient approach for large-scale
link discovery on the web of data. integration 15(3) (2011)
[9] Suchanek, F.M., Abiteboul, S., Senellart, P.: Paris: Probabilistic align-
ment of relations, instances, and schema. arXiv preprint arXiv:1111.7164
(2011)
[10] Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F., Bal, H.: Webpie:
A web-scale parallel inference engine using mapreduce. Journal of Web
Semantics 10, 59–75 (2012)
[11] Böhm, C., De Melo, G., Naumann, F., Weikum, G.: Linda: distributed
web-of-data-scale entity matching. In: Proceedings of the 21st ACM Inter-
national Conference on Information and Knowledge Management, pp.
2104–2108 (2012)
[12] Morales, C., Collarana, D., Vidal, M.-E., Auer, S.: Matetee: A semantic
similarity metric based on translation embeddings for knowledge graphs.
In: Web Engineering: 17th International Conference, ICWE 2017, Rome,
Italy, June 5-8, 2017, Proceedings 17, pp. 246–263 (2017). Springer
[13] Ristoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for data min-
ing. In: The Semantic Web–ISWC 2016: 15th International Semantic Web
Conference, Kobe, Japan, October 17–21, 2016, Proceedings, Part I 15,
pp. 498–514 (2016). Springer
[14] Lantzaki, C., Papadakos, P., Analyti, A., Tzitzikas, Y.: Radius-aware
approximate blank node matching using signatures. Knowledge and
Information Systems 50, 505–542 (2017)
[15] van Dis, E.A., Bollen, J., Zuidema, W., van Rooij, R., Bockting, C.L.:
Chatgpt: five priorities for research. Nature 614(7947), 224–226 (2023)
[16] Humphreys, G.W., Sui, J.: Attentional control and the self: the self-
attention network (san). Cognitive neuroscience 7(1-4), 5–17 (2016)
[17] Lin, W., Babyn, P., Yan, Y., Zhang, W.: Ontology in the modern computer
era. Enterprise Information Systems (2023, submitted)
[18] Walonoski, J., Kramer, M., Nichols, J., Quina, A., Moesel, C., Hall,
D., Duffett, C., Dube, K., Gallagher, T., McLachlan, S.: Synthea: An
approach, method, and software mechanism for generating synthetic
patients and the synthetic electronic health care record. Journal of the
American Medical Informatics Association 25(3), 230–238 (2018)
16 ChatGPT for Semantic Database Management
[19] Silvestri, S., Esposito, A., Gargiulo, F., Sicuranza, M., Ciampi, M.,
De Pietro, G.: A big data architecture for the extraction and analysis of
ehr data. In: 2019 IEEE World Congress on Services (SERVICES), vol.
2642, pp. 283–288 (2019). IEEE