0% found this document useful (0 votes)

41 views22 pages

Recent Advances in Text-To-SQL- A Survey of What We Have and What We Expect

Uploaded by

huw0217

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views22 pages

Recent Advances in Text-To-SQL- A Survey of What We Have and What We Expect

Uploaded by

huw0217

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Recent Advances in Text-to-SQL:

A Survey of What We Have and What We Expect

Naihao Deng Yulong Chen Yue Zhang

University of Michigan Westlake University Westlake University
[email protected] [email protected] [email protected]

Abstract What are the major cities in the state of Kansas?

Database End User

Text-to-SQL has attracted attention from both
the natural language processing and database
arXiv:2208.10099v1 [cs.CL] 22 Aug 2022

SELECT T1.CITY_NAME FROM CITY AS T1 WHERE

communities because of its ability to convert Model T1.POPULATION > 150000 AND T1.STATE_NAME = "Kansas" ;
the semantics in natural language into SQL
queries and its practical application in build- Figure 1: The framework for text-to-SQL systems.
ing natural language interfaces to database sys- Given the database schema and user utterance, the sys-
tems. The major challenges in text-to-SQL tem outputs a corresponding SQL query to query the
lie in encoding the meaning of natural utter- database system for the result. Appendix B gives more
ances, decoding to SQL queries, and translat- text-to-SQL examples.
ing the semantics between these two forms.
These challenges have been addressed to dif-
ferent extents by the recent advances. How-
ever, there is still a lack of comprehensive sur- text-to-SQL has attracted researchers in the natural
veys for this task. To this end, we review language processing (NLP) and the database (DB)
recent progress on text-to-SQL for datasets,
community for decades (Codd, 1970; Hemphill
methods, and evaluation and provide this sys-
tematic survey, addressing the aforementioned et al., 1990; Dahl et al., 1994; Zelle and Mooney,
challenges and discussing potential future di- 1996; Popescu et al., 2003; Bertomeu et al., 2006;
rections. We hope that this survey can serve Wang et al., 2020a; Scholak et al., 2021b).
as quick access to existing work and motivate
future research. 1 The challenges in text-to-SQL lie within three
aspects: (1) extracting the meaning of natural ut-
1 Introduction terances (encoding); (2) transforming the extracted
meaning into another expression which is pragmat-
The task of text-to-SQL is to convert natural ut- ically equivalent to the NL meaning (translating)
terances into SQL queries (Zhong et al., 2017; Yu and; (3) producing the corresponding SQL queries
et al., 2018c). Figure 1 shows an example. Given (decoding). A wide range of methods has been in-
a user utterance “What are the major cities in the vestigated to address the technical challenges, from
state of Kansas?”, the system outputs a correspond- representation learning, intermediate structures, de-
ing SQL that can be used for retrieving the answer coding, model structures, training objectives, and
from a database. It builds a natural language in- other perspectives. In addition, much work has
terface to the database (NLIDB) to help lay users been conducted on data resources and evaluation.
access information in the database (Popescu et al., However, relatively little work has been done in the
2003; Li and Jagadish, 2014), inspiring research literature to provide a comprehensive survey of the
in human-computer interaction (Elgohary et al., landscape. The only exceptions are (Katsogiannis-
2020). Because the SQL query can be regarded as Meimarakis and Koutrika, 2021) and (Kalajdjieski
a semantic representation (Guo et al., 2020), text- et al., 2020), but they cover a limited scope. To this
to-SQL is also a representative task in semantic end, we aim to provide a systematic survey that
parsing, helping downstream applications such as involves a broader range of text-to-SQL research
question answering (Wang et al., 2020d). Thus, and addresses the aforementioned challenges.
1
The Github Link for this survey is: https://
github.com/text-to-sql-survey-coling22/ In this paper, we survey the recent progress on
text-to-sql-survey-coling22.github.io. text-to-SQL, from datasets (§ 2), methods (§ 3)
Datasets #Size #DB #D #T/DB Issues addressed Sources for data
College courses,
Domain
Spider (Yu et al., 2018c) 10,181 200 138 5.1 DabaseAnswers,
generalization
WikiSQL
WikiSQL (Zhong et al., 2017) 80,654 26,521 - 1 Data size Wikipedia
Lexicon-level
Squall (Shi et al., 2020b) 11,468 1,679 - 1 WikiTableQuestions
supervision
Domain
KaggleDBQA (Lee et al., 2021) 272 8 8 2.3 Real web daabases
generalization
Internet Movie
IMDB (Yaghmazadeh et al., 2017) 131 1 1 16 -
Database
Yelp (Yaghmazadeh et al., 2017) 128 1 1 7 - Yelp website
University of
Advising (Finegan-Dollak et al., 2018) 3,898 1 1 10 - Michigan course
information
MIMICSQL (Wang et al., 2020d) 10,000 1 1 5 - Healthcare domain
SQL template
SEDE (Hazoom et al., 2021) 12,023 1 1 29 Stack Exchange
diversity

Table 1: The statistic for recent text-to-SQL datasets. #Size, #DB, #D, and #T/DB represent the numbers of
question-SQL pairs, databases, domains, and the averaged number of tables per domain, respectively. The “-” in
the #D column indicates an unknown number of domains, and the “-” in the Issues Addressed indicates no specific
issue addressed by the dataset. Datasets above and below the line are cross-domain and single-domain, respectively.
The complete statistic is listed in Table 7 in Appendix C.

to evaluation (§ 4) 2 and highlight potential direc- unseen SQL queries or SQL queries from other
tions for future work (§ 5). Apendix A shows the domains (Finegan-Dollak et al., 2018; Yu et al.,
topology for the text-to-SQL task. 2018c). However, since these datasets are adapted
from real-life applications, most of them contain
2 Datasets domain knowledge (Gan et al., 2021b) and dataset
As shown in Table 1, existing text-to-SQL datasets conventions (Suhr et al., 2020). Thus, they are still
can be classified into three categories: single- valuable to evaluate models’ ability to generalize
domain datasets, cross-domain datasets and others. to new domains and explore how to incorporate do-
main knowledge and dataset convention to model
Single-Domain Datasets Single-domain text-to- predictions.
SQL datasets typically collect question-SQL pairs Appendix B gives a detailed discussion on do-
for a single database in some real-world tasks, in- main knowledge and dataset convention, and con-
cluding early ones such as Academic (Li and Ja- crete text-to-SQL examples.
gadish, 2014), Advising (Finegan-Dollak et al.,
2018), ATIS (Price, 1990; Dahl et al., 1994), Large Scale Cross-domain Datasets Large
GeoQuery (Zelle and Mooney, 1996), Yelp and cross-domain datasets such as WikiSQL (Zhong
IMDB (Yaghmazadeh et al., 2017), Scholar (Iyer et al., 2017) and Spider (Yu et al., 2018c) are pro-
et al., 2017) and Restaurants (Tang and Mooney, posed to better evaluate deep neural models. Wik-
2000; Popescu et al., 2003), as well as recent ones iSQL uses tables extracted from Wikipedia and lets
such as SEDE (Hazoom et al., 2021), ESQL (Chen annotators paraphrase questions generated for the
et al., 2021a) and MIMICSQL (Wang et al., 2020d). tables. Compared to other datasets, WikiSQL is an
These single-domain datasets, particularly the order of magnitude larger, containing 80,654 nat-
early ones, are usually limited in size, containing ural utterances in total (Zhong et al., 2017). How-
only a few hundred to a few thousand examples. Be- ever, WikiSQL contains only simple SQL queries,
cause of the limited size and similar SQL patterns and only a single table is queried within each SQL
in training and testing phases, text-to-SQL mod- query (Yu et al., 2018c).
els that are trained on these single-domain datasets Yu et al. (2018c) propose Spider, which contains
can achieve decent performance by simply mem- 200 databases with an average of 5 tables for each
orizing the SQL patterns and fail to generalize to database, to test models’ performance on compli-
2
Note that most work discussed in this paper is in English cated unseen SQL queries and their ability to gen-
unless otherwise specified. eralize to new domains. Furthermore, researchers
expand Spider to study various issues of their inter- and attain robustness towards different types of
est (Lei et al., 2020; Zeng et al., 2020; Gan et al., questions (Radhakrishnan et al., 2020) .
2021b; Taniguchi et al., 2021; Gan et al., 2021a).
Besides, researchers build several large-scale Typical data augmentation techniques involve
text-to-SQL datasets in different languages such as paraphrasing questions and filling pre-defined tem-
CSpider (Min et al., 2019a), TableQA (Sun et al., plates for increasing data diversity. Iyer et al. (2017)
2020), DuSQL (Wang et al., 2020c) in Chinese, use the Paraphrase Database (PPDB) (Ganitkevitch
ViText2SQL (Tuan Nguyen et al., 2020) in Viet- et al., 2013) to generate paraphrases for training
namese, and PortugueseSpider (José and Cozman, questions. Appendix B gives an example of this
2021) in Portuguese. Given that human transla- augmentation method. Iyer et al. (2017) and Yu
tion has shown to be more accurate than machine et al. (2018b) collect question-SQL templates and
translation (Min et al., 2019a), these datasets are an- fill in them with DB schema. Researchers also em-
notated mainly by human experts based on the En- ploy neural models to generate natural utterances
glish Spider dataset. These Spider-based datasets for sampled SQL queries to acquire more data. For
can serve as potential resources for multi-lingual instance, Li et al. (2020a) fine-tune pre-trained T5
text-to-SQL research. model (Raffel et al., 2019) using SQL query as the
Other Datasets Several context-dependent text- input to predict natural utterance on WikiSQL, and
to-SQL datasets have been proposed, which involve then randomly synthesize SQL queries from tables
user interactions with the text-to-SQL system in in WikiSQL and use the tuned model to generate
English (Price, 1990; Dahl et al., 1994; Yu et al., the corresponding natural utterance.
2019a,b) and Chinese (Guo et al., 2021). In addi-
tion, researchers collect datasets to study questions The quality of the augmented data is impor-
in text-to-SQL being answerable or not (Zhang tant because low-quality data can hurt the perfor-
et al., 2020), lexicon-level mapping (Shi et al., mance of the models (Wu et al., 2021). Various
2020b) and cross-domain evaluation for real Web approaches have been exploited to improve the
databases (Lee et al., 2021). quality of the augmented data. After sampling
Appendix C.1 discusses more details about SQL queries, Zhong et al. (2020b) employ an utter-
datasets mentioned in § 2. ance generator to generate natural utterances and
a semantic parser to convert the generated natural
3 Methods utterance to SQL queries. To filter out low-quality
Early text-to-SQL systems employ rule-based and augmented data, Zhong et al. (2020b) only keep
template-based methods (Li and Jagadish, 2014; data that have the same generated SQL queries as
Mahmud et al., 2015), which is suitable for simple the sampled ones. Wu et al. (2021) use a hierarchi-
user queries and databases. However, with the cal SQL-to-question generation process to obtain
progress in both DB and NLP communities, recent high-quality data. Observing that there is a strong
work focuses on more complex settings (Yu et al., segment-level mapping between SQL queries and
2018c). In these settings, deep models can be more natural utterances, Wu et al. (2021) decompose
useful because of their great feature representation SQL queries into several clauses, translate each
ability and generalization ability. clause into a sub-question, and then combine the
In this survey, we focus on the deep learn- sub-questions into a complete question.
ing methods primarily. We divide these meth-
ods employed in text-to-SQL research into Data To increase the diversity of the augmented
Augmentation (§ 3.1), Encoding (§ 3.2), Decod- data, Guo et al. (2018) incorporate a latent variable
ing (§ 3.3), Learning Techniques (§ 3.4), and Mis- in their SQL-to-text model to encourage question
cellaneous (§ 3.5). diversity. Radhakrishnan et al. (2020) augment the
WikiSQL dataset by simplifying and compressing
3.1 Data Augmentation questions to simulate the colloquial query behavior
Data augmentation can help text-to-SQL models of end-users. Wang et al. (2021b) exploit a proba-
handle complex or unseen questions (Zhong et al., bilistic context-free grammar (PCFG) to explicitly
2020b; Wang et al., 2021b), achieve state-of-the- model the composition of SQL queries, encourag-
art with less supervised data (Guo et al., 2018), ing sampling compositional SQL queries.
Methods Adopted by
Applied as “both columns are from the same table” in their
datasets graph.
TypeSQL (Yu et al., Graphs have also been used to encode questions
Encode type WikiSQL
2018a)
GNN (Bogin et al., together with DB schema. Researchers have been
Graph-based Spider
2019a) using different types of graphs to capture the se-
RAT-SQL (Wang
Self-attention
et al., 2020a)
Spider mantics in NL and facilitate linking between NL
SQLova (Hwang and table schema. Cao et al. (2021) adopt line
Adapt PLM WikiSQL
et al., 2019) graph (Gross et al., 2018) to capture multi-hop
TaBERT (Yin et al.,
Pre-training
2020)
Spider semantics by meta-path (e.g., an exact match for
a question token and column, together with the
Table 2: Typical methods used for encoding in text-to- column belonging to a table can form a 2-hop
SQL. The full table of existing methods and more de- meta-path) and distinguish between local and non-
tails are listed in Table 8 in Appendix D. local neighbors so that different tables and columns
will be attended differently. SADGA (Cai et al.,
2021) adopts the graph structure to provide a uni-
3.2 Encoding
fied encoding for both natural utterances and DB
Various methods have been adopted to address the schemas to help question-schema linking. Apart
challenges of representing the meaning of ques- from the relations between entities in both ques-
tions, representing the structure for DB schema, tions and DB schema, the structure for DB schemas,
and linking the DB content to question. We group S2 SQL (Hui et al., 2022) integrates syntax de-
them into five categories, as shown in Table 2. pendency among question tokens into the graph
Encode Token Types To better encode key- to improve model performance. To improve the
words such as entities and numbers in questions, Yu generalization of the graph method for unseen do-
et al. (2018a) assign a type to each word in the mains, ShawdowGNN (Chen et al., 2021b) ignores
question, with a word being an entity from the names of tables or columns in the database and
knowledge graph, a column, or a number. Yu et al. uses abstract schemas in the graph projection neu-
(2018c) concatenate word embeddings and the ral network to obtain delexicalized representations
corresponding type embeddings to feed into their of questions and DB schemas.
model. Finally, graph-based techniques are also ex-
ploited in context-dependent text-to-SQL. For in-
Graph-based Methods Since DB schemas con- stance, IGSQL (Cai and Wan, 2020) uses a graph
tain rich structural information, graph-based meth- encoder to utilize historical information of DB
ods are used to better encode such structures. schemas in the previous turns.
As summarized in § 2, datasets prior to Spider
typically involve simple DBs that contain only one Self-attention Models using transformer-based
table or a single DB in both training and testing. encoder (He et al., 2019; Hwang et al., 2019; Xie
As a result, modeling DB schema receives little et al., 2022) incorporate the original self-attention
attention. Because Spider contains complex and mechanism by default because it is the building
different DB in training and testing, Bogin et al. block of the transformer structure.
(2019a) propose to use graphs to represent the struc- RAT-SQL (Wang et al., 2020a) applies relation-
ture of the DB schemas. Specifically, Bogin et al. aware self-attention, a modified version of self-
(2019a) use nodes to represent tables and columns, attention (Vaswani et al., 2017), to leverage rela-
edges to represent relationships between tables and tions of tables and columns. DuoRAT (Scholak
columns, such as tables containing columns, pri- et al., 2021a) also adopts such a relation-aware
mary key, and foreign key constraints, and then self-attention in their encoder.
use graph neural networks (GNNs) (Li et al., 2016)
to encode the graph structure. In their subsequent Adapt PLM Various methods have been pro-
work, Bogin et al. (2019b) use a graph convolu- posed to leverage the knowledge in pre-trained lan-
tional network (GCN) to capture DB structures and guage models (PLMs) and better align PLM with
a gated GCN to select the relevant DB information the text-to-SQL task. PLMs such as BERT (Devlin
for SQL generation. RAT-SQL (Wang et al., 2020a) et al., 2019) are used to encode questions and DB
encodes more relationships for DB schemas such schemas. The modus operandi is to input the con-
catenation of question words and schema words Methods Adopted by
Applied
to the BERT encoder (Hwang et al., 2019; Choi datasets
et al., 2021). Other methods adjust the embeddings SyntaxSQLNet (Yu
Tree Spider
et al., 2018b)
by PLMs. On WikiSQL, for instance, X-SQL (He SQLNet (Xu et al.,
Sketch WikiSQL
et al., 2019) replaces segment embeddings from 2017)
the pre-trained encoder by column type embed- SmBop (Rubin and
Bottom-up Spider
Berant, 2021)
dings. Guo and Gao (2019) encode two additional Attention Wang et al. (2019) WikiSQL
feature vectors for matching between question to- Copy Wang et al. (2018a) WikiSQL
kens and table cells as well as column names and IRNet (Guo et al.,
IR Spider
2019)
concatenate them with BERT embeddings of ques- Global-GCN Bogin
Spider
tions and DB schemas. Others et al. (2019b)
Kelkar et al. (2020) Spider
HydraNet (Lyu et al., 2020) uses BERT to
encode the question and an individual column, Table 3: Typical methods used for decoding in text-to-
aligning with the tasks BERT is pre-trained on. SQL. The full table and more details are listed in Ta-
After obtaining the BERT representations of all ble 9 in Appendix D. IR: Intermediate Representation.
columns, Lyu et al. (2020) select top-ranked
columns for SQL prediction. Liu et al. (2021b)
train an auxiliary concept prediction module to pre- we group these methods into five main categories
dict which tables and columns correspond to the and other technologies.
question. They detect important question tokens by Tree-based Seq2Tree (Dong and Lapata, 2016)
detecting the largest drop in the confidence score employs a decoder that generates logical forms in a
caused by erasing that token in the question. Lastly, top-down manner. The components in the sub-tree
they train the PLM with a grounding module us- are generated conditioned on their parents apart
ing the question tokens and the corresponding ta- from the input question. Note that the syntax of
bles as well as columns. By empirical studies, Liu the logical forms is implicitly learned from data
et al. (2021b) claim that their approach can awaken for Seq2Tree. Similarly, Seq2AST (Yin and Neu-
the latent grounding from PLM via this erase-and- big, 2017) uses an abstract syntax tree (AST) for
predict technique. decoding the target programming language, where
the syntax is explicitly integrated with AST. Al-
Pre-training There have been various works
though both Seq2Tree (Dong and Lapata, 2016)
proposing different pre-training objectives and us-
and Seq2AST (Yin and Neubig, 2017) do not study
ing different pre-training data to better align the
text-to-SQL datasets, their uses of trees inspire
transformer-based encoder with the text-to-SQL
tree-based decoding in text-to-SQL. SyntaxSQL-
task. For instance, TaBERT (Yin et al., 2020)
Net (Yu et al., 2018b) employs a tree-based decod-
uses tabular data for pre-training with objectives
ing method specific to SQL syntax and recursively
of masked column prediction and cell value recov-
calls modules to predict different SQL components.
ery to pre-train BERT. Grappa (Yu et al., 2021)
synthesizes question-SQL pairs over tables and Sketch-based SQLNet (Xu et al., 2017) designs
pre-trains BERT with the objectives of masked lan- a sketch aligned with the SQL grammar, and SQL-
guage modeling (MLM) and predicting whether a Net only needs to fill in the slots in the sketch
column appears in the SQL query as well as what rather than predict both the output grammar and
SQL operations are triggered. GAP (Shi et al., the content. Besides, the sketch captures the de-
2020a) pre-trains BART (Lewis et al., 2020) on pendency of the predictions. Thus, the prediction
synthesized text-to-SQL and tabular data with the of one slot is only conditioned on the slots it de-
objectives of MLM, column prediction, column pends on, which avoids issues of the same SQL
recovery, and SQL generation. query with varied equivalent serializations. Dong
and Lapata (2018) decompose the decoding into
3.3 Decoding two stages, where the first decoder predicts a rough
Various methods have been proposed for decoding sketch, and the second decoder fills in the low-
to achieve a fine-grained and easier process for level details conditioned on the question and the
SQL generation and bridge the gap between natural sketch. Such coarse-to-fine decoding has also been
language and SQL queries. As shown in Table 3, adopted in other works such as IRNet (Guo et al.,
2019). To address the complex SQL queries with the copy mechanism is also adopted in context-
nested structures, RYANSQL (Choi et al., 2021) dependent text-to-SQL task (Wang et al., 2020b).
recursively yields SELECT statements and uses a
sketch-based slot filling for each of the SELECT Intermediate Representations Researchers use
statements. intermediate representations to bridge the gap be-
tween natural language and SQL queries. Inc-
Bottom-up Both the tree-based and the sketch- SQL (Shi et al., 2018) defines actions for different
based decoding mechanisms can be viewed as SQL components and let decoder decode actions
top-down decoding mechanisms. Rubin and Be- instead of SQL queries. IRNet (Guo et al., 2019)
rant (2021) use a bottom-up decoding mechanism. introduces SemQL, an intermediate representation
Given K trees of height t, the decoder scores trees for SQL queries that can cover most of the chal-
with height t + 1 constructed by SQL grammar lenging Spider benchmark. Specifically, SemQL
from the current beam, and K trees with the high- removes the JOIN ON, FROM and GROUP BY
est scores are kept. Then, a representation of the clauses, merges HAVING and WHERE clause for
new K trees is generated and placed in the new SQL queries. ValueNet (Brunner and Stockinger,
beam. 2021) uses SemQL 2.0, which extends SemQL to
include value representation. Based on SemQL,
Attention Mechanism To integrate the encoder-
NatSQL (Gan et al., 2021c) removes the set op-
side information at decoding, an attention score is
erators 3 . Suhr et al. (2020) implement SemQL
computed and multiplied with hidden vectors from
as a mapping from SQL to a representation with
the encoder to get the context vector, which is then
an under-specified FROM clause, which they call
used to generate an output token (Dong and Lapata,
SQLU F . Rubin and Berant (2021) employ a rela-
2016; Zhong et al., 2017).
tional algebra augmented with SQL operators as
Variants of the attention mechanism have been
the intermediate representations.
used to better propagate the information encoded
However, the intermediate representations are
from questions and DB schemas to the decoder.
usually designed for a specific dataset and cannot
SQLNet (Xu et al., 2017) designs column atten-
be easily adapted to others (Suhr et al., 2020). To
tion, where it uses hidden states from columns
construct a more generalized intermediate represen-
multiplied by embeddings for the question to cal-
tation, Herzig et al. (2021) propose to omit tokens
culate attention scores for a column given the ques-
in the SQL query that do not align to any phrase in
tion. Guo and Gao (2018) incorporate bi-attention
the utterance.
over question and column names for SQL com-
Inspired by the success of text-to-SQL task,
ponent selection. Wang et al. (2019) adopt a
intermediate representations are also studied
structured attention (Kim et al., 2017) by comput-
for SPARQL, another executable language for
ing the marginal probabilities to fill in the slots
database systems (Saparina and Osokin, 2021;
in their generated abstract SQL queries. Duo-
Herzig et al., 2021).
RAT (Scholak et al., 2021a) adopts the relation-
aware self-attention mechanism in both its encoder Others PICARD (Scholak et al., 2021b) and
and decoder. Other works that use sequence-to- UniSAr (Dou et al., 2022) set constraints to the
sequence transformer-based models or decoder- decoder to prevent generating invalid tokens. Sev-
only transformer-based models incorporate the self- eral methods adopt an execution-guided decoding
attention mechanism by default (Scholak et al., mechanism to exclude non-executable partial SQL
2021b; Xie et al., 2022). queries from the output candidates (Wang et al.,
Copy Mechanism Seq2AST (Yin and Neubig, 2018b; Hwang et al., 2019). Global-GNN (Bogin
2017) and Seq2SQL (Zhong et al., 2017) employ et al., 2019b) employs a separately trained discrim-
the pointer network (Vinyals et al., 2015) to com- inative model to rerank the top-K SQL queries
pute the probability of copying words from the in the decoder’s output beam, which is to reason
input. Wang et al. (2018a) use types (e.g., columns, about the complete SQL queries instead of con-
SQL operators, constant from questions) to explic- sidering each word and DB schemas in isolation.
itly restrict locations in the query to copy from Similarly, Kelkar et al. (2020) train a separate dis-
and develop a new training objective to only copy 3
The operators that combine the results of two or more
from the first occurrence in the input. In addition, SELECT statements, such as INTERSECT
criminator to better search among candidate SQL 2012) to learn an auxiliary reward to discount spu-
queries. Xu et al. (2017); Yu et al. (2018b); Guo rious SQL queries in SQL generation. Min et al.
and Gao (2018); Lee (2019) use separate submod- (2019b) model the possible SQL queries as a dis-
ules to predict different SQL components, eas- crete latent variable and adopt a hard-EM-style
ing the difficulty of generating a complete SQL parameter updates, letting their model take advan-
query. Chen et al. (2020b) employ a gate to select tage of the possible pre-computed solutions.
between the output sequence encoded for the ques-
tion and the output sequence from the previous 3.5 Miscellaneous
decoding steps at each step for SQL generation. In- In DB linking, BRIDGE (Lin et al., 2020) appends
spired by machine translation, Müller and Vlachos a representation for the DB cell values mentioned
(2019) apply byte-pair encoding (BPE) (Sennrich in the question to corresponding fields in the en-
et al., 2016) to compress SQL queries to shorter coded sequence, which links the DB content to the
sequences guided by AST, reducing the difficulties question. Ma et al. (2020) employ an explicit ex-
in SQL generation. tractor of slots mentioned in the question and then
link them with DB schemas.
3.4 Learning Techniques
Model-wise, Finegan-Dollak et al. (2018) use a
Apart from end-to-end supervised learning, differ- template-based model which copies slots from the
ent learning techniques have been proposed to help question. Shaw et al. (2021) use a hybrid model
text-to-SQL research. Here we summarize these which firstly uses a high precision grammar-based
learning techniques, each addressing a specific is- approach (NQG) to generate SQL queries, then
sue for the task. uses T5 (Raffel et al., 2019) as a back-up if NQG
Fully supervised Ni et al. (2020) adopt active fails. Yan et al. (2020) formulate submodule slot-
learning to save human annotation. Yao et al. (2019, filling as machine reading comprehension (MRC)
2020); Li et al. (2020b) employ interactive or imi- task and apply BERT-based MRC models on it.
tation learning to enhance text-to-SQL systems via Besides, DT-Fixup (Xu et al., 2021) designs an
interactions with end-users. Huang et al. (2018); optimization approach for a deeper Transformer on
Wang et al. (2021a); Chen et al. (2021a) adopt small datasets for the text-to-SQL task.
meta-learning (Finn et al., 2017) for domain gen- In SQL generation, IncSQL (Shi et al., 2018)
eralization. Various multi-task learning settings allows parsers to explore alternative correct action
have been proposed to improve text-to-SQL mod- sequences to generate different SQL queries. Brun-
els via enhancing their abilities on some relevant ner and Stockinger (2021) search values in DB to
tasks. Chang et al. (2020) set an auxiliary task insert values into SQL query.
of mapping between column and condition values. For context-dependent text-to-SQL, researchers
SeaD (Xuan et al., 2021) integrates two denoising adopt techniques such as turn-level encoder and
objectives to help the model better encode infor- copy mechanism (Suhr et al., 2018; Zhang et al.,
mation from the structural data. Hui et al. (2021b) 2019; Wang et al., 2020b), constrained decod-
integrate a task of learning the correspondence being (Wang et al., 2020b), dynamic memory decay
tween questions and DB schemas. Shi et al. (2021) mechanism (Hui et al., 2021a), treating questions
integrate a column classification task to classify and SQL queries as two modalities, and using bi-
which columns appear in the SQL query. McCann modal pre-trained models (Zheng et al., 2022).
et al. (2018) and Xie et al. (2022) train their models
with other semantic parsing tasks, which improves 4 Evaluation
models’ performance on text-to-SQL task.
Metrics Table 4 shows widely used automatic
Weakly supervised Seq2SQL (Zhong et al., evaluation metrics for the text-to-SQL task. Early
2017) use reinforcement learning to learn WHERE works evaluate SQL queries by comparing the
clause to allow different orders for components in database querying results executed from the pre-
WHERE clause. Liang et al. (2018) leverage mem- dicted SQL query and the ground-truth (or gold)
ory buffer to reduce the variance of policy gradient SQL query (Zelle and Mooney, 1996; Yagh-
estimates when applying reinforcement learning mazadeh et al., 2017) or use exact string match
to text-to-SQL. Agarwal et al. (2019) use meta- to compare the predicted SQL query with the gold
learning and Bayesian optimization (Snoek et al., one query (Finegan-Dollak et al., 2018). However,
Metrics Datasets Errors performance loss when tested against different text-
Naiive Execution GeoQuery, IMDB, False to-SQL datasets from other domains (Suhr et al.,
Accuracy Yelp, WikiSQL, etc positive 2020; Lee et al., 2021). It is unclear how to in-
Advising, WikiSQL, False
Exact String Match corporate domain knowledge to the models trained
etc negative
Exact Set Match Spider
False on Spider and deploy these models efficiently on
negative different domains, especially those with similar in-
Test Suite Accuracy
(execution accuracy Spider, GeoQuery, False formation stored in DB but slightly different DB
with generated etc positive schemas. Although large-scale datasets promote
databases) the cross-domain settings, question-SQL pairs from
Spider are free from domain knowledge, ambiguity,
Table 4: The summary of metrics, datasets that use
these metrics, and their potential error cases. or domain convention. Thus, cross-domain text-to-
SQL needs to be studied in future research to build
a practical cross-domain system that can handle
execution accuracy can create false positives for se-
real-world requests.
mantically different SQL queries even if they yield
the same execution results (Yu et al., 2018c). The There are different use cases in real-world sce-
exact string match can be too strict as two different narios, which requires models to be robust to dif-
strings can still have the same semantics (Zhong ferent settings and be smart to handle different user
et al., 2020a). Aware of these issues, Yu et al. requests. For instance, the model trained with DB
(2018c) adopt exact set match (ESM) in Spider, schemas can need to handle a corrupted table, or no
deciding the correctness of SQL queries by com- table is provided in its practical use. Besides, the
paring the sub-clauses of SQL queries. Zhong et al. input from users can vary from the standard ques-
(2020a) generate databases that can distinguish the tion input in Spider or WikiSQL, which poses chal-
predicted SQL query and gold one. Both methods lenges to models trained on these datasets. More
are used as official metrics on Spider. user studies need to be done to study how well
the current systems serve the end-users and the in-
Evaluation Setup Early single-domain datasets put pattern from the end-users. Apart from SQL
typically use the standard train/dev/test split (Iyer queries, administrators can want to change DB
et al., 2017) by splitting the question-SQL pairs ran- schemas, where a system that can translate the
domly. To evaluate generalization to unseen SQL natural language to such DB commands can be
queries within the current domain, Finegan-Dollak helpful. Also, although there are already works
et al. (2018) propose SQL query split, where no on text-to-SQL beyond English (Min et al., 2019a;
SQL query is allowed to appear in more than one Tuan Nguyen et al., 2020; José and Cozman, 2021),
set among the train, dev, and test sets. Further- but we still lack a comprehensive study on multi-
more, Yu et al. (2018c) propose a database split, lingual text-to-SQL, which can be challenging but
where the model does not see the databases in the useful in real-life scenarios. Finally, it is important
test set in its training time. Other splitting methods to build NLIDB for people with disabilities. Song
also exist to help different research topics (Shaw et al. (2022) propose speech-to-SQL that translates
et al., 2021; Chang et al., 2020). voice input to SQL queries, which helps visually
impaired end users. More work can be done to
5 Discussion and Future Directions address various needs from the perspective of end-
Ever since the LUNAR system (Woods et al., 1972; users, in particular, the needs from minorities.
Woods, 1973), systems for retrieving DB informa- Text-to-SQL research can also be integrated into
tion have witnessed an increasing amount of re- a larger scope of research. Application-wise, Xu
search interest and an enormous growth, especially et al. (2020) develop a question answering system
in the field of text-to-SQL in the deep learning for the database, Chen et al. (2020a) generate task-
era. With the ever-increasing model performance oriented dialogue by retrieving knowledge from the
on the WikiSQL and Spider leaderboards, one can database using the text-to-SQL model. An example
be optimistic because models are becoming more of the possible directions is to employ the text-to-
sophisticated than ever. But there are still several SQL model to query databases for fact-checking.
challenges to overcome. Research-wise, Guo et al. (2020) compare SQL
First, these sophisticated models suffer a great queries to other logical forms in semantic pars-
ing, Xie et al. (2022) include text-to-SQL as one of Stewart for proofreading and suggestions. The
the tasks to achieve a generalized semantic parsing work is funded by the Zhejiang Province Key
framework. The inter-relations between various Project 2022SDXHDX0003.
logical forms in semantic parsing can be further
studied. A generalized framework or a general-
ized model can come as the fruit for our semantic References
parsing community. Rishabh Agarwal, Chen Liang, Dale Schuurmans, and
In hindsight, the development of text-to-SQL Mohammad Norouzi. 2019. Learning to generalize
from sparse and underspecified rewards. In Pro-
has been pushed by the innovation in the general ceedings of the 36th International Conference on
ML/NLP community, such as LSTM (Hochreiter Machine Learning, ICML 2019, 9-15 June 2019,
and Schmidhuber, 1997), self-attention (Vaswani Long Beach, California, USA, volume 97 of Pro-
et al., 2017), PLMs (Devlin et al., 2019), etc. Re- ceedings of Machine Learning Research, pages 130–
140. PMLR.
cently, prompt learning has achieved decent perfor-
mance on various tasks, in particular, in the low- Núria Bertomeu, Hans Uszkoreit, Anette Frank, Hans-
resource setting (Liu et al., 2021a). Such charac- Ulrich Krieger, and Brigitte Jörg. 2006. Contextual
phenomena and thematic relations in database QA
teristics align well with the expectation of having a
dialogues: results from a Wizard-of-Oz experiment.
functional text-to-SQL model with a few training In Proceedings of the Interactive Question Answer-
samples. Some recent works already explore applying Workshop at HLT-NAACL 2006, pages 1–8, New
ing prompt learning to the text-to-SQL task (Xie York, NY, USA. Association for Computational Lin-
et al., 2022). The practical expectation for the guistics.
text-to-SQL task is to deploy the model in differ- Shikhar Bharadwaj and Shirish Shevade. 2022. Effi-
ent scenarios, requiring robustness across domains. cient constituency tree based encoding for natural
However, prompt learning struggles with being ro- language to bash translation. In Proceedings of the
2022 Conference of the North American Chapter of
bust, and the performance can be easily affected the Association for Computational Linguistics: Hu-
by the selected data. This misalignment encour- man Language Technologies, pages 3159–3168.
ages researchers to study how to employ prompt
Ben Bogin, Jonathan Berant, and Matt Gardner. 2019a.
learning in the real-world text-to-SQL task, which
Representing schema structure with graph neural
can need further understanding of the cross-domain networks for text-to-SQL parsing. In Proceedings of
challenges for text-to-SQL. the 57th Annual Meeting of the Association for Com-
Another line of research is to evaluate these so- putational Linguistics, pages 4560–4565, Florence,
Italy. Association for Computational Linguistics.
phisticated text-to-SQL systems. The typical mea-
sure is to evaluate the performance of the system Ben Bogin, Matt Gardner, and Jonathan Berant. 2019b.
on some existing datasets. As there are operational Global reasoning over database structures for text-
systems using NL input to perform tasks such as to-SQL parsing. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language
getting answers from database management system Processing and the 9th International Joint Confer-
or building ontologies or playing some games, the ence on Natural Language Processing (EMNLP-
performance of these systems can be measured by IJCNLP), pages 3659–3664, Hong Kong, China. As-
the diminution of the (human) time taken to get sociation for Computational Linguistics.
the searched information (Deng et al., 2021; Zhou Sridevi Bonthu, S Rama Sree, and MHM Kr-
et al., 2022). While there are context-dependent ishna Prasad. 2021. Text2PyCode: Machine transla-
text-to-SQL datasets available (Yu et al., 2019a,b), tion of natural language intent to python source code.
In International Cross-Domain Conference for Ma-
researchers can draw inspirations from other fields chine Learning and Knowledge Extraction, pages
of research (Zellers et al., 2021) to design interac- 51–60. Springer.
tive set-ups to evaluate text-to-SQL systems. Ap-
Ursin Brunner and Kurt Stockinger. 2021. Valuenet:
pendix E discusses tasks relevant to the task of
A natural language-to-SQL system that learns from
text-to-SQL. database information. In 2021 IEEE 37th Inter-
national Conference on Data Engineering (ICDE),
Acknowledgement pages 2177–2182. IEEE.
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang
Yue Zhang is the corresponding author. We thank Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra-
all reviewers for their insightful comments, and madan, and Milica Gašić. 2018. MultiWOZ - a
Rada Mihalcea, Siqi Shen, Winston Wu and Ian large-scale multi-domain Wizard-of-Oz dataset for
task-oriented dialogue modelling. In Proceedings of DongHyun Choi, Myeong Cheol Shin, EungGyun Kim,
the 2018 Conference on Empirical Methods in Nat- and Dong Ryeol Shin. 2021. RYANSQL: Recur-
ural Language Processing, pages 5016–5026, Brus- sively applying sketch-based slot fillings for com-
sels, Belgium. Association for Computational Lin- plex text-to-SQL in cross-domain databases. Com-
guistics. putational Linguistics, 47(2):309–332.
Ruichu Cai, Jinjie Yuan, Boyan Xu, and Zhifeng Hao. E. F. Codd. 1970. A relational model of data for large
2021. SADGA: Structure-aware dual graph aggre- shared data banks. Commun. ACM, 13(6):377–387.
gation network for text-to-SQL. Advances in Neural
Information Processing Systems, 34. Deborah A. Dahl, Madeleine Bates, Michael Brown,
William Fisher, Kate Hunicke-Smith, David Pallett,
Yitao Cai and Xiaojun Wan. 2020. IGSQL: Database Christine Pao, Alexander Rudnicky, and Elizabeth
schema interaction graph based neural model for Shriberg. 1994. Expanding the scope of the ATIS
context-dependent text-to-SQL generation. In Pro- task: The ATIS-3 corpus. In Human Language Tech-
ceedings of the 2020 Conference on Empirical Meth- nology: Proceedings of a Workshop held at Plains-
ods in Natural Language Processing (EMNLP), boro, New Jersey, March 8-11, 1994.
pages 6903–6912, Online. Association for Compu-
tational Linguistics. Naihao Deng, Shuaichen Chang, Peng Shi, Tao Yu,
and Rui Zhang. 2021. Prefix-to-SQL: Text-to-SQL
Ruisheng Cao, Lu Chen, Zhi Chen, Yanbin Zhao, generation from incomplete user questions. arXiv
Su Zhu, and Kai Yu. 2021. LGESQL: Line graph en- preprint arXiv:2109.13066.
hanced text-to-SQL model with mixed local and non-
local relations. In Proceedings of the 59th Annual Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Meeting of the Association for Computational Lin- Kristina Toutanova. 2019. BERT: Pre-training of
guistics and the 11th International Joint Conference deep bidirectional transformers for language under-
on Natural Language Processing (Volume 1: Long standing. In Proceedings of the 2019 Conference
Papers), pages 2541–2555, Online. Association for of the North American Chapter of the Association
Computational Linguistics. for Computational Linguistics: Human Language
Shuaichen Chang, Pengfei Liu, Yun Tang, Jing Huang, Technologies, Volume 1 (Long and Short Papers),
Xiaodong He, and Bowen Zhou. 2020. Zero-shot pages 4171–4186, Minneapolis, Minnesota. Associ-
text-to-SQL learning with auxiliary task. In Pro- ation for Computational Linguistics.
ceedings of the AAAI Conference on Artificial Intel- Li Dong and Mirella Lapata. 2016. Language to logi-
ligence, volume 34, pages 7488–7495. cal form with neural attention. In Proceedings of the
Chieh-Yang Chen, Pei-Hsin Wang, Shih-Chieh Chang, 54th Annual Meeting of the Association for Compu-
Da-Cheng Juan, Wei Wei, and Jia-Yu Pan. 2020a. tational Linguistics (Volume 1: Long Papers), pages
AirConcierge: Generating task-oriented dialogue 33–43, Berlin, Germany. Association for Computa-
via efficient large-scale knowledge retrieval. In tional Linguistics.
Findings of the Association for Computational Lin-
guistics: EMNLP 2020, pages 884–897, Online. As- Li Dong and Mirella Lapata. 2018. Coarse-to-fine de-
sociation for Computational Linguistics. coding for neural semantic parsing. In Proceedings
of the 56th Annual Meeting of the Association for
Sanxing Chen, Aidan San, Xiaodong Liu, and Computational Linguistics (Volume 1: Long Papers),
Yangfeng Ji. 2020b. A tale of two linkings: Dy- pages 731–742, Melbourne, Australia. Association
namically gating between schema linking and struc- for Computational Linguistics.
tural linking for text-to-SQL parsing. In Proceed-
ings COLING-2020, the 28th International Confer- Longxu Dou, Yan Gao, Mingyang Pan, Dingzirui
ence on Computational Linguistics, pages 2900– Wang, Jian-Guang Lou, Wanxiang Che, and Dechen
2912, Barcelona, Spain (Online). Association for Zhan. 2022. UniSAr: A unified structure-aware au-
Computational Linguistics. toregressive language model for text-to-SQL. ArXiv
preprint, abs/2203.07781.
Yongrui Chen, Xinnan Guo, Chaojie Wang, Jian
Qiu, Guilin Qi, Meng Wang, and Huiying Li. Ahmed Elgohary, Saghar Hosseini, and Ahmed Has-
2021a. Leveraging table content for zero-shot san Awadallah. 2020. Speak to your parser: Interac-
text-to-SQL with meta-learning. ArXiv preprint, tive text-to-SQL with natural language feedback. In
abs/2109.05395. Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 2065–
Zhi Chen, Lu Chen, Yanbin Zhao, Ruisheng Cao, Zi- 2077, Online. Association for Computational Lin-
han Xu, Su Zhu, and Kai Yu. 2021b. ShadowGNN: guistics.
Graph projection neural network for text-to-SQL
parser. In Proceedings of the 2021 Conference of Catherine Finegan-Dollak, Jonathan K. Kummerfeld,
the North American Chapter of the Association for Li Zhang, Karthik Ramanathan, Sesh Sadasivam,
Computational Linguistics: Human Language Tech- Rui Zhang, and Dragomir Radev. 2018. Improving
nologies, pages 5567–5577, Online. Association for text-to-SQL evaluation methodology. In Proceed-
Computational Linguistics. ings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa- on Empirical Methods in Natural Language Process-
pers), pages 351–360, Melbourne, Australia. Asso- ing (EMNLP), pages 1520–1540, Online. Associa-
ciation for Computational Linguistics. tion for Computational Linguistics.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Jiaqi Guo, Ziliang Si, Yu Wang, Qian Liu, Ming Fan,
Model-agnostic meta-learning for fast adaptation of Jian-Guang Lou, Zijiang Yang, and Ting Liu. 2021.
deep networks. In Proceedings of the 34th Inter- Chase: A large-scale and pragmatic Chinese dataset
national Conference on Machine Learning, ICML for cross-database context-dependent text-to-SQL.
2017, Sydney, NSW, Australia, 6-11 August 2017, In Proceedings of the 59th Annual Meeting of the
volume 70 of Proceedings of Machine Learning Re- Association for Computational Linguistics and the
search, pages 1126–1135. PMLR. 11th International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers), pages
Yujian Gan, Xinyun Chen, Qiuping Huang, Matthew 2316–2331, Online. Association for Computational
Purver, John R. Woodward, Jinxia Xie, and Peng- Linguistics.
sheng Huang. 2021a. Towards robustness of text-to-
SQL models against synonym substitution. In Pro- Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao,
ceedings of the 59th Annual Meeting of the Associa- Jian-Guang Lou, Ting Liu, and Dongmei Zhang.
tion for Computational Linguistics and the 11th In- 2019. Towards complex text-to-SQL in cross-
ternational Joint Conference on Natural Language domain database with intermediate representation.
Processing (Volume 1: Long Papers), pages 2505– In Proceedings of the 57th Annual Meeting of the
2515, Online. Association for Computational Lin- Association for Computational Linguistics, pages
guistics. 4524–4535, Florence, Italy. Association for Compu-
tational Linguistics.
Yujian Gan, Xinyun Chen, and Matthew Purver.
2021b. Exploring underexplored limitations of Tong Guo and Huilin Gao. 2018. Bidirectional
cross-domain text-to-SQL generalization. In Pro- attention for SQL generation. ArXiv preprint,
ceedings of the 2021 Conference on Empirical Meth- abs/1801.00076.
ods in Natural Language Processing, pages 8926–
8931, Online and Punta Cana, Dominican Republic. Tong Guo and Huilin Gao. 2019. Content en-
Association for Computational Linguistics. hanced BERT-based text-to-SQL generation. ArXiv
preprint, abs/1910.07179.
Yujian Gan, Xinyun Chen, Jinxia Xie, Matthew Purver,
John R. Woodward, John Drake, and Qiaofu Zhang. Moshe Hazoom, Vibhor Malik, and Ben Bogin. 2021.
2021c. Natural SQL: Making SQL easier to infer Text-to-SQL in the wild: A naturally-occurring
from natural language specifications. In Findings dataset based on stack exchange data. In Proceed-
of the Association for Computational Linguistics: ings of the 1st Workshop on Natural Language Pro-
EMNLP 2021, pages 2030–2042, Punta Cana, Do- cessing for Programming (NLP4Prog 2021), pages
minican Republic. Association for Computational 77–87, Online. Association for Computational Lin-
Linguistics. guistics.

Juri Ganitkevitch, Benjamin Van Durme, and Chris Pengcheng He, Yi Mao, Kaushik Chakrabarti, and
Callison-Burch. 2013. PPDB: The paraphrase Weizhu Chen. 2019. X-SQL: reinforce schema
database. In Proceedings of the 2013 Conference of representation with context. ArXiv preprint,
the North American Chapter of the Association for abs/1908.08113.
Computational Linguistics: Human Language Tech-
nologies, pages 758–764, Atlanta, Georgia. Associa- Charles T. Hemphill, John J. Godfrey, and George R.
tion for Computational Linguistics. Doddington. 1990. The ATIS spoken language sys-
tems pilot corpus. In Speech and Natural Language:
Jonathan L Gross, Jay Yellen, and Mark Anderson. Proceedings of a Workshop Held at Hidden Valley,
2018. Graph theory and its applications. Chapman Pennsylvania, June 24-27,1990.
and Hall/CRC.
Jonathan Herzig, Peter Shaw, Ming-Wei Chang, Kelvin
Daya Guo, Yibo Sun, Duyu Tang, Nan Duan, Jian Yin, Guu, Panupong Pasupat, and Yuan Zhang. 2021. Un-
Hong Chi, James Cao, Peng Chen, and Ming Zhou. locking compositional generalization in pre-trained
2018. Question generation from SQL queries im- models using intermediate representations. ArXiv
proves neural semantic parsing. In Proceedings of preprint, abs/2104.07478.
the 2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 1597–1607, Brus- Sepp Hochreiter and Jürgen Schmidhuber. 1997.
sels, Belgium. Association for Computational Lin- Long short-term memory. Neural computation,
guistics. 9(8):1735–1780.

Jiaqi Guo, Qian Liu, Jian-Guang Lou, Zhenwen Li, Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-
Xueqing Liu, Tao Xie, and Ting Liu. 2020. Bench- tau Yih, and Xiaodong He. 2018. Natural language
marking meaning representations in neural seman- to structured query generation via meta-learning. In
tic parsing. In Proceedings of the 2020 Conference Proceedings of the 2018 Conference of the North
American Chapter of the Association for Compu- Chia-Hsuan Lee, Oleksandr Polozov, and Matthew
tational Linguistics: Human Language Technolo- Richardson. 2021. KaggleDBQA: Realistic evalu-
gies, Volume 2 (Short Papers), pages 732–738, New ation of text-to-SQL parsers. In Proceedings of the
Orleans, Louisiana. Association for Computational 59th Annual Meeting of the Association for Compu-
Linguistics. tational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Vol-
Binyuan Hui, Ruiying Geng, Qiyu Ren, Binhua Li, ume 1: Long Papers), pages 2261–2273, Online. As-
Yongbin Li, Jian Sun, Fei Huang, Luo Si, Pengfei sociation for Computational Linguistics.
Zhu, and Xiaodan Zhu. 2021a. Dynamic hybrid re-
lation network for cross-domain context-dependent Dongjun Lee. 2019. Clause-wise and recursive decod-
semantic parsing. ArXiv preprint, abs/2101.01686. ing for complex and cross-domain text-to-SQL gen-
eration. In Proceedings of the 2019 Conference on
Binyuan Hui, Ruiying Geng, Lihan Wang, Bowen Empirical Methods in Natural Language Processing
Qin, Bowen Li, Jian Sun, and Yongbin Li. 2022. and the 9th International Joint Conference on Natu-
S2 SQL: Injecting syntax to question-schema inter- ral Language Processing (EMNLP-IJCNLP), pages
action graph encoder for text-to-SQL parsers. ArXiv 6045–6051, Hong Kong, China. Association for
preprint, abs/2203.06958. Computational Linguistics.

Binyuan Hui, Xiang Shi, Ruiying Geng, Binhua Li, Wenqiang Lei, Weixin Wang, Zhixin Ma, Tian Gan,
Yongbin Li, Jian Sun, and Xiaodan Zhu. 2021b. Im- Wei Lu, Min-Yen Kan, and Tat-Seng Chua. 2020.
proving text-to-SQL with schema dependency learn- Re-examining the role of schema linking in text-to-
ing. ArXiv preprint, abs/2103.04399. SQL. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Process-
Wonseok Hwang, Jinyeong Yim, Seunghyun Park, and ing (EMNLP), pages 6943–6954, Online. Associa-
Minjoon Seo. 2019. A comprehensive exploration tion for Computational Linguistics.
on WikiSQL with table-aware word contextualiza-
tion. ArXiv preprint, abs/1902.01069. Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Levy, Veselin Stoyanov, and Luke Zettlemoyer.
Krishnamurthy, and Luke Zettlemoyer. 2017. Learn- 2020. BART: Denoising sequence-to-sequence pre-
ing a neural semantic parser from user feedback. In training for natural language generation, translation,
Proceedings of the 55th Annual Meeting of the As- and comprehension. In Proceedings of the 58th An-
sociation for Computational Linguistics (Volume 1: nual Meeting of the Association for Computational
Long Papers), pages 963–973, Vancouver, Canada. Linguistics, pages 7871–7880, Online. Association
Association for Computational Linguistics. for Computational Linguistics.

Fei Li and Hosagrahar V Jagadish. 2014. Construct-

Marcelo Archanjo José and Fabio Gagliardi Cozman.
ing an interactive natural language interface for rela-
2021. mRAT-SQL+ GAP: A Portuguese text-to-
tional databases. Proceedings of the VLDB Endow-
SQL transformer. In Brazilian Conference on Intel-
ment, 8(1):73–84.
ligent Systems, pages 511–525. Springer.
Ning Li, Bethany Keller, Mark Butler, and Daniel
Jovan Kalajdjieski, Martina Toshevska, and Frosina Cer. 2020a. SeqGenSQL–a robust sequence gener-
Stojanovska. 2020. Recent advances in SQL ation model for structured query language. ArXiv
query generation: A survey. ArXiv preprint, preprint, abs/2011.03836.
abs/2005.07667.
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and
George Katsogiannis-Meimarakis and Georgia Richard S. Zemel. 2016. Gated graph sequence
Koutrika. 2021. A deep dive into deep learning neural networks. In 4th International Conference
approaches for text-to-SQL systems. In Proceed- on Learning Representations, ICLR 2016, San Juan,
ings of the 2021 International Conference on Puerto Rico, May 2-4, 2016, Conference Track Pro-
Management of Data, pages 2846–2851. ceedings.
Amol Kelkar, Rohan Relan, Vaishali Bhardwaj, Yuntao Li, Bei Chen, Qian Liu, Yan Gao, Jian-
Saurabh Vaichal, Chandra Khatri, and Peter Re- Guang Lou, Yan Zhang, and Dongmei Zhang. 2020b.
lan. 2020. Bertrand-dr: Improving text-to-SQL “what do you mean by that?” a parser-independent
using a discriminative re-ranker. ArXiv preprint, interactive approach for enhancing text-to-SQL. In
abs/2002.00557. Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
Yoon Kim, Carl Denton, Luong Hoang, and Alexan- pages 6913–6922, Online. Association for Computa-
der M. Rush. 2017. Structured attention networks. tional Linguistics.
In 5th International Conference on Learning Rep-
resentations, ICLR 2017, Toulon, France, April 24- Chen Liang, Mohammad Norouzi, Jonathan Berant,
26, 2017, Conference Track Proceedings. OpenRe- Quoc V. Le, and Ni Lao. 2018. Memory aug-
view.net. mented policy optimization for program synthesis
and semantic parsing. In Advances in Neural In- 9th International Joint Conference on Natural Lan-
formation Processing Systems 31: Annual Con- guage Processing (EMNLP-IJCNLP), pages 2851–
ference on Neural Information Processing Systems 2864, Hong Kong, China. Association for Computa-
2018, NeurIPS 2018, December 3-8, 2018, Mon- tional Linguistics.
tréal, Canada, pages 10015–10027.
Samuel Müller and Andreas Vlachos. 2019. Byte-
Xi Victoria Lin, Richard Socher, and Caiming Xiong. pair encoding for text-to-SQL generation. ArXiv
2020. Bridging textual and tabular data for cross- preprint, abs/1910.08962.
domain text-to-SQL semantic parsing. In Findings
of the Association for Computational Linguistics: Ansong Ni, Pengcheng Yin, and Graham Neubig. 2020.
EMNLP 2020, pages 4870–4888, Online. Associa- Merging weak and active supervision for semantic
tion for Computational Linguistics. parsing. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 34, pages 8536–8543.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
Hiroaki Hayashi, and Graham Neubig. 2021a. Pre- Peter Ochieng. 2020. Parot: Translating natural lan-
train, prompt, and predict: A systematic survey of guage to SPARQL. Expert Systems with Applica-
prompting methods in natural language processing. tions: X, 5:100024.
ArXiv preprint, abs/2107.13586.
Panupong Pasupat and Percy Liang. 2015. Compo-
Qian Liu, Dejian Yang, Jiahui Zhang, Jiaqi Guo, Bin sitional semantic parsing on semi-structured tables.
Zhou, and Jian-Guang Lou. 2021b. Awakening la- In Proceedings of the 53rd Annual Meeting of the
tent grounding from pretrained language models for Association for Computational Linguistics and the
semantic parsing. In Findings of the Association 7th International Joint Conference on Natural Lan-
for Computational Linguistics: ACL-IJCNLP 2021, guage Processing (Volume 1: Long Papers), pages
pages 1174–1189, Online. Association for Computa- 1470–1480, Beijing, China. Association for Compu-
tional Linguistics. tational Linguistics.
Qin Lyu, Kaushik Chakrabarti, Shobhit Hathi, Sou- Ana-Maria Popescu, Oren Etzioni, and Henry Kautz.
vik Kundu, Jianwen Zhang, and Zheng Chen. 2020. 2003. Towards a theory of natural language inter-
Hybrid ranking network for text-to-SQL. ArXiv faces to databases. In Proceedings of the 8th in-
preprint, abs/2008.04759. ternational conference on Intelligent user interfaces,
Jianqiang Ma, Zeyu Yan, Shuai Pang, Yang Zhang, and pages 149–157.
Jianping Shen. 2020. Mention extraction and link-
ing for SQL query generation. In Proceedings of the P. J. Price. 1990. Evaluation of spoken language sys-
2020 Conference on Empirical Methods in Natural tems: the ATIS domain. In Speech and Natural Lan-
Language Processing (EMNLP), pages 6936–6942, guage: Proceedings of a Workshop Held at Hidden
Online. Association for Computational Linguistics. Valley, Pennsylvania, June 24-27,1990.

Tanzim Mahmud, KM Azharul Hasan, Mahtab Ahmed, Karthik Radhakrishnan, Arvind Srikantan, and Xi Vic-
and Thwoi Hla Ching Chak. 2015. A rule based ap- toria Lin. 2020. ColloQL: Robust cross-domain
proach for NLP based query processing. In 2015 text-to-SQL over search queries. ArXiv preprint,
2nd International Conference on Electrical Infor- abs/2010.09927.
mation and Communication Technologies (EICT),
pages 78–82. IEEE. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, Wei Li, and Peter J Liu. 2019. Exploring the limits
and Richard Socher. 2018. The natural language de- of transfer learning with a unified text-to-text trans-
cathlon: Multitask learning as question answering. former. ArXiv preprint, abs/1910.10683.
ArXiv preprint, abs/1806.08730.
Ohad Rubin and Jonathan Berant. 2021. SmBoP:
Qingkai Min, Yuefeng Shi, and Yue Zhang. 2019a. A Semi-autoregressive bottom-up semantic parsing. In
pilot study for Chinese SQL semantic parsing. In Proceedings of the 2021 Conference of the North
Proceedings of the 2019 Conference on Empirical American Chapter of the Association for Computa-
Methods in Natural Language Processing and the tional Linguistics: Human Language Technologies,
9th International Joint Conference on Natural Lan- pages 311–324, Online. Association for Computa-
guage Processing (EMNLP-IJCNLP), pages 3652– tional Linguistics.
3658, Hong Kong, China. Association for Computa-
tional Linguistics. Irina Saparina and Anton Osokin. 2021. SPARQLing
database queries from intermediate question decom-
Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and positions. In Proceedings of the 2021 Conference on
Luke Zettlemoyer. 2019b. A discrete hard EM ap- Empirical Methods in Natural Language Processing,
proach for weakly supervised question answering. pages 8984–8998, Online and Punta Cana, Domini-
In Proceedings of the 2019 Conference on Empirical can Republic. Association for Computational Lin-
Methods in Natural Language Processing and the guistics.
Torsten Scholak, Raymond Li, Dzmitry Bahdanau, 2012. Proceedings of a meeting held December 3-
Harm de Vries, and Chris Pal. 2021a. DuoRAT: To- 6, 2012, Lake Tahoe, Nevada, United States, pages
wards simpler text-to-SQL models. In Proceedings 2960–2968.
of the 2021 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Yuanfeng Song, Raymond Chi-Wing Wong, Xuefang
Human Language Technologies, pages 1313–1321, Zhao, and Di Jiang. 2022. Speech-to-SQL: Towards
Online. Association for Computational Linguistics. speech-driven SQL query generation from natural
language question. ArXiv preprint, abs/2201.01209.
Torsten Scholak, Nathan Schucher, and Dzmitry Bah-
danau. 2021b. PICARD: Parsing incrementally for Alane Suhr, Ming-Wei Chang, Peter Shaw, and Ken-
constrained auto-regressive decoding from language ton Lee. 2020. Exploring unexplored generalization
models. In Proceedings of the 2021 Conference on challenges for cross-database semantic parsing. In
Empirical Methods in Natural Language Processing, Proceedings of the 58th Annual Meeting of the Asso-
pages 9895–9901, Online and Punta Cana, Domini- ciation for Computational Linguistics, pages 8372–
can Republic. Association for Computational Lin- 8388, Online. Association for Computational Lin-
guistics. guistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Alane Suhr, Srinivasan Iyer, and Yoav Artzi. 2018.
2016. Neural machine translation of rare words Learning to map context-dependent sentences to ex-
with subword units. In Proceedings of the 54th An- ecutable formal queries. In Proceedings of the 2018
nual Meeting of the Association for Computational Conference of the North American Chapter of the
Linguistics (Volume 1: Long Papers), pages 1715– Association for Computational Linguistics: Human
1725, Berlin, Germany. Association for Computa- Language Technologies, Volume 1 (Long Papers),
tional Linguistics. pages 2238–2249, New Orleans, Louisiana. Associ-
ation for Computational Linguistics.
Peter Shaw, Ming-Wei Chang, Panupong Pasupat, and
Kristina Toutanova. 2021. Compositional general- Ningyuan Sun, Xuefeng Yang, and Yunfeng Liu. 2020.
ization and natural language variation: Can a se- Tableqa: a large-scale Chinese text-to-SQL dataset
mantic parsing approach handle both? In Proceed- for table-aware SQL generation. ArXiv preprint,
ings of the 59th Annual Meeting of the Association abs/2006.06434.
for Computational Linguistics and the 11th Interna-
tional Joint Conference on Natural Language Pro- Lappoon R. Tang and Raymond J. Mooney. 2000. Au-
cessing (Volume 1: Long Papers), pages 922–938, tomated construction of database interfaces: Inter-
Online. Association for Computational Linguistics. grating statistical and relational learning for seman-
tic parsing. In 2000 Joint SIGDAT Conference on
Peng Shi, Patrick Ng, Zhiguo Wang, Henghui Empirical Methods in Natural Language Process-
Zhu, Alexander Hanbo Li, Jun Wang, Cicero ing and Very Large Corpora, pages 133–141, Hong
Nogueira dos Santos, and Bing Xiang. 2020a. Kong, China. Association for Computational Lin-
Learning contextual representations for seman- guistics.
tic parsing with generation-augmented pre-training.
ArXiv preprint, abs/2012.10309. Yasufumi Taniguchi, Hiroki Nakayama, Kubo
Takahiro, and Jun Suzuki. 2021. An investiga-
Peng Shi, Tao Yu, Patrick Ng, and Zhiguo Wang. tion between schema linking and text-to-SQL
2021. End-to-end cross-domain text-to-SQL se- performance. ArXiv preprint, abs/2102.01847.
mantic parsing with auxiliary task. ArXiv preprint,
abs/2106.09588. Anh Tuan Nguyen, Mai Hoang Dao, and Dat Quoc
Nguyen. 2020. A pilot study of text-to-SQL seman-
Tianze Shi, Kedar Tatwawadi, Kaushik Chakrabarti, tic parsing for Vietnamese. In Findings of the As-
Yi Mao, Oleksandr Polozov, and Weizhu Chen. 2018. sociation for Computational Linguistics: EMNLP
IncSQL: Training incremental text-to-SQL parsers 2020, pages 4079–4085, Online. Association for
with non-deterministic oracles. ArXiv preprint, Computational Linguistics.
abs/1809.05054.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Tianze Shi, Chen Zhao, Jordan Boyd-Graber, Hal Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Daumé III, and Lillian Lee. 2020b. On the poten- Kaiser, and Illia Polosukhin. 2017. Attention is all
tial of lexico-logical alignments for semantic pars- you need. In Advances in Neural Information Pro-
ing to SQL queries. In Findings of the Association cessing Systems 30: Annual Conference on Neural
for Computational Linguistics: EMNLP 2020, pages Information Processing Systems 2017, December 4-
1849–1864, Online. Association for Computational 9, 2017, Long Beach, CA, USA, pages 5998–6008.
Linguistics.
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2015. Pointer networks. In Advances in Neural
2012. Practical bayesian optimization of machine Information Processing Systems 28: Annual Con-
learning algorithms. In Advances in Neural Infor- ference on Neural Information Processing Systems
mation Processing Systems 25: 26th Annual Con- 2015, December 7-12, 2015, Montreal, Quebec,
ference on Neural Information Processing Systems Canada, pages 2692–2700.
Bailin Wang, Mirella Lapata, and Ivan Titov. 2021a. W. Woods, Ronald Kaplan, and Bonnie Webber. 1972.
Meta-learning for domain generalization in seman- The lunar sciences natural language information sys-
tic parsing. In Proceedings of the 2021 Conference tem.
of the North American Chapter of the Association
for Computational Linguistics: Human Language William A Woods. 1973. Progress in natural language
Technologies, pages 366–379, Online. Association understanding: an application to lunar geology. In
for Computational Linguistics. Proceedings of the June 4-8, 1973, national com-
puter conference and exposition, pages 441–450.
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr
Kun Wu, Lijie Wang, Zhenghua Li, Ao Zhang, Xinyan
Polozov, and Matthew Richardson. 2020a. RAT-
Xiao, Hua Wu, Min Zhang, and Haifeng Wang.
SQL: Relation-aware schema encoding and linking
2021. Data augmentation with hierarchical SQL-
for text-to-SQL parsers. In Proceedings of the 58th
to-question generation for cross-domain text-to-SQL
Annual Meeting of the Association for Computa-
parsing. In Proceedings of the 2021 Conference on
tional Linguistics, pages 7567–7578, Online. Asso-
Empirical Methods in Natural Language Processing,
ciation for Computational Linguistics.
pages 8974–8983, Online and Punta Cana, Domini-
Bailin Wang, Ivan Titov, and Mirella Lapata. 2019. can Republic. Association for Computational Lin-
Learning semantic parsers from denotations with la- guistics.
tent structured alignments and abstract programs. In Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong,
Proceedings of the 2019 Conference on Empirical Torsten Scholak, Michihiro Yasunaga, Chien-Sheng
Methods in Natural Language Processing and the Wu, Ming Zhong, Pengcheng Yin, Sida I Wang, et al.
9th International Joint Conference on Natural Lan- 2022. UnifiedSKG: Unifying and multi-tasking
guage Processing (EMNLP-IJCNLP), pages 3774– structured knowledge grounding with text-to-text
3785, Hong Kong, China. Association for Computa- language models. ArXiv preprint, abs/2201.05966.
tional Linguistics.
Peng Xu, Dhruv Kumar, Wei Yang, Wenjie Zi, Keyi
Bailin Wang, Wenpeng Yin, Xi Victoria Lin, and Caim- Tang, Chenyang Huang, Jackie Chi Kit Cheung, Si-
ing Xiong. 2021b. Learning to synthesize data for mon J.D. Prince, and Yanshuai Cao. 2021. Optimiz-
semantic parsing. In Proceedings of the 2021 Con- ing deeper transformers on small datasets. In Pro-
ference of the North American Chapter of the Asso- ceedings of the 59th Annual Meeting of the Associa-
ciation for Computational Linguistics: Human Lan- tion for Computational Linguistics and the 11th In-
guage Technologies, pages 2760–2766, Online. As- ternational Joint Conference on Natural Language
sociation for Computational Linguistics. Processing (Volume 1: Long Papers), pages 2089–
2102, Online. Association for Computational Lin-
Chenglong Wang, Marc Brockschmidt, and Rishabh guistics.
Singh. 2018a. Pointing out SQL queries from text.
Silei Xu, Sina Semnani, Giovanni Campagna, and
Chenglong Wang, Kedar Tatwawadi, Marc Monica Lam. 2020. AutoQA: From databases to QA
Brockschmidt, Po-Sen Huang, Yi Mao, Olek- semantic parsers with only synthetic training data.
sandr Polozov, and Rishabh Singh. 2018b. Robust In Proceedings of the 2020 Conference on Empirical
text-to-SQL generation with execution-guided Methods in Natural Language Processing (EMNLP),
decoding. ArXiv preprint, abs/1807.03100. pages 422–434, Online. Association for Computa-
tional Linguistics.
Huajie Wang, Mei Li, and Lei Chen. 2020b. PG-
GSQL: Pointer-generator network with guide de- Xiaojun Xu, Chang Liu, and Dawn Song. 2017. SQL-
coding for cross-domain context-dependent text-to- Net: Generating structured queries from natural
SQL generation. In Proceedings of COLING-2022, language without reinforcement learning. ArXiv
the 28th International Conference on Computational preprint, abs/1711.04436.
Linguistics, pages 370–380, Barcelona, Spain (On- Kuan Xuan, Yongbo Wang, Yongliang Wang, Zujie
line). Association for Computational Linguistics. Wen, and Yang Dong. 2021. Sead: End-to-end
text-to-SQL generation with schema-aware denois-
Lijie Wang, Ao Zhang, Kun Wu, Ke Sun, Zhenghua
ing. ArXiv preprint, abs/2105.07911.
Li, Hua Wu, Min Zhang, and Haifeng Wang. 2020c.
DuSQL: A large-scale and pragmatic Chinese text- Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and
to-SQL dataset. In Proceedings of the 2020 Con- Thomas Dillig. 2017. SQLizer: query synthesis
ference on Empirical Methods in Natural Language from natural language. Proceedings of the ACM on
Processing (EMNLP), pages 6923–6935, Online. As- Programming Languages, 1(OOPSLA):1–26.
sociation for Computational Linguistics.
Zeyu Yan, Jianqiang Ma, Yang Zhang, and Jianping
Ping Wang, Tian Shi, and Chandan K. Reddy. 2020d. Shen. 2020. SQL generation via machine reading
Text-to-SQL generation for question answering on comprehension. In Proceedings of COLING-2022,
electronic medical records. In WWW ’20: The Web the 28th International Conference on Computational
Conference 2020, Taipei, Taiwan, April 20-24, 2020, Linguistics, pages 350–356, Barcelona, Spain (On-
pages 350–361. ACM / IW3C2. line). Association for Computational Linguistics.
Ziyu Yao, Yu Su, Huan Sun, and Wen-tau Yih. 2019. Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue,
Model-based interactive semantic parsing: A uni- Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze
fied framework and a text-to-SQL case study. In Shi, Zihan Li, Youxuan Jiang, Michihiro Yasunaga,
Proceedings of the 2019 Conference on Empirical Sungrok Shim, Tao Chen, Alexander Fabbri, Zifan
Methods in Natural Language Processing and the Li, Luyao Chen, Yuwen Zhang, Shreya Dixit, Vin-
9th International Joint Conference on Natural Lan- cent Zhang, Caiming Xiong, Richard Socher, Walter
guage Processing (EMNLP-IJCNLP), pages 5447– Lasecki, and Dragomir Radev. 2019a. CoSQL: A
5458, Hong Kong, China. Association for Computa- conversational text-to-SQL challenge towards cross-
tional Linguistics. domain natural language interfaces to databases. In
Proceedings of the 2019 Conference on Empirical
Ziyu Yao, Yiqi Tang, Wen-tau Yih, Huan Sun, and Methods in Natural Language Processing and the
Yu Su. 2020. An imitation game for learning se- 9th International Joint Conference on Natural Lan-
mantic parsers from user interaction. In Proceed- guage Processing (EMNLP-IJCNLP), pages 1962–
ings of the 2020 Conference on Empirical Methods 1979, Hong Kong, China. Association for Computa-
in Natural Language Processing (EMNLP), pages tional Linguistics.
6883–6902, Online. Association for Computational
Linguistics. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Dongxu Wang, Zifan Li, James Ma, Irene Li,
Xi Ye, Qiaochu Chen, Xinyu Wang, Isil Dillig, and Qingning Yao, Shanelle Roman, Zilin Zhang,
Greg Durrett. 2020. Sketch-driven regular expres- and Dragomir Radev. 2018c. Spider: A large-
sion generation from natural language and examples. scale human-labeled dataset for complex and cross-
Transactions of the Association for Computational domain semantic parsing and text-to-SQL task. In
Linguistics, 8:679–694. Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, pages
Pengcheng Yin and Graham Neubig. 2017. A syntactic 3911–3921, Brussels, Belgium. Association for
neural model for general-purpose code generation. Computational Linguistics.
In Proceedings of the 55th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern
Long Papers), pages 440–450, Vancouver, Canada. Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene
Association for Computational Linguistics. Li, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit,
David Proctor, Sungrok Shim, Jonathan Kraft, Vin-
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Se- cent Zhang, Caiming Xiong, Richard Socher, and
bastian Riedel. 2020. TaBERT: Pretraining for joint Dragomir Radev. 2019b. SParC: Cross-domain se-
understanding of textual and tabular data. In Pro- mantic parsing in context. In Proceedings of the
ceedings of the 58th Annual Meeting of the Asso- 57th Annual Meeting of the Association for Com-
ciation for Computational Linguistics, pages 8413– putational Linguistics, pages 4511–4523, Florence,
8426, Online. Association for Computational Lin- Italy. Association for Computational Linguistics.
guistics.
John M Zelle and Raymond J Mooney. 1996. Learn-
Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and ing to parse database queries using inductive logic
Dragomir Radev. 2018a. TypeSQL: Knowledge- programming. In Proceedings of the national con-
based type-aware neural text-to-SQL generation. In ference on artificial intelligence, pages 1050–1055.
Proceedings of the 2018 Conference of the North
American Chapter of the Association for Compu- Rowan Zellers, Ari Holtzman, Elizabeth Clark, Lianhui
tational Linguistics: Human Language Technolo- Qin, Ali Farhadi, and Yejin Choi. 2021. TuringAd-
gies, Volume 2 (Short Papers), pages 588–594, New vice: A generative and dynamic evaluation of lan-
Orleans, Louisiana. Association for Computational guage use. In Proceedings of the 2021 Conference of
Linguistics. the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, Bailin nologies, pages 4856–4880, Online. Association for
Wang, Yi Chern Tan, Xinyi Yang, Dragomir R. Computational Linguistics.
Radev, Richard Socher, and Caiming Xiong. 2021.
Grappa: Grammar-augmented pre-training for table Jichuan Zeng, Xi Victoria Lin, Steven C.H. Hoi,
semantic parsing. In 9th International Conference Richard Socher, Caiming Xiong, Michael Lyu, and
on Learning Representations, ICLR 2021, Virtual Irwin King. 2020. Photon: A robust cross-domain
Event, Austria, May 3-7, 2021. OpenReview.net. text-to-SQL system. In Proceedings of the 58th An-
nual Meeting of the Association for Computational
Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Linguistics: System Demonstrations, pages 204–
Dongxu Wang, Zifan Li, and Dragomir Radev. 214, Online. Association for Computational Linguis-
2018b. SyntaxSQLNet: Syntax tree networks for tics.
complex and cross-domain text-to-SQL task. In Pro-
ceedings of the 2018 Conference on Empirical Meth- Rui Zhang, Tao Yu, Heyang Er, Sungrok Shim,
ods in Natural Language Processing, pages 1653– Eric Xue, Xi Victoria Lin, Tianze Shi, Caim-
1663, Brussels, Belgium. Association for Computa- ing Xiong, Richard Socher, and Dragomir Radev.
tional Linguistics. 2019. Editing-based SQL query generation for
cross-domain context-dependent questions. In Pro- A Topology for Text-to-SQL
ceedings of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the 9th In- Figure 5 shows the topology for the text-to-SQL
ternational Joint Conference on Natural Language task.
Processing (EMNLP-IJCNLP), pages 5338–5349,
Hong Kong, China. Association for Computational
Linguistics.
B Text-to-SQL Examples

Yusen Zhang, Xiangyu Dong, Shuaichen Chang, Tao B.1 Table and Database
Yu, Peng Shi, and Rui Zhang. 2020. Did you ask a Table 6 shows an example of the table in the
good question? a cross-domain question intention
classification benchmark for text-to-SQL. ArXiv
database for Restaurants dataset. The domain for
preprint, abs/2010.12634. this dataset is restaurant information, where ques-
tions are typically about food type, restaurant loca-
Yanzhao Zheng, Haibin Wang, Baohua Dong, Xingjun tion, etc.
Wang, and Changshan Li. 2022. HIE-SQL: His-
tory information enhanced network for context- There is a big difference in terms of how many
dependent text-to-SQL semantic parsing. ArXiv tables a database has. For restaurants, there are 3
preprint, abs/2203.07376. tables in the database, while there are 32 tables in
Ruiqi Zhong, Tao Yu, and Dan Klein. 2020a. Semantic ATIS (Suhr et al., 2020).
evaluation for text-to-SQL with distilled test suites.
In Proceedings of the 2020 Conference on Empirical B.2 Domain Knowledge
Methods in Natural Language Processing (EMNLP), Question: Will undergrads be okay to take 581 ?
pages 396–411, Online. Association for Computa-
tional Linguistics. SQL query:
SELECT DISTINCT T1.ADVISORY_REQUIREMENT ,
Victor Zhong, Mike Lewis, Sida I. Wang, and Luke T1.ENFORCED_REQUIREMENT , T1.NAME FROM
Zettlemoyer. 2020b. Grounded adaptation for zero- COURSE AS T1 WHERE T1.DEPARTMENT =
shot executable semantic parsing. In Proceedings of "EECS" AND T1.NUMBER = 581 ;
the 2020 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP), pages 6869– In Advising dataset, Department “EECS” is con-
6882, Online. Association for Computational Lin- sidered as domain knowledge where “581” in the
guistics. utterance means a course in “EECS” department
Victor Zhong, Caiming Xiong, and Richard Socher. with course number “581”.
2017. Seq2SQL: Generating structured queries
from natural language using reinforcement learning. B.3 Dataset Convention
ArXiv preprint, abs/1709.00103.
Question: Give me some restaurants in alameda ?
Jiawei Zhou, Jason Eisner, Michael Newman, Em- SQL query:
manouil Antonios Platanios, and Sam Thomson. SELECT T1.HOUSE_NUMBER ,
2022. Online semantic parsing for latency reduc- T2.NAME FROM LOCATION AS T1 , RESTAURANT
tion in task-oriented dialogue. In Proceedings of the AS T2 WHERE T1.CITY_NAME = "alameda"
60th Annual Meeting of the Association for Compu- AND T2.ID = T1.RESTAURANT_ID ;
tational Linguistics (Volume 1: Long Papers), pages
1554–1576, Dublin, Ireland. Association for Compu- In Restaurants dataset, when the user queries
tational Linguistics. “restaurants”, by dataset convention, the cor-
responding SQL query returns the column
“HOUSE_NUMBER” and “NAME”.

B.4 Text-to-SQL Templates

An example of the template for text-to-SQL pair
used by Iyer et al. (2017) is as follows:
Question template: Get all <ENT1>.<NAME>
having <ENT2>.<COL1>.<NAME> as
<ENT2>.<COL1>.<TYPE>
SQL query template:
SELECT <ENT1>.<DEF> FROM JOIN_FROM(
<ENT1>, <ENT2>) WHERE JOIN_WHERE(<ENT1>,
<ENT2>) AND
<ENT2>.<COL1> = <ENT2>.<COL1>.<TYPE> ;
ATIS; GeoQuery; Restau-
Single- rants; Scholar; Academic;
domain Yelp; IMDB; Advising; MIM-
ICSQL; ESQL(zh); SEDE

WikiSQL

Large Scale Spider; Spider-DK; SpiderUTran ;

Datasets § 2
Cross-domain Spider-L; SpiderSL ; Spider-Syn

TableQA(zh); DuSQL(zh);
ViText2SQL(vi); CSpi-
der(zh); PortugueseSpider(pt)

ATIS; Sparc; CoSQL;

Others Multi-turn Splash; Chase (zh)

Data Others TriageSQL; Squall; KaggleDBQA

Augmentation
Encode Token Types;
Encoding Graph-based; Self-attention;
Adapt PLM; Pre-training

Tree-based; Sketch-based;
Methodologies Bottom-up; Attention Mecha-
text-to-SQL Decoding
§3 nism; Copy Mechanism; Inter-
mediate Representation; Others

Active Learning; Interac-

Learning Fully
tive/Imitation Learning; Meta-
Techniques Supervised
learning; Multi-task learning
Miscellaneous
Reinforcement Learning; Meta-
Weakly
Learning; Bayesian Optimization;
supervised
Hard-EM-style Parameter Updates

Evaluations Exact string match; Exact set

§4 Metrics match; Execution accuracy

Split Example split; SQL

Methods query split; Database split

Table 5: Topology for text-to-SQL. Format adapted from Liu et al. (2021a).

CITY_NAME* COUNTY REGION paperDataset , dataset WHERE author.

authorId = writes.authorId
VARCHAR(255) VARCHAR(255) VARCHAR(255) AND writes.paperId = paper.paperId
Alameda AND paper.paperId = paperDataset.paperId
Alameda Bay Area
County AND paperDataset.datasetId = dataset.
Contra Costa datasetId AND dataset.datasetName =
Alamo Bay Area
County DATASET_TYPE ;
Alameda
Albany Bay Area
County
... ... ...

Table 6: Geography, one of the tables in Restaurants , where they populate the slots in the templates
database. * denotes the primary key of this table. We with table and column names from the database
only include 3 rows for demonstration purpose.
schema, as well as join the corresponding tables
accordingly.
Generated question: Get all author having dataset
as DATASET_TYPE An example of the PPDB (Ganitkevitch et al.,
Generated SQL query: 2013) paraphrasing is “thrown into jail” and “im-
SELECT author.authorId prisoned”. The English portion of PPDB contains
FROM author , writes , paper , over 220 million paraphrasing pairs.
B.5 Complexity of Natural Language and C.1 More Discussion on Text-to-SQL
SQL Query Pairs Datasets
In terms of the complexity for SQL queries, CSpider (Min et al., 2019a), Vi-
Finegan-Dollak et al. (2018) find that models per- Text2SQL (Tuan Nguyen et al., 2020) and José
form better on shorter SQL queries than longer and Cozman (2021) translate all the English
SQL queries, which indicates that shorter SQL questions in Spider into Chinese, Vietnamese and
queries are easier in general. Yu et al. (2018c) Portuguese, respectively. TableQA (Sun et al.,
define the SQL hardness as the number of SQL 2020) follows the data collection method from
components. The SQL query is harder when it con- WikiSQL, while DuSQL (Wang et al., 2020c)
tains more SQL keywords such as GROUP BY and follows Spider. Both TableQA and DuSQL collect
nested subqueries. Yu et al. (2018c) gives some Chinese utterance and SQL query pairs across
examples of SQL queries with different difficulty different domains. Chen et al. (2021a) propose a
levels: Chinese domain-specific dataset, ESQL.
Easy: For multi-turn context-dependent text-to-SQL
SELECT COUNT(*) benchmarks, ATIS (Price, 1990; Dahl et al.,
FROM cars_data 1994) includes user interactions with a SQL flight
WHERE cylinders > 4 ;
database in multiple turns. Sparc (Yu et al., 2019b)
Medium: takes a further step to collect multi-turn interactions
SELECT T2.name, COUNT(*) across 200 databases and 138 domains. However,
FROM concert AS T1 JOIN stadium AS T2 ON
T1.stadium_id = T2.stadium_id GROUP
both ATIS and Sparc assume all user questions can
BY T1.stadium_id ; be mapped into SQL queries and do not include
Hard: system responses. Later, inspired by task-oriented
dialogue system (Budzianowski et al., 2018), Yu
SELECT T1.country_name
FROM countries AS T1 JOIN continents AS et al. (2019a) propose CoSQL. In CoSQL, the di-
T2 ON T1.continent = T2.cont_id JOIN alogue state is tracked by SQL. CoSQL includes
car_makers AS T3 ON T1.country_id = T3.
country
three tasks of SQL-grounded dialogue state track-
WHERE T2.continent = ’Europe’ ing to generate SQL queries from user’s utterance,
GROUP BY T1.country_name system response generation from query results, and
HAVING COUNT(*) >= 3 ;
user dialogue act prediction to detect and resolve
Extra Hard: ambiguous and unanswerable questions.
SELECT AVG(life_expectancy) FROM country Besides, TriageSQL (Zhang et al., 2020) col-
WHERE name NOT IN lects unanswerable questions other than natural
(SELECT T1.name
FROM country AS T1 JOIN utterance and SQL query pairs from Spider and
country_language AS T2 WikiSQL, bringing up the challenge of distinguish-
ON T1.code = T2.country_code
WHERE T2.language = "English"
ing answerable questions from unanswerable ones
AND T2.is_official = "T") ; in text-to-SQL systems.
In terms of the complexity of natural utterance, D Encoding and Decoding Method
there is no qualitative measure of how hard the
utterance is. Intuitively, models’ performance can Table 8 and Table 9 show the encoding and decod-
decrease when faced with longer questions from ing methods that have been discussed in § 3.2 and
users. However, the information conveyed in longer § 3.3, respectively.
sentences can be more complete, while there can
be ambiguity in shorter sentences. Besides, there E Other Related Tasks
can be domain-specific phrases that confuse the Other tasks that are related to text-to-SQL in-
model in both short and long utterances (Suhr et al., clude text-to-python (Bonthu et al., 2021), text-
2020). Thus, researchers need to consider various to-shell script/bash script (Bharadwaj and She-
perspectives to determine the complexity of natural vade, 2022), text-to-regex (Ye et al., 2020), text-to-
utterance. SPARQL (Ochieng, 2020), etc. They all take natu-
ral language queries as input and output different
C Text-to-SQL Datasets
logical forms. Among these tasks, text-to-SPARQL
Table 7 lists statistics for text-to-SQL datasets. is closest to text-to-SQL as both SPARQL and SQL
Datasets #Size #DB #D #T/DB Issues addressed Sources for data
College courses,
Domain
Spider (Yu et al., 2018c) 10,181 200 138 5.1 DabaseAnswers,
generalization
WikiSQL
Domain
Spider-DK (Gan et al., 2021b) 535 10 - 4.8 Spider dev set
knowledge
Spider + 5,330
Untranslatable
SpiderUtran (Zeng et al., 2020) 15,023 200 138 5.1 untranslatable
questions
questions
Spider-L (Lei et al., 2020) 8,034 160 - 5.1 Schema linking Spider train/dev
SpiderSL (Taniguchi et al., 2021) 1,034 10 - 4.8 Schema linking Spider dev set
Spider-Syn (Gan et al., 2021a) 8,034 160 - 5.1 Robustness Spider train/dev
WikiSQL (Zhong et al., 2017) 80,654 26,521 - 1 Data size Wikipedia
WikiTableQuestions
Lexicon-level
Squall (Shi et al., 2020b) 11,468 1,679 - 1 (Pasupat and Liang,
supervision
2015)
Domain
KaggleDBQA (Lee et al., 2021) 272 8 8 2.3 Real web daabases
generalization
ATIS (Price, 1990; Dahl et al., 1994) 5,280 1 1 32 - Flight-booking
GeoQuery (Zelle and Mooney, 1996) 877 1 1 6 - US geography
Academic
Scholar (Iyer et al., 2017) 817 1 1 7 -
publications
Microsoft Academic
Academic (Li and Jagadish, 2014) 196 1 1 15 - Search (MAS)
database
Internet Movie
IMDB (Yaghmazadeh et al., 2017) 131 1 1 16 -
Database
Yelp (Yaghmazadeh et al., 2017) 128 1 1 7 - Yelp website
University of
Advising (Finegan-Dollak et al., 2018) 3,898 1 1 10 - Michigan course
information
Restaurants (Tang and Mooney, 2000)
378 1 1 3 - Restaurants
(Popescu et al., 2003)
MIMICSQL (Wang et al., 2020d) 10,000 1 1 5 - Healthcare domain
SQL template
SEDE (Hazoom et al., 2021) 12,023 1 1 29 Stack Exchange
diversity

Table 7: Summarization for text-to-SQL datasets. #Size, #DB, #D, and #T/DB represent the number of question-
SQL pairs, databases, domains, and tables per domain, respectively. We put “-” in the #D column because we do
not know how many domains are in the Spider dev set and “-” in the Issues Addressed column because there is no
specific issue addressed for the dataset. Datasets above and below the line are cross-domain and single-domain,
respectively.

can execute on database systems. Therefore, some

end-to-end models that take user queries as the in-
put and output a sequence of logical forms can be
applied to both tasks (Raffel et al., 2019). In con-
trast, methods (Xu et al., 2017) designed to take
care of SQL natures cannot be directly applied to
SPARQL, which requires carefully modification
instead.
Applied
Methods Adopted by Addressed challenges
datasets
Representing question
Encode token type TypeSQL (Yu et al., 2018a) WikiSQL
meaning
GNN (Bogin et al., 2019a) Spider
Global-GCN (Bogin et al., 2019b) Spider
IGSQL (Cai and Wan, 2020) Sparc, CoSQL
RAT-SQL (Wang et al., 2020a) Spider
Graph-based LEGSQL (Cao et al., 2021) Spider
SADGA (Cai et al., 2021) Spider
ShawdowGNN (Chen et al., 2021b) Spider (1) Representing ques-
Spider, tion and DB schemas in
S2 SQL (Hui et al., 2022)
Spider-Syn a structured way
X-SQL (He et al., 2019) WikiSQL (2) Schema linking
SQLova (Hwang et al., 2019) WikiSQL
Self-attention RAT-SQL (Wang et al., 2020a) Spider
DuoRAT (Scholak et al., 2021a) Spider
WikiSQL,
UnifiedSKG (Xie et al., 2022)
Spider
X-SQL (He et al., 2019) WikiSQL
SQLova (Hwang et al., 2019) WikiSQL
Guo and Gao (2019) WikiSQL
Adapt PLM HydraNet (Lyu et al., 2020) WikiSQL
Spider-L, Leveraging external
Liu et al. (2021b), etc data to represent ques-
SQUALL
TaBERT (Yin et al., 2020) Spider tion and DB schemas
Pre-training GraPPA (Yu et al., 2021) Spider
GAP (Shi et al., 2020a) Spider

Table 8: Methods used for encoding in text-to-SQL.

Applied
Methods Adopted by Addressed challenges
datasets
Seq2Tree (Dong and Lapata, 2016) -
Tree-based Seq2AST (Yin and Neubig, 2017) -
SyntaxSQLNet (Yu et al., 2018b) Spider
SQLNet (Xu et al., 2017) WikiSQL
Dong and Lapata (2018) WikiSQL Hierarchical decoding
Sketch-
based IRNet (Guo et al., 2019) Spider
RYANSQL (Choi et al., 2021) Spider
Bottom-up SmBop (Rubin and Berant, 2021) Spider
Seq2Tree (Dong and Lapata, 2016) -
Attention Seq2SQL (Zhong et al., 2017) WikiSQL
Attention Bi-attention Guo and Gao (2018) WikiSQL
Mechanism Structured attention Wang et al. (2019) WikiSQL
Relation-aware
DuoRAT (Scholak et al., 2021a) Spider Synthesizing informa-
Self-attention
Seq2AST (Yin and Neubig, 2017) - tion for decoding
Copy Mech- Seq2SQL (Zhong et al., 2017) WikiSQL
anism Wang et al. (2018a) WikiSQL
SeqGenSQL (Li et al., 2020a) WikiSQL
IncSQL (Shi et al., 2018) WikiSQL
IRNet (Guo et al., 2019) Spider
Spider and
Suhr et al. (2020)
others♠
Intermediate Bridging the gap be-
GeoQuery,
Representa- Herzig et al. (2021) tween natural language
ATIS, Scholar
tion and SQL query
Gan et al. (2021c) Spider
Brunner and Stockinger (2021) Spider
WikiSQL,
Constrained decod- UniSAr (Dou et al., 2022) Spide and
ing others♥
PICARD (Scholak et al., 2021b) Spider, CoSQL
Fine-grained decoding
SQLova (Hwang et al., 2019) WikiSQL
Execution-guided
Wang et al. (2018b) WikiSQL
Discriminative Global-GCN (Bogin et al., 2019b) Spider
SQL Ranking
re-ranking Kelkar et al. (2020) Spider
SQLNet (Xu et al., 2017) WikiSQL
Others Separate submodule Guo and Gao (2018) WikiSQL
Lee (2019) Spider Easier decoding
Advising, ATIS,
BPE Müller and Vlachos (2019)
GeoQuery
Synthesizing
Link gating Chen et al. (2020b) Spider information for
decoding

Table 9: Methods used for decoding in text-to-SQL. ♠ : Academic, Advising, ATIS, GeoQuery, Yelp, IMDB,
Scholar, Restaurants; ♥ : TableQA DuSQL, CoSQL, Sparc, Chase.

Download ? PDF - 1000+ SQL Interview Questions & Answers v2
No ratings yet
Download ? PDF - 1000+ SQL Interview Questions & Answers v2
1,261 pages
Recent Advances in Text to SQL
No ratings yet
Recent Advances in Text to SQL
22 pages
Llm model transform for short term trading on commodity
No ratings yet
Llm model transform for short term trading on commodity
7 pages
Dusql
No ratings yet
Dusql
13 pages
research paper
No ratings yet
research paper
32 pages
Large Language Model Enhanced Text-to-SQL Generation- A Survey
No ratings yet
Large Language Model Enhanced Text-to-SQL Generation- A Survey
18 pages
Semantic Parsing For Complex Data Retrieval: Targeting Query Plans vs. SQL For No-Code Access To Relational Databases
No ratings yet
Semantic Parsing For Complex Data Retrieval: Targeting Query Plans vs. SQL For No-Code Access To Relational Databases
17 pages
Graphix T5
No ratings yet
Graphix T5
10 pages
A Survey On Text-to-SQL Parsing: Concepts, Methods, and Future Directions
No ratings yet
A Survey On Text-to-SQL Parsing: Concepts, Methods, and Future Directions
19 pages
LLM Based Survey Text 1741015993
No ratings yet
LLM Based Survey Text 1741015993
20 pages
2406.08426v3
No ratings yet
2406.08426v3
18 pages
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
No ratings yet
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
17 pages
LLM Based TXT To SQL
No ratings yet
LLM Based TXT To SQL
18 pages
Syntax_and_Relation_Enhanced_Query_Generation_for_
No ratings yet
Syntax_and_Relation_Enhanced_Query_Generation_for_
12 pages
3583140.3583165
No ratings yet
3583140.3583165
14 pages
670e4e23bdd7d170839060aa2023.findings-emnlp.227
No ratings yet
670e4e23bdd7d170839060aa2023.findings-emnlp.227
32 pages
Few-Shot Text-to-SQL Translation Using Structure
No ratings yet
Few-Shot Text-to-SQL Translation Using Structure
28 pages
HCteam_IT_Proposal
No ratings yet
HCteam_IT_Proposal
15 pages
Solid-SQL Enhanced Schema-linking Based in-context Learning For
No ratings yet
Solid-SQL Enhanced Schema-linking Based in-context Learning For
11 pages
2303.07351v1
No ratings yet
2303.07351v1
16 pages
ChatGPT SQL
No ratings yet
ChatGPT SQL
7 pages
Seq 2 SQL
No ratings yet
Seq 2 SQL
13 pages
2405.16755v2
No ratings yet
2405.16755v2
39 pages
Enhancing Text-To-SQL Translation for Financial System Design
No ratings yet
Enhancing Text-To-SQL Translation for Financial System Design
11 pages
RATSQL
No ratings yet
RATSQL
12 pages
T2S Retrieval
No ratings yet
T2S Retrieval
16 pages
Base paper
No ratings yet
Base paper
10 pages
RESDSQL
No ratings yet
RESDSQL
9 pages
From Natural Language to SQL Review Of
No ratings yet
From Natural Language to SQL Review Of
15 pages
An Algorithm To Transform Natural Languages To SQL Queries For Relational Databases
No ratings yet
An Algorithm To Transform Natural Languages To SQL Queries For Relational Databases
7 pages
2024 Lrec-Main 539
No ratings yet
2024 Lrec-Main 539
19 pages
TEXT-to-SQL DIN-SQL
No ratings yet
TEXT-to-SQL DIN-SQL
11 pages
SQLin Big Data
No ratings yet
SQLin Big Data
7 pages
Can LLM Already Serve As A Database Interface? A Big Bench For Large-Scale Database Grounded Text-To-Sqls
No ratings yet
Can LLM Already Serve As A Database Interface? A Big Bench For Large-Scale Database Grounded Text-To-Sqls
28 pages
Docspider a Dataset of Cross Domain Natural Language Querying for Mongodb
No ratings yet
Docspider a Dataset of Cross Domain Natural Language Querying for Mongodb
32 pages
13657_Spider_2_0_Can_Language_
No ratings yet
13657_Spider_2_0_Can_Language_
45 pages
Structure-guided Large Language Models For
No ratings yet
Structure-guided Large Language Models For
24 pages
Data Democratisation with Deep Learning
No ratings yet
Data Democratisation with Deep Learning
4 pages
Natural Language To SQL Queries
No ratings yet
Natural Language To SQL Queries
17 pages
S 2.0: E L M R - W E T - SQL W - : Pider Valuating Anguage Odels On EAL Orld Nterprise EXT TO ORK Flows
No ratings yet
S 2.0: E L M R - W E T - SQL W - : Pider Valuating Anguage Odels On EAL Orld Nterprise EXT TO ORK Flows
45 pages
Paper 1
No ratings yet
Paper 1
6 pages
p1737-kim
No ratings yet
p1737-kim
14 pages
NL2SQL_handbook
No ratings yet
NL2SQL_handbook
181 pages
db gpt hub 2024
No ratings yet
db gpt hub 2024
17 pages
1 s2.0 S2352340922004152 Main
No ratings yet
1 s2.0 S2352340922004152 Main
11 pages
A Natural Language Interface To Relational Databases Using An Online Analytic Processing Hypercube
No ratings yet
A Natural Language Interface To Relational Databases Using An Online Analytic Processing Hypercube
18 pages
PS-SQL_Phrase-based_Schema-Linking_with_Pre-trained_Language_Models_for_Text-to-SQL_Parsing
No ratings yet
PS-SQL_Phrase-based_Schema-Linking_with_Pre-trained_Language_Models_for_Text-to-SQL_Parsing
5 pages
OmniSQL Synthesizing High-quality Text-To-SQL Data at Scale
No ratings yet
OmniSQL Synthesizing High-quality Text-To-SQL Data at Scale
15 pages
Natural Language To SQL in Low-Code Platforms
No ratings yet
Natural Language To SQL in Low-Code Platforms
11 pages
Data Framework
No ratings yet
Data Framework
5 pages
2409.16751v1
No ratings yet
2409.16751v1
18 pages
2.1 Review of Literature: "SQL Generation and PL/SQL Execution From Natural Language Processing"
No ratings yet
2.1 Review of Literature: "SQL Generation and PL/SQL Execution From Natural Language Processing"
11 pages
17.NL To SQL System-.Full
No ratings yet
17.NL To SQL System-.Full
6 pages
Natural Language Processing With Some Abbreviation To SQL
No ratings yet
Natural Language Processing With Some Abbreviation To SQL
5 pages
6.1 Emerging Databases
No ratings yet
6.1 Emerging Databases
18 pages
Conceptual Graphs For A Database Interface (Sowa 1976)
No ratings yet
Conceptual Graphs For A Database Interface (Sowa 1976)
22 pages
Full Text Indexes in Postgresql
No ratings yet
Full Text Indexes in Postgresql
37 pages
C: A Pragmatic Chinese Answer-to-Sequence Dataset With Large Scale and High Quality
No ratings yet
C: A Pragmatic Chinese Answer-to-Sequence Dataset With Large Scale and High Quality
16 pages
Formation of SQL From Natural Language Query Using NLP: Uma M Sneha V Sneha G
No ratings yet
Formation of SQL From Natural Language Query Using NLP: Uma M Sneha V Sneha G
5 pages
24_Data_Centric_Text_to_SQL_wi
No ratings yet
24_Data_Centric_Text_to_SQL_wi
6 pages
KSQL for Stream Processing: Definitive Reference for Developers and Engineers
From Everand
KSQL for Stream Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Analytics - IBM
No ratings yet
Data Analytics - IBM
14 pages
Setting Up JDBC & Odbc
No ratings yet
Setting Up JDBC & Odbc
6 pages
Queries With Table
No ratings yet
Queries With Table
14 pages
Security
No ratings yet
Security
20 pages
2022-23 S4 Mid-Year Exam P1
No ratings yet
2022-23 S4 Mid-Year Exam P1
19 pages
Comprehensive Guide To WMI 1
No ratings yet
Comprehensive Guide To WMI 1
15 pages
Table
No ratings yet
Table
1 page
LO1 Write An SQL Statement To Retrieve and Sort Data
100% (1)
LO1 Write An SQL Statement To Retrieve and Sort Data
10 pages
The SQL Tutorial For Data Analysis
No ratings yet
The SQL Tutorial For Data Analysis
103 pages
Bala Bhaskar
No ratings yet
Bala Bhaskar
9 pages
Paper 2 Past Papers Topical N Yearly 2K23 Onward - Marking
No ratings yet
Paper 2 Past Papers Topical N Yearly 2K23 Onward - Marking
151 pages
Java Server Benchmarks
No ratings yet
Java Server Benchmarks
25 pages
158.337 Database Development - Massey - Exam - S1 2010
No ratings yet
158.337 Database Development - Massey - Exam - S1 2010
26 pages
Be Happy and Make Others To Be Happy - Oracle EBS R12 Purchasing, Inventory, Order Management Queries
No ratings yet
Be Happy and Make Others To Be Happy - Oracle EBS R12 Purchasing, Inventory, Order Management Queries
17 pages
CS6001-C Sharp and .Net Programming
No ratings yet
CS6001-C Sharp and .Net Programming
11 pages
Practical Questions With Answer
No ratings yet
Practical Questions With Answer
3 pages
Databse Chapter 5 - SQL-1
No ratings yet
Databse Chapter 5 - SQL-1
72 pages
Query Processing System: BREB-2016)
No ratings yet
Query Processing System: BREB-2016)
11 pages
Postgis Introduction
100% (1)
Postgis Introduction
36 pages
DBMS Lab-4
No ratings yet
DBMS Lab-4
8 pages
db2 Migrate
No ratings yet
db2 Migrate
37 pages
Application Engine - Development, Execution & Debugging
No ratings yet
Application Engine - Development, Execution & Debugging
53 pages
DBMS Lab Manual Updated
No ratings yet
DBMS Lab Manual Updated
91 pages
Relational Model Slides
No ratings yet
Relational Model Slides
30 pages
DV 210220053508
No ratings yet
DV 210220053508
30 pages
ICT Assignment 4 Bachelors
No ratings yet
ICT Assignment 4 Bachelors
4 pages
Lab 4 Muhammad Abdullah (1823-2021)
No ratings yet
Lab 4 Muhammad Abdullah (1823-2021)
11 pages
Rajat Dubey Resume
No ratings yet
Rajat Dubey Resume
1 page
Database Sever Report
No ratings yet
Database Sever Report
16 pages

Recent Advances in Text-To-SQL- A Survey of What We Have and What We Expect

Uploaded by

Recent Advances in Text-To-SQL- A Survey of What We Have and What We Expect

Uploaded by

Recent Advances in Text-to-SQL:

A Survey of What We Have and What We Expect

Naihao Deng Yulong Chen Yue Zhang

Abstract What are the major cities in the state of Kansas?

Database End User

SELECT T1.CITY_NAME FROM CITY AS T1 WHERE

Fei Li and Hosagrahar V Jagadish. 2014. Construct-

B.4 Text-to-SQL Templates

Large Scale Spider; Spider-DK; SpiderUTran ;

ATIS; Sparc; CoSQL;

Data Others TriageSQL; Squall; KaggleDBQA

Active Learning; Interac-

Evaluations Exact string match; Exact set

Split Example split; SQL

CITY_NAME* COUNTY REGION paperDataset , dataset WHERE author.

can execute on database systems. Therefore, some

Table 8: Methods used for encoding in text-to-SQL.

You might also like