Converting Natural Language Query to SQL Statments

What if we could talk to a database? This would allow non-technical people to become SQL programmers and extract whatever insights they want.

This project presents a schema-aware Text-to-SQL (T2S) system that translates natural language questions into executable SQL queries. The system is architected as a robust FastAPI service that leverages multiple Large Language Models (LLMs) for high accuracy and is optimized for efficient, cost-effective local inference.

The core methodology achieves 81% execution accuracy on the complex, multi-table Spider benchmark. This work bridges the gap between theoretical research and practical deployment by creating a system that is not only accurate but also scalable and user-friendly.

🌟 Key Features

High-Accuracy SQL Generation: Utilizes a fine-tuned SQLCoder-7B model, building on foundational research with T5 and BART models, to generate syntactically and semantically correct SQL.
Dynamic Semantic Schema Matching: Employs Sentence-BERT (SBERT) to intelligently match a user's question to the most relevant database schema from over 35 candidates, achieving over 90% accuracy in schema selection.
Multi-Layer Validation Pipeline: Ensures query reliability through a comprehensive validation process, including initial syntax checks (sqlparse), schema compliance, and a final execution check against a SQLite engine to verify correctness.
Optimized for Local Inference: Uses GGUF quantization to run the powerful 7B parameter model efficiently on local hardware, drastically reducing operational costs compared to API-based solutions.
Production-Grade API: Built with FastAPI, the system exposes clean, RESTful endpoints for schema matching, SQL generation, and query execution, with robust error handling and logging.
Proven on Complex Benchmarks: The system's effectiveness was validated on the WikiSQL and Spider datasets, demonstrating its ability to handle both simple and complex multi-table database structures.

🏗️ System Architecture

The system employs a modular pipeline that processes a natural language question through several stages to ensure the final SQL query is accurate, relevant, and executable.

Semantic Schema Matching: The user's question is first encoded into a dense vector embedding using Sentence-BERT (all-MiniLM-L6-v2). This embedding is compared against pre-computed embeddings of all available database schemas using cosine similarity to find the most relevant one.
Schema-Aware Prompt Engineering: The identified schema is formatted using custom XML-like tags (<tab>, <col>) and concatenated with the original question. This technique provides critical structural context to the LLM, reducing errors and hallucinations.
LLM-Based SQL Generation: The structured prompt is sent to a locally deployed SQLCoder-7B (GGUF) model, which generates the SQL query. The decoding strategy uses beam search to enhance syntactic accuracy.
Validation and Execution: The generated SQL undergoes a final check for syntactic validity and is then executed against a live SQLite database to confirm it runs without errors and produces a valid result.

📊 Performance & Evaluation

Execution Accuracy (Spider): 81%
Exact Match Accuracy (WikiSQL): 70.16% (using CodeT5-base)
Schema Selection Accuracy: >90%

🛠️ Technical Stack

Category	Technologies
Backend	`FastAPI`, `Python 3.11+`, `SQLAlchemy`
ML/NLP	`SQLCoder-7B (GGUF)`, `Sentence-Transformers`, `T5`, `BART`, `llama-cpp-python`, `Hugging Face Transformers`
Database	`SQLite`, `PostgreSQL`
Tools & DevOps	`sqlparse`, `NLTK`, `Faker` (for synthetic data), `GGUF Quantization`

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
models		models
text2sql_model/text2sql_tokenizer		text2sql_model/text2sql_tokenizer
utils		utils
.gitignore		.gitignore
README.md		README.md
generate_predictions_sql.py		generate_predictions_sql.py
generate_synthetic_data.py		generate_synthetic_data.py
main.py		main.py
query_db.py		query_db.py
requirements.txt		requirements.txt
utils_text2sql.py		utils_text2sql.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Converting Natural Language Query to SQL Statments

🌟 Key Features

🏗️ System Architecture

📊 Performance & Evaluation

🛠️ Technical Stack

About

Uh oh!

Releases

Packages

Uh oh!

Languages

vatsalvatsyayan/Text-to-SQL

Folders and files

Latest commit

History

Repository files navigation

Converting Natural Language Query to SQL Statments

🌟 Key Features

🏗️ System Architecture

📊 Performance & Evaluation

🛠️ Technical Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages