Can Smaller AI Models Solve Text-to-SQL?

May 4, 2026

Author: Derek Chezzi Editor: Satya Krishna Gorti

New research shows how Multi-Sample Critiquing improves accuracy by generating and comparing multiple SQL candidates.

Databases are the quiet infrastructure behind modern life. They’re used across nearly every domain imaginable—healthcare, education, science, finance, government, retail, transportation, business, and much more—anywhere information needs to be stored accurately and retrieved reliably. If you’ve ever searched a library catalogue, checked test results in a patient portal, tracked a delivery, managed student records, or used a business intelligence tool to explore customer revenue, marketing campaign performance, or support ticket trends, you’ve interacted with a database even if you didn’t realize it.

The Power of a Database

At their core, databases are systems for storing facts in a structured way so people and software can find the right information quickly and consistently. The real power isn’t just storage, it’s retrieval and connection: with the right question, you can pull specific details, summarize patterns, and combine information from different places to reveal relationships that aren’t obvious on the surface. That’s how a hospital spots a trend in readmissions, a city identifies traffic hotspots, a researcher finds correlations in a dataset, or a marketing team detects shifts in customer behavior.

To retrieve information and draw those insights, people have traditionally relied on specialized software. Artificial Intelligence is changing that.

Querying a Database

Retrieving information from a database presents two main challenges.

First, it requires precision. To retrieve the right information, you need to understand how the data is structured and stored.

Second, getting useful answers often requires specialized knowledge or software.

AI can help bridge that gap, but with an important caveat. Using AI to query a database means translating a natural language request into the programming language the database understands, usually Structured Query Language (SQL). Because database queries require precision, even a small error can break the request or worse, return a result that appears correct but is actually wrong. And that is the greater risk: not a visible failure, but a convincing answer that cannot be trusted.

Read about how information retrieval is applied in TD’s North American Customer Operations Centre

Using AI for Text-to-SQL

One limitation of software tools that interface with databases, such as a customer relationship management platform or library management software, is that they are often designed to answer predefined questions. When the data changes or someone needs an ad hoc query the dashboard wasn’t designed to answer, teams often hit a bottleneck and need custom SQL.

This is where text-to-SQL becomes valuable.

In simple terms, text-to-SQL is translating a natural language question into a correct SQL query. The user would be an individual with or without knowledge of writing code to query a database.

The challenge with the process is that the AI must understand where to pull the data to answer the query by deciding which are the right tables and columns to choose. Then, the AI must join the data from the various tables correctly, and it must also apply the right filters, grouping, time windows, and aggregations.

Let’s look at a simple example.

Imagine you’re a marketing manager working in a subscription business, and you want to answer a question that isn’t already in the revenue dashboard: “Which marketing channel drove the highest first-month revenue from new customers in February, and what was the average order value?”

For a human analyst this request is simple to understand but might be time consuming to answer. You could be pulling the data from multiple sources (i.e. tables with data about customers, orders, and marketing attribution) then defining what “new customer” means, applying the right time windows, and calculating totals and averages. For an AI system, this is not a one step process. It must identify the right tables and columns, join them correctly, and apply filters and aggregations in exactly the right way, otherwise the answer can look plausible while being wrong.

What this example demonstrates is the inherent trickiness of text-to-SQL and highlights four challenges when using AI for this task.

First is ambiguity in the language. Users may use words like “recent” or “best” or “active” when writing their prompts. The AI must interpret the user’s intent and find the relevant data.

Second is schema complexity. The database structure may be vast with multiple tables to filter through, identifying the correct ones, and then joining the data from different tables properly.

Third is accuracy. Some queries may return results that “look right” but are in fact wrong answers. A process must be developed to fact check against these silent errors.

And fourth, edge cases. Naming conventions could be different from table to table, there could be missing tables, or inconsistent values in the tables.

Read about how text-to-SQL has transformed the front office at TD Securities

The Trade-offs in Today’s Text-to-SQL Approaches

Many of the strongest recent text-to-SQL systems rely on large closed-source models, such as GPT-4, combined with task decomposition.

One influential example is MAC-SQL¹ which breaks the task into stages handled by components such as a Selector, Decomposer, and Refiner. This approach performs very well on several SQL generation benchmarks.

But relying on a large, API-based closed-source model for text-to-SQL comes with trade-offs.

Cost: Large models are more expensive to run, and breaking the task into smaller steps increases the number of model calls, driving costs even higher.

Latency: More model calls also mean slower response times.

Transparency: Closed-source models limit visibility and control.

Adaptability: Restrictions on fine-tuning and modification make it harder to adapt these models to new tasks.

Privacy: Sending data through an API can introduce privacy and security concerns.

Smaller open-source models can help address these drawbacks.

The challenge is that methods built on smaller open-source models, especially those under 10 billion parameters, have often shown promising results but still lag behind larger models in performance. Our solution, Multi-Sample Critiquing (MSc-SQL), addresses these trade-offs by using small open-source models while still achieving accuracy that is competitive with several larger closed-source models on key benchmarks.

Read the full publication MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation

The Principle Behind Multi-Sample Critiquing

The core idea behind MSc-SQL is simple: don’t bet everything on a single generated query.

As noted, many high-performing text-to-SQL systems improve accuracy by breaking the task into multiple steps. But when those systems rely on large models, that often means more model calls, higher latency, and greater cost. MSc-SQL also uses a multi-stage pipeline but adds a different form of test-time computation through multi-sample critiquing. Instead of relying on one generated query, it produces multiple SQL candidates and compares them to select the best one. This adds some inference overhead relative to a simple baseline, but because the work is done with smaller open-source models, the system can still be far more compute-efficient than larger-model approaches while also offering greater transparency, control, and privacy.

This differs from one-shot methods such as Chain-of-Thought prompting, where extra computation happens within a single response. That can improve performance, but it still leaves the system dependent on one candidate query. MSc-SQL spreads that computation across multiple candidates, increasing the odds that at least one is correct and giving the system a stronger basis for selecting the best result.

A Closer Look at the Challenges of Text-to-SQL

As noted earlier, translating natural language into SQL is difficult for several reasons.

First, people can ask the same question in different ways.

If the underlying intent is to identify the “top 10 customers by revenue in Q4 2025, in Canada, excluding refunds,” that request could be phrased:

“Who were our top 10 customers by revenue in Q4 2025 in Canada, not counting refunded orders?”
“Show me the 10 highest-grossing Canadian customers for Oct–Dec 2025, excluding refunds or returns.”
“Rank customers in Canada by net sales for Q4 2025 and list the top ten.”

Second, database schemas often involve complex relationships.

Take the question: “Which marketing channel produced the highest first-month revenue from new customers in February?”

A human analyst may understand which tables connect, how they should be joined, and what counts as a “new customer.” An AI system needs to infer all of that. One incorrect join can silently double-count revenue or assign it to the wrong channel.

Third, the system must generate SQL that is not only valid but correct.

If a user asks, “What was our net revenue last month by channel?”, the goal is not simply to produce SQL that runs. It is to produce SQL that runs and returns a truthful answer.

How MSc-SQL Works

To address these challenges, we divide the task into three distinct modules—the Schema Linking Module, the SQL Generation Model, and the Multi-Sample Critiquing Model—which together form the blueprint for the pipeline shown in the figure below.

Schema Linking—Narrow the Search Space

The first step in the pipeline is to narrow the search space.

Schema linking is the process of identifying which tables and columns are most likely to be relevant to the question, using the question itself, the database schema, and supporting metadata. This helps later stages focus on the most relevant parts of the database rather than reasoning across every available table. It is similar to the Selector component in MAC-SQL.

Generate Multiple SQL Candidates

The next module produces 2 or 3 candidate queries using data from the columns predicted in the first module, sometimes with different small language models. In our experiments, we used Mistral, Llama, and Gemma models. By generating a diverse set of queries across models, the system increases the odds that at least one will be correct.

Generating a valid SQL query, while conceptually simple, can be tricky. It requires knowing how values are formatted in the database. To help with that, the system retrieves contextual few-shot examples that the model can use when generating the query. For string columns, it pulls examples closest in meaning to the natural-language question; for other column types, it samples example values differently. These examples act as hints, helping the model judge which values are relevant and how they may be represented. For example, a database might store a value as “CA” rather than “California”, use “Cust_ID” instead of “CustomerID”, or label a status as “Pending” rather than “In Progress”.

To improve robustness to noise that may arise from the model’s tendency to maximize recall, and therefore predict more tables than are necessary, the SQL generation step is trained to discard unnecessary or irrelevant tables. Through this fine-tuning, the model learns to generate syntactically correct and semantically valid SQL queries, improving accuracy.

Evaluating the Output—Multi-Sample Critiquing

In the final module, we introduce sample critiquing, the key innovation in this process. At this stage, a separate module evaluates the generated candidates. This step is shown in the image below.

The critique compares the sample outputs side by side and selects the result judged to be the most accurate.

In this module, the model learns to make an informed critique using richer contextual information. We provide it with the question, the schema, and the generated SQL queries. We also provide the output of each SQL query, along with any resulting error messages, to support its evaluation.

As a baseline, we consider a basic critiquing model: \(f_{\text{isc}}: \left(q, \mathcal{S}_q, (s_i, r_i), \mathcal{M}_{\text{sc}} \right) \to \left[0, 1\right] \).

This formula explains how the model ranks an individual SQL query from multiple candidates in the experiment to estimate the likelihood that the candidate is correct. This gives us a baseline for comparison when evaluating our Multi-Sample Critiquing method. In this formula, \(q\) is the natural language question; \(\mathcal{S}_q\) are the schema tables predicted to be needed to answer the question; \(s_i\) and \(r_i\) are the set of SQL statements and their results; and \(\mathcal{M}_{\text{sc}}\) denotes additional metadata needed at this stage.

Once we establish a baseline measurement, we then apply the MSc process to compare the samples. Rather than scoring each candidate independently for its likelihood of being correct, MSc-SQL critiques multiple samples together and selects the best one through side-by-side comparison as expressed in the following formula. This gives the model more context than independent sample critiquing, allowing it to detect subtle differences that can make one query more accurate than another. We define our multi-sample critiquing model as: \(f_{\text{msc}}: \left(q, \mathcal{S}_q, \{ (s_i, r_i) \}_{i=1}^{n}, \mathcal{M}_{\text{sc}} \right) \to \{1, \cdots, n \} \).

This formula explains how our method uses multi-sample comparison to identify a single best answer based on what the model determines to be the most accurate response to the query among all samples reviewed.

How We Evaluated the Approach

We ran a series of experiments to identify the process that delivers the greatest accuracy and efficiency with small models.

Our test variables included:

the number of samples used in critiquing;
the effect of model diversity on accuracy;
varying model temperature (or randomness); and
using samples generated by multiple generations of models and comparing the results with those from a single-generation model.

The key metrics we measured included:

Execution Accuracy, to verify results against expected outcomes and assess practical usability;
Exact Match, to assess the syntactic precision of the SQL query; and
Validity and Efficiency Score, to evaluate both accuracy and computational efficiency.

KEY FINDINGS

Our experiments revealed several key insights:

Multi-Sample Critiquing using samples generated by diverse generation models increases the overall accuracy of the text-to-SQL pipeline.
Training different models from random initializations increases the likelihood of generating a correct query compared with using samples from a single-generation model.
A diverse set of generated SQL candidates, especially from multiple models, is more beneficial than sending duplicate queries from the same model through the pipeline.
With smaller models, Multi-Sample Critiquing performs better on simple queries, while larger models such as GPT-4 perform better on moderate and challenging queries.
Compared with other selection methods—specifically the Llama-8B reward model from the RewardBench² leaderboard, fine-tuned Llama-8B, and a self-consistency approach—our Multi-Sample Critiquing method outperforms them by a wide margin.

Why This Matters

If text-to-SQL works reliably, it changes who can access and use data inside an organization. Instead of routing every custom question through a data analyst or settling for whatever a dashboard happens to show, more people can ask precise questions in plain language and get answers quickly. That has real operational impact: faster decisions, fewer bottlenecks, and less “analysis ping-pong” between business teams and data teams.

Because this research focuses on improving the performance of smaller, open models, it also points toward deployments that are cheaper, more customizable, and easier to run in privacy-sensitive environments, without sending schemas and data questions to an external closed API.

Key Takeaway

The core takeaway is simple: turning language into SQL is not just about writing code that looks right. It is about producing a query that behaves correctly when run against real business data. Our paper’s key distinction is not to bet everything on a single generated query. By producing multiple candidates, executing them, and using a critic to select the best one, Multi-Sample Critiquing improves the odds that the answer matches the user’s intent, especially for messy, real-world business intelligence questions such as attributing first-month revenue to marketing channels for new customers. In other words, it is a practical step toward making “ask your data a question” work for more people, with stronger safeguards than a closed-API workflow and at lower cost.

Citation

@inproceedings{gorti2025msc,
  title={MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation},
  author={Gorti, Satya Krishna and Gofman, Ilan and Liu, Zhaoyan and Wu, Jiapeng and Vouitsis, No{\"e}l and Yu, Guangwei and Cresswell, Jesse C and Hosseinzadeh, Rasa},
  booktitle={Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
  pages={2145--2160},
  year={2025}
}

References

Wang, Bing, et al. “MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL.” Proceedings of the 31st International Conference on Computational Linguistics. 2025. ↩︎
Lambert, Nathan, et al. “RewardBench: Evaluating Reward Models for Language Modeling.” Findings of the Association for Computational Linguistics: NAACL 2025. ↩︎