License: CC BY 4.0
arXiv:2604.21413v1 [cs.DB] 23 Apr 2026

An Alternate Agentic AI Architecture
(It’s About the Data)

Fabian Wenz1,2,3 Felix Treutwein5 Kai Arenja1 Çağatay Demiralp3,4 Michael Stonebraker3
1TUM 2TU-Darmstadt 3MIT 4AWS AI Labs 5Landeshauptstadt München
fab_wenz@mit.edu, kai.arenja@tum.de, felix.treutwein@muenchen.de,
{cagatay, stonebraker}@csail.mit.edu
Abstract

For the last several years, the dominant narrative in "agentic AI" has been that large language models should orchestrate information access by dynamically selecting tools, issuing sub-queries, and synthesizing results. We argue this approach is misguided: enterprises do not suffer from a reasoning deficit, but from a data integration problem.

Enterprises are data-centric: critical information is scattered across heterogeneous systems (e.g., databases, documents, and external services), each with its own query language, schema, access controls, and performance constraints. In contrast, contemporary LLM-based architectures are optimized for reasoning over unstructured text and treat enterprise systems as either corpora or external tools invoked by a black-box component. This creates a mismatch between schema-rich, governed, performance-critical data systems and text-centric, probabilistic LLM architectures, leading to limited transparency, weak correctness guarantees, and unpredictable performance.

In this paper, we present RUBICON, an alternative architecture grounded in data management principles. Instead of delegating orchestration to an opaque agent, we introduce AQL (Agentic Query Language), a small, explicit query algebra - Find, From, and Where - executed through source-specific wrappers that enforce access control, schema alignment, and result normalization. All intermediate results are visible and inspectable. Complex questions are decomposed into structured, auditable query plans rather than hidden chains of LLM calls.

Our thesis is simple: enterprise AI is not a prompt engineering problem; it is a systems problem. By reintroducing explicit query structure, wrapper-based mediation, and cost-based optimization, we obtain the breadth of agentic search while preserving traceability, determinism, and trust in enterprise environments.

1 Introduction

Agentic AI is gaining widespread traction. In effect, Large Language Models (LLMs) are not good enough by themselves for deployment in many enterprise applications. As such, they are “enriched” by adding modules to an LLM that can then query text repositories or databases as well as add control flow if necessary. In these enriched systems, the LLM remains the component that performs the reasoning and determines which added modules to invoke. The resulting application is thereby a workflow of modules under LLM control.

Contemporary wisdom suggests that an LLM be responsible for parsing Natural Language (NL) utterances from a user and playing the role of workflow coordinator. If possible, an external data source is ingested into the agentic AI environment, and enriched data is made available to the LLM through Retrieval-Augmented Generation (RAG) or through prompt engineering. We will term this architecture “LLM-centric” because of the major role that an LLM plays in the solution.

In enterprise applications, we believe this approach is seriously flawed for several reasons. First, enterprise data sources are rarely in “the pile” and LLMs cannot train on them readily. There is an adage that an LLM will be unable to retrieve any stored data that it has not seen multiple times before. Hence, an LLM does not have an advantage on enterprise data.

Second, enterprises store mission critical data in specialized data stores, for example SAP for financial data, Salesforce for customer data, and data warehouses for historical data. It is unlikely that such data can be imported into an LLM for size and security reasons. In this case, Agentic AI enthusiasts suggest querying such data stores and then moving the result as text into an LLM. Such an approach turns enriched structured data into text, thereby losing all the structure in such records.

In addition, there is a thorny problem of access control. Databases often deal with sensitive information, and allowed access is usually role-specific (i.e. member of the accounting department) or user-specific (i.e. a user can only see his own salary). Current LLMs are oblivious to such restrictions.

Fourth, the application is often trying to integrate data from multiple sources. For example, to find the professors at a university who have a Wikipedia page, one good approach would be to query an institutional data warehouse for the relevant faculty records and then join the result to Wikipedia metadata. This join is naturally expressed in SQL, but posing such a query directly to an LLM will typically generate an uninteresting result. It should also be noted that query optimization technology for databases is very mature and provides a good foundation for optimizing such interactions. It is not clear how such optimization would work in an LLM-based system.

Of course, one or more of these data sources can be an LLM or a text repository. LLM output can be treated as a vector or processed with text indexing such as Apache Lucene. In either case, it is straight forward to access such results using database technology.

Lastly, an LLM-centric architecture will have to employ “text-to-SQL” or “text-to-module-interface” to connect to enterprise data sources. Text-to-SQL has been evaluated in a number of enterprise-style settings. There is ample literature that suggests text-to-SQL works well, mostly by running a text-to-SQL module on either the Spider [20] or Bird [10] data set. The leaderboard for those benchmarks shows multiple systems with accuracy above 85

  1. 1.

    Public data: this data is considered to be in “the pile”. Data warehouses on the other hand are rarely in the pile as noted above.

  2. 2.

    A “clean” (i.e. non redundant) schema: Enterprise data warehouses use materialized views and other kinds of redundancy to speed access. Redundancy makes text-to-SQL considerably harder as there are now multiple ways to solve a query.

  3. 3.

    No idiosyncratic data: In a typical university data warehouse, as in most other warehouses, there is institution-specific jargon. For example, a short January term may refer to a one-month semester, and the Computer Science major may be referred to by an internal program code. It is implausible that an LLM can deal with such site-specific content.

  4. 4.

    Relatively simple queries specified by students: In contrast, the actual queries to real data warehouses are formulated by business users with a real problem and are more complex that Spider or Bird. See [18] for examples of this complexity difference.

Because of these differences, a substantial drop in accuracy is observed when applying an LLM to real enterprise data warehouses. Specifically, accuracy may fall well below benchmark-reported levels, even when using standard techniques (RAG, prompt engineering using the schema and results of previous queries, enrichment using publicly available institutional websites, etc.). In summary, there can be more than a 50% drop in accuracy when real problems are encountered. This is the difference between production-usable technology and failure[4].

We have therefore concluded that text-to-SQL will not work in the enterprise, at least not for data exhibiting the above four characteristics. We note that current agentic AI is dependent on technology that, so far, does not work reliably in these settings.

For these reasons, we believe that agentic AI must take a different approach to enterprise data and applications. In this paper, we propose that agentic AI applications should be “data-centric” and not “LLM-centric”. In other words, the application should be “structured data-centric” and not “text-centric”. In the next section, we describe our alternate architecture, RUBICON, which is based on data sources, wrappers, a query optimizer/executor and a limited natural language front end.

2 Architecture of RUBICON

This section describes the architecture of RUBICON and the AQL abstraction that enables structured, query-time integration across heterogeneous enterprise sources.

2.1 Structured Retrieval and Integration Layer

Contemporary enterprise copilots and knowledge assistants such as Microsoft 365 Copilot [11], Glean [17], and emerging open agentic systems such as OpenCLAW [14] typically follow an ingestion-and-indexing paradigm. These systems connect to heterogeneous enterprise data sources (e.g., CRM, ERP, file systems), extract documents or records, compute embeddings or sparse indices, and expose a unified semantic search interface over otherwise disconnected silos. While this architecture improves discoverability, it does not integrate the underlying data models. Put differently, a text-oriented architecture is not adept at performing semantic joins.

Refer to caption
Figure 1: AQL-Based Query Processing Architecture (RUBICON).

RUBICON instead treats retrieval as a virtual data integration problem in the classical database sense [9]. It assumes a collection of (often large) data sources that are left insitu with whatever indexing is currently in place. Access control is provided by whatever system manages the data store, as noted in the architectural diagram of Figure 1. Integration occurs at the query layer, not by physically consolidating data into a new store.

In order to join such data sources, the obvious approach (widely used today) is to have a data source-specific wrapper (connector) that will convert the bespoke APIs for the data sources into a common one. In our opinion, a requirement for this interface is to facilitate data integration (joins) and SQL would be a natural choice. However, using Text-to-SQL to convert a human request into SQL is very likely unworkable in an enterprise setting. Instead, we propose a stripped-down SQL-like notation, called AQL (Agentic Query Language) described in Section  2.3. The goal of a wrapper or agent is to translate AQL into the bespoke interface actually supported. We assume that wrappers will be constructed locally by enterprise personnel.

As noted in Figure 1, on top of AQL there is a query processing module, which can run in either of two modes: interactive mode and compiled mode. In interactive mode, a user issues a single AQL command and then inspects the result before proceeding. Since RUBICON can easily “go off the rails,” this allows the user to direct processing toward a final result. Using this mode, we expect the user to execute one command at a time until the desired goal is reached. Each issued command produces an explicit intermediate table that is visible to the user. These intermediate results are first-class artifacts: they can be inspected, saved, reused, or joined with subsequent queries. Because execution proceeds through concrete relational representations rather than hidden conversational state, the resulting workflow is reproducible. The user is not dependent on regenerating an answer through repeated LLM reasoning but instead operates over stable, inspectable query outputs.

In contrast, compiled mode executes an entire sequence of AQL commands as a single composite plan. If the user plans to run the same task again, then RUBICON has the complete command sequence and can invoke a relational-style query optimizer to obtain the answer more rapidly. In effect, interactive mode corresponds to interpreted execution, whereas compiled mode corresponds to relational compilation. Of course, compiled mode can also be invoked directly by specifying a more complex interaction from the outset. One of the strengths of RUBICON is that relational compilation is vastly cheaper and more predictable than repeated LLM processing.

On top of this processing layer sits a user interface. It is possible that this interface is natural language, processed by an LLM. However, our first version will use a graphical user interface layered above AQL. It will build on the form-based relational interfaces of the 1980s, such as Query-by-Forms (QBF) [7] and Query-by-Example (QBE) [21].

In this sense, RUBICON functions as connective tissue across enterprise systems. Rather than constructing a loosely connected graph of embeddings, the system supports a structured relational substrate in which entities, attributes, and relationships are explicitly represented and queryable in a SQL-like notation.

This architectural choice has two important consequences:

  • Structured Datastore over Disconnected Indices: Retrieval operates over a logically integrated relational schema constructed at query time via wrappers, enabling joins, constraints, aggregation, provenance tracking, and transactional guarantees.

  • Dual Language of Interaction: AQL provides declarative programmability using a GUI and will investigate NL in the future.

2.2 Wrappers

Multimodal data sources (e.g. structured records, free text, images, video) may require sophisticated wrappers. For example, a video wrapper would be an object detection pipeline (e.g., YOLO-style detectors [15]) transform visual data into relational tuples (object, attribute, timestamp, source). A textual corpus would likely be supported a word-oriented text system such as Lucene, by a sparse lexical representations [16] or dense vector embeddings [12]. In any case, the wrapper will transform AQL to the native query language of the specific interface.

Crucially, the wrapper must present a relational view, even if none exists natively. For example, an e-mail system can be exposed as a table Message(from, to, subject, date, body, ...) where the underlying search engine evaluates the WHERE clause and returns matching rows. Likewise, Wikipedia can be exposed as a table Page(title, url, snippet, text, categories, ...) even if the underlying interface is keyword or vector search.

These tables are not materialized relations; they are logical views constructed by the wrapper. This normalization step is essential. Without rows and columns, downstream operators such as joins and unions have nothing well-defined to operate on.

2.3 Common Notation: Agentic Query Language (AQL)

Our experience dealing with enterprise data warehouses is that full NL text cannot be translated to SQL with sufficient accuracy to be workable in production. At some point in the future, the outlook may be more rosy, but for now we need to specify a SQL subset. Accordingly, AQL is a restricted query language that serves as the common notation between users and data-source wrappers. Our major simplification is to require the user to say what attributes from what table they desire and then to generate a natural language predicate, as follows:

Find <column(s)>
FROM
<table>
WHERE
<NL utterance>

Given our experience with data warehouses, this seems a required simplification.

Aggregates.

Adding aggregates to the above yields:

FIND <aggregate-name column(s)>
FROM
<table>
WHERE
<NL utterance>

Joins.

Lastly, we need to support joins as follows:

FIND <column(s)>
FROM
<table>
WHERE
<NL utterance>
JOIN
FIND
<column(s)>
FROM
<table>
WHERE
<NL utterance>

Schema access.

A new user needs to access the schema of the various data sources, as follows:

?: returns the data sources know to RUBICON
? <data source>
: returns the tables known to the data source
? <table>: returns the columns and object types in the table, which we assume is unique

Housekeeping.

Finally, a user needs the following housekeeping commands

SAVE (<query>) as <new-table>
OUTPUT
<table>
DELETE
<table>

The above is a compromise between expressivity and the requirement that the interaction be successfully processable.

This restriction is intentional. By requiring the user to specify relations and attributes explicitly, AQL makes the LLM’s role narrow and observable. The natural-language predicate is translated into a native call by the wrapper, and the resulting intermediate representation is visible to the user. Consequently, queries can be inspected, saved, reused, and re-executed deterministically.

In contrast, fully agentic systems repeatedly invoke an LLM to decide what to do next, producing opaque execution traces that are neither predictable nor repeatable. Our goal is to minimize LLM calls, expose their translations, and reduce the system to a sequence of explicit relational operations. The result is a pipeline that is transparent, reproducible, and suitable for enterprise deployment.

ID Query Wiki University DW Research Lab Website Pile/LLM Email
Q1 List all research lab professors at the university and the dates they were promoted to full professor. R O R
Q2 How many university buildings have a Wikipedia page? R R
Q3 Which research lab professors have won a Turing Award or a Nobel Prize? R R O
Q4 Summarize the email thread between two users about benchmark queries. R R
Q5 Which university email newsletters am I subscribed to? R R
Q6 What research lab events have taken place in a specific campus building over the past month? R R
Q7 Which projects are currently being worked on in a particular university building? R R
Table 1: Ground-truth source relevance for the seven benchmark queries. Green = required source (R), yellow = optional source (O), gray = irrelevant source (–).

3 Experimental Setup

We have implemented a controlled multi-source benchmark to evaluate the performance of RUBICON versus other options (LLM-centric workflows, tool-augmented orchestration). We have been working with the Munich (Germany) Department of Transportation which has several full time engineers answering questions from citizens about crosswalk safety, traffic light timing and trolley spacing. We have a hundred actual queries with their actual human-generated solution. We plan to abstract their application into a benchmark to evaluate RUBICON as soon as we have permission from the relavant entities (expected within a month). In the meantime, we have constructed a similar benchmark on data we have access to. This benchmark simulates a setting in which relevant information is distributed across independent data sources with no schema-level integration.

3.1 Enterprise Motivation

Modern enterprises operate in highly fragmented information environments. Organizational knowledge is distributed across heterogeneous systems, each optimized for a specific operational purpose. These systems typically include:

  • Relational Databases: Structured transactional and analytical data stored in data warehouses and OLTP systems.

  • Documentation Repositories: Internal knowledge bases, policy documents, technical documentation, and reports stored in content management systems.

  • Application Platforms: Enterprise systems such as CRM, ERP, or HR solutions, each exposing application-specific APIs.

  • Email and Communication Systems: Semi-structured correspondence that often contains operational decisions, updates, and contextual clarifications.

  • External Public Knowledge: Web content and publicly accessible information sources that shape external representations of the organization.

  • Geoinformation and CAD Systems: Spatial databases and engineering design repositories that manage geospatial layers, maps, 3D models, and CAD drawings.

Specifically, our benchmark contains the following five data sources:

  • Wikipedia, accessed via public API calls.

  • University Data Warehouse (DW). We have access to 97 tables from an institutional administration data warehouse. Most of the tables concern facilities (buildings, room use, etc.). We have anonymized the data in these tables and copied it into a SQLite database.

  • University research laboratory website. This site has information on lab personnel, research activities, events, etc.

  • Email system (Gmail[6]), accessed via API calls,

  • The Pile / LLM knowledge, representing general pretrained language knowledge accessible through foundation models.

Each source exposes a distinct interface and data modality (structured tables, web documents, private communication, or pretrained knowledge). No source shares schema-level integration with the others. As a result, cross-source reasoning requires explicit coordination rather than implicit joins or shared schema assumptions.

3.2 Query Workload Design

We constructed a workload of seven expert-designed queries (Q1–Q7). Each query is deliberately designed to require integration of exactly two relevant sources. The remaining three sources are irrelevant for answering the query and serve as distractors. Table 1 shows the ground-truth relevance matrix.

This design enforces three structural properties:

  • Mandatory source selection: Correctly answering a query requires identifying both relevant sources.

  • Cross-source reasoning: No query can be answered using a single source alone.

  • Distractor sensitivity: Invoking irrelevant sources increases cost and may introduce hallucinated or incomplete information.

The benchmark is intentionally small and fully disclosed to allow transparent inspection of system behavior at the per-query level in addition to aggregate metrics.

3.3 Ground-Truth Construction

Ground-truth answers were established through manual expert retrieval across the required sources for each query. For every qQq\in Q, we verified that:

  • the required sources contain sufficient information to derive the correct answer, and

  • the irrelevant sources do not contain sufficient standalone information to produce a correct result.

Each final answer was validated to ensure consistency across the two required sources. This controlled construction allows systematic diagnosis of coordination failures, including omitted sources and redundant exploration.

3.4 Metrics

We evaluate each system along correctness (accuracy), token usage, provider-reported monetary cost, latency, and coordination complexity.

Accuracy.

Accuracy measures whether a system produces the correct final answer for a given query. A prediction is considered correct if the system’s output is semantically and logically equivalent to the manually established ground-truth answer for the corresponding query.

Token Usage.

We record the number of tokens consumed by the model for each query, split into input and output tokens. These values are obtained from the provider API.

Provider-Reported Monetary Cost.

We measure monetary cost using the value reported by the model provider’s API (when available). We do not compute cost from token counts and do not include storage or database costs.

Latency.

We measure latency as the time from issuing a query to receiving the first output token from the model (Time to First Token, TTFT).

Coordination Complexity.

To quantify coordination behavior, we measure the number of external tool invocations executed while answering each query.

3.5 Model Configuration

We evaluate three representative state-of-the-art foundation models and compared their performance to RUBICON:

  • OpenAI GPT-5-mini[13],

  • Google Gemini-3-flash-preview[5],

  • Anthropic Claude-Sonnet-4.6[1].

All models are configured with the highest available reasoning settings (e.g., long-context mode, increased reasoning effort and web search). We deliberately focus on “mini” or mid-tier foundation models rather than the largest available variants. In enterprise settings, query processing must be cost-predictable and scalable. Architectures that require invoking frontier-scale models for every intermediate step incur substantial latency and operating expense, and are therefore unlikely to be viable in production. Our objective is to demonstrate that the proposed system achieves competitive accuracy while reducing reliance on large, opaque models. If acceptable performance can be obtained with smaller models, the resulting solution is more economical, repeatable, and deployable at scale.

We evaluate two prompting styles:

  • Natural Language (NL): The query is submitted exactly as written.

  • AQL-style Prompting: The query is reformulated into a structured pseudo-query representation intended to guide reasoning.

We compare two orchestration settings against RUBICON: a standard single-shot interaction (“Vanilla LLM”) and a ReAct-style agent[19] implemented using LangChain[3]. Across both settings, all baselines use identical foundation models and model configurations. The only difference is whether structured tool access and iterative reasoning are enabled.

Vanilla LLM

This setting corresponds to the standard chat interface: the model receives the prompt and produces a single response without tool use.

No enterprise-internal information is injected into the context window. In particular, the university data warehouse, the lab research website API, and the email system are not accessible in this configuration. As a result, the model must rely on pretrained knowledge and, when enabled, publicly accessible web information. This isolates the limitations of implicit multi-source reasoning in the absence of explicit coordination and tool-based retrieval.

LangChain ReAct Agent[8]

We implement a ReAct-style agent that interleaves reasoning and action. In this setting, the model may:

  • generate intermediate reasoning steps,

  • invoke one of the available tools,

  • observe the returned result, and

  • refine its reasoning before issuing additional tool calls.

The agent has access to the following tools:

  • Wikipedia API,

  • University Data Warehouse (SQLite)

  • Research Lab website API,

  • Gmail API,

  • LLM knowledge tool.

The underlying foundation models (GPT-5-mini, Gemini-3-flash-preview, and Claude-Sonnet-4.6) are identical to those used in the vanilla baseline. The difference is purely in the interaction pattern: instead of producing a single-shot response, the model can iteratively invoke tools and incorporate intermediate results into its final answer.

ID Query (short) Vanilla LLM LangChain React Agent RUBICON
GPT Gemini Claude GPT Gemini Claude GPT
NL AQL NL AQL NL AQL NL AQL NL AQL NL AQL AQL
Q1 Research lab professors + promotion dates I I I I I I I I I I I I C
Q2 University buildings w/ Wikipedia page I I I I I I I I I I I I C
Q3 Research lab professors w/ Turing/Nobel I I I I I I I I I I I I C
Q4 Email thread summary (two users) I I I I I I I I I I I I C
Q5 Subscribed university newsletters I I I I I I I I I I I I C
Q6 Research lab events in a campus building last month I I I I I I I I I I I I C
Q7 Projects in a university building I I I I I I I I I I I I C
Accuracy (%) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Table 2: Per-query correctness. C = correct, and I = incorrect.
System Config T¯in\overline{T}_{in} T¯out\overline{T}_{out} k¯\overline{k} C¯\overline{C} ($) t¯TTFT\overline{t}_{TTFT} (s)
Vanilla GPT-NL 19.71 1085.86 0.0 0.0016 13.70
Vanilla GPT-AQL 64.43 1383.00 0.0 0.0021 12,80
Vanilla Gem-NL 25.00 657.57 0.0 0.0020 8.50
Vanilla Gem-AQL 78.14 442.71 0.0 0.0014 30.95
Vanilla Cl-NL 21.86 169.14 0.0 0.0026 0.95
Vanilla Cl-AQL 76.14 673.57 0.0 0.010332 1.20
ReAct GPT-NL 20 489.14 2 231,86 3.57 0.0046 29.47
ReAct GPT-AQL 46 486.71 4 760.71 8.29 0.0123 52.21
ReAct Gem-NL 271 574.29 11 823.29 17.28 0.1654 190.66
ReAct Gem-AQL 469 543.29 14 774.57 22.71 0.2791 246.23
ReAct Cl-NL 55 917.00 1 708.43 11.14 0.1883 99.48
ReAct Cl-AQL 156 656.29 4 763.29 13.14 0.5416 265.00
RUBICON GPT-AQL 4182 2347 2.00 0.0036 21.84
Table 3: Aggregated efficiency metrics averaged over queries. k¯\overline{k} is the mean number of tool calls per query (0 for Vanilla).

4 Evaluation

We report results for the configurations defined in Section 3.

4.1 Accuracy

Table 2 reports per-query correctness across all systems and prompting styles. A result is marked as correct (C) only if all required sources are consulted and the final answer fully matches the ground truth. All other outcomes are labeled incorrect (I).

Across all seven benchmark queries, neither vanilla LLMs nor the LangChain ReAct agent achieve a single correct result. This holds across all models (GPT, Gemini, Claude) and both prompting styles (NL and AQL). Overall accuracy for these configurations is therefore 0%.

Importantly, errors are not dominated by hallucinated content. Instead, failures consistently stem from incomplete multi-source coordination: models either omit at least one required source, terminate prematurely, or fail to correctly integrate results across sources. Even when some relevant information is retrieved, required joins are not completed.

Providing tool access via the ReAct agent does not alter this outcome. Despite being able to invoke all available sources, no configuration reliably enforces completeness over the required source set. Tool availability alone does not guarantee correct coordination.

For example, in the “research lab Turing or Nobel laureates” query (Q3), both the vanilla LLM configurations and the LangChain ReAct agent exhibit the same coordination failure. In many runs, the model retrieves award recipients from Wikipedia but never restricts the result correctly to research lab professors. In other runs, the enterprise database is queried but award status is not verified through Wikipedia. In no configuration does the system reliably consult both required sources and complete the necessary join.

In contrast, RUBICON achieves 100% accuracy, correctly answering all seven queries. Because RUBICON enforces explicit source selection and deterministic execution over required sources, it eliminates omission-based failures.

To avoid confounding coordination errors with long-context degradation, we intentionally do not inject enterprise data directly into the context window of vanilla LLMs. Prior long-context evaluations (e.g., BEAVER[4]) demonstrate that excessive context length can degrade reasoning quality due to context dilution. Our setup therefore isolates pretrained knowledge and explicit tool-based coordination, ensuring that observed failures reflect coordination limitations rather than context saturation effects.

Overall, the results demonstrate that the dominant bottleneck is not language understanding but reliable multi-source completeness. Without explicit coordination constraints, even tool-augmented LLM agents fail systematically.

4.2 Efficiency: Cost and Latency

Table 3 reports average input tokens (T¯in\overline{T}_{in}), output tokens (T¯out\overline{T}_{out}), tool calls (k¯\overline{k}), monetary cost (C¯\overline{C}), and time-to-first-token (TTFT, t¯TTFT\overline{t}_{TTFT}).

Vanilla vs. ReAct scaling.

Vanilla configurations exhibit compact token footprints and low cost. Average input length remains below 80 tokens across models, reflecting single-shot inference without iterative tool interaction. Latency is correspondingly low.

ReAct configurations show a fundamentally different scaling regime. Iterative reasoning–tool cycles cause rapid context accumulation, leading to one to two orders of magnitude higher input token counts. For GPT, input tokens increase from 20\approx 20 (Vanilla-NL) to over 20,00020{,}000 (ReAct-NL) and 46,00046{,}000 (ReAct-AQL). Gemini and Claude exhibit even more extreme growth, with Gemini-AQL exceeding 469,000469{,}000 input tokens on average. This growth directly increases monetary cost and latency, with TTFT expanding from seconds to multiple minutes in the most exploratory configurations.

Model-level differences.

Efficiency differences reflect both pricing and coordination behavior.

In the Vanilla setting, cost differences are largely pricing-driven; Claude appears comparatively expensive despite modest token usage due to higher per-token rates.

Under ReAct, behavioral differences dominate. Gemini generates by far the largest token footprints and highest total cost, not primarily because of pricing, but due to deeper exploratory execution: more intermediate reasoning, more tool calls, and frequent termination at tool-call limits rather than natural completion. Claude exhibits substantial but more bounded expansion, while GPT shows the most conservative growth.

Thus, under iterative tool use, efficiency is governed less by token price and more by exploration depth and termination discipline.

Effect of AQL prompting.

AQL systematically increases prompt size due to its structured specification format. Across both Vanilla and ReAct settings, AQL configurations exhibit higher average input tokens than their NL counterparts. Under ReAct, this amplification compounds with iterative reasoning, often doubling total token consumption (e.g., GPT and Gemini). Thus, while AQL provides clearer structural guidance, it also expands the computational footprint.

Tool-call dynamics.

The average number of tool calls per query (k¯\overline{k}) highlights the degree of exploratory behavior. Vanilla configurations perform no tool interaction (k¯=0\overline{k}=0). ReAct configurations range from moderate (GPT-NL: 3.57) to highly exploratory (Gemini-AQL: 22.71). In contrast, RUBICON maintains a bounded and deterministic interaction pattern (k¯=2.0\overline{k}=2.0), reflecting explicit planning rather than iterative search.

Cost and latency implications.

Overall, efficiency scales with exploratory coordination depth. Systems that repeatedly re-plan, serialize intermediate results, and re-query tools accumulate large context windows, driving cost and latency upward. Constrained execution strategies maintain bounded token growth and predictable runtime characteristics.

Q1Q2Q3Q4Q5Q6Q70101020203030# tool callsWikipediaPile/LLMLab WebsiteDWEmail
(a) GPT (NL/AQL) vs Rubicon
Q1Q2Q3Q4Q5Q6Q70101020203030# tool callsWikipediaPile/LLMLab WebsiteDWEmail
(b) Gemini (NL/AQL) vs Rubicon
Q1Q2Q3Q4Q5Q6Q70101020203030# tool callsWikipediaPile/LLMLab WebsiteDWEmail
(c) Claude (NL/AQL) vs Rubicon
Figure 2: Tool-call composition per query. Each query shows three stacked bars (LangChain+NL, LangChain+AQL, Rubicon).

4.3 Coordination Behavior

Figure 2 characterizes how tool usage is allocated across sources for each query and model.

Ground-truth alignment

Rubicon exhibits perfect overlap with the required sources across all queries. Its tool calls are tightly concentrated on the ground-truth sources, with minimal or no exploration of irrelevant ones. The color composition of the stacked bars closely mirrors the requirement structure defined in Table 2. This reflects bounded exploration and deterministic source selection.

Natural language prompting (NL)

In contrast, LangChain+NL shows substantially noisier coordination. Across models, two recurring failure patterns emerge: (i) exploratory invocation of non-required sources prior to consulting required ones, and (ii) omission of one or more required sources before termination.

The latter is particularly visible in queries requiring multi-source aggregation: NL configurations frequently miss at least one necessary source, even when other required sources are consulted correctly. This explains the large gap between RUBICON and LLM-centric systems observed in Section 4.1.

Effect of AQL

Providing AQL specifications improves source selection behavior. Across models, AQL increases overlap with the required sources and reduces purely irrelevant exploration. However, this improvement does not necessarily reduce overall tool usage. In most cases, AQL results in at least as many tool calls as NL — and often more.

Thus, AQL appears to improve which tools are called, but not how efficiently they are invoked.

Model-level differences

Clear differences in call intensity are observable across models. Gemini consistently invokes the largest number of tools per query, followed by Claude, and then GPT. This ordering holds across both NL and AQL prompting styles. Importantly, higher call volume does not translate into higher correctness, indicating that exploratory breadth alone does not resolve coordination failures.

Coordination vs. exploration

Overall, correctness depends not merely on tool availability or reasoning effort, but on disciplined source selection and complete multi-source coverage. While AQL nudges models toward more appropriate tool invocation, coordination remains probabilistic and termination conditions remain unreliable. Rubicon’s constrained planning eliminates these failure modes by enforcing explicit completeness over required sources.

4.4 Plan Sensitivity and the Need for an Optimizer

Even for a single user question, the AQL layer admits multiple semantically similar execution plans with dramatically different cost profiles. Consider the query Q3Q3:

NL:

"Which research lab professors at the university have won a Turing Award or a Nobel Prize?"

AQL (award-first):

FIND laureate_full_name, award_name
FROM
WIKIPEDIA
WHERE
people associated with ’Turing Award’ or ’Nobel Prize’
JOIN
FIND
full_name
FROM
UNIVERSITY_DW.faculty
WHERE
the person is a professor in the research lab

AQL (faculty-first):

FIND full_name
FROM
UNIVERSITY_DW.faculty
WHERE
the person is a professor in the research lab
JOIN
FIND
full_name, award_name
FROM
WIKIPEDIA
WHERE
for each professor, determine whether their page indicates a ’Turing Award’ or ’Nobel Prize’

This question can be expressed through at least two alternative plans.

Plan A (award-first). First extract the set of Turing and Nobel laureates from Wikipedia, then join this candidate set with research lab faculty from UNIVERSITY_DW.

Plan B (faculty-first). First enumerate all research lab faculty from UNIVERSITY_DW, and for each faculty member retrieve and scan the corresponding Wikipedia page to determine whether it mentions a Turing Award or Nobel Prize.

While both plans aim to compute the same result, their execution characteristics differ fundamentally.

Plan B induces an 𝒪(|research lab professors|)\mathcal{O}(|\text{research lab professors}|) pattern of external lookups: for every faculty member, a Wikipedia retrieval and long-text scan is required. In an LLM-mediated setting, each probe translates into tool invocations, token consumption, and latency. If faculty pages are long or require multiple retrieval attempts, the cumulative cost increases proportionally with the number of faculty members.

Plan A instead attempts to push down a highly selective predicate (“award recipient”) into Wikipedia before joining with research lab faculty. Since the number of Turing and Nobel laureates is relatively small compared to the number of faculty members, the number of external probes and subsequent joins is substantially reduced. This mirrors classical database join-order optimization: applying selective filters early can shrink intermediate results and avoid expensive nested access patterns.

Importantly, the difference between these plans is structural rather than linguistic. Both can be expressed correctly in AQL; both are logically valid. Yet their token usage, tool-call count, and latency can diverge significantly depending on join order and access path. In LLM-centric multi-source systems, where external retrieval and long-context processing are expensive, such plan sensitivity directly translates into monetary cost and response time.

Our experiments already demonstrate that increasing autonomy expands the cost surface through additional reasoning steps and tool calls. The example above highlights a complementary issue: even under perfect coordination, execution cost remains highly dependent on plan structure. This suggests that multi-source LLM systems require an optimizer (or at minimum a cost-aware planner) that can choose access paths and join orders using source statistics, cardinality estimates, caching strategies, and tool pricing information. Without such mechanisms, greater autonomy risks translating into higher spend and slower responses without corresponding gains in correctness.

We present this example as a motivating illustration of query-plan sensitivity and leave a full empirical cost comparison of alternative AQL plans to future work.

4.5 Failure Modes

We observe four dominant error modes across systems:

Missing-source hallucination (Vanilla).

When required enterprise-internal data is unavailable, vanilla LLMs produce plausible but incorrect responses by extrapolating from partial web information or pretrained knowledge.

Incomplete cross-source reasoning (Vanilla, ReAct).

For multi-source queries, models frequently consult or reason over only one relevant source, yielding incomplete aggregation or missing constraints.

Incorrect source selection (ReAct).

Despite access to all tools, the agent often fails to identify one of the required sources, leading to systematic omission errors.

Context accumulation effects (ReAct).

As intermediate tool outputs accumulate, prompt length and noise increase, raising latency and increasing the likelihood of integration errors or early stopping.

Taken together, these failures indicate that the primary bottleneck is coordination: enforcing source completeness, disciplined invocation order, and stable integration under heterogeneous multi-source queries.

4.6 Summary of Findings

Across all experiments, we find: (i) stronger prompting and increased reasoning effort do not compensate for missing or omitted sources, (ii) higher cost and latency do not reliably predict correctness, and (iii) the dominant limitation is coordination—ensuring source completeness and deterministic multi-source integration rather than increasing model-side deliberation.

5 Conclusion

This work evaluated whether increasing reasoning effort and granting broader tool access improves multi-source coordination in LLM-based systems. Across all configurations, a consistent pattern emerges: greater autonomy does not reliably translate into correct or complete multi-source integration.

Even when equipped with full tool access and high reasoning effort, models frequently fail to identify all required sources, invoke them in appropriate order, or enforce completeness before termination. Errors arise primarily from omissions and structural missteps rather than linguistic or logical weaknesses. Coordination not fluency is the dominant challenge.

These findings are consistent with broader enterprise evidence. The State of AI in Business 2025 report by Project NANDA (MIT)[2], which analyzes more than 300 publicly disclosed AI initiatives and surveys senior executives, documents widespread experimentation with generative and agentic systems but limited transition to measurable economic impact. Fewer than 5% of custom enterprise AI initiatives achieve demonstrable returns, highlighting a persistent pilot-to-production gap. Our benchmark reflects this structural gap at a controlled micro-level: strong reasoning capabilities do not automatically yield reliable, production-grade multi-source execution.

Importantly, increasing autonomy expands both cost and failure surface. Additional reasoning steps and iterative tool invocations substantially increase token consumption, latency, and provider-reported cost without proportional gains in accuracy. Autonomy alone therefore scales uncertainty and expense rather than robustness.

Overall, the results indicate that unconstrained LLM-centric coordination is insufficient for enterprise-grade multi-source reasoning. Reliable systems require explicit structural guarantees: enforced source coverage, bounded exploration, constraint-aware planning, and verifiable completeness. Increased reasoning depth alone does not provide these guarantees.

References

  • [1] Anthropic (2025) Claude sonnet 4.6. Note: https://www.anthropic.com/claudeModel documentation, Accessed: 2026-02-27 Cited by: 3rd item.
  • [2] A. Challapally, C. Pease, R. Raskar, P. Chari, and M. NANDA (2025-07) The genai divide: state of ai in business 2025. Technical Report Project NANDA, Massachusetts Institute of Technology. External Links: Link Cited by: §5.
  • [3] H. Chase (2022) LangChain. Note: https://github.com/langchain-ai/langchainAccessed: 2026-02-27 Cited by: §3.5.
  • [4] P. B. Chen, F. Wenz, Y. Zhang, M. Kayali, N. Tatbul, M. J. Cafarella, Ç. Demiralp, and M. Stonebraker (2024) BEAVER: an enterprise benchmark for text-to-sql. CoRR abs/2409.02038. External Links: Link, Document, 2409.02038 Cited by: §1, §4.1.
  • [5] G. DeepMind (2025) Gemini 3 flash (preview). Note: https://ai.google.dev/Preview model documentation, Accessed: 2026-02-27 Cited by: 2nd item.
  • [6] Google (2024) Gmail. Note: https://workspace.google.com/products/gmail/Accessed: 2026-02-27 Cited by: 4th item.
  • [7] M. Jayapandian and H. V. Jagadish (2008) Automated creation of a forms-based database query interface. Proc. VLDB Endow. 1 (1), pp. 695–709. External Links: Link, Document Cited by: §2.1.
  • [8] LangChain (2024) ReAct agent documentation. Note: https://python.langchain.com/docs/modules/agents/agent_types/reactAccessed: 2026-02-27 Cited by: §3.5.
  • [9] M. Lenzerini (2002) Data integration: A theoretical perspective. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5, Madison, Wisconsin, USA, L. Popa, S. Abiteboul, and P. G. Kolaitis (Eds.), pp. 233–246. External Links: Link, Document Cited by: §2.1.
  • [10] J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang, B. Qin, R. Geng, N. Huo, et al. (2024) Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems 36. Cited by: §1.
  • [11] Microsoft (2023) Microsoft 365 copilot: reinventing productivity with ai. Note: https://www.microsoft.com/en-us/microsoft-365/copilotAccessed: 2026-02-25 Cited by: §2.1.
  • [12] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §2.2.
  • [13] OpenAI (2025) GPT-5 mini. Note: https://platform.openai.com/docs/modelsModel documentation, Accessed: 2026-02-27 Cited by: 1st item.
  • [14] OpenCLAW Contributors (2024) OpenCLAW: open source agent framework. Note: https://github.com/openclawProject documentation Cited by: §2.1.
  • [15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. External Links: 1506.02640, Link Cited by: §2.2.
  • [16] G. Salton and C. Buckley (1988-08) Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24 (5), pp. 513–523. External Links: ISSN 0306-4573, Link, Document Cited by: §2.2.
  • [17] G. Technologies (2023) Glean: work ai for the enterprise. Note: https://www.glean.comAccessed: 2026-02-25 Cited by: §2.1.
  • [18] F. Wenz, O. Bouattour, D. Yang, J. Choi, C. Gregg, N. Tatbul, and Ç. Demiralp (2026) BenchPress: A human-in-the-loop annotation system for rapid text-to-sql benchmark curation. In 16th Conference on Innovative Data Systems Research, CIDR 2026, Chaminade, CA, USA, January 18-21, 2026, External Links: Link Cited by: item 4.
  • [19] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao (2023) ReAct: synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, External Links: Link Cited by: §3.5.
  • [20] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, et al. (2018) Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887. Cited by: §1.
  • [21] M. M. Zloof (1975) Query-by-example: the invocation and definition of tables and forms. In Proceedings of the International Conference on Very Large Data Bases, September 22-24, 1975, Framingham, Massachusetts, USA, D. S. Kerr (Ed.), pp. 1–24. External Links: Link, Document Cited by: §2.1.