AI search systems that answer questions now act like research assistants, but they do not always deserve that trust. A new study reports that popular tools frequently make claims unsupported by their cited sources.
“Our evaluation demonstrates that current public systems fall short of their promise to deliver trustworthy, source-grounded synthesis,” said lead author Pranav Narayanan Venkit from Salesforce AI Research (SAIR) in Palo Alto, California.
The team evaluated 303 queries across two categories and checked answers against eight evidence and sourcing metrics.
Rates of unsupported claims ranged from roughly one-quarter to nearly one-half in search modes, and one deep research configuration reached 97.5 percent unsupported statements in its long reports.
The framework, called DeepTRACE, audits answers at the statement level to see what is said and whether the listed sources actually back it up.
The researchers defined eight dimensions and computed them for each tool’s output.
These include one-sided answers, overconfident answers, fraction of relevant statements, fraction of unsupported statements, fraction of uncited sources, source necessity, citation accuracy, and citation thoroughness.
A one-sided answer occurs when only one perspective is presented on a debate question.
Overconfidence is flagged when a highly confident tone accompanies a one-sided answer, which can mislead users into thinking a contested topic is settled.
Source necessity tests whether each listed source is truly needed to support the answer. Citation accuracy checks whether the specific sources cited for a sentence actually support that sentence, not just the general topic.
Across the search engines studied, unsupported claim rates varied widely. One system’s search mode had about 23 percent unsupported statements, while another reached 47 percent for the same class of tasks.
Deep research settings often reduced overconfident tone but did not eliminate unsupported content. One deep research agent reached 97.5 percent unsupported statements despite producing long, citation heavy reports.
The evaluation also found frequent misattribution. Even when a supporting source existed, tools sometimes cited an irrelevant link rather than the correct one.
When answers lean one way on contentious topics, people can be pushed into a narrow lane of information.
That risk is compounded if the system sounds sure of itself while omitting counterarguments.
Independent work has shown that assistants can mirror a user’s stated views – a behavior known as sycophancy.
One recent paper reported that preference-tuned models often align with the user even when that reduces truthfulness.
The audit’s debate queries make that tendency visible. Rates of one-sidedness stayed high across engines and deep research modes.
Calibrating tone matters here. The study’s overconfidence metric penalizes confident language when balance is missing.
Citation accuracy in the audit ranged from about 40 percent to 80 percent, depending on the system. That spread means a link in an answer is not always the right link for the sentence it claims to support.
The authors also measured citation thoroughness, asking whether all available supporting links are cited where they belong. They warn that listing many links does not guarantee strong grounding.
“More sources and longer answers do not translate into reliability,” wrote Venkit. Users can be given a wall of links while the key claims remain weakly supported.
Source necessity helps cut through that fog. If only a small subset of links is truly necessary to support the factual claims, the rest may create false confidence.
The team used an internal judge model to score confidence, balance, and factual support.
To anchor those judgments, they compared model scores with human annotations on a subset and reported Pearson correlation values of about 0.72 for confidence and 0.62 for factual support.
That approach allows them to scale to thousands of checks. It also raises fair questions about when automated judging should be paired with more extensive human review.
The dataset spans 303 queries across debate and expert topics. Debate prompts included a question about whether alternative energy can effectively replace fossil fuels, and expert prompts probed areas such as computational hydrology.
As with any benchmark, results are a snapshot in time. Systems change quickly, and follow-up audits will be needed to determine whether accuracy, balance, and sourcing improve.
The audit’s findings align with broader concerns about factual drift in long text generation.
A comprehensive survey on hallucinations in language generation documents how models can produce fluent but unsupported content across tasks.
Retrieval helps, but it does not solve everything. Models still need to attribute specific claims to specific lines of evidence.
Definitions matter here. An unsupported statement is a sentence without backing in any of the listed sources, and citation accuracy is the share of citations that support the exact sentence they are attached to.
These definitions are strict by design. They reflect how a careful reader would check claims against sources line by line.
Treat AI search like a first pass rather than a final verdict. If a sentence makes a strong claim, click through and look for the exact passage in the cited source that supports it.
Watch for a confident tone on disputed questions. If an answer sounds certain but does not present countervailing evidence, assume you are only getting part of the picture.
Look at how many sources are truly used. If only a few links are doing the real work, the rest may be window dressing.
Small habits go a long way. Skim the original material, compare at least two independent sources, and note when numbers in the answer do not appear in the link.
The study is published in arXiv.
—–
Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.
—–