Commercial AI chatbots are becoming news intermediaries. Their performance is high, uneven, and fragile.

We evaluated six production AI chatbots on same-day BBC News questions across six regional services. The best systems now answer many recent-news questions correctly, but aggregate accuracy masks regional inequity, retrieval dependence, opaque source selection, and vulnerability to misleading premises.

Read the findings Read on arXiv Code & data Cite this work

Mirac Suzgun¹, Emily Shen², Federico Bianchi³, Alexander Spangher¹,
Thomas Icard¹, Daniel E. Ho¹, Dan Jurafsky¹, and James Zou^1,3

¹ Stanford University ² Independent Researcher ³ Together AI

12,600 model-question instances across a 14-day real-time evaluation

94-96% multiple-choice accuracy for the strongest systems on same-day news questions

79.3% accuracy on Hindi, the lowest region for every model tested

>70% of errors traced to source divergence or retrieval failure

19-70% adversarial-accuracy spread when questions contained subtle false premises

Main finding

Aggregate accuracy is high, but unevenly distributed.

In clean multiple-choice conditions, the leading chatbots answered questions about events reported hours earlier with high accuracy: Gemini 3 Flash reached 95.6%, Grok 4 reached 95.0%, and Gemini 3 Pro reached 93.7% across the full 14-day study.

But the same evaluation shows that reliability is not evenly distributed. The systems struggled most with Hindi news, often retrieving English-language sources such as Wikipedia or English summaries instead of local-language reporting. The result is a pattern of answers that can look plausible globally while being wrong for the specific article-derived fact.

A substantial advance in real-time factual accuracy

The top four systems exceeded 90% accuracy on recent news questions, a substantial advance over earlier real-time QA benchmarks.

A regional gap obscured by aggregate accuracy

Hindi accuracy was 79.3%, compared with 88.9–91.3% for the other five regions, and every model performed worst on Hindi.

Retrieval is the bottleneck

Source divergence and retrieval failure accounted for more than 70% of errors. When models found the right evidence, they usually extracted the answer.

Competence does not imply robustness

False-premise variants caused sharp drops: adversarial accuracy ranged from 70.0% for Grok 4 to 19.0% for GPT-5.

Inside the benchmark

What a question looks like.

Each day, the pipeline generates 25 five-option multiple-choice questions per region, each anchored to a specific same-day BBC article. Questions target concrete details that typically survive only in the originating article — exact figures, named sources, locations, and time–place pairs.

Four sample benchmark questions rendered in their original scripts: an English question about US-brokered Ukraine-Russia peace talks; a Turkish question about the date Keçiören became a separate district; an Arabic question about the cumulative number of ISIS detainees transferred from Syria to Iraq; and a Hindi question about the time and location of a Lamborghini crash in Kanpur. The correct answer in each is highlighted in green. — **English (US & Canada, Feb 9):** the host city of the second round of US-brokered Ukraine–Russia peace talks. **Turkish (Feb 10):** the date Keçiören became a district. **Arabic (Feb 11):** the cumulative number of ISIS-affiliated detainees transferred from Syria to Iraq, attributed to spokesperson Saad Maan. **Hindi (Feb 12):** the time and place on VIP Road of a Lamborghini crash in Kanpur. Distractors are constructed to be plausible yet article-incorrect, so a model that retrieves a topically-related but distinct article will typically pick a wrong option.

Figures from the paper

Key findings.

Nine figures map the evaluation along five axes. Figures 1–2 give overall accuracy by model and by region, and the model-by-region interaction—identifying both the headline accuracy level and the Hindi gap that aggregate averages obscure. Figure 3 reports the gap between multiple-choice and free-response scoring on a paired-evaluation subset, bounding how much the headline format inflates absolute accuracy. Figures 4–6 examine citation behavior: how often each provider attributes to BBC despite its scraping restrictions, which domains dominate non-English regions (English Wikipedia tops the Hindi panel), and how widely two providers' citation profiles diverge on identical queries. Figure 7 quantifies the contribution of live retrieval through a no-search ablation. Figures 8–9 turn to robustness: how accuracy degrades when questions contain subtly false premises, and the dissociation between noticing such premises and recovering the correct answer.

Bar chart of multiple-choice accuracy by model. Gemini 3 Flash 95.6%, Grok 4 95.0%, Gemini 3 Pro 93.7%, Claude 4.5 Sonnet 90.4%, GPT-5 85.0%, GPT-4o Mini 69.0%. — **14 days · 12,600 model–question instances · web search enabled.** Bars show mean accuracy across all 14 days and all 6 regions. The right-margin column is mean errors per day out of 150 questions.

Bar chart of accuracy by region. Five regions (US & Canada, Turkish, Russian, Arabic, Afrique) cluster between 88.9% and 91.3%; Hindi sits alone at 79.3% (highlighted in red). — **Aggregated across all six models.** Five regions cluster within 88.9 – 91.3%. Hindi falls nearly 10% below the next-lowest region, and the gap persists when GPT-4o Mini is excluded.

Heatmap of accuracy by model and region. Cells are shaded from deep teal (high accuracy) through cream (mid) to coral red (low). The Hindi column is visibly warmer than every other column; GPT-4o Mini's row is the warmest overall, with its Hindi cell at 54.3% the deepest coral. Right column shows model averages, bottom row shows region averages. — **14-day mean accuracy (%) per model–region cell, with ±1 SD across days.** The right column is the row mean; the bottom row is the column mean. Cell shade encodes accuracy: deep teal is high, coral is low. The Hindi column visibly warms for every model, and weaker models (GPT-5, GPT-4o Mini) show wider regional spread than stronger ones — the gap compounds with model capability.

The Hindi gap is not a failure of language comprehension. Models generate fluent Hindi and reason competently in it. The failures are overwhelmingly failures of retrieval and grounding: models pivot to English-language sources covering the same topic but reporting different specific details, and answer faithfully from those substitutes.

Per-model dumbbell chart showing multiple-choice accuracy versus the range of free-response accuracy across three independent LLM judges. Gemini 3 Pro 97 / 82-86; Grok 4 92.7 / 78.7-79.3; Gemini 3 Flash 92.7 / 78.0-78.7; Claude 4.5 Sonnet 93.3 / 75.3-76.7; GPT-5 81.3 / 64.7-66.0; GPT-4o Mini 69.3 / 45.3-47.3. Right-margin column reports the MC minus best-FR-grader gap. — **MC vs. free-response on a single-day validation set (April 6, 2026; 850 paired items).** Per-model marker = multiple-choice accuracy; slate-blue bracket = the range across three independent LLM judges (Claude Opus 4.7, GPT-5.4, Gemini 3 Pro) grading the free-response variant. The MC-to-FR drop is 16 – 17% in aggregate; per-model gaps range from 11% (Gemini 3 Pro) to 22% (GPT-4o Mini). Crucially, model *rankings* are identical under both formats and across all three graders — so the relative findings reported throughout the paper (the Hindi gap, retrieval dominance, adversarial fragility) are robust to evaluation format, even as absolute accuracy levels should be read as upper bounds on what these systems would achieve in naturalistic open-ended interactions.

Bar chart of BBC citation rate by model. Grok 4 28.5%, Gemini 3 Flash 6.9%, Gemini 3 Pro 4.1%, GPT-5 0.2%, GPT-4o Mini 0.0%, Claude 4.5 Sonnet 0.0%. — **Rate = responses with any BBC URL among citations.** Three models effectively never cite the BBC (0.0 – 0.2%), even though every benchmark question is generated from a BBC article. The divergence likely reflects differences in scraping policy and licensing compliance as much as retrieval capability — providers that comply with the BBC's `robots.txt` restrictions cite the BBC less, regardless of how well their retrieval works.

Six-panel grid showing the top eight cited domains for each of the six regional query sets, with sources whose primary language differs from the regional service highlighted in slate blue. English Wikipedia dominates Hindi; English BBC and Guardian appear high for Russian and Arabic; only Turkish has a native-language source (T24) at #1. — **Top eight cited domains per region (14-day mean, all models aggregated).** Sources whose primary language differs from the regional service are highlighted in slate blue. English Wikipedia tops the Hindi panel, surpassing every Hindi-language outlet on our index. The same English-pivot pattern is visible (in milder form) across most non-English regions.

Six-panel dumbbell chart showing the per-model citation frequency for each of the eight top domains in each of the six regions. Each row's horizontal span between markers indicates how widely two models' citation profiles diverge for the same query set. Grok 4 reaches the BBC much more often than any other model; only Gemini family cites VertexAI Search; Claude shows distinctive regional preferences. — **Per-model variation in domain reliance, by region.** Same domains as Figure 5; here each marker shape encodes one of the six models. The horizontal span within a row shows how much two providers' citation profiles diverge for an identical query set. Users asking the same question of different chatbots receive answers grounded in materially different source material — a form of information fragmentation specific to AI-mediated access.

Models almost always extract the correct answer when they retrieve the correct source. The binding constraint is the fidelity of the connection between query and evidence — what the paper calls evidence binding.

Dumbbell chart showing accuracy for four frontier models with and without web search on US & Canada questions. Filled marker indicates production (search-enabled) accuracy; hollow ring indicates the same model's accuracy with web search disabled. Grok 4 drops from near-perfect to roughly half (a +46 percentage-point gap); GPT-5 drops by +31 percentage points. Right-margin column lists each model's gap. — **Web-search ablation, US & Canada questions (14-day mean).** Filled marker = production (search-enabled); hollow ring = the same model with web search disabled. Without retrieval, no-search baselines fall to 51 – 61%. The top systems' headline accuracy depends almost entirely on effective retrieval — Grok 4's gain from search is the largest (+46%), GPT-5's the smallest (+31%) — meaning the practical reliability of these systems is downstream of search infrastructure, not just model capability.

Slope chart showing standard vs adversarial accuracy for four frontier models. Grok 4 drops from 96.0% to 70.0%, Gemini 3 Pro from 96.0% to 55.0%, Claude 4.5 Sonnet from 92.0% to 46.0%, and GPT-5 from 88.0% to 19.0% — barely above the 16.7% chance baseline on the six-option adversarial set. — **Standard vs. adversarial accuracy on US & Canada questions (4-day mean).** Each connector spans one model's drop when a subtle false premise is injected into otherwise well-formed news questions. The standard-condition spread is 8.0 percentage points (88.0 – 96.0%); under adversarial framing the spread widens to 51.0 points. GPT-5 falls to 19.0%, barely above the 16.7% chance baseline on the six-option adversarial instrument.

Scatter plot of detection rate (x-axis) vs adversarial accuracy (y-axis) for four frontier models, with provider-colored markers, a 16.7% chance line (six-option instrument), and a y=x reference line. Gemini 3 Pro at (80%, 55%); Claude 4.5 Sonnet at (78%, 46%); Grok 4 at (59%, 70%); GPT-5 at (36%, 19%). The misalignment between detection and correct answers is visually obvious. — **False-premise detection rate vs. adversarial accuracy** (100 adversarial questions per model). Detection requires the model to explicitly flag or correct the injected falsehood in its reasoning; adversarial accuracy is whether the final answer is correct. The two axes *dissociate*: Claude 4.5 Sonnet detects 78% but answers only 46% correctly, while Grok 4 detects 59% yet answers 70% correctly. Robustness to false premises requires both *epistemic vigilance* (noticing) and *retrieval strength* (recovering) — capabilities current systems develop independently.

Study design

A real-time test of production systems, not base models in isolation.

Collect same-day news

Each day, the pipeline collected top articles from six BBC regional services: US & Canada, Arabic, Afrique, Hindi, Russian, and Turkish.

Generate grounded questions

It created 25 five-option factual questions per region per day, targeting concrete details such as figures, locations, quotes, and named entities.

Evaluate commercial chatbots

Six production chatbots were queried in parallel with native web search enabled, reflecting the retrieval and synthesis systems users actually encounter.

Analyze reliability

The study measured accuracy, citations, source domains, error mechanisms, search ablation, and robustness to subtle false premises.

**Models evaluated.** All six were queried at temperature 0.0 with each provider's native web search enabled. Knowledge-cutoff dates all precede the 14-day evaluation window (Feb 9 – 22, 2026).
Model	API identifier	Knowledge cutoff	Released
Gemini 3 Flash	`gemini-3-flash-preview`	Jan 2025	Dec 17, 2025
Gemini 3 Pro	`gemini-3-pro-preview`	Jan 2025	Nov 18, 2025
Grok 4	`grok-4-0709`	Nov 2024	Jul 9, 2025
Claude 4.5 Sonnet	`claude-sonnet-4-5`	Jan 2025	Sep 29, 2025
GPT-5	`gpt-5`	Sep 30, 2024	Aug 7, 2025
GPT-4o Mini	`gpt-4o-mini-search-preview`	Oct 1, 2023	Mar 11, 2025

Why it matters

AI-mediated news access is becoming infrastructure.

As more people ask chatbots for news, model choice affects more than answer quality. It affects which sources are surfaced, which languages receive reliable grounding, which publishers receive attribution, and whether users are warned when their question contains a false premise.

The study suggests that evaluating AI news systems on aggregate accuracy alone is insufficient. Public-interest evaluation should also measure retrieval fidelity across languages, source attribution, licensing constraints, and robustness to imperfect user questions.

Important caveats

What this study does and does not claim.

Multiple choice is an upper bound

A free-response validation showed a 16 – 17% drop in absolute accuracy, though model rankings remained stable across all three independent LLM judges.

BBC is a special case

BBC News is prominent and trusted, but access may be shaped by robots.txt, scraping restrictions, and provider licensing arrangements.

Geography may matter

Queries were issued from U.S.-based servers, which may affect search personalization and especially the retrieval of local-language sources.

Systems change quickly

The 14-day evaluation captures a snapshot of production systems whose search and model behavior can shift over time.

Cite this work

How to cite this work.

@article{suzgun2026news,
  title   = {Evaluating Commercial AI Chatbots as News Intermediaries},
  author  = {Suzgun, Mirac and Shen, Emily and Bianchi, Federico and
             Spangher, Alexander and Icard, Thomas and Ho, Daniel E. and
             Jurafsky, Dan and Zou, James},
  journal = {arXiv preprint arXiv:2605.22785},
  year    = {2026}
}

Commercial AI chatbots are becoming news intermediaries. Their performance is high, uneven, and fragile.

Aggregate accuracy is high, but unevenly distributed.

A substantial advance in real-time factual accuracy

A regional gap obscured by aggregate accuracy

Retrieval is the bottleneck

Competence does not imply robustness

What a question looks like.

Each question is answerable only by retrieving an article-specific detail.

Key findings.

Top systems answer 90 – 96% of same-day news questions correctly.

Five regions cluster within 89 – 91%; Hindi is alone, nearly ten points lower.

Every model performs worst on Hindi.

Free-response answers drop accuracy by 16 – 17%, but model rankings are preserved.

Grok 4 cites the BBC more than four times the rate of any other model evaluated.

For non-English regions, English-language domains often outrank native-language outlets.

Different chatbots cite materially different sources for the same query.

Disabling web search drops accuracy by 31 – 46%.

Subtle false premises drop top systems from 88 – 96% accuracy to as low as 19%.

Noticing a false premise and answering correctly are independent capabilities.