In 252,000 Trials, AI Answer Engines Cited Topical, Top-Listed Sources First

Key takeaways

Topical relevance beat cosmetic polish
First position still carried real weight
Price and recent timestamps helped
Formatting-only edits changed little

If your company shows up in AI answers, being near the top is only half the battle. These systems often cite just a few sources, so the real prize is being named in the response at all. This paper asks a simple question with big stakes: when two retrieved pages compete, which one gets cited first? The authors built a controlled two-document retrieval-augmented generation testbed, a setup that feeds exactly two candidate sources into the model. They ran 252,000 trials across six large language models, changing one of 18 content factors at a time and using brand anonymization plus counterbalanced source order to separate content effects from position bias. The clearest drivers of first citation were topical relevance and list position. Explicit price information and a recent timestamp also helped consistently. Completeness and trust cues gave smaller gains, while formatting-only changes had little impact. The paper also releases a reproducible evaluation protocol and a prioritized Generative Engine Optimization checklist. In an early internal pilot at Sprinklr, teams reported positive qualitative feedback on workflow usability.

A source can do everything right and still vanish inside an AI answer. That is the odd new reality of search with citation markers: only a few retrieved pages get named in the final response, so being visible now means more than ranking near the top. This study asks the question most site owners and product teams now care about: when two pages compete inside the model's context, which one gets cited first? The answer is less glamorous than many marketing guides suggest. The models kept reaching for the page that fit the topic best and, just as importantly, the page that came first in the list. Style mattered, but much less than substance and position.

What actually made one source win the first citation?

Across six LLMs, the testbed ran 252,000 trials and changed one content factor at a time across 18 factors. Each trial used exactly two candidate sources, and the model had to pick one first citation marker from the answer it produced. That design matters because it turns a messy ranking problem into a clean head-to-head match. The biggest winners were topical relevance and list position, which means the model did not treat all retrieved pages as equal once they entered the prompt. Explicit price information and a recent timestamp also helped in a steady way. Completeness and trust cues added smaller gains, while formatting-only edits barely moved the needle. In plain terms: if a page is closer to the user's question, and if it shows up first, it has the best shot at being the one the model names first.

How the test stripped away the noise

The setup used a two-document retrieval-augmented generation testbed, but the trick was in how carefully it controlled the contest. In each run, the two sources differed in exactly one factor, so the result could be tied to that single change instead of a pile of changes at once. Brand anonymization removed recognition effects, while counterbalanced source order helped separate content from simple placement bias. Mixed-effects models then pulled the pattern out across all the repeated trials, showing which factors kept helping across models and which ones only looked good in one setting. That is why the method reads like a fair race rather than a noisy popularity contest: one variable changes, the rest stay still, and the first citation reveals what the model actually prefers.

252,000trials

across six LLMs

paired two-document RAG testbed

Topical relevance and list position drove first citations most strongly.
Explicit price information and recent timestamps helped consistently.
Completeness and trust cues gave smaller gains.
Formatting-only edits had little impact.

“topical relevance and list position are the biggest drivers of being cited first”

From the abstract

“Formatting-only edits have little impact.”

Why this changes the GEO playbook

For anyone trying to win visibility in AI answers, this shifts the game from decoration to proof. The results suggest that generative engine optimization is less about making pages look busy and more about making them easy to prefer when the model compares two options side by side. That is a sharper target than classic search advice, because the prize is not only rank but citation. A source that appears in the answer may get seen even if the user never clicks through, while an uncited source can disappear entirely. The practical lesson is simple: topical fit comes first, then placement, then the facts that help the model trust the page quickly, such as price and freshness.

The next test is not another slogan

The release of a reproducible evaluation protocol and a prioritized GEO checklist gives others a way to test the same question in new settings, which is more useful than another broad claim about AI search. The paper also says an early internal pilot at Sprinklr got positive qualitative feedback on workflow usability, so the method is not staying on the whiteboard. The next hard question is whether the same ranking of factors holds when the two-source contest is replaced by more crowded answer engines, or when the content mix changes across domains. If topical relevance and top position keep winning there, then the quiet logic of citation may be more stable than the flashy interface around it.

In 252,000 Trials, AI Answer Engines Cited Topical, Top-Listed Sources First

What actually made one source win the first citation?

How the test stripped away the noise

Why this changes the GEO playbook

The next test is not another slogan

Authors

Provenance

Keep reading

Comments