|
ResearchAudio.io Repeat Your Prompt Twice. Get Better Answers.47 wins, 0 losses across 7 models. No extra tokens generated. |
|
Copy your prompt. Paste it twice. That is the entire technique. A team at Google Research ran this across Gemini, GPT-4o, Claude 3.7 Sonnet, and DeepSeek V3, and the doubled prompt improved accuracy in 47 out of 70 benchmark-model combinations, with zero losses and no increase in output length or latency. The paper is titled "Prompt Repetition Improves Non-Reasoning LLMs" (arXiv:2512.14982). At first glance, it sounds like a trick. On inspection, it maps directly to how transformer attention works. Why Causal Attention Creates the Problem |

LLMs use causal (left-to-right) attention: each token can only attend to tokens that appear before it in the sequence. This creates an asymmetry in how context is processed depending on word order. A question placed after a long passage of context can attend to all of it. A question placed before the context cannot attend to anything that follows.
This is why "options-first, question-last" and "question-first, options-last" configurations produce different accuracy scores on the same multiple-choice benchmark. The model processes them differently because its attention mechanism literally sees different information at each token position.
Prompt repetition sidesteps this. When the full query appears twice in sequence, every token in the second copy can attend to every token in the first copy. The prompt effectively becomes bidirectional. The researchers note this mirrors what reasoning models often do on their own: extended chain-of-thought outputs frequently begin by restating the problem, which achieves the same effect at the cost of extra generated tokens. Prompt repetition moves that repetition to the prefill stage, where computation is parallelized and does not affect output length.
How the Technique Works
What the Experiments Measured
The team tested 7 models across 7 benchmarks: ARC Challenge, OpenBookQA, GSM8K, MMLU-Pro, MATH, and two custom tasks they designed to specifically stress-test causal attention limits. All tests ran via official provider APIs in February and March 2025, with requests interleaved in round-robin order to minimize provider-side variance.
The custom tasks are worth understanding in detail. NameIndex gives the model a list of 50 names and asks it to retrieve the 25th one. This is a pure middle-of-sequence retrieval problem, where causal attention degrades significantly. MiddleMatch gives a list of 40 items (with repetitions drawn from a pool of 10) and asks what appears between two specific items. Both tasks show dramatic gains with prompt repetition: Gemini 2.0 Flash-Lite went from 21.33% to 97.33% on NameIndex. The x3 variant (three repetitions) showed even stronger results on these tasks.
The paper also tested a padding control: inputs padded with periods to match the length of the repeated prompt. Padding produced no accuracy gain, ruling out the hypothesis that longer inputs alone help. The gain is specifically attributable to repeated semantic content, not input length.
|
Key Insight: For options-first multiple-choice prompts (where the model reads the answer choices before seeing the question), prompt repetition produces larger gains than for question-first layouts. Causal attention cannot relate the options back to a question it has not yet seen. The repeated prompt fixes this by providing a second pass where the full context is visible. |
Latency: What the Data Shows
Across all seven models, average and median output token counts were statistically identical between baseline and prompt repetition. Because prefill is parallelized, doubling the input does not add proportional wall-clock time. The researchers measured end-to-end API latency and found no meaningful increase for standard prompt lengths.
The one exception is worth noting: Anthropic's models (Claude 3 Haiku and Claude 3.7 Sonnet) showed latency increases for the longest inputs (from the NameIndex and MiddleMatch datasets and the x3 repetition variant). The paper attributes this to prefill processing time on very long contexts. For typical prompt lengths, this effect does not appear.
|
Key Insight: Unlike chain-of-thought prompting ("think step by step"), which increases output token count and latency proportionally, prompt repetition operates entirely in the prefill stage. The model generates the same number of tokens either way. For latency-sensitive production systems, this distinction matters. |
When Reasoning Is Enabled
The title specifies "non-reasoning LLMs" deliberately. When the team tested models with explicit chain-of-thought prompting, the results were largely neutral: 5 wins, 1 loss, and 22 ties across 28 tests. The explanation is intuitive. Reasoning models trained with reinforcement learning already learn to repeat parts of the input as part of their extended thinking process. Prompt repetition in the system prompt becomes redundant once the model does this internally in its output.
|
Key Insight: One future direction the paper outlines: fine-tune a model on repeated prompts, then train it with RL to stop repeating the input itself. The model would internalize bidirectional attention patterns during training without needing explicit repetition at inference time. |
Applying This in Practice
The implementation requires no code changes to the model or inference infrastructure. The output format is identical to baseline, so the technique is compatible with any downstream parser or structured output schema. The template for the verbose variant is:
{query}
Let me repeat that:
{query}
The paper identifies the tasks most likely to benefit: retrieval from the middle of long lists, options-first multiple choice, and any structure where critical context appears before the question. Tasks with short symmetric prompts show smaller but still positive gains. The technique performs best when the model would otherwise need to attend backwards across a long context window.
The Deeper Implication
The paper frames prompt repetition as a practical workaround for a structural property of causal language models. But the striking result (47 wins, 0 losses, across seven distinct model families from four different organizations) suggests the underlying limitation is more consequential than commonly assumed. Every non-reasoning model in production today processes prompts with this same asymmetry. The gains from a simple copy-paste may be the visible surface of a much deeper accuracy gap in how these models handle context order.
|
ResearchAudio.io Source: Prompt Repetition Improves Non-Reasoning LLMs (arXiv:2512.14982) |
Vibe code with your voice
Vibe code by voice. Wispr Flow lets you dictate prompts, PRDs, bug reproductions, and code review notes directly in Cursor, Warp, or your editor of choice. Speak instructions and Flow will auto-tag file names, preserve variable names and inline identifiers, and format lists and steps for immediate pasting into GitHub, Jira, or Docs. That means less retyping, fewer copy and paste errors, and faster triage. Use voice to dictate prompts and directions inside Cursor or Warp and get developer-ready text with file name recognition and variable recognition built in. For deeper context and examples, see our Vibe Coding article on wisprflow.ai. Try Wispr Flow for engineers.


