Training a Query Fan-Out Model

Jun 24, 2025

—

Dan Petrovic

in AI, Google, Machine Learning

Google discovered how to generate millions of high-quality query reformulations without human input by literally traversing the mathematical space between queries and their target documents.

Here’s How it Works

Take a query and its relevant document (e.g., “stock market returns” → S&P 500 data)
Move step-by-step through latent space using the formula: qκ = q + κ/k(d − q)
Decode each point back to text using a trained “query decoder”
Collect the successful reformulations that retrieve the target document

This generated 863,307 training examples for a query suggestion model (qsT5) that outperforms all existing baselines.

Query Decoder + Latent Space Traversal

Step 1: Build a Query Decoder

First, they trained a T5 model to invert Google’s GTR search encoder. Feed it any embedding vector, and it generates the query that would produce that embedding. This achieved 96% cosine similarity on reconstruction, nearly perfect fidelity.

Step 2: Generate Training Data via Traversal

Starting with MSMarco query-document pairs:

Compute embeddings for both query and gold document
Take 20 steps along the straight line between them
Decode each intermediate point
Keep reformulations that improve retrieval metrics

Example traversal from “average yearly return on stock market”:

Step 0: “average yearly return on stock market” [nDCG: 0.0] Step 5: “what is the average return in a stock market” [nDCG: 0.0] Step 12: “what is the average return on the s&p stock exchange” [nDCG: 0.36] Step 20: “what is the average annual return of the s&p stock exchange” [nDCG: 1.0]

Step 3: Train the Production Model

Using this synthetic dataset, they fine-tuned T5-large with two variants:

qsT5-plain: Input is just the query
qsT5: Input is query + top-5 search results (pseudo-relevance feedback)

The Geometry of Meaning

Modern neural retrievers like GTR embed queries and documents in the same vector space where semantic similarity equals geometric proximity. The researchers’ insight: if relevant documents cluster in certain regions, then moving toward those regions should produce better queries.

The elegance lies in three key observations:

Latent spaces are structured: Related concepts form neighborhoods
Paths carry meaning: Intermediate points represent semantic compromises
Decoders preserve semantics: The query decoder reliably maps vectors back to meaningful text

The Implicit Learning Phenomenon

Here’s the fascinating part: while training data comes from explicit geometric traversal, the final qsT5 model operates without any vector arithmetic. It has internalized the traversal patterns.

When qsT5 sees “python loops” + search results about programming:

It doesn’t compute q + α(d − q)
Instead, it has learned which reformulation directions work
It generates “python for loop examples”, “python iterator protocol” based on learned patterns

The model essentially compresses thousands of traversal examples into an implicit understanding of how to navigate query space.

Production Implementation and Impact

In deployment, the system works like this:

User query → Initial search
Top results → Context for reformulation
qsT5 model → Multiple query variants
Parallel search → Comprehensive results

Performance gains:

MSMarco: nDCG@10 improved from 0.420 to 0.554
Natural Questions: nDCG@10 improved from 0.495 to 0.637
Generates 10+ diverse reformulations per query

Original Query
who created spiritual gangster

MQR
Who created the Spiritual Gangster?
Who created the “spiritual gangster” storyline?
Who created the “spiritual gangster”?

RM3
who created spiritual gangster spiritual
who created spiritual gangster modern
who created spiritual gangster inspired

Sampling+QD
who created gangster a spiritual & egantious
who created spiritual gangster -gangster
who created spiritual gangster

qsT5
who is the founder of spiritual gangsters
who created the spiritual gangster ( spiritual yogi )
what is the spiritual gangster movement

qsT5-plain
who are the founders of the gangster spirit band
how many gangsters were formed in white supreme
who was the members of the gangster supremes

Why Pseudo-Relevance Feedback Matters

The qsT5 model with PRF significantly outperforms the query-only version because:

Disambiguation: “python” → programming language vs. snake
Terminology discovery: Seeing documents reveals domain-specific terms
Intent grounding: Results show what the corpus actually contains

The model learns to extract signals from initial results and incorporate them into reformulations, mimicking how human searchers refine queries after seeing preliminary results.

Implications for Search Architecture

This approach enables:

Automated query fanout without hand-crafted rules
Continuous improvement via self-supervised learning
Interpretable AI through query decoder inspection
Language-agnostic reformulation (the method works on embeddings, not words)

The Broader Vision

By framing query reformulation as navigation in latent space, this work opens new possibilities:

Real-time search adaptation based on user behavior
Cross-modal search (text to image queries)
Explainable search suggestions (“moving toward technical documentation”)

The key insight: instead of treating queries as fixed strings, we can view them as starting points for journeys through meaning space. The AI has learned to be an expert guide for these journeys.

Papers

https://arxiv.org/pdf/2210.12084

https://patents.google.com/patent/US20230281193A1/en

Comments

6 responses to “Training a Query Fan-Out Model”

Training Gemma‑3‑1B Embedding Model with LoRA – DEJAN

28 June 2025

[…] our previous post, Training a Query Fan-Out Model, we demonstrated how to generate millions of high-quality query reformulations without human […]

Reply
Brian Crouch

10 July 2025

The research you put out is simply phenomenal. Thank you for making this public…

Reply
1. Dan Petrovic
  
  10 July 2025
  
  Thank you Brian! Love to see that the research clicks with people, it’s very exciting stuff.
  
  Reply
Ikenna Ene

25 July 2025

This is really great and insightful. I’m more concerned on practicality does this show relevance to performant keywords. And thanks for sharing.

Reply
Google’s Query Fan-Out System – A Technical Overview – DEJAN

30 August 2025

[…] have successfully replicated Google’s query fan-out approach following their research papers and this article describes the exact mechanics of […]

Reply
Fan-Out Query Search Volume Prediction Using Deep Learning – DEJAN

30 August 2025

[…] query variations that a business might not yet rank for, or even be aware of. This is where our query fan-out model comes in. Using advanced language models to generate a vast array of related search queries […]

Reply