query fan-out

Fan-Out Query Search Volume Prediction Using Deep Learning

While traditional keyword research tools provide valuable data, they often fall short in discovering truly novel or long-tail search query variations that a business might not yet rank for, or even be aware of. This is where our query fan-out model comes in. Using advanced language models to generate a vast array of related search queries from existing organic queries.

https://dejan.ai/tools/fanout

However, generating a massive list of potential keywords creates a new challenge: how do you efficiently assess the search volume potential of these new, unproven queries? Manually checking each one is impractical. This article we present a deep learning approach developed to automatically predict the search volume ranges for these fan-out queries, transforming a broad list into an actionable, prioritized asset.

The Challenge: Scaling Keyword Research

Content teams and SEO strategists constantly seek to expand their keyword footprint. Given a primary query like “AI SEO” and a target URL (e.g., dejan.ai), a fan-out generation model can suggest many diverse, yet related queries.

Here’s the exact output from the fan-out model for a single search query:

  • ai seo tools
  • ai powered seo tools
  • ai powered search engine optimization
  • ai search engine optimization
  • ai for search engine optimization
  • seo automation with ai
  • artificial intelligence for seo
  • best ai seo tools
  • ai for seo optimization
  • seo with artificial intelligence
  • best ai for seo
  • artificial intelligence seo tools
  • ai for seo ranking
  • ai for seo a/b testing
  • ai powered seo
  • artificial intelligence in seo
  • ai in search engine optimization
  • ai for seo agencies
  • ai for seo beginners
  • artificial intelligence for search engines
  • ai seo automation
  • ai for seo
  • AI-powered SEO tools
  • ai seo best practices
  • benefits of ai for SEO
  • ai powered keyword research
  • ai for seo content
  • ai powered search engine
  • ai seo tools 2025
  • ai website optimization
  • artificial intelligence seo
  • benefits of ai in seo
  • search engine optimization ai
  • ai powered seo services
  • AI for SEO automation
  • ai SEO tools comparison
  • ai seo optimization
  • best ai tools for seo
  • SEO with AI tools
  • improve seo with ai
  • ai in seo
  • ai powered seo software
  • machine learning for seo
  • ai seo examples
  • AI SEO implementation
  • automate seo with AI
  • ai powered website ranking
  • ai marketing automation
  • ai seo platforms
  • zendesk ai seo
  • artificial intelligence for website ranking
  • ai for seo tracking
  • ai seo algorithm
  • ai for digital marketing
  • ai for seo audit
  • artificial intelligence seo strategy
  • AI for SEO strategy
  • ai seo automation tools
  • dejan ai SEO
  • ai SEO agency
  • AI SEO services
  • best ai SEO software
  • website seo with ai
  • dejan ai seo expert
  • predictive seo ai
  • AI SEO software
  • AI for website seo
  • generate backlinks with ai
  • ai for seo reviews
  • website ranking automation with ai
  • google ai SEO
  • ai website traffic

While invaluable for identifying new opportunities, this explosion of data quickly becomes overwhelming. Each generated query ideally needs a search volume estimate to determine its potential value and prioritize content efforts. Relying on external tools for millions of queries is costly and time-consuming.

Query Demand Estimator

To address this, we developed a Query Demand Estimator (QDE) using a deep learning model. The core idea is to train a sequence classification model to categorize a given query into predefined search volume buckets.

A tool driven by QDE model, trained for one specific industry.

Model Training

1. Data Preparation: The Ground Truth

The success of any supervised learning model hinges on the quality and quantity of its training data. Our approach involved:

  • Collecting Organic Performance Data: We aggregated historical search performance data (impressions and clicks) for millions of queries where our digital properties ranked well (positions 1-10). Ranking well implies that the impression data is a good proxy for actual search demand, as the content is visible to a significant portion of searchers.
  • Defining Volume Buckets: We established 12 distinct search volume ranges, from very low (“51-100” impressions) to very high (“200,001+” impressions). These ranges became our target labels.
  • Labeling Queries: Each query from our high-ranking dataset was assigned to its corresponding impression bucket, creating a (query_text, volume_label) pair dataset. For example, “dejan ai query fan-out tool” might be labeled as “501-1000”, while “top AI SEO agencies”

label_id,label_text

  • 0,51-100
  • 1,101-150
  • 2,151-200
  • 3,201-250
  • 4,251-500
  • 5,501-1000
  • 6,1001-2000
  • 7,2001-5000
  • 8,5001-10000
  • 9,10001-100000
  • 10,100001-200000
  • 11,200001+

2. Model Architecture and Training

We leveraged a pre-trained transformer model, specifically mDeBERTa-v3-base, known for its strong performance across various natural language understanding tasks, including classification. The choice of mDeBERTa also offers multilingual capabilities, which is advantageous for global businesses.

The model was fine-tuned as a sequence classifier:

  • Input: A search query.
  • Output: One of the 12 predefined search volume buckets.

The training process involved:

  • Tokenization: Converting text queries into numerical tokens using the MDEBERTa tokenizer, ensuring consistent input length (MAX_LENGTH=256).
  • Batching and Epochs: Training in batches (BATCH_SIZE=16) over several epochs (EPOCHS=3) to allow the model to learn from the data efficiently.
  • Optimization: Using AdamW optimizer with a low learning rate (LR=2e-5) and weight decay to prevent overfitting.
  • Evaluation: Regular evaluation on a held-out validation set to monitor performance using metrics like accuracy, precision, recall, and F1-score. Weights & Biases (WandB) was used for experiment tracking.
MODEL_NAME = "microsoft/mdeberta-v3-base"
WANDB_PROJECT = "mdeberta-finetune"
NUM_LABELS = 12
MAX_LENGTH = 256
EPOCHS = 3
BATCH_SIZE = 16
LR = 2e-5
WEIGHT_DECAY = 0.01
WARMUP_RATIO = 0.1
LOGGING_STEPS = 10
EVAL_STEPS = 200
SAVE_TOTAL_LIMIT = 10
OUTPUT_DIR = "./finetuned-mdeberta"

3. Integration into the Fan-Out Workflow

Once trained, the QDE model was integrated into our fan-out query generation system. As the fan-out model generated new query variations for a given URL and seed query, each new variation was immediately passed to the QDE model for a volume prediction. This allowed the system to:

  • Generate an extensive list of relevant keywords.
  • Assign an estimated search volume range and a confidence score to each generated keyword.
  • Store these predictions alongside the fan-out query and its original source, making the data directly actionable.

Validation: How Accurate Are the Predictions?

Validation is crucial. To assess the QDE model’s real-world utility, we compared its predictions against a true gold standard: a subset of queries from a held out dataset, representing terms where our properties consistently ranked in the top 1-10 positions. For these queries, impression data closely reflects actual search volume.

The validation process involved:

  1. Extracting the QDE model’s volume predictions for all fan-out queries.
  2. Identifying queries that overlapped with our high-ranking ground truth dataset.
  3. Comparing the QDE predicted_volume bucket with the actual_volume_bucket from our ground truth.

Key Findings:

Exact Match Accuracy: 23.31%

Initially, this might seem modest. It means that for 23.31% of the overlapping queries, the model predicted the exact search volume bucket.

Combined Accuracy (Exact + Adjacent): 54.80%

This metric is far more representative of the model’s practical value. It indicates that for 54.80% of the queries, the model’s prediction was either exactly correct OR within one adjacent search volume bucket (e.g., predicting “501-1000” when the actual was “251-500” or “1001-2000”). This level of accuracy is highly beneficial for prioritizing.

What the numbers mean

  • Exact Match Accuracy (23%): Out of all predictions, only about 1 in 4 were exactly correct.
  • Combined Accuracy (55%): If we also count predictions that were very close (off by just one “volume bucket”), the model got it right more than half the time.

Why 50% isn’t “coin flip” odds

This isn’t a yes/no problem. The model isn’t picking between just 2 outcomes (like heads vs. tails). Instead, it has to choose among 11 different possible volume ranges (labels).

  • If the model were guessing randomly, each guess would have about a 1 in 11 chance (~9%) of being correct.
  • Getting ~23% exact match accuracy is much better than random chance—it means the model is finding real patterns.
  • The ~55% combined accuracy shows that even when it misses, it’s often close to the right bucket, not completely wrong. That’s useful for practical decision-making.

How to read the confusion matrix

  • The diagonal shows “perfect hits.” Those are the exact matches.
  • The cells right next to the diagonal are “near misses” (predicted slightly higher or lower than reality).
  • Off-diagonal far-away values mean the model got it very wrong—these are the cases we want to reduce.

Insights from the Confusion Matrix:

The confusion matrix (a table showing actual vs. predicted labels) provided deeper insights:

  • Directional Correctness: The predictions clustered strongly around the diagonal, confirming the model’s ability to broadly categorize queries into low, medium, and high-volume ranges.
  • Systematic Biases:
    • Under-prediction in Low-Mid Range: The model showed a slight tendency to predict slightly lower volume buckets (e.g., 51-100) for queries that actually fell into the next higher categories (101-150, 151-200). This is a useful bias, as it means potentially under-valued queries might be identified, encouraging further investigation.
    • Slight Over-prediction in Mid-Range: Conversely, some mid-range queries were occasionally over-predicted by one or two buckets, which can help flag terms as potentially more valuable than initially perceived.

A Powerful Tool for SEO Strategy

The deep learning-powered QDE model, integrated with fan-out query generation, transforms a previously manual and time-consuming process into an automated, scalable, and data-driven one. While not always achieving perfect exact-bucket accuracy, its ability to correctly or nearly correctly classify query search volume over 50% of the time provides an invaluable, actionable signal.

This system empowers SEO teams to:

  • Rapidly identify and prioritize millions of new keyword opportunities.
  • Uncover long-tail queries that traditional tools might miss.
  • Strategically plan content and optimize existing pages with a clearer understanding of potential demand, moving beyond guesswork with the power of deep learning.

The future of SEO keyword research is increasingly augmented by AI, allowing businesses to be more agile, comprehensive, and ultimately, more successful in capturing organic search demand.

Want one for you industry?

If you’d like a custom QDE model trained for your own website or client please apply below. This type of model training is best suited for websites with at least 100K, ideally 1M queries. We’ll evaluate your dataset and advise whether it’s suitable for model training.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *