Fan-Out Query Search Volume Prediction Using Deep Learning

While traditional keyword research tools provide valuable data, they often fall short in discovering truly novel or long-tail search query variations that a business might not yet rank for, or even be aware of. This is where our query fan-out model comes in. Using advanced language models to generate a vast array of related search queries from existing organic queries.

https://dejan.ai/tools/fanout

However, generating a massive list of potential keywords creates a new challenge: how do you efficiently assess the search volume potential of these new, unproven queries? Manually checking each one is impractical. This article we present a deep learning approach developed to automatically predict the search volume ranges for these fan-out queries, transforming a broad list into an actionable, prioritized asset.

The Challenge: Scaling Keyword Research

Content teams and SEO strategists constantly seek to expand their keyword footprint. Given a primary query like “AI SEO” and a target URL (e.g., dejan.ai), a fan-out generation model can suggest many diverse, yet related queries.

Here’s the exact output from the fan-out model for a single search query:

ai seo tools
ai powered seo tools
ai powered search engine optimization
ai search engine optimization
ai for search engine optimization
seo automation with ai
artificial intelligence for seo
best ai seo tools
ai for seo optimization
seo with artificial intelligence
best ai for seo
artificial intelligence seo tools
ai for seo ranking
ai for seo a/b testing
ai powered seo
artificial intelligence in seo
ai in search engine optimization
ai for seo agencies
ai for seo beginners
artificial intelligence for search engines
ai seo automation
ai for seo
AI-powered SEO tools
ai seo best practices
benefits of ai for SEO
ai powered keyword research
ai for seo content
ai powered search engine
ai seo tools 2025
ai website optimization
artificial intelligence seo
benefits of ai in seo
search engine optimization ai
ai powered seo services
AI for SEO automation
ai SEO tools comparison
ai seo optimization
best ai tools for seo
SEO with AI tools
improve seo with ai
ai in seo
ai powered seo software
machine learning for seo
ai seo examples
AI SEO implementation
automate seo with AI
ai powered website ranking
ai marketing automation
ai seo platforms
zendesk ai seo
artificial intelligence for website ranking
ai for seo tracking
ai seo algorithm
ai for digital marketing
ai for seo audit
artificial intelligence seo strategy
AI for SEO strategy
ai seo automation tools
dejan ai SEO
ai SEO agency
AI SEO services
best ai SEO software
website seo with ai
dejan ai seo expert
predictive seo ai
AI SEO software
AI for website seo
generate backlinks with ai
ai for seo reviews
website ranking automation with ai
google ai SEO
ai website traffic

While invaluable for identifying new opportunities, this explosion of data quickly becomes overwhelming. Each generated query ideally needs a search volume estimate to determine its potential value and prioritize content efforts. Relying on external tools for millions of queries is costly and time-consuming.

Query Demand Estimator

To address this, we developed a Query Demand Estimator (QDE) using a deep learning model. The core idea is to train a sequence classification model to categorize a given query into predefined search volume buckets.

A tool driven by QDE model, trained for one specific industry.

Model Training

1. Data Preparation: The Ground Truth

The success of any supervised learning model hinges on the quality and quantity of its training data. Our approach involved:

Collecting Organic Performance Data: We aggregated historical search performance data (impressions and clicks) for millions of queries where our digital properties ranked well (positions 1-10). Ranking well implies that the impression data is a good proxy for actual search demand, as the content is visible to a significant portion of searchers.
Defining Volume Buckets: We established 12 distinct search volume ranges, from very low (“51-100” impressions) to very high (“200,001+” impressions). These ranges became our target labels.
Labeling Queries: Each query from our high-ranking dataset was assigned to its corresponding impression bucket, creating a (query_text, volume_label) pair dataset. For example, “dejan ai query fan-out tool” might be labeled as “501-1000”, while “top AI SEO agencies”

label_id,label_text

0,51-100
1,101-150
2,151-200
3,201-250
4,251-500
5,501-1000
6,1001-2000
7,2001-5000
8,5001-10000
9,10001-100000
10,100001-200000
11,200001+

2. Model Architecture and Training

We leveraged a pre-trained transformer model, specifically mDeBERTa-v3-base, known for its strong performance across various natural language understanding tasks, including classification. The choice of mDeBERTa also offers multilingual capabilities, which is advantageous for global businesses.

The model was fine-tuned as a sequence classifier:

Input: A search query.
Output: One of the 12 predefined search volume buckets.

The training process involved:

Tokenization: Converting text queries into numerical tokens using the MDEBERTa tokenizer, ensuring consistent input length (MAX_LENGTH=256).
Batching and Epochs: Training in batches (BATCH_SIZE=16) over several epochs (EPOCHS=3) to allow the model to learn from the data efficiently.
Optimization: Using AdamW optimizer with a low learning rate (LR=2e-5) and weight decay to prevent overfitting.
Evaluation: Regular evaluation on a held-out validation set to monitor performance using metrics like accuracy, precision, recall, and F1-score. Weights & Biases (WandB) was used for experiment tracking.

MODEL_NAME = "microsoft/mdeberta-v3-base"
WANDB_PROJECT = "mdeberta-finetune"
NUM_LABELS = 12
MAX_LENGTH = 256
EPOCHS = 3
BATCH_SIZE = 16
LR = 2e-5
WEIGHT_DECAY = 0.01
WARMUP_RATIO = 0.1
LOGGING_STEPS = 10
EVAL_STEPS = 200
SAVE_TOTAL_LIMIT = 10
OUTPUT_DIR = "./finetuned-mdeberta"

3. Integration into the Fan-Out Workflow

Once trained, the QDE model was integrated into our fan-out query generation system. As the fan-out model generated new query variations for a given URL and seed query, each new variation was immediately passed to the QDE model for a volume prediction. This allowed the system to:

Generate an extensive list of relevant keywords.
Assign an estimated search volume range and a confidence score to each generated keyword.
Store these predictions alongside the fan-out query and its original source, making the data directly actionable.

Validation: How Accurate Are the Predictions?

Validation is crucial. To assess the QDE model’s real-world utility, we compared its predictions against a true gold standard: a subset of queries from a held out dataset, representing terms where our properties consistently ranked in the top 1-10 positions. For these queries, impression data closely reflects actual search volume.

The validation process involved:

Extracting the QDE model’s volume predictions for all fan-out queries.
Identifying queries that overlapped with our high-ranking ground truth dataset.
Comparing the QDE predicted_volume bucket with the actual_volume_bucket from our ground truth.

Key Findings:

Exact Match Accuracy: 23.31%

Initially, this might seem modest. It means that for 23.31% of the overlapping queries, the model predicted the exact search volume bucket.

Combined Accuracy (Exact + Adjacent): 54.80%

This metric is far more representative of the model’s practical value. It indicates that for 54.80% of the queries, the model’s prediction was either exactly correct OR within one adjacent search volume bucket (e.g., predicting “501-1000” when the actual was “251-500” or “1001-2000”). This level of accuracy is highly beneficial for prioritizing.

What the numbers mean

Exact Match Accuracy (23%): Out of all predictions, only about 1 in 4 were exactly correct.
Combined Accuracy (55%): If we also count predictions that were very close (off by just one “volume bucket”), the model got it right more than half the time.

Why 50% isn’t “coin flip” odds

This isn’t a yes/no problem. The model isn’t picking between just 2 outcomes (like heads vs. tails). Instead, it has to choose among 11 different possible volume ranges (labels).

If the model were guessing randomly, each guess would have about a 1 in 11 chance (~9%) of being correct.
Getting ~23% exact match accuracy is much better than random chance—it means the model is finding real patterns.
The ~55% combined accuracy shows that even when it misses, it’s often close to the right bucket, not completely wrong. That’s useful for practical decision-making.

How to read the confusion matrix

The diagonal shows “perfect hits.” Those are the exact matches.
The cells right next to the diagonal are “near misses” (predicted slightly higher or lower than reality).
Off-diagonal far-away values mean the model got it very wrong—these are the cases we want to reduce.

Insights from the Confusion Matrix:

The confusion matrix (a table showing actual vs. predicted labels) provided deeper insights:

Directional Correctness: The predictions clustered strongly around the diagonal, confirming the model’s ability to broadly categorize queries into low, medium, and high-volume ranges.
Systematic Biases:
- Under-prediction in Low-Mid Range: The model showed a slight tendency to predict slightly lower volume buckets (e.g., 51-100) for queries that actually fell into the next higher categories (101-150, 151-200). This is a useful bias, as it means potentially under-valued queries might be identified, encouraging further investigation.
- Slight Over-prediction in Mid-Range: Conversely, some mid-range queries were occasionally over-predicted by one or two buckets, which can help flag terms as potentially more valuable than initially perceived.

A Powerful Tool for SEO Strategy

The deep learning-powered QDE model, integrated with fan-out query generation, transforms a previously manual and time-consuming process into an automated, scalable, and data-driven one. While not always achieving perfect exact-bucket accuracy, its ability to correctly or nearly correctly classify query search volume over 50% of the time provides an invaluable, actionable signal.

This system empowers SEO teams to:

Rapidly identify and prioritize millions of new keyword opportunities.
Uncover long-tail queries that traditional tools might miss.
Strategically plan content and optimize existing pages with a clearer understanding of potential demand, moving beyond guesswork with the power of deep learning.

The future of SEO keyword research is increasingly augmented by AI, allowing businesses to be more agile, comprehensive, and ultimately, more successful in capturing organic search demand.

Want one for you industry?

If you’d like a custom QDE model trained for your own website or client please apply below. This type of model training is best suited for websites with at least 100K, ideally 1M queries. We’ll evaluate your dataset and advise whether it’s suitable for model training.

Expressions of Interest