A recent patent application describes a method for training AI models to better understand human queries by using LLMs to automatically generate training data.
Training search engines to truly understand what we mean is a major challenge. Traditionally, it requires massive amounts of human-labeled data, which is slow and expensive to produce. Now, a new patent application reveals a clever alternative: using advanced artificial intelligence to train itself.
Instead of relying on humans, this new method uses large language models to automatically generate high-quality training data. It starts with a simple passage of text, then tasks the AI with generating a relevant query and finding other matching documents.
To ensure high quality, the system uses a two-stage process. First, it creates a specific search task and query. Second, it ranks how well different passages actually answer that query. This helps the AI learn to distinguish between a passage that merely mentions a topic and one that truly answers a user's question.
The technology also shines in multilingual search. Instead of just translating questions word-for-word, it uses a summarize-then-ask technique. The AI first summarizes a passage, then uses that summary to write natural, context-appropriate questions in different languages.
In the real world, this could drastically improve corporate databases, online shopping searches, and research tools, helping people find exactly what they need in a fraction of the time.
A recent patent application* reveals an innovative method for training AI models to become more effective at understanding and answering human queries. The approach tackles a fundamental challenge in modern search technology: how to teach AI systems to truly understand what people are looking for, rather than just matching keywords.
The traditional way of training search AI requires massive amounts of human-labeled data – real questions paired with their ideal answers. This is expensive, time-consuming, and often limited in scope. The newly proposed method takes a different approach: it uses advanced AI language models to automatically generate diverse, high-quality training examples.
Here’s a practical example of how it works:
Let’s say the system encounters this passage: “The film follows the story of American scientist John Smith and his role in the development of the elixir of life.”
The AI would:
This approach solves several practical problems:
The technology could improve various real-world applications:
At its core, the patent introduces a novel two-stage distillation process that transforms the traditional approach to training embedding models. This architecture is particularly noteworthy for how it leverages large language models (LLMs) to generate and validate training data.
The first stage employs few-shot prompting of an LLM to generate both tasks and queries. What makes this approach unique is its explicit separation of task description from query generation. The LLM receives a passage and generates two distinct outputs: a task description that defines the type of retrieval required, and a relevant query for that task. This separation allows for much finer control over training data diversity.
The second stage introduces a sophisticated approach to relevance scoring that combines two distinct prompting strategies: Query Likelihood and Relevance Classification. Query Likelihood assesses how likely a passage would generate the given query, while Relevance Classification directly evaluates the relevance of a passage to the query. These scores are combined using Reciprocal Rank Fusion to create a final ranking function.
The model employs a dual-encoder architecture with separate towers for query and document processing. The query tower processes both the task description and the query, while the document tower handles the passage and any associated metadata like titles. This separation allows for efficient retrieval during inference while maintaining the ability to encode rich contextual information.
The query generation process follows a three-step pipeline:
For multilingual applications, the patent introduces SAP as a novel approach. Instead of direct translation or cross-lingual generation, SAP first creates an extractive summary in the source language, then uses this summary as context for generating queries in target languages. This approach helps maintain semantic coherence across languages while generating natural-sounding queries.
Rather than assuming the seed passage is the optimal answer, the system implements a global ranking strategy to identify potentially better matches. This approach recognizes that the original passage might not be the best answer to the generated query, leading to higher quality training data.
The system employs a two-pronged approach to hard negative mining:
This dual approach helps create more challenging and effective training examples.
The training process utilizes contrastive learning with temperature-scaled similarity scores. The loss function is designed to push query embeddings closer to positive passage embeddings while pulling them away from negative examples, with careful consideration given to batch composition and temperature scaling.
The system’s performance is evaluated on two major benchmarks:
Key metrics include cross-lingual transfer performance, zero-shot generalization capability, retrieval accuracy at various thresholds, and query generation diversity.
*Systems and Methods for Generating Instruction Fine-tuning Dataset for a General Purpose Embedding Model – #20250045316