An interactive analysis of the extractive grounding pipeline. Based on patents, the BERT paper, Passage Ranking announcements, and the Smith model architecture.
Snippet selection has evolved from simple keyword density (TF-IDF) to semantic understanding. Explore the timeline to see the shift in technology.
Bidirectional Encoder Representations from Transformers changed everything.
"BERT: Pre-training of Deep Bidirectional Transformers..."
Research suggests a multi-stage process: Segmentation, Retrieval (finding candidates), and Scoring (Cross-Attention). Use the controls to simulate how Google likely "reads" a page to extract a snippet for the query: "Why do leaves change color?"
Passage Ranking papers suggest fixed-length windows or logical DOM breaks.
Ready to analyze.
For snippets, Google likely uses a retrieve-then-rerank architecture. Bi-Encoders are fast for finding candidates (Retrieval), but Cross-Encoders (like BERT) are far more accurate for the specific sentence selection (Grounding) despite being computationally expensive.
Dual Encoders (Bi-Encoders): The query and the document are encoded independently into vectors. Retrieval is a fast Nearest Neighbor search.
Source: "Dense Passage Retrieval for Open-Domain Question Answering" (2020)
Cross-Encoders: The query and the candidate passage are concatenated and fed into BERT. The model attends to every word pair. This produces the "Grounding Score" used to snip the text.
Source: "Passage Re-ranking with BERT" (2019)
Google's Siamese Multi-depth Transformer-based Hierarchical (SMITH) encoder handles long documents better than BERT by modeling sentence blocks structurally, crucial for finding snippets in long articles.
Primary patents and research papers analyzed to construct this model.
| Type | Title / ID | Relevance to Snippets | Year |
|---|