Grounding Should Come Before Generation
Google’s RARR (Retrofit Attribution using Research and Revision) is a clever but fragile Band‑Aid for LLM hallucinations. Today I want to zoom out and contrast that generate → ground philosophy with a retrieval‑first alternative that’s already proving more robust in production.
Quick Recap: What RARR Tries to Do
- Step 1: Draft – The LLM autocompletes an answer from scratch.
- Step 2: Query‑auto‑gen – It turns its own output into Google queries.
- Step 3: Retrieve & Revise – It fetches passages, checks facts, edits and cites.
Great for retro‑fitting citations onto an existing model; terrible when that auto‑generated query layer sneezes. Miss the target once and the whole answer wobbles.
Observed Pain Points
- Single Point of Failure – One malformed query cascades into wrong evidence and wrong edits.
- Latency Tax – Draft ➜ Search ➜ Edit is three passes, not one.
- Intent Attrition – RARR often “fixes” facts by deleting them, trimming useful nuance in the process.
Enter Retrieve‑Then‑Generate (RAG)
The Retrieval‑Augmented Generation framework flips the order (retrieve → generate) and keeps the evidence on‑hand before the model opens its mouth. First proposed by Lewis et al. (2020), RAG pipes your user query through a vector index, pulls the top‑k passages, and feeds «query + evidence» into the decoder in a single context window.
- RAG‑Sequence – Same evidence list powers the whole answer.
- RAG‑Token – Model can swap in fresh passages on the fly as it generates each token.
Why Retrieval‑First Wins
- Built‑in Factuality – The model copies or reasons over real text instead of hallucinating dates and names.
- Cleaner Failure Modes – If retrieval finds nothing, you know early and can say “I don’t know.”
- Speed – One forward pass instead of two (no post‑hoc surgery).
- Benchmark Proof – Outperforms parametric‑only baselines on open‑domain QA and yields higher evidence attribution scores out of the box.
Fusion‑in‑Decoder (FiD): Multi‑Passage, State of the Art
FiD (Izacard & Grave 2021) pushes the idea further by:
- Encoding each passage separately (no 10k‑token concatenation headaches).
- Letting the decoder’s cross‑attention fuse signals across all passages.
The result? Even better factual accuracy and graceful scaling to bigger evidence sets.
Putting It Together
Paradigm | Steps | Achilles Heel |
---|---|---|
Generate → Ground (RARR) | Draft → Queries → Retrieval → Edit | Query generator fails → bad evidence → bad answer |
Retrieve → Generate (RAG / FiD) | Retrieve → Decoder attends & writes | Retriever misses → detect early, return fallback |
Takeaways for Search & Content Folks
- Ground first, write second. Evidence in context slashes hallucinations at the root.
- Measure both attribution and intent preservation. Deleting half the answer to stay factual is cheating.
- Latency matters. If you’re building a user‑facing tool, every extra loop is UX drag you’ll feel.
- RARR is fine as a retrofit. But if you’re architecting from scratch in 2025, retrieval‑first is the sturdier foundation.
Bottom line: don’t spend your roadmap polishing a Band‑Aid. Slot evidence into the context window before generation, and your model will thank you, and so will your users.
Acknowledgements
Thanks to Jean-Christophe Chouinard for bringing this to my attention.
This article is AI augmented using the following context:
Personal view as the primary driver for the article.
The process is suboptimal in a sense that the pipeline starts with an autoregressive step and then tries to make it work by grounding as a bandaid. This setup seems particularly prone to error due to its dependence on query generator. If this layer fails, the entire response fails. A logical sequence of events in the pipeline would be that the model has both index results and relevant grounding available in unified context prior to its response as opposed to grounding as an afterthought paradigm.
Full video transcript as context
00:00 Junling Hu: talk will be uh we are very happy to get uh speaker from Google Research, Ni Lao. He is going to talk about large language model and attributed text generation. So without further ado, I will let uh start.
00:17 Ni Lao: Uh thanks Juning for inviting me. Um for the talk.
00:22 Ni Lao: Um in this talk um uh going to talk about actually two things. One is large language model
00:29 Ni Lao: and uh one major issue with them.
00:33 Ni Lao: Um and another um
00:35 Ni Lao: part is the recent uh publication we put out on archive,
00:41 Ni Lao: uh which introduce attributed text generation task.
00:46 Ni Lao: Um, let me
00:48 Ni Lao: So first disclaimer.
00:49 Ni Lao: Uh, this talk is like I said, it’s a combination of two talks. One is from last year about large language model.
00:55 Ni Lao: And the other one is this new paper we just uh put out on archive.
01:00 Ni Lao: Um, and I don’t represent Google. This is just I comment on new publications and old publications.
01:09 Ni Lao: So let’s see there are main three things: 1. LLMs vs Search Engines vs Databases, 2. Attributed Text Generation, 3. RARR (Retrofit Attribution using Research and Revision).
01:13 Ni Lao: Uh let’s start with the first one about large language model.
01:18 Ni Lao: This cake is very famous. It’s called uh Yann LeCun’s cake.
01:23 Ni Lao: Um, what he is trying to say is that
01:26 Ni Lao: for machine learning, the most important part is
01:30 Ni Lao: unsupervised training.
01:32 Ni Lao: Uh that’s the cake itself.
01:34 Ni Lao: And supervised training is just the icing.
01:37 Ni Lao: And reinforcement learning is just the cherry on the top.
01:40 Ni Lao: Um, because uh by the end of the day,
01:44 Ni Lao: uh you want your model to be able to learn from a few examples. For example,
01:50 Ni Lao: um a children can distinguish uh a type of new animal just by having one example, right?
01:57 Ni Lao: Um, in comparison, a lot of um image classification model need thousands of examples, only a few years ago.
02:05 Ni Lao: Um, maybe in the past a few years, this has changed a lot.
02:11 Ni Lao: Um, and pretrained pretrained model, um play a big very big role in this change.
02:19 Ni Lao: Um, the fundamental um
02:24 Ni Lao: uh relationship between data and model size is the following. It’s saying that
02:31 Ni Lao: the let’s say you the DE is the effective training data size.
02:36 Ni Lao: And DF is the the label data you provide to your task.
02:41 Ni Lao: And DT is how much data you can transfer from other tasks.
02:49 Ni Lao: And based on a lot of experiments, these researchers found that the effective transferred uh data set
02:58 Ni Lao: is has this relationship with
03:02 Ni Lao: your fine tune task size and the model size. So you can see that the bigger the model, the more you can transfer from
03:09 Ni Lao: generic task or pretrain task to your fine tune task.
03:14 Ni Lao: When your model is very, very big,
03:17 Ni Lao: you basically don’t need a lot of training data. Your your effective train data is basically just the pre-training the transfer data instead of your actually labeled data.
03:28 Ni Lao: So based on this, you can just like give very, very few labeled data uh and achieve good result because most of the knowledge is transferred from somewhere else.
03:51 Ni Lao: Um, this works really well uh for many cases, but also fails um in certain cases, and make the model very embarrassing to show their results.
04:05 Ni Lao: For example, you can try GPT-3, right? Let’s say you take one of the largest models and try to ask questions about the world, right?
04:16 Ni Lao: Uh if you ask something that’s very common… the model might give you the correct answer. Like if you ask what’s the birthday of Barack Obama, it will give you a correct date and year. (Fact)
04:30 Ni Lao: If you ask his about his wife, it will still give you the correct answer. (Fact) But when you ask more um detail knowledge… For example, what is Barack Obama’s father’s birthday? Barack Obama’s father’s birthday is August 4, 1961. (Fiction)
04:59 Ni Lao: …it will just like fake something… and show it to you, pretend this is the real one. And you have no way to tell, right? There’s no way for you to tell this is the correct one and this is the incorrect one just by looking at the answers. They all look very good… look like legit answers.
05:16 Ni Lao: But if you find a document… about the same the political topic, right? It’s very easy to verify if the answer is correct or not. You can find a page about Obama’s father or Obama’s family, you can easily verify this answer is incorrect or this answer is correct.
05:39 Ni Lao: So this is kind of a big problem if we want to use language model to produce things and for people to read. People might be fooled, right? Because the format of the answer is looks so good. People might think uh they are getting the truth or facts, but actually it’s made up by the language model.
05:58 Ni Lao: So in this talk, we’re just trying to understand why the language model is doing this and also what can be possibly done to fix that.
06:14 Ni Lao: Oh, okay. So I think Stephen asked the question, is it possible to get the confidence level of these tokens? Yes, you can get the confidence level for every token, right? But still you you cannot distinguish whether
06:29 Ni Lao: the confidence, the low confidence come from either of the two reason, right?
06:37 Ni Lao: The let’s say the this the one of the reason is model have never seen this fact in the corpus, right? Another possible reason is that the corpus has several answers which are conflicting with each other, right? In both cases, the model will give you a a low score. But there’s no way for you to tell um which is the case. And by default it’s also no way for you to verify if the output is is correct or not. So it will be very I wouldn’t trust the answer from this large language model.
07:14 Ni Lao: Um especially about facts.
07:16 Ni Lao: Um Let’s continue.
07:19 Ni Lao: Um let’s compare that with search engine.
07:23 Ni Lao: Uh search engine is kind of very, very different, but fundamentally, they can do the same thing, right? You are looking up things that you care about, right? You you can ask the same question to large language model and search engine and see how the answer are different.
07:39 Ni Lao: Um so search engines are very scalable, they come back very quickly. You can like accept a lot of queries and return the answer very quickly.
07:48 Ni Lao: Uh it’s more accountable. It sort of have understanding of which website are uh trustworthy and and will prioritize those websites.
08:00 Ni Lao: Uh however, it’s less generalizable. It’s uh or say it’s less smart than uh deep model. It doesn’t match uh different expression of the same concept that well.
08:13 Ni Lao: Um Ideally we we want to have both, right? We want to have scalability and accountability uh from the search engine, but we we want the large language model to but we also want to be generalizable like like the large language models.
08:32 Ni Lao: So the question is, can we make large language models like more like a search engine or more like a database? I would say. Um so whenever it returns an answer, can it give me attribution? Give me pointers to where this answer come from.
08:50 Ni Lao: Um and when it doesn’t have when it have never learned some of the facts, you should tell me. You should tell me like I don’t know, I have no record of this fact uh in my knowledge, right? And also you should separate data from logic, right? How you reason and query things is part of the model, but all these facts is sort of um kind of like a storage. How can we achieve those things, right?
09:13 Ni Lao: So what we believe that can get us closer to that point, it is to change the task, the the way we define text generation.
09:20 Ni Lao: Um especially we want to have the generated text to have to be attributed so that we it’s easy to verify uh if the output is correct or not correct.
09:31 Ni Lao: Um that will get make uh the language models a lot more trustworthy than it is today.
09:39 Ni Lao: Um and also we come up with uh a prototype system that can do attribution uh with um generation.
09:53 Ni Lao: Um at the same time, we want to investigate why this issue happen and what’s the possible solution. So, eventually what we came up with this post hoc fix um scheme where we don’t change large language model at all. We don’t change anything. The output is exactly what they used to output. But after that, we make some changes
10:11 Ni Lao: to fix the problem.
10:13 Ni Lao: Because um architecture uh innovation takes time. Um we we don’t we don’t need to do that right now. What we want is just to study what’s the problem. Um so there’s some interesting assumption we made. Uh one is that
10:32 Ni Lao: the uh the large language models even though they they cannot tell facts from fiction, they still contain valuable procedure knowledge, naming like what I should say given a question, right? How these like sentences should be structured. These are all very valuable.
10:51 Ni Lao: Um and the their initial output can be seen as a plan for the ideal output.
10:59 Ni Lao: And the only thing that’s missing from this output um are the facts in the in the generated text.
11:07 Ni Lao: That’s the main assumption we make. But eventually you will see the assumption might not hold that well, but uh at a very high level, it still holds.
11:18 Ni Lao: Okay. So the task we change the text generation task to be attributed text generation.
11:26 Ni Lao: Um so the setup, as I said, it’s post hoc fixing things. So we assume there is already a text generation model that generated some outputs. It can be answers to a question, summary of a passage or dialogue uh continuing one sentence in a dialogue. It can be any of these things, right?
11:45 Ni Lao: Um, and then uh a hypothetical system should do retrieval over a text corpus. Uh let’s say you can use a search engine over over the web.
11:59 Ni Lao: Uh and then the output would be uh one fixing all the factual errors um in the in the initial output.
12:09 Ni Lao: Two, also give a report of where these facts come from. For every claim in the text, um the system should should attribute that to some of the sentences somewhere in a corpus, right? Let’s say you have a URL representing the document ID and uh a sentence or a passage representing the the context that’s supporting the output.
12:38 Ni Lao: And eventually there is um human evaluation or automatic evaluation like model can evaluate the quality of these two outputs, the revision Y and the the attribution report A.
12:54 Ni Lao: So eventually you will give a score for how well are the claims attributed and also a score of how well the original intention has been preserved. Yeah, this is something new, like nobody have ever tried to measure this before, because nobody have this task setup. So this task setup assume that the original text generation model knows
13:22 Ni Lao: uh the in domain know have the in domain knowledge uh about what need to be said. So we want to preserve that intention because uh if you don’t preserve the initial intention, you you can very easily have a trivial solution, right? You always answer a fact like the earth is round.
13:44 Ni Lao: Um and then point to a particular page on the Wikipedia, right? You sort of start to talk about something completely different, but it’s always attributed, that will trivially solve the attribution problem, but it doesn’t really accomplish the original task. Like let’s say um the system was talking to human about a certain topic, right? You don’t want to switch the topic. You want to continue on that topic, but talking with facts.
14:09 Ni Lao: Um so how to measure the quality? Uh uh as I said, there are two measurement, one is attribution. Uh how the revised text Y can be attributed to the evidence in A.
14:26 Ni Lao: Uh so we use both human and automatic evaluation. Uh for human, this is a rating template that that’s published when year ago. Uh for automatic, this is a model that’s also published one year ago.
14:53 Ni Lao: Uh for preservation, uh there’s no existing measurement, so we have to come up by something new uh that measure whether the revised text Y preserve the intention of the original text X.
15:09 Ni Lao: Uh so there’s human rating template and also automatic metric. Uh for automatic metric, we use uh edit distance to see how many character or like uh what’s the portion of character that’s getting um replaced in the new text.
15:29 Ni Lao: And eventually the the preservation measure is just uh the product of these two measures.
15:36 Ni Lao: And to measure the overall quality of a system, we just combine these two metric, the attribution and preservation into one measure.
15:49 Ni Lao: So there’s an example rating template for attribution. Uh so basically for every sentence in text, there is the interface to ask the reader whether the sentence can be attributed to any of the given evidence. There should be a whole bunch of evidence.
16:09 Ni Lao: Uh so this is an end-to-end mapping between sentences and the evidences.
16:18 Ni Lao: Um for preservation of intent, um it’s just uh a multi class classification, whether the intent was uh preserved or not similar or someone in the middle.
16:31 Ni Lao: Okay, I guess I should uh stop if I see if anyone have any problem any question about um the task setup.
16:39 Ni Lao: Uh, I guess there are some comments in the um about yeah.
16:46 Ni Lao: GitHub software pirate GitHub. The main point there is a violation of requirement for code use attribution. Don’t know from legal point of view if that case has weight or not, but that’s the first and foremost violation.
17:03 Ni Lao: Yeah, I I’m not a lawyer, but uh I guess it’s always good to attribute things when you are writing, right? And same that’s true for human and probably it’s also true for machines. Whenever machines write a sentence, it should try to attribute that to something um in the literature as much as possible.
17:27 Ni Lao: So I do have a quick question on the revision research Uh-huh. So it seems like that you’re updating that’s for example, you updating the record to the test one.
17:39 Ni Lao: Right? So in this case, in the use case that you show. Uh-huh.
17:45 Ni Lao: So, uh when you override or maybe call the new data in the corpus, Uh-huh. Do we need to keep the old one or you just override? We don’t need to keep the old one because you think about why where does this old one come from, right?
17:59 Ni Lao: Uh, when you ask, let’s say, uh the original question is, what is the world record for uh so and so, right? Um for I guess this is like running or something, right? Well let’s say what’s the world record for running, right? And then as a human, right? You know the the format of the output. The format should be like the the marathon record was certain time, right? From hour and minute and second by certain people somewhere, of somewhere at some year, right? You know the exact format, right? But as a human, you cannot write down the exact time, exact date, and exact year, right? Same thing for the model, right? The model probably doesn’t have this fact uh at hand by at the hand to tell you exactly what it should be. But it knows sort of the format. It will first generate um a sentence that has the correct format, but only thing that need to be fixed is the facts. So in that sense, there’s no point of keeping the original number like this like this hour and and minute and second. Actually the it has the very good guess, right? It guess the hour and minute correctly, but miss the second, which is very hard, right? So there’s no point of keeping that because you know the model will struggle, it will like try to guess, right?
19:33 Ni Lao: Like like you you like you have an initial guess and then you find a Wikipedia page or something, right? You find the actual facts, and then you have your final answer, right? You output the final answer. So there’s no point of keeping the initial guess.
19:47 Ni Lao: So my assumption is that if somebody ask, let’s say top one, top two, some kind of sequence. Let’s say ask question, who is the world record holder before someone called Kim change? then how how this would respond? If you don’t have this kind of record of B, then
20:08 Ni Lao: Uh, can you say that Can you say that again? I didn’t quite get your question.
20:13 Ni Lao: Yeah, let’s say somebody hold the world record in 2018 is A, right? Uh-huh. But I want to ask the question, who is the holding record before A? It was B, something like that. But you say we don’t keep the record of B then
20:28 Ni Lao: Oh no, we don’t keep the the guessing, right? The guessing by the model. The model really don’t have enough information to like give you the exact answer anyway, right? We don’t keep that.
20:39 Ni Lao: Okay. So there’s no point of keeping this 39 seconds because that’s made up. Okay. Right? Got it. Thank you.
20:46 Ni Lao: There’s another question on the latency.
20:49 Ni Lao: What about the impact on latency? Do you try to measure that? compared to ground attribution in one go instead of generate and revise.
21:01 Ni Lao: Uh, yeah, we didn’t measure that. It definitely is going to be slower, right? Because you you generate and then regenerate, right? It’s definitely going to be slower. But that’s yeah, that’s just how this is set up.
21:19 Ni Lao: Okay, let’s uh continue.
21:23 Ni Lao: Um then how to measure the quality? Uh uh as I said, there are two measurement. One is attribution. Uh how the revised text Y can be attributed to the evidence in A?
21:36 Ni Lao: Uh so we use both human and automatic evaluation. Uh for human, there’s a rating template that that’s published one year ago.
21:46 Ni Lao: For automatic, there this is a model that’s also published one year ago.
21:53 Ni Lao: Uh for preservation, um there’s no existing measurement, so we have to come up by something new. Uh that measure whether the revised text Y preserve the intention of the original text X.
22:10 Ni Lao: Uh so there’s human rating template and also automatic metric. Uh for automatic metric, we use uh edit distance to see how many character or like uh what’s the portion of character that’s getting um replaced in the new text.
22:29 Ni Lao: And eventually the the preservation measure is just uh the product of these two measures.
22:36 Ni Lao: And to measure the overall quality of a system, we just combine these two metric, the attribution and preservation into one measure.
22:49 Ni Lao: So there’s an example rating template for attribution. Uh so basically for every sentence in text, there is the interface to ask the reader whether the sentence can be attributed to any of the given evidence. There should be a whole bunch of evidence.
23:10 Ni Lao: Uh so this is an end-to-end mapping between sentences and the evidences.
23:18 Ni Lao: Um for preservation of intent, um it’s just uh a multi class classification, whether the intent was uh preserved or not similar or someone in the middle.
23:33 Ni Lao: Uh so there’s some question about GitHub. Yeah, so I I’m not sure.
23:44 Ni Lao: Okay, so this is the task setup. Um any question about the task setup?
23:50 Ni Lao: So you mentioned about the preservation. Is this the the industrial or study standard they use as the measurement or No. Nobody nobody used this before, right? No, I use it because of the specific way we set up this task, right? The task is to modify the initial output of uh text generator. So basically our solution is task agnostic, right? It doesn’t matter what task the the first model is trying to do. Uh this our our solution is just trying to fix the facts. So assumption is that fixing facts is something that’s very generic, that’s not task specific, but that may or may not be true, but you have to make some assumption before you do anything, I guess.
24:40 Ni Lao: Um Yeah. Okay, so can you give for example, what is the uh preservation score higher case and what is the low case in in how do you measure it? Oh, here, right? This is the example, right? There’s a passage A and a passage B. And then given the same context above, how similar is the intent conveyed by passage A and passage B, then the reader will just choose one of these three, right? Similar or not similar or somewhere in the middle.
25:12 Ni Lao: Okay, so this is evaluated by human. Yeah, this is human. Okay.
25:19 Ni Lao: Um Okay. Now, we switch to the actual solution or we would yeah, we can say it’s a solution. Um but mainly just uh demonstrating a point of how these issues can potentially be uh be fixed.
25:41 Ni Lao: So the system starts with input text passage, so like this here. Uh somebody premier something, I guess it’s a movie or something. Premiered on so and so date on so and so uh channel.
25:57 Ni Lao: Um and then the system will start with generating queries from this passage, then each query represents uh a claim that need to be verified. And these queries are sent to some search engine. And the search engine returns documents and which are getting turned into passages. Um and all these passages are sort of the context that can be used to attribute uh these claims.
26:31 Ni Lao: And there are several modules. Some of the module decide whether passages are relevant, some of you decide whether uh relevant passage agree or not agree with uh with your initial passage. So if they agree, there’s nothing to be done, right? Just skip this uh context. If they do not agree, there is uh edit module that takes in two passages also the query and produce a new passage that try to fix the original passage.
27:10 Ni Lao: And eventually there is some mechanism to pick a subset of the evidence um uh into a report so that human can judge um the attribution and uh and the preservation.
27:27 Ni Lao: So the query generation part is from the model or Uh, all of them are models, right? Like generate query, judge whether um the passage is relevant or uh does passage agree and also make edits. All these are just models.
27:46 Ni Lao: So the query are the pure text is Yeah. Edit is also pure text, right? Agreement it kind of like classification, but you can turn that into pure text. The output is yes or no, let’s say.
28:02 Ni Lao: Um Okay. So in the retrieval part, Uh you do some tokenization or how do we do the retrieval sign in here?
28:13 Ni Lao: Uh it’s sent to Google. So this query is sent to google.com. Oh, okay. Google.com come back with documents, yeah. Okay. Got it.
28:25 Ni Lao: Uh about all these modules, right? Um this is like something that we come up in a short amount of time. Uh there’s no training anywhere. Um it’s just few short learning and also demonstrating how the large models can learn with very few examples, right? Uh so all these modules are just like prompts that you send to a large language model. And and it needs very few labeled data, but also but it needs prompt engineering. So basically, you need to try all different ways to talk to large language model so the model will do things that that accomplish your certain the task, right? For example, for query generation, the prompt will sort of pretend it’s talking to someone, like you said something. This is the the original passage, right? To verify it, I Google something. I Google something, I Google something, I Google something. It’s like literally like pretending like talking to someone about Googling some facts about um what you said. Maybe people really talk like this way on Reddit, I don’t know.
29:43 Ni Lao: Um And the similarly for other components, right? You you sort of pretending to be talking to someone uh in a prompt.
29:57 Ni Lao: Any question about this part?
30:04 Ni Lao: So when you when you Google it, return a whole document. So how do you know which part is more important than the others? I think there is some logic that’s not prompt. There’s some logic to break the document into passages and decide like how relevant is each passage.
30:26 Ni Lao: Okay, so this is included in the model that you proposed or It is. Yeah, it’s part of the um the solution. It’s not something existing. Oh yeah, that’s what I understand. Yeah. Yeah.
30:41 Ni Lao: How do they come up with those prompt? Uh researchers or interns, I guess. Like you need people to like to really try all different ways to talk to large language model to end up with this, right?
30:56 Ni Lao: So it’s kind of ad hoc. Yeah, it’s uh black magic. Okay.
31:06 Ni Lao: Um generating the attribution report. Um there’s some simple logic to pick at most M uh evidence to be to be part of the attribution report, right? Because to prevent the extreme case where you include every text, every document in the um in the attribution report, then then it’s very easy to get a very high attribution score. So the the system should really pick only the one that’s needed um to to verify the claims. So there’s this uh exhaustive search to find the minimum the the set of um evidence that that sort of explains every claim in in the generated text. And the claims are represented by these search queries.
32:05 Ni Lao: Um so now we switch to uh evaluation or the experiments. So in the experiment, we uh we experiment with uh quite a few tasks. These are the task that are sort of works well. And later I’ll talk about there are other tasks which are more challenging, which are uh more like mass um or other type of tasks.
32:34 Ni Lao: And these uh here these three tasks are uh question answering, reasoning or dialogue. And you can see these are example uh system outputs for these tasks.
32:48 Ni Lao: And for these tasks, we um use different language model to generate the initial outputs. Uh so for dialogue, we use Lambda because Lambda is sort of trained to do dialogue. We feel that might be the best uh you can model you can use for this task.
33:10 Ni Lao: Uh and for non dialogue tasks, we use PaM.
33:19 Ni Lao: And for the baseline, um we pick two baseline. Uh one is Lambda research. So the Lambda is kind of a very big system, right? And then it has a component where it take an initial output of a language model, and then it starts to do Google search, basically. And try to fix um issues in the initial output until it decide that the output looks okay, and it will output it will uh output response to the user.
33:55 Ni Lao: Uh this is one baseline. Um the other baseline is from uh fact correction literature. So this is not a dialogue system. This is like fact checking, fact correction system, where it starts with a claim and also does retrieval, like all the system look very similar, but they sort of designed for different purpose and will behave very differently.
34:23 Ni Lao: Um it does retrieval and then it corrects the the output based on the retrieval result.
34:31 Ni Lao: Um here are the main results. Um you can see that um for EFEC, it’s designed to fix uh the attribution, fix the facts. You can see uh the attribution does goes very high, right? It it like when it output something, it like 50% of the output can be attributed, which is higher than all the other systems. However, it tends to like completely change everything, change all the outputs. Um, and when we look into what has been changed, it looks like um it it often will delete a lot of content. So basically take a passage and uh it will keep some part of the passage that can be attributed, but also delete all the other parts that cannot be attributed. So eventually you sort of you lose some of the intention of the original passage. That’s why the preservation score is very, very low, right? It’s like lower than 10%, which like you you lose a lot of information. Even though the result is fully or mostly attributed.
35:49 Ni Lao: Lambda is sort of similar, it’s less attributed, but it will keep more of the content. Um and the system we just described, um will preserve most of the content. It will preserve like 80% of the content most of the time, which is much, much higher because um it’s designed to uh to preserve the original intent. Uh even though the attribution is slightly lower, uh but but if you compute the F1 measure, it’s going to be highest because it preserve the original content.
36:29 Ni Lao: Uh any question about this result?
36:36 Ni Lao: What does the dash line indicate? The dash line is um the attribution without editing. So remember these system internally, they do some retrieval, right? Once you retrieve, you can already compute how many, how much of the generation can be attributed to the retrieval result, right? Without editing anything, you can already attribute some of the sentences, right? But but with further editing, you’re supposed to get better because some of the facts might be wrong, right? And therefore cannot be attributed. And if you replace the wrong fact with the correct fact, then then they are attributable. So you’re supposed to be higher then the dash line. So dash line has no uh editing.
37:31 Ni Lao: So it sort of tells you how much editing is improving the attribution.
37:37 Ni Lao: Yeah. So one question in terms of accountability, that was one of the original goals. Uh so if the attribution percentage goes down, I know score goes up, but if the attribution percentage goes down, uh how does that help with the accountability goal? Where do you see it go down? It’s going up, right? The these dots are higher than the dash line, right? The dash line, okay. So that is the baseline and then Yeah, it’s going up except for Lambda is the one that is going down. Okay, in the first Yeah. Oh, this is going uh this there should um I think there should be like three different dash lines because each system actually does a slightly different retrieval. Uh so there this is the highest dash line, I guess. So lambda probably started with some dash line which is lower and it improved over that. But this figure is a little bit misleading because each dot should have its own dash line. Makes sense. Thank you. Yeah. Yeah, so here it’s just showing the highest attribute score among all three system. Uh so it’s not clear which system produce this dash line. Maybe we should draw three dash lines.
38:54 Ni Lao: Uh okay. That’s a 100% sure, right? So this That’s why we call this attribution, right? We never say this is factual, right? Because fact is a much higher standard where you assume the source is trustworthy. So attribution only means that you find something that that supports your claim, right? But that’s something whether that something is really trustworthy, we don’t have any claim on that.
55:18 Ni Lao: But you’re making an editorial decision whether to include that source or not. So Junling, this is the same as when the Microsoft Tay model was polluted with Hitler comments. What what if somebody tries to put Mine Camp into the model?
55:33 Ni Lao: Yeah, that’s a larger question.
55:36 Ni Lao: Rucher, uh you can go ahead on your question.
55:43 Ni Lao: Hey, thanks Junan. So I I I’m new to this area, but I’ve been fascinated by it. I guess my question is a very simple one. Is it common to have such parameterized machine learning models in in machine learning papers where you can um, you know, based on your choice of parameters in this graph, come up with a new model easily and tune it for a certain purpose?
56:08 Ni Lao: Yeah, traditionally you, you you tune your model to do new tasks with a lot of training examples. But more recently because these large models are more generalizable, you can just give it a few examples instead of thousands of examples.
56:27 Ni Lao: And is that because you’re working on a um already a large model which has all the information and you just need to tune it?
56:36 Ni Lao: Uh so there’s no tuning at all, right? So these models are like um large language models, a large language model, right? You you give this portion as the the blue portion as the the input to the model, the model will continue to generate the rest of the the outputs, right? Mhm. Uh and it will generate all these questions given the input.
57:03 Ni Lao: So the way you teach the model is to give a few examples like this. So for this passage, I generate those queries, for that passage, I generate those uh queries. And then use that as the initial input to the model, and then you add one more passage to ask the model to continue to generate something.
57:24 Ni Lao: Okay, very interesting. Thanks for that. And is this available online or do I have to set this up if I want to play with something like this?
57:33 Ni Lao: It’s not open source. We are working on open sourcing this, but it’s not.
57:38 Ni Lao: But uh but the prompts are you can see all the prompts on the paper. In the appendix we include all the prompts.
57:48 Junling Hu: Great. Thank you. Uh I guess we reached the end of our uh talk uh our meeting time. Thanks everyone for coming and thanks Lee for giving this wonderful talk.
58:01 Ni Lao: Thanks everyone. Have a great weekend. Thank you. Thank you very much.
58:08 Ni Lao: Uh Junling, I just posted my announcement about my own talk that I mentioned to you. Okay. Thanks for the talk, Niel. and uh thanks, Jun.
Transcript summary
Talk Overview:
Ni Lao discussed two main topics:
- Large Language Models (LLMs) and their limitations.
- Attributed Text Generation—introducing methods to improve LLM trustworthiness by attributing generated text to verifiable sources.
Main Points:
1. Large Language Models vs. Search Engines:
- Yann LeCun’s cake metaphor:
- Unsupervised training is the cake.
- Supervised training is icing.
- Reinforcement learning is the cherry on top.
- Data Transferability:
Larger models transfer more effectively from pre-training to fine-tuning tasks. Big models require less task-specific data due to extensive knowledge from pre-training. - LLM Limitations:
LLMs generate plausible but often incorrect facts (“hallucinations”). For example:- Correct fact: “Barack Obama’s birthday.”
- Incorrect fact: “Obama’s father’s birthday.”
- Confidence Issues:
LLMs provide confidence scores but can’t differentiate between:- No information available.
- Conflicting information sources.
- Comparison with Search Engines:
- Search engines provide more reliable results by prioritizing trustworthy sources but lack the generalization and intelligence of LLMs.
- Goal: combine the accountability of search engines with the flexibility of LLMs.
2. Attributed Text Generation:
- New task to generate text with verifiable sources (attribution).
- System design includes:
- Initial output from LLM.
- Verification queries to search engines.
- Fact-checking and revision based on retrieved sources.
- Generation of an attribution report.
- Evaluation metrics:
- Attribution: How well text claims match retrieved evidence.
- Preservation: How closely the revised text maintains the original intent.
- Example Metrics:
- Human and automated evaluations.
- Preservation measured via edit distance and intent similarity.
3. Prototype System (RARR: Retrofit Attribution using Research and Revision):
- Workflow:
- Generate text.
- Extract verification queries.
- Retrieve documents (e.g., via Google).
- Check passage relevance and factual correctness.
- Edit passages if needed.
- Produce attribution report citing evidence.
- Modules were implemented through prompt engineering—a trial-and-error process where the best-performing prompts were selected manually (“black magic”).
Experiment Results:
- Compared with two baselines (LaMDA research & EFEC):
- RARR achieved the highest preservation of original intent (around 80%).
- EFEC had higher attribution but heavily reduced content preservation.
- RARR balanced attribution and preservation most effectively.
Audience Q&A Highlights:
Open Source Availability:
Currently not open-source, but plans to release prompts in the paper appendix.
Original vs. Revised Facts:
Discard initial LLM-generated guesses when corrected by verified facts.
Latency Issues:
The proposed two-step approach (generate and then verify/revise) naturally increases latency.
Accountability:
System provides clear attribution but doesn’t inherently verify source credibility.
Practical Application:
Current solution is purely through prompts and large LLMs.
No additional training; few-shot learning with minimal examples.
Model response
Yes—that’s a fair critique. What Ni Lao et al. describe is essentially a generate‑then‑ground (GenGround) pipeline: you let the LLM spit out an autoregressive answer (and even generate its own “verification” queries), and then you go back and try to patch any factual errors by retrieving documents to “ground” each claim aclanthology.org. That approach has two major weaknesses:
- Error propagation through query generation
If the LLM’s generated queries are off–target or ambiguous, the retrieval step will fetch irrelevant or misleading passages. Because everything hinges on that initial query layer, a single bad query can derail the entire grounding process. - Latency and brittleness
You pay the cost of a full generation pass and then a retrieval+revision pass. And since the LLM wasn’t conditioned on any real evidence up front, it has no built‑in notion of which facts it actually “knows” versus invented.
A more robust alternative: Retrieve‑then‑Generate
The standard Retrieval‑Augmented Generation (RAG) paradigm flips the order:
- Retrieve a small set of grounded documents directly using the user’s original query.
- Generate the answer conditioned on those retrieved passages, which are fed into the LLM’s context window arXiv.
By exposing real evidence up front, you:
- Eliminate the brittle query‑generation layer—you use the user’s intent directly for retrieval.
- Improve factuality—the model can attend to and copy or reason over actual source text rather than hallucinating.
- Reduce end‑to‑end latency—a single generation pass over a fused [user query + retrieved docs] context is usually faster than generate‑then‑ground.
In practice, systems like “RAG” (Lewis et al., 2020) or Fusion‑in‑Decoder (Izacard & Grave, 2021) demonstrate that giving the LLM a unified context of user query plus retrieved evidence before generation leads to far more reliable, less error‑prone outputs—no “afterthought” grounding needed.
Alternative model response
Your critique is basically right: RARR’s “generate → ground” order is a pragmatic patch, not the ideal architecture for factual reliability.
Below is a quick comparison of the two design philosophies and why the post‑hoc route was chosen, plus emerging alternatives.
1 Where the generate‑then‑ground pipeline is fragile
Step | Main risk | Failure cascade |
---|---|---|
(a) Autoregressive draft | Hallucinated facts inserted. | Everything downstream must now detect & fix them. |
(b) Query generation | Draft must be parsed into good search queries. | Bad or missing queries → no relevant evidence → unrepaired errors. |
(c) Retrieval & edit | Depends on (b) fetching correct passages. | Even a perfect editor can’t fix what it can’t see. |
Because the draft is already committed to particular entities, dates, etc., any retrieval miss forces expensive re‑writes or content deletion, which is why RARR scored high on attribution but lost intent preservation. arXiv
2 Why they did it anyway
Practical advantage | Rationale |
---|---|
Model‑agnostic | Works with any off‑the‑shelf LLM; no retraining. |
Keeps stylistic value | Initial draft carries tone/structure; editor only patches facts. |
Easy to bolt on | Can retrofit existing products while RAG infra is still being built. |
In short, it’s a fast, low‑invasion bandaid for hallucinations, not a long‑term blueprint.
3 Retrieval‑first (RAG) is cleaner—and now feasible
Retrieve → read → generate pipelines load evidence before decoding, so the model never invents facts it can’t source.
Recent work makes this approach more attractive:
- Stochastic RAG – end‑to‑end differentiable retrieval that jointly optimises retriever and generator. arXiv
- OneGen – single‑pass model that interleaves token generation with on‑the‑fly vector retrieval. ACL Anthology
- GenGround – shows gains on multi‑hop QA by mixing parametric knowledge with retrieved snippets early. arXiv
These systems reduce dependence on a fragile query‑generation step because retrieval is either optimised jointly or invoked token‑by‑token.
4 What a “better” pipeline looks like now
- User query
- Dense retriever (or hybrid) pulls k candidate passages.
- LLM receives query + top‑k passages in one context window.
- Constrained decoding (e.g., copy‑attention, citation markers) forces the model to ground each claim.
- Optional verify‑and‑refine loop (smaller, faster) for final polish.
5 Bottom line
It was a conscious trade‑off for ease of deployment; the research frontier is now shifting toward integrated retrieval‑generation that mitigates the single‑point‑of‑failure issue you highlighted.
Your assessment—that the post‑hoc approach is sub‑optimal and query‑generator‑dependent—is accurate.
Fusion‑in‑Decoder is a completely novel concept I discovered while authoring this article.
Leave a Reply