RAG基础知识

general infomation

RAG provides up-to-date information about the world and domain-specific data to your GenAI applications.

png

embedding models

An embedding model identifies relevant information when given a user’s query. These models can do this by looking at the “human meaning” behind a query and matching that to the “meaning” of a broader set of documents, webpages, videos, or other sources of information.

png

rerankers and two-stage retrieval

As with most tools, RAG is easy to use but hard to master. The truth is that there is more to RAG than putting documents into a vector DB and adding an LLM on top. That can work, but it won’t always.

png

For vector search to work, we need vectors. These vectors are essentially compressions of the “meaning” behind some text into (typically) 768 or 1536-dimensional vectors. There is some information loss because we’re compressing this information into a single vector.

Because of this information loss, we often see that the top three (for example) vector search documents will miss relevant information. Unfortunately, the retrieval may return relevant information below our top_k cutoff.

What do we do if relevant information at a lower position would help our LLM formulate a better response? The easiest approach is to increase the number of documents we’re returning (increase top_k) and pass them all to the LLM.

The metric we would measure here is recall — meaning “how many of the relevant documents are we retrieving”.

recall@K= # of relevant docs returned / # of relevant docs in dataset

Unfortunately, we cannot return everything. LLMs have limits on how much text we can pass to them — we call this limit the context window.

LLMs have limits on how much text we can pass to them — we call this limit the context window. With that, we could fit many tens of pages of text — so could we return many documents (not quite all) and “stuff” the context window to improve recall?

Again, no. We cannot use context stuffing because this reduces the LLM’s recall performance — note that this is the LLM recall, which is different from the retrieval recall we have been discussing so far.

png

When storing information in the middle of a context window, an LLM’s ability to recall that information becomes worse than had it not been provided in the first place.

LLM recall refers to the ability of an LLM to find information from the text placed within its context window. Research shows that LLM recall degrades as we put more tokens in the context window. LLMs are also less likely to follow instructions as we stuff the context window — so context stuffing is a bad idea.

We can increase the number of documents returned by our vector DB to increase retrieval recall, but we cannot pass these to our LLM without damaging LLM recall.

The solution to this issue is to maximize retrieval recall by retrieving plenty of documents and then maximize LLM recall by minimizing the number of documents that make it to the LLM. To do that, we reorder retrieved documents and keep just the most relevant for our LLM — to do that, we use reranking.

A reranking model — also known as a cross-encoder — is a type of model that, given a query and document pair, will output a similarity score. We use this score to reorder the documents by relevance to our query.

Search engineers have used rerankers in two-stage retrieval systems for a long time. In these two-stage systems, a first-stage model (an embedding model/retriever) retrieves a set of relevant documents from a larger dataset. Then, a second-stage model (the reranker) is used to rerank those documents retrieved by the first-stage model.

rerankers are slow, and retrievers are fast.

If a reranker is so much slower, why bother using them? The answer is that rerankers are much more accurate than embedding models.

png

When using bi-encoder models with vector search, we frontload all of the heavy transformer computation to when we are creating the initial vectors — that means that when a user queries our system, we have already created the vectors, so all we need to do is:

Run a single transformer computation to create the query vector.
Compare the query vector to document vectors with cosine similarity (or another lightweight metric).

With rerankers, we are not pre-computing anything. Instead, we’re feeding our query and a single other document into the transformer, running a whole transformer inference step, and outputting a single similarity score.

png

应用

参考

附件

版本记录

2025-04-14，初稿；