RAG from scratch (4)

자료 : https://www.youtube.com/playlist?list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x




(Indexing techniques for vectorstores)


(1) Multi-representation Indexing

make a good summary style indexing to fetch the full document

i.e. distill documents


idea from "Proposition Indexing" (https://arxiv.org/pdf/2312.06648) - decouple raw documents and retrieval unit


document → split → (via LLMs) make proposition


what is proposition?

distillation of that split, some sort of summary, which is better optimized for retrieval


ref : https://www.youtube.com/watch?v=gTCU9I6QqCE&list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x&index=12


(1)  full raw document is saved in docstore + (2) summary of raw document is generated (via LLMs) and embedded in vectorstore


with query input, summary is used to retrieve the full raw document to feed into the LLM to generate answer

(suited for long context LLMs)


the full raw document is retrieved


ref : 

1. Dense X Retrieval: What Retrieval Granularity Should We Use?

(Proposition Indexing)




(2) RAPTOR - hierarchical indexing

(hierarchy in abstraction)

hierarchical index of summaries

recursively build more abstract, high level summaries


raw documents → clusters → summary of clusters → clusters → higher level of summary


why? lower level questions vs. higher level qestions.

lower level questions may only require details from a single raw document, whereas higher level qestions may require obtaining information across several documents(or chunks)


ref : 

1. RAPTOR : Recursive Abstractive Processing For Tree-Organized Retrieval



further studies




(3) ColBERT

why? to embed a full document into a single vector may seem too much of a restraint

tokenize document + tokenize question → find max similarity between each token from question and document


ref :

1. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT


2. LangChain document on RAGatouille




