Lessons Learned Building a High-Quality RAG System

https://www.reddit.com/r/LocalLLaMA/comments/16cbimi/yet_another_rag_system_implementation_details_and/ 1 collections

Summary

The author details their journey in building a Retrieval-Augmented Generation (RAG) system for querying large knowledge bases in Obsidian and technical documents. Initial attempts using standard recursive text splitting and cosine similarity yielded underwhelming results, even with more powerful LLMs. Significant improvements were achieved through several key implementation details and strategies: **Pre-processing and Chunking:** * **Document Format:** Structured formats like Markdown, HTML, and DOCX are preferred over PDFs, which are harder to parse. Source code should ideally be treated as a single block, with metadata linking split pieces. * **Metadata Inclusion:** Chunks should include document name, references to higher-level logical blocks (e.g., parent headers), and for code, start/end indicators and language. External metadata like document path and collection label allows for dynamic filtering. * **Chunk Size:** Sensitivity to chunk size is high. A solution involves embedding documents with multiple chunk sizes and dynamically selecting the best performing one at runtime, though this increases storage and processing time. **Embeddings:** * **Model Choice:** Models like `e5-large-v2` offer a good balance of size and quality, comparable to OpenAI's ADA. * **Prefixes:** Some embedding models require specific prefixes for both text chunks and queries to achieve optimal results. **Retrieval:** * **Re-ranker:** A crucial component for improving retrieval quality. Re-rankers score passages obtained from similarity or hybrid search, allowing results to be sorted by relevance before being fed to the LLM. Cross-encoders are an efficient method for re-ranking. * **Sparse Embeddings (SPLADE):** Used to address the "vocabulary mismatch problem." Sparse embeddings, despite high dimensionality, can be stored efficiently. * **Hybrid Search:** Combines sparse (SPLADE) and dense (similarity search) embeddings. The implemented approach retrieves top-k documents from both, creates a union, and then re-ranks the combined set using a re-ranker. * **Re-ranking Strategy:** Instead of weighted combinations of sparse and dense scores, the author retrieves top-k from each, unions them, and then uses a re-ranker to sort the final set for the LLM. The post also includes extensive discussion with the community on topics like objective evaluation metrics for RAG systems, handling different document types (especially PDFs), the role of metadata, alternative embedding and re-ranking strategies (e.g., HyDE, cross-encoders), scaling RAG for large document sets, handling follow-up questions, and the trade-offs between using frameworks like Langchain and custom implementations. The author's implementation details are available on GitHub.

Keywords

AG system Retrieval-Augmented Generation LLM chunking embeddings re-ranker hybrid search sparse embeddings dense embeddings

Collections