Lessons Learned Building a High-Quality RAG System
The author details their journey in building a Retrieval-Augmented Generation (RAG) system for querying large knowledge bases in Obsidian and technical documents. Initial attempts using standard recursive text splitting and cosine similarity yielded underwhelming results, even with more powerful LLMs. Significant improvements were achieved through several key implementation details and strategies: **Pre-processing and Chunking:** * **Document Format:** Structured formats like Markdown, HTML, and DOCX are preferred over PDFs, which are harder to parse. Source code should ideally be treated as a single block, with metadata linking split pieces. * **Metadata Inclusion:** Chunks should include document name, references to higher-level logical blocks (e.g., parent headers), and for code, start/end indicators and language. External metadata like document path and collection label allows for dynamic filtering. * **Chunk Size:** Sensitivity to chunk size is high. A solution involves embedding documents with multiple chunk sizes and dynamically selecting the best performing one at runtime, though this increases storage and processing time. **Embeddings:** * **Model Choice:** Models like `e5-large-v2` offer a good balance of size and quality, comparable to OpenAI's ADA. * **Prefixes:** Some embedding models require specific prefixes for both text chunks and queries to achieve optimal results. **Retrieval:** * **Re-ranker:** A crucial component for improving retrieval quality. Re-rankers score passages obtained from similarity or hybrid search, allowing results to be sorted by relevance before being fed to the LLM. Cross-encoders are an efficient method for re-ranking. * **Sparse Embeddings (SPLADE):** Used to address the "vocabulary mismatch problem." Sparse embeddings, despite high dimensionality, can be stored efficiently. * **Hybrid Search:** Combines sparse (SPLADE) and dense (similarity search) embeddings. The implemented approach retrieves top-k documents from both, creates a union, and then re-ranks the combined set using a re-ranker. * **Re-ranking Strategy:** Instead of weighted combinations of sparse and dense scores, the author retrieves top-k from each, unions them, and then uses a re-ranker to sort the final set for the LLM. The post also includes extensive discussion with the community on topics like objective evaluation metrics for RAG systems, handling different document types (especially PDFs), the role of metadata, alternative embedding and re-ranking strategies (e.g., HyDE, cross-encoders), scaling RAG for large document sets, handling follow-up questions, and the trade-offs between using frameworks like Langchain and custom implementations. The author's implementation details are available on GitHub.
LocalLlama Subreddit - Discussions on Local AI and Large Language Models
The r/LocalLLaMA subreddit is a community dedicated to discussions about local Artificial Intelligence (AI) and Large Language Models (LLMs), with a particular focus on Meta AI's Llama models. The subreddit features a wide range of content, including announcements, discussions, news, and user-created projects. Recent popular posts include discussions on the "MiniMax 2.1 release?", "Xiaomi's MiMo-V2-Flash (309B model)", and "Apple introduces SHARP, a model that generates a photorealistic 3D Gaussian representation from a single image in seconds." Users share their experiences and seek advice on various topics such as optimizing llama.cpp performance with specific flags, building open-source voice assistants that run entirely in the browser, and using AI for coding assistance with open-weight models. There are also discussions about new model releases and potential releases, such as "GLM 4.7 imminent?!" and "New Google model incoming!!!". Hardware and performance are also frequent topics, with users discussing deals on GPUs and RAM, and exploring configurations like a Raspberry Pi with an eGPU. Projects shared include a Rust-based HTML-to-Markdown converter for RAG token saving, a free CPU-only trainer for LLMs, and a model that turns video into humanoid robot motion. The community also engages in more theoretical discussions, such as measuring AI drift and the semantic instability of LLMs.
RAG - Jason Liu
The content from jxnl.co/writing/category/rag/ discusses Retrieval-Augmented Generation (RAG) systems, highlighting various aspects through a series of speaker sessions and articles. Key themes include the practical application and improvement of RAG, particularly in the context of coding agents and enterprise AI. The site features a "Coding Agents Speaker Series" focusing on economically viable agents like Devin, Sourcegraph's Amp, Cline, and Augment, emphasizing their real-world revenue generation and production use. It also introduces a "RAG Master Series" as a comprehensive guide to RAG systems, covering fundamental concepts, advanced optimization, anti-patterns, and case studies. Specific sessions delve into: * **Domain Experts:** The role of specialized knowledge in vertical AI, with insights from Anterior's Head of Clinical AI. * **Text Chunking:** Technical research on chunking strategies for RAG, presented by ChromaDB. * **Agentic Approaches:** How techniques from coding agents can inform RAG system design, using lessons from Augment. * **Multi-Agent Systems:** A discussion on why single agents with robust context management might be superior to multi-agent systems in certain contexts, as explored by Cognition. * **Benchmarking:** Critiquing standard benchmarks like MTEB and advocating for custom evaluation sets for retrieval systems, with research from Chroma. * **Document Automation:** Achieving high extraction accuracy in document processing using AI, with case studies from Extend. * **Performance Boosts:** Strategies for improving RAG performance, such as fine-tuning re-rankers and embedding models, discussed by LanceDB. * **Custom Embedding Models:** The advantage of building bespoke embedding models for individual customers over generic solutions, as practiced by Glean. The content positions RAG as a foundational technology for AI applications requiring external knowledge and reasoning, distinguishing it from other AI applications due to its blend of information retrieval and language generation complexity. The site also links to related series on "Context Engineering" for technical implementation patterns.
Full Stack Retrieval
No summary available.
The 5 Levels of Text Splitting for Retrieval
This YouTube video, titled "The 5 Levels Of Text Splitting For Retrieval" by Greg Kamradt, explores various methods for dividing text into manageable chunks for retrieval systems, particularly in the context of Retrieval-Augmented Generation (RAG). The video outlines five distinct levels of text splitting, starting with basic character-based methods and progressing to more sophisticated semantic and agentic approaches. Level 1 involves character splitting, while Level 2 utilizes recursive character splitting. Level 3 focuses on document-specific splitting, and Level 4 introduces semantic splitting using embeddings. The highest level, Level 5, discusses agentic splitting. A bonus level on alternative representations is also mentioned. The video provides a detailed breakdown of each level, including theoretical explanations and practical demonstrations, with timestamps for each section. It aims to help viewers understand how to optimize text chunking for better retrieval quality in AI applications.
ChunkViz
ChunkViz is a tool designed to help users understand different text chunking and splitting strategies, particularly in the context of language models. The application allows users to upload text and visualize how various chunking methods, such as 'Character Splitter' and 'Recursive Character Text Splitter' (available in JavaScript, Python, and Markdown variants), divide the text. Users can adjust 'Chunk Size' and 'Chunk Overlap' parameters to observe their effects. The visualization uses different colors to represent distinct chunks, with overlapping text highlighted in orange. The tool also includes an explanation of 'superlinear returns' as a concept relevant to performance and growth, drawing parallels to business, knowledge acquisition, and exponential growth. It explains that language models perform better with focused, relevant information, and chunking is a strategy to provide this. The site notes that text splitters may trim whitespace, affecting visual continuity, and that overlap is capped at less than 50% of the chunk size. ChunkViz is open-sourced under the MIT License and developed by Greg Kamradt.