RAG System Architecture: Components, How To Implement, Challenges, and Best Practices
Eliott Ardisson
Founder & CEO - Basalt Studio
A practitioner's guide to RAG system architecture: components, chunking strategies, hybrid retrieval, reranking, and what actually breaks in production deployments.
TL;DR
- RAG (Retrieval Augmented Generation) combines a language model with external knowledge retrieval, allowing AI responses to be grounded in your actual business data rather than static training weights.
- The most consequential architectural decisions are made early: vector strategy, chunking approach, and database selection shape everything downstream.
- Hybrid retrieval (dense + sparse vectors) consistently outperforms single-vector approaches for mixed real-world query types.
- Chunking strategy is where most production RAG systems quietly fail — arbitrary character splits break semantic continuity and degrade retrieval quality.
- Evaluation is not optional. Systems without ongoing benchmarking degrade silently as data changes.
What RAG Architecture Actually Is
Retrieval Augmented Generation (RAG) is a system design pattern that connects a large language model to an external retrieval layer. Instead of answering purely from what the model learned during training, a RAG system fetches relevant content from a vector database or document store at inference time, then passes that content to the LLM as context for its response.
This matters for business applications because it lets you ground AI responses in your own data — internal documentation, case files, product catalogs, client records — without retraining or fine-tuning a model. It also makes responses more auditable, since you can trace which source documents influenced a given answer.
The term “RAG architecture” refers specifically to the structural decisions: which vector types you use, how you store and index documents, how you process queries, and how components connect. This is distinct from a RAG pipeline (the step-by-step data flow) or a RAG application (the user-facing product). Getting the architecture right determines whether the application works in production or just in demos.
The Three-Stage Core
Every RAG system, regardless of complexity, moves through three fundamental stages.
Indexing is where documents are prepared for retrieval. Text is extracted, cleaned, split into chunks, converted into vector representations (embeddings), and stored in a vector database alongside metadata.
Retrieval happens at query time. The user’s input is embedded using the same model, and a similarity search identifies the document chunks most likely to contain relevant information.
Generation is where the LLM takes the retrieved chunks as context and produces a response. The quality of this step depends almost entirely on the quality of retrieval — garbage in, garbage out.
Production systems add layers on top of this core: preprocessing pipelines, reranking models, caching, access controls, and monitoring. Simple tutorials tend to show only the happy path. The sections below cover the decisions that actually matter once you move beyond a proof of concept.
Vector Strategy: Dense, Sparse, and Hybrid
Your choice of vector type determines what categories of queries your system handles well. This is one of the most consequential early decisions.
Dense vectors (semantic embeddings) capture conceptual meaning. A query about “reducing staff turnover” can match documents about “employee retention” or “workforce stability” even if none of those exact words appear in the query. Dense vectors are generated by neural embedding models and stored as high-dimensional floating-point arrays.
The limitation: dense vectors can underperform on exact-match requirements. If a user searches for a specific product code, clause reference, or technical term, a semantic embedding may dilute the signal by weighting related-but-wrong results.
Sparse vectors (keyword-based methods, typically BM25 or TF-IDF variants) excel at exact term matching. They work well for legal documents, technical manuals, product specifications, and any corpus where users tend to know the precise language of what they need.
The limitation: sparse vectors miss semantic relationships. A user asking about “contract termination” may not match a document that discusses “ending an agreement” unless those terms co-occur.
Hybrid retrieval runs both searches in parallel, merges the candidate sets, and applies a fusion scoring step before reranking. For most production business applications — where queries range from precise lookups to open-ended questions — hybrid is the right default. The trade-off is higher indexing complexity and slightly more query overhead, but the retrieval quality improvement typically justifies it.
| Approach | Strengths | Weaknesses | Suitable for |
|---|---|---|---|
| Dense only | Semantic understanding, handles paraphrasing | Can miss exact matches | Conceptual Q&A, conversational search |
| Sparse only | Precise keyword matching | Misses semantic variation | Legal, technical, catalog search |
| Hybrid | Broad coverage, best general accuracy | More complex to implement | Most production use cases |
Chunking: Where RAG Systems Quietly Break
Chunking is the process of splitting source documents into segments before embedding. It sounds mechanical, but it is where the majority of retrieval quality problems originate.
The fundamental tension: smaller chunks allow more precise retrieval, but may lack enough context for the LLM to generate a useful answer. Larger chunks provide richer context, but dilute the relevance signal during retrieval, and may push the system toward hitting LLM context limits.
Fixed-size character splitting is what most introductory examples show. It is simple to implement and completely blind to document structure. It will split sentences mid-thought, separate a table header from its data, and break numbered steps across chunks. For anything beyond a toy demo, this approach causes problems.
Semantic chunking splits on logical boundaries: paragraph breaks, section headings, topic transitions, or sentence boundaries. This requires slightly more preprocessing but produces chunks that behave as coherent units during retrieval.
Hierarchical chunking maintains multiple levels of granularity: full documents for broad context, sections for intermediate specificity, and paragraphs for precise retrieval. Advanced retrieval pipelines can query at different levels depending on query type.
Practical defaults that work well across most business document types: 300-400 tokens per chunk, with 50-100 tokens of overlap between adjacent chunks. The overlap prevents important information from being severed at a boundary. Test these numbers against your actual queries — technical documentation often benefits from larger chunks, conversational content often benefits from smaller ones.
Embedding Model Decisions
The embedding model converts text into vectors. Its quality directly determines how well your retrieval captures semantic relationships.
Cloud-hosted embedding APIs (from model providers) are the fastest path to a working system. They require no infrastructure, perform well out of the box, and handle variable load without capacity planning. The cost scales with usage volume and can become significant for high-throughput applications.
Self-hosted open-source embedding models run on your own infrastructure. After initial setup, the marginal cost per embedding is effectively zero. Privacy-sensitive industries — legal, HR, accounting, healthcare-adjacent — often prefer this route because documents never leave their environment. The trade-off is infrastructure management, model versioning, and potentially lower ceiling performance on niche domains.
Dimensionality affects storage and compute. Higher-dimensional embeddings (1536+) can capture finer semantic distinctions but require more space and slower indexing. For most SMB document corpora, 768-dimension models offer a reasonable balance. Don’t over-optimize here early — switching embedding models later requires re-indexing your entire corpus, so make a deliberate choice and stick with it.
Vector Databases and Indexing
The vector database stores your embeddings and handles similarity search at query time. For development and small corpora (under ~10,000 documents), lightweight options with simple setup are appropriate. For production deployments at scale, you need databases built for approximate nearest neighbor (ANN) search with efficient indexing structures.
ANN algorithms like HNSW trade a small amount of recall precision for dramatically faster query times. At meaningful document scale, exact nearest neighbor search is computationally prohibitive. Most vector databases implement ANN by default.
Key configuration decisions:
- Number of candidates to retrieve before reranking (typically 20-50; you narrow this down in the reranking step)
- Similarity threshold below which results are discarded as irrelevant
- Metadata filtering to restrict search to relevant subsets (by document type, date range, client, department)
Metadata filtering is often underused. If a recruitment agency’s RAG system serves both clients and internal staff, filtering by access tier at the database level is cleaner and safer than trying to enforce it in the application layer.
Reranking: The Step That Lifts Production Quality
Initial vector similarity is a proxy for relevance, not a direct measure of it. A chunk can be semantically close to a query without actually answering it. Reranking adds a second, more expensive scoring pass over the top candidates from retrieval.
Cross-encoder rerankers take each (query, document chunk) pair and score them jointly. This is more computationally intensive than vector similarity but substantially more accurate at identifying which chunks are genuinely useful for the query. For systems where retrieval quality has a direct impact on user trust — internal knowledge bases, client-facing agents, compliance tools — reranking is worth the latency cost.
The typical pattern: retrieve 20-50 candidates via hybrid search, rerank to 5-10, pass the top results to the LLM. This keeps context windows manageable and improves generation quality.
Reranking becomes increasingly important as corpus size grows. With a few hundred documents, initial retrieval may be accurate enough. With tens of thousands of chunks, the gap between the initial ranked list and the truly relevant documents widens considerably.
Data Ingestion and Update Patterns
Static document corpora are the exception in business settings, not the rule. Pricing changes. Policies get updated. New case law appears. Client records evolve. Your ingestion architecture needs to account for how data changes over time.
Batch ingestion processes updates on a schedule — hourly, daily, weekly. It is simpler to implement and debug, and works well when the cost of slightly stale data is low. An internal HR knowledge base updated nightly is fine for most queries.
Push-based ingestion processes updates as they happen via event triggers or webhooks. The data in your vector store stays current, but this requires event-driven infrastructure and careful handling of partial updates, deletions, and conflicts. Customer-facing agents that reference live inventory or pricing typically need this pattern.
Whatever your ingestion model, invest in the preprocessing pipeline. Raw business documents contain boilerplate headers, repeated legal disclaimers, navigation artifacts, and encoding inconsistencies. All of this becomes noise in your vector store. Clean data compounds into better retrieval; dirty data compounds into unreliable responses.
Evaluation: The Practice Most Teams Skip
RAG systems degrade silently. As your document corpus changes, as query patterns shift, as embedding models get updated, the retrieval quality you validated at launch may no longer hold. Without systematic evaluation, you won’t know until users stop trusting the system.
Build an evaluation dataset early — a representative set of queries paired with the documents that should appear in the top results. Use this to measure retrieval recall (how often the right document appears in the top-k results) and mean reciprocal rank (how high the correct document ranks). Run this suite on a schedule, not just at deployment.
Track end-user signals too: whether users accept or rephrase responses, whether they escalate to a human, whether answers get flagged. These are lagging indicators but they reflect the ground truth of whether the system is working.
In our work helping founder-led professional services firms deploy knowledge retrieval agents, the most common failure pattern is a system that worked well at launch and gradually became unreliable over six months as documents were added without re-evaluating chunk quality or embedding coverage. A lightweight automated evaluation suite would have caught this.
Common Mistakes Worth Avoiding
Treating data cleaning as an afterthought. Retrieval is only as good as what is in the index. Teams that invest in embedding strategy but skip preprocessing consistently see worse results than teams with simpler models and clean data.
Starting with over-engineered retrieval. Multi-stage pipelines with hierarchical chunking, query decomposition, and multi-hop reasoning are appropriate solutions to specific failure modes — not a starting architecture. Begin with straightforward hybrid search, validate with real users, then add complexity where evidence shows it helps.
No evaluation framework. A RAG system without benchmarks is a black box. You cannot improve what you cannot measure, and you cannot detect degradation you are not tracking.
Ignoring document deletion. When source documents are updated or removed, stale chunks remain in the vector store and continue to surface in retrieval. Build deletion and update handling into the ingestion pipeline from the start.
Chunk size chosen by gut feel. The right chunk size depends on your document types and query patterns. Test several configurations against a real query sample before committing — the performance difference between 200 and 500 tokens can be substantial, in either direction.
What a Realistic Implementation Timeline Looks Like
A focused team with existing engineering capacity can typically reach a working production deployment in four to six weeks for a well-scoped use case.
Weeks one and two: data audit, ingestion pipeline, basic semantic search working against a representative document sample. Weeks three and four: hybrid retrieval, reranking, evaluation dataset, initial user testing. Weeks five and six: production deployment, monitoring setup, iteration based on early user feedback.
The two to three months after launch matter as much as the build. This is when you discover the query patterns you did not anticipate, the document types that chunk poorly, and the edge cases where retrieval fails. Budget time for this iteration phase.
A Practical Architecture Checklist
Data layer
- Identified all source documents and their update frequency
- Preprocessing pipeline handles all relevant file formats
- Deduplication and quality validation in place
Vector storage
- Database selected for target document scale
- Hybrid indexing configured (dense + sparse)
- Metadata schema designed for filtering requirements
Retrieval pipeline
- Semantic chunking implemented with appropriate chunk size
- Chunk overlap configured
- Hybrid search merging and reranking in place
Generation
- Context assembly formats retrieved chunks clearly for the LLM
- Edge cases handled: no results found, conflicting sources, out-of-scope queries
- Source attribution included in responses
Evaluation and monitoring
- Evaluation dataset created before launch
- Automated retrieval quality metrics running on a schedule
- User feedback signals being captured
Closing Thoughts
RAG architecture is not complicated in principle, but the gap between a working demo and a reliable production system is wider than most teams expect. The decisions that matter most — vector strategy, chunking approach, evaluation framework — are also the ones that are easiest to defer until they become expensive problems.
The pattern that works: start with a focused use case, build clean data infrastructure, implement hybrid retrieval, measure retrieval quality before you measure anything else, and add complexity only where evidence demands it.
If you are scoping a RAG implementation and want to pressure-test your approach before committing to a build, Basalt Studio offers an AI strategy call where we can walk through your document types, query patterns, and infrastructure constraints to help you avoid the most common architectural missteps.
