ENGINEERING 18 Feb 2026 13 min read

Building RAG Pipelines That Actually Work in Production

Retrieval-augmented generation is the most practical way to give large language models access to your organisation's proprietary knowledge. But the gap between a demo RAG system and a production-grade pipeline is enormous. This guide covers the engineering decisions that determine whether your RAG system delivers reliable, accurate results at scale.

AB
Aru Bhardwaj Founder & CEO, Insightrix

The Promise and the Reality of RAG

Retrieval-augmented generation has become the default architecture for enterprise LLM applications. The appeal is obvious: instead of fine-tuning a model on your proprietary data (expensive, slow, and prone to hallucination), you retrieve relevant documents at query time and provide them as context to the language model. The model generates its response grounded in your actual data, reducing hallucinations and ensuring answers reflect current information.

The concept is elegantly simple. The execution is anything but. Every RAG tutorial makes it look straightforward: split your documents, embed them, store them in a vector database, retrieve the top-k results, and pass them to the LLM. In reality, each of these steps involves engineering decisions that profoundly affect the system's accuracy, latency, cost, and reliability. Get them wrong, and your RAG system will confidently produce incorrect answers, miss relevant information, or grind to a halt under production load.

Reality Check

In our experience building RAG systems for enterprises across financial services, legal, and healthcare, the single greatest source of failure is not the language model—it is the retrieval step. If you retrieve the wrong documents, no amount of prompt engineering will save the output. Getting retrieval right is 80% of the battle.

This guide is based on our experience building and deploying RAG pipelines for enterprise clients. We cover the architectural decisions, engineering trade-offs, and hard-won lessons that separate demo-quality RAG from production-grade systems. We assume familiarity with the basic concepts of LLMs, embeddings, and vector search.

RAG Architecture: Beyond the Basics

The canonical RAG architecture consists of two phases: an offline indexing pipeline that processes documents into searchable embeddings, and an online query pipeline that retrieves relevant context and generates responses. In production, both phases are significantly more complex than the textbook version suggests.

The Indexing Pipeline

Your indexing pipeline must handle document ingestion from multiple sources (file systems, databases, APIs, content management systems), document parsing and format conversion (PDF, Word, HTML, Markdown, scanned images requiring OCR), text extraction and cleaning (removing boilerplate, headers, footers, and navigation elements), chunking (splitting documents into appropriately sized segments), metadata extraction (dates, authors, document types, section headings), embedding generation, and storage in a vector database with associated metadata.

Each of these steps introduces potential failure modes. PDFs with complex layouts may parse incorrectly. OCR may introduce errors in scanned documents. Chunking decisions can split critical information across boundaries. Metadata extraction may miss important context. A production indexing pipeline must handle all of these gracefully, with logging, error handling, and quality checks at every stage.

The Query Pipeline

The query pipeline receives a user question, transforms it into one or more search queries, retrieves relevant documents, optionally re-ranks them, constructs a prompt with the retrieved context, sends it to the language model, and returns the generated response with source citations. In production, you also need query preprocessing (spell correction, intent classification, query decomposition for complex questions), guardrails (content filtering, PII detection, topic boundary enforcement), caching (for repeated or similar queries), and comprehensive logging for debugging and evaluation.

A production RAG system is not a simple chain of retriever plus generator. It is a distributed system with multiple failure modes, performance bottlenecks, and quality dimensions that must be monitored and optimised continuously.

Chunking Strategies That Preserve Meaning

Chunking—how you split documents into segments for embedding and retrieval—is arguably the most consequential decision in a RAG pipeline. Poor chunking destroys context, splits related information across separate chunks, and makes it impossible for the retrieval step to find complete, coherent answers.

Fixed-Size Chunking

The simplest approach is to split text into fixed-size chunks (for example, 512 tokens) with some overlap (typically 50–100 tokens). This is fast, predictable, and easy to implement. It is also the worst strategy for most enterprise content because it has no awareness of document structure. A fixed-size chunk might start in the middle of a paragraph, split a table across two chunks, or separate a heading from the content it describes.

Semantic Chunking

Semantic chunking uses the document's structure to create more meaningful segments. The simplest version respects paragraph boundaries. More sophisticated approaches use headings and subheadings to create hierarchical chunks, keeping each section as a coherent unit. For structured documents like contracts, policies, or technical manuals, this approach dramatically improves retrieval quality because each chunk represents a complete thought or concept rather than an arbitrary slice of text.

Recursive and Hierarchical Chunking

For documents with complex structures, we often use a recursive approach: split first by major sections (H1 headings), then by subsections (H2, H3), then by paragraphs, with each level maintaining a reference to its parent. This creates a hierarchy that supports both broad and narrow retrieval. A query about a specific clause can retrieve the relevant paragraph, while a broader question can pull back an entire section.

From the Field

In a project for a European insurance company, switching from fixed-size 512-token chunks to semantic chunking based on policy section boundaries improved retrieval precision by 34% and reduced hallucination rates by 22%. The model was the same; only the chunking strategy changed. This is a representative result across our engagements.

Choosing and Optimising Embedding Models

Embedding models convert text into dense vector representations that capture semantic meaning. The quality of these embeddings directly determines how well your retrieval system can match queries to relevant documents. Choosing the right embedding model is a critical decision that affects accuracy, latency, cost, and storage requirements.

Model Selection Criteria

When selecting an embedding model, consider these factors: dimensionality (higher dimensions capture more nuance but increase storage and computation costs), maximum token length (models that truncate at 512 tokens will lose information from longer chunks), multilingual support (critical for organisations operating across language boundaries), domain relevance (models trained on general web text may underperform on specialised domains like legal or medical text), and licensing (some models have restrictions on commercial use or require specific attribution).

Domain-Specific Fine-Tuning

For enterprise applications, off-the-shelf embedding models often underperform because they were trained on general-purpose text that does not reflect your domain's vocabulary, concepts, or semantic relationships. Fine-tuning an embedding model on your domain-specific data can significantly improve retrieval quality. The key is constructing high-quality training pairs: queries and their corresponding relevant documents. These can be generated from existing search logs, FAQ databases, or by using an LLM to generate synthetic query-document pairs from your corpus.

Fine-tuning does not require massive datasets. In our experience, 5,000 to 10,000 high-quality training pairs are sufficient to meaningfully improve retrieval quality for most enterprise domains. The improvement is particularly pronounced for technical or regulatory content where general-purpose models struggle with domain-specific terminology.

Retrieval Optimisation: Beyond Naive Vector Search

Naive vector search—embedding the query, finding the top-k nearest neighbours, and passing them to the LLM—is a reasonable starting point but insufficient for production. Several techniques can significantly improve retrieval quality and robustness.

Hybrid Search

Combining vector search (semantic similarity) with keyword search (BM25 or similar) produces better results than either approach alone. Vector search excels at finding semantically related content even when the exact keywords differ, but it can miss documents that use specific technical terms or product names. Keyword search captures these exact matches but misses paraphrases and conceptual relationships. Hybrid search combines both signals, typically using reciprocal rank fusion to merge the results.

Query Transformation

The user's query is rarely the optimal search query. Query transformation techniques improve retrieval by reformulating the query before search. Common approaches include query expansion (adding related terms to broaden the search), query decomposition (breaking complex questions into simpler sub-queries that are searched independently), and hypothetical document embeddings (HyDE), where you ask the LLM to generate a hypothetical answer and then use that answer as the search query. HyDE can be remarkably effective because the hypothetical answer is often semantically closer to the actual documents than the original question.

Re-ranking

Initial retrieval is optimised for recall—casting a wide net to ensure relevant documents are not missed. Re-ranking applies a more computationally expensive model to the retrieved candidates to improve precision—pushing the most relevant documents to the top. Cross-encoder re-rankers evaluate the query and document together (rather than independently, as embedding models do) and can capture fine-grained relevance signals that vector similarity misses. In production, we typically retrieve 20–50 candidates via hybrid search and then re-rank to select the top 3–5 for inclusion in the prompt.

Generation and Prompting for Faithful Outputs

Once you have retrieved the right documents, the generation step must produce an accurate, well-structured response that is faithfully grounded in the retrieved context. This is where prompt engineering, output formatting, and citation management come into play.

Prompt Design for Grounded Generation

Your generation prompt must clearly instruct the model to base its answer on the provided context and to indicate when the context does not contain sufficient information to answer the question. The prompt should also specify the desired output format, tone, and level of detail. In production, prompts are not static strings—they are parameterised templates that vary based on the query type, user role, and application context.

Citation and Attribution

Enterprise RAG systems must provide citations that allow users to verify the AI's claims against source documents. This means not only indicating which documents were used but also specifying the relevant section or passage within each document. Implementing reliable citation requires careful coordination between the retrieval and generation steps: each retrieved chunk must carry sufficient metadata (document title, section heading, page number) to construct a meaningful reference, and the generation prompt must instruct the model to cite sources inline.

Production Tip

Always implement a "confidence threshold" in your generation step. If the retrieved documents do not contain information relevant to the query (as determined by the re-ranker scores or by the model's own assessment), the system should decline to answer rather than hallucinate. Users trust a system that says "I don't have enough information to answer this" far more than one that confidently produces incorrect answers.

Evaluation: Measuring What Matters

You cannot improve what you cannot measure, and RAG systems have multiple quality dimensions that must be evaluated independently. A holistic evaluation framework should assess retrieval quality, generation quality, and end-to-end system performance.

Retrieval Metrics

Measure retrieval independently from generation. Key metrics include precision at k (what fraction of the top-k retrieved documents are relevant), recall (what fraction of all relevant documents are retrieved), mean reciprocal rank (how highly the first relevant document is ranked), and normalised discounted cumulative gain (how well the ranking aligns with the relevance ordering). Building a ground-truth evaluation dataset of queries paired with their relevant documents is essential. This dataset should cover the full range of query types your system will encounter, including edge cases and adversarial queries.

Generation Metrics

Generation quality can be assessed along several dimensions: faithfulness (does the answer accurately reflect the retrieved context, without adding information that is not present?), relevance (does the answer address the user's question?), completeness (does the answer cover all relevant aspects of the question?), and coherence (is the answer well-structured and easy to understand?). Automated metrics like RAGAS provide useful signals, but human evaluation remains the gold standard for assessing generation quality, particularly for faithfulness.

Establish a regular evaluation cadence—weekly or fortnightly—where a sample of production queries and responses are reviewed by domain experts. Track quality metrics over time to detect regressions early and to measure the impact of pipeline changes.

The most dangerous failure mode in a RAG system is a plausible-sounding answer that is factually incorrect. These failures are difficult for end users to detect and erode trust rapidly. Continuous evaluation is the only reliable defence.

Conclusion: Engineering Rigour Over Demo Magic

Building a RAG demo takes an afternoon. Building a production RAG system that delivers accurate, reliable results across thousands of queries per day takes months of careful engineering. The difference lies in the attention paid to chunking strategies, embedding quality, retrieval optimisation, prompt engineering, and continuous evaluation.

If you are planning to deploy a RAG system, invest the time to get the foundations right. Choose chunking strategies that preserve the semantic structure of your documents. Evaluate and potentially fine-tune your embedding model for your domain. Implement hybrid search and re-ranking to maximise retrieval quality. Design prompts that enforce grounded generation with proper citations. And build an evaluation framework that lets you measure and improve quality continuously.

The organisations that succeed with RAG are those that treat it as a serious engineering challenge, not a quick win. The underlying technology is powerful, but realising that power in production requires the same discipline, rigour, and operational excellence that any mission-critical system demands.

Building a RAG system for your organisation?

We design and build production-grade RAG pipelines for enterprises across Europe. From architecture design to deployment and monitoring, we bring the engineering rigour your project needs. Book a free consultation to discuss your use case.

Book a Free AI Consultation
Share this article