What is RAG? The 3 components and when not to use it

Q: What embedding model should I use?

For most apps, `text-embedding-3-small` from OpenAI ($0.02 per 1M tokens) is a solid starting point. If you need multilingual support or want to avoid OpenAI dependency, `nomic-embed-text` (open-source, runs locally) and Cohere's `embed-v3` are both good alternatives. The embedding model and the generation model don't need to come from the same provider.

Your LLM keeps making up answers about your product, because your docs aren't in its training data. Fine-tuning sounds expensive and slow. Someone in Slack suggests RAG. Here's what that means.

What RAG does

Retrieval-augmented generation is a pattern for giving an LLM access to your private documents without retraining it. Instead of baking your knowledge into the model weights, you retrieve the relevant text at query time and paste it into the prompt. The model answers based on what you retrieved, not on what it memorized during training.

That's the whole trick. Everything else is implementation detail.

The 3 parts of every RAG pipeline

Every RAG implementation, from a weekend prototype to a production knowledge base, has the same three stages.

1. Ingestion

You take your source documents (PDFs, Markdown files, support tickets, Notion pages, whatever), split them into smaller chunks, convert each chunk to a vector using an embedding model, and store those vectors in a searchable index. The embedding converts text into a set of numbers that captures meaning, so similar-meaning passages end up mathematically close together.

This is a one-time job, or a scheduled sync if your documents change. The output is a vector index.

2. Retrieval

When a user asks a question, you embed the question using the same embedding model, then query the index for the chunks most similar to the question. You get back the top-N matches, typically 3 to 10, ranked by semantic similarity.

This is where most RAG failures originate. If the chunks are too large, or the question phrasing doesn't match the source text, you pull back the wrong passages. Swapping models or rewriting prompts won't fix bad retrieval.

3. Generation

You build a prompt: the retrieved chunks, then the user's question, then an instruction to answer based on the provided context. The LLM reads the chunks, synthesizes an answer, and ideally cites which passage it drew from.

One constraint worth planning around: the context window limits how many chunks you can include. If you retrieve 10 large chunks, they might overflow the model's working memory, and you'll need to rank or truncate. The relationship between retrieval and the context window limit becomes a real engineering constraint as your corpus grows beyond a few hundred pages.

When RAG is the right call

RAG fits best when:

Your data isn't in the training set. Internal docs, proprietary knowledge bases, product FAQs, customer-specific context. None of this is in any commercial model's training data. RAG is the standard way to inject it.
Your data changes frequently. Fine-tuning requires re-running a training job whenever data changes. RAG just requires re-indexing, which is fast and cheap.
You need provenance. RAG lets you return citations. Fine-tuned models can't trace which training example informed their answer.
Your corpus is too large for the context window. If you have 10,000 support articles, you can't paste them all into one prompt. RAG retrieves the relevant slice.

When RAG isn't the right call

RAG adds real overhead: an embedding step, a vector store, a retrieval pass, chunk boundary decisions, and metadata management. Don't add it if you don't need it.

Your entire knowledge base fits in the context window. If it's 50 pages, stuff it into the prompt first. It's faster to build. With modern context windows pushing well past 128k tokens, this works more often than people expect.

You need structured lookups. If users ask "what's the current price of SKU #4421?", a database query with a function call is more accurate and cheaper than semantic retrieval.

You're still validating the use case. Build the feature with context-stuffing first. Add RAG when you've hit a real scaling limit.

Your queries and documents are phrased very differently. Semantic search assumes the question and the answer will be semantically close. Legal contracts and casual user questions often aren't. You'll need hybrid search (keyword plus semantic) or query rewriting before RAG works reliably.

What actually breaks in RAG pipelines

Chunk boundaries. A chunk that spans two unrelated topics retrieves partially relevant context, even when it ranks at the top. Most teams start with 256-512 token chunks and 10-20% overlap, then tune based on what retrieval misses.

No metadata filtering. If your knowledge base covers 10 product versions and you retrieve without filtering on version, you'll confidently answer from the wrong one. Add metadata fields (version, date, category) to your index and filter before ranking.

Hallucination on empty retrieval. When no relevant chunks exist, the LLM often invents an answer anyway. Prompt it explicitly: "If the provided context doesn't answer the question, say so." Then measure how often retrieval fails on real traffic; it's usually higher than expected.

RAG also adds costs beyond the LLM call. Embedding queries costs tokens, the vector database query adds latency, and retrieved chunks inflate your input token count. The three hidden LLM costs compound fast on high-traffic RAG apps.

FAQ

What's the difference between RAG and fine-tuning? Fine-tuning bakes knowledge into the model weights through continued training. RAG retrieves knowledge at inference time and places it in the context window. RAG is faster to update, cheaper to iterate on, and easier to inspect. Fine-tuning is better when you want to change how the model reasons or writes, not just what it knows.

Do I need a vector database? For small corpora (a few hundred documents), you can skip a dedicated vector database and store embeddings in a JSON file or a PostgreSQL table with pgvector. A dedicated vector database (Pinecone, Weaviate, Qdrant) becomes worth it when you have tens of thousands of chunks and need filtering, multi-tenancy, or hybrid search.

What embedding model should I use? For most apps, text-embedding-3-small from OpenAI ($0.02 per 1M tokens) is a solid starting point. If you need multilingual support or want to avoid OpenAI dependency, nomic-embed-text (open-source, runs locally) and Cohere's embed-v3 are both good alternatives. The embedding model and the generation model don't need to come from the same provider.

What is hybrid search? Hybrid search combines dense retrieval (semantic/vector similarity) with sparse retrieval (keyword matching, typically BM25). It tends to outperform either approach alone, especially when user queries contain exact product names or technical terms. Most vector databases support hybrid search natively.

Can I use RAG with any LLM? Yes. The retrieval step is model-agnostic. You retrieve with one embedding model and generate with any LLM. If you want to compare which generation model handles your retrieved context most accurately, you can route the same retrieved chunks through multiple models and compare outputs. That's exactly the kind of test LLMTest's benchmark runner is built for.

How do I measure RAG quality? Track retrieval precision (are the returned chunks actually relevant?) and generation faithfulness (does the answer reflect the retrieved chunks?). For retrieval, build a small golden set of questions with known correct chunks. For generation, use an LLM-as-judge pass with a faithfulness rubric. Even a small model scoring 1 to 5 catches most hallucination failures.

Testing your RAG setup

RAG introduces enough moving parts that "it feels like it's working" is not a reliable signal. Build a small golden set of question-and-answer pairs from your real corpus, run them through your pipeline, and track precision at the retrieval step separately from faithfulness at the generation step. Treating them as one number hides which half is failing.

If you're building a RAG pipeline and want to test which generation model handles your retrieved context best, LLMTest runs your prompts through multiple models and scores the outputs automatically.