RAG systems retrieve relevant content by searching for it — but traditional search engines look for keyword matches. The content you need is not always described using the same words as the query: a question about “employee leave entitlement” should retrieve documents that discuss “annual leave policy” even though none of those exact words overlap. Vector databases solve this by storing content as mathematical representations (vectors) that capture semantic meaning, enabling similarity-based search that finds conceptually related content regardless of exact word match.
What a Vector Actually Is
A vector, in this context, is a list of numbers — typically hundreds or thousands of numbers — that represents the meaning of a piece of text. An AI embedding model converts text into these vectors such that texts with similar meanings produce similar vectors. “The patient presented with chest pain” and “the customer complained of heart discomfort” would produce similar vectors because they describe similar situations, even though they share few exact words. A vector database stores these numerical representations and enables efficient search for vectors that are numerically similar — which translates to semantically similar text.
How RAG Uses Vector Databases
The RAG pipeline has two phases. At indexing time: each document in your knowledge base is split into chunks, each chunk is converted into a vector by an embedding model, and the vectors are stored in the vector database alongside the original text. At query time: the user’s question is also converted into a vector, the vector database finds the stored chunks whose vectors are most similar to the question vector, and those chunks are retrieved and included in the context sent to the AI model.
Vector Database Options for Business
| Option | Best For | Setup | Cost |
|---|---|---|---|
| Pinecone | Managed cloud, fast setup | Low | Free tier / usage |
| Weaviate | Open source, self-hosted | Medium | Free (self-hosted) |
| Chroma | Local development / small scale | Very Low | Free |
| pgvector (Postgres) | Teams already using Postgres | Low | Infrastructure only |
Choosing the Right Vector Database
For small businesses building their first RAG system, Chroma is the best starting point — it runs locally without any infrastructure setup, stores vectors on disk, and integrates directly with LangChain and LlamaIndex. For production RAG systems that need to serve many users reliably, Pinecone is the most accessible managed option with a free tier that covers moderate-scale applications. For businesses already running Postgres, the pgvector extension adds vector search capability to your existing database without new infrastructure.
Chunk Size: The Hidden Variable That Matters
How you split your documents into chunks significantly affects retrieval quality. Chunks that are too small lose context — a single sentence extracted without its surrounding paragraph may not contain enough information to be useful. Chunks that are too large include irrelevant content that dilutes the relevance signal. For most business documents, chunks of 300–500 tokens with a 50-token overlap between adjacent chunks produce good retrieval quality. Test different chunk sizes on your specific document types and measure retrieval quality (whether the right chunks are returned for test queries) before settling on a configuration.
Indexing Your Documents Effectively
The quality of your vector index depends heavily on how you prepare and chunk your documents. Long documents divided into large chunks (1,000+ tokens) retrieve broadly but imprecisely — the model gets a lot of content but much of it is irrelevant to the specific query. Very small chunks (50–100 tokens) are precise but may lack the surrounding context that makes a retrieved passage interpretable. The sweet spot for most business documents is 300–500 tokens with a 50-token overlap between adjacent chunks, ensuring that concepts that span chunk boundaries are not lost.
Document metadata enriches retrieval significantly. Tagging each chunk with its source document, document type, date, and any relevant categorisation (product, region, department) enables filtered retrieval — searching only within documents of a specific type or time range. A legal question should retrieve from contract and policy documents, not from marketing copy, even if the marketing copy contains some overlapping terminology. Build metadata into your indexing pipeline from the start; retrofitting it after the index is built requires re-indexing everything.
Hybrid Search: Combining Dense and Keyword Retrieval
Pure vector (dense) search is powerful for semantic similarity but misses exact keyword matches that dense retrieval can underweight. A query for a specific product SKU, a person’s name, or a precise technical term may not retrieve the right document through semantic search alone because the vector representation dilutes the importance of exact terms. Hybrid search combines dense vector search with traditional keyword (BM25) search, merging and reranking the results from both. For business knowledge bases with a mix of semantic content and specific named entities, hybrid search consistently outperforms pure vector search on retrieval accuracy.
Weaviate, Pinecone, and Qdrant all support hybrid search configurations. If your initial RAG system shows poor retrieval on queries involving specific names, product codes, or precise technical terms, enabling hybrid search is usually the first optimisation to try. The configuration change is typically a few parameters in your retrieval call, not a major infrastructure change.
Monitoring and Improving Retrieval Quality Over Time
Retrieval quality should be monitored continuously, not just at initial deployment. As your document base grows and changes, previously well-calibrated retrieval parameters may need adjustment. Maintain a test set of representative queries with known correct answers that you run against the retrieval system weekly. Track the percentage of test queries where the relevant document is in the top three retrieved chunks — this metric, called recall@3, is a practical measure of retrieval quality for most RAG use cases. When recall@3 drops, investigate whether the issue is a new document type that chunks or embeds poorly, a query type that the current retrieval configuration does not handle well, or document staleness where the indexed content no longer reflects current reality.
Start with Chroma for local development and test your chunking strategy on ten representative documents. Measure retrieval quality with a small test set before committing to a production configuration or a managed vector database.
Embedding Model Selection
The embedding model — the model that converts text to vectors — significantly affects retrieval quality. OpenAI’s text-embedding-3-small and text-embedding-3-large are widely used and perform well for general business content. For specialised domains (legal, medical, scientific), domain-specific embedding models trained on domain-relevant text can improve retrieval accuracy. The embedding model must be the same at indexing time and query time — if you re-index with a different model, you must also update the query embedding process to match, otherwise similarity scores become meaningless.
Embedding model selection also affects cost and latency. Smaller, faster embedding models are appropriate for real-time retrieval where latency matters; larger models produce better embeddings at higher cost and are more appropriate for offline indexing or applications where retrieval quality is prioritised over latency. Match the embedding model to your latency requirements as well as your quality requirements — the best embedding model for your application is the one that meets both constraints.
Reranking: The Quality Boost After Retrieval
Initial vector retrieval optimises for semantic similarity — it finds the most semantically related chunks. A reranker takes the top N retrieved chunks and re-scores them for relevance to the specific query, applying more sophisticated relevance judgments than pure vector similarity. Reranking consistently improves retrieval precision — the proportion of retrieved chunks that are actually relevant — at the cost of additional processing time and expense.
For applications where retrieval quality is critical and latency is not the primary constraint, adding a reranking step (using Cohere Rerank or a cross-encoder model) after initial vector retrieval produces meaningfully better answer quality. For latency-sensitive applications, the additional processing time may not be worth the quality improvement. Test both configurations on your specific queries and latency requirements before deciding.
Production Readiness Checklist
Before deploying a vector database to production, verify: your backup and restore process works (vector indices can be rebuilt from source documents, but this takes time — test it), your embedding model is pinned to a specific version (embedding model updates can change vector representations, breaking similarity search), your index size and query latency have been load tested at expected production volume, your monitoring is configured to alert on retrieval quality drops and index size anomalies, and your update pipeline has been tested end-to-end — the process that adds new documents and updates changed ones should be as reliable as the query path. A vector database that works well in development but has not been hardened for production reliability will fail at the worst possible time.