RAG with Embeddings & Rerank
RAG with Embeddings & Rerank
Build retrieval-augmented generation pipelines using OpenRouter’s embeddings and rerank APIs
RAG with Embeddings & Rerank
Build retrieval-augmented generation pipelines using OpenRouter’s embeddings and rerank APIs
Retrieval-Augmented Generation (RAG) grounds LLM responses in your own data by retrieving relevant documents before generating an answer. This prevents hallucinations and keeps responses up to date without fine-tuning.
OpenRouter provides all three building blocks for a RAG pipeline through a single API:
A typical RAG pipeline follows these steps:
Split your documents into chunks and generate embeddings for each chunk. Store the embeddings in a vector database (or in-memory for prototyping).
In production, use a vector database (Pinecone, Weaviate, Qdrant, pgvector, etc.) to store and query embeddings efficiently. The in-memory approach shown here is for illustration only.
When a user asks a question, embed the query and find the most similar document chunks using cosine similarity.
Embedding-based retrieval is fast but approximate. A rerank model uses a cross-encoder to compare each document directly against the query, producing more accurate relevance scores. This is especially valuable when you retrieve many candidates (e.g., 20) and want to narrow down to the best few (e.g., 3).
Pass the top-ranked documents as context to a chat model. The LLM generates a grounded answer based on the retrieved information.
Here is a full end-to-end RAG pipeline combining all four steps:
Reranking adds an extra API call, so it’s worth understanding when it helps most:
Use rerank when:
Skip rerank when:
How you split documents significantly affects retrieval quality:
Smaller chunks (200-300 tokens) tend to produce more precise retrieval but may lose surrounding context. Larger chunks (500-1000 tokens) preserve more context but may dilute relevance signals. Experiment with your specific data to find the right balance.
Use the same embedding model for indexing and querying. Mixing models produces incompatible vector spaces and will give poor retrieval results.
Batch your embedding requests. Send multiple texts in a single API call to reduce latency and costs. The embeddings API accepts arrays of inputs.
Cache embeddings. Embeddings for the same text are deterministic. Store them in a database to avoid recomputing.
Retrieve more than you need, then rerank. A common pattern is to retrieve 10-20 candidates via embeddings, then rerank to the top 3-5. This combines the speed of embedding search with the precision of cross-encoder reranking.
Include metadata in your prompt. When generating, include source metadata (document title, section, URL) alongside the text so the LLM can produce proper citations.
Set a relevance threshold. After reranking, filter out documents below a minimum relevance score to avoid injecting irrelevant context that could confuse the LLM.
Browse available models for each step: