Which embedding model is best for RAG?

For English-only production applications with no existing ML inference infrastructure, start with OpenAI text-embedding-3-small. It is the operational default with strong quality and simple integration. If you need multilingual support, use Cohere embed-multilingual-v3.0. If you are running your own inference infrastructure and want maximum quality, use BGE-M3 for multilingual or BGE-large-en-v1.5 for English-only. If you need domain-specific quality improvements and have labeled data, fine-tune an E5 or BGE model on your domain. The most important evaluation is on your actual corpus and query distribution — benchmark rankings are guides, not guarantees.

Should I use OpenAI embeddings or an open-source model?

It depends on your operational constraints and scale economics. OpenAI embeddings: simple integration, reliable uptime, no infrastructure to run, pay-per-token pricing that is reasonable at moderate scale. Open-source (BGE, E5): higher upfront operational cost (inference infrastructure), potentially better quality for domain-specific tasks, no per-token API cost at scale (just compute), and no vendor lock-in. The break-even point depends on your embedding volume and infrastructure costs, but typically: below 50M tokens/month, OpenAI is cheaper when you factor in infrastructure cost. Above 200M tokens/month, self-hosted inference on GPU is almost always cheaper. The quality gap is small for general English text — choose based on operational preference at small scale, and re-evaluate at large scale.

How do I evaluate embedding model quality?

Build or generate a retrieval evaluation set: 100–500 query-document pairs where you know the correct document for each query. Embed the full corpus and all queries with each candidate model. Run ANN search (k=5) for each query. Measure Recall@5 (fraction of queries where the correct document is in the top 5) and MRR (mean reciprocal rank of the first correct result). Run this evaluation for every candidate model and choose the one with the best Recall@5 on your specific data. The evaluation should take under an hour for a corpus of 100K documents. Cache embeddings between runs so you pay for each embedding once.

What is the difference between BGE and E5?

Both are open-source embedding model families that compete at the top of the MTEB leaderboard. BGE (BAAI General Embedding) is trained by the Beijing Academy of Artificial Intelligence. Its flagship model, BGE-M3, supports dense, sparse, and multi-vector retrieval from a single model with 100+ language support. E5 (Microsoft Research) uses instruction-tuned embeddings — you prefix each input with a task description that conditions the embedding. E5-mistral-7b-instruct achieves the highest quality at the cost of high inference requirements. In practice, BGE-large-en-v1.5 and E5-large-v2 are similarly competitive for English retrieval. BGE-M3 is the clear winner for multilingual or multi-vector use cases.

Can I fine-tune an embedding model without a GPU?

Technically yes, but it is impractical for anything beyond very small experiments. Fine-tuning even a small encoder model (BGE-small, 33M parameters) on a few thousand examples takes hours on a CPU vs. minutes on a GPU. The sentence-transformers library supports CPU training, but the gradient computations for contrastive loss are significantly slower without CUDA. Practical options: use Google Colab with a T4 GPU (free tier, sufficient for small fine-tuning runs up to ~10K examples), use a cloud spot instance (A10G for ~$0.60/hour, typically takes 30–90 minutes for a full fine-tuning run), or use the Hugging Face AutoTrain API which handles GPU allocation for you. Do not attempt to fine-tune a 7B-parameter model without at least one A100 or H100.

Fundamentals · 2026-05-06

Embedding Models Compared: OpenAI, Cohere, BGE, E5, and How to Choose (2026)

The embedding model you choose determines your retrieval ceiling. A bad embedding model cannot be fixed with a better vector database. Here is how to evaluate and choose.

§1

What Embeddings Do in a RAG System

Embedding models convert text into dense vectors that encode semantic meaning. In a RAG system, both the documents (at ingestion time) and queries (at retrieval time) are embedded into the same vector space. Retrieval works because documents and queries that are about the same topic end up close together in that space. The quality of this proximity mapping is the retrieval ceiling — the maximum recall your system can achieve regardless of how good your vector database or retrieval algorithm is.

A weak embedding model maps unrelated texts to similar vectors (false positives in retrieval) and maps semantically similar texts to distant vectors (false negatives, the more damaging failure). Neither failure is recoverable downstream — the LLM cannot generate a correct answer from a falsely retrieved document, and it cannot generate an answer at all if the relevant document is not retrieved.

Embedding quality is domain-dependent. A model that achieves state-of-the-art results on the MTEB benchmark (which covers general English text) may underperform on your medical documentation corpus, your multilingual customer support tickets, or your code repository. Always evaluate embedding models on your actual data distribution before committing to one in production.

§2

OpenAI text-embedding-3: The Safe Default

OpenAI's text-embedding-3-small and text-embedding-3-large are the most widely used embedding models in production. text-embedding-3-small produces 1536-dimensional vectors (reducible to 256 or 512 via Matryoshka representation learning) at $0.02 per million tokens. text-embedding-3-large produces up to 3072 dimensions at $0.13 per million tokens.

The primary advantage is operational simplicity: the same API key as your generation model, reliable uptime, and a consistent interface. The quality is strong for general English text — competitive with best-in-class open-source models on most MTEB tasks. The Matryoshka dimension reduction is a genuine production feature: you can reduce embedding dimensions to save storage and improve query speed with only 2–5% quality loss.

The downsides: vendor lock-in (changing embedding models requires re-embedding everything), cost at scale (a 10M-document corpus costs $200–$1,300 to embed depending on average document length), and weaker multilingual performance than Cohere Embed v3 or BGE-M3. For English-only applications with moderate scale and a preference for operational simplicity, text-embedding-3-small is the right default.

§3

Cohere Embed v3: Multilingual and Classification-Optimized

Cohere Embed v3 (embed-multilingual-v3.0 and embed-english-v3.0) is the strongest commercial embedding model for multilingual applications. It supports 100+ languages with consistent cross-lingual retrieval quality — a query in French reliably retrieves a relevant document in German. This is a qualitatively different capability from OpenAI's models, which are primarily English-optimized.

Cohere's input_type parameter is a production-relevant feature: you specify whether you are embedding a query or a document, and the model uses different internal representations for each. This asymmetric embedding is aligned with how semantic search actually works — query intent and document content have different linguistic structures, and a single embedding function that ignores this distinction underperforms one that accounts for it.

For English-only retrieval, Cohere Embed v3 is competitive with but not clearly better than OpenAI text-embedding-3-large. Choose Cohere when multilingual support is required, when you want the input_type semantic distinction, or when you are already using Cohere's generation models and want a single vendor relationship. The API pricing is comparable to OpenAI at similar quality tiers.

§4

BGE Models: The Open-Source Benchmark Leaders

The BGE (BAAI General Embedding) model family from the Beijing Academy of Artificial Intelligence consistently tops the MTEB leaderboard. BGE-M3 is particularly notable: it supports dense retrieval, sparse retrieval, and multi-vector (ColBERT-style) retrieval from a single model, and it covers 100+ languages. BGE-large-en-v1.5 is the strongest English-only model in the open-source ecosystem for most retrieval tasks.

Running BGE models requires infrastructure for inference: a GPU server, a containerized API (typically with Hugging Face's text-embeddings-inference or FastEmbed), and operational management of that service. The total cost of ownership includes inference hardware, engineering time, and the ongoing maintenance burden. For teams that already run ML inference infrastructure, this is manageable. For teams without that infrastructure, the savings versus OpenAI/Cohere need to be substantial to justify the operational overhead.

BGE-M3's multi-vector support is a genuine capability advantage: it can produce ColBERT-style late interaction embeddings that outperform single-vector dense retrieval on many tasks. Implementing ColBERT-style retrieval requires a vector database that supports multi-vector documents (Qdrant supports this natively). If you need maximum retrieval quality and have the infrastructure, BGE-M3 with ColBERT retrieval is a strong choice.

§5

E5 Models: Instruction-Tuned and Domain-Adaptable

The E5 (EmbEddings from bidirEctional Encoder rEpresentations) model family from Microsoft Research uses instruction-tuned embeddings — each input is prefixed with a task instruction ("Represent this sentence for searching relevant passages: ...") that conditions the embedding on the intended use. This instruction-following capability makes E5 models highly adaptable to specific retrieval tasks.

E5-mistral-7b-instruct is a 7B-parameter embedding model that achieves state-of-the-art performance on the MTEB benchmark by leveraging a large language model backbone. The quality is exceptional but the inference cost is high — a 7B model requires a GPU with 14–16GB VRAM just for inference, and throughput is limited compared to smaller encoder-only models. E5-large-v2 and E5-base-v2 are the practical alternatives: encoder-only, fast, and competitive with OpenAI text-embedding-3-small on most English retrieval tasks.

The instruction-tuning of E5 models makes them particularly effective for asymmetric retrieval tasks where queries and documents have different structures. For domain-specific applications, E5 models fine-tune well with relatively small labeled datasets — a few thousand query-document pairs can produce significant quality improvements over the base model.

§6

Fine-Tuning an Embedding Model for Your Domain

Out-of-box embedding models are trained on general text distributions. If your corpus uses specialized vocabulary (medical, legal, financial, scientific), your queries use domain-specific terminology, or your retrieval task has a structure that differs from general web search, fine-tuning can produce significant recall improvements.

Fine-tuning an embedding model requires query-document pairs: examples of queries paired with the documents that are relevant to them. You can generate these synthetically using an LLM (for each document chunk, generate 3–5 queries that it answers) and then train on the (query, relevant_chunk, irrelevant_chunk) triplets using a contrastive loss. The MTEB-standard approach uses InfoNCE or multiple negatives ranking loss.

The practical minimum for fine-tuning is around 1,000–5,000 labeled pairs, though more is better. Use a small encoder model as the base (BGE-small or E5-small) to keep inference fast. Fine-tune with sentence-transformers using the MultipleNegativesRankingLoss trainer — this is the standard tooling and has good documentation. Expect a 5–15 percentage point improvement in Recall@5 on domain-specific queries, which is significant at production scale.

§7

How to Evaluate Embedding Quality for Your Use Case

Evaluating embedding models requires a held-out retrieval evaluation set: query-document pairs where you know which document is the correct retrieval result. Run each candidate embedding model, embed your query set and document corpus, run ANN search, and measure Recall@5 and MRR. The model with the highest Recall@5 on your specific corpus and query distribution wins — benchmark leaderboard rankings are informative but not determinative.

A fast evaluation protocol: use 100–200 query-document pairs (synthetically generated or human-labeled), a moderate-size corpus sample (50K–500K documents), and a consistent vector database configuration across models. Run all models at the same embedding dimensionality for fair comparison. Include timing measurements — a model that achieves 2% better Recall@5 but takes 10x longer to embed is not necessarily the right choice if you need real-time ingestion.

Do not evaluate embedding models on the corpus they were used to build a retrieval system — this is circular. Always use a held-out query set that the model has not seen. Synthetic generation of evaluation queries using the same LLM that generated your documents will overestimate performance — use a different model or human-authored queries for the most reliable evaluation.

FAQ

Frequently asked questions

Which embedding model is best for RAG?: For English-only production applications with no existing ML inference infrastructure, start with OpenAI text-embedding-3-small. It is the operational default with strong quality and simple integration. If you need multilingual support, use Cohere embed-multilingual-v3.0. If you are running your own inference infrastructure and want maximum quality, use BGE-M3 for multilingual or BGE-large-en-v1.5 for English-only. If you need domain-specific quality improvements and have labeled data, fine-tune an E5 or BGE model on your domain. The most important evaluation is on your actual corpus and query distribution — benchmark rankings are guides, not guarantees.
Should I use OpenAI embeddings or an open-source model?: It depends on your operational constraints and scale economics. OpenAI embeddings: simple integration, reliable uptime, no infrastructure to run, pay-per-token pricing that is reasonable at moderate scale. Open-source (BGE, E5): higher upfront operational cost (inference infrastructure), potentially better quality for domain-specific tasks, no per-token API cost at scale (just compute), and no vendor lock-in. The break-even point depends on your embedding volume and infrastructure costs, but typically: below 50M tokens/month, OpenAI is cheaper when you factor in infrastructure cost. Above 200M tokens/month, self-hosted inference on GPU is almost always cheaper. The quality gap is small for general English text — choose based on operational preference at small scale, and re-evaluate at large scale.
How do I evaluate embedding model quality?: Build or generate a retrieval evaluation set: 100–500 query-document pairs where you know the correct document for each query. Embed the full corpus and all queries with each candidate model. Run ANN search (k=5) for each query. Measure Recall@5 (fraction of queries where the correct document is in the top 5) and MRR (mean reciprocal rank of the first correct result). Run this evaluation for every candidate model and choose the one with the best Recall@5 on your specific data. The evaluation should take under an hour for a corpus of 100K documents. Cache embeddings between runs so you pay for each embedding once.
What is the difference between BGE and E5?: Both are open-source embedding model families that compete at the top of the MTEB leaderboard. BGE (BAAI General Embedding) is trained by the Beijing Academy of Artificial Intelligence. Its flagship model, BGE-M3, supports dense, sparse, and multi-vector retrieval from a single model with 100+ language support. E5 (Microsoft Research) uses instruction-tuned embeddings — you prefix each input with a task description that conditions the embedding. E5-mistral-7b-instruct achieves the highest quality at the cost of high inference requirements. In practice, BGE-large-en-v1.5 and E5-large-v2 are similarly competitive for English retrieval. BGE-M3 is the clear winner for multilingual or multi-vector use cases.
Can I fine-tune an embedding model without a GPU?: Technically yes, but it is impractical for anything beyond very small experiments. Fine-tuning even a small encoder model (BGE-small, 33M parameters) on a few thousand examples takes hours on a CPU vs. minutes on a GPU. The sentence-transformers library supports CPU training, but the gradient computations for contrastive loss are significantly slower without CUDA. Practical options: use Google Colab with a T4 GPU (free tier, sufficient for small fine-tuning runs up to ~10K examples), use a cloud spot instance (A10G for ~$0.60/hour, typically takes 30–90 minutes for a full fine-tuning run), or use the Hugging Face AutoTrain API which handles GPU allocation for you. Do not attempt to fine-tune a 7B-parameter model without at least one A100 or H100.

ragembeddingsfundamentals