Artificial Intelligence
15 minutes

How AI Finds Information: Retrieval Models, Embeddings and RAG Explained

Learn how retrieval models, embedding vectors and retrieval-augmented generation are combined in modern AI systems. This articles covers semantic search, the role of embeddings in supporting generative models, and the practical trade-offs of embedding-based retrieval.

Retrieval, Embeddings and RAG: How Modern AI Systems Find and Use Information

Modern AI systems are often described as if they are a single model doing everything. In practice, most useful systems are built from multiple components, each solving a different problem. One of the most important, and often least visible, of these components is the retrieval model.

Retrieval models answer a simple but critical question: how do we find the right information? Before an AI system can summarise, explain, or reason about anything, it first needs to locate relevant material. This is why retrieval models are typically used alongside generative models such as ChatGPT rather than being replaced by them.

Retrieval models

Retrieval models are used to search through a collection of documents and return the most relevant ones for a given query. Traditionally, this was done using keyword-based approaches. Older models such as bag-of-words or TF-IDF represent documents based on which words they contain and how often those words appear. A query is matched to documents that share many of the same terms, with rarer words weighted more heavily than common ones.

Keyword-based retrieval is precise and reliable when users know exactly what they are looking for. It works well for names, numbers, regulations, and exact phrases. However, it struggles when the wording of a query differs from the wording used in a document, even if the underlying meaning is the same.

More modern retrieval systems use embedding models. Instead of comparing words directly, embedding-based retrieval compares meanings. Both queries and documents are converted into numerical representations, and similarity is measured mathematically. This allows a search for “limits of transformer retrieval” to return a document that discusses “geometric constraints on single-vector retrievers”, even if the phrasing does not overlap.

In practice, many real-world systems combine both approaches. Keyword search provides precision, while embeddings provide semantic flexibility. This hybrid approach is now common in enterprise search tools and AI-powered assistants.

How retrieval fits with generative models

Generative models such as ChatGPT are designed to read, reason, and generate text. They are not designed to efficiently search large document collections. Asking a language model to read thousands of documents every time a user asks a question would be slow, expensive, and often impossible due to context length limits.

Retrieval solves this problem by narrowing the search space. Instead of giving the generative model everything, the system first retrieves a small set of relevant documents and only passes those into the model. The generative model then uses that material to produce an answer, summary, or explanation.

This pattern is known as Retrieval-Augmented Generation, or RAG.

What RAG actually means

At its core, RAG simply means that a generative model produces an answer using external documents rather than relying only on what it learned during training. The defining feature of RAG is not how documents are selected, but that retrieved information is included as context for generation.

In its simplest form, RAG can be entirely manual. A user might select a small number of documents themselves, paste them into a prompt, and ask the model to answer a question using only those sources. The retrieval step is done by a human, but the generation is still augmented with retrieved material. This still counts as RAG.

Most modern systems automate this process. Instead of a user choosing documents manually, a retrieval model selects them automatically based on the query. The retrieved documents are then inserted into the prompt and passed to the generative model. The underlying idea is the same, but the retrieval step now scales to large document libraries and works in real time.

When people talk about RAG today, they are usually referring to this automated version, but both approaches fit the definition.

How embedding models enable automated RAG

Embedding models are what make automated RAG practical at scale. An embedding model converts text into a vector, which is a list of numbers representing meaning in a high-dimensional space. Similar meanings result in vectors that are close together, while different meanings are far apart.

To make this concrete, consider a simple document library containing three short documents:

“Jon likes apples.”
“Mary likes bananas.”
“Server outages increased last quarter.”

Each document is passed through the same embedding model and converted into a vector. The exact numbers are not meaningful on their own, but the relative distances are:

“Jon likes apples.” → [0.12, -0.44, 0.88, …]
“Mary likes bananas.” → [0.10, -0.41, 0.85, …]
“Server outages increased last quarter.” → [-0.91, 0.32, -0.14, …]

At this stage, the embedding model is finished. It has not compared documents to each other or ranked anything. It has simply mapped each piece of text into the same semantic space.

When a user asks the question “Who likes apples?”, that question is embedded in exactly the same way:

“Who likes apples?” → [0.11, -0.43, 0.86, …]

The retrieval system then compares the query vector to all document vectors using a similarity measure such as cosine similarity. The query vector is closest to “Jon likes apples”, somewhat close to “Mary likes bananas”, and very far from the document about server outages. The closest match is retrieved and passed to the generative model as context.

Importantly, this comparison happens numerically rather than linguistically. The system is not matching the word “apples”. It is comparing positions in a high-dimensional space where similarity reflects meaning learned during training.

How embedding models differ from word matching

It is natural to assume that embedding models work by matching words within documents. If that were the case, long documents would often appear similar simply because they share many common words. That intuition is correct for older models such as bag-of-words or TF-IDF, but it does not describe how modern embedding models work.

Embedding models are usually based on transformer architectures. These models use attention mechanisms to understand relationships between words in context. Words that are central to the meaning of a sentence or paragraph have a much stronger influence on the final embedding than generic or structural words.

Consider two documents that both begin with “This paper evaluates transformer-based language models”. One goes on to discuss clinical diagnosis, accuracy, and calibration. The other focuses on legal contracts, clause extraction, and compliance. Although they share many words, their embeddings will diverge because the concepts that define their meaning are very different.

This allows embedding models to work well even with long documents that contain substantial word overlap. However, this process still involves compression, which introduces important limitations.

Can ChatGPT do retrieval?

ChatGPT can participate in retrieval workflows, but it is not itself a retrieval engine. Its role is fundamentally different from that of a retrieval or embedding model.

On its own, ChatGPT does not maintain an index of documents and cannot efficiently search large document libraries. If documents are pasted directly into the prompt, the model can reason over them, but this is manual RAG rather than scalable retrieval.

In practical systems, ChatGPT is often used to interpret a user’s question, reformulate it into a good search query, or decide whether retrieval is needed at all. It can also combine, summarise, and explain documents once they have been retrieved. When ChatGPT appears to “look things up”, it is usually relying on an external retrieval system that has already selected the relevant material.

A helpful way to think about this is that retrieval models decide where to look, while ChatGPT decides what the retrieved information means and how to communicate it.

Why vectors and dimensionality matter

Every embedding has a fixed size, such as 384, 768, or 4,096 numbers. This size determines how much information can be represented. A higher-dimensional vector can encode more distinctions, just as a larger space can accommodate more objects without crowding.

When a document or query is simple, this compression works well. When it becomes complex, covering multiple topics or intents, all of that information must still be squeezed into a single vector. At some point, trade-offs are unavoidable. A query that is meant to retrieve several very different documents may not be able to sit close to all of them at once without also becoming close to irrelevant material.

This is not a failure of training or data. It is a geometric constraint. Even perfectly trained embedding models hit this limit as collections grow larger and more diverse.

The fundamental limitation of single-embedding retrieval

Recent research has shown that these trade-offs are not just practical inconveniences, but fundamental limits of single-embedding retrieval. In theory, a retriever should be able to return any subset of documents for a given query, for example “documents about X and Y but not Z”. In practice, as the number and diversity of documents increases, this becomes impossible.

The limitation becomes clearer when thinking in terms of vectors. Imagine a document collection with three relevant documents:

A document about topic X → [0.90, 0.10, -0.20, …]
A document about topic Y → [-0.85, 0.15, 0.30, …]
A document about topic Z → [0.05, -0.95, 0.40, …]

Now consider a query that asks for documents about X and Y, but not Z. That query must also be represented as a single vector:

“documents about X and Y but not Z” → [? , ? , ? , …]

Ideally, this query vector would sit close to both the X and Y document vectors while remaining far from Z. In practice, this may be impossible. The vectors for X and Y may lie far apart in the embedding space, and moving the query closer to one necessarily moves it further away from the other or closer to unrelated documents.

As the document collection grows and topics become more diverse, these conflicts become unavoidable. No matter how well the embedding model is trained, a single point in space cannot reliably represent all combinations of relevance. This is a geometric limitation rather than a linguistic one, and increasing embedding size can delay the problem but cannot remove it entirely.

This also explains why traditional keyword-based methods can sometimes outperform modern embedding models on certain retrieval tasks, and why multi-vector approaches and agentic retrieval strategies are increasingly important for complex queries.

Practical design choices in modern retrieval systems

These limits explain many common design decisions. Long documents are usually split into chunks so that unrelated sections do not get blended together. Keyword search is often combined with embeddings to balance precision and recall. Some systems use multiple embeddings per document so that different parts can be matched independently.

For more complex queries, retrieval may be iterative. A generative model can reformulate a question, retrieve additional documents, and combine results across multiple steps. This agentic approach avoids forcing all relevance into a single retrieval operation.

Understanding these trade-offs helps set realistic expectations. Embedding-based retrieval is extremely powerful, but it is not a universal solution.

Why this matters for AI agents

These limitations become especially important in the context of AI agents. Unlike simple question-answering systems, agents do not retrieve information once and stop. They plan, decompose tasks, retrieve information repeatedly, and combine results across multiple steps.

An agent might begin with a broad question, retrieve background information, identify gaps, issue more targeted searches, and then reconcile conflicting evidence. In this setting, retrieval is no longer a supporting step but a core capability that shapes the agent’s behaviour.

Single-embedding retrieval works well when queries are simple and focused. As agents take on more complex tasks, they naturally generate queries that combine multiple constraints. These are exactly the cases where the geometric limits of single-vector retrieval become visible.

This is why many agentic systems rely on hybrid retrieval, multi-vector representations, and iterative retrieval loops rather than a single embedding lookup. Agents do not overcome the limits of embeddings by reasoning harder. Instead, they work around those limits by breaking problems into smaller pieces, retrieving information in stages, and refining their search as they go.

Retrieval models in practice

In modern AI systems, retrieval models and generative models play complementary roles. Retrieval models, often powered by embeddings, determine where to look. Generative models determine what the information means and how to communicate it.

Embedding models enable fast, scalable semantic search across large document libraries. RAG allows generative models to ground their answers in real, up-to-date, or private information. When designed thoughtfully, these components work together to produce systems that are far more reliable and useful than any single model on its own.

As AI systems move from static question answering to agentic behaviour, retrieval is no longer just an implementation detail. It becomes a defining constraint on what those systems can reliably do.

September 23, 2025

Read our latest

Blog posts