xMemory: Agent Memory Beyond RAG

ICML 2025 Best Paper: Can xMemory Finally Solve RAG’s Memory Dilemma?

As large-model Agents flourish, enabling AI to remember conversation history spanning dozens of sessions has become the key to real-world usefulness. When we chat with an AI companion that has accompanied us for months, we expect it to recall project preferences we once discussed, understand our work habits, even remember a tiny detail mentioned in some long-ago chat. Yet standard retrieval-augmented generation (RAG) reveals startling limitations in such “agent memory” scenarios.

The Overlooked Root: RAG and Agent Memory Are Genetically Incompatible

Before diving into xMemory, we must grasp the backdrop. What was traditional RAG designed for? To retrieve from oceans of web documents, enterprise knowledge bases, or hundreds of research papers. In those settings the retrieved snippets come from disparate sources, often covering different topics and viewpoints. Hence classic “top-k similarity” can effectively pick the most relevant heterogeneous pieces.

Agent memory is utterly different. When a user talks with an Agent for weeks or months, the stored memories exhibit a unique structure: highly related, highly redundant, and tightly linked along the time axis. If the user mentions the same project many times in a month, all related fragments form a dense “cloud” in semantic space—extremely similar to one another—while the few points that carry the key information for the current query may hide inside that cloud.

The awkward result: standard top-5 or top-10 retrieval returns near-duplicate snippets about the same topic. The “most relevant” contexts simply retell the same story, whereas the true temporal evidence chain is discarded as redundancy.

xMemory’s core insight: standard RAG assumes a heterogeneous text corpus, whereas agent memory is essentially a homogeneous memory stream. This mismatch cannot be fixed by mere tuning.

xMemory’s Core Idea: Decoupling then Aggregation

xMemory proposes an elegant solution—“Decoupling-then-Aggregation”. Instead of struggling at the raw-dialogue level, first structure the memory, then invert that structure to drive retrieval.

Concretely, xMemory builds a four-level hierarchy:

Messages: the actual user-Agent turns
Episodes: compress consecutive turns into concise summaries capturing one topic unit
Semantics: reusable long-term facts extracted from episodes—name, employer, preferences, etc.
Themes: group related semantics into higher-level concept clusters

This hierarchy “decouples” information once tangled in the time stream—semantics are isolated from raw turns—while “aggregating” them into high-level links. Think of letting muddy river water settle into layers, then drawing from the layer you need.

Two-Stage Adaptive Retrieval: Representative Selection & Uncertainty Awareness

Building the hierarchy is not enough; we must retrieve efficiently. xMemory employs a two-stage adaptive retrieval strategy—the paper’s brightest spot.

Stage 1: Query-aware Representative Selection on kNN Graph

At the theme and semantic layers xMemory keeps a k-nearest-neighbor graph. Given a query, the system first performs “representative selection” on the theme layer—picking nodes that cover diverse knowledge directions.

A balancing formula governs:

i* = argmax [α × coverage gain + (1-α) × query relevance]

It balances semantic relevance with coverage, iteratively updating neighbor states until a coverage threshold is met, preventing clustering in one dense region.

Stage 2: Uncertainty-Adaptive Evidence Inclusion

After selecting semantics, decide which episodes to include as “evidence”. xMemory scores each candidate by how much it reduces the LM’s answer-prediction uncertainty. Only snippets that significantly lower uncertainty are added, achieving on-demand inclusion—relevant details stay, redundant duplicates are filtered.

Results: Quality and Efficiency Win-Win

Evaluated on LoCoMo (≈300-turn multi-session chats) and PerLTQA (personal long-term QA):

Qwen3-8B: BLEU 28.51 → 34.48, F1 40.45 → 43.98
GPT-5 nano: BLEU 36.65 → 38.71, F1 48.17 → 50.00
Token cost (Qwen3-8B): 9103 → 4711, almost halved!

Higher answer quality and lower cost—irresistible for cost-sensitive long-running Agents.

Deeper Implications

The paper teaches:

“Memory organization dictates retrieval efficiency.” Separate long-lived facts from short-lived context, core concepts from trivia. Separation is intelligence.
The two-stage pattern—coarse representative selection, then fine-grained uncertainty filtering—generalizes to any resource-constrained scenario.
RAG is powerful but not universal. When assumptions clash with reality, customize or redesign instead of forcing a fit.

Closing

This ICML 2025 best paper energizes the agent-memory frontier. With its decouple-aggregate hierarchy and representative-selection plus uncertainty-aware retrieval, xMemory dissolves standard RAG’s plight in homogeneous memory streams—while cutting token cost. If you build long-memory Agents or care about advanced retrieval, read it. It offers not just a deployable technique, but a way of thinking: when existing methods mismatch the problem, return to first principles and redesign.