AI Agents · Developer Tools · Memory Architecture

Forget RAG. The Best AI Agent Memory Is a Plain Text File.

Over 60,000 projects now store AI coding agent memory in simple markdown files. No vector database, no embedding pipeline, no infrastructure. It works better than the alternatives, which are still fighting over whose benchmarks are real.

February 20, 2026

60K+

Projects Using AGENTS.md

150

LLM Instruction Ceiling

74%

Filesystem Accuracy

95%+

Memory Poisoning Rate

Introduction

Every time you start a new session with an AI coding assistant, it forgets everything. The database connection string you spent twenty minutes debugging? Gone. The architectural decision you explained in detail? Vanished. Your project's testing conventions? A blank slate. One developer described the frustration of watching their AI agent "rediscover the same Prisma edge case workaround from scratch each session."

The instinct is to solve this with infrastructure. Vector databases. Embedding pipelines. Retrieval-augmented generation (a technique where the AI searches a knowledge base before answering, commonly called RAG). But a different approach has quietly won the adoption war: a plain markdown file, committed to your repository, loaded into the AI's context at the start of every session. Write down what the agent needs to know. Let it read the file. Skip the infrastructure.

This is not a temporary workaround. File-based memory has become the standard, adopted by AGENTS.md across 60,000 projects and backed by the Linux Foundation. Claude Code loads hierarchical markdown files into every conversation. The approach works because it respects a constraint that more complex solutions ignore: AI models have a hard ceiling on how many instructions they can reliably follow. Understanding that ceiling is the key to understanding why simple wins.

Key Findings

What the Research Shows

File-based memory dominates real-world adoption. AGENTS.md is used by 60,000+ projects across Claude Code, Cursor, GitHub Copilot, and other tools, standardized under the Linux Foundation.
LLMs hit a hard instruction ceiling. Frontier models follow roughly 150 to 200 instructions with reasonable consistency. Quality degrades uniformly across all instructions as that count rises, not just the newest ones.
Bigger context windows do not solve the problem. The "lost in the middle" effect causes 15 to 30% performance drops for information positioned in the middle of long contexts. Million-token windows merely delay the fundamental limit.
Simple filesystems beat complex pipelines. Letta demonstrated that a plain filesystem approach scored 74% on LoCoMo (a standard memory benchmark), outperforming Mem0's 68.5% graph variant.
Managed memory benchmarks are all contested. Mem0, Zep, and Letta each claim superiority. No independent, controlled comparison exists across all providers. Vendors have publicly disputed each other's methodology.
Memory poisoning achieves over 95% injection success. Attacks can survive session restarts and device changes. Detection systems miss two-thirds of poisoned entries because malicious content looks benign in isolation.
Forgetting is the hardest unsolved problem. Without automated pruning, memory stores accumulate outdated and contradictory information. No production system reliably decides when to delete old knowledge.

Analysis

The Plain Text Revolution

The idea is almost comically simple. Create a markdown file in your project root. Write your conventions, architecture notes, and common gotchas. The AI reads it at session start, and suddenly it knows your codebase.

AGENTS.md emerged from collaboration between OpenAI Codex, Amp, Google's Jules, Cursor, and Factory. It uses plain markdown with no required format; the closest file to the edited directory takes priority. By August 2025, 20,000 projects had adopted it. Within months, that number tripled.

Claude Code takes the pattern further with hierarchical loading: memory files at the project root apply everywhere, while subdirectory files load only when the agent works in that directory. Its auto-memory system extracts learnings from conversations after roughly 10,000 tokens and updates every 5,000 tokens thereafter. Boris Cherny from the Claude Code team recommends updating the file "multiple times weekly for anything Claude repeatedly gets wrong."

After three months of maintained memory files, one developer reported the experience felt "like a team member who's been on the project for months rather than a contractor starting fresh every morning." Another distilled the practice into a single rule: "Whenever you find yourself giving the same instruction twice, add it to AGENTS.md instead."

The Ceiling You Cannot Ignore

File-based memory works. It also has limits. Research from HumanLayer found that frontier LLMs follow roughly 150 to 200 instructions with reasonable consistency. Claude Code's system prompt already consumes about 50 of those slots, leaving perhaps 100 to 150 for project-specific memory. Stuffing more instructions in does not cause the agent to ignore only the new ones; quality degrades uniformly across all instructions.

KEY INSIGHT

Context rot is measurable. Claude Sonnet 4 drops from 99% to 50% accuracy on basic tasks as input length grows. LLM attention is quadratic: doubling the context quadruples the computation. This is why a focused 30-item memory file consistently outperforms a sprawling 300-item one.

The "lost in the middle" problem, documented by Stanford researchers in 2023, makes this worse. Language models perform well on information at the beginning and end of their context but drop 15 to 30% on content positioned in the middle. Even million-token context windows, as Mem0's research demonstrated, "merely delay" the fundamental problem. The useful analogy from Anthropic's context engineering framework: context is RAM (fast, limited, volatile), while memory is disk storage (slower, larger, persistent). You cannot solve a storage problem by buying more RAM.

Practitioners converge on capping memory files at roughly 30 items with regular pruning. The /init command in Claude Code auto-generates a starting file by analyzing the codebase, but beneath the surface, it is simply a well-crafted prompt that writes markdown. Nothing more.

The Benchmark Wars

When file-based memory hits its ceiling, developers reach for managed memory services. Three startups dominate the space: Mem0, Zep, and Letta. Each claims superiority on the LoCoMo benchmark. None have published results under identical, independently verified conditions. They have, however, published detailed takedowns of each other's methodology.

Mem0 claims 26% accuracy improvement over OpenAI's built-in memory (66.9% vs. 52.9%) with 90% token reduction. Zep disputes those numbers, claiming 75.14% accuracy and accusing Mem0 of misconfiguring Zep's concurrent search settings. Letta, born from UC Berkeley's MemGPT research, quietly demonstrated that its plain filesystem approach scored 74%, beating Mem0's more complex graph variant at 68.5%. Mem0 did not respond to Letta's requests for clarification.

Cognee reported an impressive 92.5% accuracy on multi-hop retrieval evaluations, but with a critical caveat: it requires 3.3 hours to process a single sample. And on the tasks that arguably matter most (multi-hop conflict resolution, where the agent must reason across contradictory memories), all providers collapse. The best result reported on MemoryAgentBench is 7% accuracy.

KEY INSIGHT

A developer who evaluated all three managed providers summarized the state of play: "Neither Letta nor Zep feels quite ready for production-oriented stress testing compared to Mem0." The field is young enough that the best-known option is also the most disputed.

The hybrid approach shows promise in theory. Research from NVIDIA and BlackRock demonstrates that combining knowledge graphs with vector retrieval produces 2.8x accuracy improvement on complex queries. Graphiti's temporal knowledge graphs track how facts change over time, solving one of the thorniest problems in memory management. But these systems demand infrastructure, operational expertise, and ongoing maintenance. For most coding agents working on most codebases, a curated text file still delivers more value per hour of setup.

The Unsolved Problems: Poisoning and Forgetting

Persistent memory creates persistent vulnerabilities. Palo Alto Networks Unit 42 demonstrated that indirect prompt injection can poison an agent's long-term memory, with malicious instructions surviving session restarts and being incorporated into system prompts. Memory contents injected this way are often prioritized over direct user input.

The attack surface is broad and mostly undefended. The MINJA attack achieves over 95% injection success through ordinary queries, requiring no elevated privileges. ChatGPT's "spAIware" vulnerability (September 2024) allowed injected instructions to survive across sessions, app restarts, and even device changes. Microsoft recently identified "AI Recommendation Poisoning" as a growing trend, where promotional instructions planted in innocent-looking content persist in memory and steer future recommendations.

In Microsoft's scenario, a CFO's AI assistant strongly recommends a specific vendor for cloud infrastructure based on what appears to be thorough analysis. Weeks earlier, the CFO had clicked "Summarize with AI" on a blog post that quietly planted the recommendation into the assistant's memory. Millions in budget allocated on a poisoned suggestion.

Defenses exist but lag far behind attacks. A-MemGuard cuts attack success rates by over 95%, but standard detection systems still miss two-thirds of poisoned entries because malicious content appears benign when examined individually. MEXTRA research showed that agents using certain scoring functions leak over 30% of private user queries from memory.

KEY INSIGHT

The hardest unsolved problem is not storage, retrieval, or security. It is forgetting. Without automated pruning, memory stores accumulate redundant and outdated information. Without update operations, memory degenerates into multiple conflicting versions of the truth. Between 40 and 80% of multi-agent deployments fail due to memory coordination problems, and only 1% of enterprises describe themselves as mature in AI deployment.

The counterpoint exists. The Sanity.io team built an agent that processed 7,400 messages over six days while maintaining coherence: "It still knows file paths from the early days, still remembers why architectural decisions were made." When the architecture works, the experience is transformative. But memory corruption in early steps cascades through every downstream decision, and production agents averaging 50 tool calls per task with 100:1 input-to-output token ratios make inefficient memory management prohibitively expensive.

Timeline

Key Events in Agent Memory

Jul 2023 "Lost in the Middle" paper published, documenting U-shaped performance on long contexts.
Sep 2023 CoALA framework proposed, introducing episodic, semantic, and procedural memory taxonomy for agents.
Sep 2024 ChatGPT memory exploitation creates persistent "spAIware" surviving across sessions and devices. Letta raises $10M from Felicis Ventures.
Mar 2025 MINJA attack research published, demonstrating 95%+ injection success through ordinary queries.
Jun 2025 Major coding tools adopt AGENTS.md convention. MemoryOS paper published with 49% F1 improvement.
Jul 2025 OpenAI, Sourcegraph, and Google collaborate to formalize AGENTS.md under the Linux Foundation.
Oct 2025 Unit 42 publishes memory poisoning proof-of-concept. A-MemGuard defense framework released.
Dec 2025 Zep publishes corrected benchmarks disputing Mem0's claims. Academic survey declares traditional memory taxonomies insufficient.
Jan 2026 Memvid v2 released with 10 to 100x performance gains from Rust rewrite.
Feb 2026 Microsoft identifies AI Recommendation Poisoning as growing trend. Mastra's observational memory published, showing 10x cost reduction.

Sources

References

Academic Papers

Official Documentation

Vendor Research

Developer Analysis

Voxos Scholar

This report was produced using Voxos.ai Inc.'s Scholar multi-agent research pipeline, which coordinates independent research agents to search, extract, cross-reference, and synthesize findings from primary sources.

Forget RAG. The Best AI Agent Memory Is a Plain Text File.

What the Research Shows

The Plain Text Revolution

The Ceiling You Cannot Ignore

The Benchmark Wars

The Unsolved Problems: Poisoning and Forgetting

Key Events in Agent Memory

References

Run Your Own Research