Long-Horizon AI Agents: Benchmarks, Vulnerabilities & AgentLAB

AgentLAB, the first benchmark of its kind, reveals that even advanced LLM agents are highly susceptible to adaptive, long-horizon attacks, exposing a critical vulnerability in their design. This vulnerability, spanning 28 realistic agentic environments and 644 security test caseses, demands immediate attention from AI agent developers to ensure real-world viability, according to arxiv.

AI agents are increasingly deployed for complex, multi-step operations, but their foundational memory and security mechanisms remain highly vulnerable to sophisticated, long-horizon attacks. The promise of autonomous agents handling intricate tasks contrasts sharply with their demonstrated fragility under sustained adversarial pressure.

Companies deploying or building long-horizon AI agents must invest heavily in advanced memory architectures and rigorous security benchmarking, or risk critical failures and data breaches.

The State of AI Agent Performance

69.3% — An off-the-shelf coding agent baseline achieved this accuracy, according to arxiv.
62% — The BEAM 1M benchmark score, introduced in 2026, was recorded at this figure, according to Mem0 Ai.
72.5% — AgentRunbook-C achieves this average accuracy, marking its best performance, according to arxiv.

Even leading agents struggle with consistent high accuracy on complex, long-horizon tasks, as confirmed by these figures. Inconsistent performance across benchmarks signals significant room for improvement in overall agent capabilities.

Innovations in Long-Term Memory for AI Agents

New memory architectures are crucial for enabling agents to handle the vast amounts of information required for truly long-horizon operations. Microsoft Research's Memora, for instance, offers scalable and reliable long-term recall, contrasting with simpler approaches like RAG-based memory, which show significant limitations.

1. Memora (Microsoft Research)

Best for: Enterprise AI development requiring scalable, reliable long-term memory for long-horizon AI agent platforms.

Memora achieved 86.3% LLM-judge accuracy on the LoCoMo benchmark and 87.4% on LongMemEval, outperforming RAG, Mem0, and full-context inference. The system reduces context token usage by up to 98%, with its policy-guided retriever running at 5-6 seconds per query.

Strengths: High recall on benchmarks, significant context token reduction, fast retrieval. | Limitations: Specific benchmark focus may not cover all real-world security scenarios. | Price: Not specified.

2. AgentRunbook-C

Best for: Developers building long-horizon AI agents for complex, multi-step operations needing robust performance.

AgentRunbook-C achieved the best performance with 72.5% average accuracy in evaluations. It outperformed AgentRunbook-R (48.5%) and an off-the-shelf coding agent baseline (69.3%).

Strengths: High average accuracy on agent memory evaluations, strong baseline performance. | Limitations: Specifics of memory architecture not detailed, potentially limited scope. | Price: Not specified.

3. RAG (Retrieval Augmented Generation)

Best for: General-purpose AI agents requiring basic information retrieval from external knowledge within long-horizon AI frameworks.

RAG is a foundational memory framework. AgentRunbook-R, an efficient RAG-based memory, achieved 48.5% accuracy. Memora has outperformed RAG in specific benchmark evaluations.

Strengths: Widely adopted, flexible for various data sources. | Limitations: Lower accuracy compared to specialized systems like Memora, potential for hallucinations without careful implementation. | Price: Open-source components typically free, commercial implementations vary.

4. Mem0

Best for: Researchers and developers evaluating memory system performance against standard benchmarks in long-horizon AI agent platforms.

Mem0 is a distinct memory system that Memora has outperformed on both LoCoMo and LongMemEval benchmarks.

Strengths: Provides a baseline for comparison. | Limitations: Specific performance metrics not highlighted as leading, outperformed by newer systems. | Price: Not specified.

5. AgentRunbook-R

Best for: Implementing efficient RAG-based memory solutions in specific long-horizon AI agent tasks.

AgentRunbook-R is an efficient RAG-based memory implementation. It achieved 48.5% average accuracy.

Strengths: Efficient RAG implementation. | Limitations: Lower accuracy, suggesting limitations for complex, long-horizon tasks. | Price: Not specified.

6. AgentLAB

Best for: Security researchers and developers needing to rigorously test LLM agents against advanced, adaptive attacks within long-horizon AI frameworks.

AgentLAB is the first benchmark designed to evaluate LLM agent susceptibility to adaptive, long-horizon attacks. It spans 28 realistic agentic environments and 644 security test cases, supporting five novel attack types.

Strengths: Comprehensive security evaluation, focuses on adaptive attacks, realistic environments. | Limitations: Primarily a security benchmark, not a memory performance benchmark. | Price: Not specified.

7. LongMemEval-V2 (LME-V2)

Best for: Developers and researchers evaluating web agents' long-term memory capabilities in long-horizon AI agent platforms.

LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents. It pairs questions with history trajectories containing up to 500 trajectories and 115M tokens.

Strengths: Detailed and specific for web agents, large-scale trajectory handling. | Limitations: Focuses on memory recall, not directly on security vulnerabilities. | Price: Not specified.

8. LongMemEval benchmark

Best for: Standardized evaluation of chat assistant long-term memory in long-horizon AI frameworks.

This benchmark evaluates five core long-term memory abilities of chat assistants. It contains 500 curated questions, with LongMemEvalS using approximately 115K tokens and 50 sessions, and LongMemEvalM using approximately 1.5M tokens and 500 sessions. A reported benchmark score of 94.4% is associated with this evaluation, according to Mem0 Ai.

Strengths: Standardized, comprehensive for chat assistants, multiple scales (S, M). | Limitations: Primarily for chat assistants, may not fully capture general agent memory needs. | Price: Not specified.

9. LoCoMo benchmark

Best for: Evaluating LLM-judge accuracy on conversational memory tasks for long-horizon AI agents.

Introduced in 2024, the LoCoMo benchmark has a reported score of 92.5%, according to Mem0 Ai. Memora achieved 86.3% LLM-judge accuracy on this benchmark, as reported by InfoWorld.

Strengths: Specific to conversational memory, LLM-judge accuracy metric. | Limitations: Relatively new, specific focus. | Price: Not specified.

10. BEAM 1M benchmark

Best for: Assessing AI memory performance against a specific, large-scale benchmark for long-horizon AI agent platforms.

The BEAM 1M benchmark was introduced in 2026 at ICLR. It has a reported benchmark score of 62%, according to Mem0 Ai.

Strengths: New, large-scale benchmark. | Limitations: Specific focus, lower reported score compared to others. | Price: Not specified.

Benchmarking Breakthroughs: How Memory Systems Stack Up

Memory System/Benchmark	LoCoMo Accuracy (%)	LongMemEval Accuracy (%)	Key Differentiator
Memora (Microsoft Research)	86.3 (LLM-judge)	87.4	Outperforms RAG, Mem0, full-context inference; reduces context token usage by 98%
LoCoMo Benchmark Maximum	92.5	N/A	Theoretical maximum or ideal system score for conversational memory
LongMemEval Benchmark Maximum	N/A	94.4	Theoretical maximum or ideal system score for chat assistant long-term memory

Memora marks a significant leap in reliable long-term recall, outperforming RAG, Mem0, and full-context inference, according to InfoWorld. However, the LoCoMo benchmark has a reported score of 92.5% and LongMemEval a score of 94.4%, according to Mem0 Ai. The gap to benchmark maximums reveals further improvement potential and the nuanced nature of performance comparisons.

The Rigor Behind Evaluating Agent Capabilities

LongMemEval-V2 (LME-V2) contains 451 manually curated questions covering five core memory abilities for web agents, pairing questions with history trajectories up to 500 trajectories and 115M tokens, according to arxiv. AgentLAB, a benchmark for adaptive, long-horizon attacks, spans 28 realistic agentic environments and 644 security test cases, as reported by arxiv.

Benchmarks like LME-V2 and AgentLAB are crucial for understanding and improving AI agent robustness and security. They move beyond simple recall failures to address active, sustained exploitation, critical for trustworthy long-horizon AI platforms.

The Path Forward for Robust AI Agents

Realizing the full potential of long-horizon AI agents demands continuous development of specialized memory systems and comprehensive security benchmarks. Mitigating inherent risks and ensuring responsible deployment requires a dual focus on advanced memory recall and robust defense against adaptive attacks.

Companies deploying AI agents often underestimate security risks. AgentLAB's findings confirm advanced LLM agents are highly susceptible to adaptive attacks, exposing critical vulnerabilities. While memory systems like Microsoft's Memora show promising recall, the market's focus on memory often overlooks the more urgent threat: agent exploitation over time. Robust memory alone does not guarantee robust security, demanding a fundamental shift in agent design priorities.

By Q4 2026, organizations deploying long-horizon AI agents without adopting benchmarks like AgentLAB and advanced memory solutions such as Memora will likely face elevated risks of security breaches and operational failures.

Frequently Asked Questions About Long-Horizon AI Agents

What distinguishes adaptive, long-horizon attacks from simpler security threats?

Adaptive, long-horizon attacks differ from simpler threats by evolving their strategy over multiple steps, reacting to an agent's responses and memory. Unlike single-step injection attacks, these require agents to maintain a consistent security posture over extended interactions, often across multiple sessions or complex decision trees. AgentLAB specifically tests against five novel attack types, simulating more sophisticated, persistent adversarial actions.

How does memory capacity, such as LME-V2's 115M tokens, impact real-world agent reliability?

An immense memory capacity, like LME-V2's handling of 115M tokens in history trajectories, is crucial for agents operating in complex environments such as web browsing or intricate data analysis. This capacity allows agents to recall context from hundreds of past interactions, preventing