Autonomous AI agents face a practical deployment challenge: once a large language model (LLM) is released into production, its parameters typically remain fixed. This constraint limits how quickly an agent adapts when its environment changes. A new framework called Memento-Skills addresses this bottleneck by enabling agents to rewrite and expand their own skills without retraining the underlying model.
According to VentureBeat, the framework provides a continual learning approach for agent systems through an evolving external memory that updates executable skill artifacts—stored as structured markdown files—based on environment feedback. Researchers report benchmark results showing significant gains over a static skill library, including a 13.7 percentage point improvement on GAIA and more than doubling performance on HLE.
Why Frozen LLMs Complicate Agent Adaptation
Once an LLM is deployed, its parameters remain fixed, restricting it to the knowledge encoded during training and what fits in its immediate context window. This means agents often need additional mechanisms to handle new tasks or shifting requirements.
A common workaround is to add or modify skills manually. However, many automatic skill-learning methods produce text-only guides that function like prompt optimization, or log single-task trajectories that don’t transfer well across tasks. Additionally, standard retrieval-augmented generation (RAG) pipelines often rely on semantic similarity routers such as dense embeddings. Semantic overlap may not correlate with behavioral utility—an agent might retrieve a “password reset” script for a “refund processing” query simply because documents share enterprise terminology.
These constraints affect operational complexity and cost. If adaptation requires fine-tuning model weights or extensive re-engineering, enterprises may struggle to iterate quickly or safely.
Memento-Skills: Evolving External Memory Through Executable Artifacts
Memento-Skills functions as a continually-learnable LLM agent system that avoids modifying the underlying frozen LLM by building an external memory layer that can evolve. The framework stores skills in structured markdown files that serve as a persistent, evolving knowledge base. Each reusable skill artifact includes three core elements: declarative specifications describing what the skill is and how it should be used; specialized instructions and prompts that guide the language model’s reasoning; and executable code and helper scripts the agent runs to solve the task.
The framework uses a mechanism called Read-Write Reflective Learning, which frames memory updates as active policy iteration rather than passive data logging. When the agent faces a new task, it queries a specialized skill router to retrieve the most behaviorally relevant skill—not merely the most semantically similar one—and then executes it.
After execution, the system reflects on the outcome to close the learning loop. If execution fails, an orchestrator evaluates the trace and rewrites the skill artifacts, updating the code or prompts to patch the specific failure mode. If needed, it creates an entirely new skill. The system also updates the skill router using one-step offline reinforcement learning, learning from execution feedback rather than text overlap. According to co-author Jun Wang, the true value of a skill lies in how it contributes to the overall agentic workflow and downstream execution, making reinforcement learning a more suitable framework for evaluating long-term utility.
To limit regression risk during production-like updates, the framework includes an automatic unit-test gate that generates a synthetic test case, executes it through the updated skill, and checks results before saving changes to the global library.
Benchmark Results: Performance Gains From Self-Evolving Skills
Researchers evaluated Memento-Skills on two benchmarks: General AI Assistants (GAIA), which requires complex multi-step reasoning, multi-modality handling, web browsing, and tool use; and Humanity’s Last Exam (HLE), an expert-level benchmark spanning eight academic subjects including mathematics and biology. The system was powered by Gemini-3.1-Flash as the underlying frozen language model.
Memento-Skills was compared against a Read-Write baseline that retrieves skills and collects feedback but lacks self-evolving features. The custom skill router was also compared against standard semantic retrieval baselines including BM25 and Qwen3 embeddings.
On GAIA, Memento-Skills improved test set accuracy by 13.7 percentage points, reaching 66.0% compared to 52.3% for the static baseline. On HLE, where domain structure allowed for massive cross-task skill reuse, the system more than doubled baseline performance, moving from 17.9% to 38.7%. The specialized skill router boosted end-to-end task success rates to 80%, compared to 50% for standard BM25 retrieval.
Both benchmark experiments started with just five atomic seed skills including basic web search and terminal operations. On GAIA, the agent expanded this into a library of 41 skills. On HLE, it scaled to 235 distinct skills. These results suggest that performance gains come from both updating skills and from how skills are selected and revised—behaviorally relevant routing combined with execution-driven memory mutation.
Enterprise Deployment: Workflows and Governance Considerations
For enterprise architects, effectiveness depends on domain alignment and whether agents handle isolated tasks or structured workflows. According to Wang, skill transfer depends on task similarity. When tasks are isolated or weakly related, the agent cannot rely on prior experience and must learn through interaction, limiting cross-task transfer. When tasks share substantial structure, previously acquired skills can be reused directly, making learning more efficient.
Wang notes that workflows are likely the most appropriate setting because they provide a structured environment where skills can be composed, evaluated, and improved. However, he cautions against over-deployment in unsuitable areas: physical agents remain largely unexplored in this context, and tasks with longer horizons may require more advanced approaches such as multi-agent LLM systems for coordination and planning.
While Memento-Skills includes safety rails like the automatic unit-test gate, a broader framework will likely be needed for enterprise adoption. Wang argues that reliable self-improvement requires a well-designed evaluation system to assess performance and provide consistent guidance, with guided self-development steered by feedback rather than unconstrained self-modification. This suggests enterprises may need to invest in testing, monitoring, and evaluation mechanisms to make skill rewriting operationally safe.
Source: VentureBeat