Why We Need Continual Learning我们为什么需要持续学习https://www.a16z.news/p/why-we-need-continual-learningIn Christopher Nolan’s Memento, Leonard Shelby lives inside a fractured present. After a traumatic brain injury, he suffers from anterograde amnesia, an affliction that prevents him from forming new memories. Every few minutes, his world resets, leaving him stranded in a perpetual now, untethered from what just happened and uncertain of what comes next. To cope, he survives by tattooing notes on his body and snapping Polaroids that are basically external props to remind him of what his brain cannot retain.Large language models live in a similar perpetual present. They emerge from training with vast knowledge frozen into their parameters but they cannot form new memories - cannot update their parameters in response to new experience. To compensate, we surround them with scaffolding: chat history as short-term sticky notes, retrieval systems as external notebooks, system prompts as guiding tattoos. The model itself never fully internalizes the new information.在克里斯托弗·诺兰的《记忆碎片》中莱昂纳多·谢尔比生活在一个支离破碎的当下。遭受创伤性脑损伤后他患上了顺行性遗忘症这种病症使他无法形成新的记忆。每隔几分钟他的世界就会重置让他永远困在此刻——既与刚发生的事割裂也对即将发生的事茫然。为应对这种情况他依靠纹身笔记和宝丽来照片生存这些外在道具帮他记住大脑无法留存的信息。大型语言模型同样活在永恒的当下。它们通过训练获得固化在参数中的海量知识却无法形成新记忆——无法根据新经验更新参数。为此我们为它们搭建辅助系统用聊天记录作短期便利贴用检索系统当外部笔记本用系统提示作指引性纹身。模型本身永远不会真正内化这些新信息。There’s a growing belief among some researchers that this is not enough. In-context learning (ICL) is sufficient for problems where the answer, or pieces of the answer, already exist somewhere in the world. But for problems that require genuine discovery (like novel mathematics), for adversarial scenarios (like security), or for knowledge too tacit to express in language, there’s a strong argument that models need a way to update their knowledge and experience directly into their parameters after deployment.ICL is transient. Real learning requires compression. Until we let models compress continuously, we may be stuck in Memento’s perpetual present. Conversely, if we can train models to learn their own memory architectures - rather than offloading to bespoke harnesses - we may unlock a new dimension of scaling.一些研究人员日益认为这还不够。上下文学习ICL适用于答案或答案片段已存在于世界某处的问题。但对于需要真正发现如新颖数学的问题、对抗性场景如安全或对于难以用语言表达的隐性知识有强烈观点认为模型需要一种在部署后直接更新其参数中知识和经验的方式。ICL是短暂的。真正的学习需要压缩。除非我们让模型持续压缩否则可能被困在《记忆碎片》般的永恒当下。反之如果我们能训练模型构建自己的记忆架构——而非依赖定制化外部工具——或许能开启规模扩展的新维度。The name for this field of research iscontinual learning.And while the idea is not new (see: McCloskey and Cohen, 1989!), we think it’s some of the most important work happening in AI right now. With the astounding growth in model capabilities over the past 2-3 years, the gap between what models know and what theycouldknow has become increasingly obvious. So our goal with this post is to share what we’ve learned from top researchers working in this field; help disambiguate different approaches to continual learning; and advance this topic in the startup ecosystem.Note: This article was shaped by conversations with an extraordinary group of researchers, PhD students, and startup founders who have shared their work and perspectives on continual learning openly with us. Their insights from the theoretical foundations to the engineering realities of post-deployment learning made this piece sharper and more grounded than anything we could have written on our own. Thank you for your generosity with your time and ideas!这一研究领域的名称是持续学习。尽管这一概念并不新鲜参见McCloskey和Cohen1989年但我们认为这是当前人工智能领域最重要的工作之一。随着过去2-3年模型能力的惊人增长模型已知内容与其可能掌握内容之间的差距日益明显。因此我们撰写本文的目的是分享我们从该领域顶尖研究人员那里学到的经验帮助厘清持续学习的不同方法并在初创企业生态系统中推进这一主题。注本文的成形得益于与一群杰出的研究人员、博士生和初创企业创始人的对话他们向我们公开分享了他们在持续学习方面的工作和观点。从理论基础到部署后学习的工程现实他们的见解使本文比我们独自撰写的任何内容都更加犀利和扎实。感谢你们慷慨地贡献时间和想法First, Let’s Talk About Context首先让我们谈谈上下文Before making the case for parametric learning - i.e., learning that updates the model’s weights - it’s important to acknowledge that in-context learning absolutely does work. And there is a compelling argument that it will keep winning.Transformers are, at their core, conditional next-token predictors over a sequence. Give them the right sequence, and you get surprisingly rich behavior, without touching the weights. That is why context management, prompt engineering, instruction tuning, and few-shot examples have been so powerful. The intelligence lives in the static parameters, and the apparent capabilities change radically depending on what you feed into the window.在为参数化学习即更新模型权重的学习方式辩护之前必须承认上下文学习确实行之有效。更有说服力的观点认为这种学习方式将持续占据优势。Transformer模型的核心本质上是基于序列的条件化下一词元预测器。只要输入正确的序列无需调整权重就能产生令人惊叹的丰富行为。这正是上下文管理、提示工程、指令微调和少量示例技术如此强大的原因所在。智能存在于静态参数之中而模型展现的能力会随着输入窗口内容的改变发生根本性变化。Cursor’s recent deep-dive on scaling autonomous coding agents gives a nice example of this point:“A surprising amount of the system’s behavior comes down to how we prompt the agents. The harness and models matter, but the prompts matter more.”The model weights were fixed. What made the system work was careful orchestration of context: what to include, when to summarize, how to maintain coherent state across hours of autonomous operation.OpenClaw is another great example. It broke out not because of special model access (the underlying models were available to everyone) but because of how effectively it turns context and tools into working state: tracking what you’re doing, structuring intermediate artifacts, deciding what to re-inject into the prompt, maintaining persistent memory of prior work. OpenClaw elevates agent harness design to a discipline in its own right.When prompting first emerged, many researchers were skeptical that “just prompting” could be a serious interface. It looked like a hack. Yet it was native to the transformer architecture, required no retraining, and scaled automatically with model improvements. So as models got better, prompting got better. “Janky but native” interfaces often win because they couple directly to the underlying system rather than fighting it. And so far, that’s exactly what’s happening with LLMs.Cursor最近对自主编码代理扩展的深度探讨为此观点提供了很好的例证系统中出人意料的大量行为都取决于我们如何提示这些代理。框架和模型固然重要但提示语的影响更为关键。模型权重是固定的真正让系统运作起来的是对上下文的精心编排包含哪些内容、何时进行总结、如何在数小时的自主运行中保持连贯状态。OpenClaw则是另一个绝佳范例。它的脱颖而出并非因为特殊的模型权限底层模型对所有人开放而是源于其将上下文和工具转化为工作状态的卓越能力追踪用户操作、构建中间产物、决定哪些内容重新注入提示语、保持对先前工作的持久记忆。OpenClaw将代理框架设计提升为一门独立的学科。当提示工程最初出现时许多研究者怀疑仅靠提示能成为严肃的交互方式。这看起来像是一种取巧手段。但它本就内生于Transformer架构无需重新训练并能随模型改进自动扩展。因此随着模型性能提升提示效果也水涨船高。粗糙但原生的交互方式往往能胜出因为它们直接与底层系统耦合而非对抗。迄今为止这正是大语言模型领域正在发生的现实。State Space Models: Context On Steroids状态空间模型强化背景As the dominant workflow moves from raw LLM calls to agentic loops, pressure is building on the in-context learning model. It used to be relatively rare to fill up context completely. This usually happened when LLMs were asked to do a long sequence of discrete work, and the app layer could prune and/or compress chat history in a straightforward way. With agents, though, one task can consume a significant portion of total available context. Each step in the agent’s loop relies on context passed from prior iterations. And they often fail after 20–100 steps because they lose the thread: their context fills up, coherence degrades, and they stop converging.As a result, the major AI labs are now contributing significant resources (i.e., large training runs) to develop models with very large context windows. This is a natural approach to take because it builds on what’s working (in-context learning) and maps cleanly to the broader industry shift toward inference-time compute. The most common architecture is to intersperse fixed memory layers with normal attention heads, i.e., state space models and linear attention variants (we will refer to all these as SSMs for simplicity). SSMs offer a fundamentally better scaling profile than traditional attention for long contexts.随着主导工作流从原始大语言模型调用转向代理循环情境学习模型面临的压力正不断增大。过去完全耗尽上下文长度的情况相对罕见通常仅发生在要求大语言模型处理长序列离散任务时此时应用层可直接对聊天记录进行剪枝或压缩。但在代理机制下单个任务就可能消耗大部分可用上下文空间——代理循环中的每个步骤都依赖于先前迭代传递的情境信息导致系统通常在20-100步后失效上下文空间耗尽、连贯性退化、最终停止收敛。为此顶尖AI实验室正投入大量资源如大规模训练开发超长上下文窗口模型。这种策略顺应当前有效范式情境学习并完美契合行业向推理阶段计算迁移的趋势。主流架构方案是在常规注意力机制中嵌入固定记忆层即状态空间模型及其线性注意力变体为简化统称为SSM。相较于传统注意力机制SSM在长上下文场景中展现出更优越的扩展特性。The goal is to help agents maintain coherence over longer loops by several orders of magnitude, from say ~20 steps to ~20,000, without losing the breadth of skills and knowledge afforded by traditional transformers. If it works, this will be a major win for long-running agents. And you could even consider this approach a form of continual learning: while you’re not updating the model weights, you’ve introduced an external memory layer that rarely needs to be reset.So, these non-parametric approaches are real and powerful. Any assessment of continual learning has to start here. The question is not whether today’s context-based systems work - they do. The question is whether we are looking at the ceiling, and if new approaches can take us further.目标是帮助智能体在更长循环中保持连贯性规模提升数个数量级例如从约20步扩展到约20,000步同时不丢失传统Transformer模型提供的广泛技能与知识。若实现这一目标将成为长效运行智能体的重大突破。甚至可将该方法视为持续学习的一种形式虽然不更新模型权重但引入了几乎无需重置的外部记忆层。因此这些非参数化方法确实存在且效果显著。任何关于持续学习的评估都应由此出发。问题不在于现有基于上下文的系统是否有效——它们确实有效。真正的问题是我们是否已触及性能天花板新方法能否带我们突破极限What Context Misses: The Filing Cabinet Fallacy背景所遗漏的文件柜谬误“The thing that happened with AGI and pre-training is that in some sense they overshot the target… A human being is not an AGI. Yes, there is definitely a foundation of skills, but a human being lacks a huge amount of knowledge. Instead, we rely on continual learning. If I produce a super intelligent 15-year-old, they don’t know very much at all. A great student, very eager. You can say, ‘Go and be a programmer. Go and be a doctor.’ The deployment itself will involve some kind of a learning, trial-and-error period. It’s a process, not dropping the finished thing.”— Ilya Sutskever“AGI与预训练的发展在某种程度上超越了目标……人类并非AGI。诚然我们具备基础技能但人类仍缺乏海量知识。相反我们依赖持续学习。如果我培养出一个超级聪明的15岁少年他们实际所知甚少。这是位优秀且求知若渴的学生。你可以说‘去当程序员去当医生’但实际部署过程必然包含某种学习与试错阶段。这是渐进的过程而非直接交付成品。”——伊利亚·苏茨克沃Imagine a system with infinite storage. The world’s biggest filing cabinet, every fact perfectly indexed, instantly retrievable. It can look up anything. Has it learned?No. It has never been forced to do the compression.This is the centerpiece of our argument, and it draws on a point that Ilya Sutskever has made before: LLMs are, at their core, compression algorithms. During training, they compress the internet into parameters. The compression is lossy, and that is precisely what makes it powerful. Compression forces the model to find structure, to generalize, to build representations that transfer across contexts. A model that memorizes every training example is worse than one that extracts the underlying patterns. The lossy compression is the learning.想象一个拥有无限存储的系统。世界最大的文件柜每条信息都被完美索引瞬间可检索。它能查找任何内容。但它学会了吗没有。因为它从未被迫进行压缩。这是我们论点的核心也呼应了伊利亚·苏茨克弗Ilya Sutskever曾提出的观点大语言模型本质上是压缩算法。训练过程中它们将互联网压缩成参数。这种压缩是有损的而这恰恰是其强大之处。压缩迫使模型发现结构、进行泛化构建可跨情境迁移的表征。一个死记硬背每个训练样本的模型远不如能提取潜在规律的模型。正是有损压缩的过程实现了真正的学习。The irony is that the very mechanism that makes LLMs powerful during training (e.g. compressing raw data into compact, transferable representations) is exactly what we refuse to let them do after deployment. We stop the compression at the moment of release and replace it with external memory. Most agent harnesses, of course, compress context in some bespoke way. But wouldn’t the bitter lesson suggest that the models themselves should learn to do this compression, directly and at scale?One example Yu Sun shares to illustrate the debate is math. Consider Fermat’s Last Theorem. For over 350 years, no mathematician could prove it - not because they lacked access to the right literature, but because the solution was highly novel. The conceptual distance between established mathematics and the eventual answer was simply too vast. When Andrew Wiles finally cracked it in the 1990s, after seven years of working in near-total isolation, he had to invent powerful new techniques to reach the solution. His proof relied on successfully bridging two distinct branches of mathematics: elliptic curves and modular forms. While earlier work by Ken Ribet had shown that proving this connection would automatically resolve Fermat’s Last Theorem, no one possessed the theoretical machinery to actually construct that bridge until Wiles. A similar argument can be made about Grigori Perelman’s proof of the Poincaré conjecture.讽刺的是使大语言模型在训练时如此强大的机制例如将原始数据压缩为紧凑、可迁移的表征恰恰是我们在其部署后禁止它们继续执行的。我们在模型发布时中止了这种压缩能力转而用外部记忆系统替代。当然大多数智能体框架会通过定制化方式压缩上下文。但苦涩的教训理论不正是暗示着模型自身应该直接学习这种大规模压缩能力吗孙宇分享的数学案例生动诠释了这一争论。以费马大定理为例350多年间数学家们始终无法证明它——并非因为文献匮乏而是答案需要突破性创新。现有数学体系与最终解法之间的概念鸿沟实在过于巨大。当安德鲁·怀尔斯在1990年代闭关七年后终于破解时他不得不发明全新的数学工具。其证明关键在于架起椭圆曲线与模形式这两个数学分支的桥梁——虽然肯·里贝特早前已证实该联系能自动推导出费马大定理但在怀尔斯之前无人掌握构建这座桥梁的理论工具。格里戈里·佩雷尔曼证明庞加莱猜想的过程同样印证了这一观点。The central question is:do these examples prove that something is missing from LLMs, some ability to update their priors and think in truly creative ways? Or, does the story prove the opposite - that all human knowledge is just data available for training/recombination, and Wiles and Perelman simply show what LLMs could do at even greater scale?This question is empirical, and the answer is not known yet. But we do know there are many classes of problems where in-context learning fails today and where parametric learning could have an impact. For example:核心问题是这些例子是否证明了大型语言模型LLMs缺失了某些能力——比如更新先验知识或以真正创造性的方式思考或者说这些故事是否恰恰证明了相反的观点——即所有人类知识都只是可供训练/重组的数据而怀尔斯和佩雷尔曼的成就仅仅展示了LLMs在更大规模上也能做到的事这是个实证性问题目前尚无定论。但我们确实知道当今存在许多情境学习失效的领域而参数化学习可能会产生影响。例如What’s more, in-context learning is limited to what can be expressed in language, whereas weights can encode concepts that someone’s prompt cannot relay in text. Some patterns are too high-dimensional, too tacit, too deeply structural to fit in a context. For example, the visual texture that distinguishes a benign artifact from a tumor in a medical scan, or the micro-fluctuations in audio that define a speaker’s unique cadence, are patterns that do not easily decompose into exact words. Language can only approximate them. No prompt, no matter how long, can transfer either; that kind of knowledge can only live in the weights. They live in the latent space of learned representations, not words. No matter how long the context window grows, there will be knowledge that cannot be described in text and can only be held in the parameters.This may help explain why explicit “the bot remembers you” features, such as ChatGPT’s memory, often trigger user discomfort rather than delight. Users don’t actually want recall per se. They want competence. A model that has internalized your patterns can generalize to novel situations; a model that merely recalls your history cannot. The difference between “Here is what you responded to this email before” (verbatim) vs. “I understand how you think well enough to anticipate what you need” is the difference between retrieval and learning.此外上下文学习仅限于能用语言表达的内容而权重可以编码那些无法通过文本提示传递的概念。某些模式过于高维、过于隐性、结构过于深层难以融入上下文。例如医学扫描中区分良性伪影与肿瘤的视觉纹理或定义说话者独特节奏的音频微观波动这些模式很难精确分解为文字描述。语言只能近似表达它们。无论多长的提示都无法传递这类知识这种知识只能存在于权重中。它们栖身于学习表征的潜在空间而非词语之中。无论上下文窗口扩展得多长总有些知识无法用文本描述只能储存在参数里。这或许能解释为何机器人记住你这类显性功能如ChatGPT的记忆功能往往引发用户不安而非欣喜。用户真正需要的并非记忆本身而是能力。一个内化了您行为模式的模型能泛化至新情境仅能回忆历史的模型则做不到。这是您之前对这封邮件的回复逐字记录与我足够了解您的思维方式能预判您的需求之间的差异正是检索与学习的本质区别。A Primer on Continual Learning持续学习入门There are various approaches to continual learning. The dividing line is not “memory features” vs. “no memory features.” It is:where does compaction happen?The approaches cluster along a spectrum from no compaction (pure retrieval, weights frozen), to full internal compaction (weight-level learning, model gets smarter), and one important middle ground (modules).持续学习有多种方法。关键区别不在于“记忆特征”与“无记忆特征”而在于压缩发生在哪里这些方法分布在从无压缩纯检索权重冻结到完全内部压缩权重级学习模型变得更智能的连续谱系上其中模块化方法是重要的中间地带。ContextOn the context end, teams build smarter retrieval pipelines, agent harnesses, and prompt orchestration. This is the most mature category: the infrastructure is proven and the deployment story is clean. The limitation is depth: the context length.One emerging extension worth noting here: multi-agent architectures as a scaling strategy for context itself. If a single model is bounded by a 128K-token window, a coordinated swarm of agents, each holding its own context, specializing on a slice of the problem, and communicating results, can collectively approximate unbounded working memory. Each agent performs in-context learning within its window; the system aggregates. Karpathy’s recent autoresearch project Cursor’s example of building a web browser are early examples. It is a purely non-parametric approach (no weights change) but it dramatically extends the ceiling of what context-based systems can do.在上下文处理方面团队构建了更智能的检索管道、智能体工具和提示编排系统。这是最成熟的类别基础设施经过验证且部署流程清晰。其局限性在于深度上下文长度。这里值得注意的一个新兴扩展方向将多智能体架构作为上下文本身的扩展策略。如果单个模型受限于128K标记的窗口那么一组协同工作的智能体——每个智能体持有自己的上下文、专注于问题的某个片段并相互通信结果——就能共同模拟出无限的工作记忆。每个智能体在其窗口内进行上下文学习系统负责聚合结果。Karpathy最近的自主研究项目与Cursor构建网页浏览器的实例就是早期范例。这是一种纯粹的非参数化方法不改变权重但极大拓展了基于上下文的系统能力上限。ModulesIn the modules space, teams build attachable knowledge modules (compressed KV caches, adapter layers, external memory stores) that specialize a general-purpose model without retraining it. An 8B model with the right module can match 109B performance on targeted tasks using a fraction of the memory. The appeal is that it works with existing transformer infrastructure.在模块化空间领域各团队构建可附加的知识模块压缩键值缓存、适配器层、外部记忆存储这些模块能针对特定任务优化通用模型而无需重新训练。配备合适模块的80亿参数模型在目标任务上能以十分之一的内存消耗实现1090亿参数模型的性能。其优势在于完全兼容现有Transformer架构体系。WeightsOn the weight updates, researchers are pursuing genuine parametric learning such as sparse memory layers that update only the relevant fraction of parameters, reinforcement learning loops that refine models from feedback, and test-time training that compresses context into weights during inference. These are the deepest approaches and the hardest to deploy, but ones that actually allow models to fully internalize new information or skills.There are multiple parametric mechanisms on how to do the update. To name a few research directions:在权重更新方面研究者们正在探索真正的参数化学习方法例如仅更新相关参数部分的稀疏记忆层、通过反馈优化模型的强化学习循环以及在推理过程中将上下文信息压缩进权重的测试时训练。这些是最具深度也最难部署的方法但它们真正能让模型完全内化新信息或技能。关于如何进行更新存在多种参数化机制。列举几个研究方向The weight-level research landscape spans several parallel lines of work.Regularization and weight-space methodsare the oldest: EWC (Kirkpatrick et al., 2017) penalizes changes to parameters in proportion to their importance for previous tasks, and weight interpolation (Kozal et al., 2024) blends old and new weight configurations in parameter space, though both tend to be brittle at scale.Test-time training, pioneered by Sun et al. (2020) and since evolved into architectural primitives (TTT layers, TTT-E2E, TTT-Discover), takes a different approach: run gradient descent on test-time data, compressing new information into parameters at the moment it matters.Meta-learningasks whether we can train models that learn how to learn, from MAML’s few-shot-friendly parameter initialization (Finn et al., 2017) to Behrouz et al.’s Nested Learning (2025), which structures the model as a hierarchy of optimization problems operating at different timescales, with fast-adapting and slow-updating modules inspired by biological memory consolidation.权重层面的研究格局涵盖多条并行路线。正则化与权重空间方法最为古老EWCKirkpatrick等2017根据参数对先前任务的重要性比例惩罚其变更权重插值Kozal等2024则在参数空间混合新旧权重配置但二者在大规模应用时往往表现脆弱。测试时训练由Sun等2020开创后发展为架构原语TTT层、TTT端到端、TTT发现机制其采取不同路径对测试时数据运行梯度下降在关键时刻将新信息压缩至参数中。元学习探究能否训练出学习如何学习的模型——从MAML适用于小样本的参数初始化Finn等2017到Behrouz等的嵌套学习2025后者受生物记忆巩固启发将模型构建为跨时间尺度的优化问题层次结构包含快速适应与慢速更新模块。Distillationpreserves prior-task knowledge by matching a student to a frozen teacher checkpoint. LoRD (Liu et al., 2025) makes this efficient enough to run continuously by pruning both model and replay buffer. Self-distillation (SDFT, Shenfeld et al., 2026) flips the source, using the model’s own expert-conditioned outputs as the training signal, sidestepping the catastrophic forgetting of sequential fine-tuning.Recursive self-improvementoperates in a similar spirit: STaR (Zelikman et al., 2022) bootstraps reasoning from self-generated rationales, AlphaEvolve (DeepMind, 2025) discovered improvements to algorithms untouched for decades, and Silver and Sutton’s “Era of Experience” (2025) frames agents learning from a continuous, never-ending experience stream.These research directions are converging. TTT-Discover already fuses test-time training with RL-driven exploration. HOPE nests fast and slow learning loops inside a single architecture. SDFT turns distillation into a self-improvement primitive. The boundaries between columns are blurring — the next generation of continual learning systems will likely combine multiple strategies, using regularization to stabilize, meta-learning to accelerate, and self-improvement to compound. A growing cohort of startups is betting on different layers of this stack.蒸馏学习通过将学生模型与冻结的教师检查点相匹配来保留先前任务的知识。LoRDLiu等人2025年通过剪枝模型和回放缓冲区使其效率足以持续运行。自我蒸馏SDFTShenfeld等人2026年则转变了知识来源使用模型自身基于专家条件的输出作为训练信号从而规避了顺序微调中的灾难性遗忘问题。递归自我改进遵循相似理念STaRZelikman等人2022年通过自生成推理依据实现自我提升AlphaEvolveDeepMind2025年发现了数十年未改进算法的优化方案而Silver与Sutton提出的经验时代2025年则构建了从持续不断经验流中学习的智能体框架。这些研究方向正在融合。TTT-Discover已将测试时训练与强化学习驱动的探索相结合HOPE在单一架构内嵌套了快慢学习循环SDFT将蒸馏转化为自我改进的基本单元。各方法间的界限逐渐模糊——新一代持续学习系统很可能融合多种策略利用正则化实现稳定、元学习加速进程、自我改进形成累积效应。越来越多的初创公司正押注于这个技术栈的不同层级。The Continual Learning Startup Landscape持续学习初创企业格局The non-parametric end of the spectrum is the most familiar. Harness companies (Letta, mem0, Subconscious) build orchestration layers and scaffolding that manage what goes into the context window. External storages and RAG infrastructure (e.g. Pinecone, xmemory) provide the retrieval backbone. The data exists, the challenge is getting the right slice of it in front of the model at the right time. As context windows expand, the design space for these companies grows with them, particularly on the harness side, where a new wave of startups is emerging to manage increasingly complex context strategies.The parametric side is earlier and more varied. Companies here are attempting some version of post-deployment compression, letting models internalize new information in the weights. The approaches cluster into a few distinct bets abouthowmodels should learn after release.Partial compaction: learning without retraining.Some teams are building attachable knowledge modules (compressed KV caches, adapter layers, external memory stores) that specialize a general-purpose model without touching its core weights. The shared thesis: you can get meaningful compaction (not just retrieval) while keeping the stability-plasticity tradeoff manageable, because the learning is isolated rather than distributed across the full parameter space. An 8B model with the right module can match far larger model performance on targeted tasks. The upside is composability: modules work with existing transformer architectures out of the box, can be swapped or updated independently, and are far easier to experiment with than retraining.RL and feedback loops: learning from signals.Other teams are betting that the richest signal for post-deployment learning already exists in the deployment loop itself — user corrections, task success and failure, reward signals from real-world outcomes. The core idea is that models should treat every interaction as a potential training signal, not just an inference request. This is a close analog to how humans improve at a job: you do the work, you get feedback, you internalize what worked. The engineering challenge is converting sparse, noisy, sometimes adversarial feedback into stable weight updates without catastrophic forgetting but a model that genuinely learns from deployment compounds in value over time in a way that context-only systems cannot.Data-centric approaches: learning from the right signal.A related but distinct bet is that the bottleneck isn’t the learning algorithm but the training data and surrounding systems. These teams focus on curating, generating, or synthesizing the right data to drive continual updates: the premise being that a model with access to high-quality, well-structured learning signal needs far fewer gradient steps to meaningfully improve. This connects naturally to the feedback-loop companies but emphasizes the upstream question: not justwhetherthe model can learn, butwhatandto what degreeit should learn from.Novel architectures: learning by design.The most radical bet is that the transformer architecture itself is the bottleneck, and that continual learning requires fundamentally different computational primitives: architectures with continuous-time dynamics and built-in memory mechanisms. The thesis here is structural: if you want a system that learns continuously, you should build the learning mechanism into the substrate.All the major labs are also active across these categories. Some are exploring better context management and chain-of-thought reasoning. Others are experimenting with external memory modules or sleep-time compute pipelines. Several stealth startups are pursuing novel architectures. The field is early enough that no single approach has won, and given the range of use cases, none should.Why Naive Weight Updates FailUpdating model parameters in production introduces a cascade of failure modes that are, so far, unsolved at scale.The engineering problems are well-documented. Catastrophic forgetting means models sensitive enough to learn from new data destroy existing representations - the stability-plasticity dilemma. Temporal disentanglement is the fact that invariant rules and mutable state get compressed into the same weights, so updating one corrupts the other. Logical integration fails because fact updates don’t propagate to their consequences: changes are local to token sequences, not semantic concepts. And unlearning remains impossible: there is no differentiable operation for subtraction, so false or toxic knowledge has no surgical remedy.But there is a second set of problems that gets less attention. The current separation between training and deployment is not just an engineering convenience - it is a safety, auditability, and governance boundary. Open it, and several things break at once. Safety alignment can degrade unpredictably: even narrow fine-tuning on benign data can produce broadly misaligned behavior. Continuous updates create a data poisoning surface - a slow, persistent version of prompt injection that lives in the weights. Auditability breaks down because a continuously updating model is a moving target that can’t be versioned, regression-tested, or certified once. And privacy risks intensify when user interactions get compressed into parameters, baking sensitive information into representations that are far harder to filter than retrieved context.These are open problems, not fundamental impossibilities, and solving them is as much a part of the continual learning research agenda as solving core architectural challenges.From Memento to MemoryLeonard’s tragedy inMementoisn’t that he can’t function: he’s resourceful, even brilliant within any given scene. His tragedy is that he can never compound. Every experience remains external - a Polaroid, a tattoo, a note in someone else’s handwriting. He can retrieve, but he cannot compress the new knowledge.As Leonard moves through this self-constructed maze, the line between truth and belief begins to blur. His condition does not just strip him of memory;it forces him to constantly reconstruct meaning, making him both investigator and unreliable narrator in his own story.Today’s AI operates under the same constraint. We have built extraordinarily capable retrieval systems: longer context windows, smarter harnesses, coordinated multi-agent swarms, and they work! But retrieval is not learning. A system that can look up any fact has not been forced to find structure. It has not been forced to generalize. The lossy compression that makes training so powerful, the mechanism that turns raw data into transferable representations, is exactly what we shut off the moment we deploy.The path forward is likely not a single breakthrough but a layered system. In-context learning will remain the first line of adaptation: it is native, proven, and improving. Module mechanisms can handle the middle ground of personalization and domain specialization. But for the hard problems such as discovery, adversarial adaptation, knowledge too tacit to express with words, we may need models that compress experience into their parameters after training. That means advances in sparse architectures, meta-learning objectives, and self-improvement loops. It may also require us to redefine what “a model” even means: not a fixed set of weights, but an evolving system that includes its memories, its update algorithms, and its capacity to abstract from its own experience.The filing cabinet keeps getting bigger. But a bigger filing cabinet is still a filing cabinet. The breakthrough is letting the model do after deployment what made it powerful during training: compress, abstract, and learn. We stand at the cusp of moving from amnesiac models to ones with a glimmer of experience. Otherwise, we will be stuck in our own Memento.----