Bounded-Memory Long-Horizon Conversations: MGPT on BABILong
A transformer's context window is its short-term memory, and it is both finite and expensive. As a conversation gets longer, the usual options are to grow the window — which costs more compute and memory every turn — or to throw old turns away and lose the thread. Neither scales to the kind of long-horizon interaction we care about.
MGPT — Mask-Generative Pretrained Transformer — takes a different route. Instead of treating the KV cache as a fixed buffer to be managed from the outside, it lets the model govern its own memory. A pretrained GPT is fine-tuned to emit explicit <|mask|>N commands that hide stale turns from direct attention while keeping their compressed traces inside downstream state. The model decides what to forget, and recalls what it hid without replaying it verbatim.
The result
On BABILong, a long-context question-answering benchmark, a 4B-parameter MGPT reaches 99% token accuracy on the QA1 task with constant memory usage, outperforming substantially larger systems that have to keep growing their context to keep up. The effective context contracts as the model masks, but the neural traces left behind in the KV cache behave like a fading-but-durable recollection — enough to answer questions about information the model chose to hide turns earlier.
This is the proof we needed for the architecture thesis: capability does not have to come from a bigger model or a bigger window. A small model that manages its own memory can sustain long-horizon interaction at a fixed cost.
The demo
Here is MGPT solving a BABILong QA1 problem in real time, using model-managed masking and fixed-memory context handling rather than an ever-growing window.
The paper
The full write-up — Bounded-Memory Long-Horizon Conversations: MGPT and Model-Governed KV Memory — covers the architecture, the masking mechanism, training and inference, and the BABILong evaluation in detail. It is a work-in-progress draft.
MGPT is the architecture behind the Continual MI research line. The same efficiency that makes a small model competitive on BABILong is what we serve through the MGPT API.
All posts