
Efficient local models need a better attention architecture.
MGPT — Mask-Generative Pretrained Transformer — is our main research surface. It uses masking to fix the attention deficit of smaller models, so an open base model can get strong locally before it's ever asked to learn continually.
Smaller models underperform partly because their attention is too weak.
Continual learning can't happen on a model you can't run. If the base model is too large for local hardware, too costly to update, or locked behind someone else's API, its weights will never adapt to your everyday use.
MGPT works on that foundation. The hypothesis: small models have an attention deficit — they can't reliably route the information that matters through limited context and compute. Masking gives the model a deliberate way to choose what to keep and act on, adding capability without just adding parameters.
The aim isn't a longer context window. It's a stronger local base model — small, efficient, and open — capable enough to carry continual learning later.
The MGPT API
The hosted API is the product-facing route for building on MGPT today, while the architecture work pushes toward more capable local base models. Create scoped platform keys, call the endpoint, and route model work through Continual MI.
BABILong QA1
MGPT solving a QA1 problem on the BABILong benchmark using model-managed masking. The demo is an early proof that masking can help route attention through constrained memory instead of relying only on bigger models.
Work-in-progress draft
MGPT updates and discussion also live in the Continual Society, especially around efficient architecture, local open models, and the path toward continual learning.