Continual MI · Blog

Block Generation: How We Made Endless Visual Novels Affordable

June 28, 2026MDLEngineeringGenerationCost

MDL started as something nobody had built before: an endless visual novel engine. Not a branching script with a finite number of endings, but a story that keeps writing itself for as long as you keep playing. You make a choice, the engine writes the next moment of your world, you respond, and it writes again. There is no last page.

It worked better than we expected. The immersion was the thing people kept coming back to. We watched players stay inside a single universe for days at a stretch, the way you fall into a good book and lose the afternoon, except the book was writing itself around their decisions. That was the proof we needed that the core idea was real.

There was one problem, and it was a big one. MDL was a premium engine. To experience a story for any meaningful length of time, you had to pay, and you had to pay a lot.

Why it was so expensive

The first version of the engine ran on Gemini 3 Flash. It is a cheaper model than the flagships, but it has real creative range, and for our task that combination was hard to beat. Every step of the story was a fresh call to it. We even built a prefetching system on top so the next beat would already be on its way before you asked for it, which kept the experience feeling smooth instead of turn-based.

Smooth, but costly. One step, one call, every time. And the cost got worse in a way we did not anticipate. The model would sometimes fall into a rhythm of producing illustrated scenes back to back, and each of those scenes cost 15 credits to generate. A few of those in a row and a player could watch their balance evaporate in minutes. Some people blew through their credits before they had really gotten anywhere.

For an experience whose whole promise is endless, that is close to fatal. An endless story you can only afford in short bursts is not the product we set out to build. We needed the economics to make sense, not just the storytelling.

The two months we spent on the wrong answer

The plan we committed to was training our own model. We had tens of thousands of real generations from actual play, which is exactly the kind of data you want, and the idea was to use it to train an open-source model we could host ourselves and run far more cheaply than calling out to a flagship for every step.

It did not work, and it is worth being honest about why.

The models we could realistically train are smaller. Smaller is what makes them cheap to run, but smaller is also where the creativity goes to die. None of the candidates were inherently creative in the way the task demands. And they were not small enough to be cheap on the training side either. We did not have the capital to fully fine-tune them, only to nudge them.

This is the uncomfortable shape of the problem. The reason Gemini 3 Flash is good at writing a living story is that it is enormous, and that scale is also why it adapts to our task so readily. The model we would actually want to fine-tune is Gemini 3 Flash itself, and the moment you fine-tune something that large you have given back the cost savings that were the entire point.

So we stopped. Every model we trained came out meaningfully worse than the Gemini 3 Flash baseline, and the thing that gave it away every time was the writing. You can forgive a model a lot, but you cannot forgive flat prose in a product whose only job is to make you feel like you are inside a story.

The accident

There was one experiment from the training work that did not look like a dead end.

While we had the training pipeline open, we tried to teach a model to do something Gemini 3 Flash had never reliably done for us: produce several story steps in a single generation instead of just one. If we could get multiple beats per call, the cost math changed completely. The trained model managed it, sort of. It could output multiple steps, but not consistently enough to ship, and the writing was still the weak link.

So we went back to Gemini 3 Flash and moved on.

Except we left something behind. When we reverted, we forgot to turn structured output enforcement back on, and the multi-step prompt from the experiment was still in place. Structured output enforcement is the mechanism that forces the model to return data in a strict, schema-shaped form. It is normally how we guarantee the engine gets something it can parse. With it switched off and the multi-step instructions still present, Gemini 3 Flash did the thing the trained model had only fumbled at: it produced clean multi-step output, consistently.

That was the part we did not see coming. With strict structured output turned on, the same model had always struggled to produce more than one good step at a time. Take the constraint away and it could write a whole block of beats at once, and write them well. We had spent two months trying to train this behavior into a smaller model, and the behavior was sitting in Gemini 3 Flash the whole time, locked behind a setting.

We tested it properly before we let ourselves believe it. The writing was not worse. If anything it was slightly better, because the model was composing a continuous run of scene rather than restarting cold each step. And it was steady: across real playthroughs, Gemini 3 Flash settled at an average of about 6.7 story steps per generation.

What that bought us

We call the new approach block generation: one call produces a block of playable beats, and the player walks through them while the next block is already being prepared.

On one full playthrough that ran to 160 beats, the cleanest stretch came out like this:

	Block generation	One step per call
Playable beats	149	~149
Model calls	22	~149
Beats per call	6.7	1.0
Credits charged	42	≥149

Same story. Same number of beats reaching the player. About 85% fewer model calls, and at least 100 credits saved on a single universe.

The reason this saves money is worth being precise about, because it is not where people assume. The model still writes every beat. The output, the actual story, is the same length. What changes is that all the work around the writing — assembling the prompt, the round trip, the per-call billing — now gets paid once per block instead of once per beat.

The images, and the wait

The illustrated scenes are the other place block generation quietly fixed things, on two fronts.

The first is how many image requests we make. Under the old one-step system, an illustrated scene could be triggered on any step, and the model would sometimes chain them, firing several heavy image generations back to back. That was both the latency problem and the 15-credits-a-scene problem stacked on top of each other. With block generation the structure naturally caps it: a block carries at most one illustrated scene, and it tends to land in the middle of the block rather than at the front. Instead of a run of synchronous image waits and a run of 15-credit charges, you get one, placed where there are beats on either side of it.

The second is where the wait happens. When the illustrated scene sits a few beats into the block, the player still has reading to do before they reach it, so we generate the image into that reading time instead of making them sit and stare at a spinner. The text block goes out first and the image finishes in the background while the player reads toward it.

The playthrough we measured generated three illustrated scenes:

Scene	Image time	Position in block	What the player felt
A sketch reveal	~22.1s	front of its block	still a visible wait
A training-park scene	~23.7s	end of its block	hidden behind reading
A stairwell moment	~17.5s	end of its block	hidden behind reading

The two scenes that landed with beats ahead of them generated in the background. Across the run, about 65% of total image generation time (~41s of ~63s) happened after the response had already reached the player, hidden behind beats they were still reading. The background images averaged around 20.6s each, and with three beats of reading in front of them at roughly 6 to 8 seconds per beat, that is 18 to 24 seconds of cover, enough to swallow most of the wait.

The one exception proves where to push next. The sketch reveal sat at the front of its block with nothing to read ahead of it, so it was still an inline wait, about 37 seconds of total response time with roughly 23 of those being the image. That is the case the placement is meant to avoid, and it tells us the goal is not just “make images faster” but “make sure there are beats to read before the expensive one arrives.”

Where this leaves us

Block generation is the gain we needed to make an endless visual novel actually worth playing at length instead of in expensive bursts. Lower cost per beat is the headline, but the more interesting consequence is what it frees up. Because each call now carries six or seven beats instead of one, we can afford to put our best models on the task. We are moving toward running Gemini 3.1 Pro for generation, which raises the writing quality and the immersion another step, on stories that are now cheaper to tell than they were on the old setup.

The honest lesson is that the win did not come from the plan. It came from a reverted experiment and a setting nobody re-enabled. We will take it. But the reason we caught it at all is that we were reading the logs of real playthroughs closely enough to notice when the numbers changed. That habit is the actual asset, and it is the one we are keeping.

This is the first post on the Continual MI engineering blog. We build MDL, the endless visual novel engine, as part of a longer bet on continual learning. More on that bet another time.

All posts