A loop for things you can’t measure
Self-improving loops only work where the outcome is verifiable. Can you build one that improves how an app feels? What I learned trying.
Loops are everywhere right now, and for good reason. You let a model try something, check the result, feed the result back, and let it try again. Do that enough times and the system gets better on its own. It is one of the most exciting ideas in how we build with models.
But look closely at the loops that actually work, and they all live in the same neighborhood. Code that compiles or does not. A test that passes. A math answer that checks out. A benchmark that moves. The loop closes because something can say right or wrong, cheaply and without a human. That verifiable signal is the whole engine.
The interesting question is everything else
Most of what software is made of has no such signal. Whether a flow feels smooth. Whether a screen is confusing. Whether the product is actually pleasant to use. Experience does not compile. There is no test that returns true when something feels right. And that is exactly where I want a loop the most.
Can you build a loop that improves the experience of an app, not only its correctness? I think you can. But almost everything written about self-improving systems takes the verifiable case for granted, and the experience case is a different problem hiding under the same word.
The model is not the bottleneck
The model can already do the judging. Anyone who has asked one for design feedback knows this. Show it a screen, describe a flow, and it will tell you something useful about where a user would get stuck or what feels off. It has seen enough products to have taste. So the missing piece was never the model’s ability to hold an opinion about experience.
The missing pieces are two, and both are less glamorous. What you feed it, and how it ever knows it helped.
What I built, and what it taught me
At Akarii I built a version of this. A self-improving product loop: an agent drives a real browser through the running app, walks the important flows, evaluates what it finds, and files improvement proposals. A person reviews them, and for clear, verifiable bugs the agent can open the pull request itself. It shipped, and it was genuinely useful.
It also showed me where the hard edges are, and they are the two I just named.
The inputs are harder than they look
I assumed the hard part was giving the agent eyes, so I put it in a browser and captured what it saw. But look at what the model actually reasoned over and it is the behavior and structure of the page. What loaded, what threw an error, how long it took, the words that came back. The screenshot was mostly there for the human reviewing later. So even with a browser in the loop, the thing judging experience was reading a description of the page, not seeing it.
Experience is the felt, rendered thing. The rhythm of a screen, the small hesitation before someone understands what to do. You can hand a loop the DOM. Handing it the feeling is the actual research problem, and a browser alone does not solve it.
The signal is the harder half
The deeper problem is the one that makes experience different from code. Say the agent proposes a good change and it ships. How does the loop know it worked? A code loop reruns the test. An experience loop has nothing to rerun. There is no assertion for better.
This is why my loop stops where it does. It proposes, and a person decides. For a clean bug I let it run all the way to a pull request, because a bug has something close to a verifiable signal. For anything about how the product feels, it hands the judgment back to a human. Not as a safety fallback. Because the judgment of experience genuinely belongs to a person.
Closing it with softer signals
The way I think this actually works is less like a proof and more like a habit. You let the agent run on a cadence, every day, so improving the product becomes a steady background process instead of a project you schedule. Most days it surfaces a few small things. Over weeks that compounds into something a person would never have had the patience to do by hand.
And you close the loop with the only signal that truly understands experience, which is people. The users. What they complain about, what they abandon, what they return to. You feed that back in as the thing the agent chases. Real users are the reward function you cannot write down.
Experience is defined by the people having it, so the signal that closes the loop has to come from them.
This might be the real shape of it
For a while I read this as a weakness. My loop did not close itself the way the impressive coding loops do, and that felt unfinished. I have come around to thinking it is the actual shape of the problem. There is no clever trick that removes the people from an experience loop, because they are the definition of the thing being improved.
That changes what you are engineering for. The goal is to make the loop tight and cheap enough that a little human and user judgment can steer a lot of automated proposing. The machine does the tireless watching. People supply the taste.
The loop I actually want stays out of the way. It runs every day, watches the product the way a careful designer would if they had infinite patience, proposes more than any person would have time to, and leaves the question of whether it is better to the people who feel it. The part I am still chasing is the edge between those two. How much of that judgment can you compress into a signal the loop can use on its own, before you flatten the very thing you were trying to improve? I do not have the full answer. What I am fairly sure of is that the answer is not a metric pretending experience is a number.