leaflet, my autoresearcher
Many of my friends think models cannot do research the way humans can — and that making models do research is a waste of compute.
Let’s map what research actually involves. Generating a hypothesis using background knowledge and creativity. Designing an experiment to test it, running it, and measuring the outcomes. Then reading the measurements, and either proving, disproving, or updating the hypothesis. Repeat.
Models can very reasonably handle the experimental side of this today — designing experiments given a hypothesis, running them, and recording measurements. They can do this far more reliably, and in far greater quantity, than humans ever could.
On hypothesis generation: leading research labs are already using models specifically for mass hypothesis generation. Earlier, it took a human who had spent their life in a field — from undergraduate to master’s to PhD — to generate hypotheses one by one. Models can be prompted to produce hundreds of them in minutes. With specialised training, the quantity and quality both improve.
So the only stage where we might want a few humans in the loop is hypothesis generation — generating them and reviewing what comes back. At present, labs have assistants and students running experiments and measuring outcomes one by one. All of that can be automated. It is more reliable. It is faster.
What people are actually worried about is that research is a creative field, and mass-producing creativity feels uncomfortable. It is uncomfortable. But machines are coming to work alongside humans in almost every creative field. We just have to swallow that.
If we integrate machines into research, scientists can now produce and test far more hypotheses than before. Earlier, since running experiments and writing code for hundreds of hypotheses was the bottleneck, a scientist had to choose: which one is worth testing first? Which ones aren’t worth testing at all?
That constraint no longer exists. You can try out hundreds of hypotheses. This changes how researchers think and work.
What it opened up for me personally: I can explore a lot more than I would have before. Earlier I would read ten papers to understand one topic. Now the model generates ten hypotheses for me — and they’re already precise. I don’t have to read a hundred papers to surface them. I have the ten, and sitting with the model I can understand what’s involved.
What does it take to come up with a hypothesis? For me it has always been this: I’m exploring a bunch of papers simultaneously — studying art, language, physics — and I take a concept from one field and try to apply it to another. Most recently I’ve been applying ideas from complex systems to everything else I’m studying. That’s where my most interesting hypotheses come from. I’ve watched other scientists do the same — they were studying some field, went and applied it to a completely different one, and it just worked. The world is more interconnected than we give it credit for.
I’m hopeful that when that time comes — when models can figure out the hypothesis generation too — humans will have elevated to something on an even higher level.
The role of us humans, I always think, is to discover problems. There are plenty of other humans who find solutions to problems. But our job is to discover more and more problems. The better the problems we find, the better our civilisation becomes. That is the most defining feature of an advanced civilisation — they know about problems we can’t even fathom.
But then: you have a problem, you have a bunch of possible solutions, you go in five different directions and you have to make a choice. There’s no real mathematical way to get the probability of each one working.
That’s where I think world models come in. People run world-level simulations. Here are five possible ways of dealing with this problem. Run a simulation for all five. Then you can see the outputs.
What level of problem am I talking about? Say we discover a vaccine for polio. We want to understand: what are the impacts of curing polio around the world, when given to everyone? A world model lets us simulate. There will be a class that accepts it, a class that won’t. Then that will be a problem — one we could not have previously fathomed. World models bring those up.
You’re making an assumption that this is possible. Sure. But that’s the same ground we stood on before — all of this would be nice if intelligence simulation is possible. And we are at a point now where it might be. It’s a hypothesis. If it is possible, we will see.
LLMs are very good at simulating a given role. They are role-playing models. You give them the right role and they play. So if we can identify all the kinds of roles that exist in the world, we can create role-playing models for each. Give them the same problem. See how each one responds.
I would still call this an unsolved problem. Right now. Hopefully someday.
Do you remember the Foundation novels? Hari Seldon predicts the movement of masses, but in the second book they find characters who break through his frame entirely — characters who make all his predictions untrue. Sounds similar to weather forecasting in a way. We are optimistic about this.
The actually practical question. When I started doing this I kept asking myself: I can just give a good detailed prompt to Fable to solve a problem. Why on earth would I build my own auto-research framework? Fable is awesome. Fable rests in peace too. So why?
Here is the reason I would still want a framework, even with the best model available:
A framework’s job today is to allow productive conflict to happen.
Each time there is a conflict between agents, something new is learned — something previously unknown, or at least unconceived. Can the framework allow that?
Fable, the model itself, was good. The Anthropic CLI creates a very good conflict for the model to solve — that’s why it works so well. But the agents in standard CLIs are singular. They do not create conflicts. They have some algorithmic conflicts (you have to read a file before you write to it, that kind of rule), but those are the framework forcing simple checks. The Claude Code instance behaves like one person. There isn’t much internal conflict until you prompt it to have some. You can prompt almost any decent model into the same shape — that’s not the differentiator.
Before Fable, the model I relied on for anything important — anything where I was ideating, not just implementing — was ChatGPT 5.5 Pro. An extremely good ideation model. After Fable went away and I had to go back to Opus 4.8 (I’m fond of Opus, don’t get me wrong) — the difference is this: Opus, I need to tell what to do. Fable, I was willing to listen to. Its judgment you could trust.
A friend pushed back: the model is awesome — but is it the model’s capability, or is it the framework around it doing the work? That’s a good distinction. There are two components:
- Model capability. Some models are simply better generalists, better ideators, better judges.
- Framework capability. If the framework lets a model write a hypothesis, walk away, come back two steps later, and review what it wrote — that’s a property of the framework, not the model.
When building an auto-research agent, both matter. The same models can sit inside very different frameworks, and the framework that enables explicit conflict will outperform.
So the answer to why build my own? is: take the same models the CLIs use, but build a framework that enables conflict more deeply than they do. Fable and Codex have it, but constrained. Make ours less constrained.
Imagine three agents working together to produce one output — say a developer, a security agent, and a QA agent — going through 50 conflicts before they’re all happy with the result. A better model would likely have fewer conflicts on the same task. Or it might have more, depending on what’s being done — but the point is: every single argument in the system will be of higher quality than it was before.
My usual setup in software is developer → security agent → QA. The developer writes code; the security agent checks for issues; the loop runs until all three are satisfied.
With a better model, what happens? The code written in the first pass is already much more secure than what a junior agent would produce, because the model has internalised: in this world, security matters. So the code arrives cleaner. When it goes to the security agent — which is also a better model — that agent doesn’t just check for basic things. It checks for extremely complicated stuff. Maybe more conflicts, maybe fewer. The point is every argument is at a higher quality than before.
The same system magically appears to start working better — because the people in it (if we consider agents to be people) became more thoughtful.
A no-conflict system is a hierarchical system. The coder agent gives the code, the PR agent just takes and pushes it, and that’s all it does. It never questions what the code is or where it’s being pushed.
Maybe you can think of examples where that’s appropriate, but they have to be very specific.
Say I want to post on socials about a security breach. My developer and PR agents go figure out the source of the breach, fix it, and write a report. When that report goes out to social, you might think: the social agent’s job is just to relay what they said.
But even here — think what a conflict would bring. The social agent would say: hey, if you post it exactly like this, it’s going to sound really bad. Make these changes. And release a longer note for people who want the full story. That’s a productive conflict. It improves the outcome.
I’m having a hard time thinking of many scenarios where I would not want conflict. In most situations conflict creates better outcomes.
Imagine a state ruled by a dictator. That dictator wants the population to have no conflicts — to just follow whatever the constitution says. That is the only kind of system that wants no conflict. Everywhere else: you want conflict.
Having conflict is one thing. The structure of it is another.
I have been using Leaflet for the past week, and I’m beginning to find my own answers to: when do I want conflicts? how much? when exactly do I want them to resolve? Artists talk about their process of getting ideas. This is what I’m finding here. Not the agent’s process — my process. The carbon–silicon mingling.
Drawing the line is everything. We want conflicts, but we want them to resolve at some point. A conflict that never resolves is useless. If models are extremely capable but also extremely adamant, the security agent can keep finding new issues forever. The loop never ends. There needs to be a point where everyone says — okay, we don’t need to keep going. Where do you draw it? Maybe 90%. I’m happy at 90%.
Here is the frame I’ve landed on for my own research:
- I ask the agent to do a survey of the field.
- The agent and I sit together and read it. At this point we are not generating hypotheses, not drawing conclusions, not questioning. Just reading.
- We go to hypothesis generation. The agent generates first. I review. It generates again. After some rounds we have a list.
- I rank the hypotheses.
- After ranking, I start experimenting — test the first hypothesis. Based on what comes back, update the rest.
That’s my frame. Yours could be different. Maybe your stage one is: just ask the model for a hypothesis, test it, get the conclusion, and that’s where you start to think.
That’s not how the agent’s process should be designed — that’s the carbon–silicon mingling I want for me. Where I want the human in the loop. All the scientists from now on are going to have their own way to do research with agents. These people are so smart. Let’s see what kind of processes they come up with.