autoresearcher — The Mirror

I was studying neuropedia, & ran SAE training on my laptop. I was going to run few hypothesis myself. But it seemed pointless. Why should I be running experiments one by one. I have LLMs who can run tens of simultaneous threads. They can co-research with me. So I left the SAE for time being and started making my own autoresearch agent. We call it leaflet.

autoresearcher design

The Leaflet stack has three core steps. This section is the working framework — what each step actually does, the prompt sequence that drives it, and a real sample I ran on the SAE topic.

Step 1 — Survey (STORM)

The survey step is Stanford’s STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking — paper / code).

Given a topic string, STORM produces a Wikipedia-style article from scratch by:

Generating writer personas (from related Wikipedia tables of contents).
Simulating one multi-turn conversation per persona — each turn = ask a question, decompose into search queries, retrieve snippets, answer.
Drafting an outline from parametric knowledge, then refining it using the conversations.
Writing each section using the retrieved snippets, with citations.
Adding a summary at the top and deduping.

The exact LM-call-by-LM-call prompt sequence is here: storm_prompt_sequence.html ← click through.

Sample run on SAEs

Input:

--topic "Sparse autoencoders in language model interpretability"
--retriever serper
--max-conv-turn 2 --max-perspective 2 --search-top-k 3

Run on Claude Code as the LM (custom dspy adapter shelling out to claude -p) + Serper for web retrieval. Wall-clock: ~7m45s. ~30 LM calls, 18 Serper queries.

Output: a 36K-character Wikipedia-style article with citations to Anthropic Circuits Thread, OpenAI SAE paper, arxiv, OpenReview. Full article: storm_sae_article.txt.

Excerpt — the opening summary paragraph:

Sparse autoencoders (SAEs) are a class of neural network architectures that have emerged as a prominent tool in mechanistic interpretability, a subfield of artificial intelligence research that aims to open the black box of neural networks and rigorously explain the underlying computations they perform. Their primary application is the decomposition of the complex, superposed internal representations of large language models (LLMs) into components that correspond to distinct, human-interpretable semantic concepts. […]

Honest gaps for Leaflet: STORM produces a clean article but doesn’t fully match what Leaflet step 1 wants — no explicit history/timeline section, people aren’t named (Olah, Cunningham, Bricken, Templeton, Nanda), datasets/open-source assets aren’t enumerated, and the outline is parametric-knowledge-driven rather than literature-driven. Natural upgrades for the Leaflet variant: academic-only retriever (Semantic Scholar / paper-qa), explicit history + people + datasets sub-prompts, and a citation-precision verification pass.

Step 2 — Hypothesis generation (Co-Scientist)

The hypothesis-generation step is an open-source re-implementation of Google’s AI co-scientist (Gottweis et al., Nature, 2026 / Google research blog). The agent roster, prompts, and control flow follow the paper.

The agents:

Generation — proposes hypotheses via literature review and simulated scientific debate.
Reflection — reviews hypotheses for novelty, correctness, and testability; deep-verifies the underlying assumptions.
Ranking — runs an Elo tournament with simulated debates between hypotheses.
Evolution — combines, simplifies, makes more feasible, or out-of-box-reimagines top-ranked hypotheses.
Proximity — embeds and clusters hypotheses to drive dedup and informative tournament pairings.
Meta-review — synthesizes system-wide feedback and the final research overview.

A Supervisor parses the goal into a research plan and schedules agent tasks through a durable SQLite-backed queue.

The exact LM-call-by-LM-call prompt sequence is here: co_scientist_prompt_sequence.html ← click through.

Sample run on SAEs

Input:

co-scientist run "test SAEs as way to interpret neural networks" \
  --n 3 --budget-usd 2.0 --wall-clock 900

Output: the Meta-review agent’s final research overview. Top-ranked hypothesis was SAE feature-splitting as a causal concept hierarchy — test whether features that appear at smaller dictionary sizes have greater causal effect on downstream task performance than features only appearing at larger sizes. Full overview: co_scientist_sae_overview.md.

Excerpt — the executive summary:

The tournament converged on a single top-ranked hypothesis exploring how Sparse Autoencoder (SAE) feature-splitting hierarchies might encode causal structure within neural networks. The core idea — that features learned at smaller dictionary sizes are causally more central than those appearing only at larger sizes — is a compelling operationalization of mechanistic interpretability that bridges representation learning and causal inference. […] Acting on it could yield a principled, falsifiable framework for evaluating whether SAEs do more than describe activations — whether they actually illuminate how models compute.

Concrete first experiment proposed: train SAEs at 3–5 dictionary sizes (512 / 2048 / 8192 / 32768 features) on a single residual-stream layer of a small open-weight LLM (Pythia-1.4B or GPT-2-XL); compute activation-patching effect sizes per feature on 2–3 downstream tasks; test whether mean effect size is significantly higher for features in the smallest SAE vs. the largest. 4–8 weeks of work with standard interpretability tooling (TransformerLens, SAELens).

Step 3 — Experimentation and writeup

Detailed setup pending — this step uses Sakana’s AI Scientist v2 (cloned at ~/AI-Scientist-v2/), with ShinkaEvolve inside it where evolutionary search beats greedy iteration, and AlphaEvolve as the design reference. The hypothesis from step 2 is the input. The output is a paper + code that ran.

specifics

Leaflet — the maximally designed group of agents

The group is called Leaflet. It is a group of agents — not a pipeline, not a framework, not a system. The framing matters: a group has members with roles, and the members have character.

what Leaflet does

Leaflet takes in any research direction and produces outcomes — by which I mean a tested hypothesis written up as a paper, with code that ran and data that was collected, grounded in a survey of the field it came from.

The input is a research direction. Example I keep coming back to:

“SAEs’ effectiveness in mechanical interpretability, and alternate methods.”

The output, at the end, is something I can sit down and read.

the steps

Leaflet runs in three core steps, with a fourth step (the gym) that exists to make Leaflet itself better over time.

Step 1 — Survey (STORM-shaped)

The survey step is designed after Stanford’s STORM. When I hand Leaflet a research direction, the survey agent performs a thorough survey of the field.

The output of this step is a thoroughly examined Wikipedia-style page for the field. It must contain:

Complete history of the field
All relevant papers and the people behind them
Important milestones in the field
State of the art and recent breakthroughs
All available open-source code that’s important
Available datasets
Everything else worth knowing about the field

This is a real survey writeup — not a list, not bullet points handed off to the next agent. It writes the complete survey down. The next step reads it.

Step 2 — Hypothesis generation (co-scientist-shaped, with poetry)

The hypothesis generation step is based on Google’s co-scientist. Its guiding principle is: quantity first, quality second. Generate broadly, then filter.

Its output is a list of hypotheses to test.

Internally it works very much like co-scientist:

Generation agent proposes hypotheses
Reflection agent reflects on them
Ranking agent ranks them
Review / evaluation agent evaluates them
The loop iterates

The quirk — and this matters to me: along with reflecting, this step also writes poems and stories about the hypotheses. I want a creative element here. If there are neurons that activate on creativity, I want them activated in this step. So the step produces both:

The list of hypotheses (handed to step 3)
Stories and poems about the hypotheses (written down, kept)

It writes it all down. Both the structured list and the creative artifacts are outputs of this step.

Step 3 — Experimentation and writeup (Sakana-shaped)

The list of hypotheses is passed to Sakana AI’s Scientist (AI Scientist v2). It tests all the hypotheses thoroughly and writes papers. This is what I read at the end.

This step:

Saves all the code it produces
Tests the code well
Uses ShinkaEvolve-like tools to optimize specific steps where evolutionary search beats greedy iteration
Can use other tools too as needed

Step 4 — The gym (future, FutureHouse-shaped)

At the end of all this, I will have a gym built where my leaflets can go to get better. The gym is based on FutureHouse’s Aviary for now. More on this later — it gets designed after the first three steps are built and running.

how the steps connect

research direction
       │
       ▼
   ┌───────────────────────────┐
   │ Step 1 — Survey (STORM)   │ → Wikipedia-style page (history, papers,
   └───────────────────────────┘   people, milestones, SOTA, code, data)
       │
       ▼
   ┌───────────────────────────────────┐
   │ Step 2 — Hypotheses (co-scientist)│ → list of hypotheses
   │ Generation ↔ Reflection ↔ Ranking │ → poems & stories (kept)
   │ ↔ Review (+ creative writing)     │
   │ Principle: quantity > quality     │
   └───────────────────────────────────┘
       │
       ▼
   ┌──────────────────────────────────┐
   │ Step 3 — Sakana AI Scientist     │ → papers (I read these)
   │ Tests each hypothesis thoroughly │ → code (saved, tested)
   │ Uses ShinkaEvolve where useful   │
   └──────────────────────────────────┘
       │
       ▼
   ┌─────────────────────────────────┐
   │ Step 4 — Gym (FutureHouse-based)│ → leaflets get better over time
   └─────────────────────────────────┘

a validation — Google organises along the same three axes

Good news: Google’s own Labs / Science page organises their scientific-AI work along the same three sections I’ve landed on for Leaflet:

Literature agent — Google has NotebookLM. I have STORM.
Hypothesis generator — Google has AI co-scientist. I have a co-scientist-shaped step.
Experimentation agent — Google has AlphaEvolve. I’m using Sakana AI Scientist, with references taken from AlphaEvolve.

Independently arriving at the same three-step decomposition is a useful sanity check.

principles that span the group

Creativity is a real ingredient, not decoration. The poems-and-stories quirk in step 2 is on purpose. The hypothesis step is the step where breadth, surprise, and imagination matter most. The creative artifacts are not throwaway — they’re written down because writing them is what activates the right mode of thinking.
Quantity first, quality second — but only inside step 2. Step 1 is comprehensive (not “lots of surveys”), step 3 is rigorous (not “lots of papers”). The breadth/depth balance is intentional and lives in different places at different steps.
Each step writes its full output down. Survey → full page. Hypothesis gen → list + creative writing. Experiments → papers + code. Nothing is summarized away between steps.
The gym is a self-improvement layer, not a fourth research step. Step 4 is about Leaflet, not produced by Leaflet.

plan sketching readings

Before I clone Sakana’s AI Scientist and run it, I want to skim every related paper — just in case something better than this already exists. Working dump below.

The stack I like

The auto-research pipeline isn’t one agent — it’s four layers:

Hypothesis generation — Stanford STORM (open) or Google’s AI co-scientist (design reference) → produces a ranked list of research directions from the literature.
Experiment + writeup — Sakana’s AI Scientist v2 → takes one of those directions and runs the full ideation → BFTS experiments → paper pipeline.
Algorithm/code evolution — Sakana’s ShinkaEvolve → optionally evolves the code/algorithm inside the experiment step.
Benchmark + evaluate the agent system — FutureHouse’s Aviary (the gym) → at the end, once we’ve built or chained these agents, we run them through a standardized arena to measure how good the system actually is. Same role for agents that ImageNet played for classifiers.

AI Scientist v2 is built for one hypothesis → one paper, not for surveying — the first step belongs to a different class of tool. Aviary is a layer below the rest — not an agent, an environment for scoring agents — but that’s exactly why it sits at the end as the evaluation step.

Hypothesis-generation agents — the missing piece

The step that comes before AI Scientist. Survey the field, propose directions, debate, rank.

Design reference (not open):

Google — AI co-scientist (blog, Feb 2025) — multi-agent system designed exactly for this. Generation → Reflection → Ranking → Evolution → Meta-review. Piloted on biomedical questions with domain experts. Paper: arxiv 2502.18864. Code is gated (Trusted Tester Program); use this as the architectural reference, not a clone target.

Tier 1 — open-source, runnable today:

ResearchAgent (Baek et al., 2024) — paper: arxiv 2404.07738. Multi-agent: Idea Generator + Critic, iterated over a paper corpus. The closest academic analog to AI co-scientist; co-scientist is partly a refinement of this design. Worth one weekend to spin up.
AI Scientist v2’s ideation step alone — python ai_scientist/perform_ideation_temp_free.py. Uses Semantic Scholar + LLM reflection. Narrow (one hypothesis, not a ranked list) but works out of the box on our already-cloned repo.
Stanford STORM — multi-agent literature-survey writer. Caveat: STORM outputs a Wikipedia-style article, not a list of testable hypotheses. Useful as a landscape-prep tool, not as the hypothesis step itself.
FutureHouse — PaperQA2 — agentic paper Q&A. Building block.
FutureHouse — Aviary — gymnasium for scientific agents, including hypothesis-gen environments. Building block.

Tier 2 — commercial, instant, no code:

OpenAI Deep Research / Anthropic / Google’s equivalents — one carefully crafted prompt (“survey SAE research at Anthropic/DeepMind/Goodfire; produce 10 ranked hypotheses with claim + novelty + feasibility + experiment shape”) realistically gets 80% of what co-scientist gives. Fastest path.
Roll-your-own Claude/GPT with Semantic Scholar tool — ~50 lines of Python. Most flexible if you want to control the prompt + scoring rubric.

Adjacent (not direct substitutes):

Edison Scientific — platform — research-agent platform; placeholder note until I dig in.
POPPER (Huang et al., 2024) — automated falsification of hypotheses. Runs after you have a hypothesis to test it.
Elicit / Consensus / ResearchRabbit — literature search tools; useful as a data layer under a generator.

Honest recommendation: weekend project — run ResearchAgent + Deep Research on “sparse autoencoders for LLM interpretability”; compare outputs; pick the better one; feed one hypothesis to AI Scientist v2. Don’t build a co-scientist clone yourself unless these outputs are unusable — that’s a separate 1–2 week project that delays your first paper.

AI Scientist v2 (Sakana)

The main reference. Clone this. Run this.

End-to-end agentic system that has produced the first AI-written workshop paper accepted through peer review. Pipeline: ideation → agentic tree search (BFTS) experiments → writeup → LLM/VLM peer review.

Code: SakanaAI/AI-Scientist-v2 — ai_scientist module
Paper (Nature): s41586-026-10265-5
Benchmark reference: SWE-bench (original)

ShinkaEvolve (Sakana)

Algorithm evolution — solves algorithms. Need to try this on Codeforces.

Blog: Sakana — ShinkaEvolve
Code: SakanaAI/ShinkaEvolve

TODO: run ShinkaEvolve over Codeforces problems.

ASAL — searching for artificial life (Sakana)

This looks like a cool shit. Searching for artificial life. Whaaaaaaaat. Need to read this hell. Fucking awesome hell of a paper.

Blog + code: Sakana — ASAL

Adjacent question, todo: what is Core War? Understand it. (Probably the 1984 programming game where programs fight in a virtual machine — ancestor of artificial-life simulations. Need to confirm and pull the thread to ASAL.)

Trinity (Sakana)

A trained manager-worker-evaluator. Worth looking into — same shape as the auto-research stack, but the orchestration itself is learned.

Blog: Sakana — Trinity

Adjacent — Darwin Gödel Machine + open questions

Darwin Gödel Machine (Sakana) — self-improving system; relevant to the “design your way of auto research” question.

Open question, todo: is “saksham AI” doing something like this? If not, find it and propose it. (Note for myself: this might have been a voice-typo for “Sakana” — but if I really meant a different org, fill in here.)

concrete next steps

Skim each link above. Note what’s actually novel vs marketing.
Decide if AI Scientist v2 is still the right base, or if Trinity / DGM gives a better starting point.
Clone the chosen base. Get one full run going end-to-end on a small problem.
Then start replacing pieces with my own architecture.

v1, 2026-05-24. Working notes; not polished. Will grow as I read.