this writing is not finished yet. it will not make sense, it is random chunks of ideas links to which are in my head right now. subscribe to get notified when it is finished.

autoresearcher

I was studying neuropedia, & ran SAE training on my laptop. I was going to run few hypothesis myself. But it seemed pointless. Why should I be running experiments one by one. I have LLMs who can run tens of simultaneous threads. They can co-research with me. So I left the SAE for time being and started making my own autoresearch agent. We call it leaflet.


autoresearcher design

The Leaflet stack has three core steps. This section is the working framework — what each step actually does, the prompt sequence that drives it, and a real sample I ran on the SAE topic.

Step 1 — Survey (STORM)

The survey step is Stanford’s STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking — paper / code).

Given a topic string, STORM produces a Wikipedia-style article from scratch by:

  1. Generating writer personas (from related Wikipedia tables of contents).
  2. Simulating one multi-turn conversation per persona — each turn = ask a question, decompose into search queries, retrieve snippets, answer.
  3. Drafting an outline from parametric knowledge, then refining it using the conversations.
  4. Writing each section using the retrieved snippets, with citations.
  5. Adding a summary at the top and deduping.

The exact LM-call-by-LM-call prompt sequence is here: storm_prompt_sequence.html ← click through.

Sample run on SAEs

Input:

--topic "Sparse autoencoders in language model interpretability"
--retriever serper
--max-conv-turn 2 --max-perspective 2 --search-top-k 3

Run on Claude Code as the LM (custom dspy adapter shelling out to claude -p) + Serper for web retrieval. Wall-clock: ~7m45s. ~30 LM calls, 18 Serper queries.

Output: a 36K-character Wikipedia-style article with citations to Anthropic Circuits Thread, OpenAI SAE paper, arxiv, OpenReview. Full article: storm_sae_article.txt.

Excerpt — the opening summary paragraph:

Sparse autoencoders (SAEs) are a class of neural network architectures that have emerged as a prominent tool in mechanistic interpretability, a subfield of artificial intelligence research that aims to open the black box of neural networks and rigorously explain the underlying computations they perform. Their primary application is the decomposition of the complex, superposed internal representations of large language models (LLMs) into components that correspond to distinct, human-interpretable semantic concepts. […]

Honest gaps for Leaflet: STORM produces a clean article but doesn’t fully match what Leaflet step 1 wants — no explicit history/timeline section, people aren’t named (Olah, Cunningham, Bricken, Templeton, Nanda), datasets/open-source assets aren’t enumerated, and the outline is parametric-knowledge-driven rather than literature-driven. Natural upgrades for the Leaflet variant: academic-only retriever (Semantic Scholar / paper-qa), explicit history + people + datasets sub-prompts, and a citation-precision verification pass.

Step 2 — Hypothesis generation (Co-Scientist)

The hypothesis-generation step is an open-source re-implementation of Google’s AI co-scientist (Gottweis et al., Nature, 2026 / Google research blog). The agent roster, prompts, and control flow follow the paper.

The agents:

A Supervisor parses the goal into a research plan and schedules agent tasks through a durable SQLite-backed queue.

The exact LM-call-by-LM-call prompt sequence is here: co_scientist_prompt_sequence.html ← click through.

Sample run on SAEs

Input:

co-scientist run "test SAEs as way to interpret neural networks" \
  --n 3 --budget-usd 2.0 --wall-clock 900

Output: the Meta-review agent’s final research overview. Top-ranked hypothesis was SAE feature-splitting as a causal concept hierarchy — test whether features that appear at smaller dictionary sizes have greater causal effect on downstream task performance than features only appearing at larger sizes. Full overview: co_scientist_sae_overview.md.

Excerpt — the executive summary:

The tournament converged on a single top-ranked hypothesis exploring how Sparse Autoencoder (SAE) feature-splitting hierarchies might encode causal structure within neural networks. The core idea — that features learned at smaller dictionary sizes are causally more central than those appearing only at larger sizes — is a compelling operationalization of mechanistic interpretability that bridges representation learning and causal inference. […] Acting on it could yield a principled, falsifiable framework for evaluating whether SAEs do more than describe activations — whether they actually illuminate how models compute.

Concrete first experiment proposed: train SAEs at 3–5 dictionary sizes (512 / 2048 / 8192 / 32768 features) on a single residual-stream layer of a small open-weight LLM (Pythia-1.4B or GPT-2-XL); compute activation-patching effect sizes per feature on 2–3 downstream tasks; test whether mean effect size is significantly higher for features in the smallest SAE vs. the largest. 4–8 weeks of work with standard interpretability tooling (TransformerLens, SAELens).

Step 3 — Experimentation and writeup

Detailed setup pending — this step uses Sakana’s AI Scientist v2 (cloned at ~/AI-Scientist-v2/), with ShinkaEvolve inside it where evolutionary search beats greedy iteration, and AlphaEvolve as the design reference. The hypothesis from step 2 is the input. The output is a paper + code that ran.


specifics

Leaflet — the maximally designed group of agents

The group is called Leaflet. It is a group of agents — not a pipeline, not a framework, not a system. The framing matters: a group has members with roles, and the members have character.

what Leaflet does

Leaflet takes in any research direction and produces outcomes — by which I mean a tested hypothesis written up as a paper, with code that ran and data that was collected, grounded in a survey of the field it came from.

The input is a research direction. Example I keep coming back to:

“SAEs’ effectiveness in mechanical interpretability, and alternate methods.”

The output, at the end, is something I can sit down and read.

the steps

Leaflet runs in three core steps, with a fourth step (the gym) that exists to make Leaflet itself better over time.

Step 1 — Survey (STORM-shaped)

The survey step is designed after Stanford’s STORM. When I hand Leaflet a research direction, the survey agent performs a thorough survey of the field.

The output of this step is a thoroughly examined Wikipedia-style page for the field. It must contain:

This is a real survey writeup — not a list, not bullet points handed off to the next agent. It writes the complete survey down. The next step reads it.

Step 2 — Hypothesis generation (co-scientist-shaped, with poetry)

The hypothesis generation step is based on Google’s co-scientist. Its guiding principle is: quantity first, quality second. Generate broadly, then filter.

Its output is a list of hypotheses to test.

Internally it works very much like co-scientist:

The quirk — and this matters to me: along with reflecting, this step also writes poems and stories about the hypotheses. I want a creative element here. If there are neurons that activate on creativity, I want them activated in this step. So the step produces both:

  1. The list of hypotheses (handed to step 3)
  2. Stories and poems about the hypotheses (written down, kept)

It writes it all down. Both the structured list and the creative artifacts are outputs of this step.

Step 3 — Experimentation and writeup (Sakana-shaped)

The list of hypotheses is passed to Sakana AI’s Scientist (AI Scientist v2). It tests all the hypotheses thoroughly and writes papers. This is what I read at the end.

This step:

Step 4 — The gym (future, FutureHouse-shaped)

At the end of all this, I will have a gym built where my leaflets can go to get better. The gym is based on FutureHouse’s Aviary for now. More on this later — it gets designed after the first three steps are built and running.

how the steps connect

research direction


   ┌───────────────────────────┐
   │ Step 1 — Survey (STORM)   │ → Wikipedia-style page (history, papers,
   └───────────────────────────┘   people, milestones, SOTA, code, data)


   ┌───────────────────────────────────┐
   │ Step 2 — Hypotheses (co-scientist)│ → list of hypotheses
   │ Generation ↔ Reflection ↔ Ranking │ → poems & stories (kept)
   │ ↔ Review (+ creative writing)     │
   │ Principle: quantity > quality     │
   └───────────────────────────────────┘


   ┌──────────────────────────────────┐
   │ Step 3 — Sakana AI Scientist     │ → papers (I read these)
   │ Tests each hypothesis thoroughly │ → code (saved, tested)
   │ Uses ShinkaEvolve where useful   │
   └──────────────────────────────────┘


   ┌─────────────────────────────────┐
   │ Step 4 — Gym (FutureHouse-based)│ → leaflets get better over time
   └─────────────────────────────────┘

a validation — Google organises along the same three axes

Good news: Google’s own Labs / Science page organises their scientific-AI work along the same three sections I’ve landed on for Leaflet:

Independently arriving at the same three-step decomposition is a useful sanity check.

principles that span the group


plan sketching readings

Before I clone Sakana’s AI Scientist and run it, I want to skim every related paper — just in case something better than this already exists. Working dump below.

The stack I like

The auto-research pipeline isn’t one agent — it’s four layers:

  1. Hypothesis generation — Stanford STORM (open) or Google’s AI co-scientist (design reference) → produces a ranked list of research directions from the literature.
  2. Experiment + writeup — Sakana’s AI Scientist v2 → takes one of those directions and runs the full ideation → BFTS experiments → paper pipeline.
  3. Algorithm/code evolution — Sakana’s ShinkaEvolve → optionally evolves the code/algorithm inside the experiment step.
  4. Benchmark + evaluate the agent system — FutureHouse’s Aviary (the gym) → at the end, once we’ve built or chained these agents, we run them through a standardized arena to measure how good the system actually is. Same role for agents that ImageNet played for classifiers.

AI Scientist v2 is built for one hypothesis → one paper, not for surveying — the first step belongs to a different class of tool. Aviary is a layer below the rest — not an agent, an environment for scoring agents — but that’s exactly why it sits at the end as the evaluation step.

The step that comes before AI Scientist. Survey the field, propose directions, debate, rank.

Design reference (not open):

  • Google — AI co-scientist (blog, Feb 2025) — multi-agent system designed exactly for this. Generation → Reflection → Ranking → Evolution → Meta-review. Piloted on biomedical questions with domain experts. Paper: arxiv 2502.18864. Code is gated (Trusted Tester Program); use this as the architectural reference, not a clone target.

Tier 1 — open-source, runnable today:

  • ResearchAgent (Baek et al., 2024) — paper: arxiv 2404.07738. Multi-agent: Idea Generator + Critic, iterated over a paper corpus. The closest academic analog to AI co-scientist; co-scientist is partly a refinement of this design. Worth one weekend to spin up.
  • AI Scientist v2’s ideation step alonepython ai_scientist/perform_ideation_temp_free.py. Uses Semantic Scholar + LLM reflection. Narrow (one hypothesis, not a ranked list) but works out of the box on our already-cloned repo.
  • Stanford STORM — multi-agent literature-survey writer. Caveat: STORM outputs a Wikipedia-style article, not a list of testable hypotheses. Useful as a landscape-prep tool, not as the hypothesis step itself.
  • FutureHouse — PaperQA2 — agentic paper Q&A. Building block.
  • FutureHouse — Aviary — gymnasium for scientific agents, including hypothesis-gen environments. Building block.

Tier 2 — commercial, instant, no code:

  • OpenAI Deep Research / Anthropic / Google’s equivalents — one carefully crafted prompt (“survey SAE research at Anthropic/DeepMind/Goodfire; produce 10 ranked hypotheses with claim + novelty + feasibility + experiment shape”) realistically gets 80% of what co-scientist gives. Fastest path.
  • Roll-your-own Claude/GPT with Semantic Scholar tool — ~50 lines of Python. Most flexible if you want to control the prompt + scoring rubric.

Adjacent (not direct substitutes):

  • Edison Scientific — platform — research-agent platform; placeholder note until I dig in.
  • POPPER (Huang et al., 2024) — automated falsification of hypotheses. Runs after you have a hypothesis to test it.
  • Elicit / Consensus / ResearchRabbit — literature search tools; useful as a data layer under a generator.

Honest recommendation: weekend project — run ResearchAgent + Deep Research on “sparse autoencoders for LLM interpretability”; compare outputs; pick the better one; feed one hypothesis to AI Scientist v2. Don’t build a co-scientist clone yourself unless these outputs are unusable — that’s a separate 1–2 week project that delays your first paper.

The main reference. Clone this. Run this.

End-to-end agentic system that has produced the first AI-written workshop paper accepted through peer review. Pipeline: ideation → agentic tree search (BFTS) experiments → writeup → LLM/VLM peer review.

Algorithm evolution — solves algorithms. Need to try this on Codeforces.

TODO: run ShinkaEvolve over Codeforces problems.

This looks like a cool shit. Searching for artificial life. Whaaaaaaaat. Need to read this hell. Fucking awesome hell of a paper.

Adjacent question, todo: what is Core War? Understand it. (Probably the 1984 programming game where programs fight in a virtual machine — ancestor of artificial-life simulations. Need to confirm and pull the thread to ASAL.)

A trained manager-worker-evaluator. Worth looking into — same shape as the auto-research stack, but the orchestration itself is learned.

Open question, todo: is “saksham AI” doing something like this? If not, find it and propose it. (Note for myself: this might have been a voice-typo for “Sakana” — but if I really meant a different org, fill in here.)

concrete next steps

  1. Skim each link above. Note what’s actually novel vs marketing.
  2. Decide if AI Scientist v2 is still the right base, or if Trinity / DGM gives a better starting point.
  3. Clone the chosen base. Get one full run going end-to-end on a small problem.
  4. Then start replacing pieces with my own architecture.

v1, 2026-05-24. Working notes; not polished. Will grow as I read.

← back to the mirror