Co-Scientist — Sequence of LM Prompts in One Run

The order LM calls fire when co-scientist processes one research goal. Captured from a code walkthrough of the Kaimen-Inc fork on 2026-06-02. Modeled after the companion STORM doc.

The sequence

┌─ STAGE 0: Parse the goal (1 LM call) ─────────────────────────────────┐

  1. ParseGoal(goal)
     → prompt: "parse_goal.md"
     → forced tool: record_research_plan
     → produces a ResearchPlan that scopes the rest of the run
     → fallback: if the model doesn't emit the tool, supervisor uses a
       bare ResearchPlan and the session continues

  ──── supervisor then enqueues N initial Generation tasks ────

└──────────────────────────────────────────────────────────────────────┘

┌─ STAGE 1: Main loop ──────────────────────────────────────────────────┐
│  task scheduler runs concurrency-many tasks at once;                  │
│  every finished task may enqueue follow-ups;                          │
│  loop terminates on budget / wall-clock / Elo stability / idle        │

  ┌─ Generation (per initial task AND per Evolution offspring) ────┐
  │                                                                │
  │   2. Generation(goal, plan, system_feedback)                   │
  │      → prompt: "generation_literature.md"                      │
  │      → tool loop, up to 8 iterations:                          │
  │           pubmed_search | arxiv_search | europe_pmc_search |   │
  │           web_search | web_fetch | record_hypothesis (terminal)│
  │      → each iteration = 1 LM call                              │
  │      → result: new Hypothesis (if dedup passes)                │
  │      → follow-up: enqueue Reflection                           │
  └────────────────────────────────────────────────────────────────┘

  ┌─ Reflection (1 per new hypothesis) ────────────────────────────┐
  │                                                                │
  │   3. Reflection(hypothesis)                                    │
  │      → prompt: "reflection_review.md"  (= reflection.full)     │
  │      → tool loop, up to 8 iterations:                          │
  │           same lit-search tools + record_review (terminal)     │
  │      → result: novelty / correctness / testability scores      │
  │      → follow-up: enqueue Ranking.AddToTournament              │
  └────────────────────────────────────────────────────────────────┘

  ┌─ Ranking — Elo tournament ─────────────────────────────────────┐
  │                                                                │
  │   AddToTournament (non-LM): Elo=1200, schedule pairings        │
  │                                                                │
  │   4a. RankingDebate(A, B)   ← cold / Δelo<50  (low confidence) │
  │       → prompt: "ranking.debate.md"                            │
  │       → 1 LM call, no tool loop                                │
  │                                                                │
  │   4b. RankingPairwise(A, B) ← warm / Δelo≥50 (high confidence) │
  │       → prompt: "ranking.pairwise.md"                          │
  │       → 1 LM call, no tool loop                                │
  │                                                                │
  │   verdict ("better idea: 1|2") → Elo update → record match     │
  │   every ~20 matches → Proximity reclusters (no LM)             │
  └────────────────────────────────────────────────────────────────┘

  ┌─ Evolution (fires when leaderboard matures) ───────────────────┐
  │   trigger: ≥20 hypotheses, each with ≥3 matches                │
  │   one EvolveTopHypotheses task runs 3 strategies in sequence:  │
  │                                                                │
  │   5a. EvolutionCombine(top-1 paired with most-distant)         │
  │       → prompt: "evolution_combine.md"                         │
  │   5b. EvolutionSimplify(top-1)                                 │
  │       → prompt: "evolution_simplify.md"                        │
  │   5c. EvolutionOutOfBox(top-5 as inspiration)                  │
  │       → prompt: "evolution_out_of_box.md"                      │
  │       (a 4th strategy, "feasibility", ships but is off in M3)  │
  │                                                                │
  │   each strategy = 1 tool_loop call (≤6 iters); offspring       │
  │   re-enter the cycle via the Reflection follow-up rule.        │
  └────────────────────────────────────────────────────────────────┘

  ┌─ Meta-review (system feedback, fires periodically) ────────────┐
  │   6. MetaReviewSystem                                          │
  │      → prompt: "metareview_system.md"                          │
  │      → forced tool: record_system_feedback                     │
  │      → output injected into Generation / Evolution prompts as  │
  │        steering for subsequent hypotheses                      │
  │      → fires every ~50 matches when supervisor is idle         │
  └────────────────────────────────────────────────────────────────┘

└──────────────────────────────────────────────────────────────────────┘

┌─ STAGE 2: Finalize (1 LM call) ───────────────────────────────────────┐

  7. MetaReviewFinal(top hypotheses, all reviews, all feedback)
     → prompt: "metareview_final.md"
     → no tools; the model writes prose directly
     → SAVED as data/artifacts/<session_id>/final/overview.md

└──────────────────────────────────────────────────────────────────────┘

Concrete call count for our run

Run config: --n 3 initial hypotheses, --budget-usd 100, --wall-clock 900 (15 min), concurrency=2, default max_ideas=60, max_matches_per_idea=12, all tool_loop.*_max_iters at defaults. Bottleneck for us is wall-clock and claude -p subprocess latency (~3-5 s of cold-start per LM call), not budget.

PromptTimes firedWhy
parse_goal1once at session start
generation_literature3 – 83 initial + offspring from Evolution; each is a full tool loop of 1 – 8 LM calls
reflection_review3 – 8one Reflection per new hypothesis; each is a tool loop
ranking.debate5 – 12cold matches and low-Δelo pairs (high-verbosity mode)
ranking.pairwise8 – 20warm matches with Δelo≥50 (fast mode); dominates once the leaderboard is established
evolution_combine0 – 2only fires after leaderboard maturity (≥20 hypotheses, ≥3 matches each); may not trigger at all in a 15-min subprocess-bound run
evolution_simplify0 – 2same maturity gate
evolution_out_of_box0 – 2same maturity gate
metareview_system0 – 2every ~50 matches; small runs hit this 0–1 times
metareview_final1always, once at termination
Total LM calls (incl. tool-loop iterations)~35 – 80wide because tool loops can short-circuit on the first record_* call or run all 8 iterations

Notes about the order

Ranking has no tool loop. Unlike Generation, Reflection, and Evolution, each Ranking match is a single direct LM call — no literature search, no nested iterations. That's why the tournament can rack up many matches quickly while Generation and Reflection are slower per-hypothesis. The verdict is parsed from the trailing text via a regex looking for better idea: 1|2.
Reflection.full, Ranking.debate vs. Ranking.pairwise are alternatives — only one fires per invocation. Reflection ships three modes in the supplementary pseudocode (full, verification, observation) but only full is wired up in this fork. Ranking's _select_mode() picks debate when either match counter is < 2 or the Elo gap is < 50, otherwise pairwise. Evolution's three strategies, by contrast, are not mutually exclusive — every EvolveTopHypotheses task runs combine + simplify + out_of_box in sequence.
The tool loop is the load-bearing inner mechanism. Each Generation / Reflection / Evolution task is one outer task but up to 8 (Generation, Reflection) or 6 (Evolution) inner LM calls — one per assistant turn in the assistant ↔ tool_use ↔ tool_result loop. Loop ends when the model emits a "terminal" tool (record_hypothesis, record_review, record_system_feedback) or hits the max-iters cap. On the final allowed iteration the supervisor optionally forces the terminal tool so the run commits instead of looping on yet another search.
Meta-review fires twice in two different shapes. metareview_system is steering feedback that gets injected back into Generation and Evolution prompts on subsequent turns — it shapes future hypotheses. metareview_final is the readable end-of-run document. Only the final overview is what you sit down and read.
Proximity is non-LM. Embeddings (Voyage → OpenAI → hash fallback) plus sklearn clustering. It schedules a recluster every ~20 matches to drive informative pair selection — but it never calls the LM. Its presence in the agent roster is what makes co-scientist "7 agents" and not "6 LM agents".

Source: hypothesisgenerator/Co-Scientist/co_scientist/agents/ — see supervisor.py, generation.py, reflection.py, ranking.py, evolution.py, proximity.py, metareview.py. Tool-loop machinery in co_scientist/llm/tool_loop.py. Prompt templates in config/prompts/ (14 Jinja2 markdown files). Limits and budget shares in config/default.toml.