Co-Scientist — Sequence of LM Prompts in One Run

The order LM calls fire when co-scientist processes one research goal. Captured from a code walkthrough of the Kaimen-Inc fork on 2026-06-02. Modeled after the companion STORM doc.

The sequence

┌─ STAGE 0: Parse the goal (1 LM call) ─────────────────────────────────┐ 1. ParseGoal(goal) → prompt: "parse_goal.md" → forced tool: record_research_plan → produces a ResearchPlan that scopes the rest of the run → fallback: if the model doesn't emit the tool, supervisor uses a bare ResearchPlan and the session continues ──── supervisor then enqueues N initial Generation tasks ──── └──────────────────────────────────────────────────────────────────────┘ ┌─ STAGE 1: Main loop ──────────────────────────────────────────────────┐ │ task scheduler runs concurrency-many tasks at once; │ │ every finished task may enqueue follow-ups; │ │ loop terminates on budget / wall-clock / Elo stability / idle │ ┌─ Generation (per initial task AND per Evolution offspring) ────┐ │ │ │ 2. Generation(goal, plan, system_feedback) │ │ → prompt: "generation_literature.md" │ │ → tool loop, up to 8 iterations: │ │ pubmed_search | arxiv_search | europe_pmc_search | │ │ web_search | web_fetch | record_hypothesis (terminal)│ │ → each iteration = 1 LM call │ │ → result: new Hypothesis (if dedup passes) │ │ → follow-up: enqueue Reflection │ └────────────────────────────────────────────────────────────────┘ ┌─ Reflection (1 per new hypothesis) ────────────────────────────┐ │ │ │ 3. Reflection(hypothesis) │ │ → prompt: "reflection_review.md" (= reflection.full) │ │ → tool loop, up to 8 iterations: │ │ same lit-search tools + record_review (terminal) │ │ → result: novelty / correctness / testability scores │ │ → follow-up: enqueue Ranking.AddToTournament │ └────────────────────────────────────────────────────────────────┘ ┌─ Ranking — Elo tournament ─────────────────────────────────────┐ │ │ │ AddToTournament (non-LM): Elo=1200, schedule pairings │ │ │ │ 4a. RankingDebate(A, B) ← cold / Δelo<50 (low confidence) │ │ → prompt: "ranking.debate.md" │ │ → 1 LM call, no tool loop │ │ │ │ 4b. RankingPairwise(A, B) ← warm / Δelo≥50 (high confidence) │ │ → prompt: "ranking.pairwise.md" │ │ → 1 LM call, no tool loop │ │ │ │ verdict ("better idea: 1|2") → Elo update → record match │ │ every ~20 matches → Proximity reclusters (no LM) │ └────────────────────────────────────────────────────────────────┘ ┌─ Evolution (fires when leaderboard matures) ───────────────────┐ │ trigger: ≥20 hypotheses, each with ≥3 matches │ │ one EvolveTopHypotheses task runs 3 strategies in sequence: │ │ │ │ 5a. EvolutionCombine(top-1 paired with most-distant) │ │ → prompt: "evolution_combine.md" │ │ 5b. EvolutionSimplify(top-1) │ │ → prompt: "evolution_simplify.md" │ │ 5c. EvolutionOutOfBox(top-5 as inspiration) │ │ → prompt: "evolution_out_of_box.md" │ │ (a 4th strategy, "feasibility", ships but is off in M3) │ │ │ │ each strategy = 1 tool_loop call (≤6 iters); offspring │ │ re-enter the cycle via the Reflection follow-up rule. │ └────────────────────────────────────────────────────────────────┘ ┌─ Meta-review (system feedback, fires periodically) ────────────┐ │ 6. MetaReviewSystem │ │ → prompt: "metareview_system.md" │ │ → forced tool: record_system_feedback │ │ → output injected into Generation / Evolution prompts as │ │ steering for subsequent hypotheses │ │ → fires every ~50 matches when supervisor is idle │ └────────────────────────────────────────────────────────────────┘ └──────────────────────────────────────────────────────────────────────┘ ┌─ STAGE 2: Finalize (1 LM call) ───────────────────────────────────────┐ 7. MetaReviewFinal(top hypotheses, all reviews, all feedback) → prompt: "metareview_final.md" → no tools; the model writes prose directly → SAVED as data/artifacts/<session_id>/final/overview.md └──────────────────────────────────────────────────────────────────────┘

Concrete call count for our run

Run config: --n 3 initial hypotheses, --budget-usd 100, --wall-clock 900 (15 min), concurrency=2, default max_ideas=60, max_matches_per_idea=12, all tool_loop.*_max_iters at defaults. Bottleneck for us is wall-clock and claude -p subprocess latency (~3-5 s of cold-start per LM call), not budget.

Prompt	Times fired	Why
`parse_goal`	1	once at session start
`generation_literature`	3 – 8	3 initial + offspring from Evolution; each is a full tool loop of 1 – 8 LM calls
`reflection_review`	3 – 8	one Reflection per new hypothesis; each is a tool loop
`ranking.debate`	5 – 12	cold matches and low-Δelo pairs (high-verbosity mode)
`ranking.pairwise`	8 – 20	warm matches with Δelo≥50 (fast mode); dominates once the leaderboard is established
`evolution_combine`	0 – 2	only fires after leaderboard maturity (≥20 hypotheses, ≥3 matches each); may not trigger at all in a 15-min subprocess-bound run
`evolution_simplify`	0 – 2	same maturity gate
`evolution_out_of_box`	0 – 2	same maturity gate
`metareview_system`	0 – 2	every ~50 matches; small runs hit this 0–1 times
`metareview_final`	1	always, once at termination
Total LM calls (incl. tool-loop iterations)	~35 – 80	wide because tool loops can short-circuit on the first `record_*` call or run all 8 iterations

Prompt

Times fired

Why

parse_goal

once at session start

generation_literature

3 – 8

3 initial + offspring from Evolution; each is a full tool loop of 1 – 8 LM calls

reflection_review

3 – 8

one Reflection per new hypothesis; each is a tool loop

ranking.debate

5 – 12

cold matches and low-Δelo pairs (high-verbosity mode)

ranking.pairwise

8 – 20

warm matches with Δelo≥50 (fast mode); dominates once the leaderboard is established

evolution_combine

0 – 2

only fires after leaderboard maturity (≥20 hypotheses, ≥3 matches each); may not trigger at all in a 15-min subprocess-bound run

evolution_simplify

0 – 2

same maturity gate

evolution_out_of_box

0 – 2

same maturity gate

metareview_system

0 – 2

every ~50 matches; small runs hit this 0–1 times

metareview_final

always, once at termination

Total LM calls (incl. tool-loop iterations)

~35 – 80

wide because tool loops can short-circuit on the first record_* call or run all 8 iterations

Notes about the order

Ranking has no tool loop. Unlike Generation, Reflection, and Evolution, each Ranking match is a single direct LM call — no literature search, no nested iterations. That's why the tournament can rack up many matches quickly while Generation and Reflection are slower per-hypothesis. The verdict is parsed from the trailing text via a regex looking for better idea: 1|2.

Reflection.full, Ranking.debate vs. Ranking.pairwise are alternatives — only one fires per invocation. Reflection ships three modes in the supplementary pseudocode (full, verification, observation) but only full is wired up in this fork. Ranking's _select_mode() picks debate when either match counter is < 2 or the Elo gap is < 50, otherwise pairwise. Evolution's three strategies, by contrast, are not mutually exclusive — every EvolveTopHypotheses task runs combine + simplify + out_of_box in sequence.

The tool loop is the load-bearing inner mechanism. Each Generation / Reflection / Evolution task is one outer task but up to 8 (Generation, Reflection) or 6 (Evolution) inner LM calls — one per assistant turn in the assistant ↔ tool_use ↔ tool_result loop. Loop ends when the model emits a "terminal" tool (record_hypothesis, record_review, record_system_feedback) or hits the max-iters cap. On the final allowed iteration the supervisor optionally forces the terminal tool so the run commits instead of looping on yet another search.

Meta-review fires twice in two different shapes. metareview_system is steering feedback that gets injected back into Generation and Evolution prompts on subsequent turns — it shapes future hypotheses. metareview_final is the readable end-of-run document. Only the final overview is what you sit down and read.

Proximity is non-LM. Embeddings (Voyage → OpenAI → hash fallback) plus sklearn clustering. It schedules a recluster every ~20 matches to drive informative pair selection — but it never calls the LM. Its presence in the agent roster is what makes co-scientist "7 agents" and not "6 LM agents".

Source: hypothesisgenerator/Co-Scientist/co_scientist/agents/ — see supervisor.py, generation.py, reflection.py, ranking.py, evolution.py, proximity.py, metareview.py. Tool-loop machinery in co_scientist/llm/tool_loop.py. Prompt templates in config/prompts/ (14 Jinja2 markdown files). Limits and budget shares in config/default.toml.