The order LM calls fire when co-scientist processes one research goal. Captured from a code walkthrough of the Kaimen-Inc fork on 2026-06-02. Modeled after the companion STORM doc.
┌─ STAGE 0: Parse the goal (1 LM call) ─────────────────────────────────┐
1. ParseGoal(goal)
→ prompt: "parse_goal.md"
→ forced tool: record_research_plan
→ produces a ResearchPlan that scopes the rest of the run
→ fallback: if the model doesn't emit the tool, supervisor uses a
bare ResearchPlan and the session continues
──── supervisor then enqueues N initial Generation tasks ────
└──────────────────────────────────────────────────────────────────────┘
┌─ STAGE 1: Main loop ──────────────────────────────────────────────────┐
│ task scheduler runs concurrency-many tasks at once; │
│ every finished task may enqueue follow-ups; │
│ loop terminates on budget / wall-clock / Elo stability / idle │
┌─ Generation (per initial task AND per Evolution offspring) ────┐
│ │
│ 2. Generation(goal, plan, system_feedback) │
│ → prompt: "generation_literature.md" │
│ → tool loop, up to 8 iterations: │
│ pubmed_search | arxiv_search | europe_pmc_search | │
│ web_search | web_fetch | record_hypothesis (terminal)│
│ → each iteration = 1 LM call │
│ → result: new Hypothesis (if dedup passes) │
│ → follow-up: enqueue Reflection │
└────────────────────────────────────────────────────────────────┘
┌─ Reflection (1 per new hypothesis) ────────────────────────────┐
│ │
│ 3. Reflection(hypothesis) │
│ → prompt: "reflection_review.md" (= reflection.full) │
│ → tool loop, up to 8 iterations: │
│ same lit-search tools + record_review (terminal) │
│ → result: novelty / correctness / testability scores │
│ → follow-up: enqueue Ranking.AddToTournament │
└────────────────────────────────────────────────────────────────┘
┌─ Ranking — Elo tournament ─────────────────────────────────────┐
│ │
│ AddToTournament (non-LM): Elo=1200, schedule pairings │
│ │
│ 4a. RankingDebate(A, B) ← cold / Δelo<50 (low confidence) │
│ → prompt: "ranking.debate.md" │
│ → 1 LM call, no tool loop │
│ │
│ 4b. RankingPairwise(A, B) ← warm / Δelo≥50 (high confidence) │
│ → prompt: "ranking.pairwise.md" │
│ → 1 LM call, no tool loop │
│ │
│ verdict ("better idea: 1|2") → Elo update → record match │
│ every ~20 matches → Proximity reclusters (no LM) │
└────────────────────────────────────────────────────────────────┘
┌─ Evolution (fires when leaderboard matures) ───────────────────┐
│ trigger: ≥20 hypotheses, each with ≥3 matches │
│ one EvolveTopHypotheses task runs 3 strategies in sequence: │
│ │
│ 5a. EvolutionCombine(top-1 paired with most-distant) │
│ → prompt: "evolution_combine.md" │
│ 5b. EvolutionSimplify(top-1) │
│ → prompt: "evolution_simplify.md" │
│ 5c. EvolutionOutOfBox(top-5 as inspiration) │
│ → prompt: "evolution_out_of_box.md" │
│ (a 4th strategy, "feasibility", ships but is off in M3) │
│ │
│ each strategy = 1 tool_loop call (≤6 iters); offspring │
│ re-enter the cycle via the Reflection follow-up rule. │
└────────────────────────────────────────────────────────────────┘
┌─ Meta-review (system feedback, fires periodically) ────────────┐
│ 6. MetaReviewSystem │
│ → prompt: "metareview_system.md" │
│ → forced tool: record_system_feedback │
│ → output injected into Generation / Evolution prompts as │
│ steering for subsequent hypotheses │
│ → fires every ~50 matches when supervisor is idle │
└────────────────────────────────────────────────────────────────┘
└──────────────────────────────────────────────────────────────────────┘
┌─ STAGE 2: Finalize (1 LM call) ───────────────────────────────────────┐
7. MetaReviewFinal(top hypotheses, all reviews, all feedback)
→ prompt: "metareview_final.md"
→ no tools; the model writes prose directly
→ SAVED as data/artifacts/<session_id>/final/overview.md
└──────────────────────────────────────────────────────────────────────┘
Run config: --n 3 initial hypotheses, --budget-usd 100, --wall-clock 900 (15 min), concurrency=2, default max_ideas=60, max_matches_per_idea=12, all tool_loop.*_max_iters at defaults. Bottleneck for us is wall-clock and claude -p subprocess latency (~3-5 s of cold-start per LM call), not budget.
| Prompt | Times fired | Why |
|---|---|---|
parse_goal | 1 | once at session start |
generation_literature | 3 – 8 | 3 initial + offspring from Evolution; each is a full tool loop of 1 – 8 LM calls |
reflection_review | 3 – 8 | one Reflection per new hypothesis; each is a tool loop |
ranking.debate | 5 – 12 | cold matches and low-Δelo pairs (high-verbosity mode) |
ranking.pairwise | 8 – 20 | warm matches with Δelo≥50 (fast mode); dominates once the leaderboard is established |
evolution_combine | 0 – 2 | only fires after leaderboard maturity (≥20 hypotheses, ≥3 matches each); may not trigger at all in a 15-min subprocess-bound run |
evolution_simplify | 0 – 2 | same maturity gate |
evolution_out_of_box | 0 – 2 | same maturity gate |
metareview_system | 0 – 2 | every ~50 matches; small runs hit this 0–1 times |
metareview_final | 1 | always, once at termination |
| Total LM calls (incl. tool-loop iterations) | ~35 – 80 | wide because tool loops can short-circuit on the first record_* call or run all 8 iterations |
better idea: 1|2.
full, verification, observation) but only full is wired up in this fork. Ranking's _select_mode() picks debate when either match counter is < 2 or the Elo gap is < 50, otherwise pairwise. Evolution's three strategies, by contrast, are not mutually exclusive — every EvolveTopHypotheses task runs combine + simplify + out_of_box in sequence.
record_hypothesis, record_review, record_system_feedback) or hits the max-iters cap. On the final allowed iteration the supervisor optionally forces the terminal tool so the run commits instead of looping on yet another search.
metareview_system is steering feedback that gets injected back into Generation and Evolution prompts on subsequent turns — it shapes future hypotheses. metareview_final is the readable end-of-run document. Only the final overview is what you sit down and read.
Source: hypothesisgenerator/Co-Scientist/co_scientist/agents/ — see supervisor.py, generation.py, reflection.py, ranking.py, evolution.py, proximity.py, metareview.py. Tool-loop machinery in co_scientist/llm/tool_loop.py. Prompt templates in config/prompts/ (14 Jinja2 markdown files). Limits and budget shares in config/default.toml.