A multi-agent autonomous research-paper authoring system for political science. Local-first. Span-traced. Self-auditing. Built around the principle that a single model cannot do eight people's jobs in one shot.
Sovereign began as the answer to a falsified hypothesis. Its predecessor — a single-call pipeline that asked one large model to draft a research paper end-to-end — failed with a perfect record: every output rejected, every run, every paper. The model was not the problem. The architecture was. A genius asked to be a committee, in one shot, with no memory of having tried before, will refuse the role by producing slop.
So we decomposed. Eight competences were named. Each was given a card to read, a card to write, and a contract that defines what it must not touch. Each emits a span event the moment it completes, so when the pipeline drifts we know which stage drifted, by how much, and why. The orchestrator does not think; it reads the plan, dispatches the next stage, and logs.
A genius asked to be a committee, in one shot, refuses the role by producing slop. — On the predecessor
Each agent is a bounded function. It reads a defined input, performs one act of reasoning, writes a typed output, and emits a span. Nothing more.
The drafter writes the paper specification: thesis, opponents, scope conditions, falsifier, target cases. The critic reads the draft and tries to break it — fatal findings block downstream stages until reconciled. Every paper begins with an argument that has already survived its first opponent.
Annotates a retrieval cache with role, relevance, and rationale per source. Decides which forty-six abstracts matter, which six to mine deeply, and which to drop. The curator's job is to refuse most of what is offered.
Mints cards. Each card carries an exact quote, a paraphrase, the source's author-year, a relation to the paper, and a section hint. A card is the atomic unit of citable claim. Cards that cannot be verified against a source are marked UNVERIFIED and excluded from the citation pool.
Names the schools engaged, the agreement edges between them, and where the paper lands in the conversation. Output is both structured (graph of positions) and prose (the literature-review section, draft one).
Identifies live debates, underexplored intersections, contested concepts. The map of disagreement that the paper either intervenes in or routes around.
The author writes one section at a time, grounded in the relevant cards, citing only authors that resolve to a card. Untraced citations trigger a bounded retry with the offenders forbidden by name. The assembler merges drafts into the manuscript at the section heading, anchored at column zero.
Runs seventy Shively rules over the manuscript. Each finding has a rule id, a severity, a locus, and a remediation. Findings flow back into a bounded audit-to-author revision loop.
Runs eighty SAGE attacks against the manuscript. The paper is not judged near-final until the adversary's findings are absent or preempted.
The orchestrator runs each paper through eight stages. Every stage emits a span. The trace is the paper's birth certificate.
· ① Spec-Author + Spec-Adversary → paper_specs/{paper}.json ② Source-Curator → test_retrieval/{paper}.json ③ Note-Taker → cards/{paper}/card_*.json ④ Lit-Reviewer → schools, edges, where-paper-lands ⑤ Debate-Mapper → live debates, intersections ⑥ Prose-Author (per section) → ProseDraft + cards-cited ⑦ Patch-Assembler → runs/{paper}/manuscript.md ⑧ Methodologist + Adversary → audit_{paper}.jsonl → calibration_{paper}.jsonl → ⚑ operator ping if near-final · trace: task_traces/pipeline_v2_{paper}_{ts}.jsonl portfolio: state/pipeline_v2_portfolio.json (atomic)
Eight stages produce one trace and ten artefacts. The trace is the per-stage span ledger — millisecond timing, contract status, decisions, signals. The artefacts are paper-level: the spec, the cards, the manuscript, the calibration, the audit. Together they form a complete, reproducible record of how this paper came to be the way it is.
"Is this paper any good" is not a question you answer by vibes. It is a question you answer by distribution. The calibrator scores each manuscript on ten axes against a reference set of twenty-four hundred and one published papers from International Studies Quarterly and the American Political Science Review. A composite is reported alongside the per-axis scores; an axis is flagged if it falls below a threshold derived from the reference distribution.
If a run does not improve on the previous run on axes we can name, the run did not improve. The calibrator's purpose is to make that judgment formal. A baseline snapshot is taken on first run; every subsequent run reports its delta. The operator does not have to read the manuscript to know whether the loop is learning.
Sovereign runs entirely on the operator's hardware — an Apple Silicon workstation with sixty-four gigabytes of unified memory. The model rotation is served by llama-swap, mixing MLX-backed servers for the larger reasoners and llama.cpp for selected GGUF quants.
No cloud. No telemetry. The whole stack — about two hundred and forty gigabytes on disk — sits behind a single local router. When the operator closes the laptop, the system is gone. When it opens, it returns exactly as it was.
Sovereign is not a chatbot. There is no persona, no helpful assistant, no conversational front. It is not a research copilot; it does not sit beside the operator suggesting edits. It runs as a batch over a paper specification and produces, on the other side, a manuscript that has survived eight rounds of independent review.
It is not a frontier model. It uses small, open, locally-runnable models — none above one hundred billion parameters — and gets where it gets through architecture rather than scale. The thesis is that the next decade of useful AI will be won by tighter loops, not larger weights.
It is not finished. The first integration run is paper-frontier-007. Whatever happens, the trace is the data. The loop will not be near-final on its first attempt; near-final on its first attempt was never the goal. The goal is a loop where the next run is traceably better than the last on axes we can name.
We did not ship a feature. We shipped a loop. — Address at the cutover
The system is in plug-and-run readiness as of May 2026. Eight paper specifications have been pre-drafted (frontier-007 through frontier-014), each surviving spec-adversary review with zero fatal findings. Three retrieval caches are warmed. The cutover script — preflight, model launch, smoke-test, dry-run, first live run — is one command.
One hundred and twenty-five tests pass. The gap-audit reports zero actionable findings across twenty-two checks. Six recurring bug patterns were caught and killed during build. The legacy single-call pipeline is queued for retirement on the third successful end-to-end run.
For the engineering log and the address given at the cutover, see the address and reflections.