It might look like a debate engine — and it is, technically — but that’s
not the point of this demo. The point is really the judge.
This system isn’t built to simulate debates — it’s built to evaluate
reasoning. At its core is a judge layer engineered to audit arguments the
way a scientist inspects a causal model. It enforces strict cause →
process → consequence structure, flags missing mechanisms, and penalizes
claims that rely on vibes, metaphors, or moral gestures without
explanatory substance. The auditor tracks contradictions across turns,
detects when an argument quietly drops a premise, and rewards positions
that maintain a stable causal spine throughout the entire exchange.
Instead of producing a “winner,” it generates arc‑level analyses that
reconstruct how each argument evolved, where it fractured, and why one
causal model held together while the other collapsed. This isn’t a debate
engine. It’s a structured reasoning environment that reveals how well
arguments actually work.
The debaters are large open-weight models running on Ollama Cloud — Kimi
K2 1T, Nemotron 3 Super, Qwen3, GLM-4, MiniMax M2.5, and others. Both
debaters and the judge can be swapped for any LLM provider you prefer. I
built a provider-agnostic LLM library to make this seamless — Anthropic,
Gemini, and Ollama are already implemented out of the box, with the architecture
designed to make adding new providers incredibly straightforward. The judge runs
independently from the debaters and can be switched without touching the
rest of the stack. Its rubric is multi-dimensional — logic, rhetoric,
tactics, frame control, and credibility are scored separately. It also
emits hidden feedback signals back into the debate in real time, shaping
how each model argues as the exchange unfolds.
The Live Judge Layer
The judge is not a single score. It runs five levels of analysis on every
turn.
1 — Per-turn absolute scoring
Each turn receives independent scores on three dimensions — Logical Coherence (0–40), Rhetorical Force (0–30), and Tactical Effectiveness (0–30), weighted
into a composite 0–100 grade. Logic uses a component-chain model: a top score
requires a complete cause → process → measurable consequence chain with the
opponent's weakest premise explicitly addressed. Rhetoric uses a four-component
method and caps delivery scores when framing is weak.
2 — Pairwise comparative analysis
After each turn, the judge runs a separate head-to-head comparison of
the two most recent turns, determining dimensional winners (Logic,
Tactics, Rhetoric) or declaring a draw. This produces floor calibration: if an absolute score falls outside the band consistent with the
comparative verdict — a "logic win" scored below 22/40, for instance —
it's flagged and harmonized. This prevents score drift across a long
debate.
3 — Claim & fact tracking
The judge maintains a Suspect Claims Register across turns. Claims that
are specific but mechanically hollow — a precise-sounding assertion with
no causal chain — are flagged with a −1 logic penalty. Fabricated
citations, historically inverted facts, or causally backwards claims
incur a mandatory −2 penalty upon detection. Critically, if a logic
failure is only exposed by a later turn, the prior turn's score
can be retroactively docked via a logic gap adjustment signal.
4 — Narrative verdict
At the end of the debate a separate LLM call evaluates arc-level
coherence: did one agent hold a more consistent, evolving thesis across
the whole debate? This verdict is deliberately allowed to diverge from
the round-by-round scorecard. Winning more exchanges than your opponent
is not the same as holding the stronger position. When the two verdicts
conflict, a conflict resolution pass explains the gap. Three named
patterns occur:
▸Coherence Collapse — The scorecard
leader won rounds by repeatedly shifting their central claim under pressure.
Each reframe was persuasive in isolation, but the sequence of abandoned
premises reveals that no single thesis survived the full arc. Winning
a round by introducing a new mechanism you couldn't defend two turns
ago is not cumulative progress; it's lateral movement. The narrative
verdict penalizes the instability of the through-line, not the quality
of any individual turn.
▸Asymmetric Depth — One side dominated
on style — rhetoric, tactics, audience resonance — while the other quietly
held the stronger logical position throughout. The scorecard reflects
the stylistic dominance. The narrative verdict reflects that the motion
assigned a substantive burden, and flourish doesn't discharge it. This
pattern most often appears when one debater is more rhetorically gifted
but kept sidestepping the mechanism question the opponent never stopped
asking.
▸Convergence Failure — Both sides
spent the debate talking past each other at the definitional level rather
than engaging the actual dispute. The round wins accumulated, but each
was won against a position the opponent wasn't holding. By the end, neither
debater had established their thesis so much as successfully defended
a smaller, safer version of it. The narrative verdict names this as a
failure to close on the real question, not a victory for whoever landed
more points on the surrogate one.
5 — Convergence detection
In debates of ten or more turns, a final pass compares the agents'
early positions (turns 1–2) against their late positions (turns n−1 to
n). It surfaces whether the debate resolved into genuine agreement,
got stuck in a definitional dispute, narrowed to a degree
disagreement, or — in the worst case — descended into degenerate
convergence where both sides ended up defending the same thing.
Adaptive Personalities
Each debater starts with a personality archetype — Engineer, Philosopher,
Strategist, or Provocateur — expressed as twelve numeric traits (1–10)
across cognitive style, emotional stance, and rhetorical approach. After
the judge scores a turn, it emits pressure signals (cognitive, emotional,
strategic, credibility) back into the losing agent's personality state.
Traits shift by a pressure-weighted delta, multiplied by that agent's
elasticity setting.
These adjustments are injected as hidden directives — invisible to the
user — into the next agent's system prompt. The result is that agents
noticeably change rhetorical posture across a debate. A Strategist who is
losing on logic will become more analytical; a Provocateur under
credibility pressure will reach for evidence. The debate feels dynamic
rather than two models reading from fixed scripts.
Strengths
→Multi-dimensional scoring. There is
no single "who won" meter. Logic, rhetoric, and tactics are tracked separately,
so a win on style that conceals weak reasoning is visible in the scorecard.
→Live feedback loop. The judge shapes
debater behavior in real time via hidden directives. The debate reacts
to itself as it unfolds.
→Retroactive penalties. A claim praised
in round 2 and disproved in round 4 doesn't get a free pass. The scoring
system can reach backwards.
→Arc vs. skirmish separation. The narrative
verdict decouples "who won the most rounds" from "who held the stronger
overall position." These are different questions and the system treats
them that way.
→Pairwise calibration. Floor banding
prevents score drift. A turn can't score in the "win" range if the comparative
analysis says it lost.
→Convergence detection. Long debates
can drift, loop, or silently resolve. The system surfaces when that happens
rather than presenting a false contest.
→Personality drift. Agents are not static.
Under sustained judge pressure, a rigid debater loosens up and a scattered
one sharpens. Longer debates produce more behaviorally interesting exchanges.
Weaknesses & Known Limitations
⚠LLM judges are non-deterministic. Run
the same debate twice and you will get different scores. The rubric is
explicit and detailed, but the underlying model introduces variance. Treat
scores as estimates, not measurements.
⚠Harmonization pass limitations. The
harmonization pass occasionally misses anomaly flags on rounds where retroactive
gap enforcement created an unusual score pattern. Scores are correct; the
flags are informational.
⚠Judge style bias. The judge is a language
model and may prefer certain rhetorical patterns — dense analytic prose,
Western academic framing, confident assertion — independent of the actual
quality of an argument. This bias is not fully characterized.
⚠Cold-start effect. The first one or
two turns have no pairwise context. Absolute scores in the opening rounds
are less calibrated than in the middle and end of a debate, and floor banding
cannot apply until turn two.
⚠False-positive fabricated-claim flags. The fabricated-claim detector can penalize real but obscure facts that
resemble hallucinated citations. A narrowly true claim about an unusual
study or event may receive the same −2 as an actual fabrication.
⚠Narrative vs. scorecard confusion. By design, the narrative verdict can contradict the round-count winner.
Users sometimes experience this as contradictory output. It is intentional,
but the conflict resolution explanation is not always satisfying.
⚠Archetype constraints. Personality
archetypes define trait starting points, but the underlying model has its
own tendencies. A model that strongly prefers ornate language will resist
a "plain speech" personality pressure. The system adjusts prompts, not
weights.
⚠Backend latency variance. Debaters
(Ollama Cloud) and the judge (Anthropic / Gemini) run on separate inference
stacks. Network latency between them is additive and unpredictable, making
turn-completion time irregular in long debates.
⚠Heuristic hollow-specificity detection. The hollow-specificity scan penalizes claims that are precise but lack
a mechanism. However, "mechanism" is itself a judgment call the LLM judge
makes heuristically. Legitimate domain-specific precision can trigger the
penalty in fields the judge is less familiar with.