AIDebate

two AIs, one topic, no mercy

What even is this?

It might look like a debate engine — and it is, technically — but that’s not the point of this demo. The point is really the judge.

This system isn’t built to simulate debates — it’s built to evaluate reasoning. At its core is a judge layer engineered to audit arguments the way a scientist inspects a causal model. It enforces strict cause → process → consequence structure, flags missing mechanisms, and penalizes claims that rely on vibes, metaphors, or moral gestures without explanatory substance. The auditor tracks contradictions across turns, detects when an argument quietly drops a premise, and rewards positions that maintain a stable causal spine throughout the entire exchange. Instead of producing a “winner,” it generates arc‑level analyses that reconstruct how each argument evolved, where it fractured, and why one causal model held together while the other collapsed. This isn’t a debate engine. It’s a structured reasoning environment that reveals how well arguments actually work.

The debaters are large open-weight models running on Ollama Cloud — Kimi K2 1T, Nemotron 3 Super, Qwen3, GLM-4, MiniMax M2.5, and others. Both debaters and the judge can be swapped for any LLM provider you prefer. I built a provider-agnostic LLM library to make this seamless — Anthropic, Gemini, and Ollama are already implemented out of the box, with the architecture designed to make adding new providers incredibly straightforward. The judge runs independently from the debaters and can be switched without touching the rest of the stack. Its rubric is multi-dimensional — logic, rhetoric, tactics, frame control, and credibility are scored separately. It also emits hidden feedback signals back into the debate in real time, shaping how each model argues as the exchange unfolds.

The Live Judge Layer

The judge is not a single score. It runs five levels of analysis on every turn.

1 — Per-turn absolute scoring

Each turn receives independent scores on three dimensions — Logical Coherence (0–40), Rhetorical Force (0–30), and Tactical Effectiveness (0–30), weighted into a composite 0–100 grade. Logic uses a component-chain model: a top score requires a complete cause → process → measurable consequence chain with the opponent's weakest premise explicitly addressed. Rhetoric uses a four-component method and caps delivery scores when framing is weak.

2 — Pairwise comparative analysis

After each turn, the judge runs a separate head-to-head comparison of the two most recent turns, determining dimensional winners (Logic, Tactics, Rhetoric) or declaring a draw. This produces floor calibration: if an absolute score falls outside the band consistent with the comparative verdict — a "logic win" scored below 22/40, for instance — it's flagged and harmonized. This prevents score drift across a long debate.

3 — Claim & fact tracking

The judge maintains a Suspect Claims Register across turns. Claims that are specific but mechanically hollow — a precise-sounding assertion with no causal chain — are flagged with a −1 logic penalty. Fabricated citations, historically inverted facts, or causally backwards claims incur a mandatory −2 penalty upon detection. Critically, if a logic failure is only exposed by a later turn, the prior turn's score can be retroactively docked via a logic gap adjustment signal.

4 — Narrative verdict

At the end of the debate a separate LLM call evaluates arc-level coherence: did one agent hold a more consistent, evolving thesis across the whole debate? This verdict is deliberately allowed to diverge from the round-by-round scorecard. Winning more exchanges than your opponent is not the same as holding the stronger position. When the two verdicts conflict, a conflict resolution pass explains the gap. Three named patterns occur:

▸ Coherence Collapse — The scorecard leader won rounds by repeatedly shifting their central claim under pressure. Each reframe was persuasive in isolation, but the sequence of abandoned premises reveals that no single thesis survived the full arc. Winning a round by introducing a new mechanism you couldn't defend two turns ago is not cumulative progress; it's lateral movement. The narrative verdict penalizes the instability of the through-line, not the quality of any individual turn.
▸ Asymmetric Depth — One side dominated on style — rhetoric, tactics, audience resonance — while the other quietly held the stronger logical position throughout. The scorecard reflects the stylistic dominance. The narrative verdict reflects that the motion assigned a substantive burden, and flourish doesn't discharge it. This pattern most often appears when one debater is more rhetorically gifted but kept sidestepping the mechanism question the opponent never stopped asking.
▸ Convergence Failure — Both sides spent the debate talking past each other at the definitional level rather than engaging the actual dispute. The round wins accumulated, but each was won against a position the opponent wasn't holding. By the end, neither debater had established their thesis so much as successfully defended a smaller, safer version of it. The narrative verdict names this as a failure to close on the real question, not a victory for whoever landed more points on the surrogate one.

5 — Convergence detection

In debates of ten or more turns, a final pass compares the agents' early positions (turns 1–2) against their late positions (turns n−1 to n). It surfaces whether the debate resolved into genuine agreement, got stuck in a definitional dispute, narrowed to a degree disagreement, or — in the worst case — descended into degenerate convergence where both sides ended up defending the same thing.

Adaptive Personalities

Each debater starts with a personality archetype — Engineer, Philosopher, Strategist, or Provocateur — expressed as twelve numeric traits (1–10) across cognitive style, emotional stance, and rhetorical approach. After the judge scores a turn, it emits pressure signals (cognitive, emotional, strategic, credibility) back into the losing agent's personality state. Traits shift by a pressure-weighted delta, multiplied by that agent's elasticity setting.

These adjustments are injected as hidden directives — invisible to the user — into the next agent's system prompt. The result is that agents noticeably change rhetorical posture across a debate. A Strategist who is losing on logic will become more analytical; a Provocateur under credibility pressure will reach for evidence. The debate feels dynamic rather than two models reading from fixed scripts.

Strengths

→ Multi-dimensional scoring. There is no single "who won" meter. Logic, rhetoric, and tactics are tracked separately, so a win on style that conceals weak reasoning is visible in the scorecard.
→ Live feedback loop. The judge shapes debater behavior in real time via hidden directives. The debate reacts to itself as it unfolds.
→ Retroactive penalties. A claim praised in round 2 and disproved in round 4 doesn't get a free pass. The scoring system can reach backwards.
→ Arc vs. skirmish separation. The narrative verdict decouples "who won the most rounds" from "who held the stronger overall position." These are different questions and the system treats them that way.
→ Pairwise calibration. Floor banding prevents score drift. A turn can't score in the "win" range if the comparative analysis says it lost.
→ Convergence detection. Long debates can drift, loop, or silently resolve. The system surfaces when that happens rather than presenting a false contest.
→ Personality drift. Agents are not static. Under sustained judge pressure, a rigid debater loosens up and a scattered one sharpens. Longer debates produce more behaviorally interesting exchanges.

Weaknesses & Known Limitations

⚠ LLM judges are non-deterministic. Run the same debate twice and you will get different scores. The rubric is explicit and detailed, but the underlying model introduces variance. Treat scores as estimates, not measurements.
⚠ Harmonization pass limitations. The harmonization pass occasionally misses anomaly flags on rounds where retroactive gap enforcement created an unusual score pattern. Scores are correct; the flags are informational.
⚠ Judge style bias. The judge is a language model and may prefer certain rhetorical patterns — dense analytic prose, Western academic framing, confident assertion — independent of the actual quality of an argument. This bias is not fully characterized.
⚠ Cold-start effect. The first one or two turns have no pairwise context. Absolute scores in the opening rounds are less calibrated than in the middle and end of a debate, and floor banding cannot apply until turn two.
⚠ False-positive fabricated-claim flags. The fabricated-claim detector can penalize real but obscure facts that resemble hallucinated citations. A narrowly true claim about an unusual study or event may receive the same −2 as an actual fabrication.
⚠ Narrative vs. scorecard confusion. By design, the narrative verdict can contradict the round-count winner. Users sometimes experience this as contradictory output. It is intentional, but the conflict resolution explanation is not always satisfying.
⚠ Archetype constraints. Personality archetypes define trait starting points, but the underlying model has its own tendencies. A model that strongly prefers ornate language will resist a "plain speech" personality pressure. The system adjusts prompts, not weights.
⚠ Backend latency variance. Debaters (Ollama Cloud) and the judge (Anthropic / Gemini) run on separate inference stacks. Network latency between them is additive and unpredictable, making turn-completion time irregular in long debates.
⚠ Heuristic hollow-specificity detection. The hollow-specificity scan penalizes claims that are precise but lack a mechanism. However, "mechanism" is itself a judgment call the LLM judge makes heuristically. Legitimate domain-specific precision can trigger the penalty in fields the judge is less familiar with.