Technical Summary

Architecture

Market Synthesis uses adversarial orchestration with domain-routed experts:

graph TD subgraph S1["Stage 1: Single Agent Baseline"] direction LR E1["Event"] --> H1["Haiku 4.5
1 API call"] H1 --> R1["Verdict"] end subgraph S2["Stage 2: Multi-Agent (no orchestration)"] direction LR E2["Event"] --> A2a["Agent 1"] E2 --> A2b["Agent 2"] E2 --> A2c["Agent 3"] A2a --> V2["Majority Vote"] A2b --> V2 A2c --> V2 end subgraph S3["Stage 3: Adversarial Debate (current)"] direction TB E3["Event"] --> CLS3["πŸ“‹ Classify
Sonnet"] CLS3 -->|"routes category-specific
prompts to all 3"| PROS3["βš”οΈ Prosecutor
Haiku"] CLS3 --> EXP3["πŸ”¬ Expert
Sonnet"] CLS3 --> DEF3["πŸ›‘οΈ Defender
Haiku"] PROS3 -->|"parallel,
independent"| JUDGE3["βš–οΈ Judge
Sonnet"] EXP3 --> JUDGE3 DEF3 --> JUDGE3 JUDGE3 --> TRADE3["πŸ” Gap
Haiku"] JUDGE3 --> CAL3["πŸ“Š Calibrate
Haiku"] TRADE3 --> BEST["βœ… Stage 3 result
635 signals Β· +7.98% mean P&L
303 backtested Β· 2014–2026
adversarial exit system"] CAL3 --> BEST end S1 -.->|"Stage 1 β†’ Stage 2
96.9% detection,
no discrimination"| S2 S2 -.->|"Stage 2 β†’ Stage 3
77% detection,
no orchestration"| S3 style S1 fill:#f8fafc,stroke:#94a3b8 style S2 fill:#fef3c7,stroke:#d97706 style S3 fill:#eef4ff,stroke:#4f7ee8 style BEST fill:#dcfce7,stroke:#16a34a,color:#14532d style CLS3 fill:#f0f9ff,stroke:#0284c7,color:#0c4a6e style PROS3 fill:#dcfce7,stroke:#16a34a,color:#14532d style DEF3 fill:#fee2e2,stroke:#dc2626,color:#7f1d1d style EXP3 fill:#fef3c7,stroke:#d97706,color:#78350f style JUDGE3 fill:#e0e7ff,stroke:#4f46e5,color:#312e81 style TRADE3 fill:#fef3c7,stroke:#d97706,color:#78350f style CAL3 fill:#f0f9ff,stroke:#0284c7,color:#0c4a6e

Evolution from single agent to adversarial debate β€” orchestration is what makes multi-agent work

The current production architecture (8 agents, 4 sequential round-trips):

  1. News Enrichment fetches real contemporaneous coverage (Guardian API 1992+, GNews 2004+, NYT if key available) with per-domain search strategies.
  2. Classifier (Sonnet 4.5) routes event to 1 of 7 domains β€” this is the critical routing decision; miscategorization cascades through the entire pipeline, so it uses the strongest model.
  3. Prosecutor (Haiku 4.5) argues for dislocation using domain-specific evidence (e.g., reserve depletion for FX, leverage metrics for market structure).
  4. Defender (Haiku 4.5) argues against using domain-specific counter-evidence β€” independent from Prosecutor.
  5. Domain Expert (Sonnet 4.5) provides category-specific specialist analysis.
  6. Domain Adjudicator (Sonnet 4.5) synthesizes all arguments using domain-specific adjudicator prompt with containment checks β€” outputs verdict + confidence.
  7. Gap Characterization (Haiku 4.5) describes the detected gap for fund managers (affected assets, key tension, time horizon, example actions) β€” runs in parallel with Calibration.
  8. Calibration Agent (Haiku 4.5) adjusts probability for ECE optimization.

All LLM calls use temperature 0.0 for full determinism β€” the same input always produces the same output, enabling reproducible backtests and meaningful A/B comparisons.

Total: 28+ specialized prompt templates (7 Prosecutor + 7 Defender + 7 Expert + 7 Adjudicator) β€” each category’s entire pipeline is domain-aware, including domain-specific containment checks for sovereign debt, central bank, and geopolitical events.

Domain expert routing

Events are classified into 7 categories, each with a specialized expert prompt:

graph TD CLASS["πŸ“‹ Classifier
Haiku 4.5"] CLASS --> FX["πŸ’± FX/Peg Expert
Reserve adequacy, intervention
capacity, carry trade mechanics"] CLASS --> COMM["πŸ›’οΈ Physical Markets Expert
Supply chains, storage,
producer concentration"] CLASS --> CB["🏦 Central Bank Expert
Reaction function, policy
transmission, forward guidance"] CLASS --> GEO["🌍 Geopolitical Expert
Sanctions, trade wars, regime
change, EM contagion"] CLASS --> MS["⚑ Market Structure Expert
Leverage, short squeezes,
margin cascades, LDI"] CLASS --> CREDIT["πŸ“‰ Credit Expert
Debt sustainability, contagion,
restructuring, deposit risk"] CLASS --> BUBBLE["🏠 Bubble Expert
Valuations, reflexivity loops,
policy backstop assessment"] style CLASS fill:#f0f9ff,stroke:#0284c7,color:#0c4a6e style FX fill:#eef4ff,stroke:#4f7ee8,color:#0f2b57 style COMM fill:#fef3c7,stroke:#d97706,color:#78350f style CB fill:#e0e7ff,stroke:#4f46e5,color:#312e81 style GEO fill:#fee2e2,stroke:#dc2626,color:#7f1d1d style MS fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95 style CREDIT fill:#fef3c7,stroke:#d97706,color:#78350f style BUBBLE fill:#dcfce7,stroke:#16a34a,color:#14532d

7 specialized experts β€” each tunable independently with domain-specific prompt engineering

Evaluation setup

Current state (2026-05-19): v10 pipeline + adversarial exit backtest

The canonical result. Full 12.3 years (2014-01-01 β†’ 2026-05-01) on real news feeds, with Sonnet category classifier and temperature 0.0 for full determinism.

Full statistical summary

Metric Value Notes
Coverage 2014-01-01 β†’ 2026-05-01 12.3 years
Headlines indexed 626,245 Guardian + NYT + GDELT + FOMC + EDGAR
Active windows processed 215 after quality gates
Unique signals in store 635 after narrative collapse
Signals backtested 303 those with price data available
Mean P&L (adversarial exit) +7.98% per trade
Mean P&L (hold-to-time) +4.86% per trade
Adversarial alpha +3.11% exit system adds value
Worst drawdown -19.3% vs -57.3% for hold
Estimated annual return ~22% p.a. at 10% allocation
Cost per signal (pipeline) ~$0.04 Sonnet classifier + 8 agents

Per-category P&L (adversarial exit backtest)

Category N Exit P&L Hold P&L Alpha
commodity 45 +32.8% +21.8% +11.0%
currency_peg 24 +16.8% +2.9% +13.9%
sovereign_debt 111 +3.2% +2.0% +1.2%
geopolitical 35 +3.2% +3.6% -0.3%
central_bank 67 +0.9% +1.3% -0.4%
real_estate 21 +0.6% -0.5% +1.1%

Key v10 improvements over prior versions

  1. Sonnet category classifier β€” routing decision uses strongest model (miscategorization cascades through pipeline)
  2. Temperature 0.0 everywhere β€” full determinism, reproducible results
  3. Intelligent instrument selector β€” 3-layer (disk cache β†’ static registry β†’ Sonnet LLM fallback) with regression testing
  4. RE relevance gate β€” filters miscategorized signals (retail/insurance/utility bankruptcies wrongly labeled as real_estate)
  5. RE forcing catalyst filter β€” requires acute catalyst (entity failure, rate shock, contagion) for real estate signals

Calibration note

The model is slightly underconfident: it predicts ~75% probability on average when actual P&L is positive ~85% of the time. This is a benign bias for fund-manager use (less over-conviction β†’ less over-sizing) but means the raw probability numbers are conservative.

Full stats & prior methodology: results/bigrun_20260505_full_stats.md. Latest backtest snapshot: results/adversarial_exit_20260519_064643_v10_pipeline_rerun.json.

Historical run snapshot (legacy 3-run cross-validation on 204 hand-labeled events)

Metric Run 1 Run 2 Run 3
Positive accuracy 82.7% 78.1% 76.0%
FPR (quiet periods) 38.9% 20.8% 18.5%
F1 0.841 0.841 0.832
Brier 0.101 0.119 0.152
Wall time 30.8 min 36.1 min 10.0 min

Cross-run stability (positives)

Status Count %
Always correct 99 66%
Always missed 15 10%
Stochastic (varies) 36 24%

Stability by category

Category Stable Correct Stability
Currency Peg 26/35 74%
Commodity 20/27 74%
Sovereign Debt 14/19 74%
Real Estate 16/22 73%
Central Bank 8/18 44%
Geopolitical 9/16 56%
Market Structure 8/13 62%

Model assignment strategy

graph LR subgraph CHEAP["Fast Models (Haiku 4.5)"] direction TB B1["βš”οΈ Prosecutor"] B2["πŸ›‘οΈ Defender"] B3["πŸ” Gap Characterization"] B4["πŸ“Š Calibration"] end subgraph EXPENSIVE["Quality Models (Sonnet 4.5)"] direction TB S0["πŸ“‹ Classifier"] S1["πŸ”¬ Domain Expert"] S2["βš–οΈ Adjudicator"] S3["🎯 Instrument Selector"] end S0 -->|"routes to
correct expert"| S1 B1 -->|"firm commitment
clean signal"| S2 B2 -->|"firm commitment
clean signal"| S2 S1 -->|"independent analysis"| S2 style CHEAP fill:#dcfce7,stroke:#16a34a style EXPENSIVE fill:#e0e7ff,stroke:#4f46e5 style B1 fill:#f0fdf4,stroke:#16a34a,color:#14532d style B2 fill:#f0fdf4,stroke:#16a34a,color:#14532d style B3 fill:#f0fdf4,stroke:#16a34a,color:#14532d style B4 fill:#f0fdf4,stroke:#16a34a,color:#14532d style S0 fill:#eef2ff,stroke:#4f46e5,color:#312e81 style S1 fill:#eef2ff,stroke:#4f46e5,color:#312e81 style S2 fill:#eef2ff,stroke:#4f46e5,color:#312e81 style S3 fill:#eef2ff,stroke:#4f46e5,color:#312e81

Sonnet for high-stakes decisions (routing, expert, judge, instrument); Haiku for committed advocacy

Key technical insights

  1. Full domain specialization beats generic debate.
    All debate agents (Prosecutor, Defender, Expert, Adjudicator) now use category-specific prompts β€” 28+ templates total. A currency peg Prosecutor argues about reserve depletion; a sovereign debt Adjudicator has containment checks for debt scares vs actual defaults.
  2. Domain-specific containment checks reduce FPR.
    Sovereign debt, central bank, and geopolitical adjudicators distinguish β€œscares that resolved” from β€œcrises that happened” β€” two-sided rules that cap scare scores while protecting confirmed events.
  3. Sonnet for routing, Haiku for advocacy.
    The category classifier uses Sonnet because miscategorization cascades through the entire pipeline. Advocates use Haiku because firm commitment produces cleaner signal for the judge.
  4. Temperature 0.0 for full determinism.
    All LLM calls use temperature 0.0. Same input β†’ same output. This enables reproducible backtests and meaningful A/B comparisons between pipeline versions.
  5. Adversarial exit outperforms mechanical rules.
    An LLM weekly check (β€œis the gap still open?”) beats fixed hold periods + stop-losses by +3.11% per trade and reduces worst-case drawdown by 38 percentage points.
  6. Intelligent instrument selection with regression testing.
    3-layer selector (disk cache β†’ static registry β†’ Sonnet LLM) picks the best tradeable expression. Before replacing a cached instrument, the system regression-tests against all other signals using that ticker.
  7. Signal quality gates before instrument optimization.
    RE relevance gate + forcing catalyst filter remove miscategorized signals at the source β€” better than finding the β€œleast bad instrument” for a wrong signal.
  8. Real news enrichment with per-domain search strategies.
    Guardian API (1992+) and GNews (2004+) with category-specific search terms. Per-domain source selection (e.g., commodity events skip Guardian).
  9. Deterministic guardrails prevent false conviction.
    Per-domain score/probability thresholds. When score or probability is below threshold, direction is forced to neutral regardless of LLM output.

Source files