Technical Summary

Architecture

Market Synthesis uses adversarial orchestration with domain-routed experts:

graph TD subgraph S1["Stage 1: Single Agent Baseline"] direction LR E1["Event"] --> H1["Haiku 4.5
1 API call"] H1 --> R1["Verdict"] end subgraph S2["Stage 2: Multi-Agent (no orchestration)"] direction LR E2["Event"] --> A2a["Agent 1"] E2 --> A2b["Agent 2"] E2 --> A2c["Agent 3"] A2a --> V2["Majority Vote"] A2b --> V2 A2c --> V2 end subgraph S3["Stage 3: Adversarial Debate (current)"] direction TB E3["Event"] --> CLS3["📋 Classify
Sonnet"] CLS3 -->|"routes category-specific
prompts to all 3"| PROS3["⚔️ Prosecutor
Haiku"] CLS3 --> EXP3["🔬 Expert
Sonnet"] CLS3 --> DEF3["🛡️ Defender
Haiku"] PROS3 -->|"parallel,
independent"| JUDGE3["⚖️ Judge
Sonnet"] EXP3 --> JUDGE3 DEF3 --> JUDGE3 JUDGE3 --> TRADE3["🔍 Gap
Haiku"] JUDGE3 --> CAL3["📊 Calibrate
Haiku"] TRADE3 --> BEST["✅ Stage 3 result
635 signals · +7.98% mean P&L
303 backtested · 2014–2026
adversarial exit system"] CAL3 --> BEST end S1 -.->|"Stage 1 → Stage 2
96.9% detection,
no discrimination"| S2 S2 -.->|"Stage 2 → Stage 3
77% detection,
no orchestration"| S3 style S1 fill:#f8fafc,stroke:#94a3b8 style S2 fill:#fef3c7,stroke:#d97706 style S3 fill:#eef4ff,stroke:#4f7ee8 style BEST fill:#dcfce7,stroke:#16a34a,color:#14532d style CLS3 fill:#f0f9ff,stroke:#0284c7,color:#0c4a6e style PROS3 fill:#dcfce7,stroke:#16a34a,color:#14532d style DEF3 fill:#fee2e2,stroke:#dc2626,color:#7f1d1d style EXP3 fill:#fef3c7,stroke:#d97706,color:#78350f style JUDGE3 fill:#e0e7ff,stroke:#4f46e5,color:#312e81 style TRADE3 fill:#fef3c7,stroke:#d97706,color:#78350f style CAL3 fill:#f0f9ff,stroke:#0284c7,color:#0c4a6e

Evolution from single agent to adversarial debate — orchestration is what makes multi-agent work

The current production architecture (8 agents, 4 sequential round-trips):

News Enrichment fetches real contemporaneous coverage (Guardian API 1992+, GNews 2004+, NYT if key available) with per-domain search strategies.
Classifier (Sonnet 4.5) routes event to 1 of 7 domains — this is the critical routing decision; miscategorization cascades through the entire pipeline, so it uses the strongest model.
Prosecutor (Haiku 4.5) argues for dislocation using domain-specific evidence (e.g., reserve depletion for FX, leverage metrics for market structure).
Defender (Haiku 4.5) argues against using domain-specific counter-evidence — independent from Prosecutor.
Domain Expert (Sonnet 4.5) provides category-specific specialist analysis.
Domain Adjudicator (Sonnet 4.5) synthesizes all arguments using domain-specific adjudicator prompt with containment checks — outputs verdict + confidence.
Gap Characterization (Haiku 4.5) describes the detected gap for fund managers (affected assets, key tension, time horizon, example actions) — runs in parallel with Calibration.
Calibration Agent (Haiku 4.5) adjusts probability for ECE optimization.

All LLM calls use temperature 0.0 for full determinism — the same input always produces the same output, enabling reproducible backtests and meaningful A/B comparisons.

Total: 28+ specialized prompt templates (7 Prosecutor + 7 Defender + 7 Expert + 7 Adjudicator) — each category’s entire pipeline is domain-aware, including domain-specific containment checks for sovereign debt, central bank, and geopolitical events.

Domain expert routing

Events are classified into 7 categories, each with a specialized expert prompt:

graph TD CLASS["📋 Classifier
Haiku 4.5"] CLASS --> FX["💱 FX/Peg Expert
Reserve adequacy, intervention
capacity, carry trade mechanics"] CLASS --> COMM["🛢️ Physical Markets Expert
Supply chains, storage,
producer concentration"] CLASS --> CB["🏦 Central Bank Expert
Reaction function, policy
transmission, forward guidance"] CLASS --> GEO["🌍 Geopolitical Expert
Sanctions, trade wars, regime
change, EM contagion"] CLASS --> MS["⚡ Market Structure Expert
Leverage, short squeezes,
margin cascades, LDI"] CLASS --> CREDIT["📉 Credit Expert
Debt sustainability, contagion,
restructuring, deposit risk"] CLASS --> BUBBLE["🏠 Bubble Expert
Valuations, reflexivity loops,
policy backstop assessment"] style CLASS fill:#f0f9ff,stroke:#0284c7,color:#0c4a6e style FX fill:#eef4ff,stroke:#4f7ee8,color:#0f2b57 style COMM fill:#fef3c7,stroke:#d97706,color:#78350f style CB fill:#e0e7ff,stroke:#4f46e5,color:#312e81 style GEO fill:#fee2e2,stroke:#dc2626,color:#7f1d1d style MS fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95 style CREDIT fill:#fef3c7,stroke:#d97706,color:#78350f style BUBBLE fill:#dcfce7,stroke:#16a34a,color:#14532d

7 specialized experts — each tunable independently with domain-specific prompt engineering

Evaluation setup

Dataset: 204 events total (150 positive + 54 quiet periods, 1971–2025)
7 event categories with per-category accuracy tracking
Cross-run consistency analysis (3 runs) to separate systematic vs stochastic failures
Intra-event parallelism: 4 round-trips instead of 8 sequential calls
Inter-event parallelism: ThreadPoolExecutor with configurable workers
Deterministic “no trade” override: score < 50 OR gap_rejected → direction forced to neutral

Current state (2026-05-19): v10 pipeline + adversarial exit backtest

The canonical result. Full 12.3 years (2014-01-01 → 2026-05-01) on real news feeds, with Sonnet category classifier and temperature 0.0 for full determinism.

Full statistical summary

Metric	Value	Notes
Coverage	2014-01-01 → 2026-05-01	12.3 years
Headlines indexed	626,245	Guardian + NYT + GDELT + FOMC + EDGAR
Active windows processed	215	after quality gates
Unique signals in store	635	after narrative collapse
Signals backtested	303	those with price data available
Mean P&L (adversarial exit)	+7.98%	per trade
Mean P&L (hold-to-time)	+4.86%	per trade
Adversarial alpha	+3.11%	exit system adds value
Worst drawdown	-19.3%	vs -57.3% for hold
Estimated annual return	~22% p.a.	at 10% allocation
Cost per signal (pipeline)	~$0.04	Sonnet classifier + 8 agents

Per-category P&L (adversarial exit backtest)

Category	N	Exit P&L	Hold P&L	Alpha
commodity	45	+32.8%	+21.8%	+11.0%
currency_peg	24	+16.8%	+2.9%	+13.9%
sovereign_debt	111	+3.2%	+2.0%	+1.2%
geopolitical	35	+3.2%	+3.6%	-0.3%
central_bank	67	+0.9%	+1.3%	-0.4%
real_estate	21	+0.6%	-0.5%	+1.1%

Key v10 improvements over prior versions

Sonnet category classifier — routing decision uses strongest model (miscategorization cascades through pipeline)
Temperature 0.0 everywhere — full determinism, reproducible results
Intelligent instrument selector — 3-layer (disk cache → static registry → Sonnet LLM fallback) with regression testing
RE relevance gate — filters miscategorized signals (retail/insurance/utility bankruptcies wrongly labeled as real_estate)
RE forcing catalyst filter — requires acute catalyst (entity failure, rate shock, contagion) for real estate signals

Calibration note

The model is slightly underconfident: it predicts ~75% probability on average when actual P&L is positive ~85% of the time. This is a benign bias for fund-manager use (less over-conviction → less over-sizing) but means the raw probability numbers are conservative.

Full stats & prior methodology: results/bigrun_20260505_full_stats.md. Latest backtest snapshot: results/adversarial_exit_20260519_064643_v10_pipeline_rerun.json.

Historical run snapshot (legacy 3-run cross-validation on 204 hand-labeled events)

Metric	Run 1	Run 2	Run 3
Positive accuracy	82.7%	78.1%	76.0%
FPR (quiet periods)	38.9%	20.8%	18.5%
F1	0.841	0.841	0.832
Brier	0.101	0.119	0.152
Wall time	30.8 min	36.1 min	10.0 min

Cross-run stability (positives)

Status	Count	%
Always correct	99	66%
Always missed	15	10%
Stochastic (varies)	36	24%

Stability by category

Category	Stable Correct	Stability
Currency Peg	26/35	74%
Commodity	20/27	74%
Sovereign Debt	14/19	74%
Real Estate	16/22	73%
Central Bank	8/18	44%
Geopolitical	9/16	56%
Market Structure	8/13	62%

Model assignment strategy

graph LR subgraph CHEAP["Fast Models (Haiku 4.5)"] direction TB B1["⚔️ Prosecutor"] B2["🛡️ Defender"] B3["🔍 Gap Characterization"] B4["📊 Calibration"] end subgraph EXPENSIVE["Quality Models (Sonnet 4.5)"] direction TB S0["📋 Classifier"] S1["🔬 Domain Expert"] S2["⚖️ Adjudicator"] S3["🎯 Instrument Selector"] end S0 -->|"routes to
correct expert"| S1 B1 -->|"firm commitment
clean signal"| S2 B2 -->|"firm commitment
clean signal"| S2 S1 -->|"independent analysis"| S2 style CHEAP fill:#dcfce7,stroke:#16a34a style EXPENSIVE fill:#e0e7ff,stroke:#4f46e5 style B1 fill:#f0fdf4,stroke:#16a34a,color:#14532d style B2 fill:#f0fdf4,stroke:#16a34a,color:#14532d style B3 fill:#f0fdf4,stroke:#16a34a,color:#14532d style B4 fill:#f0fdf4,stroke:#16a34a,color:#14532d style S0 fill:#eef2ff,stroke:#4f46e5,color:#312e81 style S1 fill:#eef2ff,stroke:#4f46e5,color:#312e81 style S2 fill:#eef2ff,stroke:#4f46e5,color:#312e81 style S3 fill:#eef2ff,stroke:#4f46e5,color:#312e81

Sonnet for high-stakes decisions (routing, expert, judge, instrument); Haiku for committed advocacy

Key technical insights

Full domain specialization beats generic debate.
All debate agents (Prosecutor, Defender, Expert, Adjudicator) now use category-specific prompts — 28+ templates total. A currency peg Prosecutor argues about reserve depletion; a sovereign debt Adjudicator has containment checks for debt scares vs actual defaults.
Domain-specific containment checks reduce FPR.
Sovereign debt, central bank, and geopolitical adjudicators distinguish “scares that resolved” from “crises that happened” — two-sided rules that cap scare scores while protecting confirmed events.
Sonnet for routing, Haiku for advocacy.
The category classifier uses Sonnet because miscategorization cascades through the entire pipeline. Advocates use Haiku because firm commitment produces cleaner signal for the judge.
Temperature 0.0 for full determinism.
All LLM calls use temperature 0.0. Same input → same output. This enables reproducible backtests and meaningful A/B comparisons between pipeline versions.
Adversarial exit outperforms mechanical rules.
An LLM weekly check (“is the gap still open?”) beats fixed hold periods + stop-losses by +3.11% per trade and reduces worst-case drawdown by 38 percentage points.
Intelligent instrument selection with regression testing.
3-layer selector (disk cache → static registry → Sonnet LLM) picks the best tradeable expression. Before replacing a cached instrument, the system regression-tests against all other signals using that ticker.
Signal quality gates before instrument optimization.
RE relevance gate + forcing catalyst filter remove miscategorized signals at the source — better than finding the “least bad instrument” for a wrong signal.
Real news enrichment with per-domain search strategies.
Guardian API (1992+) and GNews (2004+) with category-specific search terms. Per-domain source selection (e.g., commodity events skip Guardian).
Deterministic guardrails prevent false conviction.
Per-domain score/probability thresholds. When score or probability is below threshold, direction is forced to neutral regardless of LLM output.

Source files

development/detectors/adversarial_detector.py — 8-agent detector with 28+ domain-specific prompts (including domain adjudicators)
development/detectors/adversarial_exit_detector.py — LLM-based weekly exit check (gap still open?)
development/shared/instrument_selector.py — 3-layer instrument selector with disk cache + regression testing
development/shared/historical_news.py — News enrichment (Guardian/GNews/NYT APIs) with per-domain search strategies
development/shared/domain_config.py — Per-category pipeline configuration (FRED series, tickers, thresholds, search strategies)
development/shared/macro_data.py — Real macro data (FRED + World Bank + yfinance) with per-domain indicator selection
development/shared/signal_extractor.py — Signal Extraction Agent (Haiku) for structured signal extraction from news
development/shared/bedrock_client.py — AWS Bedrock client with SSO auto-refresh
development/tests/backtest_adversarial_exit.py — Full adversarial exit backtest (303 signals, v10)
development/tests/backtest_parallel.py — Parallel backtest with domain-specific thresholds and direction-agnostic scoring
development/data/instrument_cache.json — Living map of signal → instrument assignments