Technical Summary
Architecture
Market Synthesis uses adversarial orchestration with domain-routed experts:
1 API call"] H1 --> R1["Verdict"] end subgraph S2["Stage 2: Multi-Agent (no orchestration)"] direction LR E2["Event"] --> A2a["Agent 1"] E2 --> A2b["Agent 2"] E2 --> A2c["Agent 3"] A2a --> V2["Majority Vote"] A2b --> V2 A2c --> V2 end subgraph S3["Stage 3: Adversarial Debate (current)"] direction TB E3["Event"] --> CLS3["π Classify
Sonnet"] CLS3 -->|"routes category-specific
prompts to all 3"| PROS3["βοΈ Prosecutor
Haiku"] CLS3 --> EXP3["π¬ Expert
Sonnet"] CLS3 --> DEF3["π‘οΈ Defender
Haiku"] PROS3 -->|"parallel,
independent"| JUDGE3["βοΈ Judge
Sonnet"] EXP3 --> JUDGE3 DEF3 --> JUDGE3 JUDGE3 --> TRADE3["π Gap
Haiku"] JUDGE3 --> CAL3["π Calibrate
Haiku"] TRADE3 --> BEST["β Stage 3 result
635 signals Β· +7.98% mean P&L
303 backtested Β· 2014β2026
adversarial exit system"] CAL3 --> BEST end S1 -.->|"Stage 1 β Stage 2
96.9% detection,
no discrimination"| S2 S2 -.->|"Stage 2 β Stage 3
77% detection,
no orchestration"| S3 style S1 fill:#f8fafc,stroke:#94a3b8 style S2 fill:#fef3c7,stroke:#d97706 style S3 fill:#eef4ff,stroke:#4f7ee8 style BEST fill:#dcfce7,stroke:#16a34a,color:#14532d style CLS3 fill:#f0f9ff,stroke:#0284c7,color:#0c4a6e style PROS3 fill:#dcfce7,stroke:#16a34a,color:#14532d style DEF3 fill:#fee2e2,stroke:#dc2626,color:#7f1d1d style EXP3 fill:#fef3c7,stroke:#d97706,color:#78350f style JUDGE3 fill:#e0e7ff,stroke:#4f46e5,color:#312e81 style TRADE3 fill:#fef3c7,stroke:#d97706,color:#78350f style CAL3 fill:#f0f9ff,stroke:#0284c7,color:#0c4a6e
Evolution from single agent to adversarial debate β orchestration is what makes multi-agent work
The current production architecture (8 agents, 4 sequential round-trips):
- News Enrichment fetches real contemporaneous coverage (Guardian API 1992+, GNews 2004+, NYT if key available) with per-domain search strategies.
- Classifier (Sonnet 4.5) routes event to 1 of 7 domains β this is the critical routing decision; miscategorization cascades through the entire pipeline, so it uses the strongest model.
- Prosecutor (Haiku 4.5) argues for dislocation using domain-specific evidence (e.g., reserve depletion for FX, leverage metrics for market structure).
- Defender (Haiku 4.5) argues against using domain-specific counter-evidence β independent from Prosecutor.
- Domain Expert (Sonnet 4.5) provides category-specific specialist analysis.
- Domain Adjudicator (Sonnet 4.5) synthesizes all arguments using domain-specific adjudicator prompt with containment checks β outputs verdict + confidence.
- Gap Characterization (Haiku 4.5) describes the detected gap for fund managers (affected assets, key tension, time horizon, example actions) β runs in parallel with Calibration.
- Calibration Agent (Haiku 4.5) adjusts probability for ECE optimization.
All LLM calls use temperature 0.0 for full determinism β the same input always produces the same output, enabling reproducible backtests and meaningful A/B comparisons.
Total: 28+ specialized prompt templates (7 Prosecutor + 7 Defender + 7 Expert + 7 Adjudicator) β each categoryβs entire pipeline is domain-aware, including domain-specific containment checks for sovereign debt, central bank, and geopolitical events.
Domain expert routing
Events are classified into 7 categories, each with a specialized expert prompt:
Haiku 4.5"] CLASS --> FX["π± FX/Peg Expert
Reserve adequacy, intervention
capacity, carry trade mechanics"] CLASS --> COMM["π’οΈ Physical Markets Expert
Supply chains, storage,
producer concentration"] CLASS --> CB["π¦ Central Bank Expert
Reaction function, policy
transmission, forward guidance"] CLASS --> GEO["π Geopolitical Expert
Sanctions, trade wars, regime
change, EM contagion"] CLASS --> MS["β‘ Market Structure Expert
Leverage, short squeezes,
margin cascades, LDI"] CLASS --> CREDIT["π Credit Expert
Debt sustainability, contagion,
restructuring, deposit risk"] CLASS --> BUBBLE["π Bubble Expert
Valuations, reflexivity loops,
policy backstop assessment"] style CLASS fill:#f0f9ff,stroke:#0284c7,color:#0c4a6e style FX fill:#eef4ff,stroke:#4f7ee8,color:#0f2b57 style COMM fill:#fef3c7,stroke:#d97706,color:#78350f style CB fill:#e0e7ff,stroke:#4f46e5,color:#312e81 style GEO fill:#fee2e2,stroke:#dc2626,color:#7f1d1d style MS fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95 style CREDIT fill:#fef3c7,stroke:#d97706,color:#78350f style BUBBLE fill:#dcfce7,stroke:#16a34a,color:#14532d
7 specialized experts β each tunable independently with domain-specific prompt engineering
Evaluation setup
- Dataset: 204 events total (150 positive + 54 quiet periods, 1971β2025)
- 7 event categories with per-category accuracy tracking
- Cross-run consistency analysis (3 runs) to separate systematic vs stochastic failures
- Intra-event parallelism: 4 round-trips instead of 8 sequential calls
- Inter-event parallelism: ThreadPoolExecutor with configurable workers
- Deterministic βno tradeβ override: score < 50 OR gap_rejected β direction forced to neutral
Current state (2026-05-19): v10 pipeline + adversarial exit backtest
The canonical result. Full 12.3 years (2014-01-01 β 2026-05-01) on real news feeds, with Sonnet category classifier and temperature 0.0 for full determinism.
Full statistical summary
| Metric | Value | Notes |
|---|---|---|
| Coverage | 2014-01-01 β 2026-05-01 | 12.3 years |
| Headlines indexed | 626,245 | Guardian + NYT + GDELT + FOMC + EDGAR |
| Active windows processed | 215 | after quality gates |
| Unique signals in store | 635 | after narrative collapse |
| Signals backtested | 303 | those with price data available |
| Mean P&L (adversarial exit) | +7.98% | per trade |
| Mean P&L (hold-to-time) | +4.86% | per trade |
| Adversarial alpha | +3.11% | exit system adds value |
| Worst drawdown | -19.3% | vs -57.3% for hold |
| Estimated annual return | ~22% p.a. | at 10% allocation |
| Cost per signal (pipeline) | ~$0.04 | Sonnet classifier + 8 agents |
Per-category P&L (adversarial exit backtest)
| Category | N | Exit P&L | Hold P&L | Alpha |
|---|---|---|---|---|
| commodity | 45 | +32.8% | +21.8% | +11.0% |
| currency_peg | 24 | +16.8% | +2.9% | +13.9% |
| sovereign_debt | 111 | +3.2% | +2.0% | +1.2% |
| geopolitical | 35 | +3.2% | +3.6% | -0.3% |
| central_bank | 67 | +0.9% | +1.3% | -0.4% |
| real_estate | 21 | +0.6% | -0.5% | +1.1% |
Key v10 improvements over prior versions
- Sonnet category classifier β routing decision uses strongest model (miscategorization cascades through pipeline)
- Temperature 0.0 everywhere β full determinism, reproducible results
- Intelligent instrument selector β 3-layer (disk cache β static registry β Sonnet LLM fallback) with regression testing
- RE relevance gate β filters miscategorized signals (retail/insurance/utility bankruptcies wrongly labeled as real_estate)
- RE forcing catalyst filter β requires acute catalyst (entity failure, rate shock, contagion) for real estate signals
Calibration note
The model is slightly underconfident: it predicts ~75% probability on average when actual P&L is positive ~85% of the time. This is a benign bias for fund-manager use (less over-conviction β less over-sizing) but means the raw probability numbers are conservative.
Full stats & prior methodology: results/bigrun_20260505_full_stats.md.
Latest backtest snapshot: results/adversarial_exit_20260519_064643_v10_pipeline_rerun.json.
Historical run snapshot (legacy 3-run cross-validation on 204 hand-labeled events)
| Metric | Run 1 | Run 2 | Run 3 |
|---|---|---|---|
| Positive accuracy | 82.7% | 78.1% | 76.0% |
| FPR (quiet periods) | 38.9% | 20.8% | 18.5% |
| F1 | 0.841 | 0.841 | 0.832 |
| Brier | 0.101 | 0.119 | 0.152 |
| Wall time | 30.8 min | 36.1 min | 10.0 min |
Cross-run stability (positives)
| Status | Count | % |
|---|---|---|
| Always correct | 99 | 66% |
| Always missed | 15 | 10% |
| Stochastic (varies) | 36 | 24% |
Stability by category
| Category | Stable Correct | Stability |
|---|---|---|
| Currency Peg | 26/35 | 74% |
| Commodity | 20/27 | 74% |
| Sovereign Debt | 14/19 | 74% |
| Real Estate | 16/22 | 73% |
| Central Bank | 8/18 | 44% |
| Geopolitical | 9/16 | 56% |
| Market Structure | 8/13 | 62% |
Model assignment strategy
correct expert"| S1 B1 -->|"firm commitment
clean signal"| S2 B2 -->|"firm commitment
clean signal"| S2 S1 -->|"independent analysis"| S2 style CHEAP fill:#dcfce7,stroke:#16a34a style EXPENSIVE fill:#e0e7ff,stroke:#4f46e5 style B1 fill:#f0fdf4,stroke:#16a34a,color:#14532d style B2 fill:#f0fdf4,stroke:#16a34a,color:#14532d style B3 fill:#f0fdf4,stroke:#16a34a,color:#14532d style B4 fill:#f0fdf4,stroke:#16a34a,color:#14532d style S0 fill:#eef2ff,stroke:#4f46e5,color:#312e81 style S1 fill:#eef2ff,stroke:#4f46e5,color:#312e81 style S2 fill:#eef2ff,stroke:#4f46e5,color:#312e81 style S3 fill:#eef2ff,stroke:#4f46e5,color:#312e81
Sonnet for high-stakes decisions (routing, expert, judge, instrument); Haiku for committed advocacy
Key technical insights
- Full domain specialization beats generic debate.
All debate agents (Prosecutor, Defender, Expert, Adjudicator) now use category-specific prompts β 28+ templates total. A currency peg Prosecutor argues about reserve depletion; a sovereign debt Adjudicator has containment checks for debt scares vs actual defaults. - Domain-specific containment checks reduce FPR.
Sovereign debt, central bank, and geopolitical adjudicators distinguish βscares that resolvedβ from βcrises that happenedβ β two-sided rules that cap scare scores while protecting confirmed events. - Sonnet for routing, Haiku for advocacy.
The category classifier uses Sonnet because miscategorization cascades through the entire pipeline. Advocates use Haiku because firm commitment produces cleaner signal for the judge. - Temperature 0.0 for full determinism.
All LLM calls use temperature 0.0. Same input β same output. This enables reproducible backtests and meaningful A/B comparisons between pipeline versions. - Adversarial exit outperforms mechanical rules.
An LLM weekly check (βis the gap still open?β) beats fixed hold periods + stop-losses by +3.11% per trade and reduces worst-case drawdown by 38 percentage points. - Intelligent instrument selection with regression testing.
3-layer selector (disk cache β static registry β Sonnet LLM) picks the best tradeable expression. Before replacing a cached instrument, the system regression-tests against all other signals using that ticker. - Signal quality gates before instrument optimization.
RE relevance gate + forcing catalyst filter remove miscategorized signals at the source β better than finding the βleast bad instrumentβ for a wrong signal. - Real news enrichment with per-domain search strategies.
Guardian API (1992+) and GNews (2004+) with category-specific search terms. Per-domain source selection (e.g., commodity events skip Guardian). - Deterministic guardrails prevent false conviction.
Per-domain score/probability thresholds. When score or probability is below threshold, direction is forced to neutral regardless of LLM output.
Source files
development/detectors/adversarial_detector.pyβ 8-agent detector with 28+ domain-specific prompts (including domain adjudicators)development/detectors/adversarial_exit_detector.pyβ LLM-based weekly exit check (gap still open?)development/shared/instrument_selector.pyβ 3-layer instrument selector with disk cache + regression testingdevelopment/shared/historical_news.pyβ News enrichment (Guardian/GNews/NYT APIs) with per-domain search strategiesdevelopment/shared/domain_config.pyβ Per-category pipeline configuration (FRED series, tickers, thresholds, search strategies)development/shared/macro_data.pyβ Real macro data (FRED + World Bank + yfinance) with per-domain indicator selectiondevelopment/shared/signal_extractor.pyβ Signal Extraction Agent (Haiku) for structured signal extraction from newsdevelopment/shared/bedrock_client.pyβ AWS Bedrock client with SSO auto-refreshdevelopment/tests/backtest_adversarial_exit.pyβ Full adversarial exit backtest (303 signals, v10)development/tests/backtest_parallel.pyβ Parallel backtest with domain-specific thresholds and direction-agnostic scoringdevelopment/data/instrument_cache.jsonβ Living map of signal β instrument assignments