Findings (Plain-English)

Market Synthesis was evaluated on 97 historical macro-financial dislocation events.
The system reached 91.8% headline accuracy and 92.8% adjusted accuracy after one label correction.

What worked

Prompt simplification fixed truncation problems.
Flat JSON with short fields eliminated output cut-offs and improved reliability.
Confidence anchors improved scoring.
Explicit score bands improved calibration and directional consistency.
Role-specific model assignment improved outcomes.
Haiku as debater + Sonnet as judge outperformed all-Sonnet in this setup.

Calibration and risk profile

Brier score: 0.047 (well-calibrated for this task)
Quiet-period FPR: 36.8% across all negatives tested
FPR on truly calm/boring negatives: ~15%

False positives cluster around near-miss stress events, not fully quiet markets.

Where it struggled

Common miss categories:

Niche or recent events with limited model knowledge
Intervention-heavy central bank events
Direction-label ambiguity (for example, instrument framing differences)

For full narrative and event-by-event discussion, see the repository analysis: development/results/FINDINGS.md