Findings (Plain-English)
Market Synthesis was evaluated on 97 historical macro-financial dislocation events.
The system reached 91.8% headline accuracy and 92.8% adjusted accuracy after one label correction.
What worked
- Prompt simplification fixed truncation problems.
Flat JSON with short fields eliminated output cut-offs and improved reliability. - Confidence anchors improved scoring.
Explicit score bands improved calibration and directional consistency. - Role-specific model assignment improved outcomes.
Haiku as debater + Sonnet as judge outperformed all-Sonnet in this setup.
Calibration and risk profile
- Brier score: 0.047 (well-calibrated for this task)
- Quiet-period FPR: 36.8% across all negatives tested
- FPR on truly calm/boring negatives: ~15%
False positives cluster around near-miss stress events, not fully quiet markets.
Where it struggled
Common miss categories:
- Niche or recent events with limited model knowledge
- Intervention-heavy central bank events
- Direction-label ambiguity (for example, instrument framing differences)
For full narrative and event-by-event discussion, see the repository analysis:
development/results/FINDINGS.md