DeepSeek-V3: Deciding Which Score Matters — Interpreting 3.9% vs 6.1% (Old vs New)
https://alexissbrilliantchat.cavandoragh.org/why-did-o3-mini-high-jump-from-0-8-to-4-8-on-vectara-s-benchmark-and-what-it-means-for-document-length-evaluations
Why you should care: 3.9% or 6.1% changes decisions, budgets, and trust If you manage models, buy vendor claims, or run A/B experiments, a jump from 3.9% to 6.1% is not trivial