Why one benchmark wasn't enough: Interpreting Perplexity Sonar Pro and Gemini 2.5 Pro results
https://ricardosmasterchat.lucialpiazzale.com/refuse-or-guess-making-the-right-choice-for-high-stakes-ai-outputs
3 key factors when choosing an evaluation strategy for large language models When you compare model claims such as "Perplexity Sonar Pro shows 37% citation errors" versus "Gemini 2.5 Pro reports 7.0% hallucination, improving on Gemini 2