| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
Is a car that wins a Formula 1 race the best choice for your morning commute? Probably not. In this sponsored deep dive with Prolific, we explore why the same logic applies to Artificial Intelligence....
Prolific researchers Andrew Gordon and Nora Petrova challenge the AI industry's reliance on technical benchmarks, arguing that high scores on exams like MMLU don't translate to good user experiences. They introduce 'Humane,' a human-centered leaderboard using TrueSkill methodology and demographically representative samples to evaluate AI models on dimensions like helpfulness, personality, and safety. Their research reveals models consistently underperform on subjective metrics like personality and cultural understanding, despite excelling at technical tasks.
Current AI evaluation relies heavily on technical benchmarks like MMLU and 'humanity's last exam,' but these don't measure what matters for real users. The field is nascent and fractured, with labs reporting results inconsistently. Models optimized for benchmark performance often provide poor user experiences, missing the critical human element.
Prolific developed a new evaluation methodology starting with a proof-of-concept 'User Experience Leaderboard' (500 US participants) and evolving into 'Humane.' The approach uses comparative battles similar to Chatbot Arena but with demographically stratified samples and multi-dimensional ratings beyond simple preference.
People increasingly use AI models for sensitive topics like mental health with no oversight or safety leaderboards. Recent incidents with Grok and Meta highlight thin safety training. Anthropic's work on constitutional AI and mechanistic interpretability offers promising directions for understanding and ensuring model safety.
The 'Leaderboard Illusion' paper exposed bias in Chatbot Arena where companies get unequal access to private testing. Meta released 27 Llama models before launch, gaining unfair advantage through more comparisons and prompt access. Additional issues include anonymous sampling with no demographic data and lack of quality assurance on prompts.
Prolific addresses Arena's flaws through three improvements: demographically stratified sampling based on census data, multi-dimensional specificity in ratings, and built-in quality assurance. Participants must engage in multi-step conversations with penalties for low-effort or topic-wandering prompts.
Humane uses Microsoft's TrueSkill framework (originally for Xbox Live) to estimate model capabilities through Bayesian distributions. The system prioritizes battles based on information gain, conducts only necessary comparisons, and can run separate tournaments for demographic groups while consolidating findings.
Prolific stratifies participants by demographics using US and UK census data to ensure results represent real-world populations. The approach allows confident claims that preferred models align with general public preferences rather than skewed subsets, with plans to expand globally.
Initial 500-participant study revealed models consistently underperform on personality and background/culture metrics compared to helpfulness and communication. This suggests training on 'the entire internet' doesn't produce desired personalities. Recent increase in model sycophancy (people-pleasing behavior) correlates with user dissatisfaction.
Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific)
Ask me anything about this podcast episode...
Try asking: