Pareto AI reposted this
Today we’re releasing AttuneBench v1.0: the first open EQ benchmark built from real multi-turn human-AI conversations and scored against what the person in the conversation actually felt and wanted. Most existing EQ benchmarks rely on synthetic prompts, single-turn interactions, or third-party ratings. But emotional intelligence is private knowledge. Only the person in the conversation knows whether they felt understood, what kind of response they wanted, and whether the interaction actually helped. So instead of asking outside annotators to judge conversations, we asked the participants themselves. We evaluated 11 frontier models across 200 real conversations and 50,000+ participant annotations. One result stood out immediately: Models were much better at recognizing how another model responded than predicting what the participant actually wanted from the conversation. For example, a model could correctly identify that a response was: - validating - analytical - reassuring - advice-oriented while still failing to predict whether the participant actually wanted reassurance, analysis, or advice in that moment. Several frontier models also ranked responses opposite to participant preference more often than with it. In other words: sounding emotionally fluent is not the same thing as understanding what a person actually needs. We also found that emotional understanding dropped substantially for participants reporting mental health diagnoses, even while models remained relatively strong at identifying surface conversational patterns. AttuneBench v1.0 is open-source and free to run, and we’ll continue updating it as models and methods evolve. Paper, leaderboard, code, and blog post are linked in the comments. Huge credit to Kate Lubrano and Mark Whiting on the Pareto AI research team, Karina Nguyen and the Thoughtful team, and especially the Pareto annotators who lived and labeled these conversations firsthand.