The GPT-5 Launch Proves Benchmarks Aren’t Enough

Even the most advanced models can struggle if they ignore user trust, emotional connection, and the realities of daily workflows.

and

Aug 13, 2025

Versión en Español aquí

Last week Open AI launched its latest model: GPT-5 (more on OpenAI). Instead of a clear success, it was a rocky trail. From Vibegraphing during the livestream (more on The Verge) to the auto-switcher not working for the first few hours (more on X), this launch has definitely been all over the news. There's plenty out there on exactly what happened, but here we want to focus on what this launch teaches us about benchmarks and user experience in the age of AI.

The rollout of GPT-5, while announced as a significant upgrade with enhanced coding skills and improved health-related query accuracy, faced considerable backlash from users (more on Wired). OpenAI CEO Sam Altman acknowledged technical issues; however, the core of the user complaints centered on an unannounced change of workflow and a perceived personality shift.

Users described the new model as "colder, less engaging, more bot than conversational partner," with many expressing a "profound personal loss" after forming deep emotional bonds with the previous GPT-4o model (more on Wired). The backlash was so intense that OpenAI had to bring back the GPT-4o legacy model for users who preferred the old experience (more on X). At a deeper level, it makes us question where human-AI relationships are headed. At a practical level, it is a textbook example of what happens when Human-Computer Interaction gets sidelined, and users feel the impact.

When Scale Amplifies Every Mistake

OpenAI's dominance means any abrupt change is amplified across diverse user groups. When a platform of this scale changes its core workflow without warning, the impact is not just technical. It disrupts how people work, learn, socialize, and even how they feel. With this level of traffic share, OpenAI isn't just another AI company, they're the platform that defines the AI experience for most users.

Think about it, users rely on ChatGPT for problem-solving, creativity, emotional companionship, and as an integrated part of their workflows. Some people start their day talking to ChatGPT. Others use it to work through complex problems or even just to have someone to bounce ideas off of. Removing model choice and adding auto-routing without pre-notifying shows a misalignment between engineering decisions and the mental models users have developed over time.

It is also a misalignment between operating as a research lab and operating as a product company, where stability and user trust are just as important as innovation. When you have millions of users who've integrated your product into their daily lives, you can't just switch things up overnight and expect them to be okay with it.

The Benchmark Trap

Historically, the AI industry, which operated as research labs, has been primarily driven by the pursuit of higher scores on standardized benchmarks, like the ones you see in the graphs often posted by OpenAI and other AI companies.

While these benchmarks initially spurred rapid improvements and were a necessary step in AI's evolution, they became the only thing that mattered to many companies. This obsession led to models being "optimized for the test, not for genuine understanding," often losing the ability to generalize to real-world situations. These laboratory tests are a "sanitized" version of reality, failing to account for the complexities of human communication, such as sarcasm, typos, cultural context, ambiguous questions, or users changing their minds mid-conversation. Companies prioritized these scores to secure funding and positive media attention, with user satisfaction often becoming an afterthought. (more on Arxiv)

It seems like they're optimizing for metrics that impress investors and journalists instead of metrics that predict whether users will stick around. Just look at the benchmarks AI companies track. Notice what's missing? There's no metric for "Does this feel like talking to a helpful friend?" or "Do users trust this enough to share personal problems?" These benchmarks measure everything except what users really care about.

The irony is that OpenAI actually gets this. In their GPT-5 announcement, they said the model is "less effusively agreeable," uses "fewer unnecessary emojis," and is more subtle and thoughtful in follow-ups compared to GPT-4o. It should feel less like "talking to AI" and more like "chatting with a helpful friend" with PhD-level intelligence. They even measured their progress on reducing "sycophantic responses" - from 14.5% to less than 6% - because they wanted users to have "high-quality, constructive conversations."

But here's the problem: they understood what users wanted and even had metrics for some aspects of personality, but then they measured overall success using the same old benchmarks that completely miss whether they achieved that "helpful friend" feeling. Users didn't care about the sycophancy reduction, they cared that their AI companion suddenly felt cold and distant.

Why This Hits Close to Home

I have a strong reaction to what happened at OpenAI because my work has always been grounded in customer obsession and iteration. At Stanford, I studied computer science with a concentration in Human-Computer Interaction, using design thinking to turn real needs into real products.

At LinkedIn, on the enterprise side, we would never launch a feature without first running A/B tests and user research. It was not the fastest process, but it kept customers at the center.

The idea of launching a significant change without warning users, without testing their reaction, without giving them choice, goes against everything I learned about building products people actually want to use. You don't just flip a switch and hope for the best when millions of people depend on your product.

Today, as an investor, that same focus guides my evaluation process. I look for companies that understand their users, not just their technology. Investing in Mexico has been a wake-up call, when you're building for markets where people have different needs, different contexts, different relationships with technology, you quickly realize that PhD-level intelligence means nothing if your AI can't understand what someone actually needs help with.

The Multi-Stakeholder Problem

OpenAI has many stakeholders to keep in mind, from their ChatGPT plus users, to API users, to business users, and even their potential investors. Each group wants different things. ChatGPT Plus users want a reliable, consistent experience. API users want predictable performance for their applications. Business users need stability and compliance. Investors want growth and innovation.

But here's the thing: when you try to serve everyone, you often end up serving no one well. The GPT-5 launch is a perfect example. In trying to push forward with what they thought was a better model, they alienated their core user base who had grown attached to GPT-4o's personality and behavior. What made this even worse was how OpenAI treated different user groups differently. API users get advance notice when models are being deprecated,it's an industry standard. But consumer ChatGPT users? They woke up one day to find that all nine of their previous AI models had disappeared overnight with zero warning (more on Ars Technica).

The challenge is that OpenAI isn't just a research lab anymore; they're a platform that millions of people depend on. When you reach that level of adoption, every decision has massive consequences. You can't move fast and break things when those "things" are people's daily workflows and emotional connections.

What Emotional AI Relationships Mean

The user reactions to GPT-5 reveal something important about where we are with AI adoption. People aren't just using these tools, they're forming relationships with them. When users talk about feeling a "profound personal loss," they're not being dramatic. They're describing the disruption of a relationship they valued.

This isn't about people being silly or anthropomorphizing technology. It's about the fact that these AI systems have become thinking partners, creative collaborators, and sometimes even emotional support systems for their users. When you suddenly change the personality or behavior of that system, it's like losing a friend or colleague. If you want to get a glimpse of what people are feeling, go to the r/ChatGPT subreddit.

I understand why OpenAI made some of these changes. There's a growing conversation about "AI psychosis" and the risks of people forming unhealthy attachments to chatbots (more on Psychology Today). Some users develop concerning dependencies, believing their AI companion has genuine feelings or even divine powers. OpenAI's decision to make GPT-5 less sycophantic and more "business-like" was likely an attempt to reduce these problematic dynamics.

But here's the fundamental problem with their approach: the emotional bonds were already there. Millions of users had already integrated ChatGPT into their daily emotional lives, some as creative partners, others as thinking companions, and yes, some as their primary source of emotional support. You can't just flip a switch and make those relationships disappear overnight. If anything, abruptly changing the personality of something people have grown attached to is more likely to cause psychological distress than prevent it.

To OpenAI's credit, they are getting the message. Just days after the backlash, Sam Altman said on X that they're "working on an update to GPT-5's personality which should feel warmer than the current personality" and acknowledged that "one learning for us from the past few days is we just need to get to a world with more per-user customization of model personality." It's exactly what users have been asking for, but it took a user revolt to get there.

Looking Forward: The Risk of Powerful But Useless AI

I can't help but wonder if, in the pursuit of AGI or whatever we are calling it today, we will end up with an extremely powerful technology that doesn't meet the needs of its intended users.

We could create AI systems that score perfectly on every benchmark but fail to understand what humans need. We could build AGI that's technically impressive but emotionally hollow. We could develop a superintelligence that's super at everything except being useful.

The GPT-5 launch is a warning sign. As AI systems become more powerful, we need to make sure they're also becoming more human-centered. That means understanding not just what users say they want, but what they need, being transparent about changes, and giving users agency over their experience.

The AI industry needs a reality check. These companies aren't research labs anymore, they're product companies serving millions of users, and they need to start acting like it. The playbook already exists: user research, A/B testing, customer-centered design. We've been doing this for decades in tech. But somehow, AI companies have forgotten about it. At the end of the day, no matter how impressive your benchmarks are, if users don't want to use your product, none of it matters.

Written by

Ana Caro Mexia

A guest post by

Ana Caro Mexia

ConteNIDO

Discussion about this post