INDEPENDENT BENCHMARK REPORT

A rigorous blind human evaluation by Josh Talks Research placed Commotion ahead of three globally recognised providers across dimensions that matter in real-world customer support voice AI.

The evaluation: Rigorous, blind, and built for the real world

When it comes to voice AI for customer support, synthetic demos and lab benchmarks tell only part of the story. The conditions that define the real-world telephony audio at 8 kHz, distressed callers, and complex resolution paths are exactly where most providers struggle.

Josh Talks Research, a specialist in voice AI evaluation, conducted a fully blind, head-to-head human evaluation comparing Commotion Laya v1.5 against three (3) other leading commercial providers in a call centre setting. The word “blind” is key: listeners had no knowledge of which model produced each audio sample, eliminating brand bias from the results entirely.

The scale of the study gives its findings statistical weight:

  • 10,546: Total votes analysed

  • 332: Unique evaluators

  • 906: Unique sentences tested

  • 8: Languages tested

  • 10: Industry use cases represented (insurance, banking, telecom, travel, healthcare, e-commerce, logistics, ride-hailing, broadband, general support)

For this benchmark, Josh Talks Research selected 8: English, the most universally sought language in enterprise voice AI, alongside seven linguistically complex Indian languages- Hindi, Tamil, Malayalam, Marathi, Kannada, Telugu, and Bangla.

This combination is a deliberate stress test. The Dravidian and Indo-Aryan language families represented here are phonologically diverse, widely spoken, and notoriously difficult for Text-to-Speech (TTS) systems to handle well. A model that performs strongly across this range is demonstrating genuine linguistic sophistication, not just polished performance on a narrow set of easy inputs.

All audio was rendered at 8 kHz to replicate true telephony conditions. Evaluation prompts were drawn from real CX-style workflows, grounded in specific intents and resolution paths such as claim rejection, delivery delay, and SIM activation.

The evaluation was designed not to test “good-sounding text-to-speech” in the abstract, but to test which model sounds right when a distressed customer is on the other end of the line.

The results: An edge across all three competitors

The headline finding is unambiguous. On decisive votes, participants expressed a clear preference for Commotion Laya v1.5 in the main telephony benchmark.

When tied votes are excluded, and only vote-based industry benchmark results are used, the margin is even wider: Commotion recieved 83% to 85% of votes head-to-head against the three (3) providers. These figures held consistently across every language included in each comparison. This was not a result driven by strength in one region or one language.

Why Commotion performed stronger: Reliability, not just preference

Preference data tells you what listeners choose. Issue-rate data tells you why. In this benchmark, the competing models were tagged far more frequently for problems that directly impact customer experience in a live call:

  • Irregular pacing

  • Mispronunciation

  • Robotic delivery

  • Missing-word errors

Commotion Laya v1.5 carried an issue rate of approximately 31%, and output was flagged for far less problems. That reliability advantage is not cosmetic. In a live call-centre environment, every mispronounced name, every robotic pause, and every dropped word affects customer trust.

Empathy: The dimension that separates good from great

Standard TTS benchmarks measure whether a voice sounds pleasant. This evaluation went further by testing whether Commotion Laya v1.5’s voice sounded appropriately human, considerate, and emotionally aware in situations such as claims, hospitalisation follow-ups, and reassurance scenarios.

The empathy benchmark asked listeners not just which voice they preferred, but which one they perceived as more empathetic. Commotion Laya v1.5 demonstrated positive results on both measures:

  • 83.8% preference rate in empathy scenarios

  • Empathy “Yes” recognition: 92.3% for Commotion

  • Empathy “Yes” recognition: 91.6% for Commotion

  • Empathy “Yes” recognition: 86.1% for Commotion

The explanation, again, comes down to delivery breakdowns. When a model mispronounces a word mid-sentence or stumbles in pacing at a moment of high emotional weight, it doesn’t just sound wrong. It sounds uncaring. Commotion’s lower rate of delivery errors means its voice remains coherent and steady precisely when it matters most.

Consistent across every use case

One of the most operationally significant findings is the breadth of Commotion’s strong performance. Commotion’s consistency across various use case categories indicates a model that is operationally ready for the full spectrum of customer interactions, not just the easy ones.

What this means for organisations evaluating Voice AI

Voice AI is no longer a novelty. It is an infrastructure. Every mispronounced policy number, every robotic pause during a claims call, every missed word on a payment confirmation is a moment where a customer’s confidence in your brand erodes. The providers that will win in this space are those that are reliable under real-world conditions, not just impressive in demos.

This benchmark, conducted by an independent specialist with no stake in the outcome, provides exactly that kind of real-world signal. Across 10,546 preference votes, 332 unique evaluators, 8 languages, and 10 industry use cases, Commotion came out ahead on every dimension tested: overall preference, industry-specific performance, and empathy recognition.

One further finding deserves a post of its own.

India is one of the most linguistically complex markets on earth, with 22 constitutionally recognised languages, dozens of major dialects, and code-switching patterns that defeat most voice AI systems. Commotion achieved favourable outcomes across all 8 Indian languages tested in this market. The implications of that result, for what it says about Commotion’s technology and its readiness for the world’s most demanding multilingual markets, are explored in our next post.

Schedule a conversation to understand what this benchmark means for your customer experience strategy.


Disclaimer: The study reported in this blog post was commissioned by Commotion, Inc. and independently conducted by Josh Talks Research. The views and conclusions expressed in this blog post represent the views of Commotion, Inc., on the report. Commotion, Inc. is not responsible or liable for the content, interpretations, methodology, or conclusions presented in the report. Click [here] to access the report dated April 12, 2026, published by Josh Talks Research to form your own views. All third-party trademarks belong to their respective owners.

Source: “Call-Center CX TTS Benchmark: Commotion Laya v1.5 vs Sarvam Bulbul v3, Cartesia Sonic v3, and ElevenLabs Turbo 2.5,” Josh Talks Research, April 12, 2026.