Amazon has unveiled Nova Sonic, a generative AI model that can process voice and generate natural-sounding speech. In tests measuring speed, speech recognition, and conversation quality, Sonic has proven itself competitive with leading voice models from OpenAI and Google.

Image source: aboutamazon.com

Nova Sonic is Amazon’s answer to new voice AI models like the one that powers ChatGPT, which offers more natural conversational formats than earlier versions of Alexa. Technological advances in recent years have made older models and digital assistants, including Alexa and Apple’s Siri, much more natural in their interactions with humans. Nova Sonic is available through Bedrock, Amazon’s platform for enterprise AI developers, and supports a bidirectional streaming API. Nova Sonic, Amazon added, costs 80 percent less to operate than OpenAI’s multimodal GPT-4o, and its components already power the upcoming Alexa+.

It excels at routing user requests to various APIs — it knows when it needs to retrieve information from the web in real time, analyze its own data source, or perform an action in an external app — and uses the appropriate tool to do so. During a two-way conversation, Nova Sonic waits to speak “at the right time,” accounting for the pauses and hesitations of the other person. It also creates a text transcript of the user’s speech that developers can use for a variety of apps.

In speech recognition tasks, it is less error-prone than other voice AI models, meaning it understands the user relatively well, even if they mumble, make mistakes, or are in a noisy environment. In the Multilingual LibriSpeech benchmark, which measures speech recognition performance across languages ​​and dialects, Nova Sonic achieved a word error rate (WER) of just 4.2% on average across English, French, Italian, German, and Spanish. That means it misrecognizes about four out of every hundred words it hears when preparing a transcript, compared to a human.

In the Augmented Multi-Party Interaction benchmark, which measures the quality of spoken conversations with multiple participants, Nova Sonic was 46.7% more accurate in terms of WER than OpenAI GPT-4o-transcribe. Amazon’s model was also very fast, with an average latency of 1.09 seconds versus 1.18 seconds for GPT-4o, which underlies OpenAI’s Realtime API. The company plans to introduce several more AI models capable of processing images, video, voice, and “other sensory data that is needed when translating into the physical world.”

Leave a Reply

Your email address will not be published. Required fields are marked *