Reading 3 min Views 1 Published Updated
Meta AI recently introduced a “breakthrough” text-to-speech (TTS) generator that is claimed to produce results 20 times faster than current AI models with comparable performance.
The new system, dubbed Voicebox, eschews the traditional TTS architecture in favor of a model more like OpenAI’s ChatGPT or Google’s Bard.
Among the main differences between Voicebox and similar TTS models such as ElevenLabs Prime Voice AI is that the Meta sentence can generalize through contextual learning.
Like ChatGPT or other transform models, Voicebox uses large-scale training datasets. Previous attempts to use huge amounts of audio data have resulted in severe degradation of the audio output. For this reason, most TTS systems use small, carefully selected labeled datasets.
Meta overcomes this limitation with a new learning scheme that eschews labeling and curation of an architecture capable of “filling in” audio information.
As Meta AI reported in a June 16 blog post, Voicebox is “the first model that can generalize speech generation tasks for which it has not been specifically trained to perform with state-of-the-art performance.”
This allows Voicebox to convert text to speech, remove unwanted noise through surrogate speech synthesis, and even apply the speaker’s voice to multilingual output.
According to an accompanying research paper published by Meta, its pre-trained Voicebox system can do all of this with just the desired output text and a three-second audio clip.
The emergence of reliable speech generation comes at a particularly challenging time as social media companies continue to crack down on moderation, and in the US, the looming US presidential election threatens to test the limits of detecting online misinformation once again.
For example, former US President Donald Trump is currently facing accusations that he mishandled sensitive government materials after leaving office. Among the alleged evidence cited in the case against him are audio recordings in which he allegedly confessed to possible wrongdoing.
While there is currently no indication that the former president intends to deny the content described in the audio files, his case shows that data integrity is at the core of the US legal system and, by extension, its democracy.
Voicebox is not the first tool of its kind, but it seems to be one of the most reliable. As such, Meta’s has developed a tool to determine if speech has been generated, which the company claims can “trivially detect” the difference between real and fake audio. According to the blog post:
“As with other powerful AI innovations, we understand that this technology can lead to misuse and unintended harm. In our article, we detail how we created a highly efficient classifier that can distinguish between genuine speech and Voicebox generated audio to mitigate these possible future risks.”
In the world of cryptocurrencies, artificial intelligence has become as integral to the daily activities of most businesses as the Internet or electricity. Major exchanges rely on AI-powered chatbots for customer interaction and sentiment analysis, and trading bots have become commonplace.
Related: Bybit connects to ChatGPT for AI-powered trading tools
The advent of robust text-to-speech systems such as Voicebox, coupled with automated trading, could help potential cryptocurrency traders who rely on TTS systems, who may currently struggle with cryptocurrency jargon or multilingual support.