Meta, the parent company of Facebook, has revealed a groundbreaking AI model known as Voicebox AI. This new AI tool is designed to generate spoken speech from textual input, potentially revolutionizing voice assistants and a wide range of applications. Although Meta has not yet released the program or its source code, Voicebox AI represents a significant leap in AI-driven speech generation technology.
Voicebox AI operates on a similar model to OpenAI’s ChatGPT and DALL-E but focuses on producing spoken language rather than text or images. The system has been trained on an extensive dataset comprising 50,000 hours of unfiltered audio. This dataset includes transcripts of publicly available audiobooks recorded in various languages, including English, French, Spanish, German, Polish, and Portuguese.
The diversity of this dataset enables Voicebox AI to generate “more conversational speech,” regardless of the languages spoken by users. According to Meta’s researchers, the results indicate that speech recognition models trained on synthetic speech generated by Voicebox AI perform nearly as well as those trained on real speech.
Meta claims that Voicebox outperforms Microsoft’s VALL-E in text-to-language conversion concerning both intelligibility (5.9% vs. 1.9% word error rate) and audio similarity (0.580% vs. 0.681%), all while being 20 times faster.
Voicebox AI boasts several valuable features, including the ability to edit audio, remove noise, and even correct mispronounced words. Users can identify segments of distorted speech caused by noise, such as a barking dog, and instruct the model to correct or update those segments.
Meta has employed a novel training method called Flow Matching to develop this speech synthesis technology from scratch. Currently, only the research paper and audio examples are available to the public. The Voicebox program and its source code have not been released due to concerns about potential misuse.
The researchers envision a wide range of applications for Voicebox AI in the future. This technology could be used for prosthetics to assist patients with damaged vocal cords, enhance gaming NPCs, and create more advanced digital assistants.
It’s worth noting that in January, Meta released its LLaMA AI language model as an open-source package, accessible to the AI community. However, concerns about misuse arose when the model’s download links appeared on online forums shortly after its release.
Meta has also developed SAM, an AI image segmentation model that can identify specific objects in images or videos based on user cues. Additionally, the company is offering open-source code and a dataset of 180,000 images to the Animated Drawings AI project, aiming to animate simple drawings using AI.
While Meta’s Voicebox AI is not yet available for public use, it represents a significant advancement in AI-driven speech generation technology, with the potential to transform various industries and applications once it becomes more widely accessible.