The new model, called VSSFlow, leverages creative architecture to generate sound and speech in a single integrated system, delivering cutting-edge results. Watch (and listen) to the demo below.
problem
Currently, most video-sound models (that is, models trained to generate sound from silent video) are not very good at generating audio. Similarly, most text-to-speech models fail to generate non-speech sounds because they are designed for different purposes.
Moreover, previous attempts to integrate both tasks are often built on the assumption that joint training degrades performance, resulting in the need for setups that teach voice and voice in separate stages, complicating the pipeline.
Based on this scenario, three researchers from Apple and six researchers from Renmin University of China developed a VSSFlowa new AI model that can generate both sound effects and audio from silent videos in a single system.
Not only that, but the architecture they developed works in such a way that voice training improves voice training, and vice versa, rather than interfering with each other.
solution
In a nutshell, VSSFlow leverages multiple concepts of generative AI. This includes converting transcripts into token phoneme sequences and learning how to reconstruct sounds from noise using flow matching. featured hereessentially training a model to efficiently start from random noise and arrive at the desired signal.
All of this is built into a 10-layer architecture that blends video and transcript signals directly into the audio generation process, allowing the model to handle both sound effects and audio within a single system.
Perhaps more interestingly, the researchers note that joint speech-to-speech training actually occurs. Performance improved for both tasksRather than pit the two against each other or reduce the overall performance of either task.
To train VSSFlow, researchers fed the model a combination of silent video combined with environmental sounds (V2S), silent conversation video combined with transcripts (VisualTTS), and text-to-speech data (TTS), allowing it to learn both sound effects and spoken dialogue together in a single end-to-end training process.
Importantly, they noted that VSSFlow, out of the box, cannot automatically generate background sounds and spoken dialogue simultaneously in a single output.
To accomplish that, they fine-tuned an already trained model on a large set. Synthesis example Since speech and environmental sounds are mixed here, the model learns how both should sound at the same time.
Make VSSFlow work
To generate sound and audio from silent videos, the model starts with random noise and uses visual cues sampled from the video at 10 frames per second to shape the environmental sounds. At the same time, a transcript of what was said provides an accurate guide to the generated audio.
When tested against task-specific models built for sound effects only or speech only, VSSFlow delivered competitive results across both tasks and outperformed on several key metrics despite using a single integrated system.
The researchers published multiple demos of sound, speech, and co-generation results (from Veo3 video), as well as comparisons of VSSFlow and several alternative models. You can see some of the results below, but be sure to visit the next page. Demo page To see them all.
And here’s the really cool thing. They are researchers. Open source VSSFlow code on GitHubwe are working on making the model weights public as well. Additionally, we are working on providing an inference demo.
As for what happens next, the researchers say:
In this study, we present an integrated flow model that integrates video-to-speech (V2S) and visual text-to-speech (VisualTTS) tasks, establishing a new paradigm for video conditional speech and voice generation. Our framework demonstrates an effective condition aggregation mechanism for incorporating audio and video conditions into DiT architectures. Furthermore, through analysis, we reveal the mutually promoting effect of speech-speech joint learning and highlight the value of the unified generative model. There are several directions for future research that merit further exploration. First, the lack of high-quality video, audio, and audio data limits the development of unified generative models. Furthermore, developing better representation methods for speech and speech that can preserve audio details while maintaining a compact size is an important future challenge.
Learn more about the study titled “VSSFlow: Integrating video conditional speech and speech generation through collaborative learning.” please follow this link.
Accessories sale on Amazon
FTC: We use automated affiliate links that generate income. more.