VALL-E can mimic a person's voice when given a three-second audio sample
Once it has learned the voice, VALL-E can synthesise audio of that person saying just about anything
VALL-E even attempts to maintain the speaker's emotional tone
The creators of VALL-E believe that the AI model could be used for high-quality text-to-speech applications
The innovative text-to-speech method is different from traditional methods which manipulate waveforms
Unlike traditional ones, VALL-E actually processes how a person sounds, and uses its training to match what it "thinks" that person would sound like speaking other phrases
Soon, Microsoft's VALL-E will be able to edit a recording of a person to match a text transcript and used for audio content creation
This does mean that it could be used to create fake audio files.
VALL-E In Action
You can see VALL-E in action on Microsoft's VALL-E example website
Some of the results from the research almost pass for human speech, which is the ultimate goal
Because the software could easily be used for wrong - manipulating what people are saying - the software has not been shared with the public
The software could spoof voice identification or impersonate a specific speaker
Microsoft is working on a way of detecting audio that has been synthesized by VALL-E