Microsoft Has Unveiled VALL-E, An Audio AI

VALL-E  is a new text-to-speech AI model

VALL-E can mimic a person's voice when given a three-second audio sample

Once it has learned the voice, VALL-E can synthesise audio of that person saying just about anything

VALL-E even attempts to maintain the speaker's emotional tone

The creators of VALL-E believe that the AI model could be used for high-quality text-to-speech applications

The innovative text-to-speech method is different from traditional methods which manipulate waveforms

Unlike traditional ones, VALL-E actually processes how a person sounds, and uses its training to match what it "thinks" that person would sound like speaking other phrases

Soon, Microsoft's VALL-E will be able to edit a recording of a person to match a text transcript and used for audio content creation

This does mean that it could be used to create fake audio files.

VALL-E In Action

You can see VALL-E in action on Microsoft's VALL-E example website

Some of the results from the research almost pass for human speech, which is the ultimate goal

Because the software could easily be used for wrong - manipulating what people are saying - the software has not been shared with the public

The software could spoof voice identification or impersonate a specific speaker

Microsoft is working on a way of detecting audio that has been synthesized by VALL-E