Microsoft’s VASA-1 can generate realistic talking faces from just one image.

In a recent white paper, Microsoft introduced a new AI model that produces a talking head that looks and sounds realistic and is generated by only uploading a still photograph and a voice sample.

The new model is named VASA-1, and it requires only one portrait style picture and an audio file of voice and fuses them together to make a short video of a talking head with facial expressions, lip syncing, and head movements. The produced head can even sing songs, and that in the voice uploaded at the time of creation.

Microsoft VASA-1 is a breakthrough for animation

According to Microsoft, the new AI model is still in the research phase, and there are still no plans to release it to the general public, and only Microsoft researchers have access to it. However, the company shared quite a few samples of the demonstrations, which show stunning realism and lip movements that seem to be too lifelike.

The demo shows people who look real, as if they were sitting in front of a camera and getting filmed. The movements of the heads are realistic and look quite natural, and the lip movement to match the audio is quite outstanding, provided that there seems very little to be noted anything for not being natural. The overall mouth synchronization is phenomenal.

Microsoft said the model was developed to animate virtual characters, and it claimed that all the people shown in the demo are synthetic, as they said, the models were generated from DALL-E, which is the image generator of OpenAI. So we think if it can animate an AI generated model, then obviously there is much more potential in it to animate photos of any real person, which should be more realistic and much easier for it to handle.

Use cases of Vasa-1 and its potential misuse

If we look at the potential of VASA-1 for practical usage, then on the baseline, it can be used to animate characters in animated movies, which will give characters a more realistic feel with natural facial expressions and head movements. Another use could be in video games, for the very same reason, think of Grand Theft Auto and the like. In the future it may be used for hyper realistic AI generated movies or series where characters can be generated from image generators and could be animated by VASA-1, and the audience may not even feel that the characters are not humans.

Along with creative use of the tool, it can also be leveraged to create content for malicious purposes. The potential misuse of VASA-1 could be its utilization for deepfakes, as it will make it easy for anyone involved in deepfake creations to scale up their bad tactics and generate more realistic misguiding content. Remember the robocall scandal involving Biden’s voice to refrain people from voting before a primary election? Now it could be a robovideo after the robocall, and that with very realistic human expressions.

The potential risk of misuse is may be the reason that Microsoft has limited its testing to its researchers only. According to Microsoft researchers, the tool can be used for creating misleading and deceivable content for impersonating humans, like some other tools, but they are aiming for positive use applications. Nvidia and Runway AI have also released their models for the same function, but VASA-1 seems far more realistic and a promising candidate.

The research paper can be seen here, and Microsoft’s note here.