Microsoft Unveils Azure AI Speech Text to Speech Avatar, Raising Ethical Concerns

At the Microsoft Ignite 2023 event, the company unveiled a tool known as Azure AI Speech Text to Speech Avatar, designed to create photorealistic avatars that can animate and speak scripted content. Users can generate videos of avatars speaking by uploading images of a person and writing a script. The tool then employs a model to animate the avatar and a text-to-speech model to vocalize the script. This feature is aimed at efficiently creating videos for various purposes such as training videos, product introductions, and customer testimonials using simple text input. The avatars can communicate in multiple languages and integrate with AI models like OpenAI's GPT-3.5 for responding to off-script questions.

While the tool offers creative possibilities, it also raises ethical concerns and potential for misuse. Microsoft acknowledges the potential for abuse and limits access to custom avatars, making them available through a registration process for specific use cases. Most Azure subscribers will only have access to prebuilt avatars at launch. The ethical considerations extend to issues faced by the entertainment industry, as seen in the recent SAG-AFTRA strike, where AI-generated likenesses became a contentious topic. Microsoft's stance on compensating individuals for their AI-generated likenesses remains unclear.

In addition to the avatar tool, Microsoft introduced another feature called Personal Voice within its custom neural voice service. This feature allows the replication of a user's voice using a one-minute speech sample as an audio prompt. Microsoft positions it as a tool to create personalized voice assistants, dub content into different languages, and generate custom narrations for various audio applications. To address potential legal concerns, Microsoft requires users to provide explicit consent through a recorded statement before utilizing Personal Voice. Access to this feature is currently restricted and subject to registration, with users agreeing to specific usage terms, including limitations on sharing or publishing the voice models and output.


