Home
Ai
Ai News
Microsoft Releases New AI Models That Can Generate Images, Audio and Transcribe Text

Microsoft Releases New AI Models That Can Generate Images, Audio and Transcribe Text

Microsoft has released MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 AI models.

Written by Akash Dutta, Edited by Ketan Pratap | Updated: 3 April 2026 18:44 IST

Microsoft Releases New AI Models That Can Generate Images, Audio and Transcribe Text

Photo Credit: Microsoft

The Image-2 model is being rolled out to Copilot, Bing, and PowerPoint

Click Here to Add Gadgets360 As A Trusted Source

Highlights

These models are available via Microsoft Foundry and MAI Playground
MAI-Transcribe-1 is said to outperform Google and OpenAI’s models
Voice-1 can generate realistic speech with an emotional range

Microsoft released three specialised artificial intelligence (AI) models on Thursday, focusing on image generation, voice generation, and speech-to-text transcription. The Redmond-based tech giant claims that these models outperform specialised models from rival companies, such as Google, OpenAI, and others. The models, MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, are also said to focus on fast generation and competitive pricing. These are currently available via the Microsoft Foundry, and they are also being rolled out to various consumer products.

Microsoft Brings Three New AI Models

In a newsroom post, the tech giant introduced the three new large language models (LLMs). All of them are currently available via Microsoft Foundry and the MAI Playground. The biggest highlight is the MAI-Transcribe-1, which the company claims delivers state-of-the-art (SOTA) speech-to-text transcription across the 25 most used languages.

Microsoft Discussion

Explore More...

The claims are based on Microsoft's internal testing on the FLEURS benchmark. It is said to outperform Gemini 3.1 Flash and GPT-Transcribe in error rate. Additionally, the company says Foundry users will find it to be the “best price-performance of any large cloud provider.”

Microsoft AI Chief Wants to Deliver State-of-the-Art AI Models by 2027

Coming to MAI-Voice-1, the LLM is said to generate “natural, realistic speech, rich with nuance, emotional range, and expression.” The model is also said to deliver consistent speech and voice identity during long-form content generation. Inside Foundry, the model will also allow users to create a custom voice with a few seconds of audio.

Microsoft claims that this process is safe and secure. It is said to generate 60 seconds of audio in a single second. Notably, the AI model will also power Copilot Audio Expressions and Copilot Podcasts.

Finally, the MAI-Image-2 model builds on the capabilities of its predecessor and is said to deliver improved output quality at a faster speed. Microsoft revealed that the model was created in collaboration with photographers, designers, and visual storytellers, and it focuses on natural lighting, accurate textures, and clear in-image text. Notably, WPP is among the first enterprise partners to have adopted the AI model.

The model, similar to the other two, will be available via the Microsoft Foundry and the MAI Playground. Additionally, it is also rolling out to Copilot, Bing, and PowerPoint.

Comments

Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.

Further reading: Microsoft, Microsoft AI, MAI Transcribe, MAI Voice, MAI Image 2, AI, Artificial Intelligence, AI Model, LLM

Akash Dutta Email Akash Dutta

Akash Dutta is a Chief Sub Editor at Gadgets 360. He is particularly interested in the social impact of technological developments and loves reading about emerging fields such as AI, metaverse, and fediverse. In his free time, he can be seen supporting his favourite football club - Chelsea, watching movies and anime, and sharing passionate opinions on food. More