Microsoft breaks from OpenAI: Launches own AI models for speech, transcription, and images
Microsoft launched three in-house AI models under the MAI (Microsoft AI) initiative on April 5th: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. The release signals a clear strategic move to reduce OpenAI dependence and compete directly with Google and Anthropic.
MAI-Transcribe-1 is an enterprise-grade speech recognition model supporting 25 languages at approximately 50% lower GPU cost than leading alternatives. It achieves a lower average word error rate than GPT-Transcribe and Gemini 3.1 Flash on accuracy benchmarks.
MAI-Voice-1 generates 60 seconds of expressive audio in under one second on a single GPU. It can create custom voices from just a few seconds of audio, enabling scalable voice personalization at scale.
MAI-Image-2 is Microsoft's second-generation text-to-image model and debuted at the top of the Arena.ai leaderboard. It generates images at least twice as fast as its predecessor on Foundry and Copilot.
The models are already integrated into Copilot, Bing, Azure Speech, and PowerPoint, and are available via Microsoft Foundry and the MAI Playground for developers and enterprises.
For enterprise decision-makers, this matters because it signals increased competition, lower enterprise AI pricing, and Microsoft products increasingly running on first-party models rather than OpenAI infrastructure.
📬 Likte du denne?
AI-nyheter for ledere. Kuratert av en CIO som bygger det selv. Daglig i innboksen.