2026-03-28voice-aiopen-sourcetext-to-speechspeech-recognitionmicrosoftpodcastaudio

Microsoft just open-sourced a voice AI with 25K stars

Microsoft VibeVoice brings frontier TTS and speech recognition to everyone for free — 90-minute audio, 50+ languages, MIT license.

A Voice AI That Sounds Like the Future — and Is Free

Microsoft just released VibeVoice, an open-source voice AI system that has already earned nearly 25,000 GitHub stars — a remarkable signal that developers and creators worldwide consider this a game-changer. The project combines two powerful capabilities in one package: TTS (text-to-speech — technology that converts written text into spoken audio) and ASR (automatic speech recognition — the technology that transcribes spoken words into text).

What makes VibeVoice stand out is its ability to generate audio that lasts up to 90 minutes in a single run, with multiple speakers talking naturally back and forth — something that was essentially impossible with previous open-source tools. Until now, free tools could barely handle a few minutes before quality fell apart.

At a glance — what VibeVoice can do:

VibeVoice-TTS (1.5B) — Creates up to 90 minutes of spoken audio with up to 4 distinct voices in a single pass
VibeVoice-ASR (7B) — Transcribes up to 60 minutes of audio at once, identifying who said what and when
VibeVoice-Realtime (0.5B) — A lightweight model for real-time applications, first audio in just ~300 milliseconds

Three Models, One Toolkit — Who Is It For?

VibeVoice ships as a family of three model sizes (a "model" is the AI brain that does the actual work — bigger models are more accurate but need more computing power). Each is designed for a different use case.

The 1.5B TTS model is the one most creators will care about. It can generate a full podcast episode — complete with two or more hosts, natural pauses, and even emotional variation — from a plain text script. It supports English and Chinese with more languages planned. The model uses a clever technique called continuous speech tokenizers (a way of breaking audio into tiny chunks at just 7.5 frames per second, versus the usual 75+ frames, which makes processing much faster and cheaper) that lets it handle very long audio without running out of memory.

The 7B ASR model is the transcription powerhouse. It can take a 60-minute interview recording and turn it into a structured document that names each speaker, adds timestamps, and organizes the content. It supports over 50 languages natively, and you can feed it a custom vocabulary list of unusual words — technical jargon, brand names, or proper nouns — to improve accuracy. On the Open ASR Leaderboard (an independent ranking of speech recognition quality), VibeVoice-ASR achieved an average word error rate of 7.77% across eight English datasets, hitting as low as 2.20% on the widely used LibriSpeech benchmark.

The 0.5B Realtime model is designed to run inside apps and services. It accepts streaming text (words arriving as they are typed) and starts producing audio in about 300 milliseconds — fast enough to feel instant to a human listener. It can generate around 10 minutes of continuous speech, making it suitable for voice assistants, accessibility tools, and interactive games.

How to Install and Run VibeVoice Right Now

VibeVoice is built in Python and requires Python 3.9 or higher. Installation is straightforward through pip (Python's standard package manager — a tool that downloads and installs software libraries with a single command).

The quickest path is to install the community fork, which mirrors the official codebase:

# Option 1: Install via pip (simplest method)
pip install vibevoice

# Option 2: Install from source with GPU support
git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice
pip install -e .[gpu]

Once installed, you can generate speech from a Python script with just a few lines:

from vibevoice import tts

# Generate speech from text
tts(
    text="Hello, this is a test of VibeVoice.",
    model_path="vibevoice/VibeVoice-1.5B",
    output_file="output.wav"
)

For a podcast-style multi-speaker script, you can launch the included demo interface:

# Launch the interactive web demo (opens in your browser)
python demo/gradio_demo.py \
    --model_path vibevoice/VibeVoice-1.5B \
    --share

# Generate from a prepared script file with named speakers
python demo/inference_from_file.py \
    --model_path ./models/VibeVoice-1.5B \
    --txt_path ./my_podcast_script.txt \
    --speaker_names Alice Frank

The models themselves are hosted on Hugging Face (a popular platform for sharing AI models) and download automatically when you first run the code. The 1.5B TTS model weighs in at approximately 3 billion total parameters across its components, so a dedicated GPU is recommended for comfortable generation speeds, though CPU-only runs are possible for shorter clips.

The License, the Stars, and the Caveats

VibeVoice is released under the MIT License (one of the most permissive open-source licenses — it means you can use, modify, and distribute the software with almost no restrictions). With 24,800+ GitHub stars and over 2,700 forks (copies that developers have made to build their own versions), the project has become one of the fastest-growing voice AI repositories of 2026.

That said, Microsoft is transparent about limitations. The project page explicitly states they do not recommend commercial use without further testing and responsible deployment planning. Every audio file generated by VibeVoice includes an audible AI disclaimer and an imperceptible watermark (a hidden digital signature baked into the audio that allows experts to verify it was AI-generated). Inference requests are also logged to help detect misuse.

In January 2026, Microsoft also released VibeVoice-ASR as a standalone speech-to-text model, now integrated directly into the popular Hugging Face Transformers library (a widely used collection of AI model tools) as of version 5.3.0. This means you can use it with the same code patterns you might already know from working with other transcription tools.

The community fork at vibevoice-community/VibeVoice continues active development and has added support for Apple Silicon chips, making it accessible to Mac users without expensive GPU hardware.

Practical use cases to try today:

Turn your blog posts or newsletters into podcast episodes automatically
Transcribe hours of meeting recordings with automatic speaker labels
Add voice narration to educational materials in multiple languages
Create audiobooks from written manuscripts with consistent character voices

The official project page at microsoft.github.io/VibeVoice includes audio demos, and the technical paper is available on arXiv for those who want to understand the research behind the approach.

Related Content — Get Started with Easy Claude Code | Free Learning Guides | More AI News

Sources

Stay updated on AI news

Simple explanations of the latest AI developments