Total Pageviews

Monday, 8 December 2025

VibeVoice


microsoft.github.io/VibeVoice/

Open-Source Frontier Voice AI: VibeVoice

Project Page Hugging Face Technical Report

VibeVoice Logo

📰 News

New Realtime TTS

2025-12-03: 📣 We open-sourced VibeVoice‑Realtime‑0.5B, a real‑time text‑to‑speech model that supports streaming text input and robust long-form speech generation. Try it on Colab.

2025-12-09: 📣 We’ve added experimental speakers in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) for exploration—welcome to try them out and share your feedback.

To mitigate deepfake risks and ensure low latency for the first speech chunk, voice prompts are provided in an embedded format. For users requiring voice customization, please reach out to our team. We will also be expanding the range of available speakers.

VibeVoice_Realtime.mp4
VibeVoice_Realtime.mp4

(Launch your own realtime demo via the websocket example in Usage).

2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have disabled this repo until we are confident that out-of-scope use is no longer possible.

Overview

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

VibeVoice currently includes two model variants:

  • Long-form multi-speaker model: Synthesizes conversational/single-speaker speech up to 90 minutes with up to 4 distinct speakers, surpassing the typical 1–2 speaker limits of many prior models.
  • Realtime streaming TTS model: Produces initial audible speech in ~300 ms and supports streaming text input for single-speaker real-time speech generation; designed for low-latency generation.

A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

MOS Preference Results VibeVoice Overview

🎵 Demo Examples

Video Demo

We produced this video with Wan2.2. We sincerely appreciate the Wan-Video team for their great work.

 

Risks and limitations

While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.

English and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs.

Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.

Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.

from  https://github.com/microsoft/VibeVoice

------------------------------------------------------

 微软开源语音合成模型 VibeVoice:实现多角色长对话的自然表达

微软研究院近期开源了新一代文本转语音(TTS)框架 VibeVoice,致力于解决长篇幅、多人说话场景下的语音合成难题。该模型在表达力、长时一致性和多角色自然对话方面实现重大突破,可生成媲美真人播客的高质量音频。VibeVoice项目遵守MIT开源协议。

🌟 核心创新亮点

    超低帧率连续语音分词器
        采用7.5Hz超低频的声学与语义分词器(Acoustic/Semantic Tokenizers),在保留音频保真度的同时,将长序列处理效率提升300%,支持长达90分钟的连续语音生成。
    混合式扩散框架
        LLM理解文本上下文:大语言模型解析对话逻辑与角色关系
        扩散模型生成细节:扩散头(Diffusion Head)合成高保真声学特征
        二者协同实现自然的话轮转换(Turn-taking)与角色一致性
    多角色对话突破
    可同时处理4个独立说话人的复杂对话场景,远超传统模型1-2人的限制,适用于播客、广播剧等专业场景。

⚙️ 技术突破价值
传统TTS痛点     VibeVoice解决方案
长音频断裂/失真     90分钟连续生成无断层
多角色切换生硬     自然话轮转换与声纹一致性
计算资源消耗大     7.5Hz分词器降低80%显存占用
💡 应用场景前瞻

    沉浸式有声内容:自动生成多角色播客/广播剧
    AI虚拟主持人:动态交互的直播解说系统
    长文本有声书:90分钟连续朗读无中断

源代码:https://github.com/microsoft/VibeVoice

No comments:

Post a Comment