Clone a voice in 5 seconds to generate arbitrary speech in real-time.
This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my thesis if you're curious or if you're looking for info I haven't documented. Mostly I would recommend giving a quick look to the figures beyond the introduction.
SV2TTS is a three-stage deep learning framework that allows to create a numerical representation of a voice from a few seconds of audio, and to use it to condition a text-to-speech model trained to generalize to new voices.
Video demonstration (click the picture):
Papers implemented
| URL | Designation | Title | Implementation source |
|---|---|---|---|
| 1806.04558 | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | This repo |
| 1802.08435 | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | fatchord/WaveRNN |
| 1703.10135 | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | fatchord/WaveRNN |
| 1710.10467 | GE2E (encoder) | Generalized End-To-End Loss for Speaker Verification | This repo |
News
14/02/21: This repo now runs on PyTorch instead of
Tensorflow, thanks to the help of @bluefish. If you wish to run the
tensorflow version instead, checkout commit 5425557.
13/11/19: I'm now working full time and I will not maintain this repo anymore. To anyone who reads this:
- If you just want to clone your voice (and not someone else's): I recommend our free plan on Resemble.AI. You will get a better voice quality and less prosody errors.
- If this is not your case: proceed with this repository, but you might end up being disappointed by the results. If you're planning to work on a serious project, my strong advice: find another TTS repo. Go here for more info.
20/08/19: I'm working on resemblyzer, an independent package for the voice encoder. You can use your trained encoder models from this repo with it.
06/07/19: Need to run within a docker container on a remote server? See here.
25/06/19: Experimental support for low-memory GPUs (~2gb) added for the synthesizer. Pass --low_mem to demo_cli.py or demo_toolbox.py to enable it. It adds a big overhead, so it's not recommended if you have enough VRAM.
Setup
1. Install Requirements
Python 3.6 or 3.7 is needed to run the toolbox.
- Install PyTorch (>=1.0.1).
- Install ffmpeg.
- Run
pip install -r requirements.txtto install the remaining necessary packages.
2. Download Pretrained Models
Download the latest here.
3. (Optional) Test Configuration
Before you download any dataset, you can begin by testing your configuration with:
python demo_cli.py
If all tests pass, you're good to go.
4. (Optional) Download Datasets
For playing with the toolbox alone, I only recommend downloading LibriSpeech/train-clean-100. Extract the contents as <datasets_root>/LibriSpeech/train-clean-100 where <datasets_root> is a directory of your choosing. Other datasets are supported in the toolbox, see here.
You're free not to download any dataset, but then you will need your
own data as audio files or you will have to record it with the toolbox.
5. Launch the Toolbox
You can then try the toolbox:
python demo_toolbox.py -d <datasets_root>
or
python demo_toolbox.py
depending on whether you downloaded any datasets. If you are running an X-server or if you have the error Aborted (core dumped), see this issue.
from https://github.com/CorentinJ/Real-Time-Voice-Cloning
-----------------------------------------------------------------------------
克隆/模拟人声-Real Time Voice Cloning
Real Time Voice Cloning是一个开源的实时语音克隆工具。只要上传分析说话者几秒钟的原始音频,通过深度学习,就能模仿该说话者的声音进行文本的语音阅读。基于Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)搭建,SV2TTS 是一个三步深度学习构架,允许把几秒钟的语音数字化分析后,再以文字到语音的训练模型生成新的声音。 Real Time Voice Cloning遵守MIT开源协议。
[repo owner=”CorentinJ” name=”Real-Time-Voice-Cloning”]
------------------------
今天推荐一个黑科技开源项目,只需要你 5 秒钟的声音对话,就能克隆出你的声音,而且能够实时的生成你任意语音。
我觉个例子,如果我这里有 300 条你说话的语音,我把你的语音数据用这个开源项目去训练,训练完成后,我就可以使用这个训练好的模型生成任何你说的语音了。
你会听到一个声音和你一模一样的人说你没说过的话,那种感觉真的细思极恐。
这个黑科技就是:Real-Time-Voice-Cloning,现在已经开源,GitHub 24K 的 Star,最重要的是,这个项目提供了 GUI 界面,交互傻瓜式操作,语音采集、训练、生成都可以交互完成,很方便。
地址:https://github.com/CorentinJ/Real-Time-Voice-Cloning
环境配置
首先你需要 Python 3.6 的环境、安装 PyTorch(要求版本 > = 1.0.1)。Pytorch 是深度学习框架,你可以通过这个站点来安装这个库。
https://pytorch.org/get-started/locally/
紧接着需要安装 ffmpeg:
地址 https://ffmpeg.org/download.html#get-packages。除此之外,你还需要安装其他的依赖包。将项目下载下来,在包含 requirements.tx 目录下运行命令 pip install -r requirements.txt 就行了。
下载预训练的模型
把开源作者训练好的模型下载下来,我们不用自己训练,直接拿来用就行了:https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models,
下载完毕要放到如下的文件夹里面。
encoder\saved_models\pretrained.pt
synthesizer\saved_models\pretrained\pretrained.pt
vocoder\saved_models\pretrained\pretrained.pt
Details about model training and audio samples can be found here: https://blue-fish.github.io/experiments/RTVC-7.html
启动
当你配完了环境,就可以尝试使用这个黑科技了。运行命令 python demo_toolbox.py 就能启动这个黑科技啦!
下面是比较详细的使用教程,遇到问题可以查看帮助:
https://www.bilibili.com/video/av79481223?zw
https://blog.csdn.net/weixin_41010198/article/details/113186232
最后结尾说一下,我用这个模型试了一下,因为这个模型是老外开源的,所以训练的数据是英语的语音,我试了一下说中文,简直就是不会说中文的老外讲中文一个味道,现在我怀疑世界的真实性了。
开源地址:https://github.com/CorentinJ/Real-Time-Voice-Cloning

No comments:
Post a Comment