Pages

Tuesday, 23 January 2024

用 MockingBird进行实时语音克隆


演示视频:
 https://www.bilibili.com/video/BV17Q4y1B7mY/

实时语音克隆,5 秒内克隆你的声音并生成任意语音内容(别人可以完全冒充你的声音了,真假难辨


功能特性
• 支持普通话并使用多种中文数据集进行测试
• 适用于 pytorch,已在 1.9.0 版本中测试,GPU Tesla T4 和 GTX 2060
• 可在 Win 和 Linux 操作系统中运行
• 仅需下载或新训练合成器(synthesizer)就有良好效果
• 可伺服你的训练结果,供远程调用

开源地址:https://github.com/babysor/MockingBird

---------

AI拟声: 5秒内克隆您的声音并生成任意语音内容 Clone a voice in 5 seconds to generate arbitrary speech in real-time  。

mockingbird

MIT License

English | 中文| 中文Linux

Features

🌍 Chinese supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc.

🤩 PyTorch worked for pytorch, tested in version of 1.9.0(latest in August 2021), with GPU Tesla T4 and GTX 2060

🌍 Windows + Linux run in both Windows OS and linux OS (even in M1 MACOS)

🤩 Easy & Awesome effect with only newly-trained synthesizer, by reusing the pretrained encoder/vocoder

🌍 Webserver Ready to serve your result with remote calling

DEMO VIDEO

Quick Start

1. Install Requirements

1.1 General Setup

Follow the original repo to test if you got all environment ready. **Python 3.7 or higher ** is needed to run the toolbox.

If you get an ERROR: Could not find a version that satisfies the requirement torch==1.9.0+cu102 (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2 ) This error is probably due to a low version of python, try using 3.9 and it will install successfully

  • Install ffmpeg.
  • Run pip install -r requirements.txt to install the remaining necessary packages.
  • Install webrtcvad pip install webrtcvad-wheels(If you need)

or

  • install dependencies with conda or mamba

    conda env create -n env_name -f env.yml

    mamba env create -n env_name -f env.yml

    will create a virtual environment where necessary dependencies are installed. Switch to the new environment by conda activate env_name and enjoy it.

    env.yml only includes the necessary dependencies to run the project,temporarily without monotonic-align. You can check the official website to install the GPU version of pytorch.

1.2 Setup with a M1 Mac

The following steps are a workaround to directly use the original demo_toolbox.pywithout the changing of codes.

Since the major issue comes with the PyQt5 packages used in demo_toolbox.py not compatible with M1 chips, were one to attempt on training models with the M1 chip, either that person can forgo demo_toolbox.py, or one can try the web.py in the project.

1.2.1 Install PyQt5, with ref here.
  • Create and open a Rosetta Terminal, with ref here.
  • Use system Python to create a virtual environment for the project
    /usr/bin/python3 -m venv /PathToMockingBird/venv
    source /PathToMockingBird/venv/bin/activate
    
Upgrade pip and install PyQt5
pip install --upgrade pip
pip install pyqt5
1.2.2 Install pyworld and ctc-segmentation

Both packages seem to be unique to this project and are not seen in the original Real-Time Voice Cloning project. When installing with pip install, both packages lack wheels so the program tries to directly compile from c code and could not find Python.h.

  • Install pyworld

    • brew install python Python.h can come with Python installed by brew
    • export CPLUS_INCLUDE_PATH=/opt/homebrew/Frameworks/Python.framework/Headers The filepath of brew-installed Python.h is unique to M1 MacOS and listed above. One needs to manually add the path to the environment variables.
    • pip install pyworld that should do.
  • Installctc-segmentation

    Same method does not apply to ctc-segmentation, and one needs to compile it from the source code on github.

    • git clone https://github.com/lumaku/ctc-segmentation.git
    • cd ctc-segmentation
    • source /PathToMockingBird/venv/bin/activate If the virtual environment hasn't been deployed, activate it.
    • cythonize -3 ctc_segmentation/ctc_segmentation_dyn.pyx
    • /usr/bin/arch -x86_64 python setup.py build Build with x86 architecture.
    • /usr/bin/arch -x86_64 python setup.py install --optimize=1 --skip-buildInstall with x86 architecture.
1.2.3 Other dependencies
  • /usr/bin/arch -x86_64 pip install torch torchvision torchaudio Pip installing PyTorch as an example, articulate that it's installed with x86 architecture
  • pip install ffmpeg Install ffmpeg
  • pip install -r requirements.txt Install other requirements.
1.2.4 Run the Inference Time (with Toolbox)

To run the project on x86 architecture. ref.

  • vim /PathToMockingBird/venv/bin/pythonM1 Create an executable file pythonM1 to condition python interpreter at /PathToMockingBird/venv/bin.
  • Write in the following content:
    #!/usr/bin/env zsh
    mydir=${0:a:h}
    /usr/bin/arch -x86_64 $mydir/python "$@"
    
  • chmod +x pythonM1 Set the file as executable.
  • If using PyCharm IDE, configure project interpreter to pythonM1(steps here), if using command line python, run /PathToMockingBird/venv/bin/pythonM1 demo_toolbox.py

2. Prepare your models

Note that we are using the pretrained encoder/vocoder but not synthesizer, since the original model is incompatible with the Chinese symbols. It means the demo_cli is not working at this moment, so additional synthesizer models are required.

You can either train your models or use existing ones:

2.1 Train encoder with your dataset (Optional)

  • Preprocess with the audios and the mel spectrograms: python encoder_preprocess.py <datasets_root> Allowing parameter --dataset {dataset} to support the datasets you want to preprocess. Only the train set of these datasets will be used. Possible names: librispeech_other, voxceleb1, voxceleb2. Use comma to sperate multiple datasets.

  • Train the encoder: python encoder_train.py my_run <datasets_root>/SV2TTS/encoder

For training, the encoder uses visdom. You can disable it with --no_visdom, but it's nice to have. Run "visdom" in a separate CLI/process to start your visdom server.

2.2 Train synthesizer with your dataset

  • Download dataset and unzip: make sure you can access all .wav in folder

  • Preprocess with the audios and the mel spectrograms: python pre.py <datasets_root> Allowing parameter --dataset {dataset} to support aidatatang_200zh, magicdata, aishell3, data_aishell, etc.If this parameter is not passed, the default dataset will be aidatatang_200zh.

  • Train the synthesizer: python train.py --type=synth mandarin <datasets_root>/SV2TTS/synthesizer

  • Go to next step when you see attention line show and loss meet your need in training folder synthesizer/saved_models/.

2.3 Use pretrained model of synthesizer

Thanks to the community, some models will be shared:

author Download link Preview Video Info
@author https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g Baidu 4j5d
75k steps trained by multiple datasets
@author https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw Baidu code:om7f
25k steps trained by multiple datasets, only works under version 0.0.1
@FawenYo https://yisiou-my.sharepoint.com/:u:/g/personal/lawrence_cheng_fawenyo_onmicrosoft_com/EWFWDHzee-NNg9TWdKckCc4BC7bK2j9cCbOWn0-_tK0nOg?e=n0gGgC input output 200k steps with local accent of Taiwan, only works under version 0.0.1
@miven https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ code: 2021 https://www.aliyundrive.com/s/AwPsbo8mcSP code: z2m0 https://www.bilibili.com/video/BV1uh411B7AD/ only works under version 0.0.1

2.4 Train vocoder (Optional)

note: vocoder has little difference in effect, so you may not need to train a new one.

  • Preprocess the data: python vocoder_preprocess.py <datasets_root> -m <synthesizer_model_path>

<datasets_root> replace with your dataset root,<synthesizer_model_path>replace with directory of your best trained models of sythensizer, e.g. sythensizer\saved_mode\xxx

  • Train the wavernn vocoder: python vocoder_train.py mandarin <datasets_root>

  • Train the hifigan vocoder python vocoder_train.py mandarin <datasets_root> hifigan

3. Launch

3.1 Using the web server

You can then try to run:python web.py and open it in browser, default as http://localhost:8080

3.2 Using the Toolbox

You can then try the toolbox: python demo_toolbox.py -d <datasets_root>

3.3 Using the command line

You can then try the command: python gen_voice.py <text_file.txt> your_wav_file.wav you may need to install cn2an by "pip install cn2an" for better digital number result.

Reference

This repository is forked from Real-Time-Voice-Cloning which only support English.

URL Designation Title Implementation source
1803.09017 GlobalStyleToken (synthesizer) Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis This repo
2010.05646 HiFi-GAN (vocoder) Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis This repo
2106.02297 Fre-GAN (vocoder) Fre-GAN: Adversarial Frequency-consistent Audio Synthesis This repo
1806.04558 SV2TTS Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis This repo
1802.08435 WaveRNN (vocoder) Efficient Neural Audio Synthesis fatchord/WaveRNN
1703.10135 Tacotron (synthesizer) Tacotron: Towards End-to-End Speech Synthesis fatchord/WaveRNN
1710.10467 GE2E (encoder) Generalized End-To-End Loss for Speaker Verification This repo

F Q&A

1.Where can I download the dataset?

Dataset Original Source Alternative Sources
aidatatang_200zh OpenSLR Google Drive
magicdata OpenSLR Google Drive (Dev set)
aishell3 OpenSLR Google Drive
data_aishell OpenSLR

After unzip aidatatang_200zh, you need to unzip all the files under aidatatang_200zh\corpus\train

2.What is<datasets_root>?

If the dataset path is D:\data\aidatatang_200zh,then <datasets_root> isD:\data

3.Not enough VRAM

Train the synthesizer:adjust the batch_size in synthesizer/hparams.py

//Before
tts_schedule = [(2,  1e-3,  20_000,  12),   # Progressive training schedule
                (2,  5e-4,  40_000,  12),   # (r, lr, step, batch_size)
                (2,  2e-4,  80_000,  12),   #
                (2,  1e-4, 160_000,  12),   # r = reduction factor (# of mel frames
                (2,  3e-5, 320_000,  12),   #     synthesized for each decoder iteration)
                (2,  1e-5, 640_000,  12)],  # lr = learning rate
//After
tts_schedule = [(2,  1e-3,  20_000,  8),   # Progressive training schedule
                (2,  5e-4,  40_000,  8),   # (r, lr, step, batch_size)
                (2,  2e-4,  80_000,  8),   #
                (2,  1e-4, 160_000,  8),   # r = reduction factor (# of mel frames
                (2,  3e-5, 320_000,  8),   #     synthesized for each decoder iteration)
                (2,  1e-5, 640_000,  8)],  # lr = learning rate

Train Vocoder-Preprocess the data:adjust the batch_size in synthesizer/hparams.py

//Before
### Data Preprocessing
        max_mel_frames = 900,
        rescale = True,
        rescaling_max = 0.9,
        synthesis_batch_size = 16,                  # For vocoder preprocessing and inference.
//After
### Data Preprocessing
        max_mel_frames = 900,
        rescale = True,
        rescaling_max = 0.9,
        synthesis_batch_size = 8,                  # For vocoder preprocessing and inference.

Train Vocoder-Train the vocoder:adjust the batch_size in vocoder/wavernn/hparams.py

//Before
# Training
voc_batch_size = 100
voc_lr = 1e-4
voc_gen_at_checkpoint = 5
voc_pad = 2

//After
# Training
voc_batch_size = 6
voc_lr = 1e-4
voc_gen_at_checkpoint = 5
voc_pad =2

4.If it happens RuntimeError: Error(s) in loading state_dict for Tacotron: size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70, 512]) from checkpoint, the shape in current model is torch.Size([75, 512]).

Please refer to issue #37

5. How to improve CPU and GPU occupancy rate?

Adjust the batch_size as appropriate to improve

6. What if it happens the page file is too small to complete the operation

Please refer to this video and change the virtual memory to 100G (102400), for example : When the file is placed in the D disk, the virtual memory of the D disk is changed.

7. When should I stop during training?

FYI, my attention came after 18k steps and loss became lower than 0.4 after 50k steps. 

from https://github.com/babysor/MockingBird 

------------

 OpenVoice是什么

OpenVoice是一款由MyShell推出的免费开源多功能即时AI语音克隆工具
,只需参考说话者的短音频剪辑即可复制其声音,并且可生成多种语言的语音。除了复制音色,OpenVoice还可以精细控制语音风格,包括情感、口音、节奏、停顿和语调。此外,OpenVoice能够在没有大量说话人训练集支持的情况下实现零样本跨语言语音克隆。此外,OpenVoice还具有高计算效率,其成本比性能较差的商用API低数十倍。

演示地址1:https://www.lepton.ai/playground/openvoice

演示地址2:https://app.myshell.ai/bot/z6Bvua/1702636181

演示地址3:https://huggingface.co/spaces/myshell-ai/OpenVoice

GitHub:https://github.com/myshell-ai/OpenVoice

--------------------------------------------------------------

10分钟学会声音克隆!一键启动包发布!在家自己做AI音频副业:

 https://17yongai.com/11451.html


最适合新手来做的声音克隆!非常友好:

https://17yongai.com/11770.html

-------------------------------------------------------------------

 EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine.

EmotiVoice is a powerful and modern open-source text-to-speech engine that is available to you at no cost. EmotiVoice speaks both English and Chinese, and with over 2000 different voices (refer to the List of Voices for details). The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including happy, excited, sad, angry and others.

An easy-to-use web interface is provided. There is also a scripting interface for batch generation of results.

Demo

A demo is hosted on Replicate, EmotiVoice.

Hot News

Features under development

  • Support for more languages, such as Japanese and Korean. #19 #22

EmotiVoice prioritizes community input and user requests. We welcome your feedback!

Quickstart

EmotiVoice Docker image

The easiest way to try EmotiVoice is by running the docker image. You need a machine with a NVidia GPU. If you have not done so, set up NVidia container toolkit by following the instructions for Linux or Windows WSL2. Then EmotiVoice can be run with,

docker run -dp 127.0.0.1:8501:8501 syq163/emoti-voice:latest

The Docker image was updated on January 4th, 2024. If you have an older version, please update it by running the following commands:

docker pull syq163/emoti-voice:latest
docker run -dp 127.0.0.1:8501:8501 -p 127.0.0.1:8000:8000 syq163/emoti-voice:latest

Now open your browser and navigate to http://localhost:8501 to start using EmotiVoice's powerful TTS capabilities.

Starting from this version, the 'OpenAI-compatible-TTS API' is now accessible via http://localhost:8000/.

Full installation

conda create -n EmotiVoice python=3.8 -y
conda activate EmotiVoice
pip install torch torchaudio
pip install numpy numba scipy transformers soundfile yacs g2p_en jieba pypinyin pypinyin_dict

Prepare model files

We recommend that users refer to the wiki page How to download the pretrained model files if they encounter any issues.

git lfs install
git lfs clone https://huggingface.co/WangZeJun/simbert-base-chinese WangZeJun/simbert-base-chinese

or, you can run:

git clone https://www.modelscope.cn/syq163/WangZeJun.git

Inference

  1. You can download the pretrained models by simply running the following command:
git clone https://www.modelscope.cn/syq163/outputs.git
  1. The inference text format is <speaker>|<style_prompt/emotion_prompt/content>|<phoneme>|<content>.
  • inference text example: 8051|Happy|<sos/eos> [IH0] [M] [AA1] [T] engsp4 [V] [OY1] [S] engsp4 [AH0] engsp1 [M] [AH1] [L] [T] [IY0] engsp4 [V] [OY1] [S] engsp1 [AE1] [N] [D] engsp1 [P] [R] [AA1] [M] [P] [T] engsp4 [K] [AH0] [N] [T] [R] [OW1] [L] [D] engsp1 [T] [IY1] engsp4 [T] [IY1] engsp4 [EH1] [S] engsp1 [EH1] [N] [JH] [AH0] [N] . <sos/eos>|Emoti-Voice - a Multi-Voice and Prompt-Controlled T-T-S Engine.
  1. You can get phonemes by python frontend.py data/my_text.txt > data/my_text_for_tts.txt.

  2. Then run:

TEXT=data/inference/text
python inference_am_vocoder_joint.py \
--logdir prompt_tts_open_source_joint \
--config_folder config/joint \
--checkpoint g_00140000 \
--test_file $TEXT

the synthesized speech is under outputs/prompt_tts_open_source_joint/test_audio.

  1. Or if you just want to use the interactive TTS demo page, run:
pip install streamlit
streamlit run demo_page.py

OpenAI-compatible-TTS API

Thanks to @lewangdev for adding an OpenAI compatible API #60. To set it up, use the following command:

pip install fastapi pydub uvicorn[standard] pyrubberband
uvicorn openaiapi:app --reload

Wiki page

You may find more information from our wiki page.

Training

Voice Cloning with your personal data has been released on December 13th, 2023.

Roadmap & Future work

  • Our future plan can be found in the ROADMAP file.
  • The current implementation focuses on emotion/style control by prompts. It uses only pitch, speed, energy, and emotion as style factors, and does not use gender. But it is not complicated to change it to style/timbre control.
  • Suggestions are welcome. You can file issues or @ydopensource on twitter.

WeChat group

Welcome to scan the QR code below and join the WeChat group.

qr

Credits

 from https://github.com/netease-youdao/EmotiVoice

---------------------------------------------------------------------

Voice Cloning with your personal data

中文

People have different preferences for voices. In response to the community's needs, we are thrilled to release the voice cloning code with tutorials.

Precautions Before Starting:

  1. At least one Nvidia's GPU card is required for training and voice cloning.
  2. Data for target voice is crucial for voice cloning. The detailed requirements are provided in the next section.
  3. Currently, only Chinese and English are supported, meaning you can use either Chinese data or English data, or both to train your voice, resulting in a model capable of speaking both languages.
  4. Although EmotiVoice supports emotional prompts, if you want your voice to convey emotions, your data should already contain emotional elements.
  5. After training solely with your data, the original voices from EmotiVoice will be altered. This means that the new model will be entirely customized based on your data. If you wish to use EmotiVoice's original 2000+ voices, it is recommended to use the pre-trained model instead.

Detailed requirements for training data

  1. Audio data should have high qualities, such as clear and undistorted speech from a single individual.
  2. Text corresponding to each audio should align with the content of the speech. Before training, the original text is converted into phonemes using G2P. It is important to pay special attention to short pauses (sp*) and polyphones, as they can have an impact on the quality of training.
  3. If you desire your voice to convey emotions, your data should already contain emotional elements. Additionally, the content of the tag 'prompt' should be appropriately modified for each audio. Prompts can include emotions, speed, and any form of text descriptions of the speaking style.
  4. After that, you shoud obtain a data directory which contains two subdirectories, named train and valid. Each subdirectory has a datalist.jsonl file with the following format: {"key": "LJ002-0020", "wav_path": "data/LJspeech/wavs/LJ002-0020.wav", "speaker": "LJ", "text": ["<sos/eos>", "[IH0]", "[N]", "engsp1", "[EY0]", "[T]", "[IY1]", "[N]", "engsp1", "[TH]", "[ER1]", "[T]", "[IY1]", "[N]", ".", "<sos/eos>"], "original_text": "In 1813", "prompt": "common"} for each single line.

Step-by-Step Training Process:

The best tutorial for Mandarin Chinese is our DataBaker Recipe, and for English is LJSpeech Recipe. Below is a summary:

  1. Prepare the training environment; this step is only necessary once.

    # create conda enviroment
    conda create -n EmotiVoice python=3.8 -y
    conda activate EmotiVoice
    # then run:
    pip install EmotiVoice[train]
    # or
    git clone https://github.com/netease-youdao/EmotiVoice
    pip install -e .[train]
  1. Prepare the data according to the Detailed requirements for training data section. Of course, you can use the provided methods and scripts from the DataBaker Recipe and LJSpeech Recipe.

  2. Next, run the following command to create a directory for training: python prepare_for_training.py --data_dir <data directory> --exp_dir <experiment directory>.

    Replace <data directory> with the actual path to your data directory and <experiment directory> with the desired path for your experiment directory.

  3. You can customize the training settings by modifying the parameters in <experiment directory>/config/config.py based on your server and data. Once you have made the necessary changes, initiate the training process by running the following command: torchrun --nproc_per_node=1 --master_port 8018 train_am_vocoder_joint.py --config_folder <experiment directory>/config --load_pretrained_model True. This command will start the training process using the specified configuration folder and load any pre-trained models if specified.

  4. After several training epochs, select some checkpoints and run the following comand to perform inference to verify if they meet your expectations: python inference_am_vocoder_exp.py --config_folder exp/DataBaker/config --checkpoint g_00010000 --test_file data/inference/text. Please be reminded to modify the speaker name in data/inference/text. If the results are satisfactory, you can utilize the new model as desired. We also provide a modified version of demo page: demo_page_databaker.py.

  5. If the results are not up to par, you can either wait for further training epochs or review your data and environment. Of course, you can consult with the community or create an issue for assistance.

Reference Information for Running Time:

The following information regarding running time and hardware environment is provided for your reference:

  • Pip package versions: Python 3.8.18, torch 1.13.1, cuda 11.7
  • GPU card type: NVIDIA GeForce RTX 3090, NVIDIA A40
  • Training time: Approximately 1 to 2 hours are required to train for 10,000 steps.

It is even capable of training without the use of Nvidia's GPU card! Just be patient and wait for a while.

from https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Cloning-with-your-personal-data

----------------

相关帖子:

https://briteming.blogspot.com/2021/04/real-time-voice-cloning.html

https://briteming.blogspot.com/2022/07/voice-deepfake.html

https://briteming.blogspot.com/2023/12/ai.html

https://briteming.blogspot.com/2024/01/clone-voice.html


No comments:

Post a Comment