在2025年使用So-VITs-SVC训练属于自己的AI人声模型！- オタクの倉庫

最终训练效果展示（伊涅芙模型）

训练机器：AMD Ryzen 9955HX + 32G DDR5 5600MTs + NVIDIA RTX 5070 Laptop
训练平台：Windows 11 Enterprise + Python3.10.6 + torch2.7.1 + cu128
训练时长：主模型2h，扩散模型30mins
训练集：伊涅芙角色语言158条
输出全文件下载地址：https://driver.haoa.moe/s/lMCE

开始部署

部署前言

因为svc-develop-team很早之前就已经将原项目归档了，这导致原项目的部分模块在现代化设备上工作不良，所以这里使用So-VITs-SVC的分支修复版本So-VITs-SVC-Fix来适配新的现代化设备

正式部署

前置要求：

一个稳定的梯子（至少可以把项目从Github上git下来）
至少拥有一张英伟达显卡（显存需求6G+）或昇腾NPU（建议昇腾910B）
建议不低于16G的内存（Windows 10+ 需要至少32G ）

开始部署·零

准备充足的数据集（尽量半个小时以上，每个切片不超过15s[可以使用audio-slicer-gui来切片]）
准备预处理模型并放置在对应位置
建议：1. checkpoint_best_legacy_500.pt放在pretrain目录下
2. nsf_hifigan.zip并解压到pretrain/nsf_hifigan目录下
3. fcpe.pt放到pretrain目录下
4. rmvpe.pt放到pretrain目录下
5. G_0.pth与D_0.pth放到logs/44k目录下
6. model_0.pt放到logs/44k/diffusion目录下

开始部署·一

确认GPU/NPU驱动正确安装：
# 英伟达适用：
nvidia-smi
# 昇腾适用
ls /usr/local/Ascend/ascend-toolkit/latest/opp
使用Git拉取项目：
# 任意正确安装了Git的操作系统
# 注意Git拉取的目录！
git clone https://github.com/USELESSER-HAOA/so-vits-svc-fix

开始部署·二

建立并使用虚拟环境（建议使用Python 3.10.6）

Python -m venv .venv
# 适用于Linux
source .venv/bin/activate
# 适用于Windows
./.venv/Scripts/activate

在不同的硬件下安装依赖库

# 适用于NVIDIA RTX 50系显卡 # 先行安装Microsoft VC 14.0 Installer的C++桌面开发的第一二子安装项 python -m pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
# 适用于昇腾NPU python -m pip install -r requirements_ascend.txt -i https://mirrors.aliyun.com/pypi/simple/

开始部署·三

使用WebUI进行训练

python webui.py

使用CLI进行训练（不推荐）

python resample.py python preprocess_flist_config.py --speech_encoder vec768l12 python preprocess_hubert_f0.py --f0_predictor rmvpe --use_diff # For NVIDIA python train.py -c configs/config.json -m 44k # For Ascend python train_ascend.py -c configs/config.json -m 44k # For NVIDIA python train_diff.py -c configs/diffusion.yaml # For Ascend python train_diff_ascend.py -c configs/diffusion.yaml

使用训练得到的模型进行推理

使用WebUI推理

python webui.py

使用CLI推理（很不推荐）

Using the command line is not recommended unless specifically required.

Use inference_main.py

# Example
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"

Required parameters:

-m | --model_path: path to the model.
-c | --config_path: path to the configuration file.
-n | --clean_names: a list of wav file names located in the raw folder.
-t | --trans: pitch shift, supports positive and negative (semitone) values.
-s | --spk_list: Select the speaker ID to use for conversion.
-cl | --clip: Forced audio clipping, set to 0 to disable(default), setting it to a non-zero value (duration in seconds) to enable.

Optional parameters: see the next section

-lg | --linear_gradient: The cross fade length of two audio slices in seconds. If there is a discontinuous voice after forced slicing, you can adjust this value. Otherwise, it is recommended to use the default value of 0.
-f0p | --f0_predictor: Select a F0 predictor, options are crepe, pm, dio, harvest, rmvpe,fcpe, default value is pm(note: f0 mean pooling will be enable when using crepe)
-a | --auto_predict_f0: automatic pitch prediction, do not enable this when converting singing voices as it can cause serious pitch issues.
-cm | --cluster_model_path: Cluster model or feature retrieval index path, if left blank, it will be automatically set as the default path of these models. If there is no training cluster or feature retrieval, fill in at will.
-cr | --cluster_infer_ratio: The proportion of clustering scheme or feature retrieval ranges from 0 to 1. If there is no training clustering model or feature retrieval, the default is 0.
-eh | --enhance: Whether to use NSF_HIFIGAN enhancer, this option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is disabled by default.
-shd | --shallow_diffusion: Whether to use shallow diffusion, which can solve some electrical sound problems after use. This option is disabled by default. When this option is enabled, NSF_HIFIGAN enhancer will be disabled
-usm | --use_spk_mix: whether to use dynamic voice fusion
-lea | --loudness_envelope_adjustment：The adjustment of the input source's loudness envelope in relation to the fusion ratio of the output loudness envelope. The closer to 1, the more the output loudness envelope is used
-fr | --feature_retrieval：Whether to use feature retrieval If clustering model is used, it will be disabled, and cm and cr parameters will become the index path and mixing ratio of feature retrieval

Shallow diffusion settings:

-dm | --diffusion_model_path: Diffusion model path
-dc | --diffusion_config_path: Diffusion config file path
-ks | --k_step: The larger the number of k_steps, the closer it is to the result of the diffusion model. The default is 100
-od | --only_diffusion: Whether to use Only diffusion mode, which does not load the sovits model to only use diffusion model inference
-se | --second_encoding：which involves applying an additional encoding to the original audio before shallow diffusion. This option can yield varying results - sometimes positive and sometimes negative.

Cautions

If inferencing using whisper-ppg speech encoder, you need to set --clip to 25 and -lg to 1. Otherwise it will fail to infer properly.

后记

原本是作者在用So-VITs-SVC的原项目训练伊涅芙的时候遇到的问题的解决集合，但是想着既然已经写了这么多用来修复原项目的东西，不如索性把他们都整合到一起，于是就有了So-VITs-SVC-Fix。当然，这个修复版的进步不会就止步于此，欢迎在Issue下将遇到的问题和全新的点子分享出来，作者会尽量实现它们的！

在2025年使用So-VITs-SVC训练属于自己的AI人声模型！