Comparison between xTTS-v2, F5-TTS and GPT-SoVITS-v2

VRAM usage:

  1. GPT-SoVITS-v2: 2.0~ GB
  2. xTTS-v2: 2.5~ GB
  3. F5-TTS: 7.2~ GB

Output Sample Rate:

Inference time is quick with xTTS and GPT-SoVITS, able to output short quotes like the ones that follow in one second. F5-TTS is noticeably slower (6x~)

GPT-SoVITS info

"GPT-SoVITS-v2 - Base" are results using only the one reference audio clip, without finetuning on the speaker's voice.
"GPT-SoVITS-v2 - Finetuned" are results using the reference audio clip and a finetune model of the speaker's voice.
Finetuning is very quick (about 5 minutes). Captioning of audio was automated with faster-whisper (it is required that the audio is captioned).
With the default batch size of 12, training takes 9.5~ GB.

Finetuning dataset length:


The quick brown fox jumps over the lazy dog.

  Reference xTTS-v2 F5-TTS GPT-SoVITS-v2 - Base GPT-SoVITS-v2 - Finetuned
Miyuki
Laverne
Glados
iDroid
Bernard
Ellis

To me, the most critical thing in the hobby market right now is the lack of good software courses, books and software itself.
Without good software and an owner who understands programming, a hobby computer is wasted. Will quality software be written for the hobby market?

  Reference xTTS-v2 F5-TTS GPT-SoVITS-v2 - Base GPT-SoVITS-v2 - Finetuned
Miyuki
Laverne
Glados
iDroid
Bernard
Ellis

Haha, I'm just not going to! Hahahaha hahahaha

  Reference xTTS-v2 F5-TTS GPT-SoVITS-v2 - Base GPT-SoVITS-v2 - Finetuned
Miyuki
Laverne
Glados
iDroid
Bernard
Ellis

Phew. I thought you were going to do something very stupid just now.

  Reference xTTS-v2 F5-TTS GPT-SoVITS-v2 - Base GPT-SoVITS-v2 - Finetuned
Miyuki
Laverne
Glados
iDroid
Bernard
Ellis

mmmm, what else is there to do? Hehe, I think I know.

  Reference xTTS-v2 F5-TTS GPT-SoVITS-v2 - Base GPT-SoVITS-v2 - Finetuned
Miyuki
Laverne
Glados
iDroid
Bernard
Ellis