Muyan-TTS: Open-Source LLM-Based TTS Optimized for Podcasting—Built on a $50K Budget

Apr 30, 2025

The Text-to-Speech (TTS) landscape is undergoing a transformation—powered by Large Language Models (LLMs) that significantly improve semantic understanding and speech naturalness. But despite these advancements, most LLM-based TTS systems remain closed-source, resource-intensive, or ill-suited for real-world audio applications like podcasting.

Introducing Muyan-TTS, a fully open-source, trainable TTS system purpose-built for podcast scenarios—achieving high naturalness, zero-shot synthesis, and fast inference within a $50,000 total training budget.

🔍 Key Innovations:

✅ LLM-Based Synthesis: Uses a pre-trained LLaMA-3.2-3B as a replacement for traditional AR models—bridging the gap between LLM expressiveness and TTS fidelity.
✅ Parallel Audio-Text Tokenization: Text is tokenized with the LLM’s tokenizer, while audio is quantized using GPT-SoVITS audio tokens, enabling alignment of speech and text representations.
✅ VITS-Based Decoder: Fine-tuned for podcast audio to mitigate hallucinations and enhance pronunciation via a structured G2P module.

⚙️ System Overview:

📥 Data Pipeline: 150K+ hours of multilingual raw audio collected and filtered down to 100K+ high-quality hours using Whisper, FunASR, and MOS/NISQA thresholds. Audio cleaning used Music Source Separation, DeEcho, DeReverb, and more. (~$30K in GPU compute)
🧠 LLM Pre-training: 15 epochs on 80 A100 GPUs—training LLaMA-3.2-3B on audio-text pairs via 1024 learned audio tokens. (~$19.2K)
🎯 SFT (Speaker Adaptation): Efficient post-training using a few minutes of target speaker data (~15 mins/GPU node per hour of speech).
🔊 Decoder Training: Fine-tuned SoVITS on a curated 10K-hour MOS>4.5 podcast set. (~$1.34K)

📊 Evaluation Highlights:

📈 WER: 3.44% (LibriSpeech), 4.09% (SEED) — outperforming Spark-TTS, FireRedTTS, and GPT-SoVITS.
🗣️ MOS: 4.58 (LibriSpeech), 4.32 (SEED) — strong perceptual quality, especially after SFT (4.97 MOS).
👥 Speaker Similarity: Improved from 0.37 → 0.46 (SIM) post-SFT.
⚡ Inference Speed: Fastest among all tested (Step-Audio, CosyVoice2, Spark-TTS, FireRedTTS, and GPT-SoVITS v3) models (real-time factor = 0.33).

🎯 Why It Matters:

First open-source LLM-TTS stack optimized end-to-end for long-form podcast content.
Fully reproducible training pipeline: data collection, LLM training, decoding, and inference—all open.
Achieves production-grade synthesis and real-time capabilities on modest compute—busting the myth that LLM-based TTS is out of reach for small labs and indie developers.

🛠️ Explore the code, pre-trained models, and docs: https://github.com/MYZY-AI/Muyan-TTS

PhonoByte

Discussion about this post