INTP (Intelligibility Preference Speech Dataset): Making Zero-Shot TTS More Robust and Intelligible
Despite impressive progress, zero-shot Text-to-Speech (TTS) models still struggle with challenging linguistic scenarios — think tongue twisters, repeated words, code-switching, and cross-lingual synthesis. A recent study directly tackles these weaknesses through preference alignment and a new dataset: INTP (Intelligibility Preference Speech Dataset).
🔍 What’s New:
Introduced INTP, a curated set of ~250K preference pairs targeting tough intelligibility cases.
Adapted Direct Preference Optimization (DPO) to align multiple TTS architectures: AR, Flow-Matching, and Masked Generative.
Preference signals were constructed using WER-based intra/inter-model comparisons and LLM-generated perturbations.
📈 Why It Matters:
Across diverse zero-shot TTS models (e.g., ARS, F5-TTS, MaskGCT, CosyVoice 2, Ints), observed significant WER reductions and improved naturalness (N-CMOS), without sacrificing speaker similarity.
INTP shows weak-to-strong generalization—enhancing even strong models it wasn’t trained on.
🌀 Its also demonstrated a flywheel effect—a scalable loop of data and model improvement through iterative preference alignment.
📢 The authors will release the INTP dataset, DPO-based alignment code, and improved model checkpoints in Amphion toolkit to support further research.
More details: https://intalign.github.io/